Abstract Scaling laws suggest that more than a trillion species inhabit our planet but only a miniscule and unrepresentative fraction (less than 0.00001%) have been studied or sequenced to date. Deep learning models, including those applied to tasks in the life sciences, depend on the quality and size of training or reference datasets. Given the large knowledge gap we experience when it comes to life on earth, we present a data-centric approach to improving deep learning models in Biology: We built partnerships with nature parks and biodiversity stakeholders across 5 continents covering 50% of global biomes, establishing a global metagenomics and biological data supply chain. With higher protein sequence diversity captured in this dataset compared to existing public data, we apply this data advantage to the protein folding problem by MSA supplementation during inference of AlphaFold2. Our model, BaseFold, exceeds traditional AlphaFold2 performance across targets from the CASP15 and CAMEO, 60% of which show improved pLDDT scores and RMSD values being reduced by up to 80%. On top of this, the improved quality of the predicted structures can yield better docking results. By sharing benefits with the stakeholders this data originates from, we present a way of simultaneously improving deep learning models for biology and incentivising protection of our planet’s biodiversity.