Proteins carry out the vast majority of functions in all biological domains, but for technological reasons their large-scale investigation has lagged behind the study of genomes. Since the first essentially complete eukaryotic proteome was reported1, advances in mass-spectrometry-based proteomics2 have enabled increasingly comprehensive identification and quantification of the human proteome3,4,5,6. However, there have been few comparisons across species7,8, in stark contrast with genomics initiatives9. Here we use an advanced proteomics workflow—in which the peptide separation step is performed by a microstructured and extremely reproducible chromatographic system—for the in-depth study of 100 taxonomically diverse organisms. With two million peptide and 340,000 stringent protein identifications obtained in a standardized manner, we double the number of proteins with solid experimental evidence known to the scientific community. The data also provide a large-scale case study for sequence-based machine learning, as we demonstrate by experimentally confirming the predicted properties of peptides from Bacteroides uniformis. Our results offer a comparative view of the functional organization of organisms across the entire evolutionary range. A remarkably high fraction of the total proteome mass in all kingdoms is dedicated to protein homeostasis and folding, highlighting the biological challenge of maintaining protein structure in all branches of life. Likewise, a universally high fraction is involved in supplying energy resources, although these pathways range from photosynthesis through iron sulfur metabolism to carbohydrate metabolism. Generally, however, proteins and proteomes are remarkably diverse between organisms, and they can readily be explored and functionally compared at www.proteomesoflife.org.
This is a preview of subscription content
Subscription info for Chinese customers
We have a dedicated website for our Chinese customers. Please go to naturechina.com to subscribe to this journal.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The MS-based proteomics data have been deposited in the ProteomeXchange Consortium via the PRIDE partner repository and are available via ProteomeXchange with identifier PXD014877 and PXD019483.
Custom computer code is available at https://github.com/MannLabs/proteomesoflife.
de Godoy, L. M. F. et al. Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 455, 1251–1254 (2008).
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012).
Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Bekker-Jensen, D. B. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 4, 587–599 (2017).
Weiss, M., Schrimpf, S., Hengartner, M. O., Lercher, M. J. & von Mering, C. Shotgun proteomics data from multiple organisms reveals remarkable quantitative conservation of the eukaryotic core proteome. Proteomics 10, 1297–1306 (2010).
Marx, H. et al. A proteomic atlas of the legume Medicago truncatula and its nitrogen-fixing endosymbiont Sinorhizobium meliloti. Nat. Biotechnol. 34, 1198–1205 (2016).
Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017); correction Nature 568, E11 (2019).
Kulak, N. A., Pichler, G., Paron, I., Nagaraj, N. & Mann, M. Minimal, encapsulated proteomic-sample processing applied to copy-number estimation in eukaryotic cells. Nat. Methods 11, 319–324 (2014).
Geyer, P. E. et al. Plasma proteome profiling to assess human health and disease. Cell Syst. 2, 185–195 (2016).
De Beeck, J. O. et al. Digging deeper into the human proteome: a novel nanoflow LCMS setup using micro pillar array columns (μPACTM). Preprint at bioRxiv https://doi.org/10.1101/472134 (2018).
Kulak, N. A., Geyer, P. E. & Mann, M. Loss-less nano-fractionator for high sensitivity, high coverage proteomics. Mol. Cell. Proteomics 16, 694–705 (2017).
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Tiwary, S. et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods 16, 519–525 (2019).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47 (D1), D506–D515 (2019).
Muñoz, J. & Heck, A. J. R. From the human genome to the human proteome. Angew. Chem. Int. Edn 53, 10864–10866 (2014).
Cox, J. et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol. Cell. Proteomics 13, 2513–2526 (2014).
Altenhoff, A. M. et al. Standardized benchmarking in the quest for orthologs. Nat. Methods 13, 425–430 (2016).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47 (D1), D309–D314 (2019).
The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 47 (D1), D330–D338 (2019).
Geer, L. Y. et al. The NCBI BioSystems database. Nucleic Acids Res. 38, D492–D496 (2010).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47 (D1), D427–D432 (2019).
Santos, A. et al. Clinical knowledge graph integrates proteomics data into clinical decision-making. Preprint at bioRxiv https://doi.org/10.1101/2020.05.09.084897 (2020).
Cox, J. & Mann, M. 1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data. BMC Bioinformatics 13 (Suppl 16), S12 (2012).
Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1061 (2018).
Zielinska, D. F., Gnad, F., Schropp, K., Wiśniewski, J. R. & Mann, M. Mapping N-glycosylation sites across seven evolutionarily distant species reveals a divergent substrate proteome despite a common core machinery. Mol. Cell 46, 542–548 (2012).
Wiśniewski, J. R., Wegler, C. & Artursson, P. Multiple-enzyme-digestion strategy improves accuracy and sensitivity of label- and standard-free absolute quantification to a level that is achievable by analysis with stable isotope-labeled standard spiking. J. Proteome Res. 18, 217–224 (2019).
Kelstrup, C. D. et al. Performance evaluation of the Q Exactive HF-X for shotgun proteomics. J. Proteome Res. 17, 727–738 (2018).
Scheltema, R. A. & Mann, M. SprayQc: a real-time LC-MS/MS quality monitoring system to maximize uptime using off the shelf components. J. Proteome Res. 11, 3458–3466 (2012).
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Cox, J. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10, 1794–1805 (2011).
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740 (2016).
Wichmann, C. et al. MaxQuant.Live enables global targeting of more than 25,000 peptides. Mol. Cell. Proteomics 18, 982–994 (2019).
Perez-Riverol, Y. et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 47 (D1), D442–D450 (2019).
Perkel, J. M. Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018).
We thank all members of the Proteomics and Signal Transduction Group and the Clinical Proteomics Group at the Max Planck Institute of Biochemistry, Martinsried, for help and discussions, and in particular I. Paron, C. Deiml, A. Strasser and B. Splettstoesser for technical assistance. We further thank the P. Bork group for supplying bacteria, the A. Pichlmair group for virus samples, F. Hosp for A. thaliana, I. Sinning for Neurospora crassa and the K.-P. Janssen group for cell line samples. Our work was partially supported by the Max Planck Society for the Advancement of Science, by the European Union’s Horizon 2020 research and innovation program with the Microb-Predict project (grant 825694), by grants from the Novo Nordisk Foundation (NNF15CC0001 and NNF15OC0016692), and by the Deutsche Forschungsgemeinschaft (DFG) project ‘Chemical proteomics inside us’ (grant 412136960).
The authors declare no competing interests.
Peer review information Nature thanks Joshua Coon, Vera van Noort and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Comparison of the peptide retention times obtained by a μPAC and a fused silica capillary column.
a, The histograms illustrate the distribution of coefficients of variation (CVs) calculated from peptide retention times obtained by a μPAC and a fused silica capillary column. The CVs were calculated for peptides from 12 measurements of a HeLa cell digest on each column. b, All components, including lines, connectors, the column and the emitter, are displayed together with grounding and spray voltage connections. The pico tip emitter is from New Objective (catalogue number FS360-20-10-N-5-105CT).
Extended Data Fig. 2 Interlaboratory reproducibility and prediction of peptide retention time on the μPAC column.
a, The ability to produce chip-based columns in a reproducible manner, coupled with the statically fixed micrometre-sized pillars, results in highly reproducible performance and interlaboratory transferability of the μPAC-based approach. Shown are the corrected retention times of an excerpt of 5,000 peptides from the 43,000 overlapping peptides measured in two different HeLa cell digests by our Munich and Copenhagen laboratories, resulting in a Pearson correlation coefficient of peptide retention times of 0.995. b, To validate our model for predicting peptide retention times, we plot an excerpt of 1,000 peptides from the complete test-set of 54,490 peptides, with experimentally determined values on the x axis and predicted values on the y axis. The Pearson’s R2 correlation value for the complete predicted peptide set is 0.99.
Extended Data Fig. 3 Total numbers of identified peptides from 100 organisms across the tree of life.
The peptides uniquely identified for a certain organism are colour-coded from peptides identified in multiple species. Orange, archaea; blue, eukaryotes; green, bacteria.
Extended Data Fig. 4 Comparison and characterization of the LSTM model for predicting peptide retention times.
a, Box plots comparing R2 scores obtained from different models of peptide retention time, calculated from the linear regressions of correlations between the predicted test set to the measured peptide retention times. Sample sizes are shown in b. b, Table comparing the different models of peptide retention time. The training set was reduced in size (number of peptides included) in order to account for the exponentially growing calculation time of certain models. Statistics represent the linear regression of correlation from the predicted test set retention times to the measured retention times. c, Characterization of the LSTM model applied here for different sizes of training peptide set.
a, Illustration of all direct taxonomic levels below the superkingdom level that are covered by our data set. DPANN, Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaeota; FCB, Fibrobacteres, Chlorobi and Bacteroidetes; PCV, Planctomycetes, Chlamydiae and Verrucomicrobia; TACK, Thaumarchaeota, Crenarchaeota and Korarchaeota. b, Number of protein identification codes (IDs) in this study and their relation to TrEMBL IDs found in the PRIDE archive. c, Comparison of the Swiss-Prot database to the data set in this study with regards to organism and protein numbers. d, Numbers of identified protein groups and UniProt protein entries for all 100 organisms in our data set. The UniProt protein-entry identifications are colour-coded into Swiss-Prot (reviewed) and TrEMBL (predicted) entries.
Protein intensities are log10-scaled and plotted against the abundance rank of each protein.
On the x axis, proteins are ranked according to their abundance; the y axis shows the cumulative protein intensity. Proteins missing biological-process annotation are highlighted by grey lines in the background.
Extended Data Fig. 8 Quantitative analysis of different enzyme classes and functional protein domains across the tree of life.
a, We classified the contribution of peptides to the top 90% of protein mass within all 100 organisms according to the enzyme commission (EC) number, using the Unipept web-tool (https://unipept.ugent.be/). The alluvial plot illustrates the proportions of each enzyme class across all organisms in our study. b, Comparison of the three domains of life with respect to their normalized contribution of peptides to each enzyme class. c, Proteins that contribute to the top 90% of the protein mass within all 100 organisms studied herein were annotated according to their known functional protein domains, and the intensities for different functional domains of an organism were summed to display the most abundant functional protein domains across the tree of life. The intensity is displayed on a log10 scale.
Extended Data Fig. 9 Quantitative analysis of specific biological processes across the tree of life.
a, Linear display showing a global view of the expression levels of functional groups across the 100 organisms from Fig. 4. Summed intensities for functional terms are shown as grey lines, with the top ten most abundant terms in all organisms colour-coded according to the top key. b, Quantitative analysis of specific biological processes from the superkingdom of eukaryotes. Proteins were annotated with biological processes, and the intensities for each annotation term within an organisms were summed. Those biological processes that display differential expression across the superkingdom as well as photosynthetic processes are highlighted according to the bottom key.
Supplementary Table 1: Organisms analyzed in the study. All organisms analyzed in the Study are listed with source and taxonomy.
Supplementary Table 2: Identified and quantified protein groups. All identified protein groups for the 100 organisms are listed and quantitative information is added for quantified protein groups.
Supplementary Table 3: Reported modified peptides. Peptides with biologically relevant modifications as found by the pFind tool are listed.
Supplementary Table 4: Identified and quantified protein groups for 14 human cell lines. The deep human proteome derived from 14 human cell lines is listed with all identified and quantified protein groups.
Supplementary Table 5: Detailed summary information for technical and biological proteomics data. Technical relevant information on the 100 organism proteomes mass spectrometry data is listed.
Supplementary Table 6: Annotation data for the 100 most abundant proteins of the 100 organisms. The 100 most abundant protein groups per organism are listed with annotation data.
About this article
Cite this article
Müller, J.B., Geyer, P.E., Colaço, A.R. et al. The proteome landscape of the kingdoms of life. Nature 582, 592–596 (2020). https://doi.org/10.1038/s41586-020-2402-x
Nature Communications (2021)
Nature Biotechnology (2021)
Data-independent acquisition method for ubiquitinome analysis reveals regulation of circadian biology
Nature Communications (2021)
Nutrition & Metabolism (2020)
Scientific Data (2020)