Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires

A preprint version of the article is available at bioRxiv.

Abstract

Adaptive immune receptor repertoires (AIRR) are key targets for biomedical research as they record past and ongoing adaptive immune responses. The capacity of machine learning (ML) to identify complex discriminative sequence patterns renders it an ideal approach for AIRR-based diagnostic and therapeutic discovery. So far, widespread adoption of AIRR ML has been inhibited by a lack of reproducibility, transparency and interoperability. immuneML (immuneml.uio.no) addresses these concerns by implementing each step of the AIRR ML process in an extensible, open-source software ecosystem that is based on fully specified and shareable workflows. To facilitate widespread user adoption, immuneML is available as a command-line tool and through an intuitive Galaxy web interface, and extensive documentation of workflows is provided. We demonstrate the broad applicability of immuneML by (1) reproducing a large-scale study on immune state prediction, (2) developing, integrating and applying a novel deep learning method for antigen specificity prediction and (3) showcasing streamlined interpretability-focused benchmarking of AIRR ML.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Overview of immuneML.
Fig. 2: Use cases demonstrating ML model training, benchmarking and platform extension.

Data availability

All data for the analyses presented in the manuscript are openly available. The detailed result files for use cases 1–3 are available as zip files at https://doi.org/10.11582/2021.00008 (ref. 78; use case 1), https://doi.org/10.11582/2021.00009 (ref. 81; use case 2) and https://doi.org/10.11582/2021.00005 (ref. 82; use case 3). Input data for use case 1 was downloaded from https://doi.org/10.21417/B7001Z.

Code availability

The immuneML source code is openly available at Github (github.com/uio-bmi/immuneML) under a free software license (AGPL-3.0). immuneML version 2.0.2 has been deposited on Zenodo with https://doi.org/10.5281/zenodo.5118741 (ref. 75). The immuneML Python package can be downloaded from pypi.org/project/immuneML.

References

  1. Brown, A. J. et al. Augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires. Mol. Syst. Des. Eng. 4, 701–736 (2019).

    Google Scholar 

  2. Georgiou, G. et al. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat. Biotechnol. 32, 158–168 (2014).

    Google Scholar 

  3. Yaari, G. & Kleinstein, S. H. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 7, 121 (2015).

    Google Scholar 

  4. Csepregi, L., Ehling, R. A., Wagner, B. & Reddy, S. T. Immune literacy: reading, writing, and editing adaptive immunity. iScience 23, 101519 (2020).

    Google Scholar 

  5. DeWitt, W. S. III et al. Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity. eLife 7, e38358 (2018).

    MathSciNet  Google Scholar 

  6. Emerson, R. O. et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet. 49, 659–665 (2017).

    Google Scholar 

  7. Krishna, C., Chowell, D., Gönen, M., Elhanati, Y. & Chan, T. A. Genetic and environmental determinants of human TCR repertoire diversity. Immun. Ageing 17, 26 (2020).

    Google Scholar 

  8. Britanova, O. V. et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol. 192, 2689–2698 (2014).

    Google Scholar 

  9. Schneider-Hohendorf, T. et al. Sex bias in MHC I-associated shaping of the adaptive immune system. Proc. Natl Acad. Sci. USA 115, 2168–2173 (2018).

    Google Scholar 

  10. Shemesh, O., Polak, P., Lundin, K. E. A., Sollid, L. M. & Yaari, G. Machine learning analysis of naïve B-cell receptor repertoires stratifies celiac disease patients and controls. Front. Immunol. 12, https://doi.org/10.3389/fimmu.2021.627813 (2021).

  11. Ostmeyer, J., Christley, S., Toby, I. T. & Cowell, L. G. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. https://doi.org/10.1158/0008-5472.CAN-18-2292 (2019).

  12. Beshnova, D. et al. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci. Transl. Med. 12, eaaz3738 (2020).

    Google Scholar 

  13. Liu, X. et al. T cell receptor β repertoires as novel diagnostic markers for systemic lupus erythematosus and rheumatoid arthritis. Ann. Rheum. Dis. 78, 1070–1078 (2019).

    Google Scholar 

  14. Arnaout, R. A. et al. The future of blood testing is the immunome. Front. Immunol. 12, 626793 (2021).

    Google Scholar 

  15. Greiff, V., Yaari, G. & Cowell, L. Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Curr. Opin. Syst. Biol. https://doi.org/10.1016/j.coisb.2020.10.010 (2020).

  16. Akbar, R. et al. A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Rep. 34, 108856 (2021).

    Google Scholar 

  17. Dash, P. et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547, 89–93 (2017).

    Google Scholar 

  18. Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017).

    Google Scholar 

  19. Springer, I., Besser, H., Tickotsky-Moskovitz, N., Dvorkin, S. & Louzoun, Y. Prediction of specific TCR-peptide binding from large dictionaries of TCR-peptide pairs. Front. Immunol. 11, 1803 (2020).

    Google Scholar 

  20. Friedensohn, S. et al. Convergent selection in antibody repertoires is revealed by deep learning. Preprint at bioRxiv https://doi.org/10.1101/2020.02.25.965673 (2020).

  21. Mason, D. M. et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat. Biomed. Eng. 5, 600–612 (2021).

    Google Scholar 

  22. Moris, P. et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief. Bioinform. https://doi.org/10.1093/bib/bbaa318 (2020).

  23. Graves, J. et al. A review of deep learning methods for antibodies. Antibodies 9, 12 (2020).

    Google Scholar 

  24. Narayanan, H. et al. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol. Sci. 42, 151–165 (2021).

    Google Scholar 

  25. Fischer, D. S., Wu, Y., Schubert, B. & Theis, F. J. Predicting antigen specificity of single T cells based on TCR CDR3 regions. Mol. Syst. Biol. 16, e9416 (2020).

    Google Scholar 

  26. Laustsen, A. H., Greiff, V., Karatt-Vellatt, A., Muyldermans, S. & Jenkins, T. P. Animal immunization, in vitro display technologies, and machine learning for antibody discovery. Trends Biotechnol. https://doi.org/10.1016/j.tibtech.2021.03.003 (2021).

  27. Jokinen, E., Huuhtanen, J., Mustjoki, S., Heinonen, M. & Lähdesmäki, H. Predicting recognition between T cell receptors and epitopes with TCRGP. PLoS Comput. Biol. 17, e1008814 (2021).

    Google Scholar 

  28. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Google Scholar 

  29. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

    Google Scholar 

  30. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. https://doi.org/10.1038/s41573-019-0024-5 (2019).

  31. Wainberg, M., Merico, D., Delong, A. & Frey, B. J. Deep learning in biomedicine. Nat. Biotechnol. 36, 829–838 (2018).

    Google Scholar 

  32. Lythe, G., Callard, R. E., Hoare, R. L. & Molina-París, C. How many TCR clonotypes does a body maintain? J. Theor. Biol. 389, 214–224 (2016).

    MATH  Google Scholar 

  33. Mora, T. & Walczak, A. M. How many different clonotypes do immune repertoires contain? Curr. Opin. Syst. Biol. 18, 104–110 (2019).

    Google Scholar 

  34. Briney, B., Inderbitzin, A., Joyce, C. & Burton, D. R. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397 (2019).

    Google Scholar 

  35. Greiff, V. et al. Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires. J. Immunol. https://doi.org/10.4049/jimmunol.1700594 (2017).

  36. Parameswaran, P. et al. Convergent antibody signatures in human dengue. Cell Host Microbe 13, 691–700 (2013).

    Google Scholar 

  37. Thomas, N. et al. Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinformatics 30, 3181–3188 (2014).

    Google Scholar 

  38. Christophersen, A. et al. Tetramer-visualized gluten-specific CD4+ T cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge. United Eur. Gastroenterol. J. 2, 268–278 (2014).

    Google Scholar 

  39. Widrich, M. et al. Modern Hopfield networks and attention for immune repertoire classification. Adv. Neural Inf. Process. Syst. 33, 18832–18845 (2020).

    Google Scholar 

  40. Sidhom, J.-W., Larman, H. B., Pardoll, D. M. & Baras, A. S. DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat. Commun. 12, 1605 (2021).

    Google Scholar 

  41. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).

    Google Scholar 

  42. Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).

    Google Scholar 

  43. Feng, J. et al. Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nat. Biotechnol. 35, 409–412 (2017).

    Google Scholar 

  44. Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).

    Google Scholar 

  45. Tomic, A. et al. SIMON: Open-source knowledge discovery platform. Patterns 2, 100178 (2021).

    Google Scholar 

  46. Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Google Scholar 

  47. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  48. Paszke, A. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8026–8037 (Curran Associates, Inc., 2019).

  49. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).

    Google Scholar 

  50. Rubelt, F. et al. Adaptive immune receptor repertoire community recommendations for sharing immune-repertoire sequencing data. Nat. Immunol. 18, 1274–1278 (2017).

    Google Scholar 

  51. Vander Heiden, J. A. et al. AIRR community standardized representations for annotated immune repertoires. Front. Immunol. 9, 2206 (2018).

    Google Scholar 

  52. Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).

    Google Scholar 

  53. Gupta, N. T. et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–3358 (2015).

    Google Scholar 

  54. Vander Heiden, J. A. et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930–1932 (2014).

    Google Scholar 

  55. Nazarov, V., immunarch.bot & Rumynskiy, E. immunomind/immunarch: 0.6.5: basic single-cell support. Zenodo https://doi.org/10.5281/zenodo.3893991 (2020).

  56. Christley, S. et al. The ADC API: a web API for the programmatic query of the AIRR data commons. Front. Big Data 3, 22 (2020).

    Google Scholar 

  57. Corrie, B. D. et al. iReceptor: a platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories. Immunol. Rev. 284, 24–41 (2018).

    Google Scholar 

  58. Bagaev, D. V. et al. VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium. Nucleic Acids Res. 48, D1057–D1062 (2020).

    Google Scholar 

  59. Huang, H., Wang, C., Rubelt, F., Scriba, T. J. & Davis, M. M. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0505-4 (2020).

  60. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Google Scholar 

  61. Nolan, S. et al. A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-51964/v1 (2020).

  62. Weber, C. R. et al. immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking. Bioinformatics 36, 3594–3596 (2020).

    Google Scholar 

  63. Marcou, Q., Mora, T. & Walczak, A.M. High-throughput immune repertoire analysis with IGoR. Nat Commun 9, 561 (2018). https://doi.org/10.1038/s41467-018-02832-w

  64. Sethna, Z., Elhanati, Y., Callan, C. G., Walczak, A. M. & Mora, T. OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs. Bioinformatics 35, 2974–2981 (2019).

    Google Scholar 

  65. FAIR principles for data stewardship. Nat. Genet. 48, 343–343 (2016).

  66. Scott, J. K. & Breden, F. The adaptive immune receptor repertoire community as a model for FAIR stewardship of big immunology data. Curr. Opin. Syst. Biol. 24, 71–77 (2020).

    Google Scholar 

  67. Breden, F. et al. Reproducibility and reuse of adaptive immune receptor repertoire data. Front. Immunol. 8, 1418 (2017).

    Google Scholar 

  68. Software with impact. Nat. Methods 11, 211 (2014).

  69. Goodman, S. N., Fanelli, D. & Ioannidis, J. P. A. What does research reproducibility mean? Sci. Transl. Med. 8, 341ps12 (2016).

    Google Scholar 

  70. Mayer-Blackwell, K. et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. Preprint at bioRxiv https://doi.org/10.1101/2020.12.24.424260 (2020).

  71. Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Conference on Operating Systems Design and Implementation 265–283 (USENIX Association, 2016).

  72. Vujovic, M. et al. T cell receptor sequence clustering and antigen specificity. Comput. Struct. Biotechnol. J. 18, 2166–2173 (2020).

    Google Scholar 

  73. Davidsen, K. et al. Deep generative models for T cell receptor protein sequences. eLife 8, e46935 (2019).

    Google Scholar 

  74. Bareinboim, E. & Pearl, J. Causal inference and the data-fusion problem. Proc. Natl Acad. Sci. USA 113, 7345–7352 (2016).

    Google Scholar 

  75. Pavlovic, M. et al. immuneML: v2.0.2. Zenodo https://doi.org/10.5281/zenodo.5118741 (2021)

  76. Fowler, M. Domain-Specific Languages (Addison-Wesley Professional, 2010).

  77. Zenger, M. Programming Language Abstractions for Extensible Software Components Ch. 1.3 (Swiss Federal Institute of Technology, 2004).

  78. Pavlović, M. immuneML use case 1: replication of a published study inside immuneML. NIRD Research Data Archive https://doi.org/10.11582/2021.00008 (2021).

  79. Ploenzke, M. S. & Irizarry, R. A. Interpretable convolution methods for learning genomic sequence motifs. Preprint at bioRxiv https://doi.org/10.1101/411934 (2018).

  80. Heikkilä, N. et al. Human thymic T cell repertoire is imprinted with strong convergence to shared sequences. Mol. Immunol. 127, 112–123 (2020).

    Google Scholar 

  81. Pavlović, M. immuneML use case 2: extending immuneML with a deep learning component for predicting antigen specificity of paired receptor data. NIRD Research Data Archive https://doi.org/10.11582/2021.00009 (2021).

  82. Scheffer, L. immuneML use case 3: benchmarking ML methods for AIRR classification on ground-truth synthetic data. NIRD Research Data Archive https://doi.org/10.11582/2021.00005 (2021).

Download references

Acknowledgements

We acknowledge generous support by The Leona M. and Harry B. Helmsley Charitable Trust (grant number 2019PG-T1D011, to V.G. and T.M.B.), the UiO World-Leading Research Community (to V.G. and L.M.S.), the UiO:LifeScience Convergence Environment Immunolingo (to V.G. and G.K.S.), EU Horizon 2020 iReceptorplus (grant number 825821, to V.G. and L.M.S.), a Research Council of Norway FRIPRO project (grant number 300740, to V.G.), a Research Council of Norway IKTPLUSS project (grant number 311341, to V.G. and G.K.S.), the National Institutes of Health (grant numbers P01 AI042288 and HIRN UG3 DK122638 to T.M.B.) and Stiftelsen Kristian Gerhard Jebsen (K.G. Jebsen Coeliac Disease Research Centre, to L.M.S. and G.K.S.). We acknowledge support from ELIXIR Norway in recognizing immuneML as a national node service.

Author information

Affiliations

Authors

Contributions

M.P., V.G. and G.K.S. conceived the study. M.P. and G.K.S. designed the overall software architecture. M.P., L.S. and K.M. developed the main platform code. M.P. and L.S. performed all analyses. M.P., L.S., C.K., F.L.M.B., R.A., G.S.A.H., G.B., M.C., R.F., I.G., S.G., P.-H.H., K.R., E.R., P.A.R., A.S., D.T., C.R.W. and M.W. created software or documentation content. R.K., N.V., K.W., L.S., M.P., A.A.C. and B.C. designed and developed the Galaxy tools. C.K., R.A., T.M.B., M.C., S.C., L.G.C., I.H.H., E.H., G.K., M.L.K., C.L.-A., A.M., T.M., J.P., K.R., P.A.R., A.R., I.S., L.M.S. and G.Y. provided critical feedback. M.P., L.S., V.G. and G.K.S. drafted the manuscript. V.G. and G.K.S. supervised the project. All authors read and approved the final manuscript and are personally accountable for its content.

Corresponding author

Correspondence to Geir Kjetil Sandve.

Ethics declarations

Competing interests

V.G. declares advisory board positions in aiNET GmbH and Enpicom B.V., and is a consultant for Roche/Genentech.

Additional information

Peer review information Nature Machine Intelligence thanks Pieter Meysman, Ryan Emerson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Tables 1–4.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pavlović, M., Scheffer, L., Motwani, K. et al. The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires. Nat Mach Intell 3, 936–944 (2021). https://doi.org/10.1038/s42256-021-00413-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00413-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing