The physical sciences community is increasingly taking advantage of the possibilities offered by modern data science to solve problems in experimental chemistry and potentially to change the way we design, conduct and understand results from experiments. Successfully exploiting these opportunities involves considerable challenges. In this Expert Recommendation, we focus on experimental co-design and its importance to experimental chemistry. We provide examples of how data science is changing the way we conduct experiments, and we outline opportunities for further integration of data science and experimental chemistry to advance these fields. Our recommendations include establishing stronger links between chemists and data scientists; developing chemistry-specific data science methods; integrating algorithms, software and hardware to ‘co-design’ chemistry experiments from inception; and combining diverse and disparate data sources into a data network for chemistry research.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only 9,27 € per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Ourmazd, A. Science in the age of machine learning. Nat. Rev. Phys. 2, 342–343 (2020).
National Science Foundation. Framing the Role of Big Data and Modern Data Science in Chemistry. NSF https://www.nsf.gov/mps/che/workshops/data_chemistry_workshop_report_03262018.pdf (2018).
Mission Innovation (Energy Materials Innovation, 2018); http://mission-innovation.net/wp-content/uploads/2018/01/Mission-Innovation-IC6-Report-Materials-Acceleration-Platform-Jan-2018.pdf.
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).
Morgan, D. & Jacobs, R. Opportunities and challenges for machine learning in materials science. Annu. Rev. Mater. Res. 50, 71–103 (2020).
Janet, J. P. & Kulik, H. J. Machine Learning In Chemistry (American Chemical Society, 2020).
Wang, A. Y.-T. et al. Machine learning for materials scientists: an introductory guide toward best practices. Chem. Mater. 32, 4954–4965 (2020).
Dashti, A. et al. Retrieving functional pathways of biomolecules from single-particle snapshots. Nat. Commun. 11, 4734 (2020).
Selvaratnam, B. & Koodali, R. T. Machine learning in experimental materials chemistry. Catal. Today 371, 77–84 (2021).
Shi, Y., Prieto, P. L., Zepel, T., Grunert, S. & Hein, J. E. Automated experimentation powers data science in chemistry. Acc. Chem. Res. 54, 546–555 (2021).
Shen, Y. et al. Automation and computer-assisted planning for chemical synthesis. Nat. Rev. Meth. Prim. 1, 23 (2021).
Nichols, P. L. Automated and enabling technologies for medicinal chemistry. Progr. Med. Chem. 60, 191–272 (2021).
Stein, H. S. & Gregoire, J. M. Progress and prospects for accelerating materials science with automated and autonomous workflows. Chem. Sci. 10, 9640–9649 (2019).
Flores-Leonar, M. M. et al. Materials acceleration platforms: on the way to autonomous experimentation. Curr. Opin. Green. Sustain. Chem. 25, 100370 (2020).
Dashti, A. et al. Trajectories of the ribosome as a Brownian nanomachine. Proc. Natl Acad. Sci. USA 111, 17492 (2014).
Hosseinizadeh, A. et al. Conformational landscape of a virus by single-particle X-ray scattering. Nat. Methods 14, 877–881 (2017).
Ourmazd, A. Cryo-EM, XFELs and the structure conundrum in structural biology. Nat. Methods 16, 941–944 (2019).
Fung, R. et al. Dynamics from noisy data with extreme timing uncertainty. Nature 532, 471–475 (2016).
Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences. Part I: progress. Angew. Chem. Int. Ed. 59, 22858–22893 (2020).
Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences. Part II: Outlook. Angew. Chem. Int. Ed. 59, 23414–23436 (2020).
Stach, E. et al. Autonomous experimentation systems for materials development: a community perspective. Matter 4, 2702–2726 (2021).
Cao, L., Russo, D. & Lapkin, A. A. Automated robotic platforms in design and development of formulations. AIChE J. 67, e17248 (2021).
Oviedo, F. et al. Fast and interpretable classification of small X-ray diffraction datasets using data augmentation and deep neural networks. njp Comput. Mat. 5, 60 (2019).
Epps, R. W. et al. Artificial chemist: an autonomous quantum dot synthesis bot. Adv. Mater. 32, 2001626 (2020).
Volk, A. A., Epps, R. W. & Abolhasani, M. Accelerated development of colloidal nanomaterials enabled by modular microfluidic reactors: toward autonomous robotic experimentation. Adv. Mater. 33, 2004495 (2021).
Abdel-Latif, K., Bateni, F., Crouse, S. & Abolhasani, M. Flow synthesis of metal halide perovskite quantum dots: from rapid parameter space mapping to AI-guided modular manufacturing. Matter 3, 1053–1086 (2020).
Whitacre, J. F. et al. An autonomous electrochemical test stand for machine learning informed electrolyte optimization. J. Electrochem. Soc. 166, A4181–A4187 (2019).
Dave, A. et al. Autonomous discovery of battery electrolytes with robotic experimentation and machine learning. Cell Rep. Phys. Sci. 1, 100264 (2020).
Wimmer, E. et al. An autonomous self-optimizing flow machine for the synthesis of pyridine–oxazoline (PyOX) ligands. React. Chem. Eng. 4, 1608–1615 (2019).
Cortés-Borda, D. et al. An autonomous self-optimizing flow reactor for the synthesis of natural product carpanone. J. Org. Chem. 83, 14286–14299 (2018).
Jeraal, M. I., Sung, S. & Lapkin, A. A. A machine learning-enabled autonomous flow chemistry platform for process optimization of multiple reaction metrics. Chem. Meth. 1, 71–77 (2021).
Christensen, M. et al. Data-science driven autonomous process optimization. Commun. Chem. 4, 112 (2021).
Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020).
Shiri, P. et al. Automated solubility screening platform using computer vision. iScience 24, 102176 (2021).
Waldron, C. et al. An autonomous microreactor platform for the rapid identification of kinetic models. React. Chem. Eng. 4, 1623–1636 (2019).
Noack, M. M. et al. A kriging-based approach to autonomous experimentation with applications to X-ray scattering. Sci. Rep. 9, 11809 (2019).
Noack, M. M., Doerk, G. S., Li, R., Fukuto, M. & Yager, K. G. Advances in kriging-based autonomous X-ray scattering experiments. Sci. Rep. 10, 1325 (2020).
Noack, M. M., Zwart, P. H. & Ushizima, D. M. et al. Gaussian processes for autonomous data acquisition at large-scale synchrotron and neutron facilities. Nat. Rev. Phys. 3, 685–697 (2021).
Cho, S.-Y. et al. Finding hidden signals in chemical sensors using deep learning. Anal. Chem. 92, 6529–6537 (2020).
Nega, P. W. et al. Using automated serendipity to discover how trace water promotes and inhibits lead halide perovskite crystal formation. Appl. Phys. Lett. 119, 041903 (2021).
Kayser, Y. et al. Core-level nonlinear spectroscopy triggered by stochastic X-ray pulses. Nat. Commun. 10, 4761 (2019).
Fuller, F. D. et al. Resonant X-ray emission spectroscopy from broadband stochastic pulses at an X-ray free electron laser. Commun. Chem. 4, 84 (2021).
Fagnan, K. et al. Data and Models: A Framework for Advancing AI in Science (OSTI, 2019).
Domcke, W. & Yarkony, D. R. Role of conical intersections in molecular spectroscopy and photoinduced chemical dynamics. Annu. Rev. Phys. Chem. 63, 325–352 (2012).
Hosseinizadeh, A. et al. Single-femtosecond atomic-resolution observation of a protein traversing a conical intersection. Nature 599, 697–701 (2021).
Takens, F. in Dynamical Systems and Turbulence, Warwick 1980 (eds Rand, D. & Young, L.S.) 366–381 (Springer, 1981).
Packard, N. H., Crutchfield, J. P., Farmer, J. D. & Shaw, R. S. Geometry from a time series. Phys. Rev. Lett. 45, 712–716 (1980).
Hosseinizadeh, A. et al. Few-fs resolution of a photoactive protein traversing a conical intersection. Nature 599, 697–701 (2021).
Fung, R. et al. Achieving accurate estimates of fetal gestational age and personalised predictions of fetal growth based on data from an international prospective cohort study: a population-based machine learning study. Lancet Dig. Health 2, e368–e375 (2020).
Jia, W. et al. in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 1–14 (IEEE, 2020); https://dl.acm.org/doi/abs/10.5555/3433701.3433707.
Sun, S. et al. A data fusion approach to optimize compositional stability of halide perovskites. Matter 4, 1305–1322 (2021).
Jia, X. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019).
Krska, S. W., DiRocco, D. A., Dreher, S. D. & Shevlin, M. The evolution of chemical high-throughput experimentation to address challenging problems in pharmaceutical synthesis. Acc. Chem. Res. 50, 2976–2985 (2017).
Dybowski, R. Interpretable machine learning as a tool for scientific discovery in chemistry. N. J. Chem. 44, 20914–20920 (2020).
Guan, W. et al. Quantum machine learning in high energy physics. Mach. Learn. Sci. Technol. 2, 011003 (2021).
Duros, V. et al. Intuition-enabled machine learning beats the competition when joint human-robot teams perform inorganic chemical experiments. J. Chem. Inf. Model. 59, 2664–2671 (2019).
McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114 (2011).
Buitrago Santanilla, A. et al. Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science 347, 49–53 (2015).
Lin, S. et al. Mapping the dark space of chemical reactions with extended nanomole synthesis and MALDI-TOF MS. Science 361, eaar6236 (2018).
Selekman, J. A. et al. High-throughput automation in chemical process development. Annu. Rev. Chem. Biomol. 8, 525–547 (2017).
Dragone, V., Sans, V., Henson, A. B., Granda, J. M. & Cronin, L. An autonomous organic reaction search engine for chemical reactivity. Nat. Commun. 8, 15733 (2017).
Sader, J. K. & Wulff, J. E. Reinvestigation of a robotically revealed reaction. Nature 570, E54–E59 (2019).
Milo, A., Neel, A. J., Toste, F. D. & Sigman, M. S. Organic chemistry. A data-intensive approach to mechanistic elucidation applied to chiral anion catalysis. Science 347, 737–743 (2015).
Melodie, C. et al. Data-science driven autonomous process optimization. Comm. Chem. 4, 112 (2021).
Li, J. et al. AI applications through the whole life cycle of material discovery. Matter 3, 393–432 (2020).
Kusne, A. G. et al. On-the-fly machine-learning for high-throughput experiments: search for rare-earth-free permanent magnets. Sci. Rep. 4, 6367 (2014).
Kusne, A. G. et al. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat. Commun. 11, 5966 (2020).
Shi, F., Foster, J. G. & Evans, J. A. Weaving the fabric of science: dynamic network models of science’s unfolding structure. Soc. Netw. 43, 73–85 (2015).
Bai, J. et al. From platform to knowledge graph: evolution of laboratory automation. J. Am. Chem. Soc. Au 2, 292–309 (2022).
Gates-Rector, S. & Blanton, T. The Powder Diffraction File: a quality materials characterization database. Powder Diffr. 34, 352–360 (2019).
Linstrom, P. J. & Mallard, W. G. (eds) NIST Chemistry WebBook, NIST Standard Reference Database Number 69 (National Institute of Standards and Technology, 2022).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Kuhn, S. & Schlörer, N. E. Facilitating quality control for spectra assignments of small organic molecules: nmrshiftdb2 — a free in-house NMR database with integrated LIMS for academic service laboratories. Magn. Reson. Chem. 53, 582–589 (2015).
Hanson, R. et al. Development Of A Standard For Fair Data Management Of Spectroscopic Data (IUPAC, 2020).
Hanson, R. M. J. et al. FAIR enough? Spectrosc. Eur. World 33, 25–31 (2021).
Kearnes, S. M. et al. The open reaction database. J. Am. Chem. Soc. 143, 18820–18826 (2021).
Tremouilhac, P. et al. Chemotion ELN: an open source electronic lab notebook for chemists in academia. J. Cheminform. 9, 54 (2017).
Mehr, S. H. M., Craven, M., Leonov Artem, I., Keenan, G. & Cronin, L. A universal system for digitization and automatic execution of the chemical synthesis literature. Science 370, 101–108 (2020).
Vaucher, A. C. et al. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 11, 3601 (2020).
Pendleton, I. M. et al. Experiment Specification, Capture and Laboratory Automation Technology (ESCALATE): a software pipeline for automated chemical experimentation and data management. MRS Commun. 9, 846–859 (2019).
Choudhury, R., Aykol, M., Gratzl, S., Montoya, J. & Hummelshøj, J. S. MaterialNet: a web-based graph explorer for materials science data. J. Opn Src. Softw. 5, 2105 (2020).
Aykol, M. et al. Network analysis of synthesizable materials discovery. Nat. Commun. 10, 2018 (2019).
Statt, M. R. et al. ESAMP: event-sourced architecture for materials provenance management and application to accelerated materials discovery. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv.14583258.v1 (2021).
Li, Z. et al. Robot-accelerated perovskite investigation and discovery. Chem. Mater. 32, 5650–5663 (2020).
Ratner, D. et al. Office Of Basic Energy Sciences (BES) roundtable on producing and managing large scientific data with artificial intelligence and machine learning. US DOE OSTI https://doi.org/10.2172/1630823 (2019).
Kwon, H.-K., Gopal, C. B., Kirschner, J., Caicedo, S. & Storey, B. D. A user-centered approach to designing an experimental laboratory data platform. Preprint at arXiv https://arxiv.org/abs/2007.14443 (2020).
Mrdjenovich, D. et al. Propnet: a knowledge graph for materials science. Matter 2, 464–480 (2020).
Sullivan, K. P., Brennan-Tonetta, P. & Marxen, L. J. Economic Impacts of the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (Rutgers Office of Research Analytics, 2017).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Alshahrani, M. et al. Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33, 2723–2730 (2017).
Carbone, M. R., Yoo, S., Topsakal, M. & Lu, D. Classification of local chemical environments from X-ray absorption spectra using supervised machine learning. Phys. Rev. Mater. 3, 033604 (2019).
Zheng, C., Chen, C., Chen, Y. & Ong, S. P. Random forest models for accurate identification of coordination environments from X-ray absorption near-edge structure. Patterns 1, 100013 (2020).
Torrisi, S. B. et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. npj Comput. Mater. 6, 109 (2020).
Carbone, M. R., Topsakal, M., Lu, D. & Yoo, S. Machine-learning X-ray absorption spectra to quantitative accuracy. Phys. Rev. Lett. 124, 156401 (2020).
Cibin, G. et al. An open access, integrated XAS data repository at diamond light source. Radiat. Phys. Chem. 175, 108479 (2020).
Musil, F. et al. Physics-inspired structural representations for molecules and materials. Chem. Rev. 121, 9759–9815 (2021).
Smidt, T. E. Euclidean symmetry and equivariance in machine learning. Trends Chem. 3, 82–85 (2021).
Ropers, J., Mosca, M. M., Anosova, O., Kurlin, V. & Cooper, A. I. Fast predictions of lattice energies by continuous isometry invariants of crystal structures. Preprint at https://arxiv.org/abs/2108.07233 (2021).
Herr, J. E., Koh, K., Yao, K. & Parkhill, J. Compressing physics with an autoencoder: creating an atomic species representation to improve machine learning models in the chemical sciences. J. Chem. Phys. 151, 084103 (2019).
Sharma, A. Laboratory glassware identification: supervised machine learning example for science students. J. Comput. Sci. Ed. 12, 8–15 (2021).
Thrall, E. S., Lee, S. E., Schrier, J. & Zhao, Y. Machine learning for functional group identification in vibrational spectroscopy: a pedagogical lab for undergraduate chemistry students. J. Chem. Educ. 98, 3269–3276 (2021).
Lafuente, D. et al. A gentle introduction to machine learning for chemists: an undergraduate workshop using python notebooks for visualization, data processing, analysis, modeling. J. Chem. Ed. 98, 2892–2898 (2021).
Gressling, T. Data Science in Chemistry: Artificial Intelligence, Big Data, Chemometrics and Quantum Computing with Jupyter (Walter de Gruyter, 2020).
Kauwe, S. K., Graser, J., Murdock, R. & Sparks, T. D. Can machine learning find extraordinary materials? Comput. Mat. Sci. 174, 109498 (2020).
Schwaller, P. et al. “Found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098 (2018).
Bergmann, U. et al. Using X-ray free-electron lasers for spectroscopy of molecular catalysts and metalloenzymes. Nat. Rev. Phys. 3, 264–282 (2021).
Ayyer, K. et al. Low-signal limit of X-ray single particle diffractive imaging. Opt. Express 27, 37816–37833 (2019).
Brewster, A. et al. Processing serial crystallographic data from XFELs or synchrotrons using the cctbx.xfel GUI. Comput. Crystallogr. Newsl. 10, 22–39 (2019).
Young, I. D. et al. Structure of photosystem II and substrate binding at room temperature. Nature 540, 453–457 (2016).
Ratner, D., Cryan, J. P., Lane, T. J., Li, S. & Stupakov, G. Pump–probe ghost imaging with SASE FELs. Phys. Rev. X 9, 011045 (2019).
This article evolved from presentations and discussions at the workshop ‘At the Tipping Point: A Future of Fused Chemical and Data Science’ held in September 2020, sponsored by the Council on Chemical Sciences, Geosciences, and Biosciences of the US Department of Energy, Office of Science, Office of Basic Energy Sciences. The authors thank the members of the Council for their encouragement and assistance in developing this workshop. In addition, the authors are indebted to the agencies responsible for funding their individual research efforts, without which this work would not have been possible.
The authors declare no competing interests.
Peer review information
Nature Reviews Chemistry thanks Martin Green, Venkatasubramanian Viswanathan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Cambridge Structural Database: https://www.ccdc.cam.ac.uk/
Chemotion Repository: https://www.chemotion-repository.net/welcome
FAIR principles: https://www.go-fair.org/fair-principles/
IBM RXN: https://rxn.res.ibm.com/
Inorganic Crystal Structure Database: https://www.psds.ac.uk/icsd
Open Reaction Database: http://open-reaction-database.org
Protein Data Bank: https://www.rcsb.org/
PuRe Data Resources: https://www.energy.gov/science/office-science-pure-data-resources
About this article
Cite this article
Yano, J., Gaffney, K.J., Gregoire, J. et al. The case for data science in experimental chemistry: examples and recommendations. Nat Rev Chem 6, 357–370 (2022). https://doi.org/10.1038/s41570-022-00382-w