Without a proper accounting of known and unknown systematics and uncertainties, combining information across multiple surveys, wavelengths, and detectors may be risky. Realizing the true potential of multi-messenger and panchromatic astrophysics requires getting data integration right.
Astronomy is currently awash in data. From past surveys such as the Sloan Digital Sky Survey (SDSS)1 to current and upcoming surveys such as Gaia2 and the Legacy Survey of Space and Time (LSST) with the Vera C. Rubin Observatory3, astronomers will soon be collecting terabytes of data from billions of Solar System bodies, stars, and galaxies every night. With the synchronous operation of telescopes across the electromagnetic (EM) spectrum such as the James Webb Space Telescope, instruments measuring cosmic particles such as the IceCube Neutrino Observatory, and instruments measuring gravitational waves (GW) such as the Laser Interferometer Gravitational-Wave Observatory, the next decade of astronomy is poised to combine these varied datasets in full pursuit of ‘multi-messenger astronomy’. However, combining often disparate datasets is fraught with risks that, if not properly accounted for, could lead to serious degradation of downstream scientific analysis.
The sheer amount of data (big data) that needs to be processed has spurred the astronomy community to investigate the use of data-driven methods such as artificial intelligence and machine learning (AIML) methods. AIML methods are becoming an integral part of data reduction pipelines as well as for scientific analysis from individual data streams4,5. Substantial attention, however, must also be paid towards tackling the fundamental promise — and challenge — of multi-messenger astronomy that arises from trying to combine information from different streams (wide data). While statistical uncertainties, systematic trends, and selection effects may be small within their own individual datasets, across multiple datasets their impact can be compounded many times over. These statistical effects (both random and systematic) run the risk of overwhelming the underlying signals of interest, possibly entirely subverting the motivation for combining different data streams in the first place.
The idea of combining datasets from different sources is not a new one. The terms ‘data fusion’ and ‘data integration’ have been used in other science disciplines and in military and defence research for a number of years to describe techniques for combining data from different catalogues, databases, and sensors6,7. Similar to how astronomers hope to combine spectroscopic, kinematic and other datasets together to study our Milky Way Galaxy, scientists in fields such as geostatistics have been grappling with similar problems for decades8. This external literature offers a wealth of knowledge that may help the astronomy community as we begin to grapple with similar problems both within and between surveys. For instance, the Simonyi Survey Telescope being constructed at the Vera C. Rubin Observatory will have six optical filters. While these will not be used all at once at any given time, each filter will eventually be used to gather data on the same source of the sky at different times. In addition, while the LSST itself will generate numerous alerts and time-series data on specific objects, these datasets will need to be supplemented, integrated and fused with datasets from other telescopes and instruments. Indeed, this fusion will be extremely important for upcoming surveys such as Euclid to achieve both their immediate science goals and to attain high legacy value9.
Given the forthcoming quantity and quality of astronomical data, figuring out how to integrate datasets robustly and effectively will soon become ubiquitous when trying to study both individual objects as well as large populations. A perfect example of this is the case of SN1987a (ref. 10). Before the supernova was first detected in optical and ultraviolet imaging surveys, it was simultaneously observed from multiple neutrino detectors. Associations with a possible progenitor were derived from previous imaging data from other surveys. Evolution of the subsequent circumstellar material relied on observations of the associated light curve as well as time-resolved spectroscopy of the ejecta across a wide range of wavelengths, including X-rays. And searches for possible stellar remnants such as neutron stars have relied on radio data. A similarly expansive, panchromatic effort will surely be taken to study any new nearby supernova that occur in the upcoming decades.
Thinking about how one might begin to combine all the diverse datasets for an object like SN1987a highlights one of the fundamental statistical challenges of data integration: handling observations of vastly different size and scope. Naive approaches routinely fail to capture the fact that a handful of points from one dataset (for example, neutrinos) might be substantially more informative than hundreds from another (for example, optical imaging). In astronomy, the traditional way of dealing with these issues has been to develop a full statistical model of the data-generating process, including observational and model uncertainties, and use Bayesian inference to marginalize over the associated ‘nuisance’ model parameters11. However, approaches in genomics, where writing down a detailed model is often not possible, show that within a frequentist setting it is also viable to learn how to ‘weight’ various components in a data-driven way12,13. Hybrid approaches, where AIML methods can be inserted as various elements of broader statistical models, have also shown promise in dealing with data modelling challenges in other astronomical contexts14,15. While these alternatives may not always provide the same ‘physical’ interpretation as traditional approaches, they can often be faster and more expressive while still remaining statistically robust.
The frequency and complexity of multi-messenger signals will increase dramatically in the coming years. It is therefore imperative to build up tools and expertise that combine datasets effectively, and that take full advantage of these observations. Searching for EM counterparts to GW events or deciding which transient events to follow up with other telescopes, for example, requires not only a rapid allocation of resources but also a detailed understanding of how best to integrate these disparate sets of observations. The stakes for doing this properly also remain higher than ever as cosmological tension in measurements of H0 and S8 between various messengers have only heightened in recent years16,17. Similarly, estimates of the virial mass of the Milky Way Galaxy can vary significantly, depending on the type of kinematic tracer data that are used (for example, globular clusters versus halo stars or stellar streams)18. Perhaps there are methods of combining these datasets fruitfully that would help solve discrepancies in estimates.
Dealing with the types of problems brought on by both 'big data' and 'wide data' is becoming a universal challenge across the sciences, and arriving at solutions will require that astronomers actively seek collaborations both within and across disciplines, especially statistics and computer science. We see three avenues where there remains ample room for growth:
With automatic differentiation software becoming both easily accessible and widely used, it is now possible to perform inference with models involving millions of parameters19. As a result, hierarchical models that include complex dependencies — including non-parametric, AIML-driven components — are now possible to achieve at scale, such as the simultaneous calibration of galaxy spectral templates, redshift distributions, and systematic uncertainties when estimating photometric redshifts14. Given their flexibility at incorporating numerous sources of uncertainties at many levels of analysis, we believe they have enormous potential to improve data integration efforts going forward.FormalPara Likelihood-free inference
As datasets, models, and observational selection effects become more complex, it is becoming increasingly difficult to estimate model parameters entirely from first principles. Improvements in computing power and AIML methods, however, have jump-started an entirely new class of approaches using simulations of observations to constrain models directly20, such as estimating the chemical homogeneity of open clusters through direct data simulations21. We believe these approaches will become routinely utilized in the coming decades.FormalPara Robustness
Most models are tested and validated in scenarios where their default assumptions are correct, rather than in scenarios where they are violated (either moderately or completely). Testing and validation of inference when assumptions are violated will become increasingly important as the breadth of data analysed increases. While statistical robustness has been studied for decades22, applications of similar ideas in astronomy remain rare. Investigations of robustness in AIML through, for example, adversarial learning, has opened entirely new avenues to model design and algorithms23. We anticipate that similar considerations in astronomy will also be fruitful.
The upcoming era of multi-messenger, panchromatic, all-sky astronomy is poised to be an incredibly exciting time for the field. But it will inevitably also be an incredibly messy one. Only by building robust frameworks for data integration that go beyond AIML to accommodate all the complexity and challenges that will come with combining large, diverse datasets will the astronomy community be prepared to take full advantage of these datasets in the decades to come.
York, D. G. et al. Astron. J. 120, 1579–1587 (2000).
Gaia Collaboration et al. Astron. Astrophys. 616, A1 (2018).
Ivezić, Ž. et al. Astrophys. J. 873, 111 (2019).
Mahabal, A. et al. Pub. Astron. Soc. Pac. 131, 038002 (2019).
Schlafly, E. F. et al. Astrophys. J. Suppl. Ser. 234, 39 (2018).
Goodman, I. R., Mahler, R. P. & Nguyen, H. T. Mathematics of Data Fusion (Springer Science & Business Media, 2013).
Doan, A., Halevy, A. & Ives, Z. Principles of Data Integration (Elsevier, 2012).
O’Neil-Dunne, J. P. M., MacFaden, S. W., Royar, A. R. & Pelletier, K. C. Geocarto. Int. 28, 227–242 (2013).
Laureijs, R. et al. Preprint at https://arxiv.org/abs/1110.3193 (2011).
Arnett, W. D., Bahcall, J. N., Kirshner, R. P. & Woosley, S. E. Annu. Rev. Astron. Astrophys. 27, 629–700 (1989).
Johnson, B. D., Leja, J., Conroy, C. & Speagle, J. S. Astrophys. J. Suppl. Ser. 254, 22 (2021).
Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jordan, M. I. & Noble, W. S. Bioinformatics 20, 2626–2635 (2004).
Hwang, D. et al. Proc. Natl Acad. Sci. USA 102, 17296–17301 (2005).
Leistedt, B., Hogg, D. W., Wechsler, R. H. & DeRose, J. Astrophys. J. 881, 80 (2019).
Cranmer, M. D., Galvez, R., Anderson, L., Spergel, D. N. & Ho, S. Preprint at https://arxiv.org/abs/1908.08045v1 (2019).
Di Valentino, E. et al. Class. Quantum Gravity 38, 153001 (2021).
Asgari, M. et al. Astron. Astrophys. 645, A104 (2021).
Wang, W. et al. Mon. Not. R. Astron. Soc. 453, 377–400 (2015).
Carpenter, B. et al. J. Stat. Softw. 76, 1–32 (2017).
Beaumont, M. A. Annu. Rev. Stat. Appl. 6, 379–403 (2019).
Bovy, J. Astrophys. J. 817, 49 (2016).
Huber, P. J. & Ronchetti, E. M. Robust Statistics 2nd edn (Wiley, 2009).
Carlini, N. et al. Preprint at https://arxiv.org/abs/1902.06705 (2019).
The authors declare no competing interests.
About this article
Cite this article
Speagle (沈佳士), J.S., Eadie, G.M. Making the sum greater than its parts. Nat Astron 5, 971–972 (2021). https://doi.org/10.1038/s41550-021-01509-7