Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets


Discovering the concomitant occurrence of distinct medical conditions in a patient, also known as comorbidities, is a prerequisite for creating patient outcome prediction tools. Current comorbidity discovery applications are designed for small datasets and use stratification to control for confounding variables such as age, sex or ancestry. Stratification lowers false positive rates, but reduces power, as the size of the study cohort is decreased. Here we describe a Poisson binomial-based approach to comorbidity discovery (PBC) designed for big-data applications that circumvents the need for stratification. PBC adjusts for confounding demographic variables on a per-patient basis and models temporal relationships. We benchmark PBC using two datasets to compute comorbidity statistics on 4,623,841 pairs of potentially comorbid medical terms. The results of this computation are provided as a searchable web resource. Compared with current methods, the PBC approach reduces false positive associations while retaining statistical power to discover true comorbidities.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Feature selection by L1 regularization.
Fig. 2: Modeling the effects of confounding variables.
Fig. 3: Screenshots from PBC-Web.

Data availability

In this paper we calculate comorbidity statistics for all pairs of medical billing codes, including diagnoses, procedures and medications. All of these P-values are available to query and download from the following link: Furthermore, a 2.3 GB file containing comorbidity statistics for all 4,623,841 pairs of medical terms can be downloaded from the Open Science Framework56. Source Data for Figs. 1 and 2 and Extended Data Fig. 14 are available with this manuscript. The original input data includes detailed medical records from University of Utah Health. As this data include PHI (patient demographics, birth dates and dated medical diagnosis, procedure, and medication codes): we cannot make the data available with this publication.

Code availability

We provide a CodeOcean capsule57 including code and sample input data.


  1. 1.

    Valderas, J. M., Starfield, B., Sibbald, B., Salisbury, C. & Roland, M. Defining comorbidity: implications for understanding health and health services. Ann. Fam. Med. 7, 357–363 (2009).

    Article  Google Scholar 

  2. 2.

    Lone, N. I. et al. Predicting risk of unplanned hospital readmission in survivors of critical illness: a population-level cohort study. Thorax 74, 1046–1054 (2019).

    Article  Google Scholar 

  3. 3.

    Wang, H. et al. Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1968–1978 (2018).

    Article  Google Scholar 

  4. 4.

    Facchinetti, G. et al. Continuity of care interventions for preventing hospital readmission of older people with chronic diseases: a meta-analysis. Int. J. Nurs. Stud. 101, 103396 (2020).

    Article  Google Scholar 

  5. 5.

    Atashi, A., Sarbaz, M., Marashi, S., Hajialiasgari, F. & Eslami, S. Intensive care decision making: using prognostic models for resource allocation. Stud. Health Technol. Inform. 251, 145–148 (2018).

    Google Scholar 

  6. 6.

    Yurkovich, M., Avina-Zubieta, J. A., Thomas, J., Gorenchtein, M. & Lacaille, D. A systematic review identifies valid comorbidity indices derived from administrative health data. J. Clin. Epidemiol. 68, 3–14 (2015).

    Article  Google Scholar 

  7. 7.

    Charlson, M. E., Pompei, P., Ales, K. L. & MacKenzie, C. R. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis. 40, 373–383 (1987).

    Article  Google Scholar 

  8. 8.

    Elixhauser, A., Steiner, C., Harris, D. R. & Coffey, R. M. Comorbidity measures for use with administrative data. Med. Care 36, 8–27 (1998).

    Article  Google Scholar 

  9. 9.

    Roque, F. S. et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput. Biol. 7, e1002141 (2011).

    Article  Google Scholar 

  10. 10.

    Gutiérrez-Sacristán, A. et al. comoRbidity: an R package for the systematic analysis of disease comorbidities. Bioinformatics 34, 3228–3230 (2018).

    Article  Google Scholar 

  11. 11.

    Moni, M. A., Xu, H. & Liò, P. CytoCom: a Cytoscape app to visualize, query and analyse disease comorbidity networks. Bioinforma. Oxf. Engl. 31, 969–971 (2015).

    Article  Google Scholar 

  12. 12.

    Moni, M. A. & Liò, P. comoR: a software for disease comorbidity risk assessment. J. Clin. Bioinforma. 4, 8 (2014).

    Article  Google Scholar 

  13. 13.

    Ronzano, F., Gutiérrez-Sacristán, A. & Furlong, L. I. Comorbidity4j: a tool for interactive analysis of disease comorbidities over large patient datasets. Bioinforma. Oxf. Engl. 35, 3530–3532 (2019).

    Article  Google Scholar 

  14. 14.

    Siggaard, T. et al. Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients. Nat. Commun. 11, 4952 (2020).

    Article  Google Scholar 

  15. 15.

    Winter, A. C., Rist, P. M., Buring, J. E. & Kurth, T. Prospective comorbidity-matched study of Parkinson’s disease and risk of mortality among women. BMJ Open 6, e011888 (2016).

    Article  Google Scholar 

  16. 16.

    Johnson, A. et al. MIMIC-IV (Version 1.0) (PhysioNet, 2021);

  17. 17.

    ICD-9-CM—International Classification of Diseases, Ninth Revision, Clinical Modification (CDC, 2019);

  18. 18.

    ICD-10-CM—International Classification of Diseases, Tenth Revision, Clinical Modification (CDC, 2020);

  19. 19.

    Clinical Classifications Software Refined (CCSR) (AHRQ, 2021);

  20. 20.

    CPT Codes (AAPC, 2021);

  21. 21.

    Liu, S. et al. RxNorm: prescription for electronic drug information exchange. IT Prof. 7, 17–23 (2005).

    Article  Google Scholar 

  22. 22.

    Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).

    Article  Google Scholar 

  23. 23.

    Seifert, T. The relationship of migraine and other headache disorders to concussion. Handb. Clin. Neurol. 158, 119–126 (2018).

    Article  Google Scholar 

  24. 24.

    Shimanovsky, A. et al. Autoimmune manifestations in patients with multiple myeloma and monoclonal gammopathy of undetermined significance. BBA Clin. 6, 12–18 (2016).

    Article  Google Scholar 

  25. 25.

    Saif, M. W., Kaley, K. & Lamb, L. Pancreatic adenocarcinoma complicated by sinistral portal hypertension. Cureus 8, e689 (2016).

    Google Scholar 

  26. 26.

    Han, H. et al. Hypertension and breast cancer risk: a systematic review and meta-analysis. Sci. Rep. 7, 44877 (2017).

  27. 27.

    Li, X. et al. Comorbidities among patients with cancer who do and do not develop febrile neutropenia during the first chemotherapy cycle. J. Oncol. Pharm. Pract. 22, 679–689 (2016).

    Article  Google Scholar 

  28. 28.

    Chia, V. M. et al. Chronic comorbid conditions associated with risk of febrile neutropenia in breast cancer patients treated with chemotherapy. Breast Cancer Res. Treat. 138, 621–631 (2013).

    Article  Google Scholar 

  29. 29.

    Toma-Dasu, I., Wojcik, A. & Kjellsson Lindblom, E. Risk of second cancer following radiotherapy. Phys. Med. 42, 211–212 (2017).

    Article  Google Scholar 

  30. 30.

    Donin, N. et al. Risk of second primary malignancies among cancer survivors in the United States, 1992 through 2008. Cancer 122, 3075–3086 (2016).

    Article  Google Scholar 

  31. 31.

    Grantzau, T. & Overgaard, J. Risk of second non-breast cancer among patients treated with and without postoperative radiotherapy for primary breast cancer: a systematic review and meta-analysis of population-based studies including 522,739 patients. Radiother. Oncol. 121, 402–413 (2016).

    Article  Google Scholar 

  32. 32.

    Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).

    MATH  Article  Google Scholar 

  33. 33.

    Hassell, K. L. Population estimates of sickle cell disease in the U.S. Am. J. Prev. Med. 38, S512–S521 (2010).

    Article  Google Scholar 

  34. 34.

    Ahmadi, M., Poormansouri, S., Beiranvand, S. & Sedighie, L. Predictors and correlates of fatigue in sickle cell disease patients. Int. J. Hematol.-Oncol. Stem Cell Res. 12, 69–76 (2018).

    Google Scholar 

  35. 35.

    Herson, J., Sharma, S., Crocker, C. L. & Jones, D. Physical complaints of patients with sickle cell trait. J. Reprod. Med. 14, 129–132 (1975).

    Google Scholar 

  36. 36.

    Aich, A., Jones, M. K. & Gupta, K. Pain and sickle cell disease. Curr. Opin. Hematol. 26, 131–138 (2019).

    Article  Google Scholar 

  37. 37.

    Tariq, S. & Aronow, W. S. Use of inotropic agents in treatment of systolic heart failure. Int. J. Mol. Sci. 16, 29060–29068 (2015).

    Article  Google Scholar 

  38. 38.

    Anders, H.-J., Huber, T. B., Isermann, B. & Schiffer, M. CKD in diabetes: diabetic kidney disease versus nondiabetic kidney disease. Nat. Rev. Nephrol. 14, 361–377 (2018).

    Article  Google Scholar 

  39. 39.

    Koye, D. N., Magliano, D. J., Nelson, R. G. & Pavkov, M. E. The global epidemiology of diabetes and kidney disease. Adv. Chronic Kidney Dis. 25, 121–132 (2018).

    Article  Google Scholar 

  40. 40.

    El Fane, M. et al. Pneumocystosis during HIV infection. Rev. Pneumol. Clin. 72, 248–254 (2016).

    Article  Google Scholar 

  41. 41.

    Seravalle, G. & Grassi, G. Obesity and hypertension. Pharmacol. Res. 122, 1–7 (2017).

    Article  Google Scholar 

  42. 42.

    Hassaine, A., Salimi-Khorshidi, G., Canoy, D. & Rahimi, K. Untangling the complexity of multimorbidity with machine learning. Mech. Ageing Dev. 190, 111325 (2020).

    Article  Google Scholar 

  43. 43.

    Hassaine, A. et al. Learning multimorbidity patterns from electronic health records using non-negative matrix factorisation. J. Biomed. Inform. 112, 103606 (2020).

    Article  Google Scholar 

  44. 44.

    Chandrasekaran, B. & Jain, A. K. Quantization complexity and independent measurements. IEEE Trans. Comput. C–23, 102–106 (1974).

    MATH  Article  Google Scholar 

  45. 45.

    Trunk, G. V. A problem of dimensionality: a simple example. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 306–307 (1979).

    Article  Google Scholar 

  46. 46.

    Capobianco, E. & Lio’, P. Comorbidity: a multidimensional approach. Trends Mol. Med. 19, 515–521 (2013).

    Article  Google Scholar 

  47. 47.

    Linear Models Section 1.1, Scikit-learn 0.24.1 Documentation (ScikitLearn, 2021);

  48. 48.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  49. 49.

    Hong, Y. On computing the distribution function for the Poisson binomial distribution. Comput. Stat. Data Anal. 59, 41–51 (2013).

    MathSciNet  MATH  Article  Google Scholar 

  50. 50.

    Hart, J. F. Computer Approximations (Wiley, 1968).

  51. 51.

    GNU Scientific Library: Reference Manual (Network Theory, 2009).

  52. 52.

    Behnel, S. et al. Cython: the best of both worlds. Comput. Sci. Eng. 13, 31–39 (2011).

    Article  Google Scholar 

  53. 53.

    Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  54. 54.

    Grinberg, M. Flask Web Development: Developing Web Applications with Python (O’Reilly, 2018).

  55. 55.

    Bostock, M., Ogievetsky, V. & Heer, J. D3: data-driven documents. IEEE Trans. Vis. Comput. Graph. 17, 2301–2309 (2011).

    Article  Google Scholar 

  56. 56.

    Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., & Yandell, M. PBC Comorbidities (OSF, 2021);

  57. 57.

    Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., Yandell, M. A Poisson Binomial Based Statistical Testing Framework for Comprehensive Comorbidity Discovery Across Massive Electronic Health Record Datasets (CodeOcean, 2021);

Download references


The following collaborators have provided valuable discussion, feedback, and insight which has guided development of PBC: B. Bray, V. Deshmukh, K. Eilbeck, E. J. Hernandez and R. Shah. We thank members of the University of Utah EDW for facilitating access to medical records. The computational resources used were partially funded by the NIH Shared Instrumentation Grant 1S10OD021644-01A1. This research was supported by the AHA Children’s Strategically Focused Research Network grant (17SFRN33630041) and the Nora Eccles Treadwell Foundation. G. Lemmon was supported by NRSA training grant T32H757632. S. Wesolowski was supported by NRSA training grant T32DK110966-04 and the AHA Children’s Strategically Focused Research Network Fellowship award (17SFRN33630041).

Author information




G.L. was the senior research associate leading PBC development and validation. S.W. is an applied mathematician who has helped formalize our approach to statistical testing. A.H. was a software engineer on the project. M.T.-F. and M.Y. conceived of the project and secured research funding and played a key role in scientific discussions regarding development of PBC. All authors edited the manuscript.

Corresponding authors

Correspondence to Martin Tristani-Firouzi or Mark Yandell.

Ethics declarations

Competing interests

G.L. and M.Y. own shares in Backdrop Health, a University of Utah effort to commercialize Bayesian inference on health records; however, there are no financial ties regarding this research. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Jeffrey P. Rewley and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 University of Utah medical records binned by age-decade.

Boxplots show median (black line), 25th and 75th percentile (box ends), 95th and 5th percentile (whisker caps) and outliers. Number of terms (bottom panel) is a count of distinct diagnoses, procedures and medications found in each patient’s medical history.

Source data

Extended Data Fig. 2 Comparison of score functions for logistic regression C-value optimization.

For each score function, we evaluated C-values ranging from 10−14 to 1014. (a) For each of 3041 diagnosis (DX), procedure (PX), and medication (RX) terms, we use cross validation to select the C-value that achieves the best score. Each boxplot contains these 3041 best scores as evaluated with different score functions. (b) Distribution of C-values for 3 score functions with high entropy. Jcutoff was chosen for downstream analysis because it has high entropy and has a smooth C-value distribution without the large outlier at C = −14.

Source data

Extended Data Fig. 3 Minimum description length of the comorbidity network discovered by PBC for diagnoses in the University of Utah EDW.

Examples of significantly associated medical conditions within each cluster are displayed. Citations supporting these associations are listed in Supplementary Table 6.

Extended Data Fig. 4 Deployment of PBC on MIMIC-IV EHR data.

See Fig. 1 legend for description of (a) and Fig. 2 legend for description of (b). In (b), the X-axis ticks correspond to the addition of regression features (PBC) or stratification criteria from left to right: 0 - no features, no stratification, 1- gender/female, 2 - ancestry/African American, 3 - length of medical history/at least 2 years, 4 - number of visits/at least 3 visits. The MIMIC-IV results are very similar to the University of Utah results, reinforcing a key message of this paper - that PBC retains the power to identify comorbid relationships that are lost by stratification.

Source data

Supplementary information

Supplementary Information

Supplementary Tables 1–6, Figs. 1 and 2, and Methods (a step-by-step explanation of the mathematics used to calculate pairwise comorbidity P-values).

Source data

Source Data Fig. 1

Source Data for Fig. 1.

Source Data Fig. 2

Source Data for Fig. 2.

Source Data Extended Data Fig. 1

Source Data for Extended Data Fig. 1.

Source Data Extended Data Fig. 2

Source Data for Extended Data Fig. 2.

Source Data Extended Data Fig. 4

Source Data for Extended Data Fig. 4.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lemmon, G., Wesolowski, S., Henrie, A. et al. A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets. Nat Comput Sci 1, 694–702 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing