Discovering the concomitant occurrence of distinct medical conditions in a patient, also known as comorbidities, is a prerequisite for creating patient outcome prediction tools. Current comorbidity discovery applications are designed for small datasets and use stratification to control for confounding variables such as age, sex or ancestry. Stratification lowers false positive rates, but reduces power, as the size of the study cohort is decreased. Here we describe a Poisson binomial-based approach to comorbidity discovery (PBC) designed for big-data applications that circumvents the need for stratification. PBC adjusts for confounding demographic variables on a per-patient basis and models temporal relationships. We benchmark PBC using two datasets to compute comorbidity statistics on 4,623,841 pairs of potentially comorbid medical terms. The results of this computation are provided as a searchable web resource. Compared with current methods, the PBC approach reduces false positive associations while retaining statistical power to discover true comorbidities.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only 7,71 € per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
In this paper we calculate comorbidity statistics for all pairs of medical billing codes, including diagnoses, procedures and medications. All of these P-values are available to query and download from the following link: https://pbc.genetics.utah.edu/lemmon2021. Furthermore, a 2.3 GB file containing comorbidity statistics for all 4,623,841 pairs of medical terms can be downloaded from the Open Science Framework56. Source Data for Figs. 1 and 2 and Extended Data Fig. 1–4 are available with this manuscript. The original input data includes detailed medical records from University of Utah Health. As this data include PHI (patient demographics, birth dates and dated medical diagnosis, procedure, and medication codes): we cannot make the data available with this publication.
We provide a CodeOcean capsule57 including code and sample input data.
Valderas, J. M., Starfield, B., Sibbald, B., Salisbury, C. & Roland, M. Defining comorbidity: implications for understanding health and health services. Ann. Fam. Med. 7, 357–363 (2009).
Lone, N. I. et al. Predicting risk of unplanned hospital readmission in survivors of critical illness: a population-level cohort study. Thorax 74, 1046–1054 (2019).
Wang, H. et al. Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1968–1978 (2018).
Facchinetti, G. et al. Continuity of care interventions for preventing hospital readmission of older people with chronic diseases: a meta-analysis. Int. J. Nurs. Stud. 101, 103396 (2020).
Atashi, A., Sarbaz, M., Marashi, S., Hajialiasgari, F. & Eslami, S. Intensive care decision making: using prognostic models for resource allocation. Stud. Health Technol. Inform. 251, 145–148 (2018).
Yurkovich, M., Avina-Zubieta, J. A., Thomas, J., Gorenchtein, M. & Lacaille, D. A systematic review identifies valid comorbidity indices derived from administrative health data. J. Clin. Epidemiol. 68, 3–14 (2015).
Charlson, M. E., Pompei, P., Ales, K. L. & MacKenzie, C. R. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis. 40, 373–383 (1987).
Elixhauser, A., Steiner, C., Harris, D. R. & Coffey, R. M. Comorbidity measures for use with administrative data. Med. Care 36, 8–27 (1998).
Roque, F. S. et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput. Biol. 7, e1002141 (2011).
Gutiérrez-Sacristán, A. et al. comoRbidity: an R package for the systematic analysis of disease comorbidities. Bioinformatics 34, 3228–3230 (2018).
Moni, M. A., Xu, H. & Liò, P. CytoCom: a Cytoscape app to visualize, query and analyse disease comorbidity networks. Bioinforma. Oxf. Engl. 31, 969–971 (2015).
Moni, M. A. & Liò, P. comoR: a software for disease comorbidity risk assessment. J. Clin. Bioinforma. 4, 8 (2014).
Ronzano, F., Gutiérrez-Sacristán, A. & Furlong, L. I. Comorbidity4j: a tool for interactive analysis of disease comorbidities over large patient datasets. Bioinforma. Oxf. Engl. 35, 3530–3532 (2019).
Siggaard, T. et al. Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients. Nat. Commun. 11, 4952 (2020).
Winter, A. C., Rist, P. M., Buring, J. E. & Kurth, T. Prospective comorbidity-matched study of Parkinson’s disease and risk of mortality among women. BMJ Open 6, e011888 (2016).
Johnson, A. et al. MIMIC-IV (Version 1.0) (PhysioNet, 2021); https://doi.org/10.13026/S6N6-XD98
ICD-9-CM—International Classification of Diseases, Ninth Revision, Clinical Modification (CDC, 2019); https://www.cdc.gov/nchs/icd/icd9cm.htm
ICD-10-CM—International Classification of Diseases, Tenth Revision, Clinical Modification (CDC, 2020); https://www.cdc.gov/nchs/icd/icd10cm.htm
Clinical Classifications Software Refined (CCSR) (AHRQ, 2021); https://www.hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp
CPT Codes (AAPC, 2021); https://www.aapc.com/resources/medical-coding/cpt.aspx
Liu, S. et al. RxNorm: prescription for electronic drug information exchange. IT Prof. 7, 17–23 (2005).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Seifert, T. The relationship of migraine and other headache disorders to concussion. Handb. Clin. Neurol. 158, 119–126 (2018).
Shimanovsky, A. et al. Autoimmune manifestations in patients with multiple myeloma and monoclonal gammopathy of undetermined significance. BBA Clin. 6, 12–18 (2016).
Saif, M. W., Kaley, K. & Lamb, L. Pancreatic adenocarcinoma complicated by sinistral portal hypertension. Cureus 8, e689 (2016).
Han, H. et al. Hypertension and breast cancer risk: a systematic review and meta-analysis. Sci. Rep. 7, 44877 (2017).
Li, X. et al. Comorbidities among patients with cancer who do and do not develop febrile neutropenia during the first chemotherapy cycle. J. Oncol. Pharm. Pract. 22, 679–689 (2016).
Chia, V. M. et al. Chronic comorbid conditions associated with risk of febrile neutropenia in breast cancer patients treated with chemotherapy. Breast Cancer Res. Treat. 138, 621–631 (2013).
Toma-Dasu, I., Wojcik, A. & Kjellsson Lindblom, E. Risk of second cancer following radiotherapy. Phys. Med. 42, 211–212 (2017).
Donin, N. et al. Risk of second primary malignancies among cancer survivors in the United States, 1992 through 2008. Cancer 122, 3075–3086 (2016).
Grantzau, T. & Overgaard, J. Risk of second non-breast cancer among patients treated with and without postoperative radiotherapy for primary breast cancer: a systematic review and meta-analysis of population-based studies including 522,739 patients. Radiother. Oncol. 121, 402–413 (2016).
Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).
Hassell, K. L. Population estimates of sickle cell disease in the U.S. Am. J. Prev. Med. 38, S512–S521 (2010).
Ahmadi, M., Poormansouri, S., Beiranvand, S. & Sedighie, L. Predictors and correlates of fatigue in sickle cell disease patients. Int. J. Hematol.-Oncol. Stem Cell Res. 12, 69–76 (2018).
Herson, J., Sharma, S., Crocker, C. L. & Jones, D. Physical complaints of patients with sickle cell trait. J. Reprod. Med. 14, 129–132 (1975).
Aich, A., Jones, M. K. & Gupta, K. Pain and sickle cell disease. Curr. Opin. Hematol. 26, 131–138 (2019).
Tariq, S. & Aronow, W. S. Use of inotropic agents in treatment of systolic heart failure. Int. J. Mol. Sci. 16, 29060–29068 (2015).
Anders, H.-J., Huber, T. B., Isermann, B. & Schiffer, M. CKD in diabetes: diabetic kidney disease versus nondiabetic kidney disease. Nat. Rev. Nephrol. 14, 361–377 (2018).
Koye, D. N., Magliano, D. J., Nelson, R. G. & Pavkov, M. E. The global epidemiology of diabetes and kidney disease. Adv. Chronic Kidney Dis. 25, 121–132 (2018).
El Fane, M. et al. Pneumocystosis during HIV infection. Rev. Pneumol. Clin. 72, 248–254 (2016).
Seravalle, G. & Grassi, G. Obesity and hypertension. Pharmacol. Res. 122, 1–7 (2017).
Hassaine, A., Salimi-Khorshidi, G., Canoy, D. & Rahimi, K. Untangling the complexity of multimorbidity with machine learning. Mech. Ageing Dev. 190, 111325 (2020).
Hassaine, A. et al. Learning multimorbidity patterns from electronic health records using non-negative matrix factorisation. J. Biomed. Inform. 112, 103606 (2020).
Chandrasekaran, B. & Jain, A. K. Quantization complexity and independent measurements. IEEE Trans. Comput. C–23, 102–106 (1974).
Trunk, G. V. A problem of dimensionality: a simple example. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 306–307 (1979).
Capobianco, E. & Lio’, P. Comorbidity: a multidimensional approach. Trends Mol. Med. 19, 515–521 (2013).
Linear Models Section 1.1, Scikit-learn 0.24.1 Documentation (ScikitLearn, 2021); https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Hong, Y. On computing the distribution function for the Poisson binomial distribution. Comput. Stat. Data Anal. 59, 41–51 (2013).
Hart, J. F. Computer Approximations (Wiley, 1968).
GNU Scientific Library: Reference Manual (Network Theory, 2009).
Behnel, S. et al. Cython: the best of both worlds. Comput. Sci. Eng. 13, 31–39 (2011).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Grinberg, M. Flask Web Development: Developing Web Applications with Python (O’Reilly, 2018).
Bostock, M., Ogievetsky, V. & Heer, J. D3: data-driven documents. IEEE Trans. Vis. Comput. Graph. 17, 2301–2309 (2011).
Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., & Yandell, M. PBC Comorbidities (OSF, 2021); https://doi.org/10.17605/OSF.IO/TH239
Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., Yandell, M. A Poisson Binomial Based Statistical Testing Framework for Comprehensive Comorbidity Discovery Across Massive Electronic Health Record Datasets (CodeOcean, 2021); https://doi.org/10.24433/CO.2251918.v1
The following collaborators have provided valuable discussion, feedback, and insight which has guided development of PBC: B. Bray, V. Deshmukh, K. Eilbeck, E. J. Hernandez and R. Shah. We thank members of the University of Utah EDW for facilitating access to medical records. The computational resources used were partially funded by the NIH Shared Instrumentation Grant 1S10OD021644-01A1. This research was supported by the AHA Children’s Strategically Focused Research Network grant (17SFRN33630041) and the Nora Eccles Treadwell Foundation. G. Lemmon was supported by NRSA training grant T32H757632. S. Wesolowski was supported by NRSA training grant T32DK110966-04 and the AHA Children’s Strategically Focused Research Network Fellowship award (17SFRN33630041).
G.L. and M.Y. own shares in Backdrop Health, a University of Utah effort to commercialize Bayesian inference on health records; however, there are no financial ties regarding this research. The remaining authors declare no competing interests.
Peer review information Nature Computational Science thanks Jeffrey P. Rewley and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Boxplots show median (black line), 25th and 75th percentile (box ends), 95th and 5th percentile (whisker caps) and outliers. Number of terms (bottom panel) is a count of distinct diagnoses, procedures and medications found in each patient’s medical history.
For each score function, we evaluated C-values ranging from 10−14 to 1014. (a) For each of 3041 diagnosis (DX), procedure (PX), and medication (RX) terms, we use cross validation to select the C-value that achieves the best score. Each boxplot contains these 3041 best scores as evaluated with different score functions. (b) Distribution of C-values for 3 score functions with high entropy. Jcutoff was chosen for downstream analysis because it has high entropy and has a smooth C-value distribution without the large outlier at C = −14.
Extended Data Fig. 3 Minimum description length of the comorbidity network discovered by PBC for diagnoses in the University of Utah EDW.
Examples of significantly associated medical conditions within each cluster are displayed. Citations supporting these associations are listed in Supplementary Table 6.
See Fig. 1 legend for description of (a) and Fig. 2 legend for description of (b). In (b), the X-axis ticks correspond to the addition of regression features (PBC) or stratification criteria from left to right: 0 - no features, no stratification, 1- gender/female, 2 - ancestry/African American, 3 - length of medical history/at least 2 years, 4 - number of visits/at least 3 visits. The MIMIC-IV results are very similar to the University of Utah results, reinforcing a key message of this paper - that PBC retains the power to identify comorbid relationships that are lost by stratification.
About this article
Cite this article
Lemmon, G., Wesolowski, S., Henrie, A. et al. A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets. Nat Comput Sci 1, 694–702 (2021). https://doi.org/10.1038/s43588-021-00141-9