Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A rare variant analysis framework using public genotype summary counts to prioritize disease-predisposition genes

Abstract

Sequencing cases without matched healthy controls hinders prioritization of germline disease-predisposition genes. To circumvent this problem, genotype summary counts from public data sets can serve as controls. However, systematic inflation and false positives can arise if confounding factors are not controlled. We propose a framework, consistent summary counts based rare variant burden test (CoCoRV), to address these challenges. CoCoRV implements consistent variant quality control and filtering, ethnicity-stratified rare variant association test, accurate estimation of inflation factors, powerful FDR control, and detection of rare variant pairs in high linkage disequilibrium. When we applied CoCoRV to pediatric cancer cohorts, the top genes identified were cancer-predisposition genes. We also applied CoCoRV to identify disease-predisposition genes in adult brain tumors and amyotrophic lateral sclerosis. Given that potential confounding factors were well controlled after applying the framework, CoCoRV provides a cost-effective solution to prioritizing disease-risk genes enriched with rare pathogenic variants.

Introduction

To identify genetic variants, especially rare variants, that are associated with various human diseases, scientists have generated whole-exome sequencing (WES) and whole-genome sequencing (WGS) data, such as the Genome Aggregation Database (gnomAD)1, the Trans-Omics for Precision Medicine (TOPMed) program2, and the 100,000 Genomes Project3. Detecting causative rare variants typically requires a much larger sample size. Such a study should include at least thousands of cases and matched controls to ensure statistical power. However, most sequencing studies focused on a specific disease or trait include relatively few cases and very few (if any) matched healthy controls. Although many large cancer genomics studies have characterized the landscape of somatic mutations by sequencing germline and somatic samples from each patient, these studies do not include independent controls because germline samples from the same individuals are used as paired normal controls. Therefore, most cancer genomics sequencing studies cannot be used directly to perform germline-based case-control association analyses for the discovery of novel cancer predisposition genes.

Combining the cases under scrutiny with external controls for prioritizing disease-predisposition genes is one solution. The best approach to controlling for confounding batch effects caused by the heterogeneity of exome-capture protocols, sequencing platforms, and bioinformatics-processing pipelines would be to re-map all raw sequencing data and then jointly call the genotypes of cases and controls. However, this approach is expensive and requires a lot of storage and a long processing time, which might not be feasible for some research groups. Another solution would be to download jointly called control genotype matrices (if available), merge with case genotype matrices, and apply variant quality control (QC) and filtering to account for batch effects. Alternatively, when the full-genotype data are not available and assuming the pipelines have negligible effects on the association analyses after QC and filtering, one can use publicly available summary counts for case–control association analyses4. Here the summary counts refer to the allele or genotype counts of each variant within a cohort. If confounding factors are well-controlled, the summary counts–based strategy is the least expensive and can dramatically increase the sample size of the controls. For instance, gnomAD V2.1 has more than 120,000 WES samples, and gnomAD V3 has more than 70,000 WGS samples1.

Using high-quality summary counts has led to important supporting evidence in identifying pathogenic germline mutations in adult cancers5 and pediatric cancers6. However, research is lacking on developing a general framework for using high-quality summary counts and evaluating the performance of such a framework.

Several challenges remain when using public summary counts for rare variant association tests. First, because the genotypes of the cases and controls are called separately, the QC and filtering steps might not be consistent if performed separately for cases and controls. To account for such differences, Guo et al.4 relied on the variant annotation QD (quality score normalized by allele depth); they also used synonymous variants to search for threshold combinations that would remove systematic inflation. One disadvantage of this approach is that it relies on one variant QC metric, instead of more sophisticated variant QC methods, such as VQSR7.

Differences in the genetic composition of unmatched cases and controls present another challenge. When using publicly available summary counts, researchers often use Fisher’s exact test (FET)4,6 which either treats different ethnicities as a single population or selects samples from one major ethnicity, e.g., samples of European ancestry. Ignoring ethnicity when using public summary counts can result in false positives due to population structure when rare variants have relatively high minor allele frequencies (MAFs) in a specific population. On the other hand, considering a single matched ethnicity might reduce the statistical power. Although a better and more common practice is to use all samples and account for the population structure by adjusting for the principal components (PCs)8, it is not applicable when only public summary counts are available without individual genotype information.

The inflation factor estimation of the association test when focusing on very rare variants is also important. In genome-wide association studies with common variants, the null distribution of P values is assumed to be continuous and follow a uniform distribution U(0, 1). However, for rare variants, especially those with very low MAFs, assuming a uniform distribution of the P values under the null hypothesis of no association is no longer accurate. For example, in the FET under the null hypothesis of no association, the count of cases with alternate alleles in a 2 × 2 contingency table for each gene follows heterogenous hypergeometric distributions, depending on the total number of rare alleles. For the same reason, traditional FDR-control methods such as the BH procedure9 assuming a uniform distribution of the P values under the null is suboptimal for the discrete count-based test results10.

Lastly, the most critical challenge when using public summary counts is related to rare variants in high-linkage disequilibrium (LD). In most cases, the independence assumption is probably reasonable because the variants are rare and the chance of observing both variants in one sample is low, unless they are in strong LD. However, when the assumption is violated, either due to high LD or technical artifacts, that can cause false-positive findings or systematic inflation.

In this study, we develop a framework called CoCoRV to prioritize disease-predisposition genes by using public summary counts as controls for rare variant burden tests, making it possible to discover novel genes without sequencing additional healthy control samples. Our framework provides consistent variant QC and filtering, ethnicity-stratified gene-based burden test, accurate inflation factor estimation, powerful FDR control, and an approach to detect high-LD variants using gnomAD summary counts, which is essential for removing false positives. We also evaluate the concordance between using summary counts and full genotypes. In general, the concordance is good, especially for top-ranked genes with a low P value threshold. We focus on genotype calls from WES data throughout this study.

Results

Overview of the proposed framework

To address the challenges in rare variant association tests using public summary counts, we developed a new framework termed Consistent summary Counts based Rare Variant burden test (CoCoRV) (Fig. 1). The input case data can be either full-genotype-based or summary-count-based. The input control data are summary-count data such as from gnomAD. We included a fast and scalable tool to calculate the coverage summary statistics because coverage depth is important in consistent filtering between cases and control4. Cases belonging to different ethnicities can be matched to corresponding controls. CoCoRV allows a user to customize criteria for filtering variants based on annotations in the input data. We provided several built-in variant categories based on annotations from ANNOVAR11 and REVEL scores12. For putative pathogenic variants, we included variants annotated with “stopgain”, “frameshift_insertion”, “frameshift_deletion”, and “nonsynonymous” with a REVEL score ≥0.65, which showed empirically good discrimination power for potential pathogenic versus nonpathogenic variants.

CoCoRV ensures that the variant QC and filtering are consistently applied between cases and controls. Only high-quality variants are retained in the analysis by controlling the coverage depth, missingness of the variants in both case and controls. A blacklist of potentially problematic variants was created based on gnomAD’s filtering status and discrepancies in allele frequencies between the WES and WGS platforms (Supplementary Methods). In addition to the FET, CoCoRV includes ethnicity-stratified analysis using the Cochran–Mantel–Haenszel (CMH)-exact test, which mitigates systematic inflations when samples include multiple ethnicities. We computed sample counts in three models (dominant, recessive, and double-heterozygous) within each gene.

Due to the discrete nature of count data, we propose an accurate inflation factor estimation method that is based on sampling the true null distribution of the test statistics (Fig. 2a) and is essential for checking possible systematic inflations. Similarly, the commonly used BH procedure9 for FDR control assumes a continuous uniform distribution, which is not true for discrete count-based tests. We, therefore, propose to use more powerful resampling-based FDR-control methods or FDR methods directly accounting for the discrete and heterogeneous distribution for the FET and CMH test results (Fig. 2a). The sampled P values under the null hypothesis can be used for both inflation factor estimation and resampling-based FDR control.

We introduce a LD-detection method using only gnomAD summary counts to identify high-LD variants (Fig. 2b). It partitions the gnomAD data set into several independent summary-count sets, and then infers the high-LD between variants based on the generated independent summary counts for each ethnicity. Excluding high-LD variants results in a more accurate estimation of the number of samples in each model and reduces false positives when the assumption of independence between variants is violated, especially for the recessive and double-heterozygous models. Details of the framework and modeling are provided in “Methods”. We also provide a side-by-side comparison of the features of our proposed tool and other existing tools for summary-count-based analysis (Supplementary Data 1).

CoCoRV’s inflation factor estimation is unbiased and accurate

We compared our proposed inflation estimation method with two other methods described by TRAPD4. We used the rare variant analysis results of the central nervous system tumors (CNS) and acute lymphoblastic leukemia (ALL) cohorts against our in-house controls under a dominant model to compare these methods (Fig. 3). We also checked their distributions under the simulated null P values, where a well-estimated inflation factor should be unbiased and centered around 1. For the CNS cohort tested using FET (Fig. 3a), the inflation factor λemp estimated using empirical null P values was 0.96, and those of the other two methods were greater than 1. For the ALL cohort tested using FET (Fig. 3b), all three inflation factors were less than 1. The rightmost plots in Fig. 3 show that λemp was well-calibrated under the null hypothesis, while the other two methods from TRAPD showed either upward or downward biases. The QQ plots based on CoCoRV’s method showed almost straight lines along the diagonal, and the other two methods showed clear shifts from the diagonal. In addition to FET, we saw similar patterns using the CMH-exact test (Fig. 3c, d). The estimation of CoCoRV’s λemp was stable, even with as few as 100 simulated null P values. Therefore, in later estimations of λemp in QQ plots, we used 100 simulated null P values per gene, unless otherwise specified. When the null distribution of the inflation factor is of interest, as shown in the rightmost plots, it is better to use at least 1000 replications for a more stable result. Because of the better performance of the proposed estimation, we applied it to nearly all subsequent analysis results in this study. The run time of our inflation factor estimation is less than a minute for sampling 1000 replicates of ~20,000 genes, considerably fast because of the direct computing of the cumulative distribution function (CDF) of P values under the null hypothesis.

Concordance of rare variant association between using summary counts and full genotypes

We compared the concordance between the result obtained using summary counts from separately called controls and that using jointly called full genotypes. For this analysis, we generated separately called control summary counts using an in-house control cohort of 8175 WES and treated the jointly called case–control full genotypes as the ground truth (Supplementary Fig. S1). To our knowledge, this is the first direct comparison between analysis results using summary counts of separately called cases and controls and that using jointly called full genotypes. For this comparison, we focused on the commonly used dominant model. We used an AF threshold of 1e–3 for the CNS and ALL cohorts and for all comparisons.

We first compared the sample counts with defined pathogenic variants for each gene. For cases, we compared and tabulated the sample counts with defined potential pathogenic variants from using jointly called full genotypes and that from using summary counts into a cross-classification/contingency table (Fig. 4a, b). For both the CNS and ALL cohorts, the concordance between using full genotypes and CoCoRV was very high (Pearson’s correlation r > 0.98). Most of the discordances were within one count. For controls, because the number of samples with defined potential pathogenic variants varied in a large range, we used the scatter plot of those numbers to illustrate the concordance between using full genotypes and using summary counts (Fig. 4c, d). The correlation between using full genotypes and CoCoRV was also very high (r = 0.996). The high correlation of the raw counts shows that the proposed summary counts-based framework maintained a high concordance with the jointly called full-genotype-based framework. We also examined the top genes of the association tests between using jointly called full genotypes and separately called summary counts (Fig. 4e, f). In general, the percentage of top overlapped genes was about 70% or higher, especially for the more stringent P value thresholds. We also performed ethnicity-stratified analyses for both frameworks and found similar overlapping patterns (Supplementary Fig. S2). Finally, the QQ plots using jointly called full genotypes and separately called summary counts were also similar, with correlations between –log10 P values greater than 0.9 (Fig. 4g, h).

We also investigated the coverage depth cutoff in QC. The coverage depth cutoff of 10 shows the best concordance measured by correlations of qualified sample counts in cases or controls, and correlations of the association test P values (Supplementary Data 2). The best concordance of using the depth cutoff 10 could be related to the QC of genotypes: we keep high-quality genotypes with DP ≥ 10 (DP is the number of informative reads for each sample in the VCF file), which is consistent with the QC used in gnomAD for summary allele counts. As the coverage cutoff increases, the variant quality will increase too, but we might risk missing good quality variants if the coverage cutoff is too high, especially for some regions not well covered due to high GC contents or other sequencing features. We also calculated the size of qualified regions with different coverage cutoffs on gnomAD’s whole-exome coverage data (Supplementary Fig. S3). The coverage cutoff of 10 seems to be a good trade-off between variant quality and the size of qualified regions being retained.

FDR methods accounting for discrete counts show good FDR control and improved power

Two adopted resampling-based methods: RBH_P and RBH_UL, and one recently developed method designed for discrete distributions ADBH.sd10 show good FDR control in general and similar improved power for FET-based test results, even though RBH_P has slightly inflated FDR in some settings (Supplementary Fig. S4). Compared to BH, the relative power increases of RBH_UL range from 11 to 27% (Supplementary Data 3). Filtering the genes with few rare allele counts (BH_T2 and BH_T3) shows little power increase. When applied to the CNS association test results using gnomAD as the control (threshold <0.2), the three methods accounting for the discrete counts all identify the established genes NF1, SUFU, ELP1 as significant, while the BH method only identifies NF1 as significant (Supplementary Data 4). For resampling-based FDR control, the time cost is within a minute, similar as that in inflation estimation because sampling the P values under the null hypothesis is the main time-consuming part. For ADBH.sd, the time cost is even smaller because only P values’ CDF is needed without sampling P values under the null.

The proposed LD-detection method has high power in detecting high-LD rare variants and accurately identifies MNVs in gnomAD

We first evaluated the type I error and statistical power of the proposed LD test by using simulated data consisting of six independent sets of summary counts (Supplementary Fig. S5a). The type I error was well-controlled. The power of detecting LD increased as the LD strength and the AF increased. The LD test was well powered for detecting strong positive LDs but not for detecting weak LDs. When full genotypes are available, the proposed test can also be used because the full-genotype-based LD test is a special case of our proposed method, where the total haplotypes is 2. We evaluated our method in simulations under the full-genotype setting and compared it with the ld function in snpStats13 which is designed for LD detection using full genotypes. CoCoRV and snpStats have almost the same results (Supplementary Fig. S5b) and the type I error was well-controlled. The power increased remarkably compared to using summary counts, which was expected given that more information is available when using full genotypes.

As one application and a validation of the proposed LD-detection method, CoCoRV was applied to detect high-LD multinucleotide variants (MNVs)14 without access to the sequencing reads information. A MNV is defined as a cluster of two or more nearby variants on the same haplotype in an individual. Because variants within a MNV are close to each other, they are very likely transmitted together thus exhibiting high-LD in a population. We applied our proposed LD test to the whole exomes of gnomAD data to detect rare variant pairs of high LD and compare them with the reported gnomAD MNVs detected using sequence reads. Although high-LD variants and MNVs are not the same, they have similarities when restricted to rare variants with distances ≤2 BPs (Table 1). For each ethnicity and each variant, we extracted the gnomAD exome count data and generated six independent sets of summary counts using the cohort and sex information. Then we applied the LD test on pairs of variants annotated with certain functions, excluding synonymous variants.

The above analyses used the stringent threshold FDR < 0.05 to achieve high precision but could sacrifice the recall rate. We also used different P value thresholds and evaluated the recall and precision by treating the reported gnomAD MNVs (≤2 BPs) as the ground truth, though some high-LD MNVs detected were not reported in the gnomAD MNVs, as described above. In total, there were 1780 reported gnomAD MNVs among our tested variant pairs (Supplementary Data 6). If we chose 0.01 as the P value threshold, 1635 of 1976 detected were true, corresponding to precision 0.83 and recall 0.92.

To further demonstrate the accuracy of our proposed LD-detection method, we compared the LD-detection results using either pooled summary counts or individual genotypes from our constructed in-house 8175 synthetic controls. Specifically, in each ethnicity, we randomly split the control dataset into three datasets with sample size ratio 1:2:3, then we stratified each of the three dataset by sex, generating six independent datasets. For each of the six dataset, we pooled the alternate allele counts and the total allele counts, similar as that from gnomAD, and tested LD between variant pairs using CoCoRV. For individual genotypes, we used the ld function in the R package snpStats13 to test each LD pairs within each ethnicity. We focused on variants with gnomAD maximal allele frequency among ethnicities <0.1 and the selection criteria of variants based on annotations were the same as that in the LD scan for gnomAD. We consider 45,626 LD pairs detected using individual genotypes (FDR < 0.05) as the ground truth. In total, 6694 LD pairs were detected (FDR < 0.05) using summary counts, where 8 were false positives, indicating high precision (99.8%) of CoCoRV. The lower power using summary counts (14.65%) is consistent with our power simulations, suggesting the power advantage of using full genotypes when they are available. For the true positives detected by CoCoRV, the estimated LD measure r2 between using summary counts and using full genotypes were highly correlated (correlation = 0.853) (Supplementary Fig. S7a). After stratifying detected LD pairs by alternate allele frequencies within the controls and the estimated LD measure r2 from using full genotypes (Supplementary Fig. S7b), we observed that summary counts-based LD detection mainly detect those high-LD pairs with decent alternate allele counts (e.g., >10).

Next we validated the detected LD pairs from gnomAD using the 8175 controls with individual-level genotypes. Out of 10,081 LD pairs from gnomAD (FDR < 0.05 within each ethnicity), 7290 (72.3%) variant pairs can be tested with at least one alternate allele from at least one variant of the pair in the individual-level controls. We used a P value threshold 5 × 10−4, which corresponds to 3.6 expected false positives out of 7290 tests under the null hypothesis of no LD. Of the 7290 variant pairs tested, 6989 (95.9%) passed the P value threshold 5 × 10−4 (Supplementary Data 5). All pairs that passed the threshold except two have odds ratio larger than 150. These results show that our proposed method of detecting LD from gnomAD using summary counts is accurate or has high precision.

Detecting high-LD variants improves count estimation and removes false positives in the recessive and double-heterozygous models

We performed a simulation of count estimation by using summary counts from a set of variants, including independent and correlated variants (Supplementary Fig. S10). Our results showed that the count estimation results from using CoCoRV were more accurate than those from using TRAPD. TRAPD overestimated the counts for all three models when high-LD variants were present, most likely because TRAPD was designed to be conservative for a one-sided FET to mitigate the violation of the assumption of independent variants. This overestimation could result in an inflated type I error, and the potential loss of power if a two-sided test is used. CoCoRV corrected for high-LD variants and resulted in a much better estimation.

We also compared the association results under the recessive model, with or without the LD test with an AF threshold 0.01 and using gnomAD summary counts as controls for both the CNS and ALL cohorts (Fig. 5). When there was no LD check for cases with two heterozygous variants, we found several false positives with very low P values (Fig. 5a, c). When variants in LD were excluded from the count of case samples (Fig. 5b, d), those false positives were removed (Supplementary Data 7). One example is the variant pair (3-37089130-A-G, 3-37089131-A–C) in gene MLH1, which is also identified as an MNV from gnomAD. Applying the LD test also removed false positives with very low P values under the double-heterozygous model (Supplementary Fig. S11). We noticed that the QQ plots under the recessive model were not as well-calibrated as those under the dominant model, even after applying the LD test. The reason for this result could be that not all double-heterozygous variants with positive LD were detected. Another factor to consider is that the recessive and double-heterozygous models are sensitive to differences in LD structures between cases and controls. Stratifying samples into different ethnicities might still retain some LD differences due to the population substructure within each of the six ethnicities. For example, African/afr and Latino/amr are two main admixed populations, and different proportions of ancestries can change the LD structure. Although there is no perfect solution, our CoCoRV method represents an advance in using only summary counts as controls for rare variant association tests.

CoCoRV analysis of the CNS and ALL cohorts

We applied CoCoRV to two St. Jude pediatric cancer cohorts (CNS and ALL) and used gnomAD summary counts as controls under a dominant model. For a comparison, we also applied TRAPD to the two cohorts. We used the all-pooled AC and AN to generate the counts per gene because TRAPD is hard-coded to use the annotation AC and AN to generate case–control statistics. CoCoRV was flexible to specify any annotated fields, such as non-cancer subsets, or any population-specific counts from gnomAD. Both SNVs and short indels were considered. For TRAPD, using the option “--pass” shows less false positives than using the QD-based filtering (Supplementary Data 8), therefore we used the former for comparison with CoCoRV. For the CNS cohort, TRAPD reported an inflated λ2points (Fig. 6a). Two genes passed the exome-wide significance (P < 2.5 × 10−6) using TRAPD. One was a false positive and the other was the known glioma causal gene NF1. For CoCoRV, it showed no inflation in either the pooled counts from all ethnicities (Fig. 6b) or the stratified analysis using CMH (Fig. 6c). One gene passed the exome-wide significance, which was the known causal gene NF1. After manual variant checking, the false positive from TRAPD was caused by inconsistent filtering: the variant has FILTER status PASS in cases and therefore included in cases but has the failed status (failed the random forest-based QC) in gnomAD therefore not included in controls (Supplementary Data 8). For the ALL cohort, five genes passed exome-wide significance using TRAPD and four of them were false positives. The inflation factor λ2points was inflated (Fig. 6d). In contrast, CoCoRV showed no obvious systematic inflation, and identified the known causal gene ETV6 as the only exome-wide significant gene. The CMH-based analysis had better calibration at the tail than did the FET-based analysis (Fig. 6e, f). The cause of the false positives from TRAPD was either inconsistent filtering between cases and controls or low-quality variants which showed large differences in AF between gnomAD WES and WGS data (Supplementary Data 8).

NF1 was detected in the pediatric CNS cohort and had relatively similar P values across different scenarios, no matter which summary counts-based controls were used. For the ALL cohort, the much larger sample size of gnomAD dramatically increased the significance of ETV6, from 10−6 to 10−12 (Table 2). The relatively stable P values for NF1 in the CNS cohort were most likely due to the adequately large counts in the contingency table using the in-house–constructed summary counts. The benefit of increasing the number of controls also diminished after a certain level, while the number of cases was fixed (Table 2). In contrast, ETV6 had 0 or 1 qualified control sample with rare alleles in the in-house controls, had more significant P values by using gnomAD as controls, benefited from the improved quantification of the AFs when the control sample size increased substantially.

For the CNS cohort, the top four genes were NF1, ELP1, C17orf75, and SUFU (Fig. 6). NF1 and SUFU are well-established causative genes for pediatric brain tumors15. ELP1 was recently identified as a predisposition gene of medulloblastoma, one type of pediatric brain tumors6. This makes C17orf75 an interesting candidate. From the GTEx portal, two of the top three tissues with highest median expression of C17orf75 are the cerebral hemispheres and the cerebellum (Supplementary Fig. S12). Further validation is needed to confirm the association of C17orf75 with pediatric brain tumors.

ProxECAT16 is another recently proposed method for performing summary-count-based analysis. It pools the counts of functional and non-functional alleles from cases and controls to form a contingency table, which is used for an association test. We implemented this method within CoCoRV and applied it to the CNS and ALL cohort. ProxECAT showed slight inflations (inflation factor 1.05 for ALL, and 1.08 for LGG). The known gene NF1 and ETV6 reached the exome-wide significance level and there were no false positives reaching the exome-wide significance (Supplementary Data 8). For the known causal genes NF1, CoCoRV shows more significant P values (P = 7.6 × 10−10) than ProxECAT (P = 9.2 × 10−8), the trend is the same for ETV6 too (Supplementary Data 9).

We also explored the recessive models in the association test using both TRAPD and CoCoRV (Supplementary Data 10). When the LD was not accounted for, for both TRAPD and CoCoRV, all genes that passed the exome-wide significance were due to the high LD between variants, therefore considered as false positives. However, when the LD was accounted for in CoCoRV, there was no gene passing the exome-wide significance. This suggests that for the CNS and ALL, the association signal is mainly driven by the dominant model, although it is also possible that we do not have enough power for the recessive models with the current sample size of cases.

The impact of ethnicity compositions on summary-count-based analysis

Population structure is a known confounder for genetic association test if not properly addressed8. The effect of ethnic-specific allele frequencies (AFs) on the association test results depends on the differences among ethnic-specific AFs, and the differences in ancestry compositions between cases and controls. The majority population of gnomAD V2 exomes is European (nfe = 47%), and the four datasets we analyzed (ALL, CNS, GBM, LGG) all have European populations as the majority (nfe ≥70%, Supplementary Fig. S14). The dominance of European samples in the cases and controls, as well as our focus on rare variants (AF ≤ 1e−3), might explain that in general, the results using ethnic-specific AFs or the pooled AFs did not show large differences in terms of genes achieving exome-wide significance for the four datasets we analyzed. However, we do observe one difference under the recessive model. Although LD was accounted for, using the pooled AF in analyzing the CNS cohort resulted in one false positive (P value <2.5 × 10−6), while there was no false positive if ethnicity-stratified analysis was used (Supplementary Data 10). The variant causing the false positive is 19-55898120-A-AT, which has the largest AF in gnomAD V2 African/African American (AF = 0.1122), but the AF is very low in European population (AF = 0.0002196).

To further illustrate the impact of ethnicity compositions and the importance of ethnic-specific analysis, we constructed a simulated “case” cohort using samples from the 1000 Genomes Project with sample ID starting with “NA”. In total, there were 840 samples included from the five major populations (nfe: 210, afr: 250, amr: 65, eas: 213, sas: 102). The ethnicity composition in the constructed “case” was very different from that in gnomAD, e.g., the European samples were not the majority. Because the samples from the 1000 Genomes Project are not disease-specific, genes that were significant in the analysis between the constructed “case” and gnomAD were false positives due to the confounding of population structures, instead of being associated with any disease. Association results showed that the simple pooled analysis can produce many false positives (Supplementary Fig. S15a, d and Supplementary Data 11), while ethnicity-stratified analysis substantially reduced the false positives (Supplementary Fig. S15c, f). Adding the filter using the maximal allele frequencies among gnomAD ethnicities (AF_popmax) could also mitigate the false positives (Supplementary Fig. S15b, e), although it was not as effective as ethnicity-stratified analyses. The variants that drive the top false positives in the simple pooled AF analysis showed large differences in allele frequencies between different ethnicities, where the AF ratios between the highest AF and lowest AF among ethnicities were around 10 and could reach about 20 (Supplementary Data 12). The number of false positives decreases as the AF threshold decreases. For example, when AF was set to 1e−4 and the ethnicity-stratified analysis was used, there were no false positives in the analysis. The impact of different ethnicity compositions was not just limited to the African population but could happen in any other populations (e.g., eas, afr, amr, sas) less represented in the gnomAD dataset (Supplementary Data 12).

Discussion

In this study, we developed a framework to prioritize disease-predisposition genes by using public summary counts as controls for rare variant burden tests. Consistent variant QC and filtering, as well as joint AF-based filtering make sure no artificial bias is introduced in the analysis. Ethnicity-stratified burden test ameliorates the effect of population stratification. By considering the discrete and heterogeneous nature of the null distribution, we proposed a more accurate inflation factor estimation method and several more powerful FDR-control methods than assuming the null distribution of P values is uniform. We also proposed an approach to detect high-LD variants using gnomAD summary counts. We detected and stored the variants in LD from either using gnomAD summary counts or using in-house controls with full genotypes, and then used them in the framework to remove false positives, which is critical under the recessive models. In addition, we evaluated the concordance between using summary counts and full genotypes, which was high especially for the top hits with high statistical significance.

The inflation factor estimation and FDR-control method can also be used in other applications if the tests are performed using FET or CMH. Besides its application in the rare variant association test, our proposed LD test can detect high-LD MNVs, and is not limited by sequencing read length compared with traditional read-based MNV detection. Identifying rare variants in strong LD is also helpful in other analyses, e.g., distinguishing compound-heterozygous variants from those on the same haplotype or applying special treatment to MNVs14.

In principle, allele frequencies derived from an independent large non-rare disease cohort should be helpful and can provide an unbiased way for AF-based filtering. We tried to use TOPMed AF (Freeze 8) annotations for variant filtering after lifted over to GRCh37 using picard23. Unfortunately using TOPMed AF alone introduced false positives (Supplementary Data 8). One main reason is that there are likely substantial differences in ethnicity proportions between gnomAD and TOPMed. Variants that show large ethnicity-specific AFs and should be filtered out are kept if we only use TOPMed’s AF for filtering. One example is the variant 1-18809351-G–C, which has the highest AF in gnomAD V2 South Asian (AF = 0.1285), but TOPMed’s overall AF is very low (AF = 0.00082360). Even though using TOPMed AF annotation alone for AF filtering can cause false positives, TOPMed could serve as a useful resource for additional AF-based filtering on top of the joint AF-based filtering.

By applying CoCoRV framework to pediatric and adult cancers, we identified known cancer-predisposition genes and prioritized other risk genes, one of which was statistically validated. We caution that CoCoRV should be used as a prioritization tool and not a statistical validation tool, because using summary counts might not control some hidden confounding factors. For example, our proposed method estimates the sample counts in the controls under different models rather than counting the actual counts from individual-level genotypes, there is the possibility that the estimated counts might not match well with the actual counts. This is the limitation of using summary-count-based burden tests. Once interesting genes are identified, either a strict full-genotype-based association test or lab-based functional studies are needed to validate the findings.

One potential limitation of using summary counts is that adjusting for covariates is not straightforward. Besides potential batch effects introduced by the sequencing platform, one (and often the only) confounding factor in genetic studies is population structure24. We propose to use a CMH-based, ethnicity-stratified analysis to mitigate this problem. Whether the population stratification causes an obvious systematic inflation depends on three factors: (1) population composition in the cases and controls, (2) AF threshold, and (3) sample size. In practice, we cannot guarantee that the populations of cases and controls will match perfectly, but with an adequately low AF threshold (e.g., 5e–4) and a focus on potential pathogenic variants, the influence of the population structure can be substantially reduced, possibly to the point of no systematic inflation. Improving the adjustment for population structure using summary counts might be worth future investigation.

Testing LD is a well-studied problem in statistical genetics when haplotypes or genotypes can be observed25,26. We extended the LD test when only a set of summary counts of two variants was available, such as in gnomAD. We assumed the Hardy–Weinberg equilibrium (HWE) in our test, which simplified the calculation and performed well. This might be due to that only strong LD can be well detected when using a limited set of summary counts, and the influence of the Hardy–Weinberg disequilibrium (HWD) for most rare variants is negligible or inadequate. Further investigation to extend the proposed method to account for the HWD might be useful, for example using a composite measure of LD27. Our proposed LD-detection method in CoCoRV is similar to a method proposed to estimate haplotype frequencies and LD measures in pooled DNA data28. The major differences are as follows: First, CoCoRV allows for different numbers of samples in a pool, e.g., the subset of samples within each ethnicity in gnomAD, rather than a fixed number in the design of the pooled experiment. Second, instead of using an expectation-maximization algorithm and treating the haplotype frequencies as missing data, our framework uses a direct gradient-based maximization of the likelihood, which exploits many well-developed gradient-based optimization methods29; therefore, it converges faster. In addition, we use a one-sided (rather than a two-sided) test to detect strong positive LD. Finally, we use the odds ratio (rather than correlation coefficient) to characterize LD strength for rare variants. In this study, we considered only the LD between two biallelic variants; in the previous study28, multiple loci and multiple alleles were allowed, which might be an interesting topic to explore in the future.

One main difference between using the separately called summary counts and jointly called full genotypes was the variant QC step. For example, for the separately called summary counts-based analysis, VQSR was applied separately to cases and controls; however, for jointly called full-genotype-based analysis, it was applied to jointly called cases and controls. We found that the results of VQSR for a specific variant can differ between cases and controls if run separately. This is partly expected because VQSR uses a machine learning approach to model the quality of variants within a jointly called genotype matrix. Different studies result in different training samples and different trained models. Including public sequencing data, such as the 1000 Genomes Project’s WES data, with the case sequencing data will most likely improve the robustness of the variant QC and the concordance when the summary counts-based analysis is performed later.

The lack of obvious inflation in the QQ plot does not guarantee that the top hits are true positives; it indicates only that there is no large systematic inflation in the association test. However, sporadic individual false positives could be among the top hits. When using publicly available summary counts as controls, the cases and controls are processed from different pipelines using different parameters or software versions. Therefore, this approach is more prone to false positives than is the joint analysis of case and controls using full genotypes generated from the same pipeline and QC steps. Also, a more stringent check on the sequence region of the called variants would be helpful, such as alignment uniqueness, duplication segments, or repeats from the UCSC genome browser resources. Warning QC flags from gnomAD are also useful. The indel variants required more manual inspection due to challenges in the accurate alignment and QC. Another method that alleviates the false-positive problem is to process cases with a relatively smaller, publicly available sequencing dataset, such as the WES data from the 1000 Genomes Project, which is also advocated by GATK. Although this approach requires increased processing time and storage space, it helps separate the true enrichment of rare alleles in cases from the false enrichment due to pipeline differences.

Compared with the standard full-genotype-based analysis using the continuous PCs to account for population structure, the coarse assignment of discrete ethnicity groups cannot account for finer population structure within each ethnic group. Accurate inflation factor estimation and the QQ plot is thus critical for determining whether there is systematic inflation due to population structure. For ethnicity stratification, we used the 1000 Genomes Project’s samples to predict the ethnicity of cases. This procedure was similar to the one used by gnomAD, though the final trained classifier was not the same due to a different training set. Hence, some discrepancies may occur between our assignment and those of the gnomAD classifier. We anticipate that the differences would be small, considering that the 1000 Genomes Project’s samples are also part of the gnomAD dataset, and a very high probability threshold (0.9) was used to classify the samples in both our and gnomAD’s ethnicity classification. Moreover, individual small discrepancies in the assignment are unlikely to substantially influence the final rare variant-based association test. When interpreting the association results, it is always more convincing if the association signals contributing to the final significance are found in more than one ethnicity.

In this study, both our processed data and gnomAD summary counts used GATK for variant calling and joint genotype calling, though there were differences, e.g., different versions, detailed implementations, and variant QC. By retaining high-quality variants and maintaining a consistent filtering strategy, we showed that batch effects in the processing pipeline can be well-controlled. However, we have not explored the potential batch effects if two completely different variant-calling algorithms were used (e.g., if one uses GATK and the other uses FreeBayes30). We anticipate these might introduce some inconsistencies and require further investigation. Recent standardization of some key processing steps in genome-sequencing analysis pipelines31 shows promising results in minimizing these batch effects.

Methods

Data sources

The case cohort in our main analyses included pediatric high-grade glioma (HGG) samples from the St. Jude PCGP, St. Jude LIFE study [central nervous system tumors (CNS) and acute lymphocytic leukemia (ALL)]32. We constructed an in-house control cohort of samples using the WES data from the Alzheimer’s Disease Sequencing Project (ADSP)33 and the 1000 Genomes Project34, which was later used to compare the concordance between using separately called control summary counts and jointly called case–control full genotypes. The datasets included multiple ethnicities (Supplementary Fig. S14), with European ancestry being the majority (52–81%). The inclusion of these individuals in our study was reviewed and approved by Institutional Review Board at St. Jude Children’s Research Hospital. We also tested our framework in two other independent datasets, The Cancer Genome Atlas (TCGA) brain tumor cohort and an amyotrophic lateral sclerosis (ALS) study, to illustrate its power.

Consistent quality control and filtering of variants

When cases and controls are called separately, coverage summary information (e.g., the percentage of samples with coverage ≥10) ensures that regions of interest are well covered. Our tool incrementally summarizes coverage information and computes the percentage of samples with coverage no less than specified thresholds, such as {1, 5, 10, 15, 20, 25, 30, 50, 100}, for each nucleotide position. Our tool can scale well for tens of thousands of samples and is easy for parallel running (supplemental text). Following the coverage filtering in Guo et al.4, we kept the variants that have ≥10 coverage (total reads of both alleles) in at least 90% of the samples for both cases and controls.

Inconsistencies can happen when QC and filtering are applied separately, e.g., when a variant is filtered out from controls but not from cases or vice versa. To address this problem, we employed the following strategy: keeping only high-quality variants from each cohort’s QC process and performing consistent QC and filtering. For the former, all variants must pass the cohorts’ QC filter. We required that the missingness within cases and controls be ≤0.1. In addition, we used the gnomAD data to generate a blacklist of variants based on gnomAD’s filtering status and discrepancies in allele frequencies between the WES and WGS platforms (Supplemental methods). For consistent QC filtering, we excluded all variants that failed QC steps in either cohort. This consistent filtering step is absent in other tools such as TRAPD4.

Another critical filtering step is joint allele frequency (AF)-based filtering. The joint AF of a variant is estimated by pooling the counts of cases and controls together, i.e., AF = (ACcase + ACcontrol)/(ANcase + ANcontrol), where ACcase and ACcontrol are the alternate allele counts of cases and controls, respectively, and ANcase and ANcontrol are the total allele counts (including both reference alleles and alternate alleles) of cases and controls, respectively. Using joint AF for filtering avoids inconsistencies when separately filtering variants based on AF within cases and AF within controls.

The joint allele frequency (AF)-based filtering is consistent and unbiased, which is commonly applied when individual-level genotypes of cases and controls are available, such as in genome-wide association studies (GWAS). Suppose we filter cases using case-derived AF and filter controls using control-derived AF, it can cause inconsistencies because the case and control-derived AFs for the same variant can be quite different simply due to statistical variations which depend on the sample size of cases and controls. On the other hand, if only the control-derived AF is used for filtering, for controls, all variants selected will have AF less or equal to the specified AF threshold, but for cases, there are possibilities that some variants can have AF above the specified AF threshold simply due to statistical fluctuation. These inconsistencies can result in biased statistical test results and inflated type I error.

For ethnicity-stratified analysis, we first extract the AC and AN annotation from each ethnicity, and then calculate the ethnicity-specific joint AFs, then the maximal joint AF among all ethnicities is used for the AF-based filtering. This shares some similarities of using the maximal gnomAD AFs (AF_popmax) among all ethnicities for filtering, however, our filtering is based on the joint AF considering both cases and controls instead of the control-derived (gnomAD-derived) AF.

Estimation of sample counts in three models using summary genotype counts

We defined three models to count the samples for the burden test (Fig. 1). Let $${{\mbox{A}}}{{{\mbox{C}}}}_{i}$$ be the alternate allele count of each variant i in a gene for a sample. A sample belongs to one or more of the following models if the sample satisfies the corresponding conditions:

$$\left\{\begin{array}{c}{{\mbox{dominant model}}}:\mathop{\sum}\limits_{i}{{\mbox{A}}}{{{\mbox{C}}}}_{i}\ge 1\hfill\\ {{\mbox{recessive model}}}:\mathop{\sum}\limits_{i}{{\mbox{A}}}{{{\mbox{C}}}}_{i}\ge 2\hfill\\ {{\mbox{double-heterozygous model: at least two variants with}}}\,{{\mbox{A}}}{{{\mbox{C}}}}_{i}\ge 1\end{array}\right.$$
(1)

Here the recessive model can be either homozygous with alternate alleles at the same position, or at least one alternate allele at ≥2 positions, i.e., double-heterozygous. Therefore, the double-heterozygous model can be viewed as a special case of the recessive model. As noted in Guo et al.4, because the haplotype information is not directly observed, the double-heterozygous model could mean two variants on the same haplotype, thus not exactly a compound-heterozygous model. We often have the full genotypes for cases, so we can determine the count of each model directly from the genotype matrix. However, for controls, we need to estimate the number of samples qualified for each model based on the summary counts of each variant. Specifically, for a gene with m variants, let $${p}_{{iG}}$$ be the genotype frequency of variant i with genotype G, where $$i=1,\cdots m,{{{{{\rm{and}}}}}}\;{G}=0,1,2$$. Public summary counts, such as gnomAD, usually provide this information; however, if only AF is available, the genotype frequencies must be estimated. One convenient, though not optimal, approach is to assume the Hardy–Weinberg equilibrium (HWE). We estimate the probability of a sample satisfying the defined models in (1) as follows. Denote the probability of the dominant model by $${p}_{{DOM}}$$, the recessive model by $${p}_{{REC}}$$, and the double-heterozygous model by $${p}_{2{HET}}$$. If only one variant exists, we have $${p}_{{DOM}}=1-{p}_{10},{{p}}_{{REC}}={p}_{12},{{{{{\rm{and}}}}}}\;{p}_{2{HET}}=0$$. If there are at least two variants, they can be estimated as shown below, assuming independence between the rare variants:

$${p}_{{DOM}}=1-\mathop{\prod }\limits_{i=1}^{m}{p}_{i0}$$
(2)
$${p}_{2{HET}}=\mathop{\sum }\limits_{i=1}^{m-1}\mathop{\sum }\limits_{j=i+1}^{m}\mathop{\sum}\limits_{k\ne i,k\ne j}\left(1-{p}_{i0}\right)\left(1-{p}_{j0}\right){p}_{k0}$$
(3)
$${p}_{{REC}}=\mathop{\sum }\limits_{i=1}^{m}{p}_{i2}\mathop{\sum}\limits _{j{\ne}i}{p}_{j0}+{p}_{2{HET}}$$
(4)

Equation (2) calculates the probability of being dominant for at least one locus. For (3), we approximated the probability by considering all pairs of variants with $${{\mbox{AC}}}$$ ≥1 out of m loci, because the probability of having two variants with $${{\mbox{AC}}}$$ ≥1 is much higher than that of having three or more variants with $${{\mbox{AC}}}$$ ≥1. Similarly, in (4), we considered a single variant with homozygous alternate alleles and two variants with $${{\mbox{AC}}}$$ ≥1. To calculate the estimated counts, we multiplied the frequencies of each model by the total number of controls. Here we assumed all rare variants were independent; later we made some modifications when we detected variants in LD.

Burden test with samples stratified by ethnicity groups

In addition to the pooled analysis of all ethnicities, we provide an ethnicity-stratified analysis. The latter requires predicting the ethnicity of each case. We performed principal component (PC) analysis for all samples, including samples from the 1000 Genomes Project (see the supplemental text). Following the practice in gnomAD1, we trained a random forest classifier based on the top 10 PCs and used the six major ethnicity categories as training labels: non-Finnish European (nfe), African American (afr), Admixed American (amr), East Asian (eas), South Asian (sas), and Finnish from the 1000 Genomes Project. Then we applied the classifier to the rest of the samples and used probability 0.9 as a cutoff to assign samples to the six ethnicities or to “other” when all predicted probabilities were less than 0.9. We then stratified samples based on the six ethnicities and calculated/estimated the counts for each model within each stratum; samples labeled “other” were not used. We performed the CMH-exact test35 to aggregate evidence of each stratified 2 × 2 contingency table. When using gnomAD summary counts as controls, we used the stratified summary counts of each ethnicity directly.

Inflation factor estimation for discrete count-based test

For the P values of burden tests calculated by the FET or CMH, the null distribution does not follow the uniform distribution. Therefore, the expected ordered null P values could not be derived from the uniform distribution. The null distribution of the ordered P values of a set of tested genes using either FET or CMH depends on the number of rare alleles in each gene. Instead of assuming the incorrect uniform distribution, we empirically sampled from the correct null distributions for inflation factor estimation. We assumed that the genes are independent of each other, which is reasonable for rare variants with very low AFs. Instead of permuting the phenotypes, we worked on the 2 × 2 contingency table directly (Fig. 2a). For each gene, let m be the number of cases, n be the number of controls, and k be the number of samples with rare alleles among all cases and controls. These values can be obtained from the observed 2 × 2 contingency tables and are held fixed for each gene. In order to sample P values under the null hypothesis of no association, for each gene, we can randomly sample the number of cases with rare alleles (denoted by x) from a hypergeometric distribution with parameters m, n, k and form a new 2 × 2 contingency table, then we can calculate the null P values using FET. However, this strategy will need to run FET for each replicate, which can be time consuming for all the genes and all replicates. A much faster sampling of P values under the null is to sample directly from the P values’ cumulative distribution function (CDF). To generate the CDF, first we generated the P values’ support list which is based on tabulating all possible values of x and corresponding probabilities from the hypergeometric distribution. Then the P value can be computed for each value of x according to the hypothesis and CDF can be generated afterward. We used the CDF approach for fast sampling. Then we sorted the P values under null across all genes for each replicate. This process was repeated N times, and the final expected sorted P values (order statistics) were the average of N P values at each ordered rank. To estimate the inflation factor, we took the lower 95% quantile of points in the QQ plot and regressed the sorted log10-scaled observed P values to the log10-scaled expected sorted P values. The slope of the regression was used as the inflation factor, which is denoted by $${\lambda }_{{{{{{\mathrm{emp}}}}}}}$$. Similarly, we extended the inflation factor estimation to the CMH-exact test. For CMH-exact test36, assume there are K strata, the test statistic is $$S={\sum }_{i=1}^{i=K}{x}_{i}$$, where xi is the number of case samples with rare alleles in strata i. To sample the P values directly from the CDF under the null hypothesis, we first need to tabulate all possible values of S and corresponding probabilities. Direct enumeration of S would be very time consuming, fortunately, there has been a very fast network-based algorithm developed37. We use the network-based algorithm, which was already implemented as an internal function in R, to calculate the P values’ support. Then the P value can be computed for each value of S according to the null hypothesis and the CDF can be generated afterward. Once the P values under the null are sampled based on the CDF, the inflation factor could then be calculated similarly as in FET.

Guo et al.4 also noted an excess of P values that equaled 1 when the null distribution was assumed to follow a uniform distribution. Thus, they proposed modifications to estimate the inflation factor. Two versions of this estimation are included in Guo et al.4. One estimated the slope by using values between quantile 0.5 and 0.95, denoted by $${\lambda }_{0.5-0.95}$$ in our study, and the other estimated the slope from two points: the first is the point with the observed P value of 1 with the highest rank, and the second is the point at the 0.95 quantile of the P values not equal to 1, denoted by $${\lambda }_{2{{{{{\mathrm{points}}}}}}}$$ in our study. For both $${\lambda }_{0.5-0.95}$$ and $${\lambda }_{2{{{{{\mathrm{points}}}}}}}$$, the expected ordered P value of rank r in an increasing order from n P values was r/(n + 1) assuming a uniform distribution U(0, 1). Because these modifications were still based on a uniform distribution, their solutions were biased.

To illustrate the performance of different methods on inflation factor estimation, we used simulated P values under the null hypothesis to evaluate the bias. Specifically, for N replicates of sampled null P values, we applied all three methods to estimate the inflation factor. For the empirical null P value-based method, we saved on computation cost for each replicate of null P values by using the rest (N−1) replicates of null P values to estimate the expected sorted P values.

FDR control

Resampling-based FDR-control method was originally developed to address the correlation between multiple tests38. We adopt it here for FDR control of discrete count-based tests. We simulated the P values under the null hypothesis the same as in the inflation factor estimation (Fig. 2a). Then we adopted the resampling-based FDR-control method to estimate the adjusted P values. The two resampling methods used a mean point estimate (RBH_P) or an upper limit estimate (RBH_UL) for the number of true positives detected, respectively. Given the simulated P values under the null hypothesis, the adjusted P value estimation is similar as that in the R package FDR-AME39. Let $${R}^{* }(p)$$ be the number of genes with P values no greater than p in the simulated P values under the null hypothesis, $$R(p)$$ the number of genes with P values no greater than p in the observed P values, $${M}^{* }(p)$$ the mean of $${R}^{* }(p)$$, $${Q}_{\beta }^{* }(p)$$ the 1−β quantile of $${R}^{* }(p)$$. Then for RBH_P, the estimated true positives detected at threshold p was $$s\left(p\right)=R\left(p\right)-{M}^{* }(p)$$; for RBH_UL, $$s\left(p\right)=R\left(p\right)-{Q}_{\beta }^{* }(p)$$. Then the adjusted P values were calculated as

$${{{\mbox{PBH}}}}\_{{{\mbox{P}}}}\left(p\right)=\left\{\begin{array}{c}{E}_{{R}^{* }}\left(\frac{{R}^{* }\left(p\right)}{{R}^{* }\left(p\right)+s\left(p\right)}\right),\,{if}\,s\left(p\right)\ge {Q}_{\beta }^{* }(p)\\ {E}_{{R}^{* }}\left({R}^{* }\left(p\right) > \,0\right),\,{otherwise}\hfill\end{array}\right.$$
(5)

and

$${{{\mbox{PBH}}}}\_{{{\mbox{UL}}}}\left(p\right)=\left\{\begin{array}{c}{E}_{{R}^{* }}\left(\frac{{R}^{* }\left(p\right)}{{R}^{* }\left(p\right)+s\left(p\right)}\right),\,{if}\,s\left(p\right) > \,0\\ {E}_{{R}^{* }}\left({R}^{* }\left(p\right) > \,0\right),\,{otherwise}\end{array}\right.$$
(6)

where $${E}_{{R}^{* }}$$ means taking the expectation over all replicates of P values under the null hypothesis.

An alternative approach for powerful FDR control is using FDR methods which directly account for the discreteness and heterogeneity of the distribution. We chose to use the methods developed recently which show good performance for discrete distributions40, e.g., A-DBH-SD. These methods can be applied to arbitrary discrete distributions if the CDF of the P values under the null is known or can be calculated. However, the implemented R package DiscreteFDR only supports the generation of CDF of FET. To extend these powerful FDR methods to the CMH-exact test, we developed a method to generate the CDF of the P values under the null in CMH-exact test as described in the section for inflation factor estimation. Due to the fast network algorithm37 for tabulating the test statistic and corresponding probabilities, the CDF list can be generated within seconds for ~20,000 genes considered.

Detection of high-LD rare variants via gnomAD summary counts

One interesting feature of CoCoRV is the proposed LD test using only gnomAD summary counts (Fig. 2). The gnomAD dataset has several subset-based summary counts, including “controls” and “non_cancer.” Because control samples are a subset of non_cancer samples, we can partition them into three independent sets of summary counts: cancer, healthy controls within the non_cancer set, and other diseases within the non_cancer set, which are labeled “non_cancer_non_controls.” After further stratifying by sex, we can generate six independent summary counts per ethnicity (Fig. 2). We assume that the allele frequencies and LD among variants are the same among these six independent datasets per ethnicity, which is likely reasonable because most variants should not be associated with sex or cancer. Given the $${ACs}$$ (alternate allele counts in a cohort) of two variants and the total number of haplotypes in these data, we can test the hypothesis that two variants are in positive LD, i.e., the variants are more likely to lie on the same haplotype than random chance under the assumption of independence. Specifically, let the data observed be $$\{{x}_{i},{y}_{i},{n}_{i},{i}=1\cdots I\}$$, where $${x}_{i}$$ is the AC of the first variant, $${y}_{i}$$ is AC of the second variant, $${n}_{i}$$ is the total number of haplotypes, I is the number of independent sets of summary counts (e.g., six from gnomAD) (Fig. 2). Denote the four haplotypes of two variants by $${h}_{11},{h}_{10},{h}_{01},{h}_{00}$$ and their corresponding probabilities by $${p}_{11},{p}_{10},{p}_{01},{p}_{00}$$, where 1 indicates the alternate allele and 0 indicates the reference allele. The log-likelihood of the observed data is

$${\log }{{\mbox{L}}}=\mathop{\sum }\limits_{i=1}^{I}{\log }{{\mbox{P}}}({x}_{i},\,{y}_{i}{{{{{\rm{|}}}}}}{n}_{i})$$
(7)

In the following equation, we drop the subscript i to simplify the notation. Let r be the count of unobserved haplotype $${h}_{11}$$, then the counts of the four haplotypes are $$r,{x}-r,{y}-r,{{{{{\rm{and}}}}}}\;{n}-x-y+r$$, respectively. By the law of total probability, the likelihood of observed allele counts x, y of two variants given n total haplotypes is $$P\left(x,{y}\,|\,n\right)={\sum }_{r}P(x,{y},{r|n})$$. Assuming the HWE, $$P(x,{y},{r|n})$$ can be calculated as the multinomial probability with the four haplotype counts and probabilities. Specifically,

$$P\left(x,\,y,\,r\,|\,n\right)=\frac{n!}{r!\left(x-r\right)!\left(y-r\right)!(n-x-y+r)!}{p}_{11}^{r}{p}_{10}^{x-r}{p}_{01}^{y-r}{p}_{00}^{n-x-y+r}$$
(8)

Note that the range of r is $$[{\max }\left(0,{x}+y-n\right),{\min }\left(x,{y}\right)]$$. There are only three free parameters for the four haplotype probabilities because their sum is 1. For a direct test of the null hypothesis, we reparametrize the haplotype probabilities using another three parameters $$s={p}_{11}+{p}_{10},{t}={p}_{11}+{p}_{01},{{{{{\rm{and}}}}}}\;\theta =\frac{{p}_{11}{p}_{00}}{{p}_{10}{p}_{01}}$$, where s is the alternate AF of the first variant, t is the alternate allele frequency of the second variant, and θ is the odds ratio specifying the LD strength and direction between two variants. We used a likelihood ratio test, where the null hypothesis was θ ≤1 and the alternative hypothesis was θ >1, i.e., the two variants were more likely to be on the same haplotype. The parameters s and t were treated as nuisance parameters. Specifically, the test statistic is as follows:

$$T=2\,\left(\mathop{{\max }}\limits_{s,\,t,\,\theta \in \left(-{{{{{\rm{\infty }}}}}},+{{{{{\rm{\infty }}}}}}\right)}{\log }{{\mbox{L}}}\left(s,\,t,\,\theta \right)-\mathop{{\max }}\limits_{s,\,t,\,\theta \in \left(-{{{{{\rm{\infty }}}}}},\,1\right)}{\log }{{\mbox{L}}}\left(s,\,t,\,\theta \right)\right)$$

We derived the gradient of the log-likelihood function and used the R package nloptr29 to maximize the log-likelihood under the null hypothesis and full model. For robust maximation, multiple initial parameter values were used. The chi-square distribution with 1 degree of freedom was used to calculate the P values, which appeared to work well, though the asymptotic distribution of the one-sided likelihood ratio test could be better characterized. We used an odds ratio of 1 in the null hypothesis; however, other prespecified odds ratios can be used to test directly whether the LD exceeded a high odds ratio threshold. Due to QC, the $${n}_{i}$$ of each variant can slightly differ, hence in our implementation, it was calculated as the rounded average of all ($${n}_{i}$$)s of the qualified variants within a gene. To accelerate the optimization, we implemented the gradient functions in C++ with Rcpp41.

The above test can also be used to detect LD when full genotypes are observed. The additive coding of genotypes corresponds to n = 2 in the above summary-count-based test. The calculation of the likelihood can be accelerated because there are only nine combinations of genotypes between the two variants. Let $${f}_{ij}=P(x=i,y={j|n}=2)$$, then assuming HWE we have

$${f}_{00}={p}_{00}^{2}{{{{{\rm{;}}}}}}{f}_{01}=2{p}_{00}{p}_{01}{{{{{\rm{;}}}}}}{f}_{02}={p}_{01}^{2}{{{{{\rm{;}}}}}}$$
$${f}_{10}=2{p}_{10}{p}_{00}{{{{{\rm{;}}}}}}{f}_{11}=2{p}_{11}{p}_{00}+2{p}_{10}{p}_{01}{{{{{\rm{;}}}}}}{f}_{12}=2{p}_{11}{p}_{01}{{{{{\rm{;}}}}}}$$
(10)
$${f}_{20}={p}_{10}^{2}{{{{{\rm{;}}}}}}{f}_{21}=2{p}_{11}{p}_{10}{{{{{\rm{;}}}}}}{f}_{22}={p}_{11}^{2}.$$

The log-likelihood of the data is as follows:

$${\log }{{{\mbox{L}}}}={\mathop{\sum }\limits_{i=0}^{2}}{\mathop{\sum }\limits_{j=0}^{2}}{c}_{{ij}}{\log }({f}_{ij})$$
(11)

where $${c}_{{ij}}$$ is the count of genotype combination $$x=i,y=j$$. This full-genotype-based LD test resembled a recently proposed LD test26, but we used a different parametrization. We also implemented this full-genotype-based LD test in CoCoRV, in case users have full genotypes, which will result in higher power to detect LDs.

Accounting for LD between two variants

For variants in high LD, we needed to adjust the procedure of counting qualified samples with rare alleles. When estimating the counts of qualified samples in controls for each group of rare variants in LD, only the variant with the highest AF was kept representing the variants in LD; the rest were excluded. The reason that we exclude redundant variant pairs that are in LD in the controls from the counting of the recessive model is twofold: (1) to match the assumption when estimating the number of samples under the recessive models in controls: rare variants are assumed to be independent; (2) it is less interesting if the recessive pattern in the case samples can be simply explained by LD between two variants in the control population. Supplementary Fig. S16 illustrates the counting process under different models with/without considering LD between pairs of rare variants in LD.

For a quick check of high-LD variants, we precomputed the LD test results (P values and FDRs) from gnomAD and in-house controls and stored those with relatively small P values (e.g., P < 0.1). When estimating sample counts for controls, we regarded variants as LD variants if the LD test result has FDR < 0.05. To be conservative, for each double-heterozygous sample in the cases, we further required the P value of the LD test to be less than a threshold (e.g., P < 0.05 or 0.1). This practice reduced false positives due to strong LD between variants under the recessive or double-heterozygous models.

Detection of high-LD variants in gnomAD exomes

We used our proposed LD-detection method to scan the gnomAD exome-based summary counts in each gene to detect high-LD variants, which share some similarities with MNV14. MNVs are usually detected by direct examination of the sequence reads in aligned bam files. As a validation of the effectiveness of the proposed test, we detected high-LD variants by using the summary counts from the gnomAD exomes and compared them with the reported gnomAD MNVs.

Because we are interested in rare variants annotated with some functions, we focused on variant sets annotated as “stop gain”, “nonsynonymous”, “splicing”, “frameshift_insertion”, or “frameshift_deletion” by ANNOVAR. We restricted variants in the cohort to those with AF ≤0.01, at least 10 alternate allele counts (AC $$\ge$$10), and missingness ≤0.1. Note that the alternate allele count differed from the alternate read count, which is the number of sequencing reads harboring the alternate allele supporting the genotype calls. We filtered out variants in the blacklist, as described in the supplemental text. We required a coverage depth of at least 10 (total reads of both alleles) for 90% of the samples. For each pair of variants within a gene, we applied the LD test by using the summary counts for each ethnicity. In total, about 10 million tests were run. Variant pairs with FDR < 0.05 within each ethnicity were considered significant. For comparison with gnomAD MNVs, we downloaded the coding MNVs detected from gnomAD exomes, where variants’ distances were ≤2 base pairs (BPs). Therefore, we restricted our comparison to high-LD variants with distances ≤2 BPs. We also downloaded the MNV lists based on gnomAD genomes and defined the union of coding MNVs and genome MNVs with distances ≤2 BPs as the full set of MNVs reported in gnomAD with ≤2 BPs.

Comparison of CoCoRV with other methods

TRAPD4 was developed to use gnomAD summary counts, therefore we perform the comparison with CoCoRV using gnomAD summary counts. The schematic diagram of applying TRAPD is shown in Supplementary Fig. S17. TRAPD proposed to use a single annotation QD for variant QC. We followed the description in TRAPD4 to select the best QD threshold pairs for cases and controls by which TRAPD’s estimated inflation factor was closest to 1. Besides the QD-based filtering, we also tried the option “--pass” for variant filtering which is based on the FILTER status. Because the filtering is performed in cases and controls separately, it can create inconsistencies. We used the option “--minAN” to make sure the variant missingness ≤0.1, which can help reduce false positives. The option “--popmaxAF” is used for filtering based on AF_popmax in gnomAD. We use TRAPD’s “--includeinfo” to select variants based on the defined criteria of potential pathogenic variants. Because TRAPD uses the maximal allele frequency among different ethnicities (AF_popmax) to filter variants, we added AF_popmax to filter variants in CoCoRV. Specifically, we used an AF_popmax threshold of 1e–3 in TRAPD for filtering in the final test with the best QD threshold for single-nucleotide variants (SNVs) and indels. For CoCoRV, we used an AF threshold of 1e–3 and AF_popmax threshold of 1e–3. Because TRAPD uses only AC and AN to derive qualified sample counts, we included the analysis using only AC and AN from gnomAD for CoCoRV. In addition, we included the ethnicity-stratified results for CoCoRV. We changed the default one-sided FET to a two-sided FET in TRAPD to match that used in CoCoRV.

Another method ProxECAT16 which pools the counts of functional and non-functional alleles between cases and controls to form a 2 × 2 contingency table for the test of the enrichment of rare variants. However, the R package ProxECAT only provides two functions performing the statistical tests assuming the contingency tables have already been generated. There is no tool for any upstream filtering or generation of the 2 × 2 contingency tables. For a fair comparison, we implemented the pooling method proposed by ProxECAT in CoCoRV, and used the same upstream variant filtering from CoCoRV. The functional alleles were defined the same as that in CoCoRV for potential pathogenic alleles. The non-functional alleles included alleles annotated as synonymous or missense variants with REVEL < 0.2. The test in ProxECAT is essentially the likelihood ratio test given the 2 × 2 contingency table, which might not produce accurate P values when the counts are very small, therefore we replace it with Fisher’s exact test in our analysis.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability

The code for TRAPD is available at https://github.com/mhguo1/TRAPD. The code for ProxECAT is available at https://github.com/hendriau/ProxECAT. The CoCoRV42 code for summary counts-based rare variant association test is available at https://bitbucket.org/Wenan/cocorv/src/master/.

References

1. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

2. Sarnowski, C. et al. Impact of rare and common genetic variants on diabetes diagnosis by hemoglobin A1c in multi-ancestry cohorts: the Trans-Omics for Precision Medicine Program. Am. J. Hum. Genet. 105, 706–718 (2019).

3. Wheway, G., Mitchison, H. M. & Genomics England Research, C. Opportunities and challenges for molecular understanding of ciliopathies—the 100,000 Genomes Project. Front. Genet. 10, 127 (2019).

4. Guo, M. H., Plummer, L., Chan, Y. M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).

5. Huang, K. L. et al. Pathogenic germline variants in 10,389 adult cancers. Cell 173, 355–370 e314 (2018).

6. Waszak, S. M. et al. Germline Elongator mutations in Sonic Hedgehog medulloblastoma. Nature 580, 396–401 (2020).

7. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at https://www.biorxiv.org/content/10.1101/201178v3 (2018).

8. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

9. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).

10. Döhler, S., Durand, G. & Roquain, E. New FDR bounds for discrete and heterogeneous tests. Electron. J. Stat. 12, 1867–1900 (2018).

11. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

12. Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).

13. Clayton, D. & Leung, H. T. An R package for analysis of whole-genome association studies. Hum. Hered. 64, 45–51 (2007).

14. Wang, Q. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020).

15. Muskens, I. S. et al. Germline genetic landscape of pediatric central nervous system tumors. Neuro Oncol. 21, 1376–1388 (2019).

16. Hendricks, A. E. et al. ProxECAT: proxy external controls association test. A new case-control gene region association test using allele frequencies from public controls. PLoS Genet. 14, e1007591 (2018).

17. Zhang, J. et al. Germline mutations in predisposition genes in pediatric cancer. N. Engl. J. Med. 373, 2336–2346 (2015).

18. Kang, H. M. EPACTS: efficient and parallelizable association container toolbox. http://genome.sph.umich.edu/wiki/EPACTS (2019).

19. Guha, T. & Malkin, D. Inherited TP53 mutations and the Li-Fraumeni syndrome. Cold Spring Harb. Perspect. Med. 7, https://doi.org/10.1101/cshperspect.a026187 (2017).

20. Sontheimer, H. An unexpected role for ion channels in brain tumor metastasis. Exp. Biol. Med. 233, 779–791 (2008).

21. Turner, K. L. & Sontheimer, H. Cl- and K+ channels and their role in primary brain tumour biology. Philos. Trans. R. Soc. Lond. B Biol. Sci. 369, 20130095 (2014).

22. Cirulli, E. T. et al. Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science 347, 1436–1441 (2015).

24. Aschard, H., Vilhjalmsson, B. J., Joshi, A. D., Price, A. L. & Kraft, P. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. Am. J. Hum. Genet. 96, 329–339 (2015).

25. Weir, B. S. Genetic Data Analysis. 2nd edn. (Sinauer Associates, 1996).

26. Hui, T. J. & Burt, A. Estimating linkage disequilibrium from genotypes under Hardy-Weinberg equilibrium. BMC Genet. 21, 21 (2020).

27. Schaid, D. J. Linkage disequilibrium testing when linkage phase is unknown. Genetics 166, 505–512 (2004).

28. Ito, T. et al. Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am. J. Hum. Genet. 72, 384–398 (2003).

29. Johnson, G. The NLopt nonlinear-optimization package. http://github.com/stevengj/nlopt.

30. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv Prepr. 1207, 3907 (2012).

31. Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).

32. Bhakta, N. et al. The cumulative burden of surviving childhood cancer: an initial report from the St Jude Lifetime Cohort Study (SJLIFE). Lancet 390, 2569–2582 (2017).

33. Beecham, G. W. et al. The Alzheimer’s Disease Sequencing Project: Study design and sample selection. Neurol. Genet. 3, e194 (2017).

34. Fairley, S., Lowy-Gallego, E., Perry, E. & Flicek, P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 48, D941–D947 (2020).

35. Agresti, A. A. Categorical Data Analysis. 3rd edn (Wiley, 2012).

36. ZELEN, M. The analysis of several 2 × 2 contingency tables. Biometrika 58, 129–137 (1971).

37. Mehta, C. R., Patel, N. R. & Gray, R. Computing an exact confidence interval for the common odds ratio in several 2 × 2 contingency tables. J. Am. Stat. Assoc. 80, 969–973 (1985).

38. Yekutieli, D. & Benjamini, Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J. Stat. Plan. Inference 82, 171–196 (1999).

39. Benjamini, Y., Kenigsberg, E., Reiner, A. & Yekutieli, D. fdrame: FDR adjustments of microarray experiments (FDR-AME). (R package version 1.64.0., 2021).

40. Döhler, S., Durand, G. & Roquain, E. New FDR bounds for discrete and heterogeneous tests. Electron. J. Stat. 12, 1867–1900 (2018). 1834.

41. Eddelbuettel, D. & Francois, R. Rcpp: seamless R and C plus plus Integration. J. Stat. Softw. 40, 1–18 (2011).

42. CoCoRV: a rare variant analysis framework using public genotype summary counts to prioritize disease-predisposition genes. https://doi.org/10.5281/zenodo.6416062 (2022).

Acknowledgements

This study was supported in part by the National Cancer Institute grant P30 CA021765 [G.W.] and 5U54NS092091-08 [G.W.], American Lebanese Syrian Associated Charities (ALSAC), and the U.S. Public Health Service and National Institutes of Health (contract grant number R35 GM140487) [D.J.S]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We acknowledge the permission granted to use data from the Alzheimer’s Disease Sequencing Project (dbGaP Study Accession phs000572.v8.p4) and The Cancer Genome Atlas (dbGaP Study Accession phs000178.v11.p8). We acknowledge Michael Edmonson’s help with the conversion between plain text-based matrix data format and the VCF format, Jason P. Sinnwell’s help on LD packages, and Angela J. McArthur’s helpful scientific editing.

Author information

Authors

Contributions

W.C. proposed the study design, developed methods, implemented methods, processed the data and performed data analysis, reviewed the results, and drafted the manuscript. S.W. implemented methods and processed the data. S.S.T. implemented methods and processed the data. D.W.E. provided guidance on phenotype subtype information and reviewed the results. D.J.S. reviewed the draft and provided guidance on method development, G.W. proposed the study design, processed the data, reviewed the results, and drafted the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Wenan Chen or Gang Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

Chen, W., Wang, S., Tithi, S.S. et al. A rare variant analysis framework using public genotype summary counts to prioritize disease-predisposition genes. Nat Commun 13, 2592 (2022). https://doi.org/10.1038/s41467-022-30248-0

• Accepted:

• Published:

• DOI: https://doi.org/10.1038/s41467-022-30248-0