The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor–normal genomic DNA (gDNA) samples derived from a breast cancer cell line—which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations—and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking ‘tumor-only’ or ‘matched tumor–normal’ analyses.
Your institute does not have access to this article
Open Access articles citing this article.
Nature Communications Open Access 22 July 2022
Deep oncopanel sequencing reveals within block position-dependent quality degradation in FFPE processed samples
Genome Biology Open Access 29 June 2022
Genome Biology Open Access 03 March 2022
Subscription info for Chinese customers
We have a dedicated website for our Chinese customers. Please go to naturechina.com to subscribe to this journal.
Get time limited or full article access on ReadCube.
All prices are NET prices.
All raw data (FASTQ files) are available on NCBI’s SRA database (SRP162370). The call set for somatic mutations in HCC1395 cells, VCF files derived from individual WES and WGS runs, BAM files for BWA-MEM alignments and source codes are available on NCBI’s ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG). Some alignment files (BAM) are also available on Seven Bridges’ CGC platform under SEQC-II project. gDNA tested in the current study was prepared by ATCC using cell expansions from master banks of cells for the HCC1395 (ATCC, CRL-2324) and HCC1395BL (ATCC, CRL-2325) cell lines. gDNA aliquots from these preparations were distributed to the sequencing centers to perform WGS and WES as described. For remaining gDNA aliquots, contact the corresponding authors. Contact ATCC for additional materials related to the HCC1395 and HCC1395BL cell lines.
Software and code availability
The code to create somatic reference call set v1.2 is deposited on GitHub under a BSD 2-Clause open-source license tagged at https://github.com/bioinform/somaticseq/tree/seqc2_v1.2. A snapshot can also be downloaded at https://github.com/bioinform/somaticseq/archive/seqc2_v1.2.tar.gz.
Gall, J. G. Human genome sequencing. Science 233, 1367–1368 (1986).
Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 153, 17–37 (2013).
Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Hyman, D. M., Taylor, B. S. & Baselga, J. Implementing genome-driven oncology. Cell 168, 584–599 (2017).
Berger, M. F. & Mardis, E. R. The emerging clinical relevance of genomics in cancer medicine. Nat. Rev. Clin. Oncol. 15, 353–365 (2018).
Hofmann, A. L. et al. Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics 18, 8 (2017).
Krøigård, A. B., Thomassen, M., Lænkholm, A.-V., Kruse, T. A. & Larsen, M. J. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLOS ONE 11, e0151664 (2016).
Shi, W. et al. Reliability of whole-exome sequencing for assessing intratumor genetic heterogeneity. Cell Rep. 25, 1446–1457 (2018).
Kim, S. Y. & Speed, T. P. Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
Xu, H., DiCarlo, J., Satya, R. V., Peng, Q. & Wang, Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15, 244 (2014).
Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci. Rep. 10, 3501 (2020).
WHO Reference Panel 1st International Reference Panel for Genomic KRAS Codons 12 and 13 Mutations NIBSC code: 16/250 (National Institute for Biological Standards and Control, 2020).
Huo, Z., Tu, J., Lee, D.-F. & Zhao, R. Engineering mutation clones in mammalian cells with CRISPR/Cas9. Methods Mol. Biol. 2108, 355–369 (2020).
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
Lee, A. Y. et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 19, 188 (2018).
Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).
Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 24607 (2016).
MDIC SRS Report: Somatic Variant Reference Samples for NGS. (Medical Device Innovation Consortium, 2019).
Stephens, P. J. et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 462, 1005–1010 (2009).
Popova, T. et al. Ploidy and large-scale genomic instability consistently identify basal-like breast carcinomas with BRCA1/2 inactivation. Cancer Res. 72, 5454–5462 (2012).
Gazdar, A. F. et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int. J. Cancer 78, 766–774 (1998).
Staaf, J. et al. Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol. 9, R136 (2008).
Suzuki, T., Tsukumo, Y., Furihata, C., Naito, M. & Kohara, A. Preparation of the standard cell lines for reference mutations in cancer gene-panels by genome editing in HEK 293T/17 cells. Genes Environ. 42, 8 (2020).
Jia, S. et al. A novel cell line generated using the CRISPR/Cas9 technology as universal quality control material for KRAS G12V mutation testing. J. Clin. Lab. Anal. 32, e22391 (2018).
Tian, X. et al. CRISPR/Cas9—an evolving biological tool kit for cancer biology and oncology. NPJ Precis. Oncol. 3, 8 (2019).
Blackburn, J. et al. Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. Nat. Protoc. 14, 2119–2151 (2019).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Fang, L. T. et al. An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol. 16, 197 (2015).
Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).
Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Freed, D., Pan, R. & Aldana, R. TNscope: accurate detection of somatic mutations with haplotype-based variant candidate detection and machine learning filtering. Preprint at bioRxiv https://doi.org/10.1101/250647 (2018).
Sahraeian, S. M. E., Fang, L. T., Mohiyuddin, M., Hong, H. & Xiao, W. Robust cancer mutation detection with deep learning models derived from tumor–normal sequencing data. Preprint at bioRxiv https://doi.org/10.1101/667261 (2019).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Raine, K. M. et al. ascatNgs: identifying somatically acquired copy-number alterations from whole-genome sequencing data. Curr. Protoc. Bioinformatics 56, 15.9.1–15.9.17 (2016).
Flensburg, C., Sargeant, T., Oshlack, A. & Majewski, I. SuperFreq: integrated mutation detection and clonal tracking in cancer. PLoS Comput. Biol. 16, e1007603 (2020).
Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).
Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014).
Yates, L. R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 21, 751–759 (2015).
Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).
Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).
McGranahan, N. & Swanton, C. Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168, 613–628 (2017).
Choo-Wosoba, H., Albert, P. S. & Zhu, B. A hidden Markov modeling approach for identifying tumor subclones in next-generation sequencing studies. Biostatistics https://doi.org/10.1093/biostatistics/kxaa013 (2020).
Xiao, W. & The Somatic Mutation Working Group of the SEQC-II Consortium. Towards best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-00994-5 (2021).
Zhao, Y. et al. Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Preprint at bioRxiv https://doi.org/10.1101/2021.02.27.433136 (2021).
Chen, W. et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-00748-9 (2020).
Chen, X. et al. A multi-center cross-platform single-cell RNA sequencing reference dataset. Sci. Data 8, 39 (2021).
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Storchova, Z. & Kuffer, C. The consequences of tetraploidy and aneuploidy. J. Cell Sci. 121, 3859–3866 (2008).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Morrissy, A. S. et al. Spatial heterogeneity in medulloblastoma. Nat. Genet. 49, 780–788 (2017).
Araf, S. et al. Genomic profiling reveals spatial intra-tumor heterogeneity in follicular lymphoma. Leukemia 32, 1261–1265 (2018).
Ben-David, U. et al. Genetic and transcriptional evolution alters cancer cell line drug response. Nature 560, 325–330 (2018).
Abraham, J. in Handbook of Transnational Economic Governance Regimes (eds. Tietje, C. & Brouder, A.) 1041–1053 (Brill Nijhoff, 2010).
Xiao, C. et. al. Personalized genome assembly for accurate cancer somatic mutation discovery using cancer-normal paired reference samples. Preprint at bioRxiv https://doi.org/10.1101/2021.04.09.438252 (2021).
Ptashkin, R. N. et al. Prevalence of clonal hematopoiesis mutations in tumor-only clinical genomic profiling of solid tumors. JAMA Oncol. 4, 1589–1593 (2018).
Meisner, L. F. & Johnson, J. A. Protocols for cytogenetic studies of human embryonic stem cells. Methods 45, 133–141 (2008).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).
We thank J. Zook of the National Institute of Standards and Technology for advice in establishing reference samples and truth sets, S. Gowrisankar of Novartis and S. Chacko of the Center for Information Technology, NIH, for their assistance with data transfer and J. Ye of Sentieon for providing the Sentieon software package. We also thank D. Goldstein of the Office of Technology and Science at the National Cancer Institute (NCI), NIH, and L. Amundadottir of the Division of Cancer Epidemiology and Genetics, NCI, NIH, for the sponsorship and the use of the NIH Biowulf cluster, R. Phillip, Y. Hu, S. Liang and Y. Li of the Center for Devices and Radiological Health, US FDA, for their advice on study design and manuscript writing, J. Collins and E. Stahlberg of Biomedical Informatics and Data Science Directorate at Frederick National Laboratory for Cancer Research for reviewing the manuscript and providing suggestions and Seven Bridges Genomics for providing storage and computational support on the CGC. B.Z. was supported by the Intramural Research Program of the NIH, NCI, Division of Cancer Epidemiology and Genetics. Y. Zhao, K.T., T.S., B.T., J.S. and Y.K. were supported by the Frederick National Laboratory for Cancer Research and through the NIH fund (contract number 75N910D00024). L. Shi and Y. Zheng were supported by the National Natural Science Foundation of China (31720103909), the National Key R&D Project of China (2018YFE0201600) and Shanghai Municipal Science and Technology Major Project (2017SHZDZX01). E.R. was supported by the European Union through the European Regional Development Fund (2014-2020.4.01.15-0012). The CGC has been funded in whole, or in part, by Federal funds from the NCI, NIH (HHSN261201400008C), and ID/IQ Agreement number 17×146 under contract number HHSN261201500003I. C.X. and S. Sherry were supported by the Intramural Research Program of the National Library of Medicine, NIH. This work also used the computational resources of the NIH Biowulf cluster (http://hpc.nih.gov). Original data were also backed up on the servers provided by Center for Biomedical Informatics and Information Technology (CBIIT), NCI. The genomic work performed at the Loma Linda University (LLU) Center for Genomics was funded in part by the NIH grant S10OD019960, the American Heart Association grant 18IPA34170301, the Ardmore Institute of Health grant 2150141 and C.A. Sims’ gift to LLU Center for Genomics. We acknowledge TopEdit for linguistic editing and proofreading during the preparation of this manuscript.
L.T.F., S.M.E.S., M. Mohiyuddin, Y.G., L.Y. and H.L. are employees of Roche Sequencing Solutions Inc. L.K., K.L. and M. Mars are employees of ATCC, which provides cell lines and derivative materials. E.J., G.P.S. and O.D.A. are employees of Illumina Inc. V.P. and M.S. are employees of Novartis Institutes for Biomedical Research. T.H., E.P. and R. Kalamegham are employees of Genentech (a member of the Roche group). Z.L. is an employee of Sentieon Inc. R. Kusko is an employee of Immuneering Corp. C.C., S.M. and J.S. are employees of 10x Genomics. All other authors declare no competing interests.
Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 3D scatter plot shows the consistency of SomaticSeq and NeuSomatic classification of somatic variant calls.
3D scatter plot for number of PASS classifications by SomaticSeq, NeuSomatic-E, and VAF for (a) SNV (R = 0.997) and (b) indel calls (R = 0.925). (c) The subset of SNV calls that were re-sequenced by AmpliSeq. Solid markers are deemed ‘validated.’ Open markers are deemed ‘not validated.’ Stars/crosses are deemed uninterpretable. HighConf calls generally have many PASS calls and a full range of VAF. MedConf have fewer PASS calls and tend to have lower VAF. Unclassified calls have a full range of VAF, which means their somatic signals were poor-quality.
(a) Genome coverage by reads from three technologies. Inner track: PacBio. Middle track: 10X Genomics. Outer track: Illumina HiSeq. (b) Genome regions coverage by short reads in comparison to NA12878. Outer black Track: Gene density plot. Middle orange track: NA12878. Inner blue track: the callable regions in HCC1395.
(a) Validation of indels by AmpliSeq. R = 0.989 for HighConf calls. (b) Validation of indels by WES with Ion Torrent. R = 0.767 for HighConf calls. (c) Validation of indels by WES with HiSeq. R = 0.990 for HighConf calls. (d) Histogram of indel sizes. The dashed lines on the diagonal for (a), (b), and (c) are the 95% binomial confidence-interval of observed VAF given the actual VAF, calculated based on depths of 2000X for AmpliSeq, 34X for Ion Torrent, and 100X for WES, respectively. (d) shows the indel lengths of the somatic indels in the reference call set.
Germline indel scatter plots comparing VAF super set to confirmed VAF. (a) VAF scatter plot of germline indels by WGS super set and AmpliSeq. (b) VAF scatter plot of germline indels by truth set and Ion Torrent WES.
(a) Karyotype of HCC1395. Cytogenetic analysis was performed on ten G-Banded metaphase cells from HCC1395. Analysis pointed to a hypertetraploid line with chromosome counts ranging from 64-79 and gain of 38-63 unidentifiable marker chromosomes. (b) Karyotype of HCC1395BL. Cytogenetic analysis was performed on ten G-banded metaphase cells from HCC1395BL. All ten cells showed loss of a chrX and an unbalanced whole arm translocation between the long-arm of chr6 at band q10 and the short-arm of chr16 at band p10. This resulted in a net loss of one copy of the short-arm of chr6 and loss of one copy of the long-arm of chr16. The abnormal chromosome could be placed in either a chr6 or chr16 locus as we were unable to determine if the centromere belongs to chr6 or chr16 (inset figure).
Cytogenetic analysis with Affymetrix Cytoscan HD microarray. (a) Cytogenetic view of HCC1395. (b) Cytogenetic view of HCC1395BL. The losses of chr6p, chr16q, and chrX were confirmed.
(a) VAF of truth set germline SNVs in HCC1395BL. The copy numbers of HCC1395BL were predicted by Affymetrix Cytoscan HD microarray. (b) VAF of the truth set germline SNV positions (discovered in HCC1395BL) in HCC1395. (c) VAF of the truth set somatic SNVs in HCC1395. The copy numbers of HCC1395 were predicted by ascatNgs.
(a) VAFs of somatic SNVs and indels in the reference call sets. (b) VAFs of reference SNVs in different copy number states as predicted by ascatNgs.
Tumor sample HCC1395 CNV and Clonality Analysis. (a) Clonality analysis from WES data using SuperFreq for tumor cell line HCC1395. The clonality of each somatic SNV was calculated based on the VAF, accounting for local copy number. The SNVs and CNAs were evaluated with hierarchical clustering based on the clonality and uncertainty across replicates for HCC1395. The river plot shows the relative distribution of multiple subclones in HCC1395. The main cancer clone (blue) and the two subclones (red and green) appeared in early time of clonal evolution, while subclone (orange) and its descendant (peak) appeared in the late event of the clonal evolution. (b) The main- and sub-clonal somatic copy number profiles using subHMM38 from the Illumina WGS data set. Main-clonal genotype: upper panel; sub-clonal genotype: middle panel; sub-clonal proportion: bottom bar plot. Each colored block represents the genotype of somatic copy number alterations (SCNAs) in the corresponding position of the chromosome. The chromosomes are separated by vertical dash lines. Genotype of SCNAs: deletion (DEL), homozygous deletion (HOMD), hemizygous deletion loss of heterozygosity (DLOH), copy neutral loss of heterozygosity (NLOH), diploid heterozygous (HET), gain of one allele (GAIN), amplified loss of heterozygosity (ALOH), allele-specific copy number amplification (ASCNA), balanced copy number amplification (BCNA), and unbalanced copy number amplification (UBCNA).
Extended Data Fig. 10 Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature.
Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature59.
Supplementary Figs. 1–7, Tables 1–10 and sections on (1) somatic mutations, (2) germline variants, (3) detailed method to produce the reference call set, (4) software used and (5) statistical methods.
Tables of manually curated somatic variant calls: a record of manually inspected variant sites.
Tables of somatic coding variants and germline variants in ClinVar.
Tables of somatic variant calls where the validation platform showed discrepancy with the reference data sets
About this article
Cite this article
Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151–1160 (2021). https://doi.org/10.1038/s41587-021-00993-6
Achieving robust somatic mutation detection with deep learning models derived from reference data sets of a cancer sample
Genome Biology (2022)
Genome Biology (2022)
Nature Communications (2022)
Deep oncopanel sequencing reveals within block position-dependent quality degradation in FFPE processed samples
Genome Biology (2022)
The Sequencing Quality Control 2 study: establishing community standards for sequencing in precision medicine
Genome Biology (2021)