Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Abstract

The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor–normal genomic DNA (gDNA) samples derived from a breast cancer cell line—which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations—and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking ‘tumor-only’ or ‘matched tumor–normal’ analyses.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Schematic of the bioinformatics pipelines used to define the confidence levels of the somatic mutation call set (see Methods for details).
Fig. 2: Definition and validation of the somatic mutation reference call set.
Fig. 3: Initial definition and validation of germline variants.
Fig. 4: Clonality analysis of the HCC1395 cell line using bulk DNA and DNA from single cells.

Data availability

All raw data (FASTQ files) are available on NCBI’s SRA database (SRP162370). The call set for somatic mutations in HCC1395 cells, VCF files derived from individual WES and WGS runs, BAM files for BWA-MEM alignments and source codes are available on NCBI’s ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG). Some alignment files (BAM) are also available on Seven Bridges’ CGC platform under SEQC-II project. gDNA tested in the current study was prepared by ATCC using cell expansions from master banks of cells for the HCC1395 (ATCC, CRL-2324) and HCC1395BL (ATCC, CRL-2325) cell lines. gDNA aliquots from these preparations were distributed to the sequencing centers to perform WGS and WES as described. For remaining gDNA aliquots, contact the corresponding authors. Contact ATCC for additional materials related to the HCC1395 and HCC1395BL cell lines.

Software and code availability

The code to create somatic reference call set v1.2 is deposited on GitHub under a BSD 2-Clause open-source license tagged at https://github.com/bioinform/somaticseq/tree/seqc2_v1.2. A snapshot can also be downloaded at https://github.com/bioinform/somaticseq/archive/seqc2_v1.2.tar.gz.

References

  1. Gall, J. G. Human genome sequencing. Science 233, 1367–1368 (1986).

    CAS  PubMed  Article  Google Scholar 

  2. Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 153, 17–37 (2013).

    CAS  PubMed  Article  Google Scholar 

  3. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).

  5. Hyman, D. M., Taylor, B. S. & Baselga, J. Implementing genome-driven oncology. Cell 168, 584–599 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. Berger, M. F. & Mardis, E. R. The emerging clinical relevance of genomics in cancer medicine. Nat. Rev. Clin. Oncol. 15, 353–365 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. Hofmann, A. L. et al. Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics 18, 8 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  8. Krøigård, A. B., Thomassen, M., Lænkholm, A.-V., Kruse, T. A. & Larsen, M. J. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLOS ONE 11, e0151664 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  9. Shi, W. et al. Reliability of whole-exome sequencing for assessing intratumor genetic heterogeneity. Cell Rep. 25, 1446–1457 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Kim, S. Y. & Speed, T. P. Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013).

    PubMed  PubMed Central  Article  Google Scholar 

  11. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  PubMed  Article  Google Scholar 

  12. Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

  13. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. Xu, H., DiCarlo, J., Satya, R. V., Peng, Q. & Wang, Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15, 244 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  15. Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci. Rep. 10, 3501 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. WHO Reference Panel 1st International Reference Panel for Genomic KRAS Codons 12 and 13 Mutations NIBSC code: 16/250 (National Institute for Biological Standards and Control, 2020).

  17. Huo, Z., Tu, J., Lee, D.-F. & Zhao, R. Engineering mutation clones in mammalian cells with CRISPR/Cas9. Methods Mol. Biol. 2108, 355–369 (2020).

    CAS  PubMed  Article  Google Scholar 

  18. Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. Lee, A. Y. et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 19, 188 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  20. Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).

    CAS  PubMed  Article  Google Scholar 

  21. Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 24607 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. MDIC SRS Report: Somatic Variant Reference Samples for NGS. (Medical Device Innovation Consortium, 2019).

  23. Stephens, P. J. et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 462, 1005–1010 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. Popova, T. et al. Ploidy and large-scale genomic instability consistently identify basal-like breast carcinomas with BRCA1/2 inactivation. Cancer Res. 72, 5454–5462 (2012).

    CAS  PubMed  Article  Google Scholar 

  25. Gazdar, A. F. et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int. J. Cancer 78, 766–774 (1998).

    CAS  PubMed  Article  Google Scholar 

  26. Staaf, J. et al. Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol. 9, R136 (2008).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  27. Suzuki, T., Tsukumo, Y., Furihata, C., Naito, M. & Kohara, A. Preparation of the standard cell lines for reference mutations in cancer gene-panels by genome editing in HEK 293T/17 cells. Genes Environ. 42, 8 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. Jia, S. et al. A novel cell line generated using the CRISPR/Cas9 technology as universal quality control material for KRAS G12V mutation testing. J. Clin. Lab. Anal. 32, e22391 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  29. Tian, X. et al. CRISPR/Cas9—an evolving biological tool kit for cancer biology and oncology. NPJ Precis. Oncol. 3, 8 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  30. Blackburn, J. et al. Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. Nat. Protoc. 14, 2119–2151 (2019).

    CAS  PubMed  Article  Google Scholar 

  31. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    PubMed  Article  CAS  Google Scholar 

  32. Fang, L. T. et al. An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol. 16, 197 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  33. Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  34. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  35. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).

    CAS  PubMed  Article  Google Scholar 

  37. Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  38. Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  39. Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

    CAS  Article  PubMed  Google Scholar 

  40. Freed, D., Pan, R. & Aldana, R. TNscope: accurate detection of somatic mutations with haplotype-based variant candidate detection and machine learning filtering. Preprint at bioRxiv https://doi.org/10.1101/250647 (2018).

  41. Sahraeian, S. M. E., Fang, L. T., Mohiyuddin, M., Hong, H. & Xiao, W. Robust cancer mutation detection with deep learning models derived from tumor–normal sequencing data. Preprint at bioRxiv https://doi.org/10.1101/667261 (2019).

  42. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  44. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  PubMed  Article  Google Scholar 

  45. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  46. Raine, K. M. et al. ascatNgs: identifying somatically acquired copy-number alterations from whole-genome sequencing data. Curr. Protoc. Bioinformatics 56, 15.9.1–15.9.17 (2016).

    Article  Google Scholar 

  47. Flensburg, C., Sargeant, T., Oshlack, A. & Majewski, I. SuperFreq: integrated mutation detection and clonal tracking in cancer. PLoS Comput. Biol. 16, e1007603 (2020).

  48. Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  49. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  51. Yates, L. R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 21, 751–759 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  52. Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  53. Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. McGranahan, N. & Swanton, C. Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168, 613–628 (2017).

    CAS  PubMed  Article  Google Scholar 

  55. Choo-Wosoba, H., Albert, P. S. & Zhu, B. A hidden Markov modeling approach for identifying tumor subclones in next-generation sequencing studies. Biostatistics https://doi.org/10.1093/biostatistics/kxaa013 (2020).

  56. Xiao, W. & The Somatic Mutation Working Group of the SEQC-II Consortium. Towards best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-00994-5 (2021).

  57. Zhao, Y. et al. Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Preprint at bioRxiv https://doi.org/10.1101/2021.02.27.433136 (2021).

  58. Chen, W. et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-00748-9 (2020).

  59. Chen, X. et al. A multi-center cross-platform single-cell RNA sequencing reference dataset. Sci. Data 8, 39 (2021).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  61. Storchova, Z. & Kuffer, C. The consequences of tetraploidy and aneuploidy. J. Cell Sci. 121, 3859–3866 (2008).

    CAS  PubMed  Article  Google Scholar 

  62. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. Morrissy, A. S. et al. Spatial heterogeneity in medulloblastoma. Nat. Genet. 49, 780–788 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  64. Araf, S. et al. Genomic profiling reveals spatial intra-tumor heterogeneity in follicular lymphoma. Leukemia 32, 1261–1265 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  65. Ben-David, U. et al. Genetic and transcriptional evolution alters cancer cell line drug response. Nature 560, 325–330 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  66. Abraham, J. in Handbook of Transnational Economic Governance Regimes (eds. Tietje, C. & Brouder, A.) 1041–1053 (Brill Nijhoff, 2010).

  67. Xiao, C. et. al. Personalized genome assembly for accurate cancer somatic mutation discovery using cancer-normal paired reference samples. Preprint at bioRxiv https://doi.org/10.1101/2021.04.09.438252 (2021).

  68. Ptashkin, R. N. et al. Prevalence of clonal hematopoiesis mutations in tumor-only clinical genomic profiling of solid tumors. JAMA Oncol. 4, 1589–1593 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  69. Meisner, L. F. & Johnson, J. A. Protocols for cytogenetic studies of human embryonic stem cells. Methods 45, 133–141 (2008).

    CAS  PubMed  Article  Google Scholar 

  70. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgements

We thank J. Zook of the National Institute of Standards and Technology for advice in establishing reference samples and truth sets, S. Gowrisankar of Novartis and S. Chacko of the Center for Information Technology, NIH, for their assistance with data transfer and J. Ye of Sentieon for providing the Sentieon software package. We also thank D. Goldstein of the Office of Technology and Science at the National Cancer Institute (NCI), NIH, and L. Amundadottir of the Division of Cancer Epidemiology and Genetics, NCI, NIH, for the sponsorship and the use of the NIH Biowulf cluster, R. Phillip, Y. Hu, S. Liang and Y. Li of the Center for Devices and Radiological Health, US FDA, for their advice on study design and manuscript writing, J. Collins and E. Stahlberg of Biomedical Informatics and Data Science Directorate at Frederick National Laboratory for Cancer Research for reviewing the manuscript and providing suggestions and Seven Bridges Genomics for providing storage and computational support on the CGC. B.Z. was supported by the Intramural Research Program of the NIH, NCI, Division of Cancer Epidemiology and Genetics. Y. Zhao, K.T., T.S., B.T., J.S. and Y.K. were supported by the Frederick National Laboratory for Cancer Research and through the NIH fund (contract number 75N910D00024). L. Shi and Y. Zheng were supported by the National Natural Science Foundation of China (31720103909), the National Key R&D Project of China (2018YFE0201600) and Shanghai Municipal Science and Technology Major Project (2017SHZDZX01). E.R. was supported by the European Union through the European Regional Development Fund (2014-2020.4.01.15-0012). The CGC has been funded in whole, or in part, by Federal funds from the NCI, NIH (HHSN261201400008C), and ID/IQ Agreement number 17×146 under contract number HHSN261201500003I. C.X. and S. Sherry were supported by the Intramural Research Program of the National Library of Medicine, NIH. This work also used the computational resources of the NIH Biowulf cluster (http://hpc.nih.gov). Original data were also backed up on the servers provided by Center for Biomedical Informatics and Information Technology (CBIIT), NCI. The genomic work performed at the Loma Linda University (LLU) Center for Genomics was funded in part by the NIH grant S10OD019960, the American Heart Association grant 18IPA34170301, the Ardmore Institute of Health grant 2150141 and C.A. Sims’ gift to LLU Center for Genomics. We acknowledge TopEdit for linguistic editing and proofreading during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

The study was conceived and designed by W.X., C.W., L. Shi, H.H., E.D., Z.T., B.N., W.T. and R.J. Biosample preparation was performed by L.K., K.L., M. Mars and T.H. NGS library preparation and sequencing was performed by W.C., Z.C., S. Stanbouly, K.I., H.J., E.J., G.P.S., Y. Zheng, B.T., Y.Y., J.S., Y.K., M. Mehat, V.P., M.S., T.H., E.P., R.K., J.D., P.V., R.M., D.G., S.K., E.R., A.S., J.N., U.L., J.W., J.L., P.D.H., C.C., S.M., J.S., J.F., D.B. and C.E.M. Data analysis was performed by L.T.F., W.X., B.Z., Y. Zhao, Z.Y., L.R., C.L., O.D.A., L. Song, J.L., T.S., K.T., D.M., C.N., M.C., S.M.E.S., M. Mohiyuddin, Y.G., L.Y., H.L., M.P., Z.L., W.S.L., J.K., J.A., E.T., V.Z., T.M. and J.T. Data management was performed by W.X., C.X. and S.T.S. The manuscript was written by L.T.F., B.Z., W.X., R.K., M. Moos, C.X., S.T.S. and Y. Zhao. W.X. managed the project.

Corresponding authors

Correspondence to Huixiao Hong, Leming Shi, Charles Wang or Wenming Xiao.

Ethics declarations

Competing interests

L.T.F., S.M.E.S., M. Mohiyuddin, Y.G., L.Y. and H.L. are employees of Roche Sequencing Solutions Inc. L.K., K.L. and M. Mars are employees of ATCC, which provides cell lines and derivative materials. E.J., G.P.S. and O.D.A. are employees of Illumina Inc. V.P. and M.S. are employees of Novartis Institutes for Biomedical Research. T.H., E.P. and R. Kalamegham are employees of Genentech (a member of the Roche group). Z.L. is an employee of Sentieon Inc. R. Kusko is an employee of Immuneering Corp. C.C., S.M. and J.S. are employees of 10x Genomics. All other authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 3D scatter plot shows the consistency of SomaticSeq and NeuSomatic classification of somatic variant calls.

3D scatter plot for number of PASS classifications by SomaticSeq, NeuSomatic-E, and VAF for (a) SNV (R = 0.997) and (b) indel calls (R = 0.925). (c) The subset of SNV calls that were re-sequenced by AmpliSeq. Solid markers are deemed ‘validated.’ Open markers are deemed ‘not validated.’ Stars/crosses are deemed uninterpretable. HighConf calls generally have many PASS calls and a full range of VAF. MedConf have fewer PASS calls and tend to have lower VAF. Unclassified calls have a full range of VAF, which means their somatic signals were poor-quality.

Extended Data Fig. 2 Genome coverage and high-confidence regions.

(a) Genome coverage by reads from three technologies. Inner track: PacBio. Middle track: 10X Genomics. Outer track: Illumina HiSeq. (b) Genome regions coverage by short reads in comparison to NA12878. Outer black Track: Gene density plot. Middle orange track: NA12878. Inner blue track: the callable regions in HCC1395.

Extended Data Fig. 3 Validation of somatic indels.

(a) Validation of indels by AmpliSeq. R = 0.989 for HighConf calls. (b) Validation of indels by WES with Ion Torrent. R = 0.767 for HighConf calls. (c) Validation of indels by WES with HiSeq. R = 0.990 for HighConf calls. (d) Histogram of indel sizes. The dashed lines on the diagonal for (a), (b), and (c) are the 95% binomial confidence-interval of observed VAF given the actual VAF, calculated based on depths of 2000X for AmpliSeq, 34X for Ion Torrent, and 100X for WES, respectively. (d) shows the indel lengths of the somatic indels in the reference call set.

Extended Data Fig. 4 Validation of germline indels.

Germline indel scatter plots comparing VAF super set to confirmed VAF. (a) VAF scatter plot of germline indels by WGS super set and AmpliSeq. (b) VAF scatter plot of germline indels by truth set and Ion Torrent WES.

Extended Data Fig. 5 Karyotyping of HCC1395 and HCC1395BL.

(a) Karyotype of HCC1395. Cytogenetic analysis was performed on ten G-Banded metaphase cells from HCC1395. Analysis pointed to a hypertetraploid line with chromosome counts ranging from 64-79 and gain of 38-63 unidentifiable marker chromosomes. (b) Karyotype of HCC1395BL. Cytogenetic analysis was performed on ten G-banded metaphase cells from HCC1395BL. All ten cells showed loss of a chrX and an unbalanced whole arm translocation between the long-arm of chr6 at band q10 and the short-arm of chr16 at band p10. This resulted in a net loss of one copy of the short-arm of chr6 and loss of one copy of the long-arm of chr16. The abnormal chromosome could be placed in either a chr6 or chr16 locus as we were unable to determine if the centromere belongs to chr6 or chr16 (inset figure).

Extended Data Fig. 6 Cytogenetic analysis with Affymetrix Cytoscan HD microarray.

Cytogenetic analysis with Affymetrix Cytoscan HD microarray. (a) Cytogenetic view of HCC1395. (b) Cytogenetic view of HCC1395BL. The losses of chr6p, chr16q, and chrX were confirmed.

Extended Data Fig. 7 Variant allele frequencies across the genome.

(a) VAF of truth set germline SNVs in HCC1395BL. The copy numbers of HCC1395BL were predicted by Affymetrix Cytoscan HD microarray. (b) VAF of the truth set germline SNV positions (discovered in HCC1395BL) in HCC1395. (c) VAF of the truth set somatic SNVs in HCC1395. The copy numbers of HCC1395 were predicted by ascatNgs.

Extended Data Fig. 8 Variant allele frequencies of somatic mutations.

(a) VAFs of somatic SNVs and indels in the reference call sets. (b) VAFs of reference SNVs in different copy number states as predicted by ascatNgs.

Extended Data Fig. 9 Tumor sample HCC1395 CNV and Clonality Analysis.

Tumor sample HCC1395 CNV and Clonality Analysis. (a) Clonality analysis from WES data using SuperFreq for tumor cell line HCC1395. The clonality of each somatic SNV was calculated based on the VAF, accounting for local copy number. The SNVs and CNAs were evaluated with hierarchical clustering based on the clonality and uncertainty across replicates for HCC1395. The river plot shows the relative distribution of multiple subclones in HCC1395. The main cancer clone (blue) and the two subclones (red and green) appeared in early time of clonal evolution, while subclone (orange) and its descendant (peak) appeared in the late event of the clonal evolution. (b) The main- and sub-clonal somatic copy number profiles using subHMM38 from the Illumina WGS data set. Main-clonal genotype: upper panel; sub-clonal genotype: middle panel; sub-clonal proportion: bottom bar plot. Each colored block represents the genotype of somatic copy number alterations (SCNAs) in the corresponding position of the chromosome. The chromosomes are separated by vertical dash lines. Genotype of SCNAs: deletion (DEL), homozygous deletion (HOMD), hemizygous deletion loss of heterozygosity (DLOH), copy neutral loss of heterozygosity (NLOH), diploid heterozygous (HET), gain of one allele (GAIN), amplified loss of heterozygosity (ALOH), allele-specific copy number amplification (ASCNA), balanced copy number amplification (BCNA), and unbalanced copy number amplification (UBCNA).

Extended Data Fig. 10 Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature.

Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature59.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Tables 1–10 and sections on (1) somatic mutations, (2) germline variants, (3) detailed method to produce the reference call set, (4) software used and (5) statistical methods.

Reporting Summary

Supplementary Data 1

Tables of manually curated somatic variant calls: a record of manually inspected variant sites.

Supplementary Data 2

Tables of somatic coding variants and germline variants in ClinVar.

Supplementary Data 3

Tables of somatic variant calls where the validation platform showed discrepancy with the reference data sets

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151–1160 (2021). https://doi.org/10.1038/s41587-021-00993-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-00993-6

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing