Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Sketching algorithms for genomic data analysis and querying in a secure enclave

Abstract

Genome-wide association studies (GWAS), especially on rare diseases, may necessitate exchange of sensitive genomic data between multiple institutions. Since genomic data sharing is often infeasible due to privacy concerns, cryptographic methods, such as secure multiparty computation (SMC) protocols, have been developed with the aim of offering privacy-preserving collaborative GWAS. Unfortunately, the computational overhead of these methods remain prohibitive for human-genome-scale data. Here we introduce SkSES (https://github.com/ndokmai/sgx-genome-variants-search), a hardware–software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude. The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors—in particular, Intel’s SGX. To overcome the severe memory limitation of the TEEs, SkSES employs novel ‘sketching’ algorithms that maintain essential statistical information on genomic variants in input VCF files. By additionally incorporating efficient data compression and population stratification reduction methods, SkSES identifies the top k genomic variants in a cohort quickly, accurately and in a privacy-preserving manner.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Overview of the SkSES pipeline.
Fig. 2: The impact of increasing cohort size on the runtime of SkSES, with a fixed sketch size.
Fig. 3: The fraction of true top k significant SNPs (according to χ2 statistic) included in the query of top l SNPs returned by SkSES (accuracy) as a function of k.
Fig. 4: The impact of normalizing genotype matrix A and multiplying the sketching matrix (\(\hat{{\it{X}}}={\it{XT}}\)) for reducing the space needed for PCA on the left singular vectors \({\hat{{\rm{U}}}}_{L}\) as well as the ranks of Cochran–Armitage trend χ2 statistics across all unique SNPs in the iDASH2017-chr1 dataset.

Data availability

The VCF files from the original benchmarking dataset for the iDASH-2017 competition (iDASH2017-chr1) can be found at http://www.humangenomeprivacy.org/2017/competition-tasks.html. The VCF files consisting of whole-genome data (iDASH2017-wg) and the synthetic VCF files are available from the corresponding author upon request. The AMD dataset (dbGaP phs001039.v1.p1) from ref. 44 can be obtained via dbGaP authorized access.

Code availability

The source code for SkSES, under MIT license, is available for download at GitHub: https://github.com/ndokmai/sgx-genome-variants-search.

References

  1. Numanagić, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005 (2016).

    Article  Google Scholar 

  2. Alberti, C. et al. An introduction to MPEG-G, the new ISO standard for genomic information representation. Preprint at bioRxiv https://doi.org/10.1101/426353 (2018).

  3. Davies, R. GA4GH File Encryption Standard https://github.com/samtools/hts-specs/blob/master/crypt4gh.pdf (2017).

  4. Kelleher, J. et al. htsget: a protocol for securely streaming genomic data. Bioinformatics 35, 119–121 (2018).

    Article  Google Scholar 

  5. Hach, F., Numanagic, I. & Sahinalp, S. C. DeeZ: reference-based compression by local assembly. Nat. Methods 11, 1082 (2014).

    CAS  Article  Google Scholar 

  6. Anonymous. CRAM format specification (version 3.0) https://samtools.github.io/hts-specs/CRAMv3.pdf (2017).

  7. Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2014).

    Article  Google Scholar 

  8. Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).

    CAS  Article  Google Scholar 

  9. Ginart, A. A. et al. Optimal compressed representation of high throughput sequence data via light assembly. Nat. Commun. 9, 566 (2018).

    Article  Google Scholar 

  10. Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2017).

    Article  Google Scholar 

  11. Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71 (2013).

    CAS  Article  Google Scholar 

  12. Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462 (2014).

    CAS  Article  Google Scholar 

  13. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417 (2017).

    CAS  Article  Google Scholar 

  14. Flajolet, P. & Martin, G. N. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science (ed. Snyder, L.) 76–82 (IEEE, 1983).

  15. Karp, R. M. On-line algorithms versus off-line algorithms: how much is it worth to know the future? IFIP Congress 1, 416–429 (1992).

  16. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A. C. & Brown, C. T. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS ONE 9, e101271 (2014).

    Article  Google Scholar 

  17. Alon, N., Matias, Y. & Szegedy, M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999).

    Article  Google Scholar 

  18. Charikar, M., Chen, K. & Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming 693–703 (Springer, 2002).

  19. Cormode, G. & Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 58–75 (2005).

    Article  Google Scholar 

  20. McGuire, A. L. et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet. Med. 10, 495 (2008).

    Article  Google Scholar 

  21. Bloss, C. S. Does family always matter? Public genomes and their effect on relatives. Genome Med. 5, 107 (2013).

    Article  Google Scholar 

  22. Shringarpure, S. S. & Bustamante, C. D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).

    CAS  Article  Google Scholar 

  23. Ayday, E., Raisaro, J. L., Hengartner, U., Molyneaux, A. & Hubaux, J.-P. Privacy-preserving processing of raw genomic data. In Data Privacy Management and Autonomous Spontaneous Security (eds García-Alfaro, J. et al.) 133–147 (Springer, 2014).

  24. He, D. et al. Identifying genetic relatives without compromising privacy. Genome Res. 24, 664–72 (2014).

    CAS  Article  Google Scholar 

  25. Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).

    CAS  Article  Google Scholar 

  26. McLaren, P. J. et al. Privacy-preserving genomic testing in the clinic: a model using HIV treatment. Genet. Med. 18, 814 (2016).

    CAS  Article  Google Scholar 

  27. Shimizu, K., Nuida, K. & Rätsch, G. Efficient privacy-preserving string search and an application in genomics. Bioinformatics 32, 1652–1661 (2016).

    CAS  Article  Google Scholar 

  28. Xie, W. et al. Securema: protecting participant privacy in genetic association meta-analysis. Bioinformatics 30, 3334–3341 (2014).

    CAS  Article  Google Scholar 

  29. Zhao, Y., Wang, X., Jiang, X., Ohno-Machado, L. & Tang, H. Choosing blindly but wisely: differentially private solicitation of DNA datasets for disease marker discovery. J. Am. Med. Inform. Assoc. 22, 100–108 (2014).

    Article  Google Scholar 

  30. Shahbazi, A., Bayatbabolghani, F. & Blanton, M. Private computation with genomic data for genome-wide association and linkage studies. In Proc. 3rd International Workshop Genome Privacy Security (2016); https://www.acsu.buffalo.edu/~mblanton/publications/genopri16.pdf.

  31. Chen, F. et al. Premix: privacy-preserving estimation of individual admixture. In AMIA Annual Symposium Proceedings Vol. 2016, 1747–1755 (American Medical Informatics Association, 2016).

  32. Lauter, K., López-Alt, A. & Naehrig, M. Private computation on encrypted genomic data. In International Conference on Cryptology and Information Security in Latin America (eds Aranha, D. F. & Menezes, A.) 3–27 (Springer, 2014).

  33. Wang, S. et al. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2015).

    PubMed  PubMed Central  Google Scholar 

  34. Zhang, Y., Blanton, M. & Almashaqbeh, G. Secure distributed genome analysis for GWAS & sequence comparison computation. BMC Med. Inform. Decis. Mak. 15, S4 (2015).

    Article  Google Scholar 

  35. Halevi, S. & Shoup, V. Algorithms in HElib. In International Cryptology Conference (Garay, J. A. & Gennaro, R.) 554–571 (Springer, 2014).

  36. Yao, A. C. Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (ed. Pippenger, N.) 160–164 (IEEE, 1982).

  37. Wang, X., Chan, H. & Shi, E. Circuit ORAM: on tightness of the Goldreich–Ostrovsky lower bound. In Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security (eds Ray, I., Li, N. & Kruegel, C.) 850–861 (ACM, 2015).

  38. Anati, I., Gueron, S., Johnson, S. P. & Scarlata, V. R. Innovative technology for CPU based attestation and sealing. https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing (2013).

  39. Lewis, C. M. Genetic association studies: design, analysis and interpretation. Brief. Bioinformatics 3, 146–153 (2002).

    CAS  Article  Google Scholar 

  40. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    CAS  Article  Google Scholar 

  41. Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100 (2014).

    Article  Google Scholar 

  42. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    CAS  Article  Google Scholar 

  43. Wang, X. et al. IDASH secure genome analysis competition 2017. BMC Med. Genomics 11, 85 (2018).

    Article  Google Scholar 

  44. Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547 (2018).

    CAS  Article  Google Scholar 

  45. Celis, P. Robin Hood Hashing. PhD thesis, Univ. Waterloo (1986).

  46. Deng, F. & Rafiei, D. New estimation algorithms for streaming data: count-min can do more. http://webdocs.cs.ualberta.ca/~drafiei/papers/cmm.pdf (2007).

  47. Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Article  Google Scholar 

  48. Boutsidis, C., Woodruff, D. P. & Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (eds Wichs, D. & Mansour, Y.) 236–249 (ACM, 2016).

  49. Cohen, M. B., Elder, S., Musco, C., Musco, C. & Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (eds Servedio, R. A. & Rubinfeld, R.) 163–172 (ACM, 2015).

  50. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  Google Scholar 

Download references

Acknowledgements

N.K. was partially supported by NSF CCF-1844234, NSF CCF-1525024 and IIS-1633215. M.O.K. was partially supported by TUBITAK grant 114E293. Part of the work was done while D.P.W. was visiting the Simons Institute for the Theory of Computing. S.C.S. was supported in part by NSF CCF-1619081, NIH GM108348, NIH HG010798 and the Indiana University Grand Challenges Program, Precision Health Initiative, before moving to his current post at NCI.

We thank S. Simmons and H. Cho from the Computer Science and Artificial Intelligence Laboratory at MIT (now at the Broad Institute) for useful discussions and their help in benchmarking SkSES on the AMD dataset44 against the SMC tool. We also thank L. Wang and D. Bu at Indiana University for providing the iDASH2017-wg dataset. We finally thank the Linux team and B. Shei from University Information Technology Services at Indiana University for useful instructions on software installation and preparation.

Author information

Affiliations

Authors

Contributions

C.K., N.D., M.O.K. and S.C.S. initially participated in the iDASH-2017 competition. K.Z., N.K. and S.C.S. formulated the problem with limited memory. K.Z., S.C.S. and D.P.W. further formulated the problem to correct for population stratification. C.K., K.Z. and N.D. implemented the proposed solution. C.K., K.Z., N.D., N.K. and S.C.S. co-wrote the manuscript. M.O.K., D.P.W. and S.C.S. supervised the study.

Corresponding author

Correspondence to S. Cenk Sahinalp.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Fig. 1 and Supplementary Notes 1–5

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kockan, C., Zhu, K., Dokmai, N. et al. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat Methods 17, 295–301 (2020). https://doi.org/10.1038/s41592-020-0761-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-020-0761-8

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing