Most sequenced genomes are currently stored in strict access-controlled repositories1,2,3. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets4,5. However, concerns over genetic data privacy6,7,8,9 may deter individuals from contributing their genomes to scientific studies10 and could prevent researchers from sharing data with the scientific community11. Although cryptographic techniques for secure data analysis exist12,13,14, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.
H.C. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). H.C. is also partially supported by Kwanjeong Educational Foundation. D.J.W. is supported by fellowships from the Simons and National Science Foundations.
Author information
Authors and Affiliations
H.C., D.J.W., and B.B. developed the methods. H.C. implemented the software and performed experiments with assistance from D.J.W. and B.B. B.B. supervised the project. All authors wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Figure 1 Our secure GWAS protocol obtains accurate association statistics.
Using our protocol, we securely performed GWAS on three published case-control data sets for lung cancer (n = 9,098 after quality control), bladder cancer (n = 10,678), and age-related macular degeneration (AMD; n = 20,679). All of the tested SNPs passing quality control are shown in the figure: 378,492 loci for lung cancer, 389,868 loci for bladder cancer, and 221,295 loci for AMD. Our securely computed Cochran-Armitage trend test p-values (one-sided) accurately matched the ground truth we obtained based on plaintext data.
Supplementary Figure 2 Our secure GWAS protocol accurately estimates the effect size of associated SNPs via logistic regression.
We implemented logistic regression in our secure computation framework and applied it to a subset of 100 SNPs (randomly chosen among the top 1000 associations) in the lung cancer data set (n = 9,098 after quality control). The odds ratio of a SNP is given by the exponential function evaluated at the estimated weight associated with the SNP's minor allele dosage in a logistic regression model. Analogous to our main GWAS protocol, we included 10 additional phenotypes (e.g., age group) and five principal components securely obtained by our GWAS protocol as covariates in the model. As shown in the scatter plot, the odds ratios securely obtained by our protocol accurately matched those computed based on a plaintext implementation of logistic regression, the latter of which also used a plaintext PCA algorithm to obtain the top principal components. Performing logistic regression on 100 SNPs completed in about a day using our experimental setup. Although performing logistic regression genome-wide is still prohibitively expensive, our method enables a heuristic two-step approach where the odds ratios are computed for only the SNPs passing a certain significance threshold in our main GWAS protocol. Note that our logistic regression pipeline provides the same security guarantees as our main GWAS protocol; namely, no information about the underlying genotypes and phenotypes is revealed during the process other than the final output.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–2 (PDF 264 kb)
Supplementary Tables
Supplementary tables 1–3 (PDF 199 kb)
Supplementary Notes
Supplementary notes 1–12 (PDF 1959 kb)
Supplementary Code
An implementation of our secure GWAS protocol in C++. (ZIP 372 kb)
