Abstract
Convolutional neural networks (CNNs) have become a standard for analysis of biological sequences. Tuning of network architectures is essential for a CNN’s performance, yet it requires substantial knowledge of machine learning and commitment of time and effort. This process thus imposes a major barrier to broad and effective application of modern deep learning in genomics. Here we present Automated Modelling for Biological Evidence-based Research (AMBER), a fully automated framework to efficiently design and apply CNNs for genomic sequences. AMBER designs optimal models for user-specified biological questions through the state-of-the-art neural architecture search (NAS). We applied AMBER to the task of modelling genomic regulatory features and demonstrated that the predictions of the AMBER-designed model are significantly more accurate than the equivalent baseline non-NAS models and match or even exceed published expert-designed models. Interpretation of AMBER architecture search revealed its design principles of utilizing the full space of computational operations for accurately modelling genomic sequences. Furthermore, we illustrated the use of AMBER to accurately discover functional genomic variants in allele-specific binding and disease heritability enrichment. AMBER provides an efficient automated method for designing accurate deep learning models in genomics.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All data used in this study are publicly available and the URLs are provided in the corresponding sections in Methods. Training data for the genomic regulatory features were downloaded from http://deepsea.princeton.edu/help/ as described in ref. 4. The ground-truth data for allele-specific binding analysis were obtained from the supplementary data of ref. 29. The UK Biobank GWAS summary statistics data are reported in ref. 40 and downloaded from https://alkesgroup.broadinstitute.org/UKBB/.
Code availability
The AMBER package is available on GitHub at https://github.com/zj-zhang/AMBER; the analysis presented in this study is available on GitHub at https://github.com/zj-zhang/AMBER-Seq. The AMBER code is publicly available on Zenodo at https://zenodo.org/record/438477747.
References
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. https://doi.org/10.1038/s41576-019-0122-6 (2019).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks (ed. Arbib, M. A.) 3361(10) (MIT Press, 1995).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 28, 739–750 (2018).
Ma, J. et al. Using deep learning to model the hierarchical structure and function of a cell. Nat. Methods 15, 290–298 (2018).
Zhang, Z. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods 16, 307–310 (2019).
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 18, 67 (2017).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations 1–14 (ICLR, 2014).
Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 1800–1807 (IEEE, 2017).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Zoph, B. & Le, Q. V. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations (ICLR, 2017).
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V. & Dean, J. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning 4095–4104 (PMLR, 2018).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).
Real, E., Aggarwal, A., Huang, Y. & Le, Q. V. Regularized evolution for image classifier architecture search. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 4780–4789 (2019).
Liu, H., Simonyan, K. & Yang, Y. Darts: differentiable architecture search. In International Conference on Learning Representations (ICLR, 2019).
He, X., Zhao, K. & Chu, X. AutoML: a survey of the state-of-the-art. Knowl. Based Syst. 212, 106622 (2021).
Lee, H., Grosse, R., Ranganath, R. & Ng, A. Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. 26th International Conference On Machine Learning, ICML 2009 609–616 (ACM, 2009); https://doi.org/10.1145/1553374.1553453
Zoph, B., Vasudevan, V., Shlens, J. & Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8697–8710 (IEEE, 2018).
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. In 4th International Conference on Learning Representations (ICLR, 2016).
Wagih, O., Merico, D., Delong, A. & Frey, B. J. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. Preprint at bioRxiv https://doi.org/10.1101/253427 (2018).
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Bryne, J. C. et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102–D106 (2008).
Machanick, P. & Bailey, T. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).
Zhang, P. et al. Negative cross-talk between hematopoietic regulators: GATA proteins repress PU.1. Proc. Natl Acad. Sci. USA 96, 8705–8710 (1999).
Metcalf, D. et al. Inactivation of PU.1 in adult mice leads to the development of myeloid leukemia. Proc. Natl Acad. Sci. USA 103, 1486–1491 (2006).
Wang, F. & Tong, Q. Transcription factor PU.1 is expressed in white adipose and inhibits adipocyte differentiation. Am. J. Physiol. Physiol. 295, C213–C220 (2008).
Lin, L. et al. Adipocyte expression of PU.1 transcription factor causes insulin resistance through upregulation of inflammatory cytokine gene expression and ROS production. Am. J. Physiol. Endocrinol. Metab. 302, E1550 (2012).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Zhang, Z., Zhou, L., Gou, L. & Wu, Y. N. Neural architecture search for joint optimization of predictive power and biological knowledge. Preprint at https://arxiv.org/abs1909.00337 (2019).
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992).
Machiela, M. & Chanock, S. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31, 3555–3557 (2015).
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Claire Dandine-Roulland, C. & Perdry, H. Genome-wide data manipulation, association analysis and heritability estimates in R with Gaston 1.5. In 46th European Mathematical Genetics Meeting (EMGM, 2018).
Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Zhang, Z. Code for ‘An automated framework for efficiently designing deep convolutional neural networks in genomics’. Zenodo https://doi.org/10.5281/ZENODO.4384777 (2020).
Acknowledgements
We acknowledge all members of the Troyanskaya laboratory for helpful discussions. We acknowledge that the work in this paper was performed at the high-performance computing resources at Simons Foundation. O.G.T. is a CIFAR fellow.
Author information
Authors and Affiliations
Contributions
Z.Z. and O.G.T. conceived the study. Z.Z. implemented the experiments. C.Y.P. and C.L.T. contributed research materials and analytic tools. Z.Z. and O.G.T. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–6.
Supplementary Data
Supplementary Tables 1–3.
Rights and permissions
About this article
Cite this article
Zhang, Z., Park, C.Y., Theesfeld, C.L. et al. An automated framework for efficiently designing deep convolutional neural networks in genomics. Nat Mach Intell 3, 392–400 (2021). https://doi.org/10.1038/s42256-021-00316-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00316-z
This article is cited by
-
RUBICON: a framework for designing efficient deep learning-based genomic basecallers
Genome Biology (2024)
-
Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis
BMC Medical Informatics and Decision Making (2024)
-
Optimized model architectures for deep learning on genomic data
Communications Biology (2024)
-
Interpretable neural architecture search and transfer learning for understanding CRISPR–Cas9 off-target enzymatic reactions
Nature Computational Science (2023)
-
ENNGene: an Easy Neural Network model building tool for Genomics
BMC Genomics (2022)