Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Interpreting cis-regulatory interactions from large-scale deep neural networks

  • Article
  • Published:

From Nature Genetics

View current issue Submit your manuscript

Abstract

The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with orthogonal experimental data, providing insights into generalization but offering limited insights into their decision-making process. Existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we present cis-regulatory element model explanations (CREME), an in silico perturbation toolkit that interprets the rules of gene regulation learned by a genomic DNN. Applying CREME to Enformer, a state-of-the-art DNN, we identify cis-regulatory elements that enhance or silence gene expression and characterize their complex interactions. CREME can provide interpretations across multiple scales of genomic organization, from cis-regulatory elements to fine-mapped functional sequence elements within them, offering high-resolution insights into the regulatory architecture of the genome. CREME provides a powerful toolkit for translating the predictions of genomic DNNs into mechanistic insights of gene regulation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1: CREME overview and results for context perturbations in K562 cells using Enformer.
Fig. 2: CRE-level analysis in K562 using Enformer.
Fig. 3: Fine-tile search results for enhancing tiles in K562.
Fig. 4: TSS–CRE distance test schematic and results.
Fig. 5: Optimal CRE sets reveal complex interactions for K562 using Enformer.
Fig. 6: Investigation of CRE interactions.

Similar content being viewed by others

Data availability

Final and intermediate results for paper reproducibility are available via Zenodo at https://doi.org/10.5281/zenodo.12584210 (ref. 75).

Code availability

Static code for reproducing the analyses in the manuscript is available via Zenodo at https://zenodo.org/records/12594513 (ref. 76). A bleeding-edge version of CREME is available via GitHub at https://github.com/p-koo/creme-nn and https://github.com/p-koo/CREME_paper_reproducibility. A stable version of CREME is installable via pip (PyPI at https://pypi.org/project/creme-nn/). Comprehensive documentation is provided on ReadTheDocs.org (API at https://creme-nn.readthedocs.io/en/latest/index.html and tutorials at https://creme-nn.readthedocs.io/en/latest/tutorials.html).

References

  1. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Karbalayghareh, A., Sahin, M. & Leslie, C. S. Chromatin interaction–aware gene regulatory modeling with graph attention networks. Genome Res. 32, 930–944 (2022).

    PubMed  PubMed Central  Google Scholar 

  4. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting rna-seq coverage from dna sequence as a unifying model of gene regulation. Preprint at bioRxiv https://doi.org/10.1101/2023.08.30.555582 (2023).

  5. Toneyan, S., Tang, Z. & Koo, P. K. Evaluating deep learning for predicting epigenomic profiles. Nat. Mach. Intell. 4, 1–13 (2022).

  6. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 1–29 (2023).

    Article  Google Scholar 

  7. Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by starr-seq. Science 339, 1074–1077 (2013).

    Article  CAS  PubMed  Google Scholar 

  9. Qi, L. S. et al. Repurposing crispr as an RNA-guided platform for sequence-specific control of gene expression. Cell 152, 1173–1183 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023).

    Article  CAS  PubMed  Google Scholar 

  11. Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat Genet. 55, 2056–2059 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. of the International Conference on Learning Representations (ICLR, 2014).

  13. Scott, M., and Lee Su-In. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, 4765–4774 (2017).

  14. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (2017).

  15. Koo, P. K., Majdandzic, A., Ploenzke, M., Anand, P. & Paul, S. B. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 17, e1008925 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Koo, P. K. & Ploenzke, M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 3, 258–266 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Hammelman, J. & Gifford, D. K. Discovering differential genome sequence activity with interpretable and efficient deep learning. PLoS Comput. Biol. 17, e1009282 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Liu, G., Zeng, H. & Gifford, D. K. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinform. 20, 401 (2019).

    Article  Google Scholar 

  20. Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory dna sequences. Bioinformatics 34, i629–i637 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Jha, A., Aicher, J. K., Gazzara, M. R., Singh, D. & Barash, Y. Enhanced integrated gradients: improving interpretability of deep learning models using splicing codes as a case study. Genome Biol. 21, 149 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Linder, J. et al. Interpreting neural networks for biological sequences by learning stochastic masks. Nat. Mach. Intell. 4, 41–54 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Seitz, E. E., McCandlish, D. M., Kinney, J. B. & Koo, P. K. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. Nat. Mach. Intell. 6, 701–713 (2024).

    Article  Google Scholar 

  24. Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with crispr interference. Science 354, 769–773 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Frankish, A. et al. Gencode 2021. Nucleic Acids Res. 49, D916–D923 (2021).

    Article  CAS  PubMed  Google Scholar 

  27. Lin, X. et al. Nested epistasis enhancer networks for robust genome regulation. Science 377, 1077–1085 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Goel, V. Y., Huseyin, M. K. & Hansen, A. S. Region capture micro-c reveals coalescence of enhancers and promoters into nested microcompartments. Nat. Genet. 6, 1048–1056 (2023).

  29. Luthra, I. et al. Regulatory activity is the default dna state in eukaryotes. Nat. Struct. Mol. Biol. 3, 559–567 (2024).

  30. Pang, B. & Snyder, M. P. Systematic identification of silencers in human cells. Nat. Geneti. 52, 254–263 (2020).

    Article  CAS  Google Scholar 

  31. Stampfel, G. et al. Transcriptional regulators form diverse groups with context-dependent regulatory functions. Nature 528, 147–151 (2015).

    Article  CAS  PubMed  Google Scholar 

  32. Kulkarni, M. M. & Arnosti, D. N. cis-regulatory logic of short-range transcriptional repression in drosophila melanogaster. Mol. Cell. Biol. 25, 3411–3420 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Doni Jayavelu, N., Jajodia, A., Mishra, A. & Hawkins, R. D. Candidate silencer elements for the human and mouse genomes. Nat. Commun. 11, 1061 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Martinez-Ara, M., Comoglio, F., van Arensbergen, J. & van Steensel, B. Systematic analysis of intrinsic enhancer-promoter compatibility in the mouse genome. Mol. Cell 82, 2519–2531 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Bergman, D. T. et al. Compatibility rules of human enhancer and promoter sequences. Nature 607, 176–184 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Narita, T. et al. The logic of native enhancer-promoter compatibility and cell-type-specific gene expression variation. Preprint at bioRxiv https://doi.org/10.1101/2022.07.18.500456 (2022).

  37. Armendariz, D. A., Sundarrajan, A. & Hon, G. C. Breaking enhancers to gain insights into developmental defects. eLife 12, e88187 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Catarino, R. R. & Stark, A. Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev. 32, 202–223 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Luo, Y. et al. New developments on the encyclopedia of dna elements (encode) data portal. Nucleic Acids Res. 48, D882–D889 (2020).

    Article  CAS  PubMed  Google Scholar 

  40. Igolkina, A. A. et al. H3k4me3, h3k9ac, h3k27ac, h3k27me3 and h3k9me3 histone tags suggest distinct regulatory evolution of open and condensed chromatin landmarks. Cells 8, 1034 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Monaghan, L. et al. The emerging role of h3k9me3 as a potential therapeutic target in acute myeloid leukemia. Front. Oncol. 9, 705 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. PNAS 117, 9451–9457 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Gao, T. & Qian, J. Enhanceratlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species. Nucleic Acids Res. 48, D58–D64 (2020).

    CAS  PubMed  Google Scholar 

  44. Zhang, Y., See, Y. X., Tergaonkar, V. & Fullwood, M. J. Long-distance repression by human silencers: chromatin interactions and phase separation in silencers. Cells 11, 1560 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Jin, Y. et al. Targeting methyltransferase prmt5 eliminates leukemia stem cells in chronic myelogenous leukemia. J Clin Invest. 126, 3961–3980 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Griffin, G. K. et al. Epigenetic silencing by setdb1 suppresses tumour intrinsic immunogenicity. Nature 595, 309–314 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Garcia-Carpizo, V. et al. CREBBP/EP300 bromodomains are critical to sustain the GATA1/MYC regulatory axis in proliferation. Epigenetics Chromatin 11, 30 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Del Gaudio, N. et al. BRD9 binds cell type-specific chromatin regions regulating leukemic cell survival via STAT5 inhibition. Cell Death Dis. 10, 338 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Lazar, J. E. et al. Global regulatory DNA potentiation by SMARCA4 propagates to selective gene expression programs via domain-level remodeling. Cell Rep. 31, 107676 (2020).

  50. Benton, M. L., Talipineni, S. C., Kostka, D. & Capra, J. A. Genome-wide enhancer annotations differ significantly in genomic distribution, evolution, and function. BMC Genomics 20, 511 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Grant, C. E. & Bailey, T. L. XSTREME: comprehensive motif analysis of biological sequence datasets. Preprint at bioRxiv https://doi.org/10.1101/2021.09.02.458722 (2021).

  52. Zuin, J. et al. Nonlinear control of transcription through enhancer–promoter interactions. Nature 604, 571–577 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Zhan, Y. et al. Reciprocal insulation analysis of Hi-C data shows that tads represent a functionally but not structurally privileged scale in the hierarchical folding of chromosomes. Genome Res. 27, 479–490 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of crispr perturbations. Nat. Genet. 51, 1664–1669 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Choi, J. et al. Evidence for additive and synergistic action of mammalian enhancers during cell fate determination. eLife 10, e65381 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Martinez-Ara, M., Comoglio, F. & van Steensel, B. Large-scale analysis of the integration of enhancer-enhancer signals by promoters. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.552995 (2023).

  57. Kvon, E. Z., Waymack, R., Gad, M. & Wunderlich, Z. Enhancer redundancy in development and disease. Nat. Rev. Genet. 22, 324–336 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Frankel, N. et al. Phenotypic robustness conferred by apparently redundant transcriptional enhancers. Nature 466, 490–493 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Osterwalder, M. et al. Enhancer redundancy provides phenotypic robustness in mammalian development. Nature 554, 239–243 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Perry, M. W., Boettiger, A. N. & Levine, M. Multiple enhancers ensure precision of gap gene-expression patterns in the drosophila embryo. Pro. Natl Acad. Sci. USA 108, 13570–13575 (2011).

    Article  CAS  Google Scholar 

  61. Hong, C. K. Y. & Cohen, B. A. Genomic environments scale the activities of diverse core promoters. Genome Res. 32, 85–96 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Zhou, J. L., Guruvayurappan, K., Chen, H. V., Chen, A. R. & McVicker, G. P. Genome-wide analysis of crispr perturbations indicates that enhancers act multiplicatively and without epistatic-like interactions. Preprint at bioRxiv https://doi.org/10.1101/2023.04.26.538501 (2023).

  63. Sanford, E. M., Emert, B. L., Coté, A. & Raj, A. Gene regulation gravitates toward either addition or multiplication when combining the effects of two signals. eLife 9, e59388 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Crocker, J., Ilsley, G. R. & Stern, D. L. Quantitatively predictable control of drosophila transcriptional enhancers in vivo with engineered transcription factors. Nat. Genet. 48, 292–298 (2016).

    Article  CAS  PubMed  Google Scholar 

  65. Melen, G. J., Levy, S., Barkai, N. & Shilo, B.-Z. Threshold responses to morphogen gradients by zero-order ultrasensitivity. Mol. Syst. Biol. 1, 2005–0028 (2005).

    Article  PubMed Central  Google Scholar 

  66. Burz, D. S., Rivera-Pomar, R., Jäckle, H. & Hanes, S. D. Cooperative DNA-binding by bicoid provides a mechanism for threshold-dependent gene activation in the drosophila embryo. EMBO J. 17, 5998–6009 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Doughty, B. R. et al. Single-molecule chromatin configurations link transcription factor binding to expression in human cells. Preprint at bioRxiv https://doi.org/10.1101/2024.02.02.578660 (2024).

  68. Bothma, J. P. et al. Enhancer additivity and non-additivity are determined by enhancer strength in the drosophila embryo. eLife 4, e07956 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Scholes, C., Biette, K. M., Harden, T. T. & DePace, A. H. Signal integration by shadow enhancers and enhancer duplications varies across the drosophila embryo. Cell Rep. 26, 2407–2418 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Ovadia, Y. et al. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Adv. Neural Inf. Process. Syst. https://papers.nips.cc/paper_files/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf (2019).

  71. Vaswani, A. et al. Attention is all you need. In Adv. Neural Inf. Process. Syst. https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).

  72. Chen, P. B. et al. Systematic discovery and functional dissection of enhancers needed for cancer cell fitness and proliferation. Cell Rep. 41, 111630 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Crocker, J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015).

    Article  CAS  PubMed  Google Scholar 

  74. Grant, C. E., Bailey, T. L. & Noble, W. S. Fimo: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Toneyan, S. & Koo, P. Creme-nn data and results. Zenodo https://doi.org/10.5281/zenodo.12584210 (2024).

  76. Toneyan, S. & Koo, P. Creme-nn code. Zenodo https://zenodo.org/records/12594513 (2023).

Download references

Acknowledgements

We thank S. Navlakha, J. Desmarais, J. Kinney and members of the Koo Lab for helpful comments on the manuscript. Research reported in this publication was supported in part by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG012131 (P.K.K.), the National Institute Of General Medical Sciences of the National Institutes of Health under award number R01GM149921 (S.T. and P.K.K.) and the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. This work was performed with assistance from the US National Institutes of Health Grant S10OD028632-01. We also thank the NVIDIA GPU Grant Program for support.

Author information

Authors and Affiliations

Authors

Contributions

S.T. and P.K.K. conceived of the method and designed the experiments. S.T. developed code, ran the experiments and analyzed the results. S.T. and P.K.K. interpreted the results and contributed to writing the paper.

Corresponding author

Correspondence to Peter K. Koo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Results of the Context Dependence Test and Context Swap Test for GM12878 and PC-3.

a,b Histogram of normalized context effect from the Context Dependence Test for 10,000 sequences that contain an active, annotated gene in GM12878 and PC-3 cells. Inset shows the subset of sequences for enhancing, silencing and neutral contexts. a inset contains 200, 78 and 183 data points in enhancing, silencing and neutral context respectively. b inset contains 200, 90 and 110 data points in enhancing, silencing and neutral context respectively. c, Pairwise comparison of normalized context effects between cell lines for matched genes. The number of data points is 7688, 6946, 7492 from left to right. d,e, Context Swap Test results. Boxplots of normalized context effect on TSS for sequences with context perturbations given by insertion of the original TSS in different context categories. Results are organized according to the original TSS category: enhancing (left), neutral (middle), and silencing (right). The number of data points in each boxplot represent an all-vs-all comparison of each respective TSS in each possible context. The number of data points in d is 40,000, 36,600, 15,600 in boxplots for TSS from enhancing context, 36,600, 33,489, 14,274 in TSS from neutral context and 15600, 14274, 6084 in TSS from silencing context. The number of data points in e is 40,000, 22,000, 18,000 in boxplots for TSS from enhancing, 22,000, 12,100, 9,900 in TSS from neutral context and 18,000, 9,900, 8,100 in TSS from silencing context. Boxplots show the first and third quartiles, the median (central line) and the range of data with outliers removed (whiskers).

Extended Data Fig. 2 Borzoi Context Dependence Test results.

a, Scatter plot comparing the wild-type activity predicted by Enformer versus Borzoi for the matched cell types and for matched genes. b, Histogram of normalized context effect for the 10,000 highest activity, annotated genes (according to Borzoi’s predictions) for K562, GM12878 and PC-3 cells. Inset shows the subset of sequences for enhancing, silencing and neutral contexts. The number of data points is shown in inset legend.

Extended Data Fig. 3 CRE effects on TSS activity in GM12878 and PC-3 cell lines.

a,b, Boxen plot of the normalized shuffle effect for each tile in sequences from enhancing, neutral and silencing context categories (Necessity Test) for GM12878 (a) and PC-3 (b). The number of data points in a is 7600, 6954, 2964 and in b is 7600, 4180, 3420 in enhancing, neutral and silencing contexts respectively. c, d, Boxen plot of tile effects for each tile in sequences from enhancing, neutral and silencing context categories (Sufficiency Test) for GM12878 (c) and PC-3 (d). Normalization is with predicted TSS activity for wild-type (enhancing context) and control, that is the intrinsic TSS activity (neutral and silencing context). Boxen-plots have the same number of data points as in a and b. In panels ad center lines of boxenplots show the median and boxes in both directions always indicate half of the remaining data. e, Scatter plot between the results from the Necessity Test (y-axis) versus the results from the Sufficiency Test (x-axis) in K562 cell line (N = 7,600 in each plot corresponding to 200 sequences with 38 tiles in each).

Extended Data Fig. 4 Characterization of sufficient CREs in GM12878 and PC-3.

a, Histogram of the distance between CRE tiles from TSS for sufficient enhancers and silencers in GM12878 and PC-3. bd, Boxplots of mean DNase-seq coverage (b), mean ATAC-seq coverage (c), and mean histone mark coverage (d) of sufficient enhancer and silencer tiles in various cell types. The number of points in green and red boxes is 76 and 222 in K562, 41 and 57 for GM12878 and 35 and 97 for PC-3. Significance is given by the two-sided Mann-Whitney U test (*: p < 0.05; **: p < 0.05; ***: p < 0.001; ****: p < 0.0001). Boxplots show the first and third quartiles, the median (central line) and the range of data with outliers removed (whiskers).

Extended Data Fig. 5 TSS-CRE Distance Test results across cell lines.

ac, Average plot of the fold change over max versus distance to TSS for GM12878 (a) and PC-3 (b). Max represents the maximum TSS activity across all embedded positions within each sequence using Enformer. c, d, Plot of the tile sufficiency versus distance to TSS for GM12878 (c) and PC-3 (d), respectively. Tile sufficiency is calculated according to the predicted TSS activity with a TSS-CRE pair at a given distance minus the control sequence (shuffled context with just the TSS) divided by the WT sequence for enhancers and by the control sequence for silencers. In panels ad shaded regions represent standard deviation of the mean.

Extended Data Fig. 6 Example sequences showing individual tile effect sizes from the Higher-Order Interaction Test results.

ai, the left panels show results of the greedy search (green) and the additive model (orange) for a particular gene; the right panel shows the independent tile effect size (calculated from the first iteration) sorted according to greedy search tile order. ac shows example sequences classified as superadditivity; df shows sequences classified as subadditivity; gi shows example sequences classified as additivity.

Extended Data Fig. 7 Optimal CRE sets reveal complex interactions in GM12878 and PC-3.

a, b, Average plot of the greedy search results for enhancer tile sets (a) and silencer tile sets (b) for sequences from different context categories for various cell lines. The fold change over wild-type (WT) is the predicted TSS activity of the shuffled CRE tiles in each round of the greedy search (indicated by the number of tiles). c, d, Sufficiency of the tile sets identified in each round of greedy search. Average fold change over wild-type (c) and control (d), which represents shuffled sequences with just the TSS tile. Sufficiency places the tile sets along with the TSS tile into shuffled sequences, averaging over 10 total shuffles. Shaded region represents the standard deviation of the mean.

Extended Data Fig. 8 Comparison of enhancer sets identified by the Higher-Order Interaction Test and a hypothetical additive model for GM12878 and PC-3.

a, b, Comparison of the average fold change over wild-type (WT) for enhancer sets for sequences categorized as enhancing context versus a hypothetical additive effects model. The sequences from enhancing contexts are stratified according to interaction type, superadditivity, subadditivity, and additivity. Sequences were classified using mean squared error based thresholds of 0.1 for superadditivity and subadditivity and 0.05 for additivity definition (with some ambiguous cases left out of classification). Shaded region represents standard deviation of the mean. c, e, Comparison of hypothetical additive model and hypothetical multiplicative model versus greedy search outcomes at iteration 2 of the higher-order interaction test. The number of points in each box is 69, 38 and 60 in GM12878 and 93, 37, 36 in PC-3 for additive, superadditivity and subadditivity cases. Note, that some ambiguous cases were left out of the classification if they were outside of the selected thresholds. Statistical significance was given according to the two-sided Mann-Whitney U test (*: p < 0.05; **: p < 0.01; ***: p < 0.001; ****: p < 0.0001). Boxplots show the first and third quartiles, the median (central line) and the range of data with outliers removed (whiskers). d, f, Greedy search versus hypothetical additive or multiplicative models. Scatter plots show a more detailed view of the data in c, e with x-axis showing the higher-order interaction test outcomes and the y-axis showing the hypothetical model outputs (additive or multiplicative).

Extended Data Fig. 9 Comparison of silencer sets identified by the Higher-Order Interaction Test and a hypothetical additive model for K562, GM12878 and PC-3.

ac, Comparison of the average fold change over wild-type (WT) for silencer sets for sequences categorized as silencing context versus a hypothetical additive effects model for K562 (a), GM12878 (b), PC-3 (c). The sequences from silencing contexts are stratified according to interaction type, superadditivity and additivity. Shaded region represents standard deviation of the mean. Notably, we did not identify any subadditivity cases.

Extended Data Fig. 10 Saturation behavior of TSS activity predictions by Enformer in various cell lines.

The results from a CRE Multiplicity Test applied to sequences from enhancing context (left) and silencing context (right) in ac. Each line represents a particular enhancer or silencer CRE embedded into shuffled sequences at optimal positions (according to a Greedy Search) versus the copy number of the CRE in the sequence. The number of enhancers in each plot in ac is 200, the number of silencers is 200, 78, 90 in ac, respectively. The normalized TSS effect represents the predicted TSS activity of the mutated sequence divided by the control, which is the shuffled sequence with the TSS tile and the CRE in their original positions. The average across all CREs is shown with a thicker line and the shaded region represents the standard deviation of the mean.

Supplementary information

Supplementary Information

Supplementary Tables 1–4, Figs. 1–10 and Note 1.

Reporting Summary

Peer Review File

Supplementary Data 1

Supplementary Data Tables 1–4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toneyan, S., Koo, P.K. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01923-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41588-024-01923-3

  • Springer Nature America, Inc.

Navigation