Article
Published: 23 May 2024

Accurate and robust protein sequence design with CarbonDesign

Milong Ren^1,2,
Chungong Yu^1,2,3,
Dongbo Bu ORCID: orcid.org/0000-0003-4119-4238^1,2,3 &
…
Haicang Zhang ORCID: orcid.org/0000-0001-6268-4258^1,2,3

Nature Machine Intelligence volume 6, pages 536–547 (2024)Cite this article

4279 Accesses
9 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Protein sequence design is critically important for protein engineering. Despite recent advancements in deep learning-based methods, achieving accurate and robust sequence design remains a challenge. Here we present CarbonDesign, an approach that draws inspiration from successful ingredients of AlphaFold and which has been developed specifically for protein sequence design. At its core, CarbonDesign introduces Inverseformer, which learns representations from backbone structures and an amortized Markov random fields model for sequence decoding. Moreover, we incorporate other essential AlphaFold concepts into CarbonDesign: an end-to-end network recycling technique to leverage evolutionary constraints from protein language models and a multitask learning technique for generating side-chain structures alongside designed sequences. CarbonDesign outperforms other methods on independent test sets including the 15th Critical Assessment of protein Structure Prediction (CASP15) dataset, the Continuous Automated Model Evaluation (CAMEO) dataset and de novo proteins from RFDiffusion. Furthermore, it supports zero-shot prediction of the functional effects of sequence variants, making it a promising tool for applications in bioengineering.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Evaluation of CarbonDesign with the CAMEO and CASP15 independent testing sets.**

**Fig. 3: Evaluation of CarbonDesign on de novo backbone structures from RFdiffusion.**

**Fig. 4: Evaluation of CarbonDesgin in interpreting functional effects of variants.**

**Fig. 5: Evaluation of ablation models of CarbonDesign.**

Sparks of function by de novo protein design

Article 15 February 2024

Protein sequence design with a learned potential

Article Open access 08 February 2022

Single-sequence protein structure prediction using supervised transformer protein language models

Article 19 December 2022

Data availability

The training data were obtained from the PDB website (http://www.rcsb.org/). The testing sets were acquired from CASP15 (https://predictioncenter.org/casp15/) and CAMEO (https://www.cameo3d.org). Other datasets supporting the findings of this study are available in the paper and the Supplementary Information. Source data are provided with this paper.

Code availability

The CarbonDesign software is available on both GitHub (https://github.com/zhanghaicang/carbonmatrix_public) and Code Ocean (https://codeocean.com/capsule/5915382/tree)⁵⁹.

References

Cao, L. et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 370, 426–431 (2020).
Article Google Scholar
Bryan, C. M. et al. Computational design of a synthetic PD-1 agonist. Proc. Natl Acad. Sci. USA 118, 2102164118 (2021).
Article Google Scholar
Yeh, A. H.-W. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).
Article Google Scholar
Dou, J. et al. De novo design of a fluorescence-activating beta-barrel. Nature 561, 485–491 (2018).
Article Google Scholar
Vorobieva, A. A. et al. De novo design of transmembrane beta barrels. Science 371, 8182 (2021).
Article Google Scholar
Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).
Article Google Scholar
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature https://doi.org/10.1038/s41586-023-06415-8 (2023).
Yim, J. et al. SE(3) diffusion model with application to protein backbone generation. In Proc. of the 40th International Conference on Machine Learning (eds Krause, A. et al.) 40001–40039 (PMLR, 2023).
Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).
Article Google Scholar
Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Article Google Scholar
Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proc. of the 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 8946–8970 (PMLR, 2022).
Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022).
Article Google Scholar
Liu, Y. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat. Comput. Sci. 2, 451–462 (2022).
Article Google Scholar
Huang, B. et al. Accurate and efficient protein sequence design through learning concise local environment of residues. Bioinformatics 39, 122 (2023).
Article Google Scholar
Ingraham, J. et al. Generative models for graph-based protein design. In Proc. of Advances in Neural Information Processing Systems (eds Wallach, H. et al) 15820–15831 (NeurlPS, 2019).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article Google Scholar
Carreira, J. et al. Human pose estimation with iterative error feedback. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (eds Bajcsy, R. et al.) 4733–4742 (IEEE, 2016).
Tu, Z. & Bai, X. Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1744–1757 (2010).
Article Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article MathSciNet Google Scholar
Robin, X. et al. Continuous Automated Model EvaluatiOn (CAMEO)—perspectives on the future of fully automated evaluation of structure prediction methods. Proteins 89, 1977–1986 (2021).
Article Google Scholar
CASP15. Critical Assessment of Techniques for Protien Structure Prediction, 15th Round. Abstract Book (Protein Structure Prediction Center, 2022); https://predictioncenter.org/casp15/doc/CASP15_Abstracts.pdf
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1988).
Wainwright, M. J. & Jordan, M. I. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008).
Article Google Scholar
Zhang, H. et al. Predicting protein inter-residue contacts using composite likelihood maximization and deep learning. BMC Bioinform. 20, 537 (2019).
Article Google Scholar
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E 87, 012707 (2013).
Article Google Scholar
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, 1293–1301 (2011).
Article Google Scholar
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Article Google Scholar
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Article Google Scholar
Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat. Comput. Sci. 2, 804–814 (2022).
Article Google Scholar
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Article Google Scholar
Sakuma, K., Koike, R. & Ota, M. Dual-wield NTPases: a novel protein family mined from AlphaFold DB. Protein Science. 33, e4934 (2024).
Article Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, 439–444 (2022).
Article Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article Google Scholar
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
Article Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article Google Scholar
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Article Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. of Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 29287–29303 (NeurlPS, 2021).
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. of the 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16990–17017 (PMLR, 2022).
Rao, R. M. et al. MSA transformer. In Proc. of the 38th International Conference on Machine Learning (eds Meila, M and Zhang, T.) 8844–8856 (PMLR, 2021).
Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).
Article Google Scholar
Kotler, E. et al. A systematic p53 mutation library links differential functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71, 178–1908 (2018).
Article Google Scholar
Mighell, T. L., Evans-Dutson, S. & O’Roak, B. J. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype-phenotype relationships. Am. J. Hum. Genet. 102, 943–955 (2018).
Article Google Scholar
Jia, X. et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am. J. Hum. Genet. 108, 163–175 (2021).
Article Google Scholar
Pan, X. et al. Structure of the human voltage-gated sodium channel Nav1.4 in complex with beta1. Science 362, 2486 (2018).
Article Google Scholar
Hennig, M., Darimont, B., Sterner, R., Kirschner, K. & Jansonius, J. N. 2.0 Å structure of indole-3-glycerol phosphate synthase from the hyperthermophile Sulfolobus solfataricus: possible determinants of protein stability. Structure 3, 1295–1306 (1995).
Article Google Scholar
Banerjee, S. et al. Protonation state of an important histidine from high resolution structures of lytic polysaccharide monooxygenases. Biomolecules https://doi.org/10.3390/biom12020194 (2022).
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Article Google Scholar
Leman, J. K. et al. Macromolecular modeling and design in rosetta: recent methods and frameworks. Nat. Methods 17, 665–680 (2020).
Article MathSciNet Google Scholar
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01618-2 (2023).
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01763-2 (2023).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article Google Scholar
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, 570–578 (2020).
Google Scholar
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, 170–176 (2017).
Article Google Scholar
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).
Article Google Scholar
Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 11, 431 (2010).
Article Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. of the International Conference on Learning Representations (eds Bengio, Y. et al.) 210–219, (ICLR 2015).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. of Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (NeurlPS, 2019).
Ren, M., Yu, C., Bu, D. & Zhang, H. Accurate and robust protein sequence design with Carbondesign. Code Ocean https://doi.org/10.24433/CO.5915382.v2 (2024).

Download references

Acknowledgements

We acknowledge the financial support from the National Natural Science Foundation of China (grant no. 32370657) and the Project of Youth Innovation Promotion Association CAS to H.Z. We also acknowledge the financial support from the Development Program of China (grant no. 2020YFA0907000) and the National Natural Science Foundation of China (grant nos. 32271297 and 62072435). We thank Beijing Paratera Co., Ltd and the ICT Computing-X Center, Chinese Academy of Sciences, for providing computational resources.

Author information

Authors and Affiliations

SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Milong Ren, Chungong Yu, Dongbo Bu & Haicang Zhang
University of Chinese Academy of Sciences, Beijing, China
Milong Ren, Chungong Yu, Dongbo Bu & Haicang Zhang
Central China Institute of Artificial Intelligence, Zhengzhou, China
Chungong Yu, Dongbo Bu & Haicang Zhang

Authors

Milong Ren
View author publications
You can also search for this author in PubMed Google Scholar
Chungong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Dongbo Bu
View author publications
You can also search for this author in PubMed Google Scholar
Haicang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.Z. conceived the ideas and implemented the CarbonDesign model and algorithms. H.Z. and M.R. designed the experiments, and M.R. conducted the main experiments and analysis. M.R. wrote the manuscript. H.Z., D.B. and C.Y. revised the manuscript.

Corresponding authors

Correspondence to Dongbo Bu or Haicang Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Haiyan Liu and Dong Xu for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–4, Figs. 1–6 and Tables 1–17.

Reporting Summary

Supplementary Data 1

Statistical Source Data for Supplementary Fig. 3.

Supplementary Data 2

Statistical Source Data for Supplementary Fig. 4.

Supplementary Data 3

Statistical Source Data for Supplementary Fig. 5.

Supplementary Data 4

Statistical Source Data for Supplementary Fig. 6.

Source data

Source Data Fig. 2

Statistical Source Data for Fig. 2.

Source Data Fig. 3

Statistical Source Data for Fig. 3.

Source Data Fig. 4

Statistical Source Data for Fig. 4.

Source Data Fig. 5

Statistical Source Data for Fig. 5.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, M., Yu, C., Bu, D. et al. Accurate and robust protein sequence design with CarbonDesign. Nat Mach Intell 6, 536–547 (2024). https://doi.org/10.1038/s42256-024-00838-2

Download citation

Received: 10 August 2023
Accepted: 10 April 2024
Published: 23 May 2024
Issue Date: May 2024
DOI: https://doi.org/10.1038/s42256-024-00838-2

Accurate and robust protein sequence design with CarbonDesign

Subjects

Abstract

Access options

Similar content being viewed by others

Sparks of function by de novo protein design

Protein sequence design with a learned potential

Single-sequence protein structure prediction using supervised transformer protein language models

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Source data

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Source Data Fig. 5

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links