Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems

Diego Fernández¹²,
Álvaro Olivera-Nappa^13,14,
Roberto Uribe-Paredes¹² &
…
David Medina-Ortiz^12,13

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13919))

Included in the following conference series:

International Work-Conference on Bioinformatics and Biomedical Engineering

613 Accesses
3 Citations

Abstract

Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinform. 10(1), 1–15 (2009)
Article Google Scholar
Basso, A., Serban, S.: Industrial applications of immobilized enzymes-a review. Mol. Catal. 479, 110607 (2019)
Article CAS Google Scholar
Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)
Google Scholar
Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017)
Google Scholar
Cadet, F., et al.: A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8(1), 16757 (2018)
Article PubMed PubMed Central Google Scholar
Cock, P.J., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)
Article CAS PubMed PubMed Central Google Scholar
UniProt Consortium: Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019)
Google Scholar
Copeland, R.A.: Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis. Wiley, Hoboken (2023)
Book Google Scholar
Dallago, C., et al.: Learned embeddings from deep learning to visualize and predict protein sets. Curr. Protoc. 1(5), e113 (2021)
Article PubMed Google Scholar
Gao, W., Mahajan, S.P., Sulam, J., Gray, J.J.: Deep learning in protein structural modeling and design. Patterns 1(9), 100142 (2020)
Article CAS PubMed PubMed Central Google Scholar
Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23(1), 40–55 (2022)
Article CAS PubMed Google Scholar
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017)
Article CAS PubMed Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M.: KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 31(1), 47–53 (2022)
Article CAS PubMed Google Scholar
Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36(Suppl. 1), D202–D205 (2007)
Google Scholar
Kuo, C.H., Huang, C.Y., Shieh, C.J., Dong, C.D.: Enzymes and biocatalysis (2022)
Google Scholar
Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)
Article CAS PubMed Google Scholar
Luo, Y., et al.: ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12(1), 1–14 (2021)
Article Google Scholar
Maeda, K., Strassel, S.M.: Annotation tools for large-scale corpus development: using AGTK at the linguistic data consortium. In: LREC (2004)
Google Scholar
Mazurenko, S., Prokop, Z., Damborsky, J.: Machine learning in enzyme engineering. ACS Catal. 10(2), 1210–1223 (2019)
Article Google Scholar
Medina-Ortiz, D., et al.: Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Front. Mol. Biosci. 9 (2022)
Google Scholar
Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)
Article PubMed Google Scholar
Przepiórkowski, A.: XML text interchange format in the national corpus of polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009 (2009)
Google Scholar
Qu, K., Wei, L., Zou, Q.: A review of DNA-binding proteins prediction methods. Curr. Bioinform. 14(3), 246–254 (2019)
Article CAS Google Scholar
Quiroz, C., et al.: Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach. Database 2021 (2021)
Google Scholar
Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116(28), 13996–14001 (2019)
Article CAS PubMed PubMed Central Google Scholar
Salgado, D., et al.: MyMiner: a web application for computer-assisted biocuration and text annotation. Bioinformatics 28(17), 2285–2287 (2012)
Article CAS PubMed Google Scholar
Sapoval, N., et al.: Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13(1), 1728 (2022)
Article CAS PubMed PubMed Central Google Scholar
Siedhoff, N.E., Illig, A.M., Schwaneberg, U., Davari, M.D.: PyPEF-an integrated framework for data-driven protein engineering. J. Chem. Inf. Model. 61(7), 3463–3476 (2021)
Article CAS PubMed Google Scholar
Tao, Z., Dong, B., Teng, Z., Zhao, Y.: The classification of enzymes by deep learning. IEEE Access 8, 89802–89811 (2020)
Article Google Scholar

Download references

Acknowledgments

The authors acknowledge funding by the MAG-2095 project, Ministry of Education, Chile. DMO acknowledges ANID for the project “SUBVENCIÓN A INSTALACIÓN EN LA ACADEMIA CONVOCATORIA AÑO 2022”, Folio 85220004. The authors gratefully acknowledge support from the Centre for Biotechnology and Bioengineering - CeBiB (PIA project FB0001, Conicyt, Chile).

Author information

Authors and Affiliations

Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes, 01855, Punta Arenas, Chile
Diego Fernández, Roberto Uribe-Paredes & David Medina-Ortiz
Departamento de Ingeniería Química, Biotecnología y Materiales, Universidad de Chile, Beauche 851, Santiago, Chile
Álvaro Olivera-Nappa & David Medina-Ortiz
Centre for Biotechnology and Bioengineering, Universidad de Chile, Beauchef 851, Santiago, Chile
Álvaro Olivera-Nappa

Authors

Diego Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Olivera-Nappa
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Uribe-Paredes
View author publications
You can also search for this author in PubMed Google Scholar
David Medina-Ortiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Medina-Ortiz .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Granada, Granada, Spain
Olga Valenzuela
University of Granada, Granada, Spain
Fernando Rojas Ruiz
University of Granada, Granada, Spain
Luis Javier Herrera
University of Granada, Granada, Spain
Francisco Ortuño

Ethics declarations

Conflict of Interest Statement

The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernández, D., Olivera-Nappa, Á., Uribe-Paredes, R., Medina-Ortiz, D. (2023). Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems. In: Rojas, I., Valenzuela, O., Rojas Ruiz, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2023. Lecture Notes in Computer Science(), vol 13919. Springer, Cham. https://doi.org/10.1007/978-3-031-34953-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-34953-9_24
Published: 29 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34952-2
Online ISBN: 978-3-031-34953-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics