Abstract
Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinform. 10(1), 1–15 (2009)
Basso, A., Serban, S.: Industrial applications of immobilized enzymes-a review. Mol. Catal. 479, 110607 (2019)
Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)
Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017)
Cadet, F., et al.: A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8(1), 16757 (2018)
Cock, P.J., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)
UniProt Consortium: Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019)
Copeland, R.A.: Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis. Wiley, Hoboken (2023)
Dallago, C., et al.: Learned embeddings from deep learning to visualize and predict protein sets. Curr. Protoc. 1(5), e113 (2021)
Gao, W., Mahajan, S.P., Sulam, J., Gray, J.J.: Deep learning in protein structural modeling and design. Patterns 1(9), 100142 (2020)
Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23(1), 40–55 (2022)
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017)
Kanehisa, M., Sato, Y., Kawashima, M.: KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 31(1), 47–53 (2022)
Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36(Suppl. 1), D202–D205 (2007)
Kuo, C.H., Huang, C.Y., Shieh, C.J., Dong, C.D.: Enzymes and biocatalysis (2022)
Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)
Luo, Y., et al.: ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12(1), 1–14 (2021)
Maeda, K., Strassel, S.M.: Annotation tools for large-scale corpus development: using AGTK at the linguistic data consortium. In: LREC (2004)
Mazurenko, S., Prokop, Z., Damborsky, J.: Machine learning in enzyme engineering. ACS Catal. 10(2), 1210–1223 (2019)
Medina-Ortiz, D., et al.: Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Front. Mol. Biosci. 9 (2022)
Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)
Przepiórkowski, A.: XML text interchange format in the national corpus of polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009 (2009)
Qu, K., Wei, L., Zou, Q.: A review of DNA-binding proteins prediction methods. Curr. Bioinform. 14(3), 246–254 (2019)
Quiroz, C., et al.: Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach. Database 2021 (2021)
Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116(28), 13996–14001 (2019)
Salgado, D., et al.: MyMiner: a web application for computer-assisted biocuration and text annotation. Bioinformatics 28(17), 2285–2287 (2012)
Sapoval, N., et al.: Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13(1), 1728 (2022)
Siedhoff, N.E., Illig, A.M., Schwaneberg, U., Davari, M.D.: PyPEF-an integrated framework for data-driven protein engineering. J. Chem. Inf. Model. 61(7), 3463–3476 (2021)
Tao, Z., Dong, B., Teng, Z., Zhao, Y.: The classification of enzymes by deep learning. IEEE Access 8, 89802–89811 (2020)
Acknowledgments
The authors acknowledge funding by the MAG-2095 project, Ministry of Education, Chile. DMO acknowledges ANID for the project “SUBVENCIÓN A INSTALACIÓN EN LA ACADEMIA CONVOCATORIA AÑO 2022”, Folio 85220004. The authors gratefully acknowledge support from the Centre for Biotechnology and Bioengineering - CeBiB (PIA project FB0001, Conicyt, Chile).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Conflict of Interest Statement
The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fernández, D., Olivera-Nappa, Á., Uribe-Paredes, R., Medina-Ortiz, D. (2023). Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems. In: Rojas, I., Valenzuela, O., Rojas Ruiz, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2023. Lecture Notes in Computer Science(), vol 13919. Springer, Cham. https://doi.org/10.1007/978-3-031-34953-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-34953-9_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34952-2
Online ISBN: 978-3-031-34953-9
eBook Packages: Computer ScienceComputer Science (R0)