Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation
<p>Workflow of a deep learning-based metabolite annotation that includes MS/MS data processing, feature selection, model training, molecular fingerprint prediction, molecular formula prediction, candidate retrieval, and candidate ranking.</p> "> Figure 2
<p>Architecture of a deep learning model for predicting molecular fingerprints based on MS/MS spectra transformed into vectors.</p> ">
Abstract
:1. Introduction
2. Materials and Methods
2.1. Overview
2.2. MS/MS Data Processing
2.3. Molecular Fingerprint Calculation
2.4. Feature Selection
2.5. Deep Learning Model for Molecular Fingerprint Prediction
2.6. Molecular Formula Prediction
2.7. Performance Evaluation
3. Results and Discussions
3.1. Evaluation of Molecular Fingerprint Prediction Models via Cross-Validation
3.2. Top-k Ranking Performance Evaluation of Deep Learning Models via CASMI Benchmark Datasets
3.3. Comparison of Deep Learning Models with CSI:FingerID
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Blazenovic, I.; Kind, T.; Ji, J.; Fiehn, O. Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics. Metabolites 2018, 8, 31. [Google Scholar] [CrossRef]
- Scheubert, K.; Hufsky, F.; Petras, D.; Wang, M.; Nothias, L.; Dührkop, K.; Bandeira, N.; Dorrestein, P.C.; Böcker, S. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 2017, 8, 1494. [Google Scholar] [CrossRef]
- Koo, I.; Kim, S.; Zhang, X. Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry. J. Chromatogr. A 2013, 1298, 132–138. [Google Scholar] [CrossRef]
- Schrimpe-Rutledge, A.C.; Codreanu, S.G.; Sherrod, S.D.; McLean, J.A. Untargeted Metabolomics Strategies—Challenges and Emerging Directions. J. Am. Soc. Mass Spectrom. 2016, 27, 1897–1905. [Google Scholar] [CrossRef]
- NIST/EPA/NIH Mass Spectral Library. Available online: http://www.nist.gov/srd/nist1a.cfm (accessed on 22 July 2024).
- Schauer, N.; Steinhauser, D.; Strelkov, S.; Schomburg, D.; Allison, G.; Moritz, T.; Lundgren, K.; Roessner-Tunali, U.; Forbes, M.G.; Willmitzer, L.; et al. National Institute of Standards and Technology, Gaithersburg, MD, USA GC-MS libraries for the rapid identification of metabolites in complex biological samples. FEBS Lett. 2005, 579, 1332–1337. [Google Scholar] [CrossRef]
- Wang, Y.; Xiao, J.; Suzek, T.O.; Zhang, J.; Wang, J.; Bryant, S.H. PubChem: A public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37, W623–W633. [Google Scholar] [CrossRef]
- Hoffmann, M.A.; Nothias, L.; Ludwig, M.; Fleischauer, M.; Gentry, E.C.; Witting, M.; Dorrestein, P.C.; Dührkop, K.; Böcker, S. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 2022, 40, 411–421. [Google Scholar] [CrossRef]
- Sen, P.; Lamichhane, S.; Mathema, V.B.; McGlinchey, A.; Dickens, A.M.; Khoomrung, S.; Orešič, M. Deep learning meets metabolomics: A methodological perspective. Brief. Bioinform. 2021, 22, 1531–1542. [Google Scholar] [CrossRef]
- Xing, S.; Shen, S.; Xu, B.; Li, X.; Huan, T. BUDDY: Molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods 2023, 20, 881–890. [Google Scholar] [CrossRef]
- Goldman, S.; Xin, J.; Provenzano, J.; Coley, C.W. MIST-CF: Chemical Formula Inference from Tandem Mass Spectra. J. Chem. Inf. Model. 2024, 64, 2421–2431. [Google Scholar] [CrossRef]
- Bocker, S.; Letzel, M.C.; Liptak, Z.; Pervukhin, A. SIRIUS: Decomposing isotope patterns for metabolite identification. Bioinformatics 2009, 25, 218–224. [Google Scholar] [CrossRef] [PubMed]
- Böcker, S.; Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 2016, 8, 5. [Google Scholar] [CrossRef]
- Duhrkop, K.; Fleischauer, M.; Ludwig, M.; Aksenov, A.A.; Melnik, A.V.; Meusel, M.; Dorrestein, P.C.; Rousu, J.; Bocker, S. SIRIUS 4: A rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 2019, 16, 299–302. [Google Scholar] [CrossRef]
- Allen, F.; Greiner, R.; Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 2015, 11, 98–110. [Google Scholar] [CrossRef]
- Allen, F.; Pon, A.; Wilson, M.; Greiner, R.; Wishart, D. CFM-ID: A web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res. 2014, 42, 94. [Google Scholar] [CrossRef]
- Stravs, M.A.; Dührkop, K.; Böcker, S.; Zamboni, N. MSNovelist: De novo structure generation from mass spectra. Nat. Methods 2022, 19, 865–870. [Google Scholar] [CrossRef]
- Ludwig, M.; Dührkop, K.; Böcker, S. Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints. Bioinformatics 2018, 34, i333–i340. [Google Scholar] [CrossRef]
- Baygi, S.F.; Barupal, D.K. IDSL_MINT: A deep learning framework to predict molecular fingerprints from mass spectra. J. Cheminform. 2024, 16, 8. [Google Scholar] [CrossRef]
- Dührkop, K. Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra. Bioinformatics 2022, 38, i342–i349. [Google Scholar] [CrossRef]
- Wadie, B.; Stuart, L.; Rath, C.M.; Drotleff, B.; Mamedov, S.; Alexandrov, T. METASPACE-ML: Context-specific metabolite annotation for imaging mass spectrometry using machine learning. Nat. Commun. 2024, 15, 9110–9116. [Google Scholar] [CrossRef]
- Fan, Z.; Alley, A.; Ghaffari, K.; Ressom, H.W. MetFID: Artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics 2020, 16, 104–107. [Google Scholar] [CrossRef]
- Gao, S.; Chau, H.Y.K.; Wang, K.; Ao, H.; Varghese, R.S.; Ressom, H.W. Convolutional Neural Network-Based Compound Fingerprint Prediction for Metabolite Annotation. Metabolites 2022, 12, 605. [Google Scholar] [CrossRef]
- Chau, H.Y.K.; Ao, H.; Zhang, X.; Gao, S.; Varghese, R.S.; Ressom, H.W. Deep Learning Based Metabolite Annotation; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
- MassBank of NorthAmerica (MoNA). Available online: https://mona.fiehnlab.ucdavis.edu/ (accessed on 22 July 2024).
- Wishart, D.S.; Guo, A.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.; Sayeeda, Z.; Tian, S.; Lee, B.; et al. HMDB 5.0: The Human Metabolome Database for 2022. Nucleic Acids Res. 2022, 50, 622. [Google Scholar] [CrossRef]
- O’Boyle, N.M.; Morley, C.; Hutchison, G.R. Pybel: A Python wrapper for the OpenBabel cheminformatics toolkit. Chem. Cent. J. 2008, 2, 5. [Google Scholar] [CrossRef]
- Schymanski, E.L.; Ruttkies, C.; Krauss, M.; Brouard, C.; Kind, T.; Duhrkop, K.; Allen, F.; Vaniya, A.; Verdegem, D.; Bocker, S.; et al. Critical Assessment of Small Molecule Identification 2016: Automated methods. J. Cheminform. 2017, 9, 22. [Google Scholar] [CrossRef]
- Critical Assessment of Small Molecule Identification 2016. Available online: http://www.casmi-contest.org/2016/index.shtml (accessed on 22 April 2024).
- Critical Assessment of Small Molecule Identification 2017. Available online: http://casmi-contest.org/2017/index.shtml (accessed on 27 June 2024).
- Critical Assessment of Small Molecule Identification 2022. Available online: https://fiehnlab.ucdavis.edu/casmi (accessed on 22 April 2024).
- Klekota, J.; Roth, F.P. Chemical substructures that enrich for biological activity. Bioinformatics 2008, 24, 2518–2525. [Google Scholar] [CrossRef]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603:04467. [Google Scholar]
- Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
Data | Source(s) | Number of MS/MS Spectra | Number of Compounds | |||
---|---|---|---|---|---|---|
Positive | Negative | Total | Positive | Negative | ||
Training data | MoNA, NIST 23, HMDB | 1,005,931 | 264,153 | 43,386 | 38,339 | 18,213 |
Testing data | CASMI 2016 | 127 | 81 | 188 | 127 | 81 |
CASMI 2017 | 147 | 96 | 169 | 147 | 96 | |
CASMI 2022 | 304 | 196 | 500 | 304 | 196 |
Data | Sources | Original Number | After Filtering | After Supervised Selection | |||
---|---|---|---|---|---|---|---|
Positive | Negative | Positive | Negative | Positive | Negative | ||
Bins | MoNA, NIST 23, HMDB | 91,001 | 2010 | 2015 | 500 | 500 | |
Fingerprints | PubChem, MACCS, Klekota-Roth, FP3, FP4 | 6269 | 4606 | 192 | 272 |
Positive Mode | DNN | CNN | RNN |
---|---|---|---|
F1 | 50% | 71% | 51% |
Tanimoto | 34% | 55% | 35% |
MCC | 51% | 71% | 51% |
Negative Mode | DNN | CNN | RNN |
---|---|---|---|
F1 | 56% | 79% | 56% |
Tanimoto | 39% | 65% | 39% |
MCC | 56% | 78% | 57% |
Method | Formula Unknown | Formula Predicted | Formula Known | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Random | DNN | CNN | RNN | DNN | CNN | RNN | CSI:FingerID | Random | DNN | CNN | RNN | CSI:FingerID | |
Top-1 | 2% | 10% | 23% | 27% | 27% | 25% | 28% | 48% | 14% | 42% | 46% | 47% | 54% |
Top-3 | 9% | 19% | 29% | 32% | 45% | 35% | 33% | 59% | 27% | 60% | 60% | 61% | 69% |
Top-5 | 15% | 27% | 34% | 36% | 52% | 40% | 41% | 60% | 32% | 65% | 67% | 67% | 72% |
Top-10 | 27% | 40% | 45% | 47% | 62% | 54% | 54% | 63% | 37% | 73% | 77% | 78% | 75% |
Method | Formula Unknown | Formula Predicted | Formula Known | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Random | DNN | CNN | RNN | DNN | CNN | RNN | CSI:FingerID | Random | DNN | CNN | RNN | CSI:FingerID | |
Top-1 | 3% | 8% | 16% | 28% | 9% | 33% | 26% | 14% | 6% | 20% | 38% | 33% | 37% |
Top-3 | 11% | 11% | 23% | 28% | 16% | 34% | 28% | 20% | 16% | 27% | 46% | 46% | 48% |
Top-5 | 16% | 13% | 26% | 30% | 19% | 47% | 31% | 21% | 25% | 37% | 50% | 44% | 52% |
Top-10 | 28% | 21% | 30% | 37% | 27% | 50% | 37% | 22% | 38% | 44% | 56% | 52% | 56% |
Method | Formula Unknown | Formula Predicted | Formula Known | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Random | DNN | CNN | RNN | DNN | CNN | RNN | CSI:FingerID | Random | DNN | CNN | RNN | CSI:FingerID | |
Top-1 | 4% | 8% | 25% | 32% | 12% | 23% | 32% | 8% | 10% | 31% | 44% | 54% | 14% |
Top-3 | 10% | 18% | 35% | 43% | 23% | 30% | 36% | 13% | 22% | 44% | 53% | 56% | 22% |
Top-5 | 16% | 25% | 38% | 47% | 30% | 35% | 41% | 15% | 28% | 50% | 58% | 62% | 25% |
Top-10 | 25% | 31% | 51% | 51% | 41% | 49% | 47% | 18% | 42% | 56% | 66% | 67% | 28% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chau, H.Y.K.; Zhang, X.; Ressom, H.W. Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation. Metabolites 2025, 15, 132. https://doi.org/10.3390/metabo15020132
Chau HYK, Zhang X, Ressom HW. Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation. Metabolites. 2025; 15(2):132. https://doi.org/10.3390/metabo15020132
Chicago/Turabian StyleChau, Hoi Yan Katharine, Xinran Zhang, and Habtom W. Ressom. 2025. "Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation" Metabolites 15, no. 2: 132. https://doi.org/10.3390/metabo15020132
APA StyleChau, H. Y. K., Zhang, X., & Ressom, H. W. (2025). Deep Learning-Based Molecular Fingerprint Prediction for Metabolite Annotation. Metabolites, 15(2), 132. https://doi.org/10.3390/metabo15020132