Abstract
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
Similar content being viewed by others
Data availability
The unlabeled dataset ZINC20 and PubChem, used in pretraining stage, can be accessed at https://zinc20.docking.org/tranches/home/ and https://pubchem.ncbi.nlm.nih.gov/docs/downloads. The downstream benchmarks can be downloaded from MoleculeNet (https://moleculenet.org/datasets-1). It is available for non-commercial use.
Code availability
The software can be accessed at https://github.com/AI-HPC-Research-Team/3D-Mol.
References
Goh GB, Hodas NO, Siegel C, Vishnu A (2017) SMILES2Vec: an interpretable general-purpose deep neural network for predicting chemical properties https://doi.org/10.48550/ARXIV.1712.02034
Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J (2020) Deeppurpose: a deep learning library for drug-target interaction prediction. Bioinformatics 36(22–23):5545–5547
Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. https://doi.org/10.48550/ARXIV.2010.09885
Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inform Comput Sci 28(1):31–36
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. Proc Mach Learn Res 70:1263–1272
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inform Modeling 59(8):3370–3388
Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J (2019) Strategies for Pre-training Graph Neural Networks. https://doi.org/10.48550/ARXIV.1905.12265
Liu S, Demirel MF, Liang Y (2019) N-gram graph: simple unsupervised representation for graphs, with applications to molecules. Adv Neural Inform Process Syst 32:19
Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H et al (2019) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63(16):8749–8760
Wang Y, Wang J, Cao Z, Barati Farimani A (2022) Molecular contrastive learning of representations via graph neural networks. Nature Mach Intell 4(3):279–287. https://doi.org/10.1038/s42256-022-00447-x
Rong Y, Bian Y, Xu T, Xie W, WEI Y, Huang W, Huang J (2020) Self-supervised graph transformer on large-scale molecular data. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 12559–12571. Curran Associates, Inc., ???. https://proceedings.neurips.cc/paper_files/paper/2020/file/94aef38441efa3380a3bed3faf1f9d5d-Paper.pdf
Schütt K, Kindermans P-J, Sauceda Felix HE, Chmiela S, Tkatchenko A, Müller K-R (2017) Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems 30
Gasteiger J, Groß J, Günnemann S (2020) Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123
Shui Z, Karypis G (2020) Heterogeneous molecular graph neural networks for predicting molecule properties. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 492–500. IEEE
Danel T, Spurek P, Tabor J, Śmieja M, Struski Ł, Słowik A, Maziarka Ł (2020) Spatial graph convolutional networks. In: Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings, Part V, pp. 668–675. Springer
Fang X, Liu L, Lei J, He D, Zhang S, Zhou J, Wang F, Wu H, Wang H (2022) Geometry-enhanced molecular representation learning for property prediction. Nature Mach Intell 4(2):127–134
Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, Zhang L, Ke G (2023) Uni-mol: a universal 3d molecular representation learning framework
Zhang Z, Xu M, Jamasb A, Chenthamarakshan V, Lozano A, Das P, Tang J (2022) Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530
Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63
Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF (2017) Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inform Model 57(8):1757–1772
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inform Modeling 50(5):742–754
Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inform Comput Sci 42(6):1273–1280
Wang S, Guo Y, Wang Y, Sun H, Huang J (2019) Smiles-bert: large scale unsupervised pre-training for molecular property prediction. Computat Biol Health Inform 4:429–436
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/ARXIV.1810.04805
Wang J, Cao D, Tang C, Xu L, He Q, Yang B, Chen X, Sun H, Hou T (2021) Deepatomiccharge: a new graph convolutional network-based architecture for accurate prediction of atomic charges. Brief Bioinform 22(3):183
Li X-S, Liu X, Lu L, Hua X-S, Chi Y, Xia K (2022) Multiphysical graph neural network (mp-gnn) for COVID-19 drug design. Brief Bioinform 23(4):231
Lu C, Liu Q, Wang C, Huang Z, Lin P, He L (2019) Molecular property prediction: a multilevel quantum interactions modeling perspective. Proc Conf Artif Intell 33:1052–1060
Qiao Z, Welborn M, Anandkumar A, Manby FR, Miller TF (2020) Orbnet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. J Chem Phys 153(12):686
Li Z, Jiang M, Wang S, Zhang S (2022) Deep learning methods for molecular representation and property prediction. Drug Discov Today 27:103373
Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P (2018) Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics 34(21):3666–3674
Sunseri J, Koes DR (2020) Libmolgrid: graphics processing unit accelerated molecular gridding for deep learning applications. J Chem Inform Modeling 60(3):1079–1084
Liu Q, Wang P-S, Zhu C, Gaines BB, Zhu T, Bi J, Song M (2021) Octsurf: efficient hierarchical voxel-based molecular surface representation for protein-ligand affinity prediction. J Mol Graph Modelling 105:107865
Floridi L, Chiriatti M (2020) Gpt-3: its nature, scope, limits, and consequences. Minds Mach 30:681–694
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Honda S, Shi S, Ueda HR (2019) Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. Adv Neural Inform Process Syst 33:5812–5823
Sun M, Xing J, Wang H, Chen B, Zhou J (2021) Mocl: Data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In: proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp. 3585–3594
Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, Gao P, Xie G, Song S (2021) An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform 22(6):109
Wang Y, Magar R, Liang C, Barati Farimani A (2022) Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. J Chem Inform Modeling 62(11):2713–2725
Sun Q, Li J, Peng H, Wu J, Ning Y, Yu PS, He L (2021) Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In: proceedings of the web conference 2021, pp. 2081–2091
Ji Z, Shi R, Lu J, Li F, Yang Y (2022) Relmole: molecular representation learning based on two-level graph similarities. J Chem Inform Modeling 62(22):5361–5372
Cho H, Choi IS (2019) Enhanced deep-learning prediction of molecular properties via augmentation of bond topology. Chem Med Chem 14(17):1604–1609
Liu S, Wang H, Liu W, Lasenby J, Guo H, Tang J (2021) Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728
Landrum G, et al (2013) Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 8
Irwin JJ, Tang KG, Young J, Dandarchuluun C, Wong BR, Khurelbaatar M, Moroz YS, Mayfield J, Sayle RA (2020) Zinc20-a free ultralarge-scale chemical database for ligand discovery. J Chem Inform Modeling 60(12):6065–6073
Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH (2009) Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37(2):623–633
Stärk H, Beaini D, Corso G, Tossou P, Dallago C, Günnemann S, Liò P (2022) 3d infomax improves gnns for molecular property prediction. In: international conference on machine learning, pp. 20479–20502. PMLR
Acknowledgements
The research was supported by the Peng Cheng Cloud-Brain.
Funding
This work is supported by Peng Cheng Laboratory and by the Major Key Project of PCL PCL2021A13.
Author information
Authors and Affiliations
Contributions
Taojie Kuang: Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Yiming Ren: Validation, Writing - review & editing. Zhixiang Ren: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing - original draft, Writing - review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: 3D conformation descriptor and fingerprint
1.1 A.1 Fingerprint
In our study, we integrate molecular fingerprints, particularly Morgan fingerprints, to calculate weights for negative pairs in our model. These fingerprints, which provide a compact numerical representation of molecular structures, are crucial for computational chemistry tasks. The Morgan fingerprint method iteratively updates each atom’s representation based on its chemical surroundings, resulting in a detailed binary vector of the molecule. By evaluating the similarity between Morgan fingerprints, we derive a precise weighting mechanism for negative pairs, enhancing our model’s ability to detect and differentiate molecular structures. This methodology not only improves our model’s accuracy in molecular interaction analysis but also adds to its overall predictive capabilities.
1.2 A.2 3D conformation descriptor
Molecular 3D conformation descriptors are computational tools used to represent the three-dimensional arrangement of atoms within a molecule, capturing critical aspects of its spatial geometry. These descriptors are crucial in understanding how molecular shape influences chemical and biological properties, and they play a significant role in fields like drug design and materials science. The 3D-Morse descriptor, specifically, is a type of 3D molecular descriptor that quantifies the molecular structure using electron diffraction patterns, offering a unique approach to encapsulating the spatial distribution of atoms. It provides a detailed and nuanced representation of molecular conformation, making it highly valuable in computational chemistry and cheminformatics. In our research, we employ 3D-Morse descriptors to measure the similarity of molecular 3D conformations, enabling us to compare and analyze molecular structures effectively and identify potential similarities in their biological or chemical behavior. This application of 3D-Morse descriptors is instrumental in fields such as drug discovery, where understanding molecular similarities can lead to the identification of new therapeutic compounds or the prediction of their activities.
Appendix B: The contribution of pretraining method
In this section, we discuss the contributions of contrastive learning and supervised pretraining methods to our pretraining approach. We pretrained our model using three approaches: contrastive Learning only, supervised pretraining only, and complete pretraining method. We compared their performance on 7 benchmark datasets. As the Table 4 shown, the contributions of both contrastive learning and supervised pretraining were less significant than the complete method. These findings emphasize that while both contrastive learning and supervised pretraining contribute positively to the model’s performance, their combination is crucial for achieving optimal results.
Appendix C: Finetuning details
During finetuning for each downstream task, we randomly search the hyper-parameters to find the best performing setting on the validation set and report the results on the test set. Table 5 lists the combinations of different hyper-parameters.
Appendix D: Environment
CPU:
\(\bullet \) Architect: X86 64
\(\bullet \) Number of CPUs: 96
\(\bullet \) Model: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
GPU:
\(\bullet \) Type: Tesla V100-SXM2-32GB
\(\bullet \) Count: 8
\(\bullet \) Driver Version: 450.80.02
\(\bullet \) CUDA Version: 11.7
Software Environment:
\(\bullet \) Operating System: Ubuntu 20.04.6 LTS
\(\bullet \) Python Version: 3.10.9
\(\bullet \) Paddle Version: 2.4.2
\(\bullet \) PGL Version: 2.2.5
\(\bullet \) RDKit Version: 2023.3.2
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kuang, T., Ren, Y. & Ren, Z. 3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information. Pattern Anal Applic 27, 71 (2024). https://doi.org/10.1007/s10044-024-01287-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10044-024-01287-8