CKG-IMC: an inductive matrix completion method enhanced by CKG and GNN for Alzheimer’s disease compound-protein interactions prediction
Paper: 📖
Alzheimer’s disease (AD) is a prevalent neurodegenerative disorder without effective therapeutic interventions. To address this challenge, we present CKG-IMC, a deep learning model for predicting compound-protein interactions (CPIs) relevant to AD. CKG-IMC integrates three modules: a Collaborative Knowledge Graph (CKG), a Principal Neighborhood Aggregation Graph Neural Network (PNA), and an Inductive Matrix Completion (IMC).
The code has been tested running under Python 3.8 and 3.9.18. The required packages are as follows:
numpy>=1.25.0
pandas>=2.0.3
scikit-learn>=1.3.0
torch>=2.0.1
torch_geometric>=2.3.1
tqdm>=4.65.0
To perform 10-fold cross-validation, run the following command:
python main.py -adv
For displays the help message, use the following command:
python main.py -h
This study aims to predict CPIs for Alzheimer’s disease. To achieve this, we integrate six widely utilized public databases, including CTD, UniProt, PubChem, SIDER, DrugBank and ChemSpider, along with four commonly employed DTI datasets (namely Luo’s dataset, Hetionet, Yamanishi_08, and BioKG) to acquire information concerning compounds, drug side effects, and proteins relevant to Alzheimer’s disease.
Here are the data explanation of files under ./data
folder.
- CCS.npy
- Format: NumPy Binary file
- Purpose: Represents Compound-Compound Similarity matrix.
- Size: 8360 rows x 8360 cols
- Description: The CCS.npy file is a NumPy binary file storing the Compound-Compound Similarity matrix with dimensions 8360 x 8360, where each element represents the similarity between compounds. The values in the matrix fall within the range of 0 to 1, providing a normalized measure of similarity. A value closer to 1 indicates a higher similarity, while a value closer to 0 suggests dissimilarity.
- PPS.npy
- Format: Binary file
- Purpose: Represents the Protein-Protein Similarity matrix.
- Size: 1975 rows x 1975 cols
- Description: The PPS.npy file is a NumPy binary file storing the Protein-Protein Similarity matrix with dimensions 1975 x 1975, where each element represents the similarity between proteins. The values in the matrix fall within the range of 0 to 1, providing a normalized measure of similarity. A value closer to 1 indicates a higher similarity, while a value closer to 0 suggests dissimilarity.
- CPI.npy
- Format: Numpy Binary file
- Purpose: Represents the Compound-Protein Interaction matrix.
- Size: 8360 rows x 1975 cols
- Description: Each element in the matrix is either 0 or 1, indicating the absence or presence of the interaction relationship.
- compound_se.npy
- Format: NumPy Binary file
- Purpose: Represents drug-side effect relationships matrix.
- Size: 8360 rows x 5854 cols
- Description: Each element in the matrix is either 0 or 1, indicating the absence or presence of the drug-side effect relationship.
- CPI_with_negative_sample.csv
- Format: CSV file
- Size: 1,875,504 rows x 3 cols
- Purpose: Contains compound-protein interaction information with negative samples.
- Description: Each row represents a compound-protein interaction. The columns are as follows:
compound_idx
: Index representing the compound.protein_idx
: Index representing the protein.label
: Binary label indicating the interaction (1) or a negative sample (0).
-
compounds.list
- Format: Text file
- Purpose: Contains a list of compound CIDs.
- Size: 8360 rows
- Description: Each row represents the CID of a compound from the PubChem database.
-
proteins.list
- Format: Text file
- Purpose: Contains a list of protein IDs.
- Size: 1975 rows
- Description: Each row represents the UID of a protein from the UniProt database.
-
se.list
- Format: Text file
- Purpose: Contains a list of side-effect IDs.
- Size: 5854 rows
- Description: Each row represents the ID of a side effect from the SIDER database.
-
entities.dict
- Format: Text file
- Purpose: Contains a dictionary mapping entity row numbers to their IDs in database.
- Description: The first column represents entity's row numbers, and the second column represents their IDs in corresponding database . The content typically follows the order of compounds, proteins, and side effects. The first column and the second column are separated by tabs.
-
relations.dict
- Format: Text file
- Purpose: Contains a dictionary mapping relation names to IDs.
- Size: 4 rows x 2 cols
- Description: The relations.dict file serves as a mapping between relation names and their corresponding IDs. The first column represents the relation IDs, and the second column represents the corresponding relation name. The first column and the second column are separated by tabs.
If you find this work useful, please consider citing our paper.
@article{yuan2024ckg,
title={CKG-IMC: An inductive matrix completion method enhanced by CKG and GNN for Alzheimer’s disease compound-protein interactions prediction},
author={Yuan, Yongna and Hu, Rizhen and Chen, Siming and Zhang, Xiaopeng and Liu, Zhenyu and Zhou, Gonghai},
journal={Computers in Biology and Medicine},
volume={177},
pages={108612},
year={2024},
publisher={Elsevier}
}