CN114613428B - Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network - Google Patents
Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network Download PDFInfo
- Publication number
- CN114613428B CN114613428B CN202011428394.2A CN202011428394A CN114613428B CN 114613428 B CN114613428 B CN 114613428B CN 202011428394 A CN202011428394 A CN 202011428394A CN 114613428 B CN114613428 B CN 114613428B
- Authority
- CN
- China
- Prior art keywords
- metabolite
- protein
- interaction
- correlation
- positive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 99
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000002207 metabolite Substances 0.000 claims abstract description 87
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 64
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 56
- 238000007637 random forest analysis Methods 0.000 claims abstract description 18
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 6
- 238000010276 construction Methods 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 37
- 102000004190 Enzymes Human genes 0.000 claims description 14
- 108090000790 Enzymes Proteins 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 230000004850 protein–protein interaction Effects 0.000 claims description 13
- 238000012360 testing method Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 11
- 239000000376 reactant Substances 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 230000037353 metabolic pathway Effects 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 3
- 150000001875 compounds Chemical class 0.000 claims description 3
- 239000011020 iolite Substances 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000010399 physical interaction Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 2
- 239000000047 product Substances 0.000 description 7
- 230000006870 function Effects 0.000 description 4
- 235000014113 dietary fatty acids Nutrition 0.000 description 3
- 229930195729 fatty acid Natural products 0.000 description 3
- 239000000194 fatty acid Substances 0.000 description 3
- 150000004665 fatty acids Chemical class 0.000 description 3
- 101710177166 Phosphoprotein Proteins 0.000 description 2
- YZXBAPSDXZZRGB-DOFZRALJSA-N arachidonic acid Chemical compound CCCCC\C=C/C\C=C/C\C=C/C\C=C/CCCC(O)=O YZXBAPSDXZZRGB-DOFZRALJSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000031018 biological processes and functions Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- UKMSUNONTOPOIO-UHFFFAOYSA-N docosanoic acid Chemical compound CCCCCCCCCCCCCCCCCCCCCC(O)=O UKMSUNONTOPOIO-UHFFFAOYSA-N 0.000 description 2
- IPCSVZSSVZVIGE-UHFFFAOYSA-N hexadecanoic acid Chemical compound CCCCCCCCCCCCCCCC(O)=O IPCSVZSSVZVIGE-UHFFFAOYSA-N 0.000 description 2
- 210000000987 immune system Anatomy 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- OYHQOLUKZRVURQ-NTGFUMLPSA-N (9Z,12Z)-9,10,12,13-tetratritiooctadeca-9,12-dienoic acid Chemical compound C(CCCCCCC\C(=C(/C\C(=C(/CCCCC)\[3H])\[3H])\[3H])\[3H])(=O)O OYHQOLUKZRVURQ-NTGFUMLPSA-N 0.000 description 1
- MGYSTOQOSMRLQF-JZUJSFITSA-N 2-hydroxy-1-[(3s,9r,10s,13s,14r,17s)-3-hydroxy-10,13-dimethyl-2,3,4,9,11,12,14,15,16,17-decahydro-1h-cyclopenta[a]phenanthren-17-yl]ethanone Chemical compound C1[C@@H](O)CC[C@@]2(C)[C@@H]3CC[C@](C)([C@H](CC4)C(=O)CO)[C@@H]4C3=CC=C21 MGYSTOQOSMRLQF-JZUJSFITSA-N 0.000 description 1
- 235000021357 Behenic acid Nutrition 0.000 description 1
- 102100039498 Cytotoxic T-lymphocyte protein 4 Human genes 0.000 description 1
- 101000889276 Homo sapiens Cytotoxic T-lymphocyte protein 4 Proteins 0.000 description 1
- 101001013150 Homo sapiens Interstitial collagenase Proteins 0.000 description 1
- 101000611936 Homo sapiens Programmed cell death protein 1 Proteins 0.000 description 1
- 101000635938 Homo sapiens Transforming growth factor beta-1 proprotein Proteins 0.000 description 1
- 102000000380 Matrix Metalloproteinase 1 Human genes 0.000 description 1
- 102000002274 Matrix Metalloproteinases Human genes 0.000 description 1
- 108010000684 Matrix Metalloproteinases Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 235000021314 Palmitic acid Nutrition 0.000 description 1
- 102100040678 Programmed cell death protein 1 Human genes 0.000 description 1
- 102000001332 SRC Human genes 0.000 description 1
- 108060006706 SRC Proteins 0.000 description 1
- 102100030742 Transforming growth factor beta-1 proprotein Human genes 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 229940114079 arachidonic acid Drugs 0.000 description 1
- 235000021342 arachidonic acid Nutrition 0.000 description 1
- 229940116226 behenic acid Drugs 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012482 interaction analysis Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 229920002521 macromolecule Polymers 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 230000002503 metabolic effect Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004001 molecular interaction Effects 0.000 description 1
- WQEPLUUGTLDZJY-UHFFFAOYSA-N n-Pentadecanoic acid Natural products CCCCCCCCCCCCCCC(O)=O WQEPLUUGTLDZJY-UHFFFAOYSA-N 0.000 description 1
- 229940098695 palmitic acid Drugs 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000009711 regulatory function Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 235000021122 unsaturated fatty acids Nutrition 0.000 description 1
- 150000004670 unsaturated fatty acids Chemical class 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a metabolite-protein interaction prediction method based on a two-dimensional heterogeneous network, which comprises the following steps: constructing a metabolite-protein two-dimensional heterogeneous network; calculating the multi-dimensional correlation characteristics of the metabolite-protein; metabolite-protein interaction prediction model construction based on random forest algorithm; and predicting the interaction between the metabolite and the protein. The invention can reliably predict potential metabolite-protein interaction based on the established prediction model.
Description
Technical Field
The present invention relates to the field of bioinformatics, and in particular to the field of predicting macromolecular-small molecular interactions using computer technology.
Background
The human body is a complex system, which is composed of multiple types and layers of elements such as genes, proteins, metabolites and the like and interaction among various elements, and maintains the balance and steady state of the system through complex interaction and compensation mechanisms. Deep exploration of complex interactions of different types and different layers in a human body is beneficial to systematically describing a life system and promotes understanding and research on action mechanisms of complex diseases such as cancers.
While the functions of biological macromolecules such as DNA and proteins have been of great interest for a long time, metabolites are considered as end products that are passively catalyzed by enzymes, and few studies have focused on the regulatory functions of metabolites in other biological processes such as immunity and signaling. Recently, more and more studies have found that metabolites are involved in regulating and affecting a variety of biological processes other than metabolism through interactions with key proteins in the human body. Therefore, it is important to understand the interactions between metabolites and proteins. However, currently known metabolite-protein interactions are mainly limited to interactions between metabolic enzymes and their reactants and products, and a large number of metabolite-protein interactions are still not revealed.
While high throughput mass spectrometry techniques have been used to identify specific metabolite-related interaction proteins, such methods are experimentally costly and are generally limited by the manner in which covalent binding interacts. In order to more systematically and comprehensively predict metabolite-protein interactions, the invention provides a metabolite-protein interaction prediction method based on a two-dimensional heterogeneous network, and aims to efficiently predict various metabolite-protein interactions through a calculation method.
Disclosure of Invention
The invention aims to provide a metabolite-protein interaction prediction method based on a two-dimensional heterogeneous network, which is used for constructing a two-dimensional heterogeneous biological network of metabolites and proteins, calculating multidimensional correlation between any metabolites and proteins based on the network, constructing a random forest classification model and predicting potential interactions of the metabolites and the proteins.
The invention solves the problems by adopting the following technical scheme:
A method for predicting metabolite-protein interactions based on a two-dimensional heterogeneous network, the method comprising:
The modeling process comprises the following steps: constructing a random forest model for metabolite-protein interaction prediction based on a two-dimensional heterogeneous network;
step 1: constructing a metabolite-protein two-dimensional heterogeneous network;
Step 2: collecting a collection of negative and positive samples for metabolite-protein interactions;
Step 3: for any pair of samples in the negative and positive sample sets, calculating the multidimensional correlation between each pair of metabolites and proteins;
Step 4: training a metabolite-protein interaction prediction model based on a random forest algorithm by combining a multidimensional correlation calculation result;
actual prediction: predicting whether any pair of metabolites and proteins have interaction or not based on the constructed random forest model;
Aiming at any pair of metabolites and proteins which are not directly connected in the metabolite-protein two-dimensional heterogeneous network, calculating the multidimensional correlation between the two based on the same method in the step 3; and (3) taking the obtained multidimensional correlation into a prediction model in the step (4) to obtain a probability value of interaction between the metabolite and the protein, and judging that the metabolite and the protein have interaction when the probability value is larger than a preset threshold value.
The construction of the metabolite-protein two-dimensional heterogeneous network comprises the following three steps:
step 11: constructing a protein-protein interaction network;
reading human protein-protein physical interaction data from BioGrid databases, labeling each protein with a gene name, and constructing a protein-protein interaction network;
step 12: constructing a metabolite-protein interaction network;
Obtaining reactive information by reading KGML files of human metabolic pathways from a KEGG database; obtaining enzymes, reactants and products in each reaction formula, and marking the enzymes by gene names, and marking the reactants and the products by KEGG compound ID;
The labeled interaction relationship is used for indicating that any one enzyme and a reactant which participate in the same reaction at two ends of the interaction relationship or any one enzyme and any one product have interaction, and non-repeated interaction in all the reactions is integrated to construct a metabolite-protein interaction network;
Step 13: constructing a metabolite-metabolite interaction network;
reading the sdf two-dimensional structure file of each metabolite in the KEGG metabolic pathway from PubChem database; calculating a molecular descriptor for each metabolite separately; calculating Tanimoto correlation coefficients between molecular descriptors of any two metabolites;
Constructing a metabolite-metabolite interaction network using all metabolite-metabolite associations with a correlation greater than 0;
The sample collection in the step 2 comprises the following steps:
Step 21: after deleting the metabolites with the number of neighbor nodes being ranked as 10 in the metabolite-protein interaction network in the step 12, remaining all pairs of metabolites and proteins with interactions as a positive sample set;
step 22: randomly pairing all metabolites and proteins in the positive set, ensuring that the metabolites and the proteins are not repeated with the positive set, and randomly generating negative sets with the same number as the positive sets;
the step 3 multi-dimensional correlation calculation comprises the following steps:
Step 31: constructing a protein adjacency matrix P based on protein-protein interactions;
step 32: constructing a metabolite-protein adjacency matrix I from the metabolite-protein interactions;
step 33: constructing a metabolite-metabolite adjacency matrix M from the metabolite-protein interactions;
Step 34: the 4-dimensional metabolite-protein correlation was calculated based on the adjacency matrix P, I, M.
The protein adjacency matrix P in step 31 satisfies the following conditions:
where Pi, j represent values in the ith row and jth column of the matrix P.
The metabolite-protein adjacency matrix I in step 32 satisfies the following conditions:
Where Ii, j represents the values in the ith row and jth column in matrix I.
The metabolite adjacent matrix M in step 33 satisfies the following condition:
M i,j =tanimoto correlation coefficient between molecular descriptors of metabolite i and metabolite j
Where Mi, j represents the values in the ith row and jth column in the matrix M.
The step 34 includes the steps of:
Step 341: for any one of the positive and negative sample sets, namely one metabolite m, and one protein p, a first dimension correlation NS1 is calculated,
NS1(m,p)=IPm,p
Step 342: calculating a second dimension correlation NS2;
NS2(m,p)=IP2 m,p/ΣkIm,k
Step 343: calculating a third-dimensional correlation NS3:
NS3(m,p)=MIm,p/ΣkIm,k
step 344: calculate a fourth-dimensional correlation NS4:
NS4(m,p)=edgeConnectivity(m,p,G.H),
Wherein G.H is a two-dimensional heterogeneous network integrating protein-protein interactions, metabolite-metabolite interactions, protein-metabolite interactions simultaneously, edgeConnectivity (m, p, G.H) calculates how many network edges need to be deleted at least in the network G.H to break all the link paths of nodes m and p.
The step 4 comprises the following steps:
step 41: selecting samples from the negative set and the positive set in random proportion as a training set and a testing set;
Step 42: aiming at a training set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step 3 as the characteristic description of the sample, taking the sample from a positive or negative set as a class label, and training a random forest classification model;
step 43: and (3) for the test set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step (3) as a characteristic description of the sample, inputting a classification model trained in the step (42) so as to obtain probability values of positive and negative sets of each test sample, judging the positive result by using a prediction result that the positive probability value is greater than a preset threshold value, and judging the accuracy of the prediction result if the positive probability value is negative.
The positive indicates that the current test sample is judged to have an interaction, and the negative indicates that the current test sample is judged to have no interaction.
The invention has the following beneficial effects and significance:
The metabolite-protein interaction prediction method based on the two-dimensional heterogeneous network, provided by the invention, constructs a metabolite-protein two-dimensional heterogeneous network by integrating known protein-protein interactions, metabolite-metabolite structure correlation and metabolite-protein interactions, performs multidimensional description on complex network association characteristics between any metabolite and protein based on the network, constructs metabolite-protein interaction prediction based on the network by using a random forest algorithm, and can reliably predict potential metabolite-protein interactions.
Drawings
FIG. 1 is a schematic diagram of a method of predicting metabolite-protein interactions based on a two-dimensional heterogeneous network according to the present invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
1. Constructing a random forest model for metabolite-protein interaction prediction based on a two-dimensional heterogeneous network;
FIG. 1 is a schematic diagram of a two-dimensional heterogeneous network-based metabolite-protein interaction prediction method provided by the invention. As shown in FIG. 1, the present invention provides a method for predicting whether a metabolite-protein interaction is present. Four steps are given in order from top to bottom in fig. 1: the specific contents include:
step 1: constructing a metabolite-protein two-dimensional heterogeneous network;
The metabolite-protein two-dimensional heterogeneous network comprises a metabolite layer network and a protein layer network, and the two layers of networks are connected based on the known interaction of the metabolite layer and the protein layer, so that a two-dimensional heterogeneous network system is constructed, and a foundation is mainly provided for the subsequent description of the association between the metabolite and the protein.
Step 2: collecting a negative and positive sample set for metabolite-protein interactions;
The positive sample set refers to metabolite-protein pairs with known interaction, the negative set refers to non-interactive metabolite-protein pairs, and the same number of negative and positive samples are collected to be used as the data basis of a random forest prediction model for subsequent training.
Step 3: calculating a multidimensional correlation between each pair of metabolite-proteins;
the steps integrate and utilize various types of topological correlations in the metabolite-protein two-dimensional heterogeneous network, and calculate the network correlations between any pair of metabolites-proteins from various angles, so as to obtain multidimensional correlation characteristics.
Step 4: training a metabolite-protein interaction prediction model based on a random forest algorithm;
Based on the multidimensional correlation characteristics obtained in the step3, training a metabolite-protein interaction prediction model by using a random forest algorithm, wherein the model can judge the probability of interaction between any pair of the multidimensional correlation characteristics of the metabolite-protein.
2. Predicting whether any pair of metabolites and proteins have interaction or not based on the constructed random forest model;
Aiming at any metabolite and any protein in the two-dimensional heterogeneous network in the step 1, calculating the multidimensional correlation between the two based on the same method in the step 3; based on the random forest model obtained in the step 4, obtaining probability values of the pair of metabolites-proteins belonging to a positive or negative set, wherein the sum of the negative and positive probability values is 1, and when the positive probability value is more than 0.5, the result is considered to be positive, namely, the potential interaction between the metabolites and the proteins exists, and when the positive probability value is more than 0.9, the result is considered to have high credibility.
The steps involved in the above are explained in detail below.
Step 1: constructing a metabolite-protein two-dimensional heterogeneous network;
The step 1 comprises the following steps:
step 11: constructing a protein-protein interaction network;
downloading human protein-protein physical interaction data from BioGrid databases, labeling each protein with a Gene name (Gene Symbol), and constructing a protein-protein interaction network;
step 12: constructing a metabolite-protein interaction network;
the step 12 specifically includes the following steps:
step 121: downloading KGML files of all human metabolic pathways from the KEGG database;
Step 122: obtaining reactive information in each KGML file;
step 123: obtaining enzymes, reactants and products in each reaction formula, and marking the enzymes by gene names, and marking the reactants and the products by KEGG compound ID;
Step 124: any one enzyme and one reactant or any one enzyme and any one product which participate in the same reaction are added to interact;
step 125: integration of non-repetitive interactions in all reactions, construction of metabolite-protein interaction networks
Step 13: constructing a metabolite-metabolite interaction network;
the step 13 specifically includes the following steps:
Step 131: downloading the sdf two-dimensional structure file of each metabolite in the KEGG metabolic pathway from PubChem database;
Step 132: calculating a molecular descriptor of each metabolite using the sdf2ap function in the R language CHEMMINER package based on the two-dimensional structure sdf file of each metabolite;
Step 133: calculating the Tanimoto correlation coefficient between the molecular descriptors of any two metabolites by using a cmp.similarity function in an R language CHEMMINER package;
Step 134: constructing a metabolite-metabolite interaction network using all metabolite-metabolite associations with a correlation greater than 0;
The step 2 comprises the following steps:
step 21: taking all pairs of metabolites and proteins with interactions remained after the metabolites with the front 10 of the number of the neighbor nodes are deleted in the metabolite-protein interaction network in the step 12 as a positive sample set;
Step 22: carrying out random pairing on all metabolites and proteins in the positive set by using an R language sample function, ensuring that the metabolites and the proteins are not repeated with the positive set, and randomly generating negative sets with the same number as the positive sets;
the step 3 comprises the following steps;
Step 31: the protein adjacency matrix P is constructed from protein-protein interactions, satisfying the following conditions:
wherein Pi, j represent values in the ith row and the jth column in the matrix P;
Step 32: the metabolite-protein adjacency matrix I was constructed from metabolite-protein interactions, satisfying the following conditions:
wherein Ii, j represents the values in the ith row and the jth column in the matrix I;
Step 33: the metabolite-metabolite adjacency matrix M is constructed from the metabolite-protein interactions, satisfying the following conditions:
M i,j =tanimoto correlation coefficient between molecular descriptors of metabolite i and metabolite j
Wherein Mi, j represents the values in the ith row and the jth column in the matrix M;
step 34: calculating a 4-dimensional metabolite-protein correlation based on the adjacency matrix P, I, M;
The step 34 specifically includes the following steps:
Step 341: for any one of the positive and negative sample sets, namely one metabolite m, and one protein p, a first dimension correlation NS1 is calculated,
NS1(m,p)=IPm,p
Wherein IP m,p represents the result in the corresponding row of metabolite m and the corresponding column of protein P in the matrix obtained after the matrix I and the matrix P are multiplied;
step 342: calculating a second dimension correlation NS2;
NS2(m,p)=IP2 m,p/ΣkIm,k
Wherein IP 2 m,p represents the result in the corresponding row of metabolite m and the corresponding column of protein P in the matrix obtained after the multiplication of matrix I and matrix P 2, Σ kIm,k represents the sum of the values in the corresponding row of metabolite m in matrix I;
Step 343: calculating a third-dimensional correlation NS3:
NS3(m,p)=MIm,p/ΣkIm,k
wherein MI m,p represents the result in the corresponding row of metabolite M and the corresponding column of protein p in the matrix obtained after the matrix M and matrix I are multiplied;
step 344: calculate a fourth-dimensional correlation NS4:
NS4(m,p)=edgeConnectivity(m,p,G.H),
Wherein G.H is a two-dimensional heterogeneous network integrating protein-protein interactions, metabolite-metabolite interactions, protein-metabolite interactions simultaneously, edgeConnectivity (m, p, G.H) calculates how many network edges need to be deleted at least in the network G.H to break all the link paths of nodes m and p.
The step 4 specifically comprises the following steps:
step 41: randomly selecting 90% of samples from the negative set and the positive set as a training set, and the remaining 10% as a test set;
Step 42: aiming at a training set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step 3 as the characteristic description of the sample, taking the sample from a positive or negative set as a class label, and training a random forest classification model;
Step 43: and (3) for the test set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step (3) as a characteristic description of the sample, inputting a classification model trained in the step (42), so as to obtain a probability value of each test sample belonging to a positive and negative set, taking a prediction result with a positive probability value of >0.5 as positive, and otherwise, judging the accuracy of the prediction result as negative.
Examples
In this context, a two-dimensional heterogeneous network-based metabolite-protein interaction prediction method is used for interaction analysis between unsaturated fatty acids and immune system-related proteins, wherein 4 candidate fatty acids (including linoleic acid, arachidonic acid, palmitic acid, and behenic acid), 42 immune system-related proteins (including PDCD1, CTLA4, SRC, TGFB1, MMP1, and MMP 9), and a total of 4×42=168 candidate metabolite-protein pairs are obtained, and the implementation effects are as follows:
The metabolite-protein interaction prediction model based on the two-dimensional heterogeneous network constructed by the invention predicts the probability of interaction between the 4 fatty acids and 42 immune-related proteins. The results showed that of the 7 groups of predicted results with high confidence, 4 groups (arachidonic acid-TGFB 1, linoleic acid-TGFB 1, arachidonic acid-SRC, linoleic acid-SRC) were literature-proven to have interactions (table 1). This example demonstrates the effectiveness of a two-dimensional heterogeneous network based metabolite-protein interaction prediction method proposed by the present invention.
TABLE 1.4 predicted results of interactions between fatty acids and 42 immune-related proteins (highly reliable results with positive probability values > 0.9)
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.
Claims (10)
1. A metabolite-protein interaction prediction method based on a two-dimensional heterogeneous network is characterized by comprising the following steps of: the method comprises the following steps:
The modeling process comprises the following steps: constructing a random forest model for metabolite-protein interaction prediction based on a two-dimensional heterogeneous network;
step 1: constructing a metabolite-protein two-dimensional heterogeneous network;
Step 2: collecting a collection of negative and positive samples for metabolite-protein interactions;
Step 3: for any pair of samples in the negative and positive sample sets, calculating the multidimensional correlation between each pair of metabolites and proteins;
Step 4: training a metabolite-protein interaction prediction model based on a random forest algorithm by combining a multidimensional correlation calculation result;
actual prediction: predicting whether any pair of metabolites and proteins have interaction or not based on the constructed random forest model;
Aiming at any pair of metabolites and proteins which are not directly connected in the metabolite-protein two-dimensional heterogeneous network, calculating the multidimensional correlation between the two based on the same method in the step 3; and (3) taking the obtained multidimensional correlation into a prediction model in the step (4) to obtain a probability value of interaction between the metabolite and the protein, and judging that the metabolite and the protein have interaction when the probability value is larger than a preset threshold value.
2. The prediction method according to claim 1, wherein:
the construction of the metabolite-protein two-dimensional heterogeneous network comprises the following three steps:
step 11: constructing a protein-protein interaction network;
reading human protein-protein physical interaction data from BioGrid databases, labeling each protein with a gene name, and constructing a protein-protein interaction network;
step 12: constructing a metabolite-protein interaction network;
Obtaining reactive information by reading KGML files of human metabolic pathways from a KEGG database; obtaining enzymes, reactants and products in each reaction formula, and marking the enzymes by gene names, and marking the reactants and the products by KEGG compound ID;
The labeled interaction relationship is used for indicating that any one enzyme and a reactant which participate in the same reaction at two ends of the interaction relationship or any one enzyme and any one product have interaction, and non-repeated interaction in all the reactions is integrated to construct a metabolite-protein interaction network;
Step 13: constructing a metabolite-metabolite interaction network;
reading the sdf two-dimensional structure file of each metabolite in the KEGG metabolic pathway from PubChem database; calculating a molecular descriptor for each metabolite separately; calculating Tanimoto correlation coefficients between molecular descriptors of any two metabolites;
a metabolite-metabolite interaction network was constructed using all metabolite-metabolite associations with a correlation greater than 0.
3. The prediction method according to claim 2, wherein the sample collection in step 2 comprises the steps of:
Step 21: after deleting the metabolites with the number of neighbor nodes being ranked as 10 in the metabolite-protein interaction network in the step 12, remaining all pairs of metabolites and proteins with interactions as a positive sample set;
Step 22: all metabolites and proteins in the positive set are randomly paired, and are guaranteed not to be repeated with the positive set, and the negative set with the same number as the positive set is randomly generated.
4. The prediction method according to claim 1, wherein the step 3 multi-dimensional correlation calculation includes the steps of:
Step 31: constructing a protein adjacency matrix P based on protein-protein interactions;
step 32: constructing a metabolite-protein adjacency matrix I from the metabolite-protein interactions;
step 33: constructing a metabolite-metabolite adjacency matrix M from the metabolite-protein interactions;
Step 34: the 4-dimensional metabolite-protein correlation was calculated based on the adjacency matrix P, I, M.
5. The method according to claim 4, wherein the protein adjacency matrix P in step 31 satisfies the following condition:
where Pi, j represent values in the ith row and jth column of the matrix P.
6. The method according to claim 4, wherein the metabolite-protein adjacency matrix I in step 32 satisfies the following condition:
Where Ii, j represents the values in the ith row and jth column in matrix I.
7. The method according to claim 4, wherein the metabolite adjacent matrix M in step 33 satisfies the following condition:
M i,j =tanimoto correlation coefficient between molecular descriptors of metabolite i and metabolite j
Where Mi, j represents the values in the ith row and jth column in the matrix M.
8. The prediction method according to claim 4, wherein the step 34 includes the steps of:
Step 341: for any one of the positive and negative sample sets, namely one metabolite m, and one protein p, a first dimension correlation NS1 is calculated,
NS1(m,p)=IPm,p
Step 342: calculating a second dimension correlation NS2;
NS2(m,p)=IP2 m,p/ΣkIm,k
Step 343: calculating a third-dimensional correlation NS3:
NS3(m,p)=MI m,p/ΣkIm,k
step 344: calculate a fourth-dimensional correlation NS4:
NS4(m,p)=edgeConnectivity(m,p,G.H),
Wherein G.H is a two-dimensional heterogeneous network integrating protein-protein interactions, metabolite-metabolite interactions, protein-metabolite interactions simultaneously, edgeConnectivity (m, p, G.H) calculates how many network edges need to be deleted at least in the network G.H to break all the link paths of nodes m and p.
9. The prediction method according to claim 1, wherein the step 4 includes the steps of:
step 41: selecting samples from the negative set and the positive set in random proportion as a training set and a testing set;
Step 42: aiming at a training set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step 3 as the characteristic description of the sample, taking the sample from a positive or negative set as a class label, and training a random forest classification model;
step 43: and (3) for the test set, taking the correlation of 4 dimensions between the metabolite and the protein in each sample obtained by calculation in the step (3) as a characteristic description of the sample, inputting a classification model trained in the step (42) so as to obtain probability values of positive and negative sets of each test sample, judging the positive result by using a prediction result that the positive probability value is greater than a preset threshold value, and judging the accuracy of the prediction result if the positive probability value is negative.
10. The method of predicting according to claim 9, wherein the positive indicates that the current test sample is judged to have an interaction and the negative indicates that the current test sample is judged to have no interaction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011428394.2A CN114613428B (en) | 2020-12-07 | 2020-12-07 | Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011428394.2A CN114613428B (en) | 2020-12-07 | 2020-12-07 | Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114613428A CN114613428A (en) | 2022-06-10 |
CN114613428B true CN114613428B (en) | 2024-09-03 |
Family
ID=81855988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011428394.2A Active CN114613428B (en) | 2020-12-07 | 2020-12-07 | Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114613428B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116713A (en) * | 2013-02-25 | 2013-05-22 | 浙江大学 | Method of predicting interaction between chemical compounds and proteins based on random forest |
CN109870516A (en) * | 2017-12-05 | 2019-06-11 | 中国科学院大连化学物理研究所 | A kind of screening of metabolite-protein Interaction System and characterizing method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101963331B1 (en) * | 2017-06-22 | 2019-03-28 | 한국과학기술원 | Method and system for predicting drug repositioning candidate based on similarity between drug and metabolite |
-
2020
- 2020-12-07 CN CN202011428394.2A patent/CN114613428B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116713A (en) * | 2013-02-25 | 2013-05-22 | 浙江大学 | Method of predicting interaction between chemical compounds and proteins based on random forest |
CN109870516A (en) * | 2017-12-05 | 2019-06-11 | 中国科学院大连化学物理研究所 | A kind of screening of metabolite-protein Interaction System and characterizing method |
Also Published As
Publication number | Publication date |
---|---|
CN114613428A (en) | 2022-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Song et al. | A review of integrative imputation for multi-omics datasets | |
Altman et al. | Whole-genome expression analysis: challenges beyond clustering | |
US10042976B2 (en) | Direct identification and measurement of relative populations of microorganisms with direct DNA sequencing and probabilistic methods | |
Qi et al. | High-resolution computational models of genome binding events | |
AU2021203538B2 (en) | Deep learning-based framework for identifying sequence patterns that cause sequence-specific errors (SSEs) | |
WO2002103030A2 (en) | Multidimensional biodata integration and relationship inference | |
Zhao et al. | Interpreting omics data with pathway enrichment analysis | |
Lin et al. | Clustering methods in protein-protein interaction network | |
Li et al. | HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^ 6 A) based on multiple weights and feature stitching | |
Wang et al. | A brief review of machine learning methods for RNA methylation sites prediction | |
Thareja et al. | A review of data mining optimization techniques for bioinformatics applications | |
Sharma et al. | RBPSpot: Learning on appropriate contextual information for RBP binding sites discovery | |
Zhou et al. | Gene ontology, enrichment analysis, and pathway analysis | |
Zheng et al. | MaskDNA-PGD: an innovative deep learning model for detecting DNA methylation by integrating mask sequences and adversarial PGD training as a data augmentation method | |
CN113223609B (en) | Drug target interaction prediction method based on heterogeneous information network | |
Wang et al. | Transfer learning for clustering single-cell RNA-seq data crossing-species and batch, case on uterine fibroids | |
Gupta et al. | DAVI: Deep learning-based tool for alignment and single nucleotide variant identification | |
CN114613428B (en) | Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network | |
Han et al. | A review of methods for predicting DNA N6-methyladenine sites | |
Song et al. | PEA-m6A: an ensemble learning framework for accurately predicting N 6-methyladenosine modifications in plants | |
Rapp et al. | Bioinformatics resources from the national center for biotechnology information: an integrated foundation for discovery | |
Diniz et al. | The tetratricopeptide repeats (TPR)-like superfamily of proteins in Leishmania spp., as revealed by multi-relational data mining | |
Havukkala | Biodata mining and visualization: novel approaches | |
NL2021473B1 (en) | DEEP LEARNING-BASED FRAMEWORK FOR IDENTIFYING SEQUENCE PATTERNS THAT CAUSE SEQUENCE-SPECIFIC ERRORS (SSEs) | |
Aljouie et al. | Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |