CN116049769B - Discrete object data relevance prediction method and system and storage medium - Google Patents
Discrete object data relevance prediction method and system and storage medium Download PDFInfo
- Publication number
- CN116049769B CN116049769B CN202310339869.8A CN202310339869A CN116049769B CN 116049769 B CN116049769 B CN 116049769B CN 202310339869 A CN202310339869 A CN 202310339869A CN 116049769 B CN116049769 B CN 116049769B
- Authority
- CN
- China
- Prior art keywords
- similarity
- matrix
- disease
- data
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 117
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 230000006870 function Effects 0.000 claims abstract description 51
- 238000013528 artificial neural network Methods 0.000 claims abstract description 43
- 230000007246 mechanism Effects 0.000 claims abstract description 31
- 201000010099 disease Diseases 0.000 claims description 170
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 170
- 239000011159 matrix material Substances 0.000 claims description 150
- 239000003814 drug Substances 0.000 claims description 60
- 238000004364 calculation method Methods 0.000 claims description 55
- 244000005700 microbiome Species 0.000 claims description 41
- 229940079593 drug Drugs 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 24
- 230000000813 microbial effect Effects 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000011156 evaluation Methods 0.000 description 9
- 230000015654 memory Effects 0.000 description 6
- 238000005295 random walk Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 101100001674 Emericella variicolor andI gene Proteins 0.000 description 2
- 238000002869 basic local alignment search tool Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101150013834 B' gene Proteins 0.000 description 1
- 206010013710 Drug interaction Diseases 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000010972 statistical evaluation Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Entrepreneurship & Innovation (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Public Health (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Game Theory and Decision Science (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a discrete object data relevance prediction method and system and a storage medium. The prediction method comprises the following steps: respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between different discrete objects; the node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and the plurality of discrete objects are encoded; learning the characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding of various discrete objects; decoding the obtained characteristics of the plurality of discrete objects by using a linear decoder to obtain discrete object associated prediction scores; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. The method can be applied to the associated prediction of various discrete object data, and has stronger generalization and high prediction accuracy.
Description
Technical Field
The invention relates to the field of discrete object association relation prediction, relates to a discrete object data association prediction method and system and a storage medium, and in particular relates to a discrete object data association prediction method and system based on graph convolution multi-head attention and a storage medium.
Background
With the development of computer technology and internet, more and more discrete data are accumulated to lay a solid foundation for predicting association relations among discrete objects, and a wide platform is provided. A user discovers potential associations between discrete objects by searching for multiple discrete information. For example, if the commodity a and the commodity B are commodities that a customer has purchased at the same time, it is possible to predict whether the customer will purchase the commodity C and the commodity B at the same time for the commodity C having a similar function to the commodity a by using the discrete object association relationship prediction method. Another example is: if the medicine A can treat the disease B, the medicine B with a similar chemical structure with the medicine A can be predicted whether the medicine B has a treatment effect on the disease B by using a discrete object association relation prediction method. And the following steps: the word a and the word B are used for the same sentence multiple times, and then, for the word B having a similar meaning to the word a, whether the word B and the word C will appear in the same sentence may be predicted using the discrete object association prediction method.
In the existing method for predicting the association relation of the discrete objects, the heterogeneous information network among the discrete objects is constructed, and the heterogeneous information network is analyzed to obtain the prediction result of the association relation among a plurality of discrete objects. For example, a matrix decomposition method is adopted to analyze the heterogeneous information network, so that nonlinear association relations are easy to ignore; constructing a transition matrix based on a heterogeneous information network by adopting a random walk method, and enabling probability distribution to tend to converge through multiple iterations, wherein the probability distribution is easy to fall into local optimum; the topological information characteristics of the graphs in the heterogeneous network are easily ignored by adopting a neural network method to analyze the heterogeneous information network.
Specifically, for example, in the prediction of association between a discrete object microorganism and a disease, the method and system for predicting association between a microorganism and a disease disclosed in patent document CN112151191a mainly introduce a multi-source information representation of a disease and a microorganism obtained by random walk of a meta-path, so as to realize multi-source data fusion and multi-aspect information prediction of association between a microorganism and a disease. The random walk algorithm based on the meta-path can effectively extract information of microorganisms and diseases from different data sources, and particularly can effectively acquire heterogeneous network information. However, the random walk algorithm only focuses on adjacent nodes, and is easy to sink into local optimum, so that the final prediction result is inaccurate.
Therefore, the accuracy of the obtained prediction result is also provided with a larger improvement space by adopting the existing discrete object association relation prediction method.
Disclosure of Invention
The invention aims to solve the technical problems that: the method and the system for predicting the association of the discrete object data and the storage medium can fully capture the potential association relation between the discrete objects, capture the heterogeneous network topology information of different convolution layers, reduce the decision deviation caused by the sparsity of the known association data between the discrete objects and improve the prediction precision in the association relation prediction of the discrete objects.
In order to solve the technical problems, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for predicting relevance of discrete object data, specifically including the following steps:
s1, respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between different discrete objects;
s2, combining node similarity and node association information of discrete objects on a heterogeneous network by using a graph convolution neural network encoder, and encoding a plurality of discrete objects contained in the heterogeneous network;
s3, learning characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing various discrete objects;
s4, decoding the obtained characteristics containing a plurality of discrete objects by using a linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the discrete object associated prediction score;
s5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.
Further, the step S1 specifically includes:
according to the data characteristics of the discrete objects, similarity of the discrete objects m and d is obtained by adopting a similarity calculation model through calculation; using the similarity of discrete objects m as matrix Matrix for representing similarity of discrete objects dA representation;
describing the known association between the discrete object m and the discrete object d as a binary matrixWherein M, N represents the number of discrete objects m, d, respectively, when separatedBulk object datam i And discrete object datad j There is a known association between them, thenA ij =1Otherwise, the device can be used to determine whether the current,A ij =0the method comprises the steps of carrying out a first treatment on the surface of the i is an integer between 1 and M (including 1 and M), j is an integer between 1 and N (including 1 and N);
correlation matrix A based on discrete object m and discrete object d, and similarity matrix of discrete object mS m Similarity matrix to discrete object dS d Constructing heterogeneous network by using adjacency matrix of the following formula (1)The representation is:
wherein , andRespectively, a similarity matrix for the discrete objects mS m Similarity matrix to discrete object dS d Normalizing;,, whereinRepresenting data +.> andSimilarity of>Representing data +.> andSimilarity of (2); wherein->Representing data +.> andSimilarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.
Further, step S2 specifically includes:
by being in heterogeneous networksAn upper deployment graph convolutional neural network encoder (GCN) combines node similarity and node association information, and the input setting adopts the following formula (2):
wherein ,as penalty factors, the similarity contribution in the GCN propagation process can be controlled>Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):
wherein : wherein :H (l) ,H (l+1) respectively areFirst, the、Features of the tier nodes;Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G;W (l) is->Layer to->Weight matrix used in layer training, +.>A nonlinear activation function;The adjacency matrix G is normalized, and a propagation formula is initialized as follows:
according to the above arrangement, the first layer GCN encoder is further described as the following equation (4):
wherein :is a training weight matrix from the input layer to the hidden layer;is the feature matrix of the hidden layer, +.>Is the number of dimensions of the feature; g is an adjacency matrix, defined in equation (2).
Further, the step S3 specifically includes: capturing a specific representation of the discrete object m and the discrete object d by adding a multi-headed attention score in each of the layers of the graph, the attention score of each layer being represented by the following formula (5):
wherein :is a parametric function->Is->Training weight matrix of layer, +.> andRespectively represent +.>The node outputs of the discrete object m, d of the layer normalize all the attention scores using a softmax function, which is the following equation (6):
wherein :、respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; capturing structural information of heterogeneous networks by combining embedding of different convolution layersThe final embedding of the graph roll-up neural network coding attention mechanism is represented by the following formula (7):
wherein :is the feature of the discrete object m after coding;Is the feature of the discrete object d after encoding;Parameters for automatic learning of neural networks, +.>Is->Parameters automatically learned by the layer network; initializing to ,LIs the number of iterations.
Further, the step S4 specifically includes: decoding the result using a linear decoder, the associated prediction score P between the discrete object m and the discrete object d being represented by the following formula (8):
wherein :the training weight matrix from the hidden layer to the output layer is adopted, and the sigmoid function is a nonlinear activation function, so that the prediction results are all in the range of 0-1;Represents H d Is a transposed matrix of (a).
Further, the step S5 specifically includes: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):
wherein :(i,j)representing discrete object dataAnd discrete object data->;P(i,j)Representing discrete objects->And discrete object->A predicted relevance score between; influence factor- >For reducing-> andInfluence of data imbalance, ++>Representing the number of sets of known association pairs for all discrete objects m and d, +.>Representing the number of sets of discrete object m and discrete object d associated pairs that are not found (p+ representing the positive instance set, p-representing the negative instance set).
Further, the discrete object data association includes: microorganism-human disease association, known drug-disease association, association with different commercial products, and the like.
Further, the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.
Such as: the semantic description of the disease has a hierarchical structure, so that the similarity can be calculated by using the directed acyclic graph, and the method is not limited to the directed acyclic graph; the medicine contains various characteristics such as a structure, an action target point and the like, so cosine similarity calculation can be selected, and the medicine is not limited to a cosine similarity calculation model.
In a second aspect, the present invention further provides a discrete object data relevance prediction system, which adopts the discrete object data relevance prediction method, and specifically includes:
the discrete object data similarity calculation module is used for calculating the similarity of each discrete object by using the similarity calculation model;
The heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity of the discrete objects and the known association relationship between the discrete objects;
the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding the discrete object m and the discrete object d on the heterogeneous network using the graph roll-up neural network encoder to combine the node similarity and the node association information; the multi-head attention mechanism module is used for capturing node characteristics of each layer of graph convolution by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the discrete object m and the discrete object d; the linear decoder module is used for decoding the obtained characteristics of the discrete object m and the discrete object d by using the linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the associated prediction scores among the discrete objects;
and the optimization module is used for adopting the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by the sparse characteristic of the data set.
The invention also provides a computer storage medium on which a computer program is stored, wherein the computer program when executed by an executor implements the discrete object data relevance prediction method described above.
The invention provides a discrete object data association prediction method and a discrete object data association prediction system, which are based on a graph convolution multi-head attention mechanism, and aim at the advantages and disadvantages of the existing discrete object data association prediction method, a heterogeneous network is constructed by utilizing the similarity of a plurality of discrete objects and the known association relationship among the discrete objects, and the applicability of finding the potential association relationship among the discrete objects is effectively enhanced by utilizing the similarity data; the association relation between the nonlinear discrete objects can be effectively captured by using the graph convolution neural network; capturing node characteristics of each layer of graph convolution discrete objects by using multi-head attention, calculating and combining multi-head attention scores of each layer, so that node characteristics of more discrete objects are mined for embedding, and the influence of sparse association on the graph convolution neural network can be effectively compensated; the minimized weighted binary cross entropy is used as a loss function, so that decision deviation caused by sparsity of known associated data among discrete objects can be effectively compensated; the discrete object association relation prediction result obtained by the discrete object association prediction method for the graph convolution multi-head attention is evaluated, so that the prediction precision is high; the prediction method of the invention can be applied to various discrete object data for the association relation prediction of the discrete objects, and has stronger generalization.
Compared with the existing method, the discrete object data relevance prediction method and system provided by the invention have the following advantages:
(1) According to the invention, the heterogeneous network is constructed by using the similarity information of various discrete objects and the known association relation between the discrete objects, so that the data characteristics of each discrete object can be fully utilized.
(2) The present invention uses a graph convolution neural network encoder and a linear decoder to accomplish associative prediction between discrete objects. The graph convolution neural network can capture nonlinear association relations, and has better performance effect on association relations between a small number of known discrete objects and a large number of unknown or discrete objects contained in training data by adopting a semi-supervised training method.
(3) The multi-head attention mechanism is provided to capture more discrete object information, the multi-head attention can capture node characteristics of discrete objects of each layer of convolution layer, the enhancement characteristic representation of the current node can be obtained according to the neighbor node weight of each layer, the multi-head attention mechanism captures different structure information of a heterogeneous network, the problem of inconsistent contribution caused by embedding different node characteristics in different convolution layers can be effectively relieved, and the introduction of the attention mechanism can reduce the influence of association relation among sparse discrete objects on transmissibility in a graph convolution neural network.
(4) The invention uses the minimized weighted binary cross entropy as a loss function to reduce decision bias caused by the sparse characteristic of the known association relationship data between discrete objects, thereby strengthening the influence of positive samples.
(5) The prediction method is suitable for the association relation prediction of the discrete objects, and has stronger generalization.
Experiments prove that the method can remarkably improve the accuracy of discrete object association relation prediction; decision bias caused by known associations between datasets with respect to sparse discrete objects can be effectively reduced.
Drawings
FIG. 1 is a flowchart of a discrete object data relevance prediction method according to embodiment 1 of the present invention;
FIG. 2 is a schematic structural diagram of a discrete object data relevance prediction system according to embodiment 2 of the present invention;
FIG. 3 is a statistical chart of the relationship between the predicted microorganism and the disease according to the method and the system for predicting the microorganism in the embodiment 3;
FIG. 4 is a statistical chart of the predicted drug-disease association relationship using the prediction method and system of the present invention in example 4;
fig. 5 is a schematic diagram of the result of predicting the association relationship between different commodities by using the prediction method and the prediction system of the present invention in embodiment 5.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
Example 1
As shown in fig. 1, the present embodiment provides a discrete object data relevance prediction method, which is based on a graph convolution multi-head attention mechanism, and includes the following steps:
s1, respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between the discrete objects; the method specifically comprises the following steps:
according to the data characteristics of the discrete objects, similarity of the discrete objects m and d is obtained by adopting a similarity calculation model through calculation; using the similarity of discrete objects m as matrixMatrix for representing similarity of discrete objects dAnd (3) representing. The discrete object data association includes: microorganism-human disease association, known drug-disease association, association with different commercial products, and the like. The similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.
Such as: the semantic description of the disease has a hierarchical structure, so that the similarity can be calculated by using the directed acyclic graph, and the method is not limited to the directed acyclic graph; the medicine contains various characteristics such as a structure, an action target point and the like, so cosine similarity calculation can be selected, and the medicine is not limited to a cosine similarity calculation model.
Describing the known association between the discrete object m and the discrete object d as a binary matrixWherein N, M represents discrete objects d, m, respectivelyQuantity, when discrete object datam i And discrete object datad j There is a known association between them, thenA ij =1Otherwise, the device can be used to determine whether the current,A ij =0the method comprises the steps of carrying out a first treatment on the surface of the i is an integer between 1 and M (including 1 and M), j is an integer between 1 and N (including 1 and N);
correlation matrix A based on discrete object m and discrete object d, and similarity matrix of discrete object mS m Similarity matrix to discrete object dS d Constructing heterogeneous network by using adjacency matrix of the following formula (1)The representation is:
wherein , andRespectively, a similarity matrix for the discrete objects mS m Similarity matrix to discrete object dS d Normalizing;,, whereinRepresenting data +.> andSimilarity of>Representing data +.> andSimilarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.
S2, combining node similarity and node association information of discrete objects on a heterogeneous network by using a graph convolution neural network encoder, and encoding a plurality of discrete objects contained in the heterogeneous network; the method specifically comprises the following steps:
by being in heterogeneous networksAn upper deployment graph convolutional neural network encoder (GCN) combines node similarity and node association information, and the input setting adopts the following formula (2):
wherein ,as penalty factors, the similarity contribution in the GCN propagation process can be controlled>Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):
wherein :H (l) ,H (l+1) respectively the first、Features of the tier nodes;Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G;W (l) is->Layer to->Weight matrix used in layer training, +.>A nonlinear activation function;The adjacency matrix G is normalized, and a propagation formula is initialized as follows:
according to the above arrangement, the first layer GCN encoder is further described as the following equation (4):
wherein :is a training weight matrix from the input layer to the hidden layer;is the feature matrix of the hidden layer, +.>Is the number of dimensions of the feature; g is an adjacency matrix, defined in equation (2).
S3, learning characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing various discrete objects; the method specifically comprises the following steps: capturing a specific representation of the discrete object m and the discrete object d by adding a multi-headed attention score in each of the layers of the graph, the attention score of each layer being represented by the following formula (5):
wherein :is a parametric function->Is->Training weight matrix of layer, +.> andRespectively represent +.>The node outputs of the discrete object m, d of the layer normalize all the attention scores using a softmax function, which is the following equation (6):
wherein :、respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; the final embedding of the graph convolutional neural network coding attention mechanism by combining the embedding of different convolutional layers to capture the structural information of the heterogeneous network is represented by the following formula (7):
wherein :is the feature of the discrete object m after coding;Is the feature of the discrete object d after encoding;Parameters for automatic learning of neural networks, +.>Is->Parameters automatically learned by the layer network; initializing to ,LIs the number of iterations.
S4, decoding the obtained characteristics containing a plurality of discrete objects by using a linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the discrete object associated prediction score; the method specifically comprises the following steps: decoding the result using a linear decoder, the associated prediction score P between the discrete object m and the discrete object d being represented by the following formula (8):
wherein :the training weight matrix from the hidden layer to the output layer is adopted, and the sigmoid function is a nonlinear activation function, so that the prediction results are all in the range of 0-1;Represents H d Is a transposed matrix of (a).
Further, the step S5 specifically includes: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):
wherein :(i,j)representing discrete object dataAnd discrete object data->;P(i,j)Representing discrete objects->And discrete object->A predicted relevance score between; influence factor->For reducing-> andInfluence of data imbalance, ++>Representing the number of sets of known association pairs for all discrete objects m and d, +.>Representing the number of sets of discrete object m and discrete object d associated pairs that are not found (p+ representing the positive instance set, p-representing the negative instance set).
S5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.
Example 2
As shown in fig. 2, the present embodiment provides a discrete object data relevance prediction system 20 adopting the method described in embodiment 1, specifically including:
a discrete object data similarity calculation module 21, configured to calculate respective similarities of different discrete objects using the similarity calculation model;
A heterogeneous network construction module 22, configured to construct a heterogeneous network using similarity information of different discrete objects and known association relationships between the discrete objects;
a multi-headed attention model building block 23 comprising a graph convolutional neural network encoder block 231, a multi-headed attention mechanism block 232, and a linear decoder block 233, wherein: a convolutional neural network encoder module 231 for encoding the discrete object m and the discrete object d on the heterogeneous network using the convolutional neural network encoder to combine node similarity and node association information; a multi-headed attention mechanism module 232 for capturing the node features of the convolution of each layer of graph using multi-headed attention, calculating an attention score, combining the multi-headed attention of each layer to obtain the final embedding of the discrete object m and the discrete object d; a linear decoder module 233, configured to decode the obtained characteristics of the discrete object m and the discrete object d using a linear decoder so that the output matrix and the input matrix have the same dimensions, and obtain an associated prediction score between the discrete objects;
and the optimization module 24 is used for taking the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by sparse characteristics of the data set.
The embodiment of the invention also provides a computer storage medium, on which a computer program is stored, wherein the computer program, when being executed by an actuator, implements the method as shown in fig. 1.
The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be an article of manufacture that is not accessed by a computer device or may be a component used by an accessed computer device.
Example 3: application example 1, discrete object microorganism-human disease association prediction
In order to verify the effectiveness of the discrete object relevance prediction method based on the graph convolution multi-head attention, in the embodiment, microorganisms are taken as discrete objects m, human diseases are taken as discrete objects d, the known microorganism-human disease relevance is the known relevance among the discrete objects, and the potential relevance between the microorganisms and the human diseases is predicted.
The microbial similarity calculation in this embodiment, unlike the prior art using a gaussian kernel, uses a gene sequence calculation, and is obtained from a gene sequence similarity calculation model, and specifically includes: the similarity was calculated by measuring the degree of fixed alignment between two sequences by Identity (ID) using the basic local alignment search tool BLAST (Basic Local Alignment Search Tool, http:// www.ncbi.nim.nih.gov/BLAST) tool seed, the similarity calculation formula for microorganisms A 'and B' being as follows (10):
wherein :the alignment of the microbial A 'and microbial B' gene sequences is represented, and the ID represents the fixed alignment of the microbial sequences and is stored in a matrix form.
The similarity of human diseases is obtained by a directed acyclic graph similarity calculation model, and specifically comprises the following steps: disease data in the MeSH database is constructed into a Directed Acyclic Graph (DAG) of the disease, nodes represent the disease, edges represent the hierarchical relationship between the diseases, and the directed acyclic graph in this embodiment is represented by the following formula (11):
wherein :representing all diseases in the subgraph->Node set, ++>Representing disease->Corresponding edges or sets of semantic relationships with other nodes. Then disease- >Corresponding semantic value->Can be represented by formula 12:
representing disease->Each ancestor node in the corresponding DAG graph is dedicated to the disease +.>The semantic contribution of (3) is calculated according to the following formula (13):
wherein ,for semantic contributors->The value range is (0, 1), child node of child (n) is represented by child (n), and distance disease in DAG graph is +.>The farther ancestor nodes of a node contribute less to their semantics. Therefore, the formula for calculating the semantic similarity value between diseases A and B is as follows (14):
constructing heterogeneous network by using microorganism similarity, disease similarity and known microorganism-disease association relationship, and expressing the microorganism-disease association as binary matrixWherein N and M are used to represent the number of diseases and microorganisms, respectively. When a microorganismm i And a diseased j There is a known association between them, thenA ij =1The method comprises the steps of carrying out a first treatment on the surface of the Otherwise the first set of parameters is selected,A ij =0. The similarity of the microorganisms is expressed as matrix +.>, whereinRepresenting microorganism-> andSimilarity of (2); similarity of diseases is expressed as matrix->, whereinRepresenting disease-> andIs a similarity of (3). Based on a microbe-disease association matrixAMicroorganism similarity matrix>And disease similarity matrix->Establishing an adjacency matrix->. The specific construction method is shown in example 1.
The node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and microorganisms and diseases contained in the heterogeneous network are encoded; learning the characteristic embedding of the microorganisms and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding containing the microorganisms and the disease; decoding the obtained characteristics containing microorganisms and diseases by using a linear decoder so that the dimensions of an output matrix and an input matrix are the same, and obtaining a microorganism-disease associated prediction score; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. Specific methods are described in example 1.
In this example, the method was applied to HMDAD dataset (comprising 483 experimentally confirmed microbe-disease associations between 39 diseases and 292 microbes) with BRWMDA (predicting microbe-disease association based on random walk method) and WMGHDMA (predicting microbe-disease association based on network-based calculation method) as controls.
The statistics of the HMDAD dataset are shown in table 1.
Table 1 data set statistics
There are many performance evaluation methods for association prediction, among which the area under ROC curve (AUC) and the area under PR curve (AUPR) are the most used evaluation methods. AUC and AUPR were chosen as evaluation indicators in this example. The number of known drug-disease pairs predicted to be associated with each other is denoted by TP, FP denotes the number of unknown drug-disease pairs predicted to be associated with each other, FN denotes the number of known drug-disease pairs predicted to be absent from each other, TN denotes the number of unknown drug-disease pairs predicted to be absent from each other, and true positive rate (TPR, true Positive Rate), false positive rate (FPR, false Positive Rate), precision, recall rate Recall may be expressed as:
TPR is taken as an abscissa, FPR is taken as an ordinate, the ROC curve can be represented, the area under the ROC curve is AUROC, precision is taken as an abscissa, recall is taken as an ordinate, the PR curve can be represented, and the area under the PR curve is AUPR.
In order to verify that the method provided by the invention is based on other standard methods in terms of prediction effect, BRWMDA and WMGHDMA are compared with the discrete object data association prediction method (named mammda) provided by the embodiment to AUC and AUPR. In the experiment of the embodiment, a 5-fold cross validation method is adopted to evaluate the performance of the model, namely, a known microorganism-human disease associated data set is divided into 5 groups at random, one group is taken as a test set at a time, and the other four groups are taken as training sets. The convolutional neural network of the model parameter setting diagram of the embodiment adopts a 3-layer structure, 16 hidden layer nodes of each layer have a learning rate of 0.001, a node loss probability of 0.7, an edge loss probability of 0.3 and a heterogeneous network penalty factor of 6. Tables 2 and 3 show the methods AUC, AUPR, accuracy and Precision on the HMDAD dataset. According to the MAGMDA method, the AUC evaluation index on the HMDAD data set is respectively improved by 3.23% compared with that of the method with the next highest AUC evaluation index, the Accuracy is improved by 1.22% compared with that of the method with the next highest AUC evaluation index, and the Precision is improved by 0.41% compared with that of the method with the next highest AUC evaluation index. Overall, the MAGMDA method has some improvement in prediction effect compared to other baseline methods.
TABLE 2 statistical evaluation of the prediction of the relationship between microbial and human disease
Taken together, it is shown that: the discrete object correlation prediction method based on graph convolution multi-head attention provided by the invention uses a gene sequence to calculate microorganism similarity and uses a directed acyclic graph to calculate disease similarity by utilizing microorganism data and disease data, so that more information characteristics of discrete objects are captured, a heterogeneous network is constructed by utilizing the microorganism similarity, the disease similarity and known microorganism-disease correlation, topological information of the heterogeneous network is effectively captured by utilizing a graph convolution neural network of a multi-head attention mechanism, and minimized weighted binary cross entropy is adopted as a loss function learning parameter. The method can fully utilize bioinformatics multisource data, capture nonlinear microorganism-human disease association relation, capture heterogeneous network topology information of different convolution layers, reduce decision bias caused by sparsity of known microorganism-human disease association data and improve microorganism-human disease association prediction precision.
Example 4: application example 2, discrete object drug-disease association prediction
In order to verify the effectiveness of the prediction method provided by the invention, in the embodiment, the medicine is taken as a discrete object m, the disease is taken as a discrete object d, the known medicine-disease association relationship is the known association relationship between the discrete objects, and the potential association relationship between the medicine and the disease is predicted.
Wherein the drug is characterized by comprising: chemical molecular structure, drug interaction and drug action targets are respectively used in matrixMatrix->Matrix->Indicating (I)>Represents the i-th feature quantity,/->Indicating the quantity of the drug. Drug similarity was calculated using a cosine similarity calculation model, as shown in formula (19):
wherein Indicating the similarity between the ith and jth drugs obtained by the t-th feature,/v> andIth and jth feature vectors representing the tth feature, respectively, +.>。
The similarity of the diseases is calculated from the directed acyclic graph calculation model, see example 3, and in particular from formulas (11) - (14).
Constructing heterogeneous network by using medicine similarity, disease similarity and known medicine-disease association relation, and expressing medicine-disease association as binary matrixWherein N and M are used to represent the number of diseases and drugs, respectively. When a medicine ism i And a diseased j There is a known association between then A ij =1The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, A ij =0。
The similarity of drugs is expressed as a matrix, whereinRepresenting drug-> andSimilarity of diseases is expressed as matrix +.>, whereinRepresenting disease-> andIs a similarity of (3). Based on drug-disease association matrix A, drug similarity matrix +. >And disease similarity matrix->Establishing an adjacency matrix->. The specific construction method is shown in example 1.
The node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and medicines and diseases contained in the heterogeneous network are encoded; learning the characteristic embedding of the medicine and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding containing the medicine and the disease; decoding the obtained characteristics containing the medicine and the disease by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix to obtain a medicine-disease associated prediction score; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. Specific methods are described in example 1.
This example uses SCMFDD (binary network based drug-disease association prediction) and nimgcn (graph roll-up network based neuro-induction matrix based drug-disease association prediction) as reference methods, and applies the method to Ldataset dataset (comprising 18416 experimentally verified drug-disease associations between 269 and 598 drugs). Statistical information of the Ldataset dataset is shown in table 3.
Table 3 Ldataset dataset statistics
To verify that the method provided by the present invention compares AUC and AUPR with the discrete object data correlation prediction method (named MAGGCN) provided by this example due to other baseline methods. In the embodiment, the performance of the model is evaluated by adopting a 5-fold cross-validation method, namely, a known drug-human disease associated data set is divided into 5 groups randomly and averagely, one group is taken as a test set at a time, and the other four groups are taken as training sets. The model parameter setting graph convolution neural network of the embodiment adopts a 2-layer structure, 64 hidden layer nodes in each layer have a learning rate of 0.01, a node loss probability of 0.7, an edge loss probability of 0.3 and a heterogeneous network penalty factor of 6. Table 4 and fig. 4 show the respective methods AUC, AUPR, accuracy and Precision on the Ldataset dataset. The MAGGCN method improves the AUC and AUPR evaluation indexes on the Ldataset data set by 3.9% and 3% respectively, and improves the Precision by 1.4% compared with the method with the next highest. Overall, the MAGGCN method predicts a certain improvement over other baseline methods.
TABLE 4 statistics of predicted evaluation results of drug-disease associations
Taken together, it is shown that: according to the discrete object data correlation prediction method provided by the invention, the drug similarity is calculated by using the cosine similarity and the disease data, the disease similarity is calculated by using the directed acyclic graph, so that more information characteristics of discrete objects are captured, the drug similarity, the disease similarity and the known drug-disease correlation are constructed into a heterogeneous network, the topological information of the heterogeneous network is effectively captured by using the graph convolution neural network of a multi-head attention mechanism, and the minimized weighted binary cross entropy is adopted as a loss function learning parameter. The method can fully utilize various characteristic information and disease semantic information of the medicine, fully capture nonlinear medicine-human disease association relation, capture heterogeneous network topology information of different convolution layers, reduce decision bias caused by sparsity of known medicine-disease association data and improve medicine-human disease association prediction precision.
Example 5: application example 3, association prediction of different commodities
In order to verify the effectiveness of the prediction method provided by the invention, the embodiment predicts the potential association relationship existing between different commodities by taking the commodities as discrete objects and different commodities purchased by the same order as association relationships. And calculating commodity similarity by utilizing the characteristics of the commodity, such as attributes, types, functions and the like, and representing the commodity similarity as a similarity matrix. And constructing a heterogeneous network by using the commodity similarity and the association relation between commodities in the known order, wherein the commodity association is represented by a binary matrix. The similarity and association of the goods are combined by deploying a graph roll-up network on the heterogeneous network. The specific representation of the commodity is captured by adding a multi-head attention mechanism to each layer of graph convolution layer, and the final feature embedding of the commodity is obtained by combining the attention scores of each layer of convolution layer. And decoding the characteristic embedding by adopting a linear decoder to obtain the associated prediction score of the commodity. The parameters are learned for the loss function using a minimized weighted binary cross entropy. For specific methods, reference is made to the prediction method provided in example 1 and the discrete object data similarity calculation model provided in examples 3 and 4.
The method of the embodiment is applied to a jindong electronic commerce data set JDdataset (comprising 11212 commodities, 141 commodity categories and 240332 orders). The method of the invention is shown in figure 5 about the commodity association relation prediction result, wherein nodes represent commodities, and edges connected with the nodes represent predicted commodities with association relation.
Taken together, it is shown that: the discrete object relevance prediction method based on the graph convolution multi-head attention provided by the invention can fully utilize various characteristic information of commodities to capture the association relation between nonlinear commodities, capture heterogeneous network topology information of different convolution layers and provide reasonable recommended commodities for an electronic commerce platform recommended page so as to improve sales of the electronic commerce.
The above description is only of a few preferred embodiments of the present invention and should not be taken as limiting the invention, but all modifications, equivalents, improvements and modifications within the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (27)
1. A method for predicting the association of a microorganism-human disease, comprising the steps of:
s1, respectively calculating the similarity between the microorganisms of the discrete objects and the similarity between the human diseases, and constructing a heterogeneous network with the known association relationship between the microorganisms and the human diseases;
S2, combining node similarity and node association information of microorganisms and human diseases on a heterogeneous network by using a graph convolution neural network encoder, and encoding the microorganisms and the human diseases contained in the heterogeneous network;
s3, learning characteristic embedding of microorganisms and human disease nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing microorganisms and human diseases;
s4, decoding the obtained characteristics containing the microorganisms and the human diseases by using a linear decoder so that the dimensions of an output matrix and an input matrix are the same, and obtaining a microorganism-human disease associated prediction score;
s5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.
2. The method for predicting microbial-human disease association according to claim 1, wherein step S1 specifically comprises:
according to the data characteristics of the microbial and human diseases, similarity calculation models are adopted to calculate the similarity of the microbial and human diseases respectively; the similarity of microorganisms is used as matrixMatrix for representing similarity of human diseasesA representation;
describing a known association between microorganisms and human diseases as a binary matrix Wherein M, N respectively represents the number of microorganisms and human diseases, when the discrete object microorganism data m i And human disease data d j There is a known association between then A ij =1, otherwise, a ij =0;
Correlation matrix A based on microorganisms and human diseases and similarity matrix of microorganismsSimilarity matrix with human diseases->Constructing heterogeneous network by using adjacency matrix of the following formula (1)>The representation is:
wherein , andRespectively is a similarity matrix S for microorganisms m Similarity matrix S with human diseases d Normalizing;,, whereinRepresenting data +.> andSimilarity of>Representing data +.> andSimilarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.
3. The method for predicting microbial-human disease association according to claim 2, wherein step S2 specifically comprises:
by being in heterogeneous networksThe upper deployment graph convolution neural network encoder combines node similarity and node association information, and the input setting adopts the following formula (2):
wherein ,as penalty factors, the similarity contribution in the GCN propagation process can be controlled>Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):
wherein :H(l) ,H (l+1) Respectively the first、Features of the tier nodes;Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G; w (W) (l) Is->Layer to->Weight matrix used in layer training, +.>A nonlinear activation function;The adjacency matrix G is normalized, and the propagation formula is initialized as follows:
according to the above propagation equation (3) and the setup for propagation equation initialization, the first layer GCN encoder is further described as the following equation (4):
4. The method for predicting microbial-human disease association according to claim 3, wherein step S3 specifically comprises: capturing a specific representation of microbial and human disease by adding a multi-headed attention score to each of the layers of the graph, the attention score of each layer being represented by the following formula (5):
wherein :is a parametric function->Is->Training weight matrix of layer, +.> andRespectively represent +.>The nodal output of the microbial, human disease of the layer, normalized to all attention scores using a softmax function, which is given by the following formula (6):
wherein :、respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; the final embedding of the graph convolutional neural network coding attention mechanism by combining the embedding of different convolutional layers to capture the structural information of the heterogeneous network is represented by the following formula (7):
5. The method for predicting microbial-human disease association according to claim 4, wherein step S4 specifically comprises: decoding the result using a linear decoder, the associated predictive score P between the discrete object microorganism and the human disease is represented by the following formula (8):
6. The method for predicting microbial-human disease association according to claim 5, wherein step S5 specifically comprises: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):
Wherein: (i, j) represents microorganism dataAnd human disease data->The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents microorganism data +.>And human disease data->A predicted relevance score between; influence factor->For reducing-> andThe effect of the data imbalance is that,representation houseNumber of sets of known association pairs with microbial and human diseases, < >>Represents the number of sets of undiscovered microbial and human disease-associated pairs.
7. The method for predicting microbial-human disease association according to claim 2, wherein,
the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.
8. A system for predicting the association of a microorganism-human disease, comprising:
a data similarity calculation module for calculating the similarity between microorganisms and the similarity between human diseases using the similarity calculation model;
the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between microorganisms and the similarity between human diseases and the known association relationship between microorganisms and human diseases;
The multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding microbial and human diseases on a heterogeneous network using the graph roll-up neural network encoder to combine node similarity and node association information; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the microorganism and human diseases; the linear decoder module is used for decoding the obtained characteristics of the microbial and human diseases by using the linear decoder so that the dimension of the output matrix is the same as that of the input matrix, and obtaining the correlation prediction score between the microbial and human diseases;
and the optimization module is used for adopting the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by the sparse characteristic of the data set.
9. A computer storage medium, characterized in that a computer program is stored thereon, wherein the computer program, when executed by an actuator, implements the method for predicting the association of a microorganism-human disease according to any one of claims 1-7.
10. The medicine-disease association prediction method is characterized by comprising the following steps:
s1, respectively calculating the similarity between the medicines of the discrete objects and the similarity between diseases, and constructing a heterogeneous network with the known association relationship between the medicines and the diseases;
s2, combining the medicine and the node similarity and node association information of the diseases on the heterogeneous network by using a graph convolution neural network encoder, and encoding the medicine and the diseases contained in the heterogeneous network;
s3, learning characteristic embedding of the medicine and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing the medicine and the disease;
s4, decoding the obtained characteristics containing the medicine and the disease by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix, and obtaining a medicine-disease association prediction score;
s5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.
11. The method of claim 10, wherein step S1 specifically comprises:
according to the data characteristics of the medicine and the disease, adopting a similarity calculation model to calculate and obtain the similarity of the medicine and the disease respectively; matrix similarity of drugs The similarity of the diseases is represented by the matrix +.>A representation;
describing a known association between a drug and a disease as a binary matrixWherein M, N respectively represents the number of medicines and diseases, when the medicine data m of the discrete object is obtained i And disease data d j There is a known association between then A ij =1, otherwise, a ij =0;
Drug and disease-based incidence matrix A and drug similarity matrixSimilarity matrix with diseases->Constructing heterogeneous network by using adjacency matrix of the following formula (1)>The representation is:
wherein , andRespectively is a similarity matrix S for medicines m Similarity matrix S with disease d Normalizing;,, whereinRepresenting drug data andSimilarity of>Representing disease data-> andSimilarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.
12. The method of claim 11, wherein step S2 specifically comprises:
by being in heterogeneous networksThe upper deployment graph convolution neural network encoder combines node similarity and node association information, and the input setting adopts the following formula (2):
wherein ,as penalty factors, the similarity contribution in the GCN propagation process can be controlled >Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):
wherein :H(l) ,H (l+1) Respectively the first、Features of the tier nodes;Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G; w (W) (l) Is->Layer to->Weight matrix used in layer training, +.>A nonlinear activation function;The adjacency matrix G is normalized, and the propagation formula is initialized as follows:
according to the above propagation equation (3) and the setup for propagation equation initialization, the first layer GCN encoder is further described as the following equation (4):
13. The method of claim 12, wherein step S3 specifically comprises: capturing a specific representation of the drug and disease by adding multiple attention scores to each of the graph convolution layers, the attention score of each layer being represented by the following formula (5):
wherein :is a parametric function->Is->Training weight matrix of layer, +.> andRespectively represent +.>The nodal output of drug, disease of the layer, normalized all attention scores using a softmax function, which is given by the following formula (6):
wherein :、respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; the final embedding of the graph convolutional neural network coding attention mechanism by combining the embedding of different convolutional layers to capture the structural information of the heterogeneous network is represented by the following formula (7):
14. The method of claim 13, wherein step S4 specifically comprises: decoding the result using a linear decoder, the associated predictive score P between the discrete subject drug and the disease is represented by the following formula (8):
15. The method of claim 14, wherein step S5 specifically comprises: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):
Wherein: (i, j) represents drug dataAnd disease data->The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents drug data +.>And disease data->A predicted relevance score between; influence factor->For reducing-> andInfluence of data imbalance, ++>Representing the number of sets of known association pairs for all drugs and diseases, +.>Representing the number of sets of undiscovered drug and disease association pairs.
16. The method of claim 15, wherein the method comprises the steps of,
the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.
17. A drug-disease association prediction system, characterized in that it employs the drug-disease association prediction method according to any one of claims 10 to 16, and specifically comprises:
the data similarity calculation module is used for calculating the similarity between medicines and the similarity between diseases by using the similarity calculation model;
the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between medicines and the similarity between diseases and the known association relationship between medicines and diseases;
the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding drugs and diseases on a heterogeneous network using the graph roll-up neural network encoder to combine node similarity and node association information; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the medicine and the disease; a linear decoder module for decoding the obtained characteristics of the drug and the disease using a linear decoder such that the output matrix is the same as the input matrix in dimension, obtaining a drug-disease correlation prediction score;
And the optimization module is used for adopting the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by the sparse characteristic of the data set.
18. A computer storage medium having stored thereon a computer program, wherein the computer program when executed by an actuator implements the method of predicting drug-disease association according to any one of claims 10-16.
19. The method for predicting the relevance of different commodities is characterized by comprising the following steps of:
s1, respectively calculating the similarity between the first-class commodities and the similarity between the second-class commodities, and constructing a heterogeneous network with the association relationship between the first-class commodities and the second-class commodities in the known order;
s2, combining node similarity and node association information of the first-class commodity and the second-class commodity on the heterogeneous network by using a graph convolution neural network encoder, and encoding the first-class commodity and the second-class commodity contained in the heterogeneous network;
s3, learning characteristic embedding of the first-class commodity and the second-class commodity nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing the first-class commodity and the second-class commodity;
S4, decoding the obtained characteristics comprising the first type of commodity and the second type of commodity by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix, and obtaining a first type of commodity-second type of commodity association prediction score;
s5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.
20. The method for predicting relevance of different types of commodities according to claim 19, wherein step S1 specifically includes:
according to the data characteristics of the first-class commodities and the second-class commodities, respectively calculating to obtain the similarity of the first-class commodities and the second-class commodities by adopting a similarity calculation model; using a matrix for similarity of first-class commoditiesRepresenting the similarity of the second type of goods with matrix +.>A representation;
describing a known association between a first type of commodity and a second type of commodity as a binary matrixWherein M, N represents the number of first-class commodity and second-class commodity respectively, and the first-class commodity data m is the discrete object i And commodity data of the second class d j There is a known association between then A ij =1, otherwise, a ij =0;
Correlation matrix A based on first-class commodity and second-class commodity and similarity matrix of first-class commodity Similarity matrix with the second class of goods +.>Constructing heterogeneous network by using adjacency matrix of the following formula (1)>The representation is:
wherein , andRespectively is a similarity matrix S for the first type of commodity m Similarity matrix S with second class commodity d Normalizing;,, whereinRepresenting commodity data of the first kind-> andSimilarity of>Representing commodity data of the second class-> andSimilarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.
21. The method for predicting relevance of different types of commodities according to claim 20, wherein step S2 specifically includes:
by being in heterogeneous networksThe upper deployment graph convolution neural network encoder combines node similarity and node association information, and the input setting adopts the following formula (2):
wherein ,as penalty factors, the similarity contribution in the GCN propagation process can be controlled>Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):
wherein :H(l) ,H (l+1) Respectively the first、Features of the tier nodes;Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G; w (W) (l) Is->Layer to->Weight matrix used in layer training, +.>A nonlinear activation function; / >The adjacency matrix G is normalized, and the propagation formula is initialized as follows:
according to the above propagation equation (3) and the setup for propagation equation initialization, the first layer GCN encoder is further described as the following equation (4):
22. The method for predicting relevance of different types of commodities according to claim 21, wherein step S3 specifically includes: capturing a specific representation of the first type of commodity and the second type of commodity by adding a multi-headed attention score to each of the layers of the graph, the attention score of each layer being represented by the following formula (5):
wherein :is a parametric function->Is->Training weight matrix of layer, +.> andRespectively represent +.>The node outputs of the first class commodity and the second class commodity of the layer normalize all attention scores by using a softmax function, wherein the softmax function is as follows (6):
wherein :、respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; the final embedding of the graph convolutional neural network coding attention mechanism by combining the embedding of different convolutional layers to capture the structural information of the heterogeneous network is represented by the following formula (7):
wherein :is the characteristic of the first commodity data after being encoded;Is the characteristic of the second commodity data after coding>Parameters for automatic learning of neural networks, +.>Is a parameter for the layer 1 network to automatically learn; initializing toL is the number of iterations.
23. The method for predicting relevance of different types of commodities according to claim 22, wherein step S4 specifically includes: decoding the result using a linear decoder, the associated prediction score P between the first type of commodity and the second type of commodity being represented by the following formula (8):
24. The method for predicting relevance of different types of commodities according to claim 23, wherein step S5 specifically includes: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):
wherein: (i, j) represents first-class commodity dataAnd second class commodity data->The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents first-class commodity dataAnd second class commodity data->A predicted relevance score between; influence factor->For reducing- > andInfluence of data imbalance, ++>Representing the number of sets of known association pairs for all first and second type of goods, +.>Representing the number of sets of undiscovered drug and disease association pairs.
25. The method of claim 24, wherein the step of predicting relevance of different types of merchandise,
the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.
26. A system for predicting relevance of different types of commodities, which is characterized by adopting the method for predicting relevance of different types of commodities according to any one of claims 19 to 25, and specifically comprising:
the data similarity calculation module is used for calculating the similarity between the first type of commodities and the similarity between the second type of commodities by using a similarity calculation model;
the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between the first type commodities, the similarity between the second type commodities and the known association relationship between the first type commodities and the second type commodities;
the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: the graph roll neural network encoder module is used for encoding the first type commodity and the second type commodity by using the graph roll neural network encoder to combine node similarity and node association information on the heterogeneous network; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the first type commodity and the second type commodity; the linear decoder module is used for decoding the obtained characteristics of the first-class commodities and the second-class commodities by using the linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the correlation prediction score between the first-class commodities and the second-class commodities;
And the optimization module is used for adopting the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by the sparse characteristic of the data set.
27. A computer storage medium having stored thereon a computer program, wherein the computer program when executed by an actuator implements the heterogeneous product relevance prediction method of any of claims 19-25.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310339869.8A CN116049769B (en) | 2023-04-03 | 2023-04-03 | Discrete object data relevance prediction method and system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310339869.8A CN116049769B (en) | 2023-04-03 | 2023-04-03 | Discrete object data relevance prediction method and system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116049769A CN116049769A (en) | 2023-05-02 |
CN116049769B true CN116049769B (en) | 2023-06-20 |
Family
ID=86122132
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310339869.8A Active CN116049769B (en) | 2023-04-03 | 2023-04-03 | Discrete object data relevance prediction method and system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116049769B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117075756B (en) * | 2023-10-12 | 2024-03-19 | 深圳市麦沃宝科技有限公司 | Real-time induction data processing method for intelligent touch keyboard |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2605218A (en) * | 2021-03-23 | 2022-09-28 | Adobe Inc | Graph Neural Networks for datasets with heterophily |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11481418B2 (en) * | 2020-01-02 | 2022-10-25 | International Business Machines Corporation | Natural question generation via reinforcement learning based graph-to-sequence model |
US20210374499A1 (en) * | 2020-05-26 | 2021-12-02 | International Business Machines Corporation | Iterative deep graph learning for graph neural networks |
US20220092413A1 (en) * | 2020-09-23 | 2022-03-24 | Beijing Wodong Tianjun Information Technology Co., Ltd. | Method and system for relation learning by multi-hop attention graph neural network |
CN113362160B (en) * | 2021-06-08 | 2023-08-22 | 南京信息工程大学 | Federal learning method and device for credit card anti-fraud |
CN113807616B (en) * | 2021-10-22 | 2022-11-04 | 重庆理工大学 | Information diffusion prediction system based on space-time attention and heterogeneous graph convolution network |
CN114496092B (en) * | 2022-02-09 | 2024-05-03 | 中南林业科技大学 | MiRNA and disease association relation prediction method based on graph rolling network |
CN115527627A (en) * | 2022-10-08 | 2022-12-27 | 湖州师范学院 | Drug relocation method and system based on hypergraph convolutional neural network |
CN115732079A (en) * | 2022-11-17 | 2023-03-03 | 湖南电子科技职业学院 | Microorganism and disease association relation prediction method and system based on graph convolution network |
CN115798730A (en) * | 2022-11-18 | 2023-03-14 | 中南大学 | Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks |
CN115828143A (en) * | 2022-12-20 | 2023-03-21 | 南通大学 | Node classification method for realizing heterogeneous primitive path aggregation based on graph convolution and self-attention mechanism |
-
2023
- 2023-04-03 CN CN202310339869.8A patent/CN116049769B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2605218A (en) * | 2021-03-23 | 2022-09-28 | Adobe Inc | Graph Neural Networks for datasets with heterophily |
Also Published As
Publication number | Publication date |
---|---|
CN116049769A (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Hierarchical graph pooling with structure learning | |
Ma et al. | Deep learning on graphs | |
Li et al. | Deep convolutional computation model for feature learning on big data in internet of things | |
Jia et al. | Feature dimensionality reduction: a review | |
Dong et al. | A survey on deep learning and its applications | |
Law et al. | Multi-label classification using a cascade of stacked autoencoder and extreme learning machines | |
Salaken et al. | Seeded transfer learning for regression problems with deep learning | |
Hu et al. | Transformation-gated LSTM: efficient capture of short-term mutation dependencies for multivariate time series prediction tasks | |
Song et al. | Multi-layer discriminative dictionary learning with locality constraint for image classification | |
Tian et al. | A neural architecture search based framework for liquid state machine design | |
Chen et al. | AGNN: Alternating graph-regularized neural networks to alleviate over-smoothing | |
Ma et al. | MIDIA: exploring denoising autoencoders for missing data imputation | |
Zhang et al. | Application of convolutional neural network to traditional data | |
Fu et al. | Adaptive graph convolutional collaboration networks for semi-supervised classification | |
Kinderkhedia | Learning Representations of Graph Data--A Survey | |
Yuan et al. | SRLF: a stance-aware reinforcement learning framework for content-based rumor detection on social media | |
CN116049769B (en) | Discrete object data relevance prediction method and system and storage medium | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
Jiang et al. | An intelligent recommendation approach for online advertising based on hybrid deep neural network and parallel computing | |
Palmucci et al. | Where is your field going? A machine learning approach to study the relative motion of the domains of physics | |
Zhang et al. | Deep compression of probabilistic graphical networks | |
Bi et al. | Improved network intrusion classification with attention-assisted bidirectional LSTM and optimized sparse contractive autoencoders | |
Zhao et al. | Graph pooling via Dual-view Multi-level Infomax | |
Cao et al. | Implicit user relationships across sessions enhanced graph for session-based recommendation | |
Zhang et al. | Deep heterogeneous network embedding based on Siamese Neural Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |