CN116049769B

CN116049769B - Discrete object data relevance prediction method and system and storage medium

Info

Publication number: CN116049769B
Application number: CN202310339869.8A
Authority: CN
Inventors: 金敏; 张寒雪
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-06-20
Anticipated expiration: 2043-04-03
Also published as: CN116049769A

Abstract

The invention discloses a discrete object data relevance prediction method and system and a storage medium. The prediction method comprises the following steps: respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between different discrete objects; the node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and the plurality of discrete objects are encoded; learning the characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding of various discrete objects; decoding the obtained characteristics of the plurality of discrete objects by using a linear decoder to obtain discrete object associated prediction scores; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. The method can be applied to the associated prediction of various discrete object data, and has stronger generalization and high prediction accuracy.

Description

Discrete object data relevance prediction method and system and storage medium

Technical Field

The invention relates to the field of discrete object association relation prediction, relates to a discrete object data association prediction method and system and a storage medium, and in particular relates to a discrete object data association prediction method and system based on graph convolution multi-head attention and a storage medium.

Background

With the development of computer technology and internet, more and more discrete data are accumulated to lay a solid foundation for predicting association relations among discrete objects, and a wide platform is provided. A user discovers potential associations between discrete objects by searching for multiple discrete information. For example, if the commodity a and the commodity B are commodities that a customer has purchased at the same time, it is possible to predict whether the customer will purchase the commodity C and the commodity B at the same time for the commodity C having a similar function to the commodity a by using the discrete object association relationship prediction method. Another example is: if the medicine A can treat the disease B, the medicine B with a similar chemical structure with the medicine A can be predicted whether the medicine B has a treatment effect on the disease B by using a discrete object association relation prediction method. And the following steps: the word a and the word B are used for the same sentence multiple times, and then, for the word B having a similar meaning to the word a, whether the word B and the word C will appear in the same sentence may be predicted using the discrete object association prediction method.

In the existing method for predicting the association relation of the discrete objects, the heterogeneous information network among the discrete objects is constructed, and the heterogeneous information network is analyzed to obtain the prediction result of the association relation among a plurality of discrete objects. For example, a matrix decomposition method is adopted to analyze the heterogeneous information network, so that nonlinear association relations are easy to ignore; constructing a transition matrix based on a heterogeneous information network by adopting a random walk method, and enabling probability distribution to tend to converge through multiple iterations, wherein the probability distribution is easy to fall into local optimum; the topological information characteristics of the graphs in the heterogeneous network are easily ignored by adopting a neural network method to analyze the heterogeneous information network.

Specifically, for example, in the prediction of association between a discrete object microorganism and a disease, the method and system for predicting association between a microorganism and a disease disclosed in patent document CN112151191a mainly introduce a multi-source information representation of a disease and a microorganism obtained by random walk of a meta-path, so as to realize multi-source data fusion and multi-aspect information prediction of association between a microorganism and a disease. The random walk algorithm based on the meta-path can effectively extract information of microorganisms and diseases from different data sources, and particularly can effectively acquire heterogeneous network information. However, the random walk algorithm only focuses on adjacent nodes, and is easy to sink into local optimum, so that the final prediction result is inaccurate.

Therefore, the accuracy of the obtained prediction result is also provided with a larger improvement space by adopting the existing discrete object association relation prediction method.

Disclosure of Invention

The invention aims to solve the technical problems that: the method and the system for predicting the association of the discrete object data and the storage medium can fully capture the potential association relation between the discrete objects, capture the heterogeneous network topology information of different convolution layers, reduce the decision deviation caused by the sparsity of the known association data between the discrete objects and improve the prediction precision in the association relation prediction of the discrete objects.

In order to solve the technical problems, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for predicting relevance of discrete object data, specifically including the following steps:

s1, respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between different discrete objects;

s2, combining node similarity and node association information of discrete objects on a heterogeneous network by using a graph convolution neural network encoder, and encoding a plurality of discrete objects contained in the heterogeneous network;

s3, learning characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing various discrete objects;

s4, decoding the obtained characteristics containing a plurality of discrete objects by using a linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the discrete object associated prediction score;

s5, adopting a minimized weighted binary cross entropy as a loss function learning parameter, and reducing decision deviation caused by sparse characteristics of the data set.

Further, the step S1 specifically includes:

according to the data characteristics of the discrete objects, similarity of the discrete objects m and d is obtained by adopting a similarity calculation model through calculation; using the similarity of discrete objects m as matrix

Matrix for representing similarity of discrete objects d

A representation;

describing the known association between the discrete object m and the discrete object d as a binary matrix

Wherein M, N represents the number of discrete objects m, d, respectively, when separatedBulk object datam _i And discrete object datad _j There is a known association between them, thenA _ij =1Otherwise, the device can be used to determine whether the current,A _ij =0the method comprises the steps of carrying out a first treatment on the surface of the i is an integer between 1 and M (including 1 and M), j is an integer between 1 and N (including 1 and N);

correlation matrix A based on discrete object m and discrete object d, and similarity matrix of discrete object mS ^m Similarity matrix to discrete object dS ^d Constructing heterogeneous network by using adjacency matrix of the following formula (1)

The representation is:

(1)

wherein ,

and

Respectively, a similarity matrix for the discrete objects mS ^m Similarity matrix to discrete object dS ^d Normalizing;

，

, wherein

Representing data +.>

and

Similarity of>

Representing data +.>

and

Similarity of (2); wherein->

Representing data +.>

and

Similarity of (2); diag is a matrix calculation formula, and the meaning is to take the main diagonal elements of the matrix.

Further, step S2 specifically includes:

by being in heterogeneous networks

An upper deployment graph convolutional neural network encoder (GCN) combines node similarity and node association information, and the input setting adopts the following formula (2):

(2)

wherein ,

as penalty factors, the similarity contribution in the GCN propagation process can be controlled>

Representing a transpose of matrix a; the graph roll-up neural network propagation formula adopts the following formula (3):

(3)

wherein ： wherein ：H ^(l) ，H ^(l+1) respectively areFirst, the

、

Features of the tier nodes;

Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G;W ^(l) is->

Layer to->

Weight matrix used in layer training, +.>

A nonlinear activation function;

The adjacency matrix G is normalized, and a propagation formula is initialized as follows:

according to the above arrangement, the first layer GCN encoder is further described as the following equation (4):

(4)

wherein ：

is a training weight matrix from the input layer to the hidden layer;

is the feature matrix of the hidden layer, +.>

Is the number of dimensions of the feature; g is an adjacency matrix, defined in equation (2).

Further, the step S3 specifically includes: capturing a specific representation of the discrete object m and the discrete object d by adding a multi-headed attention score in each of the layers of the graph, the attention score of each layer being represented by the following formula (5):

(5)

wherein ：

is a parametric function->

Is->

Training weight matrix of layer, +.>

and

Respectively represent +.>

The node outputs of the discrete object m, d of the layer normalize all the attention scores using a softmax function, which is the following equation (6):

(6)

wherein ：

、

respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; capturing structural information of heterogeneous networks by combining embedding of different convolution layersThe final embedding of the graph roll-up neural network coding attention mechanism is represented by the following formula (7):

(7)

wherein ：

is the feature of the discrete object m after coding;

Is the feature of the discrete object d after encoding;

Parameters for automatic learning of neural networks, +.>

Is->

Parameters automatically learned by the layer network; initializing to

，LIs the number of iterations.

Further, the step S4 specifically includes: decoding the result using a linear decoder, the associated prediction score P between the discrete object m and the discrete object d being represented by the following formula (8):

(8)

wherein ：

the training weight matrix from the hidden layer to the output layer is adopted, and the sigmoid function is a nonlinear activation function, so that the prediction results are all in the range of 0-1;

Represents H _d Is a transposed matrix of (a).

Further, the step S5 specifically includes: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):

(9)

wherein ：(i，j)representing discrete object data

And discrete object data->

；P(i,j)Representing discrete objects->

And discrete object->

A predicted relevance score between; influence factor- >

For reducing->

and

Influence of data imbalance, ++>

Representing the number of sets of known association pairs for all discrete objects m and d, +.>

Representing the number of sets of discrete object m and discrete object d associated pairs that are not found (p+ representing the positive instance set, p-representing the negative instance set).

Further, the discrete object data association includes: microorganism-human disease association, known drug-disease association, association with different commercial products, and the like.

Further, the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.

Such as: the semantic description of the disease has a hierarchical structure, so that the similarity can be calculated by using the directed acyclic graph, and the method is not limited to the directed acyclic graph; the medicine contains various characteristics such as a structure, an action target point and the like, so cosine similarity calculation can be selected, and the medicine is not limited to a cosine similarity calculation model.

In a second aspect, the present invention further provides a discrete object data relevance prediction system, which adopts the discrete object data relevance prediction method, and specifically includes:

the discrete object data similarity calculation module is used for calculating the similarity of each discrete object by using the similarity calculation model;

The heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity of the discrete objects and the known association relationship between the discrete objects;

the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding the discrete object m and the discrete object d on the heterogeneous network using the graph roll-up neural network encoder to combine the node similarity and the node association information; the multi-head attention mechanism module is used for capturing node characteristics of each layer of graph convolution by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the discrete object m and the discrete object d; the linear decoder module is used for decoding the obtained characteristics of the discrete object m and the discrete object d by using the linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the associated prediction scores among the discrete objects;

and the optimization module is used for adopting the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by the sparse characteristic of the data set.

The invention also provides a computer storage medium on which a computer program is stored, wherein the computer program when executed by an executor implements the discrete object data relevance prediction method described above.

The invention provides a discrete object data association prediction method and a discrete object data association prediction system, which are based on a graph convolution multi-head attention mechanism, and aim at the advantages and disadvantages of the existing discrete object data association prediction method, a heterogeneous network is constructed by utilizing the similarity of a plurality of discrete objects and the known association relationship among the discrete objects, and the applicability of finding the potential association relationship among the discrete objects is effectively enhanced by utilizing the similarity data; the association relation between the nonlinear discrete objects can be effectively captured by using the graph convolution neural network; capturing node characteristics of each layer of graph convolution discrete objects by using multi-head attention, calculating and combining multi-head attention scores of each layer, so that node characteristics of more discrete objects are mined for embedding, and the influence of sparse association on the graph convolution neural network can be effectively compensated; the minimized weighted binary cross entropy is used as a loss function, so that decision deviation caused by sparsity of known associated data among discrete objects can be effectively compensated; the discrete object association relation prediction result obtained by the discrete object association prediction method for the graph convolution multi-head attention is evaluated, so that the prediction precision is high; the prediction method of the invention can be applied to various discrete object data for the association relation prediction of the discrete objects, and has stronger generalization.

Compared with the existing method, the discrete object data relevance prediction method and system provided by the invention have the following advantages:

(1) According to the invention, the heterogeneous network is constructed by using the similarity information of various discrete objects and the known association relation between the discrete objects, so that the data characteristics of each discrete object can be fully utilized.

(2) The present invention uses a graph convolution neural network encoder and a linear decoder to accomplish associative prediction between discrete objects. The graph convolution neural network can capture nonlinear association relations, and has better performance effect on association relations between a small number of known discrete objects and a large number of unknown or discrete objects contained in training data by adopting a semi-supervised training method.

(3) The multi-head attention mechanism is provided to capture more discrete object information, the multi-head attention can capture node characteristics of discrete objects of each layer of convolution layer, the enhancement characteristic representation of the current node can be obtained according to the neighbor node weight of each layer, the multi-head attention mechanism captures different structure information of a heterogeneous network, the problem of inconsistent contribution caused by embedding different node characteristics in different convolution layers can be effectively relieved, and the introduction of the attention mechanism can reduce the influence of association relation among sparse discrete objects on transmissibility in a graph convolution neural network.

(4) The invention uses the minimized weighted binary cross entropy as a loss function to reduce decision bias caused by the sparse characteristic of the known association relationship data between discrete objects, thereby strengthening the influence of positive samples.

(5) The prediction method is suitable for the association relation prediction of the discrete objects, and has stronger generalization.

Experiments prove that the method can remarkably improve the accuracy of discrete object association relation prediction; decision bias caused by known associations between datasets with respect to sparse discrete objects can be effectively reduced.

Drawings

FIG. 1 is a flowchart of a discrete object data relevance prediction method according to embodiment 1 of the present invention;

FIG. 2 is a schematic structural diagram of a discrete object data relevance prediction system according to embodiment 2 of the present invention;

FIG. 3 is a statistical chart of the relationship between the predicted microorganism and the disease according to the method and the system for predicting the microorganism in the embodiment 3;

FIG. 4 is a statistical chart of the predicted drug-disease association relationship using the prediction method and system of the present invention in example 4;

fig. 5 is a schematic diagram of the result of predicting the association relationship between different commodities by using the prediction method and the prediction system of the present invention in embodiment 5.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

Example 1

As shown in fig. 1, the present embodiment provides a discrete object data relevance prediction method, which is based on a graph convolution multi-head attention mechanism, and includes the following steps:

s1, respectively calculating the similarity of each discrete object, and constructing a heterogeneous network with the known association relation between the discrete objects; the method specifically comprises the following steps:

Matrix for representing similarity of discrete objects d

And (3) representing. The discrete object data association includes: microorganism-human disease association, known drug-disease association, association with different commercial products, and the like. The similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.

Wherein N, M represents discrete objects d, m, respectivelyQuantity, when discrete object datam _i And discrete object datad _j There is a known association between them, thenA _ij =1Otherwise, the device can be used to determine whether the current,A _ij =0the method comprises the steps of carrying out a first treatment on the surface of the i is an integer between 1 and M (including 1 and M), j is an integer between 1 and N (including 1 and N);

The representation is:

(1)

wherein ,

and

，

, wherein

Representing data +.>

and

Similarity of>

Representing data +.>

and

S2, combining node similarity and node association information of discrete objects on a heterogeneous network by using a graph convolution neural network encoder, and encoding a plurality of discrete objects contained in the heterogeneous network; the method specifically comprises the following steps:

by being in heterogeneous networks

(2)

wherein ,

(3)

wherein ：H ^(l) ，H ^(l+1) respectively the first

、

Features of the tier nodes;

Layer to->

Weight matrix used in layer training, +.>

A nonlinear activation function;

(4)

wherein ：

is a training weight matrix from the input layer to the hidden layer;

is the feature matrix of the hidden layer, +.>

S3, learning characteristic embedding of the discrete object nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing various discrete objects; the method specifically comprises the following steps: capturing a specific representation of the discrete object m and the discrete object d by adding a multi-headed attention score in each of the layers of the graph, the attention score of each layer being represented by the following formula (5):

(5)

wherein ：

is a parametric function->

Is->

Training weight matrix of layer, +.>

and

Respectively represent +.>

(6)

wherein ：

、

respectively representing neighbor node sets of nodes i and j, wherein exp is an exponential function; the final embedding of the graph convolutional neural network coding attention mechanism by combining the embedding of different convolutional layers to capture the structural information of the heterogeneous network is represented by the following formula (7):

(7)

wherein ：

is the feature of the discrete object m after coding;

Is the feature of the discrete object d after encoding;

Parameters for automatic learning of neural networks, +.>

Is->

Parameters automatically learned by the layer network; initializing to

，LIs the number of iterations.

S4, decoding the obtained characteristics containing a plurality of discrete objects by using a linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the discrete object associated prediction score; the method specifically comprises the following steps: decoding the result using a linear decoder, the associated prediction score P between the discrete object m and the discrete object d being represented by the following formula (8):

(8)

wherein ：

Represents H _d Is a transposed matrix of (a).

(9)

wherein ：(i，j)representing discrete object data

And discrete object data->

；P(i,j)Representing discrete objects->

And discrete object->

A predicted relevance score between; influence factor->

For reducing->

and

Influence of data imbalance, ++>

Example 2

As shown in fig. 2, the present embodiment provides a discrete object data relevance prediction system 20 adopting the method described in embodiment 1, specifically including:

a discrete object data similarity calculation module 21, configured to calculate respective similarities of different discrete objects using the similarity calculation model;

A heterogeneous network construction module 22, configured to construct a heterogeneous network using similarity information of different discrete objects and known association relationships between the discrete objects;

a multi-headed attention model building block 23 comprising a graph convolutional neural network encoder block 231, a multi-headed attention mechanism block 232, and a linear decoder block 233, wherein: a convolutional neural network encoder module 231 for encoding the discrete object m and the discrete object d on the heterogeneous network using the convolutional neural network encoder to combine node similarity and node association information; a multi-headed attention mechanism module 232 for capturing the node features of the convolution of each layer of graph using multi-headed attention, calculating an attention score, combining the multi-headed attention of each layer to obtain the final embedding of the discrete object m and the discrete object d; a linear decoder module 233, configured to decode the obtained characteristics of the discrete object m and the discrete object d using a linear decoder so that the output matrix and the input matrix have the same dimensions, and obtain an associated prediction score between the discrete objects;

and the optimization module 24 is used for taking the minimized weighted binary cross entropy as a loss function learning parameter and reducing decision deviation caused by sparse characteristics of the data set.

The embodiment of the invention also provides a computer storage medium, on which a computer program is stored, wherein the computer program, when being executed by an actuator, implements the method as shown in fig. 1.

The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be an article of manufacture that is not accessed by a computer device or may be a component used by an accessed computer device.

Example 3: application example 1, discrete object microorganism-human disease association prediction

In order to verify the effectiveness of the discrete object relevance prediction method based on the graph convolution multi-head attention, in the embodiment, microorganisms are taken as discrete objects m, human diseases are taken as discrete objects d, the known microorganism-human disease relevance is the known relevance among the discrete objects, and the potential relevance between the microorganisms and the human diseases is predicted.

The microbial similarity calculation in this embodiment, unlike the prior art using a gaussian kernel, uses a gene sequence calculation, and is obtained from a gene sequence similarity calculation model, and specifically includes: the similarity was calculated by measuring the degree of fixed alignment between two sequences by Identity (ID) using the basic local alignment search tool BLAST (Basic Local Alignment Search Tool, http:// www.ncbi.nim.nih.gov/BLAST) tool seed, the similarity calculation formula for microorganisms A 'and B' being as follows (10):

(10)

wherein ：

the alignment of the microbial A 'and microbial B' gene sequences is represented, and the ID represents the fixed alignment of the microbial sequences and is stored in a matrix form.

The similarity of human diseases is obtained by a directed acyclic graph similarity calculation model, and specifically comprises the following steps: disease data in the MeSH database is constructed into a Directed Acyclic Graph (DAG) of the disease, nodes represent the disease, edges represent the hierarchical relationship between the diseases, and the directed acyclic graph in this embodiment is represented by the following formula (11):

（11）

wherein ：

representing all diseases in the subgraph->

Node set, ++>

Representing disease->

Corresponding edges or sets of semantic relationships with other nodes. Then disease- >

Corresponding semantic value->

Can be represented by formula 12:

(12)

representing disease->

Each ancestor node in the corresponding DAG graph is dedicated to the disease +.>

The semantic contribution of (3) is calculated according to the following formula (13):

(13)

wherein ,

for semantic contributors->

The value range is (0, 1), child node of child (n) is represented by child (n), and distance disease in DAG graph is +.>

The farther ancestor nodes of a node contribute less to their semantics. Therefore, the formula for calculating the semantic similarity value between diseases A and B is as follows (14):

(14)

constructing heterogeneous network by using microorganism similarity, disease similarity and known microorganism-disease association relationship, and expressing the microorganism-disease association as binary matrix

Wherein N and M are used to represent the number of diseases and microorganisms, respectively. When a microorganismm _i And a diseased _j There is a known association between them, thenA _ij =1The method comprises the steps of carrying out a first treatment on the surface of the Otherwise the first set of parameters is selected,A _ij =0. The similarity of the microorganisms is expressed as matrix +.>

, wherein

Representing microorganism->

and

Similarity of (2); similarity of diseases is expressed as matrix->

, wherein

Representing disease->

and

Is a similarity of (3). Based on a microbe-disease association matrixAMicroorganism similarity matrix>

And disease similarity matrix->

Establishing an adjacency matrix->

. The specific construction method is shown in example 1.

The node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and microorganisms and diseases contained in the heterogeneous network are encoded; learning the characteristic embedding of the microorganisms and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding containing the microorganisms and the disease; decoding the obtained characteristics containing microorganisms and diseases by using a linear decoder so that the dimensions of an output matrix and an input matrix are the same, and obtaining a microorganism-disease associated prediction score; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. Specific methods are described in example 1.

In this example, the method was applied to HMDAD dataset (comprising 483 experimentally confirmed microbe-disease associations between 39 diseases and 292 microbes) with BRWMDA (predicting microbe-disease association based on random walk method) and WMGHDMA (predicting microbe-disease association based on network-based calculation method) as controls.

The statistics of the HMDAD dataset are shown in table 1.

Table 1 data set statistics

There are many performance evaluation methods for association prediction, among which the area under ROC curve (AUC) and the area under PR curve (AUPR) are the most used evaluation methods. AUC and AUPR were chosen as evaluation indicators in this example. The number of known drug-disease pairs predicted to be associated with each other is denoted by TP, FP denotes the number of unknown drug-disease pairs predicted to be associated with each other, FN denotes the number of known drug-disease pairs predicted to be absent from each other, TN denotes the number of unknown drug-disease pairs predicted to be absent from each other, and true positive rate (TPR, true Positive Rate), false positive rate (FPR, false Positive Rate), precision, recall rate Recall may be expressed as:

（15）

（16）

（17）

（18）/>

TPR is taken as an abscissa, FPR is taken as an ordinate, the ROC curve can be represented, the area under the ROC curve is AUROC, precision is taken as an abscissa, recall is taken as an ordinate, the PR curve can be represented, and the area under the PR curve is AUPR.

In order to verify that the method provided by the invention is based on other standard methods in terms of prediction effect, BRWMDA and WMGHDMA are compared with the discrete object data association prediction method (named mammda) provided by the embodiment to AUC and AUPR. In the experiment of the embodiment, a 5-fold cross validation method is adopted to evaluate the performance of the model, namely, a known microorganism-human disease associated data set is divided into 5 groups at random, one group is taken as a test set at a time, and the other four groups are taken as training sets. The convolutional neural network of the model parameter setting diagram of the embodiment adopts a 3-layer structure, 16 hidden layer nodes of each layer have a learning rate of 0.001, a node loss probability of 0.7, an edge loss probability of 0.3 and a heterogeneous network penalty factor of 6. Tables 2 and 3 show the methods AUC, AUPR, accuracy and Precision on the HMDAD dataset. According to the MAGMDA method, the AUC evaluation index on the HMDAD data set is respectively improved by 3.23% compared with that of the method with the next highest AUC evaluation index, the Accuracy is improved by 1.22% compared with that of the method with the next highest AUC evaluation index, and the Precision is improved by 0.41% compared with that of the method with the next highest AUC evaluation index. Overall, the MAGMDA method has some improvement in prediction effect compared to other baseline methods.

TABLE 2 statistical evaluation of the prediction of the relationship between microbial and human disease

Taken together, it is shown that: the discrete object correlation prediction method based on graph convolution multi-head attention provided by the invention uses a gene sequence to calculate microorganism similarity and uses a directed acyclic graph to calculate disease similarity by utilizing microorganism data and disease data, so that more information characteristics of discrete objects are captured, a heterogeneous network is constructed by utilizing the microorganism similarity, the disease similarity and known microorganism-disease correlation, topological information of the heterogeneous network is effectively captured by utilizing a graph convolution neural network of a multi-head attention mechanism, and minimized weighted binary cross entropy is adopted as a loss function learning parameter. The method can fully utilize bioinformatics multisource data, capture nonlinear microorganism-human disease association relation, capture heterogeneous network topology information of different convolution layers, reduce decision bias caused by sparsity of known microorganism-human disease association data and improve microorganism-human disease association prediction precision.

Example 4: application example 2, discrete object drug-disease association prediction

In order to verify the effectiveness of the prediction method provided by the invention, in the embodiment, the medicine is taken as a discrete object m, the disease is taken as a discrete object d, the known medicine-disease association relationship is the known association relationship between the discrete objects, and the potential association relationship between the medicine and the disease is predicted.

Wherein the drug is characterized by comprising: chemical molecular structure, drug interaction and drug action targets are respectively used in matrix

Matrix->

Matrix->

Indicating (I)>

Represents the i-th feature quantity,/->

Indicating the quantity of the drug. Drug similarity was calculated using a cosine similarity calculation model, as shown in formula (19):

（19）

wherein

Indicating the similarity between the ith and jth drugs obtained by the t-th feature,/v>

and

Ith and jth feature vectors representing the tth feature, respectively, +.>

。

The similarity of the diseases is calculated from the directed acyclic graph calculation model, see example 3, and in particular from formulas (11) - (14).

Constructing heterogeneous network by using medicine similarity, disease similarity and known medicine-disease association relation, and expressing medicine-disease association as binary matrix

Wherein N and M are used to represent the number of diseases and drugs, respectively. When a medicine ism _i And a diseased _j There is a known association between then A _ij =1The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, A _ij =0。

The similarity of drugs is expressed as a matrix

, wherein

Representing drug->

and

Similarity of diseases is expressed as matrix +.>

, wherein

Representing disease->

and

Is a similarity of (3). Based on drug-disease association matrix A, drug similarity matrix +. >

And disease similarity matrix->

Establishing an adjacency matrix->

. The specific construction method is shown in example 1.

The node similarity and node association information of the discrete objects are combined on the heterogeneous network by using a graph convolution neural network encoder, and medicines and diseases contained in the heterogeneous network are encoded; learning the characteristic embedding of the medicine and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain the final embedding containing the medicine and the disease; decoding the obtained characteristics containing the medicine and the disease by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix to obtain a medicine-disease associated prediction score; and the minimized weighted binary cross entropy is adopted as a loss function learning parameter, so that decision deviation caused by sparse characteristics of the data set is reduced. Specific methods are described in example 1.

This example uses SCMFDD (binary network based drug-disease association prediction) and nimgcn (graph roll-up network based neuro-induction matrix based drug-disease association prediction) as reference methods, and applies the method to Ldataset dataset (comprising 18416 experimentally verified drug-disease associations between 269 and 598 drugs). Statistical information of the Ldataset dataset is shown in table 3.

Table 3 Ldataset dataset statistics

To verify that the method provided by the present invention compares AUC and AUPR with the discrete object data correlation prediction method (named MAGGCN) provided by this example due to other baseline methods. In the embodiment, the performance of the model is evaluated by adopting a 5-fold cross-validation method, namely, a known drug-human disease associated data set is divided into 5 groups randomly and averagely, one group is taken as a test set at a time, and the other four groups are taken as training sets. The model parameter setting graph convolution neural network of the embodiment adopts a 2-layer structure, 64 hidden layer nodes in each layer have a learning rate of 0.01, a node loss probability of 0.7, an edge loss probability of 0.3 and a heterogeneous network penalty factor of 6. Table 4 and fig. 4 show the respective methods AUC, AUPR, accuracy and Precision on the Ldataset dataset. The MAGGCN method improves the AUC and AUPR evaluation indexes on the Ldataset data set by 3.9% and 3% respectively, and improves the Precision by 1.4% compared with the method with the next highest. Overall, the MAGGCN method predicts a certain improvement over other baseline methods.

TABLE 4 statistics of predicted evaluation results of drug-disease associations

Taken together, it is shown that: according to the discrete object data correlation prediction method provided by the invention, the drug similarity is calculated by using the cosine similarity and the disease data, the disease similarity is calculated by using the directed acyclic graph, so that more information characteristics of discrete objects are captured, the drug similarity, the disease similarity and the known drug-disease correlation are constructed into a heterogeneous network, the topological information of the heterogeneous network is effectively captured by using the graph convolution neural network of a multi-head attention mechanism, and the minimized weighted binary cross entropy is adopted as a loss function learning parameter. The method can fully utilize various characteristic information and disease semantic information of the medicine, fully capture nonlinear medicine-human disease association relation, capture heterogeneous network topology information of different convolution layers, reduce decision bias caused by sparsity of known medicine-disease association data and improve medicine-human disease association prediction precision.

Example 5: application example 3, association prediction of different commodities

In order to verify the effectiveness of the prediction method provided by the invention, the embodiment predicts the potential association relationship existing between different commodities by taking the commodities as discrete objects and different commodities purchased by the same order as association relationships. And calculating commodity similarity by utilizing the characteristics of the commodity, such as attributes, types, functions and the like, and representing the commodity similarity as a similarity matrix. And constructing a heterogeneous network by using the commodity similarity and the association relation between commodities in the known order, wherein the commodity association is represented by a binary matrix. The similarity and association of the goods are combined by deploying a graph roll-up network on the heterogeneous network. The specific representation of the commodity is captured by adding a multi-head attention mechanism to each layer of graph convolution layer, and the final feature embedding of the commodity is obtained by combining the attention scores of each layer of convolution layer. And decoding the characteristic embedding by adopting a linear decoder to obtain the associated prediction score of the commodity. The parameters are learned for the loss function using a minimized weighted binary cross entropy. For specific methods, reference is made to the prediction method provided in example 1 and the discrete object data similarity calculation model provided in examples 3 and 4.

The method of the embodiment is applied to a jindong electronic commerce data set JDdataset (comprising 11212 commodities, 141 commodity categories and 240332 orders). The method of the invention is shown in figure 5 about the commodity association relation prediction result, wherein nodes represent commodities, and edges connected with the nodes represent predicted commodities with association relation.

Taken together, it is shown that: the discrete object relevance prediction method based on the graph convolution multi-head attention provided by the invention can fully utilize various characteristic information of commodities to capture the association relation between nonlinear commodities, capture heterogeneous network topology information of different convolution layers and provide reasonable recommended commodities for an electronic commerce platform recommended page so as to improve sales of the electronic commerce.

The above description is only of a few preferred embodiments of the present invention and should not be taken as limiting the invention, but all modifications, equivalents, improvements and modifications within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for predicting the association of a microorganism-human disease, comprising the steps of:

s1, respectively calculating the similarity between the microorganisms of the discrete objects and the similarity between the human diseases, and constructing a heterogeneous network with the known association relationship between the microorganisms and the human diseases;

S2, combining node similarity and node association information of microorganisms and human diseases on a heterogeneous network by using a graph convolution neural network encoder, and encoding the microorganisms and the human diseases contained in the heterogeneous network;

s3, learning characteristic embedding of microorganisms and human disease nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing microorganisms and human diseases;

s4, decoding the obtained characteristics containing the microorganisms and the human diseases by using a linear decoder so that the dimensions of an output matrix and an input matrix are the same, and obtaining a microorganism-human disease associated prediction score;

2. The method for predicting microbial-human disease association according to claim 1, wherein step S1 specifically comprises:

according to the data characteristics of the microbial and human diseases, similarity calculation models are adopted to calculate the similarity of the microbial and human diseases respectively; the similarity of microorganisms is used as matrix

Matrix for representing similarity of human diseases

A representation;

describing a known association between microorganisms and human diseases as a binary matrix

Wherein M, N respectively represents the number of microorganisms and human diseases, when the discrete object microorganism data m _i And human disease data d _j There is a known association between then A _ij =1, otherwise, a _ij =0；

Correlation matrix A based on microorganisms and human diseases and similarity matrix of microorganisms

Similarity matrix with human diseases->

Constructing heterogeneous network by using adjacency matrix of the following formula (1)>

The representation is:

(1)

wherein ,

and

Respectively is a similarity matrix S for microorganisms ^m Similarity matrix S with human diseases ^d Normalizing;

，

, wherein

Representing data +.>

and

Similarity of>

Representing data +.>

and

3. The method for predicting microbial-human disease association according to claim 2, wherein step S2 specifically comprises:

by being in heterogeneous networks

The upper deployment graph convolution neural network encoder combines node similarity and node association information, and the input setting adopts the following formula (2):

(2)

wherein ,

(3)

wherein ：H^(l) ，H ^(l+1) Respectively the first

、

Features of the tier nodes;

Is the degree of the matrix G, gij represents the elements of the ith row and jth column of the matrix G; w (W) ^(l) Is->

Layer to->

Weight matrix used in layer training, +.>

A nonlinear activation function;

The adjacency matrix G is normalized, and the propagation formula is initialized as follows:

；

according to the above propagation equation (3) and the setup for propagation equation initialization, the first layer GCN encoder is further described as the following equation (4):

(4)

wherein ：

is a training weight matrix from the input layer to the hidden layer;

Is the feature matrix of the hidden layer, +.>

Is the number of dimensions of the feature; g is an adjacency matrix.

4. The method for predicting microbial-human disease association according to claim 3, wherein step S3 specifically comprises: capturing a specific representation of microbial and human disease by adding a multi-headed attention score to each of the layers of the graph, the attention score of each layer being represented by the following formula (5):

(5)

wherein ：

is a parametric function->

Is->

Training weight matrix of layer, +.>

and

Respectively represent +.>

The nodal output of the microbial, human disease of the layer, normalized to all attention scores using a softmax function, which is given by the following formula (6):

(6)

wherein ：

、

(7)

wherein ：

is the characteristic of microorganism data after being encoded;

Is the characteristic of encoded human disease data +.>

Parameters for automatic learning of neural networks, +.>

Is a parameter for the layer 1 network to automatically learn; initializing to

L is the number of iterations.

5. The method for predicting microbial-human disease association according to claim 4, wherein step S4 specifically comprises: decoding the result using a linear decoder, the associated predictive score P between the discrete object microorganism and the human disease is represented by the following formula (8):

(8)

wherein ：

is the training right from the hidden layer to the output layerThe heavy matrix, the sigmoid function is a nonlinear activation function, so that the prediction results are all in the range of 0-1;

Represents H _d Is a transposed matrix of (a).

6. The method for predicting microbial-human disease association according to claim 5, wherein step S5 specifically comprises: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):

(9)

Wherein: (i, j) represents microorganism data

And human disease data->

The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents microorganism data +.>

And human disease data->

A predicted relevance score between; influence factor->

For reducing->

and

The effect of the data imbalance is that,

representation houseNumber of sets of known association pairs with microbial and human diseases, < >>

Represents the number of sets of undiscovered microbial and human disease-associated pairs.

7. The method for predicting microbial-human disease association according to claim 2, wherein,

the similarity calculation model comprises a directed acyclic graph similarity calculation model and a cosine similarity calculation model.

8. A system for predicting the association of a microorganism-human disease, comprising:

a data similarity calculation module for calculating the similarity between microorganisms and the similarity between human diseases using the similarity calculation model;

the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between microorganisms and the similarity between human diseases and the known association relationship between microorganisms and human diseases;

The multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding microbial and human diseases on a heterogeneous network using the graph roll-up neural network encoder to combine node similarity and node association information; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the microorganism and human diseases; the linear decoder module is used for decoding the obtained characteristics of the microbial and human diseases by using the linear decoder so that the dimension of the output matrix is the same as that of the input matrix, and obtaining the correlation prediction score between the microbial and human diseases;

9. A computer storage medium, characterized in that a computer program is stored thereon, wherein the computer program, when executed by an actuator, implements the method for predicting the association of a microorganism-human disease according to any one of claims 1-7.

10. The medicine-disease association prediction method is characterized by comprising the following steps:

s1, respectively calculating the similarity between the medicines of the discrete objects and the similarity between diseases, and constructing a heterogeneous network with the known association relationship between the medicines and the diseases;

s2, combining the medicine and the node similarity and node association information of the diseases on the heterogeneous network by using a graph convolution neural network encoder, and encoding the medicine and the diseases contained in the heterogeneous network;

s3, learning characteristic embedding of the medicine and disease nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing the medicine and the disease;

s4, decoding the obtained characteristics containing the medicine and the disease by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix, and obtaining a medicine-disease association prediction score;

11. The method of claim 10, wherein step S1 specifically comprises:

according to the data characteristics of the medicine and the disease, adopting a similarity calculation model to calculate and obtain the similarity of the medicine and the disease respectively; matrix similarity of drugs

The similarity of the diseases is represented by the matrix +.>

A representation;

describing a known association between a drug and a disease as a binary matrix

Wherein M, N respectively represents the number of medicines and diseases, when the medicine data m of the discrete object is obtained _i And disease data d _j There is a known association between then A _ij =1, otherwise, a _ij =0；

Drug and disease-based incidence matrix A and drug similarity matrix

Similarity matrix with diseases->

The representation is:

(1)

wherein ,

and

Respectively is a similarity matrix S for medicines ^m Similarity matrix S with disease ^d Normalizing;

，

, wherein

Representing drug data

and

Similarity of>

Representing disease data->

and

12. The method of claim 11, wherein step S2 specifically comprises:

by being in heterogeneous networks

(2)

wherein ,

as penalty factors, the similarity contribution in the GCN propagation process can be controlled >

(3)

wherein ：H^(l) ，H ^(l+1) Respectively the first

、

Features of the tier nodes;

Layer to->

Weight matrix used in layer training, +.>

A nonlinear activation function;

；

(4)

wherein ：

is a training weight matrix from the input layer to the hidden layer;

Is the feature matrix of the hidden layer, +.>

Is the number of dimensions of the feature; g is an adjacency matrix.

13. The method of claim 12, wherein step S3 specifically comprises: capturing a specific representation of the drug and disease by adding multiple attention scores to each of the graph convolution layers, the attention score of each layer being represented by the following formula (5):

(5)

wherein ：

is a parametric function->

Is->

Training weight matrix of layer, +.>

and

Respectively represent +.>

The nodal output of drug, disease of the layer, normalized all attention scores using a softmax function, which is given by the following formula (6):

(6)

wherein ：

、

(7)

wherein ：

is the characteristic of the coded drug data;

Is the characteristic of encoded disease data +.>

Parameters for automatic learning of neural networks, +.>

Is a parameter for the layer 1 network to automatically learn; initializing to

L is the number of iterations.

14. The method of claim 13, wherein step S4 specifically comprises: decoding the result using a linear decoder, the associated predictive score P between the discrete subject drug and the disease is represented by the following formula (8):

(8)

wherein ：

Represents H _d Is a transposed matrix of (a).

15. The method of claim 14, wherein step S5 specifically comprises: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):

(9)

Wherein: (i, j) represents drug data

And disease data->

The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents drug data +.>

And disease data->

A predicted relevance score between; influence factor->

For reducing->

and

Influence of data imbalance, ++>

Representing the number of sets of known association pairs for all drugs and diseases, +.>

Representing the number of sets of undiscovered drug and disease association pairs.

16. The method of claim 15, wherein the method comprises the steps of,

17. A drug-disease association prediction system, characterized in that it employs the drug-disease association prediction method according to any one of claims 10 to 16, and specifically comprises:

the data similarity calculation module is used for calculating the similarity between medicines and the similarity between diseases by using the similarity calculation model;

the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between medicines and the similarity between diseases and the known association relationship between medicines and diseases;

the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: a graph roll-up neural network encoder module for encoding drugs and diseases on a heterogeneous network using the graph roll-up neural network encoder to combine node similarity and node association information; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the medicine and the disease; a linear decoder module for decoding the obtained characteristics of the drug and the disease using a linear decoder such that the output matrix is the same as the input matrix in dimension, obtaining a drug-disease correlation prediction score;

18. A computer storage medium having stored thereon a computer program, wherein the computer program when executed by an actuator implements the method of predicting drug-disease association according to any one of claims 10-16.

19. The method for predicting the relevance of different commodities is characterized by comprising the following steps of:

s1, respectively calculating the similarity between the first-class commodities and the similarity between the second-class commodities, and constructing a heterogeneous network with the association relationship between the first-class commodities and the second-class commodities in the known order;

s2, combining node similarity and node association information of the first-class commodity and the second-class commodity on the heterogeneous network by using a graph convolution neural network encoder, and encoding the first-class commodity and the second-class commodity contained in the heterogeneous network;

s3, learning characteristic embedding of the first-class commodity and the second-class commodity nodes of each convolution layer by using a multi-head attention mechanism to obtain final embedding containing the first-class commodity and the second-class commodity;

S4, decoding the obtained characteristics comprising the first type of commodity and the second type of commodity by using a linear decoder so that the dimension of an output matrix is the same as that of an input matrix, and obtaining a first type of commodity-second type of commodity association prediction score;

20. The method for predicting relevance of different types of commodities according to claim 19, wherein step S1 specifically includes:

according to the data characteristics of the first-class commodities and the second-class commodities, respectively calculating to obtain the similarity of the first-class commodities and the second-class commodities by adopting a similarity calculation model; using a matrix for similarity of first-class commodities

Representing the similarity of the second type of goods with matrix +.>

A representation;

describing a known association between a first type of commodity and a second type of commodity as a binary matrix

Wherein M, N represents the number of first-class commodity and second-class commodity respectively, and the first-class commodity data m is the discrete object _i And commodity data of the second class d _j There is a known association between then A _ij =1, otherwise, a _ij =0；

Correlation matrix A based on first-class commodity and second-class commodity and similarity matrix of first-class commodity

Similarity matrix with the second class of goods +.>

The representation is:

(1)

wherein ,

and

Respectively is a similarity matrix S for the first type of commodity ^m Similarity matrix S with second class commodity ^d Normalizing;

，

, wherein

Representing commodity data of the first kind->

and

Similarity of>

Representing commodity data of the second class->

and

21. The method for predicting relevance of different types of commodities according to claim 20, wherein step S2 specifically includes:

by being in heterogeneous networks

(2)

wherein ,

(3)

wherein ：H^(l) ，H ^(l+1) Respectively the first

、

Features of the tier nodes;

Layer to->

Weight matrix used in layer training, +.>

A nonlinear activation function; / >

；

(4)

wherein ：

is a training weight matrix from the input layer to the hidden layer;

Is the feature matrix of the hidden layer, +.>

Is the number of dimensions of the feature; g is an adjacency matrix.

22. The method for predicting relevance of different types of commodities according to claim 21, wherein step S3 specifically includes: capturing a specific representation of the first type of commodity and the second type of commodity by adding a multi-headed attention score to each of the layers of the graph, the attention score of each layer being represented by the following formula (5):

(5)

wherein ：

is a parametric function->

Is->

Training weight matrix of layer, +.>

and

Respectively represent +.>

The node outputs of the first class commodity and the second class commodity of the layer normalize all attention scores by using a softmax function, wherein the softmax function is as follows (6):

(6)

wherein ：

、

(7)

wherein ：

is the characteristic of the first commodity data after being encoded;

Is the characteristic of the second commodity data after coding>

Parameters for automatic learning of neural networks, +.>

Is a parameter for the layer 1 network to automatically learn; initializing to

L is the number of iterations.

23. The method for predicting relevance of different types of commodities according to claim 22, wherein step S4 specifically includes: decoding the result using a linear decoder, the associated prediction score P between the first type of commodity and the second type of commodity being represented by the following formula (8):

(8)

wherein ：

Represents H _d Is a transposed matrix of (a).

24. The method for predicting relevance of different types of commodities according to claim 23, wherein step S5 specifically includes: the calculation formula for minimizing weighted binary cross entropy as a loss function is as follows (9):

(9)

wherein: (i, j) represents first-class commodity data

And second class commodity data->

The method comprises the steps of carrying out a first treatment on the surface of the P (i, j) represents first-class commodity data

And second class commodity data->

A predicted relevance score between; influence factor->

For reducing- >

and

Influence of data imbalance, ++>

Representing the number of sets of known association pairs for all first and second type of goods, +.>

25. The method of claim 24, wherein the step of predicting relevance of different types of merchandise,

26. A system for predicting relevance of different types of commodities, which is characterized by adopting the method for predicting relevance of different types of commodities according to any one of claims 19 to 25, and specifically comprising:

the data similarity calculation module is used for calculating the similarity between the first type of commodities and the similarity between the second type of commodities by using a similarity calculation model;

the heterogeneous network construction module is used for constructing a heterogeneous network by utilizing the similarity between the first type commodities, the similarity between the second type commodities and the known association relationship between the first type commodities and the second type commodities;

the multi-head attention model building module comprises a graph convolution neural network encoder module, a multi-head attention mechanism module and a linear decoder module, wherein: the graph roll neural network encoder module is used for encoding the first type commodity and the second type commodity by using the graph roll neural network encoder to combine node similarity and node association information on the heterogeneous network; the multi-head attention mechanism module is used for capturing the node characteristics of the convolution of each layer of graph by using multi-head attention, calculating attention scores, and combining the multi-head attention of each layer to obtain the final embedding of the first type commodity and the second type commodity; the linear decoder module is used for decoding the obtained characteristics of the first-class commodities and the second-class commodities by using the linear decoder so that the output matrix and the input matrix have the same dimension, and obtaining the correlation prediction score between the first-class commodities and the second-class commodities;

27. A computer storage medium having stored thereon a computer program, wherein the computer program when executed by an actuator implements the heterogeneous product relevance prediction method of any of claims 19-25.