CN109101783B

CN109101783B - Cancer network marker determination method and system based on probability model

Info

Publication number: CN109101783B
Application number: CN201810920673.7A
Authority: CN
Inventors: 杜玉改; 刘文斌
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2020-09-04
Anticipated expiration: 2038-08-14
Also published as: CN109101783A

Abstract

The invention discloses a cancer network marker determination method and a system based on a probability model, wherein the method comprises the following steps: converting all the obtained gene expression data matrixes of the normal samples and the disease samples into likelihood matrixes by using a probability density function, and constructing a normal sample distribution function according to all the likelihood matrixes of the normal samples; and then each element in the likelihood matrix of each disease sample is brought into a normal sample distribution function, a significant difference gene set of each disease sample is determined, the significant difference gene set of each disease sample is mapped into a protein-protein interaction network, and a network marker of each disease sample is determined. By applying the method or the system provided by the invention, the cancer network markers can be accurately and effectively obtained, and the cancer network markers are utilized to classify the subtype of the disease, so that the accurate diagnosis and treatment of the disease are realized.

Description

Cancer network marker determination method and system based on probability model

Technical Field

The invention relates to the technical field of gene detection, in particular to a cancer network marker determination method and system based on a probability model.

Background

Research has shown that the development of cancer is the result of the co-action of multiple genes. Because the traditional gene expression profile data has the defects of large noise, few samples, unbalanced positive and negative samples and the like, the combination of the expression profile data and the biological network and the determination of the cancer network marker become a potential solution idea. Meanwhile, compared with the traditional single-gene marker, the network marker has higher efficiency and stability.

Disclosure of Invention

The invention provides a cancer network marker determination method and system based on a probability model on the basis of considering heterogeneity among samples and difference of diseases among different patients due to different factors such as pathogenesis and the like. The invention can accurately and effectively obtain the cancer network markers, and classify diseases by using the cancer network markers so as to realize accurate diagnosis and treatment of the diseases.

In order to achieve the purpose, the invention provides the following scheme:

a cancer network marker determination method based on a probabilistic model, the cancer network marker determination method comprising:

acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;

converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;

constructing a normal sample distribution function according to all the normal sample likelihood degree matrixes;

sequentially substituting each element in each disease sample likelihood matrix into the normal sample distribution function, and determining a significant difference gene set of each disease sample;

and mapping the significant difference gene set of each disease sample into a protein-protein interaction network in turn, and determining the network marker of each disease sample.

Optionally, the cancer network marker determination method further comprises:

classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.

Optionally, the transforming, by using a probability density function, the gene expression data matrices of all the normal samples into normal sample likelihood matrices, and transforming the gene expression data matrices of all the disease samples into disease sample likelihood matrices specifically includes:

constructing a gene likelihood calculation model by using a probability density function; the expression of the gene likelihood calculation model is

Wherein λ is_iRepresenting the likelihood of gene i;

expressing the expression level of the ith gene in the jth sample; f. of_i ¹Represents the normal distribution curve of the gene i under the disease sample; f. of_i ²Represents a normal distribution curve of the gene i under a normal sample;

and converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.

Optionally, the constructing a normal sample distribution function according to all the normal sample likelihood matrices specifically includes:

calculating the mean value and the variance of each gene likelihood according to all the normal sample likelihood matrixes;

and constructing a normal distribution function of each gene likelihood under a normal sample according to the mean value and the variance of the gene likelihood.

Optionally, the step of sequentially substituting each element in each disease sample likelihood matrix into the normal sample distribution function to determine a significant difference gene set of each disease sample includes:

sequentially bringing each element in the disease sample likelihood matrix into the normal sample distribution function, and calculating the probability value of each gene in each disease sample;

judging whether the probability value is less than or equal to a set threshold value or not;

if yes, determining the genes corresponding to the probability values smaller than or equal to the set threshold value as the significant difference genes of the disease sample.

Optionally, the mapping the significantly different gene sets of each disease sample to a protein-protein interaction network in sequence to determine a network marker of each disease sample specifically includes:

and mapping the significant difference gene sets of the disease samples to a protein-protein interaction network in sequence, and determining five genes with the maximum number of screened connecting genes and first-order neighbor nodes of the five genes as network markers of the disease samples according to the correlation action relationship among the genes.

The present invention also provides a cancer network marker determination system based on a probabilistic model, the cancer network marker determination system comprising:

the gene expression data matrix acquisition module is used for acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;

the gene expression data matrix conversion module is used for converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;

the normal sample distribution function building module is used for building a normal sample distribution function according to all the normal sample likelihood matrixes;

the significant difference gene set determining module is used for sequentially substituting each element in the likelihood matrix of each disease sample into the normal sample distribution function to determine a significant difference gene set of each disease sample;

and the network marker determining module is used for mapping the significant difference gene set of each disease sample into a protein-protein interaction network in sequence and determining the network marker of each disease sample.

Optionally, the cancer network marker determination system further comprises:

and the disease subtype classification module is used for classifying different subtypes of the disease samples according to the network markers of each disease sample and known cancer subtype prior data.

Optionally, the gene expression data matrix transformation module specifically includes:

the gene likelihood calculation model building unit is used for building a gene likelihood calculation model by utilizing a probability density function; the expression of the gene likelihood calculation model is

Wherein λ is_iRepresenting the likelihood of gene i;

and the transformation unit is used for transforming the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and transforming the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.

Optionally, the significantly different gene set determining module specifically includes:

the probability value calculating unit is used for sequentially substituting each element in the disease sample likelihood matrix into the normal sample distribution function and calculating the probability value of each gene in each disease sample;

the judging unit is used for judging whether the probability value is less than or equal to a set threshold value or not;

and the significant difference gene set determining unit is used for determining the genes corresponding to the probability values which are less than or equal to the set threshold value as the significant difference genes of the disease sample.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a cancer network marker determining method and a system based on a probability model, wherein the cancer network marker determining method comprises the following steps: acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples, and converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; then, according to all the normal sample likelihood matrixes, a normal sample distribution function is constructed, each element in each disease sample likelihood matrix is sequentially brought into the normal sample distribution function, and a significant difference gene set of each disease sample is determined; and finally, mapping the significant difference gene set of each disease sample to a protein-protein interaction network in sequence, and determining the network marker of each disease sample. By applying the method or the system provided by the invention, the cancer network markers can be accurately and effectively obtained, and the cancer network markers are utilized to classify the subtype of the disease, so that the accurate diagnosis and treatment of the disease are realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flowchart of a cancer network marker determination method based on a probabilistic model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the present invention based on probabilistic model for determining cancer network markers;

FIG. 3 is a schematic diagram of a network marker selected by the present invention;

FIG. 4 is a graph of the relationship of individual subtype partial markers obtained for cancer UCEC;

fig. 5 is a graph of the result of subtype classification of cancer UCEC;

FIG. 6 is a sample number distribution graph of individual subtypes of cancer UCEC;

fig. 7 is a graph of survival for individual subtypes of cancer UCEC;

fig. 8 is a schematic structural diagram of a cancer network marker determination system based on a probabilistic model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

To overcome data noise, the present invention assumes that the expression profile data for each gene in a particular population or phenotype follows a normal distribution. Based on this assumption, the original gene expression profile data matrix can be converted into a likelihood matrix. The invention determines the significant difference genes in each disease sample through the likelihood matrix, and projects the significant difference genes into a protein-protein interaction (PPI) network to obtain the network marker of each disease sample.

Because of different factors such as the causes of diseases, the same disease is different among different patients, and the traditional disease classification can not well represent all disease samples. Therefore, a more exhaustive sub-classification of these classical diseases is of great biological importance in disease diagnosis and treatment. And integrating the markers of all the disease samples together to obtain an integrated likelihood matrix about the cancer markers, and classifying the disease samples into different subtypes by using the ConsensussCluster plus method of the R language in combination with the existing cancer subtype information.

Based on the above, the main idea of the present invention is to introduce probability density function and combine with the idea of single sample, to screen the network markers of each disease sample, and to classify different subtypes of cancer by using the markers specific to these samples and the clinical information of the samples.

Fig. 1 is a schematic flowchart of a cancer network marker determining method based on a probabilistic model according to an embodiment of the present invention, and as shown in fig. 1, the cancer network marker determining method based on the probabilistic model according to an embodiment of the present invention includes the following steps.

Step 101: acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; the elements in the gene expression data matrix are gene expression levels.

Step 102: converting the gene expression data matrix of each normal sample into a normal sample likelihood matrix by using a probability density function, and converting the gene expression data matrix of each disease sample into a disease sample likelihood matrix; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods.

Step 103: and constructing a normal sample distribution function according to all the normal sample likelihood degree matrixes.

Step 104: and substituting each element in each disease sample likelihood matrix into the normal sample distribution function to determine a significant difference gene set of each disease sample.

Step 105: mapping the significant difference gene set of each disease sample into a protein-protein interaction network, and determining network markers of each disease sample.

Step 106: classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.

Wherein The data in The gene expression data matrix for The disease sample in step 101 is obtained from The Cancer Genoatlas (TCGA) database.

Step 102 specifically includes:

Wherein λ is_iRepresenting the likelihood of gene i;

expressing the expression quantity of the ith gene in the jth sample, wherein i is the gene number, and j is the sample number; f. of_i ¹Represents the normal distribution curve of the gene i under the disease sample; f. of_i ²Represents the normal distribution curve of gene i under normal samples, 1 and 2 represent disease and normal, respectively.

The method specifically comprises the following steps: respectively measuring the mean value and the variance of each gene expression quantity of the normal sample and the disease sample, and constructing a normal distribution curve f of each gene under the normal sample and the disease sample_i ²And f_i ¹Wherein the normal distribution function is

x is expression quantity, mu is mean value, sigma is standard deviation; then based on the normal distribution curve f of each gene under normal sample and disease sample_i ²、f_i ¹And constructing a gene likelihood calculation model.

And converting the gene expression data matrix of each normal sample into a normal sample likelihood matrix according to the gene likelihood calculation model, and converting the gene expression data matrix of each disease sample into a disease sample likelihood matrix.

Step 103 specifically comprises:

and calculating the mean value and the variance of each gene likelihood according to all the normal sample likelihood matrixes.

And constructing a distribution function of each gene likelihood under a normal sample according to the mean value and the variance of the gene likelihood. The distribution function here is a normal distribution function.

Step 104 is to calculate a significantly different gene set for each disease sample based on the single sample concept. And (3) constructing a probability density function by using the normal samples in the likelihood matrix, and comparing whether each gene is obviously different in the normal samples or not for each disease sample, thereby screening the obviously different genes.

Step 104 specifically includes:

and substituting each element in the disease sample likelihood matrix into the normal sample distribution function, and calculating the probability value p of each gene in each disease sample.

Judging whether the probability value p is less than or equal to a set threshold value or not; the threshold value here is set to 0.05.

If yes, determining the genes corresponding to the probability value p smaller than or equal to the set threshold value as the significant difference genes of the disease sample.

Protein-Protein Interaction (PPI) network information is obtained from the STRING database. The STRING database is a widely used and developed database for searching the interaction between proteins, and includes the direct physical interaction between the proteins verified by experiments, and the predicted results of the protein interaction mined from the PubMed abstract and other bioinformatics methods.

Step 105 specifically includes:

mapping the significant difference gene set of the disease sample to a protein-protein interaction network, and determining five genes with the largest number of screened connecting genes and first-order neighbor nodes of the five genes as network markers of the disease sample according to the correlation action relationship among the genes, thereby eliminating false positive parts from the difference genes, and avoiding the false positive condition of the obtained markers caused by the fact that gene expression data contains noise, the sample amount is small and positive and negative samples are unbalanced.

Step 106 specifically comprises classifying the disease samples into different subtypes by using a ConsensusClusterPlus method of R language through the prior knowledge of cancer network markers and cancer subtypes of each disease sample, and performing survival analysis on each obtained subtype by using clinical data information of the disease samples. Wherein, clinical data of disease samples are also obtained from TCGA database.

On the basis, researchers can carry out more intensive research on the acquisition of cancer markers and the classification of cancer subtypes by means of the concept, and realize accurate diagnosis and treatment of diseases on the basis.

The invention herein provides a specific data embodiment to exemplarily illustrate the present invention.

Fig. 2 is a schematic diagram of determining cancer network markers based on a probabilistic model according to the present invention, as shown in fig. 2, the details are as follows:

conversion of a calculated gene expression matrix to a likelihood matrix

TABLE 1 mRNA Gene expression matrix

Table 1 shows a matrix of mRNA gene expression, which contains 8 samples of information (n1, n2, n3, n4) indicating normal tissue samples and (d1, d2, d3, d4) indicating diseased tissue samples. g1, g2, g3, g4 and g5 represent the names of mRNAs, and the data in the table are gene expression data. The transformed likelihood matrix is then:

TABLE 2 likelihood matrix

Table 2 is a table relating to the likelihood matrix for these 8The 5 genes of the sample were determined separately

Thus obtaining a likelihood matrix, and the data in the table is transformed likelihood data.

Obtaining differentially expressed genes for each disease sample

Using the transformed likelihood matrix obtained by the mRNA gene expression matrix, assuming that the normal sample still obeys normal distribution at this time, counting whether the genes in each disease sample are significantly different in the normal sample, thereby obtaining a differential expression gene set (p <0.05) for each disease sample, as shown in table 3:

table 3 differential genes selected

As shown in table 3, for the four disease samples (d1, d2, d3, d4), it was examined whether each gene was significantly different in the normal sample (p <0.05), and the bolded data in the table indicates that the genes were significantly different in the corresponding samples.

Network marker acquisition

Since the difference gene obtained by the gene expression amount may have a false positive condition, the interaction relationship between the genes in the PPI is used to delete the false positive portion. In the network, if a certain gene is significantly different and many genes directly connected with the certain gene are different genes, the different genes are considered to be relatively stable and are used as cancer markers of a sample, the screening standard is that the genes with the top five connecting base factors in the different gene network and the first-order nodes connected with the genes are used as network markers, and the dark squares shown in fig. 3 are the screened network markers.

Classification of different subtypes of cancer

Classifying the endometrial cancer (UCEC) data into different subtypes according to the obtained network marker information of each disease sample and combining the existing cancer subtype knowledge and clinical data of the disease samples as shown in figure 4 to obtain a subtype classification result graph of the cancer UCEC and a number distribution graph of each subtype sample of the cancer UCEC as shown in figures 5 and 6, and further obtain a survival curve of each subtype of the cancer UCEC as shown in figure 7, wherein the survival difference between each subtype is characterized by a p value, and p <0.05 indicates that the cancer subtypes have larger difference.

In order to achieve the above object, the present invention also provides a cancer network marker determination system based on a probabilistic model.

Fig. 8 is a schematic structural diagram of a cancer network marker determining system based on a probabilistic model according to an embodiment of the present invention, and as shown in fig. 8, the cancer network marker determining system according to the embodiment of the present invention includes:

a gene expression data matrix obtaining module 100, configured to obtain gene expression data matrices of multiple normal samples and multiple disease samples; the elements in the gene expression data matrix are gene expression levels.

A gene expression data matrix transformation module 200, configured to transform the gene expression data matrix of each normal sample into a normal sample likelihood matrix and transform the gene expression data matrix of each disease sample into a disease sample likelihood matrix by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods.

A normal sample distribution function constructing module 300, configured to construct a normal sample distribution function according to all the normal sample likelihood matrices.

A significant difference gene set determining module 400, configured to bring each element in the likelihood matrix of each disease sample into the normal sample distribution function, and determine a significant difference gene set of each disease sample.

A network marker determination module 500 for mapping the significantly different gene sets of each of the disease samples into a protein-protein interaction network, determining network markers for each disease sample.

A disease subtype classification module 600 for classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.

The gene expression data matrix transformation module 200 specifically includes:

Wherein λ is_iRepresenting the likelihood of gene i;

expressing the expression level of the ith gene in the jth sample; f. of_i ¹Represents the normal distribution curve of the gene i under the disease sample; f. of_i ²Represents the normal distribution curve of gene i under normal samples.

And the transformation unit is used for transforming the gene expression data matrix of each normal sample into a normal sample likelihood matrix according to the gene likelihood calculation model, and transforming the gene expression data matrix of each disease sample into a disease sample likelihood matrix.

The significantly different gene set determining module 400 specifically includes:

and the probability value calculating unit is used for substituting each element in the disease sample likelihood matrix into the normal sample distribution function and calculating the probability value of each gene in each disease sample.

And the judging unit is used for judging whether the probability value is less than or equal to a set threshold value or not.

The invention provides a cancer network marker determination method and system based on a probability model, which are used for obtaining corresponding disease subtypes by classifying diseases on the basis that the diseases are different among different patients due to different factors such as pathogenesis and the like, and helping to better improve the diagnosis and treatment of the diseases. This plays a very important role in cancer network marker acquisition and cancer subtype classification. Compared with the traditional disease sample common cancer marker, the invention can obtain the specific network marker of each disease sample, can find the cancer subtype type to which each disease sample belongs, better realizes the accurate diagnosis and treatment of the disease, and has very important significance for screening mRNA playing a key role in the occurrence and development process of the disease and improving the diagnosis and treatment of the cancer.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for determining cancer network markers based on a probabilistic model, the method comprising:

2. The method for determining cancer network markers according to claim 1, wherein the transforming the gene expression data matrix of all the normal samples into a normal sample likelihood matrix and the transforming the gene expression data matrix of all the disease samples into a disease sample likelihood matrix by using a probability density function specifically comprises:

Wherein λ is_iRepresenting the likelihood of gene i;

3. The method for determining cancer network markers according to claim 1, wherein the constructing a normal sample distribution function according to all the normal sample likelihood matrices specifically comprises:

4. The method for determining cancer network markers according to claim 1, wherein the step of sequentially substituting each element in the likelihood matrix of each disease sample into the distribution function of the normal sample to determine the significantly different gene set of each disease sample comprises:

5. The method for determining cancer network markers according to claim 1, wherein the step of sequentially mapping the significantly different gene sets of each disease sample into a protein-protein interaction network to determine the network markers of each disease sample comprises:

6. A cancer network marker determination system based on a probabilistic model, the cancer network marker determination system comprising:

7. The cancer network marker determination system of claim 6, further comprising:

8. The cancer network marker determination system of claim 6, wherein the gene expression data matrix transformation module specifically comprises:

Wherein λ is_iRepresenting the likelihood of gene i;

9. The cancer network marker determination system of claim 6, wherein the significantly different gene set determination module specifically comprises: