Nothing Special   »   [go: up one dir, main page]

CN109101783B - Cancer network marker determination method and system based on probability model - Google Patents

Cancer network marker determination method and system based on probability model Download PDF

Info

Publication number
CN109101783B
CN109101783B CN201810920673.7A CN201810920673A CN109101783B CN 109101783 B CN109101783 B CN 109101783B CN 201810920673 A CN201810920673 A CN 201810920673A CN 109101783 B CN109101783 B CN 109101783B
Authority
CN
China
Prior art keywords
sample
gene
disease
likelihood
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810920673.7A
Other languages
Chinese (zh)
Other versions
CN109101783A (en
Inventor
杜玉改
刘文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN201810920673.7A priority Critical patent/CN109101783B/en
Publication of CN109101783A publication Critical patent/CN109101783A/en
Application granted granted Critical
Publication of CN109101783B publication Critical patent/CN109101783B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a cancer network marker determination method and a system based on a probability model, wherein the method comprises the following steps: converting all the obtained gene expression data matrixes of the normal samples and the disease samples into likelihood matrixes by using a probability density function, and constructing a normal sample distribution function according to all the likelihood matrixes of the normal samples; and then each element in the likelihood matrix of each disease sample is brought into a normal sample distribution function, a significant difference gene set of each disease sample is determined, the significant difference gene set of each disease sample is mapped into a protein-protein interaction network, and a network marker of each disease sample is determined. By applying the method or the system provided by the invention, the cancer network markers can be accurately and effectively obtained, and the cancer network markers are utilized to classify the subtype of the disease, so that the accurate diagnosis and treatment of the disease are realized.

Description

Cancer network marker determination method and system based on probability model
Technical Field
The invention relates to the technical field of gene detection, in particular to a cancer network marker determination method and system based on a probability model.
Background
Research has shown that the development of cancer is the result of the co-action of multiple genes. Because the traditional gene expression profile data has the defects of large noise, few samples, unbalanced positive and negative samples and the like, the combination of the expression profile data and the biological network and the determination of the cancer network marker become a potential solution idea. Meanwhile, compared with the traditional single-gene marker, the network marker has higher efficiency and stability.
Disclosure of Invention
The invention provides a cancer network marker determination method and system based on a probability model on the basis of considering heterogeneity among samples and difference of diseases among different patients due to different factors such as pathogenesis and the like. The invention can accurately and effectively obtain the cancer network markers, and classify diseases by using the cancer network markers so as to realize accurate diagnosis and treatment of the diseases.
In order to achieve the purpose, the invention provides the following scheme:
a cancer network marker determination method based on a probabilistic model, the cancer network marker determination method comprising:
acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;
converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;
constructing a normal sample distribution function according to all the normal sample likelihood degree matrixes;
sequentially substituting each element in each disease sample likelihood matrix into the normal sample distribution function, and determining a significant difference gene set of each disease sample;
and mapping the significant difference gene set of each disease sample into a protein-protein interaction network in turn, and determining the network marker of each disease sample.
Optionally, the cancer network marker determination method further comprises:
classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.
Optionally, the transforming, by using a probability density function, the gene expression data matrices of all the normal samples into normal sample likelihood matrices, and transforming the gene expression data matrices of all the disease samples into disease sample likelihood matrices specifically includes:
constructing a gene likelihood calculation model by using a probability density function; the expression of the gene likelihood calculation model is
Figure BDA0001764135760000021
Wherein λ isiRepresenting the likelihood of gene i;
Figure BDA0001764135760000022
expressing the expression level of the ith gene in the jth sample; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents a normal distribution curve of the gene i under a normal sample;
and converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.
Optionally, the constructing a normal sample distribution function according to all the normal sample likelihood matrices specifically includes:
calculating the mean value and the variance of each gene likelihood according to all the normal sample likelihood matrixes;
and constructing a normal distribution function of each gene likelihood under a normal sample according to the mean value and the variance of the gene likelihood.
Optionally, the step of sequentially substituting each element in each disease sample likelihood matrix into the normal sample distribution function to determine a significant difference gene set of each disease sample includes:
sequentially bringing each element in the disease sample likelihood matrix into the normal sample distribution function, and calculating the probability value of each gene in each disease sample;
judging whether the probability value is less than or equal to a set threshold value or not;
if yes, determining the genes corresponding to the probability values smaller than or equal to the set threshold value as the significant difference genes of the disease sample.
Optionally, the mapping the significantly different gene sets of each disease sample to a protein-protein interaction network in sequence to determine a network marker of each disease sample specifically includes:
and mapping the significant difference gene sets of the disease samples to a protein-protein interaction network in sequence, and determining five genes with the maximum number of screened connecting genes and first-order neighbor nodes of the five genes as network markers of the disease samples according to the correlation action relationship among the genes.
The present invention also provides a cancer network marker determination system based on a probabilistic model, the cancer network marker determination system comprising:
the gene expression data matrix acquisition module is used for acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;
the gene expression data matrix conversion module is used for converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;
the normal sample distribution function building module is used for building a normal sample distribution function according to all the normal sample likelihood matrixes;
the significant difference gene set determining module is used for sequentially substituting each element in the likelihood matrix of each disease sample into the normal sample distribution function to determine a significant difference gene set of each disease sample;
and the network marker determining module is used for mapping the significant difference gene set of each disease sample into a protein-protein interaction network in sequence and determining the network marker of each disease sample.
Optionally, the cancer network marker determination system further comprises:
and the disease subtype classification module is used for classifying different subtypes of the disease samples according to the network markers of each disease sample and known cancer subtype prior data.
Optionally, the gene expression data matrix transformation module specifically includes:
the gene likelihood calculation model building unit is used for building a gene likelihood calculation model by utilizing a probability density function; the expression of the gene likelihood calculation model is
Figure BDA0001764135760000041
Wherein λ isiRepresenting the likelihood of gene i;
Figure BDA0001764135760000042
expressing the expression level of the ith gene in the jth sample; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents a normal distribution curve of the gene i under a normal sample;
and the transformation unit is used for transforming the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and transforming the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.
Optionally, the significantly different gene set determining module specifically includes:
the probability value calculating unit is used for sequentially substituting each element in the disease sample likelihood matrix into the normal sample distribution function and calculating the probability value of each gene in each disease sample;
the judging unit is used for judging whether the probability value is less than or equal to a set threshold value or not;
and the significant difference gene set determining unit is used for determining the genes corresponding to the probability values which are less than or equal to the set threshold value as the significant difference genes of the disease sample.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a cancer network marker determining method and a system based on a probability model, wherein the cancer network marker determining method comprises the following steps: acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples, and converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; then, according to all the normal sample likelihood matrixes, a normal sample distribution function is constructed, each element in each disease sample likelihood matrix is sequentially brought into the normal sample distribution function, and a significant difference gene set of each disease sample is determined; and finally, mapping the significant difference gene set of each disease sample to a protein-protein interaction network in sequence, and determining the network marker of each disease sample. By applying the method or the system provided by the invention, the cancer network markers can be accurately and effectively obtained, and the cancer network markers are utilized to classify the subtype of the disease, so that the accurate diagnosis and treatment of the disease are realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flowchart of a cancer network marker determination method based on a probabilistic model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the present invention based on probabilistic model for determining cancer network markers;
FIG. 3 is a schematic diagram of a network marker selected by the present invention;
FIG. 4 is a graph of the relationship of individual subtype partial markers obtained for cancer UCEC;
fig. 5 is a graph of the result of subtype classification of cancer UCEC;
FIG. 6 is a sample number distribution graph of individual subtypes of cancer UCEC;
fig. 7 is a graph of survival for individual subtypes of cancer UCEC;
fig. 8 is a schematic structural diagram of a cancer network marker determination system based on a probabilistic model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a cancer network marker determination method and system based on a probability model on the basis of considering heterogeneity among samples and difference of diseases among different patients due to different factors such as pathogenesis and the like. The invention can accurately and effectively obtain the cancer network markers, and classify diseases by using the cancer network markers so as to realize accurate diagnosis and treatment of the diseases.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
To overcome data noise, the present invention assumes that the expression profile data for each gene in a particular population or phenotype follows a normal distribution. Based on this assumption, the original gene expression profile data matrix can be converted into a likelihood matrix. The invention determines the significant difference genes in each disease sample through the likelihood matrix, and projects the significant difference genes into a protein-protein interaction (PPI) network to obtain the network marker of each disease sample.
Because of different factors such as the causes of diseases, the same disease is different among different patients, and the traditional disease classification can not well represent all disease samples. Therefore, a more exhaustive sub-classification of these classical diseases is of great biological importance in disease diagnosis and treatment. And integrating the markers of all the disease samples together to obtain an integrated likelihood matrix about the cancer markers, and classifying the disease samples into different subtypes by using the ConsensussCluster plus method of the R language in combination with the existing cancer subtype information.
Based on the above, the main idea of the present invention is to introduce probability density function and combine with the idea of single sample, to screen the network markers of each disease sample, and to classify different subtypes of cancer by using the markers specific to these samples and the clinical information of the samples.
Fig. 1 is a schematic flowchart of a cancer network marker determining method based on a probabilistic model according to an embodiment of the present invention, and as shown in fig. 1, the cancer network marker determining method based on the probabilistic model according to an embodiment of the present invention includes the following steps.
Step 101: acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; the elements in the gene expression data matrix are gene expression levels.
Step 102: converting the gene expression data matrix of each normal sample into a normal sample likelihood matrix by using a probability density function, and converting the gene expression data matrix of each disease sample into a disease sample likelihood matrix; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods.
Step 103: and constructing a normal sample distribution function according to all the normal sample likelihood degree matrixes.
Step 104: and substituting each element in each disease sample likelihood matrix into the normal sample distribution function to determine a significant difference gene set of each disease sample.
Step 105: mapping the significant difference gene set of each disease sample into a protein-protein interaction network, and determining network markers of each disease sample.
Step 106: classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.
Wherein The data in The gene expression data matrix for The disease sample in step 101 is obtained from The Cancer Genoatlas (TCGA) database.
Step 102 specifically includes:
constructing a gene likelihood calculation model by using a probability density function; the expression of the gene likelihood calculation model is
Figure BDA0001764135760000071
Wherein λ isiRepresenting the likelihood of gene i;
Figure BDA0001764135760000072
expressing the expression quantity of the ith gene in the jth sample, wherein i is the gene number, and j is the sample number; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents the normal distribution curve of gene i under normal samples, 1 and 2 represent disease and normal, respectively.
The method specifically comprises the following steps: respectively measuring the mean value and the variance of each gene expression quantity of the normal sample and the disease sample, and constructing a normal distribution curve f of each gene under the normal sample and the disease samplei 2And fi 1Wherein the normal distribution function is
Figure BDA0001764135760000073
x is expression quantity, mu is mean value, sigma is standard deviation; then based on the normal distribution curve f of each gene under normal sample and disease samplei 2、fi 1And constructing a gene likelihood calculation model.
And converting the gene expression data matrix of each normal sample into a normal sample likelihood matrix according to the gene likelihood calculation model, and converting the gene expression data matrix of each disease sample into a disease sample likelihood matrix.
Step 103 specifically comprises:
and calculating the mean value and the variance of each gene likelihood according to all the normal sample likelihood matrixes.
And constructing a distribution function of each gene likelihood under a normal sample according to the mean value and the variance of the gene likelihood. The distribution function here is a normal distribution function.
Step 104 is to calculate a significantly different gene set for each disease sample based on the single sample concept. And (3) constructing a probability density function by using the normal samples in the likelihood matrix, and comparing whether each gene is obviously different in the normal samples or not for each disease sample, thereby screening the obviously different genes.
Step 104 specifically includes:
and substituting each element in the disease sample likelihood matrix into the normal sample distribution function, and calculating the probability value p of each gene in each disease sample.
Judging whether the probability value p is less than or equal to a set threshold value or not; the threshold value here is set to 0.05.
If yes, determining the genes corresponding to the probability value p smaller than or equal to the set threshold value as the significant difference genes of the disease sample.
Protein-Protein Interaction (PPI) network information is obtained from the STRING database. The STRING database is a widely used and developed database for searching the interaction between proteins, and includes the direct physical interaction between the proteins verified by experiments, and the predicted results of the protein interaction mined from the PubMed abstract and other bioinformatics methods.
Step 105 specifically includes:
mapping the significant difference gene set of the disease sample to a protein-protein interaction network, and determining five genes with the largest number of screened connecting genes and first-order neighbor nodes of the five genes as network markers of the disease sample according to the correlation action relationship among the genes, thereby eliminating false positive parts from the difference genes, and avoiding the false positive condition of the obtained markers caused by the fact that gene expression data contains noise, the sample amount is small and positive and negative samples are unbalanced.
Step 106 specifically comprises classifying the disease samples into different subtypes by using a ConsensusClusterPlus method of R language through the prior knowledge of cancer network markers and cancer subtypes of each disease sample, and performing survival analysis on each obtained subtype by using clinical data information of the disease samples. Wherein, clinical data of disease samples are also obtained from TCGA database.
On the basis, researchers can carry out more intensive research on the acquisition of cancer markers and the classification of cancer subtypes by means of the concept, and realize accurate diagnosis and treatment of diseases on the basis.
The invention herein provides a specific data embodiment to exemplarily illustrate the present invention.
Fig. 2 is a schematic diagram of determining cancer network markers based on a probabilistic model according to the present invention, as shown in fig. 2, the details are as follows:
conversion of a calculated gene expression matrix to a likelihood matrix
TABLE 1 mRNA Gene expression matrix
Figure BDA0001764135760000091
Table 1 shows a matrix of mRNA gene expression, which contains 8 samples of information (n1, n2, n3, n4) indicating normal tissue samples and (d1, d2, d3, d4) indicating diseased tissue samples. g1, g2, g3, g4 and g5 represent the names of mRNAs, and the data in the table are gene expression data. The transformed likelihood matrix is then:
TABLE 2 likelihood matrix
Figure BDA0001764135760000092
Table 2 is a table relating to the likelihood matrix for these 8The 5 genes of the sample were determined separately
Figure BDA0001764135760000093
Thus obtaining a likelihood matrix, and the data in the table is transformed likelihood data.
Obtaining differentially expressed genes for each disease sample
Using the transformed likelihood matrix obtained by the mRNA gene expression matrix, assuming that the normal sample still obeys normal distribution at this time, counting whether the genes in each disease sample are significantly different in the normal sample, thereby obtaining a differential expression gene set (p <0.05) for each disease sample, as shown in table 3:
table 3 differential genes selected
Figure BDA0001764135760000101
As shown in table 3, for the four disease samples (d1, d2, d3, d4), it was examined whether each gene was significantly different in the normal sample (p <0.05), and the bolded data in the table indicates that the genes were significantly different in the corresponding samples.
Network marker acquisition
Since the difference gene obtained by the gene expression amount may have a false positive condition, the interaction relationship between the genes in the PPI is used to delete the false positive portion. In the network, if a certain gene is significantly different and many genes directly connected with the certain gene are different genes, the different genes are considered to be relatively stable and are used as cancer markers of a sample, the screening standard is that the genes with the top five connecting base factors in the different gene network and the first-order nodes connected with the genes are used as network markers, and the dark squares shown in fig. 3 are the screened network markers.
Classification of different subtypes of cancer
Classifying the endometrial cancer (UCEC) data into different subtypes according to the obtained network marker information of each disease sample and combining the existing cancer subtype knowledge and clinical data of the disease samples as shown in figure 4 to obtain a subtype classification result graph of the cancer UCEC and a number distribution graph of each subtype sample of the cancer UCEC as shown in figures 5 and 6, and further obtain a survival curve of each subtype of the cancer UCEC as shown in figure 7, wherein the survival difference between each subtype is characterized by a p value, and p <0.05 indicates that the cancer subtypes have larger difference.
In order to achieve the above object, the present invention also provides a cancer network marker determination system based on a probabilistic model.
Fig. 8 is a schematic structural diagram of a cancer network marker determining system based on a probabilistic model according to an embodiment of the present invention, and as shown in fig. 8, the cancer network marker determining system according to the embodiment of the present invention includes:
a gene expression data matrix obtaining module 100, configured to obtain gene expression data matrices of multiple normal samples and multiple disease samples; the elements in the gene expression data matrix are gene expression levels.
A gene expression data matrix transformation module 200, configured to transform the gene expression data matrix of each normal sample into a normal sample likelihood matrix and transform the gene expression data matrix of each disease sample into a disease sample likelihood matrix by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods.
A normal sample distribution function constructing module 300, configured to construct a normal sample distribution function according to all the normal sample likelihood matrices.
A significant difference gene set determining module 400, configured to bring each element in the likelihood matrix of each disease sample into the normal sample distribution function, and determine a significant difference gene set of each disease sample.
A network marker determination module 500 for mapping the significantly different gene sets of each of the disease samples into a protein-protein interaction network, determining network markers for each disease sample.
A disease subtype classification module 600 for classifying the disease samples into different subtypes according to the network markers of each disease sample and the known cancer subtype prior data.
The gene expression data matrix transformation module 200 specifically includes:
the gene likelihood calculation model building unit is used for building a gene likelihood calculation model by utilizing a probability density function; the expression of the gene likelihood calculation model is
Figure BDA0001764135760000111
Wherein λ isiRepresenting the likelihood of gene i;
Figure BDA0001764135760000112
expressing the expression level of the ith gene in the jth sample; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents the normal distribution curve of gene i under normal samples.
And the transformation unit is used for transforming the gene expression data matrix of each normal sample into a normal sample likelihood matrix according to the gene likelihood calculation model, and transforming the gene expression data matrix of each disease sample into a disease sample likelihood matrix.
The significantly different gene set determining module 400 specifically includes:
and the probability value calculating unit is used for substituting each element in the disease sample likelihood matrix into the normal sample distribution function and calculating the probability value of each gene in each disease sample.
And the judging unit is used for judging whether the probability value is less than or equal to a set threshold value or not.
And the significant difference gene set determining unit is used for determining the genes corresponding to the probability values which are less than or equal to the set threshold value as the significant difference genes of the disease sample.
The invention provides a cancer network marker determination method and system based on a probability model, which are used for obtaining corresponding disease subtypes by classifying diseases on the basis that the diseases are different among different patients due to different factors such as pathogenesis and the like, and helping to better improve the diagnosis and treatment of the diseases. This plays a very important role in cancer network marker acquisition and cancer subtype classification. Compared with the traditional disease sample common cancer marker, the invention can obtain the specific network marker of each disease sample, can find the cancer subtype type to which each disease sample belongs, better realizes the accurate diagnosis and treatment of the disease, and has very important significance for screening mRNA playing a key role in the occurrence and development process of the disease and improving the diagnosis and treatment of the cancer.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (9)

1. A method for determining cancer network markers based on a probabilistic model, the method comprising:
acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;
converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;
constructing a normal sample distribution function according to all the normal sample likelihood degree matrixes;
sequentially substituting each element in each disease sample likelihood matrix into the normal sample distribution function, and determining a significant difference gene set of each disease sample;
and mapping the significant difference gene set of each disease sample into a protein-protein interaction network in turn, and determining the network marker of each disease sample.
2. The method for determining cancer network markers according to claim 1, wherein the transforming the gene expression data matrix of all the normal samples into a normal sample likelihood matrix and the transforming the gene expression data matrix of all the disease samples into a disease sample likelihood matrix by using a probability density function specifically comprises:
constructing a gene likelihood calculation model by using a probability density function; the expression of the gene likelihood calculation model is
Figure FDA0002567877500000011
Wherein λ isiRepresenting the likelihood of gene i;
Figure FDA0002567877500000012
expressing the expression level of the ith gene in the jth sample; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents a normal distribution curve of the gene i under a normal sample;
and converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.
3. The method for determining cancer network markers according to claim 1, wherein the constructing a normal sample distribution function according to all the normal sample likelihood matrices specifically comprises:
calculating the mean value and the variance of each gene likelihood according to all the normal sample likelihood matrixes;
and constructing a normal distribution function of each gene likelihood under a normal sample according to the mean value and the variance of the gene likelihood.
4. The method for determining cancer network markers according to claim 1, wherein the step of sequentially substituting each element in the likelihood matrix of each disease sample into the distribution function of the normal sample to determine the significantly different gene set of each disease sample comprises:
sequentially bringing each element in the disease sample likelihood matrix into the normal sample distribution function, and calculating the probability value of each gene in each disease sample;
judging whether the probability value is less than or equal to a set threshold value or not;
if yes, determining the genes corresponding to the probability values smaller than or equal to the set threshold value as the significant difference genes of the disease sample.
5. The method for determining cancer network markers according to claim 1, wherein the step of sequentially mapping the significantly different gene sets of each disease sample into a protein-protein interaction network to determine the network markers of each disease sample comprises:
and mapping the significant difference gene sets of the disease samples to a protein-protein interaction network in sequence, and determining five genes with the maximum number of screened connecting genes and first-order neighbor nodes of the five genes as network markers of the disease samples according to the correlation action relationship among the genes.
6. A cancer network marker determination system based on a probabilistic model, the cancer network marker determination system comprising:
the gene expression data matrix acquisition module is used for acquiring gene expression data matrixes of a plurality of normal samples and a plurality of disease samples; elements in the gene expression data matrix are gene expression amount;
the gene expression data matrix conversion module is used for converting the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and converting the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes by using a probability density function; elements in the normal sample likelihood matrix and the disease sample likelihood matrix are both gene likelihoods;
the normal sample distribution function building module is used for building a normal sample distribution function according to all the normal sample likelihood matrixes;
the significant difference gene set determining module is used for sequentially substituting each element in the likelihood matrix of each disease sample into the normal sample distribution function to determine a significant difference gene set of each disease sample;
and the network marker determining module is used for mapping the significant difference gene set of each disease sample into a protein-protein interaction network in sequence and determining the network marker of each disease sample.
7. The cancer network marker determination system of claim 6, further comprising:
and the disease subtype classification module is used for classifying different subtypes of the disease samples according to the network markers of each disease sample and known cancer subtype prior data.
8. The cancer network marker determination system of claim 6, wherein the gene expression data matrix transformation module specifically comprises:
the gene likelihood calculation model building unit is used for building a gene likelihood calculation model by utilizing a probability density function; the expression of the gene likelihood calculation model is
Figure FDA0002567877500000041
Wherein λ isiRepresenting the likelihood of gene i;
Figure FDA0002567877500000042
expressing the expression level of the ith gene in the jth sample; f. ofi 1Represents the normal distribution curve of the gene i under the disease sample; f. ofi 2Represents a normal distribution curve of the gene i under a normal sample;
and the transformation unit is used for transforming the gene expression data matrixes of all the normal samples into normal sample likelihood matrixes and transforming the gene expression data matrixes of all the disease samples into disease sample likelihood matrixes according to the gene likelihood calculation model.
9. The cancer network marker determination system of claim 6, wherein the significantly different gene set determination module specifically comprises:
the probability value calculating unit is used for sequentially substituting each element in the disease sample likelihood matrix into the normal sample distribution function and calculating the probability value of each gene in each disease sample;
the judging unit is used for judging whether the probability value is less than or equal to a set threshold value or not;
and the significant difference gene set determining unit is used for determining the genes corresponding to the probability values which are less than or equal to the set threshold value as the significant difference genes of the disease sample.
CN201810920673.7A 2018-08-14 2018-08-14 Cancer network marker determination method and system based on probability model Expired - Fee Related CN109101783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810920673.7A CN109101783B (en) 2018-08-14 2018-08-14 Cancer network marker determination method and system based on probability model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810920673.7A CN109101783B (en) 2018-08-14 2018-08-14 Cancer network marker determination method and system based on probability model

Publications (2)

Publication Number Publication Date
CN109101783A CN109101783A (en) 2018-12-28
CN109101783B true CN109101783B (en) 2020-09-04

Family

ID=64849535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810920673.7A Expired - Fee Related CN109101783B (en) 2018-08-14 2018-08-14 Cancer network marker determination method and system based on probability model

Country Status (1)

Country Link
CN (1) CN109101783B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010204B (en) * 2019-04-04 2022-12-02 中南大学 Fusion network and multi-scoring strategy based prognostic biomarker identification method
CN110444248B (en) * 2019-07-22 2021-09-24 山东大学 Cancer biomolecule marker screening method and system based on network topology parameters
CN110797083B (en) * 2019-09-18 2023-04-18 中南大学 Biomarker identification method based on multiple networks

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268431A (en) * 2013-05-21 2013-08-28 中山大学 Cancer hypotype biomarker detecting system based on student t distribution
CN103473416A (en) * 2013-09-13 2013-12-25 中国人民解放军国防科学技术大学 Protein-protein interaction model building method and device
WO2013192504A1 (en) * 2012-06-22 2013-12-27 The Trustees Of Dartmouth College Novel vista-ig constructs and the use of vista-ig for treatment of autoimmune, allergic and inflammatory disorders
CN105117617A (en) * 2015-08-26 2015-12-02 大连海事大学 Method for screening environmentally sensitive biomolecules
CN106295246A (en) * 2016-08-07 2017-01-04 吉林大学 Find the lncRNA relevant to tumor and predict its function
CN107025387A (en) * 2017-03-29 2017-08-08 电子科技大学 One kind is used for biomarker for cancer and knows method for distinguishing
CN108181471A (en) * 2017-12-15 2018-06-19 新疆医科大学第附属医院 A kind of detection marker of dissection of aorta and marker appraisal procedure
CN108345768A (en) * 2017-01-20 2018-07-31 深圳华大生命科学研究院 A kind of method and marker combination of determining infant's intestinal flora maturity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180211013A1 (en) * 2017-01-25 2018-07-26 International Business Machines Corporation Patient Communication Priority By Compliance Dates, Risk Scores, and Organizational Goals

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013192504A1 (en) * 2012-06-22 2013-12-27 The Trustees Of Dartmouth College Novel vista-ig constructs and the use of vista-ig for treatment of autoimmune, allergic and inflammatory disorders
CN103268431A (en) * 2013-05-21 2013-08-28 中山大学 Cancer hypotype biomarker detecting system based on student t distribution
CN103473416A (en) * 2013-09-13 2013-12-25 中国人民解放军国防科学技术大学 Protein-protein interaction model building method and device
CN105117617A (en) * 2015-08-26 2015-12-02 大连海事大学 Method for screening environmentally sensitive biomolecules
CN106295246A (en) * 2016-08-07 2017-01-04 吉林大学 Find the lncRNA relevant to tumor and predict its function
CN108345768A (en) * 2017-01-20 2018-07-31 深圳华大生命科学研究院 A kind of method and marker combination of determining infant's intestinal flora maturity
CN107025387A (en) * 2017-03-29 2017-08-08 电子科技大学 One kind is used for biomarker for cancer and knows method for distinguishing
CN108181471A (en) * 2017-12-15 2018-06-19 新疆医科大学第附属医院 A kind of detection marker of dissection of aorta and marker appraisal procedure

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity;Junjie Su 等;《PloS ONE》;20091207;第4卷(第12期);第1-10页 *
Learning Gaussian Graphical Models of Gene Networks with False Discovery Rate Control;Jose M. Pena 等;《European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics》;20081231;第4973卷;第165-176页 *
Personalized characterization of diseases using sample-specific networks;Xiaoping Liu 等;《Nucleic Acids Research》;20160904;第44卷(第22期);第1-18页 *
血清肿瘤标志物在胰腺癌诊断中的选择;高云朝;《上海医学》;20051231;第28卷(第04期);第330-331页 *

Also Published As

Publication number Publication date
CN109101783A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN112435714B (en) Tumor immune subtype classification method and system
CN110444248B (en) Cancer biomolecule marker screening method and system based on network topology parameters
CN109101783B (en) Cancer network marker determination method and system based on probability model
CN108694991B (en) Relocatable drug discovery method based on integration of multiple transcriptome datasets and drug target information
CN106599616B (en) Ultralow frequency mutational site determination method based on duplex-seq
CN109994200A (en) A kind of multiple groups cancer data confluence analysis method based on similarity fusion
CN111883223B (en) Report interpretation method and system for structural variation in patient sample data
CN113053535B (en) Medical information prediction system and medical information prediction method
CN107301328B (en) Cancer subtype accurate discovery and evolution analysis method based on data flow clustering
CN114530249A (en) Disease risk assessment model construction method based on intestinal microorganisms and application
CN106055922A (en) Hybrid network gene screening method based on gene expression data
CN107169264B (en) complex disease diagnosis system
CN108920903B (en) LncRNA and disease incidence relation prediction method and system based on naive Bayes
CN115631847B (en) Early lung cancer diagnosis system, storage medium and equipment based on multiple groups of chemical characteristics
CN117912570B (en) Classification feature determining method and system based on gene co-expression network
CN115881232A (en) ScRNA-seq cell type annotation method based on graph neural network and feature fusion
CN116564409A (en) Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
CN110223786B (en) Method and system for predicting drug-drug interaction based on nonnegative tensor decomposition
CN116344046A (en) Quantification method of stability in individual health state based on multiple groups of study data
KR102462746B1 (en) Method And System For Constructing Cancer Patient Specific Gene Networks And Finding Prognostic Gene Pairs
Joshi et al. Delimiting continuity: Comparison of target enrichment and double digest restriction‐site associated DNA sequencing for delineating admixing parapatric Melitaea butterflies
Zhou et al. Accurate integration of multiple heterogeneous single-cell RNA-seq data sets by learning contrastive biological variation
CN115116542B (en) Metagenome-based sample-specific species interaction network construction method and system
CN116631641B (en) Disease prediction device integrating self-adaptive similar patient diagrams
CN110797083B (en) Biomarker identification method based on multiple networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200904

Termination date: 20210814

CF01 Termination of patent right due to non-payment of annual fee