Nothing Special   »   [go: up one dir, main page]

CN104504305B - Supervise Classification of Gene Expression Data method - Google Patents

Supervise Classification of Gene Expression Data method Download PDF

Info

Publication number
CN104504305B
CN104504305B CN201410817036.9A CN201410817036A CN104504305B CN 104504305 B CN104504305 B CN 104504305B CN 201410817036 A CN201410817036 A CN 201410817036A CN 104504305 B CN104504305 B CN 104504305B
Authority
CN
China
Prior art keywords
matrix
training sample
classification
gene expression
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410817036.9A
Other languages
Chinese (zh)
Other versions
CN104504305A (en
Inventor
王文俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201410817036.9A priority Critical patent/CN104504305B/en
Publication of CN104504305A publication Critical patent/CN104504305A/en
Application granted granted Critical
Publication of CN104504305B publication Critical patent/CN104504305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses one kind to supervise Classification of Gene Expression Data method, mainly solves the problems, such as that prior art carries out existing dimension disaster during Classification of Gene Expression Data, loses information and classifier design complexity.Its technical scheme is:1. the diagnostic characteristics vector of training sample is obtained using classification retaining projection method;2. using the diagnostic characteristics vector of training sample, projection matrix is obtained using regression optimization method;3. training sample feature set and test sample feature set are obtained by projection matrix;4. by training sample feature set and test sample feature set, the Classification and Identification of test sample is realized using nearest neighbor classifier.Instant invention overcomes classification retaining projection method to have the problem of Singular Value, over-fitting, improves the accuracy of Classification of Gene Expression Data, available for the tumour identification in bioinformatics and Cancer Classification.

Description

Method for classifying supervised gene expression data
Technical Field
The invention belongs to the technical field of data processing, and relates to a supervised gene expression data classification method which can be used for tumor identification and tumor subtype classification in bioinformatics.
Background
With the development of gene chip technology, a huge amount of gene expression data is generated. How to obtain useful information from massive gene expression data becomes a hotspot of bioinformatics research. The classification method is one of important means for realizing biological information mining of gene expression data, but the high-dimensional small sample characteristic of the gene expression data brings dimensionality disaster to gene expression data classification. To overcome this problem, gene expression data is usually subjected to gene selection or feature extraction, and then classified and identified by using a conventional classifier. The existing gene selection methods are numerous, but in the face of different tumor classification tasks, various gene selection algorithms do not have uniform standards, and if the gene selection algorithms are not well designed, information genes which are useful for classification can be lost, so that the classification performance is influenced. The feature extraction method for gene expression data classification mainly includes two major categories:
(1) Provided is an unsupervised feature extraction method. The method comprises Principal Component Analysis (PCA), independent Component Analysis (ICA), non-Negative Matrix Factorization (NMF), and Local Preserving Projection (LPP). The feature extraction methods do not consider the class information of the samples, and usually need to extract effective classification features by means of some identification feature extraction methods, or adopt more complex classifiers such as a Support Vector Machine (SVM) and the like to improve the classification performance, so that the complexity of classification and identification is increased.
(2) Provided is a supervision characteristic extraction method. The classical supervised feature extraction method is Linear Discriminant Analysis (LDA), but in the face of the characteristics of high-dimensional small samples of gene expression data, the LDA has the problems that the matrix is singular, overfitting, the optimal subspace dimension is limited by the number of sample classes and the like, and the application of the LDA is limited. The class reservation projection CPP is a supervised feature extraction method proposed in 2012, see Wangwenjun. New method for extracting gene expression data features based on the class reservation projection. Electronic newspaper.40 (2): 358-364, 2012. The CPP can effectively solve the problem that the optimal subspace dimension is limited by the number of sample categories, but the CPP still has the problems of matrix singularity, overfitting and the like in the face of the high-dimensional small sample characteristics of gene expression data.
Disclosure of Invention
The invention aims to overcome the problems of matrix singularity and overfitting existing in a category-preserving projection method, and provides a novel method for classifying supervised gene expression data so as to improve the accuracy of gene expression data classification.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
(1) Let training sample gene expression dataset X = { X = { (X) } i I =1,2, \8230;, m }, where x i Is an n-dimensional column vector which represents the expression level vector of the ith training sample on n genes, and m is the number of the training samples; let the ith training sample class be denoted as c i
(2) Obtaining identification feature vector y 'of training sample by adopting category preserving projection method' l L =1,2, \8230, d is the number of identifying characteristic vectors, d is more than or equal to 1 and is more than or equal to d<n;
(3) Utilizing discriminative feature vector y' l By regressionObtaining a projection matrix A with dimension of n multiplied by d by an optimization method;
(4) The gene expression level vector x of the ith training sample i Projected on the projection matrix A to obtain the eigenvector y of the ith training sample i =A T x i Wherein A is T Represents the transpose of the projection matrix a; training sample feature set Y = { Y = i |i=1,2,…,m};
(5) Let test sample gene expression data set U = { U = { U j L j =1,2, \8230 |, p }, where u j Is an n-dimensional column vector which represents the expression level vector of the jth test sample on n genes, and p is the number of the test samples;
(6) The gene expression level vector u of the jth test sample is expressed j Projecting the test sample on a projection matrix A to obtain a feature vector q of a jth test sample j =A T u j Wherein A is T Represents the transpose of the projection matrix a; test sample feature set Q = { Q = j |j=1,2,…,p};
(7) Classifying the test samples by adopting a nearest neighbor classifier, and calculating a j test sample feature vector q j To each training sample feature vector y i The class of the training sample with the closest euclidean distance is used as the class of the jth test sample.
Compared with the prior art, the invention has the following advantages:
1) According to the method, the class retention projection method is converted into the regression frame, so that the problems of matrix singularity and overfitting existing in the class retention projection method are solved;
2) The invention combines the sample classification information to extract the classification characteristics of the sample, reduces the burden of classifier design and improves the accuracy of gene expression data classification.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a graph of the classification correct recognition rate for a first set of gene expression data used in the simulation of the present invention;
FIG. 3 is a graph of the classification correct identification rate of a second set of gene expression data used in the simulation of the present invention.
Detailed Description
Referring to fig. 1, the specific implementation steps of the present invention are as follows:
step 1, giving training sample gene expression data and class information of a training sample.
Given the expression data of m training samples on n genes, the expression data is represented by a matrix X with dimensions of n × m, wherein the rows of the matrix X represent the genes, and the columns of the matrix X represent the training samples: element X of matrix X ki Represents the expression level of the ith training sample on the kth gene, column i of matrix X i Represents the expression level vector of the ith training sample on n genes, i =1,2, \8230;, m, k =1,2, \8230;, n;
class information C = { C } for a given m training samples i I =1,2, \8230;, m }, wherein c i Representing the class of the ith training sample.
Step 2, obtaining the identification feature vector y 'of the training sample by adopting a category preserving projection method according to the category information C of the training sample and the gene expression data X of the training sample' l
(2.1) defining m x m dimensional homogeneous relation matrix W 1 Value of (2)And a m x m dimensional heterogeneous relationship matrix W 2 Value of (2)The following were used:
wherein, c t Representing a category of the t-th training sample;
(2.2) computing a homogeneous diagonal matrix D of m x m dimensions 1 Value of diagonal element ofAnd a heterogeneous diagonal matrix D of m x m dimensions 2 Value of diagonal element of
Diagonal matrix of the same kind D 1 And heterogeneous diagonal matrix D 2 The off-diagonal element values of (1) are all 0;
(2.3) calculating an m x m-dimensional intra-class dispersion matrix L 1 And m x m dimensional inter-class scatter matrix L 2
L 1 =D 1 -W 1
L 2 =D 2 -W 2
(2.4) defining generalized eigenequation L 1 y'=λL 2 y ', λ are eigenvalues, y' is an m-dimensional eigenvector;
(2.5) solving feature vectors corresponding to the first d minimum feature values of the generalized feature equation to serve as d identification feature vectors y 'of the training sample' l L =1,2, \8230, d is the number of identifying characteristic vectors, d is more than or equal to 1 and is more than or equal to d<n。
Step 3, utilizing the identification feature vector y 'of the training sample' l And training sample gene expression data X, and obtaining a projection matrix A with dimension of n multiplied by d by adopting a regression optimization method.
(3.1) setting a l For n-dimensional optimal projectionVector, defining the regression optimization as follows:
wherein alpha and beta are two regression coefficients with different values, and satisfy 0< alpha <1,0< beta <1, alpha ≠ beta;
(3.2) solving the regression optimization formula to obtain d n-dimensional projection vectors a l L =1,2, \8230;, d, a projection matrix a = [ a ] constituting an n × d dimension 1 ,a 2 ,…,a l ,…,a d ]。
And 4, calculating a training sample feature set Y according to the training sample gene expression data X and the projection matrix A.
Gene expression level vector x of ith training sample i Projected on the projection matrix A to obtain the d-dimensional eigenvector y of the ith training sample i =A T x i And training sample feature set Y = A T X=[y i |i=1,2,…,m]Wherein A is T Representing the transpose of the projection matrix a.
And 5, giving a test sample gene expression data set.
Given the expression data of p test samples on n genes, the expression data is represented by a matrix U with dimension of n multiplied by p, wherein the rows of the matrix U represent genes, and the columns represent test samples: element U of matrix U kj Represents the expression level of the jth test sample on the kth gene, jth column U of matrix U j The expression level vector for the j-th test sample on n genes is shown, j =1,2, \8230;, p.
And 6, calculating a test sample feature set Q according to the test sample gene expression data U and the projection matrix A.
Gene expression level vector u of jth test sample j Projecting the d-dimensional characteristic vector on a projection matrix A to obtain a d-dimensional characteristic vector q of a jth test sample j =A T u j And test sample feature set Q = a T U=[q j |j=1,2,…,p]。
And 7, classifying the test samples by adopting a nearest neighbor classifier according to the training sample feature set Y and the test sample feature set Q.
(7.1) calculating the Euclidean distance L (q) from the feature vector of the test sample j to the feature vector of each training sample i j ,y i ),i=1,2,…,m,j=1,2,…,p;
(7.2) the Euclidean distance L (q) j ,y i ) The class of the most recent training sample is taken as the class of the test sample j.
The present invention will be described in more detail with respect to the scheme and effects of the present invention by the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the present invention.
Example 1: a classification experiment was performed on the first set of gene expression data.
The set of data is gene expression data provided by the national cancer institute NCI, with a gene space dimension of 2308, a sample number of 64, and a sample class of 4. Details of the data are given in table 1.
TABLE 1 first set of Experimental data
Data set name Base factor Number of samples Number of sample classes
First set of Gene expression data 2308 64 4
The steps for classifying the data in table 1 are as follows:
firstly, dividing 64 samples into training samples and testing samples by adopting a 5-fold cross validation method, namely dividing the samples into 5 parts, taking one part as the testing sample in turn, and taking the rest as the training samples; if there are p test samples, the number of training samples is m =64-p, the training sample gene expression data matrix X is 2308 × m-dimensional, and the test sample gene expression data matrix U is 2308 × p-dimensional.
Secondly, d identification feature vectors y 'of the training samples are obtained through the category information of the m training samples and the 2308 Xm dimensional gene expression data matrix X by adopting a category preserving projection method' l L =1,2, \8230;, d, with the characteristic dimension d taken up to 49 maximum, i.e. d =1,2, \8230;, 49.
And thirdly, obtaining a 2308X d-dimensional projection matrix A by using d identification characteristic vectors of the training samples and a training sample gene expression data matrix X and adopting a regression optimization method.
Fourthly, obtaining a training sample feature set Y = A with dimension of d × m according to the training sample gene expression data matrix X and the projection matrix A T X。
Fifthly, according to the gene expression data matrix U and the projection matrix A of the test sample, obtaining a d multiplied by p dimensional test sample feature set Q = A T U。
And sixthly, calculating the Euclidean distance from the d-dimensional feature vector of each test sample to the d-dimensional feature vectors of all the training samples according to the training sample feature set Y and the test sample feature set Q, and taking the class of the training sample with the closest Euclidean distance as the class of the test sample.
When d =3, the experimental results show that: of the 64 samples, the classification results of 63 samples are consistent with the real class thereof, and only the classification results of 1 sample are inconsistent with the real class thereof.
Seventhly, comparing the classification results of all samples on the d-dimensional characteristics with the real classes of the samples, counting the number of the samples classified correctly,d =1,2, \8230;, 49. If d =3, 63 samples of 64 samples are correctly classified, and the classification correct recognition rate of the samples reaches 98.44%.
The result of classifying the data in the table 1 by adopting the method of the invention is recorded as a classification result 1; after the data in the table 1 are subjected to feature extraction by adopting a CPP method, a nearest neighbor classifier is used for realizing sample classification to obtain a classification result 2, and the classification result is expressed by CPP + nearest neighbor; and comparing the sample classification correct recognition rates of the classification result 1 and the classification result 2, wherein a comparison curve of the sample classification correct recognition rates is shown in a figure 2, wherein the abscissa represents the characteristic dimension of the sample, and the ordinate represents the sample classification correct recognition rate.
Example 2: a classification experiment was performed on the second set of gene expression data.
The set of data is the expression profiles of 16063 gene fragments for 190 tissue samples of different cancer types, including 14 cancer types. Details of the data are shown in table 2.
TABLE 2 second set of Experimental data
Data set name Base factor Number of samples Number of sample classes
Second set of gene expression data 16063 190 14
The procedure for sorting the data in table 2 is the same as that of example 1. The feature dimension is taken to be 150 a maximum.
The result of classifying the data in the table 2 by adopting the method of the invention is recorded as a classification result 3; after the data in the table 2 is subjected to feature extraction by adopting a CPP method, a nearest neighbor classifier is used for realizing sample classification to obtain a classification result 4, and the classification result is expressed by CPP + nearest neighbor; and comparing the sample classification correct recognition rates of the classification result 3 and the classification result 4, wherein a comparison curve of the sample classification correct recognition rates is shown in figure 3, wherein the abscissa represents the characteristic dimension of the sample, and the ordinate represents the sample classification correct recognition rate.
From fig. 2, fig. 3, the following conclusions can be drawn:
a) The correct recognition rate of the classification is obviously better than that of a nearest neighbor classifier after the CPP feature extraction, the correct recognition rate of the classification is relatively stable and shows a monotonous and non-decreasing trend along with the increase of the feature dimension, and the correct recognition rate is obviously reduced along with the increase of the feature dimension after the nearest neighbor classifier after the CPP feature extraction reaches the highest recognition rate.
b) For the first group of gene expression data, when the feature dimension is 3, the correct recognition rate of the method reaches the maximum value, and reaches 98.44%, and the highest correct recognition rate of the nearest neighbor classification after CPP feature extraction is 96.88%.
c) For the second group of gene expression data, the maximum correct recognition rate of the nearest neighbor classifier after CPP feature extraction reaches 62.63% when the feature dimension is 16, while the correct recognition rate of the method reaches 65.79% and the maximum correct recognition rate reaches 66.32% when the feature dimension is 13.

Claims (1)

1. A supervised gene expression data classification method is characterized by comprising the following steps:
(1) Setting the number of gene expressions in a training sampleData set X = { X i L i =1,2, \8230 |, m }, where x i Is an n-dimensional column vector which represents the expression level vector of the ith training sample on n genes, and m is the number of the training samples; let the ith training sample class be denoted as c i
(2) Obtaining identification feature vector y 'of training sample by adopting category retention projection method' l L =1,2, \ 8230, d is the number of identification feature vectors, d is more than or equal to 1<n:
(2.1) defining m x m dimensional homogeneous relation matrix W 1 Value of (2)And m x m dimensions of heterogeneous relationship matrix W 2 Value of (2)The following:
wherein, c t Representing a category of the t-th training sample;
(2.2) calculating a diagonal matrix D of the same kind in dimensions of m x m 1 Value of diagonal element ofAnd a heterogeneous diagonal matrix D of m x m dimensions 2 Value of diagonal element of
Diagonal matrix of the same kind D 1 And heterogeneous diagonal matrix D 2 All the off-diagonal element values of (a) are 0;
(2.3) calculating an m x m-dimensional intra-class dispersion matrix L 1 And m x m dimensional inter-class scatter matrix L 2
L 1 =D 1 -W 1
L 2 =D 2 -W 2
(2.4) defining generalized eigenequation L 1 y'=λL 2 y ', λ are eigenvalues, y' is an m-dimensional eigenvector;
(2.5) solving feature vectors corresponding to the first d minimum feature values of the generalized feature equation to serve as d identification feature vectors y 'of the training sample' l ,l=1,2,…,d;
(3) Utilizing discriminative feature vector y' l Obtaining a projection matrix A with dimensions of n multiplied by d by adopting a regression optimization method:
(3.1) setting a l For the n-dimensional optimal projection vector, the following regression optimization equation is defined:
wherein alpha and beta are two regression coefficients with different values, and satisfy 0< alpha <1,0< beta <1, alpha ≠ beta;
(3.2) solving the regression optimization formula to obtain d n-dimensional projection vectors a l L =1,2, \8230;, d, a projection matrix a = [ a ] constituting an n × d dimension 1 ,a 2 ,…,a d ];
(4) Gene expression level vector x of ith training sample i Projected on the projection matrix A to obtain the eigenvector y of the ith training sample i =A T x i Wherein A is T Represents the transpose of the projection matrix a; training sample feature set Y = { Y = i |i=1,2,…,m};
(5) Let test sample gene expression data set U = { U = { U j L j =1,2, \8230;, p }, where u j Is an n-dimensional column vector which represents the expression level vector of the jth test sample on n genes, and p is the number of the test samples;
(6) The gene expression level vector u of the jth test sample is expressed j Projecting the test sample on a projection matrix A to obtain a feature vector q of a jth test sample j =A T u j Wherein A is T Represents the transpose of the projection matrix a; test sample feature set Q = { Q = { Q = j |j=1,2,…,p};
(7) Classifying the test samples by adopting a nearest neighbor classifier, and calculating a characteristic vector q of the jth test sample j To each training sample feature vector y i The class of the training sample with the closest euclidean distance is used as the class of the jth test sample.
CN201410817036.9A 2014-12-24 2014-12-24 Supervise Classification of Gene Expression Data method Active CN104504305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410817036.9A CN104504305B (en) 2014-12-24 2014-12-24 Supervise Classification of Gene Expression Data method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410817036.9A CN104504305B (en) 2014-12-24 2014-12-24 Supervise Classification of Gene Expression Data method

Publications (2)

Publication Number Publication Date
CN104504305A CN104504305A (en) 2015-04-08
CN104504305B true CN104504305B (en) 2018-03-06

Family

ID=52945702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410817036.9A Active CN104504305B (en) 2014-12-24 2014-12-24 Supervise Classification of Gene Expression Data method

Country Status (1)

Country Link
CN (1) CN104504305B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537279A (en) * 2014-12-22 2015-04-22 中国科学院深圳先进技术研究院 Sequence clustering method and device
CN106407664B (en) * 2016-08-31 2018-11-23 深圳市中识健康科技有限公司 The domain-adaptive device of breath diagnosis system
CN113223613A (en) * 2021-05-14 2021-08-06 西安电子科技大学 Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079103A (en) * 2007-06-14 2007-11-28 上海交通大学 Human face posture identification method based on sparse Bayesian regression
CN101916376A (en) * 2010-07-06 2010-12-15 浙江大学 Local spline embedding-based orthogonal semi-monitoring subspace image classification method
CN102289685A (en) * 2011-08-04 2011-12-21 中山大学 Behavior identification method for rank-1 tensor projection based on canonical return

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079103A (en) * 2007-06-14 2007-11-28 上海交通大学 Human face posture identification method based on sparse Bayesian regression
CN101916376A (en) * 2010-07-06 2010-12-15 浙江大学 Local spline embedding-based orthogonal semi-monitoring subspace image classification method
CN102289685A (en) * 2011-08-04 2011-12-21 中山大学 Behavior identification method for rank-1 tensor projection based on canonical return

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
基于张量的稀疏保持投影降维方法;邱新涛等;《中国科技论文》;20131031;第8卷(第10期);第1007-1010页 *
基于核的类别非局保留投影;王文俊等;《模式识别与人工智能》;20091031;第22卷(第5期);第769-773页 *
基于类别保留投影的基因表达数据特征提取新方法;王文俊;《电子学报》;20120229;第40卷(第2期);正文第3节 *
基于类别保留投影的基因表达数据降维方法;王文俊等;《四川大学学报(工程科学版)》;20091130;第41卷(第6期);正文第1节 *
监督型稀疏保持投影;相文楠等;《计算机工程与应用》;20111031;第47卷(第29期);第186-188页 *

Also Published As

Publication number Publication date
CN104504305A (en) 2015-04-08

Similar Documents

Publication Publication Date Title
Schwartz et al. Face identification using large feature sets
Bi et al. Efficient multi-label classification with many labels
CN103093235B (en) A kind of Handwritten Numeral Recognition Method based on improving distance core principle component analysis
Reza et al. ICA and PCA integrated feature extraction for classification
CN104616000B (en) A kind of face identification method and device
CN105631433B (en) A kind of face identification method of bidimensional linear discriminant analysis
Zhao et al. Semantic parts based top-down pyramid for action recognition
CN103745205A (en) Gait recognition method based on multi-linear mean component analysis
Diaf et al. Non-parametric Fisher’s discriminant analysis with kernels for data classification
CN107451545A (en) The face identification method of Non-negative Matrix Factorization is differentiated based on multichannel under soft label
Guo et al. Deep embedded K-means clustering
Zhang et al. A sparse and discriminative tensor to vector projection for human gait feature representation
Ouyed et al. Feature weighting for multinomial kernel logistic regression and application to action recognition
CN104504305B (en) Supervise Classification of Gene Expression Data method
CN103793600B (en) Classifier model generating method for gene microarray data
Bolagh et al. Subject selection on a Riemannian manifold for unsupervised cross-subject seizure detection
Rustam et al. Correlated based SVM-RFE as feature selection for cancer classification using microarray databases
CN102930258B (en) A kind of facial image recognition method
Zhao et al. A block coordinate descent approach for sparse principal component analysis
Wu et al. Handwritten digit classification using the mnist data set
Su et al. Order-preserving wasserstein discriminant analysis
Kim A pre-clustering technique for optimizing subclass discriminant analysis
Deng et al. A minimax probabilistic approach to feature transformation for multi-class data
CN114300049A (en) Gene expression data classification method based on similarity sequence maintenance
Zhang et al. A linear discriminant analysis method based on mutual information maximization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant