CN104504305B

CN104504305B - Supervise Classification of Gene Expression Data method

Info

Publication number: CN104504305B
Application number: CN201410817036.9A
Authority: CN
Inventors: 王文俊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2018-03-06
Anticipated expiration: 2034-12-24
Also published as: CN104504305A

Abstract

The invention discloses a method for classifying supervised gene expression data, which mainly solves the problems of dimension disaster, loss of information and complex design of classifiers existing in the classification of gene expression data in the prior art. The technical scheme is: 1. Obtain the discriminant feature vector of the training sample by using the category-preserving projection method; 2. Obtain the projection matrix by using the regression optimization method by using the discriminant feature vector of the training sample; 3. Obtain the feature set of the training sample and the test feature set by the projection matrix Sample feature set; 4. From the training sample feature set and the test sample feature set, the nearest neighbor classifier is used to realize the classification and recognition of the test sample. The invention overcomes the problems of matrix singularity and overfitting in the category-preserving projection method, improves the accuracy of gene expression data classification, and can be used for tumor identification and tumor subtype classification in bioinformatics.

Description

Supervised Gene Expression Data Classification Methods

技术领域technical field

本发明属于数据处理技术领域，涉及一种监督基因表达数据分类方法，可用于生物信息学中的肿瘤识别和肿瘤亚型分类。The invention belongs to the technical field of data processing and relates to a method for classifying supervised gene expression data, which can be used for tumor identification and tumor subtype classification in bioinformatics.

背景技术Background technique

随着基因芯片技术的发展，产生了海量的基因表达数据。如何从海量基因表达数据中获取有用信息，成为生物信息学研究的热点。分类方法是实现基因表达数据生物信息挖掘的重要手段之一，但基因表达数据的高维小样本特性，给基因表达数据分类带来了维数灾难。为克服这一问题，通常先对基因表达数据进行基因选择或特征提取，再采用传统分类器进行分类识别。现有的基因选择方法有很多，但面对不同的肿瘤分类任务，各种基因选择算法并没有统一的标准，如果基因选择算法设计的不好，就可能丢失对分类有用的信息基因，从而影响分类性能。用于基因表达数据分类的特征提取方法主要包括两大类：With the development of gene chip technology, a large amount of gene expression data has been generated. How to obtain useful information from massive gene expression data has become a hot spot in bioinformatics research. The classification method is one of the important means to realize the biological information mining of gene expression data, but the high-dimensional and small-sample characteristics of gene expression data bring dimension disaster to the classification of gene expression data. To overcome this problem, gene selection or feature extraction is usually performed on gene expression data first, and then traditional classifiers are used for classification and identification. There are many existing gene selection methods, but in the face of different tumor classification tasks, there is no uniform standard for various gene selection algorithms. If the gene selection algorithm is not well designed, useful information genes for classification may be lost, thereby affecting classification performance. Feature extraction methods for gene expression data classification mainly fall into two categories:

(1)非监督特征提取方法。包括主分量分析PCA、独立分量分析ICA、非负矩阵分解法NMF和保局投影LPP等。这些特征提取方法都没有考虑样本的类别信息，往往还需借助一些鉴别特征提取方法来提取有效的分类特征，或采用支持向量机SVM等比较复杂的分类器来提高分类性能，从而增加了分类识别的复杂性。(1) Unsupervised feature extraction method. Including principal component analysis PCA, independent component analysis ICA, non-negative matrix factorization method NMF and local projection LPP, etc. These feature extraction methods do not consider the category information of the sample, and often need to use some discriminant feature extraction methods to extract effective classification features, or use more complex classifiers such as support vector machines (SVM) to improve classification performance, thus increasing classification recognition. complexity.

(2)监督特征提取方法。经典监督特征提取方法是线性鉴别分析LDA，但面对基因表达数据的高维小样本特性，LDA存在矩阵奇异、过拟合和最优子空间维数受样本类别数限制等问题，限制了LDA的应用。类别保留投影CPP是2012年提出的一种监督特征提取方法，见王文俊.基于类别保留投影的基因表达数据特征提取新方法.电子学报.40(2):358-364，2012。CPP能有效解决最优子空间维数受样本类别数限制的问题，但面对基因表达数据的高维小样本特性，CPP依然存在矩阵奇异、过拟合等问题。(2) Supervised feature extraction method. The classic supervised feature extraction method is linear discriminant analysis LDA, but in the face of the high-dimensional small sample characteristics of gene expression data, LDA has problems such as matrix singularity, overfitting, and the optimal subspace dimension is limited by the number of sample categories, which limits LDA. Applications. Category-preserving projection CPP is a supervised feature extraction method proposed in 2012, see Wang Wenjun. A new method for feature extraction of gene expression data based on category-preserving projection. Electronic Journal. 40(2):358-364, 2012. CPP can effectively solve the problem that the optimal subspace dimension is limited by the number of sample categories, but in the face of the high-dimensional and small sample characteristics of gene expression data, CPP still has problems such as matrix singularity and overfitting.

发明内容Contents of the invention

本发明的目的在于克服类别保留投影方法存在的矩阵奇异和过拟合问题，提出一种新的监督基因表达数据分类方法，以提高基因表达数据分类的准确性。The purpose of the present invention is to overcome the matrix singularity and overfitting problems existing in the category-preserving projection method, and propose a new supervised gene expression data classification method to improve the accuracy of gene expression data classification.

为实现上述目的，本发明的技术方案包括如下步骤：To achieve the above object, the technical solution of the present invention comprises the following steps:

(1)设训练样本基因表达数据集X＝{x_i|i＝1,2,…,m}，其中，x_i是n维列向量，代表第i个训练样本在n个基因上的表达水平向量，m是训练样本个数；设第i个训练样本的类别记为c_i；(1) Let the training sample gene expression data set X={ _xi |i=1,2,...,m}, where x _i is an n-dimensional column vector, representing the expression of the i-th training sample on n genes Horizontal vector, m is the number of training samples; set the category of the i-th training sample as c _i ;

(2)采用类别保留投影方法获得训练样本的鉴别特征向量y'_l，l＝1,2,…,d，d是鉴别特征向量的个数，1≤d<n；(2) Obtain the discriminative feature vector y'l of the training sample by using the category-preserving projection method, _l =1,2,...,d, d is the number of discriminative feature vectors, 1≤d<n;

(3)利用鉴别特征向量y'_l，采用回归优化方法获得n×d维的投影矩阵A；(3) Using the discriminant feature vector _y'l , the regression optimization method is used to obtain the n×d-dimensional projection matrix A;

(4)将第i个训练样本的基因表达水平向量x_i投影在投影矩阵A上，获得第i个训练样本的特征向量y_i＝A^Tx_i，其中，A^T表示投影矩阵A的转置；训练样本特征集Y＝{y_i|i＝1,2,…,m}；(4) Project the gene expression level vector x _i of the i-th training sample on the projection matrix A to obtain the feature vector y _i = ^AT xi of the _i -th training sample, where ^AT represents the transformation of the projection matrix A set; training sample feature set Y={y _i |i=1,2,...,m};

(5)设测试样本基因表达数据集U＝{u_j|j＝1,2,…,p}，其中，u_j是n维列向量，代表第j个测试样本在n个基因上的表达水平向量，p是测试样本个数；(5) Let the test sample gene expression data set U={u _j |j=1,2,...,p}, where u _j is an n-dimensional column vector, representing the expression of the jth test sample on n genes Horizontal vector, p is the number of test samples;

(6)将第j个测试样本的基因表达水平向量u_j投影在投影矩阵A上，获得第j个测试样本的特征向量q_j＝A^Tu_j，其中，A^T表示投影矩阵A的转置；测试样本特征集Q＝{q_j|j＝1,2,…,p}；(6) Project the gene expression level vector u _j of the jth test sample on the projection matrix A to obtain the feature vector q _j = A ^T u _j of the jth test sample, where ^AT represents the transformation of the projection matrix A setting; test sample feature set Q＝{q _j |j＝1,2,...,p};

(7)采用最近邻分类器对测试样本进行分类，计算第j个测试样本特征向量q_j到各个训练样本特征向量y_i的欧氏距离，将欧氏距离最近的训练样本的类别作为第j个测试样本的类别。(7) Use the nearest neighbor classifier to classify the test samples, calculate the Euclidean distance from the jth test sample feature vector _qj to each training sample feature vector y _i , and take the category of the training sample with the closest Euclidean distance as the jth category of test samples.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

1)本发明由于将类别保留投影方法转化到回归框架，克服了类别保留投影方法存在矩阵奇异、过拟合的问题；1) The present invention overcomes the problems of matrix singularity and over-fitting in the category-preserving projection method due to converting the category-preserving projection method into a regression framework;

2)本发明结合样本类别信息提取样本的分类特征，减轻了分类器设计的负担，提高了基因表达数据分类的准确性。2) The present invention extracts classification features of samples in combination with sample category information, which reduces the burden of classifier design and improves the accuracy of gene expression data classification.

附图说明Description of drawings

图1是本发明的实现流程图；Fig. 1 is the realization flowchart of the present invention;

图2是本发明仿真使用的第一组基因表达数据的分类正确识别率曲线图；Fig. 2 is a graph of the classification correct recognition rate of the first group of gene expression data used in the simulation of the present invention;

图3是本发明仿真使用的第二组基因表达数据的分类正确识别率曲线图。Fig. 3 is a curve diagram of the classification correct recognition rate of the second group of gene expression data used in the simulation of the present invention.

具体实施方式Detailed ways

参照图1，本发明的具体实现步骤如下：With reference to Fig. 1, the concrete realization steps of the present invention are as follows:

步骤1，给定训练样本基因表达数据和训练样本的类别信息。Step 1, given the gene expression data of the training sample and the category information of the training sample.

给定m个训练样本在n个基因上的表达数据，用一个n×m维的矩阵X表示，矩阵X的行代表基因、列代表训练样本：矩阵X的元素x_ki表示第i个训练样本在第k个基因上的表达水平，矩阵X的第i列x_i表示第i个训练样本在n个基因上的表达水平向量，i＝1,2,…,m，k＝1,2,…,n；Given the expression data of m training samples on n genes, it is represented by an n×m-dimensional matrix X, the rows of matrix X represent genes, and the columns represent training samples: the element x _ki of matrix X represents the i-th training sample The expression level on the k-th gene, the i-th column x _i of matrix X represents the expression level vector of the i-th training sample on n genes, i=1,2,...,m, k=1,2, ...,n;

给定m个训练样本的类别信息C＝{c_i|i＝1,2,…,m}，其中，c_i表示第i个训练样本的类别。Given the category information C={c _i |i=1,2,...,m} of m training samples, where c _i represents the category of the i-th training sample.

步骤2，根据训练样本的类别信息C和训练样本基因表达数据X，采用类别保留投影方法获得训练样本的鉴别特征向量y'_l。Step 2, according to the category information C of the training sample and the gene expression data X of the training sample, the discriminant feature vector y' _l of the training sample is obtained by using the category-preserving projection method.

(2.1)分别定义m×m维的同类关系矩阵W¹的元素值和m×m维的异类关系矩阵W²的元素值如下：(2.1) Define the element values of the m×m-dimensional homogeneous relationship matrix W ¹ respectively and the element values of m×m-dimensional heterogeneous relationship matrix W ² as follows:

其中，c_t表示第t个训练样本的类别；Among them, c _t represents the category of the tth training sample;

(2.2)计算m×m维的同类对角矩阵D¹的对角元素值和m×m维的异类对角矩阵D²的对角元素值 (2.2) Calculate the diagonal element values of the same kind of diagonal matrix D ¹ of m×m dimension and the values of the diagonal elements of the heterogeneous diagonal matrix ^D2 of m×m dimension

同类对角矩阵D¹和异类对角矩阵D²的非对角元素值都为0；The off-diagonal element values of the homogeneous diagonal matrix D ¹ and the heterogeneous diagonal matrix D ² are all 0;

(2.3)计算m×m维的类内散布矩阵L¹和m×m维的类间散布矩阵L²：(2.3) Calculate m×m dimensional intra-class scatter matrix L ¹ and m×m dimensional inter-class scatter matrix L ² :

L¹＝D¹-W¹，L ¹ =D ¹ -W ¹ ,

L²＝D²-W²；L ² =D ² -W ² ;

(2.4)定义广义特征方程L¹y'＝λL²y'，λ是特征值，y'是m维的特征向量；(2.4) Define the generalized characteristic equation L ¹ y'=λL ² y', where λ is an eigenvalue, and y' is an m-dimensional eigenvector;

(2.5)求解上述广义特征方程的前d个最小特征值对应的特征向量作为训练样本的d个鉴别特征向量y'_l，l＝1,2,…,d，d是鉴别特征向量的个数，1≤d<n。(2.5) Solve the eigenvectors corresponding to the first d minimum eigenvalues of the above-mentioned generalized eigenvalues as the d discriminative feature vectors y' _l of the training samples, l=1,2,...,d, d is the number of discriminative feature vectors , 1≤d<n.

步骤3，利用训练样本的鉴别特征向量y'_l和训练样本基因表达数据X，采用回归优化方法获得n×d维的投影矩阵A。Step 3, using the discriminative feature vector _y'l of the training sample and the gene expression data X of the training sample, the regression optimization method is used to obtain the n×d-dimensional projection matrix A.

(3.1)设a_l为n维最优投影向量，定义如下回归优化式：(3.1) Let a _l be the n-dimensional optimal projection vector, define the following regression optimization formula:

其中，α和β是两个数值不同的回归系数，且满足0<α<1，0<β<1，α≠β；Among them, α and β are two regression coefficients with different values, and satisfy 0<α<1, 0<β<1, α≠β;

(3.2)求解上述回归优化式获得d个n维的投影向量a_l，l＝1,2,…,d，构成n×d维的投影矩阵A＝[a₁,a₂,…,a_l,…,a_d]。(3.2) Solve the above regression optimization formula to obtain d n-dimensional projection vectors a _l , l=1, 2,...,d, and form an n×d-dimensional projection matrix A=[a ₁ ,a ₂ ,...,a _l ,..., a _d ].

步骤4，根据训练样本基因表达数据X和投影矩阵A，计算训练样本特征集Y。Step 4: Calculate the training sample feature set Y according to the training sample gene expression data X and the projection matrix A.

将第i个训练样本的基因表达水平向量x_i投影在投影矩阵A上，获得第i个训练样本的d维特征向量y_i＝A^Tx_i和训练样本特征集Y＝A^TX＝[y_i|i＝1,2,…,m]，其中，A^T表示投影矩阵A的转置。Project the gene expression level vector x _i of the i-th training sample onto the projection matrix A, and obtain the d-dimensional feature vector y _i = ^AT x _i of the i-th training sample and the training sample feature set Y=A ^T X=[ y _i |i=1,2,...,m], where ^AT represents the transpose of the projection matrix A.

步骤5，给定测试样本基因表达数据集。Step 5, given the test sample gene expression dataset.

给定p个测试样本在n个基因上的表达数据，用一个n×p维的矩阵U表示，矩阵U的行代表基因、列代表测试样本：矩阵U的元素u_kj表示第j个测试样本在第k个基因上的表达水平，矩阵U的第j列u_j表示第j个测试样本在n个基因上的表达水平向量，j＝1,2,…,p。Given the expression data of p test samples on n genes, it is represented by an n×p-dimensional matrix U, the rows of matrix U represent genes, and the columns represent test samples: the element u _kj of matrix U represents the jth test sample The expression level on the kth gene, the jth column u _j of the matrix U represents the expression level vector of the jth test sample on the nth gene, j=1,2,...,p.

步骤6，根据测试样本基因表达数据U和投影矩阵A，计算测试样本特征集Q。Step 6: Calculate the test sample feature set Q according to the test sample gene expression data U and the projection matrix A.

将第j个测试样本的基因表达水平向量u_j投影在投影矩阵A上，获得第j个测试样本的d维特征向量q_j＝A^Tu_j和测试样本特征集Q＝A^TU＝[q_j|j＝1,2,…,p]。Project the gene expression level vector u _j of the jth test sample on the projection matrix A to obtain the d-dimensional feature vector q _j = A ^T u _j and the test sample feature set Q = A ^T U = [ q _j |j=1,2,...,p].

步骤7，根据训练样本特征集Y和测试样本特征集Q，采用最近邻分类器对测试样本进行分类。Step 7, according to the training sample feature set Y and the test sample feature set Q, use the nearest neighbor classifier to classify the test samples.

(7.1)计算测试样本j的特征向量到每个训练样本i的特征向量的欧氏距离L(q_j,y_i)，i＝1,2,…,m，j＝1,2,…,p；(7.1) Calculate the Euclidean distance L(q _j ,y _i ) from the feature vector of the test sample j to the feature vector of each training sample i, i=1,2,...,m, j=1,2,..., p;

(7.2)将欧氏距离L(q_j,y_i)最近的训练样本的类别作为测试样本j的类别。(7.2) Take the category of the training sample with the closest Euclidean distance L(q _j ,y _i ) as the category of the test sample j.

本发明将通过下述的实验例子对本发明方案和效果进行更详细的描述。这些实验例子用于举例的目的，而不试图限制本发明的范围。The present invention will describe the solutions and effects of the present invention in more detail through the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the invention.

实例1：对第一组基因表达数据进行分类实验。Example 1: A classification experiment was performed on the first set of gene expression data.

该组数据是由美国癌症研究院NCI提供的基因表达数据，其基因空间的维数为2308，样本数为64，样本类别为4。数据的详细信息列于表1。This set of data is the gene expression data provided by the NCI of the American Cancer Institute. The dimension of the gene space is 2308, the number of samples is 64, and the sample category is 4. The details of the data are listed in Table 1.

表1 第一组实验数据Table 1 The first set of experimental data

数据集名称dataset name 基因数number of genes 样本数Number of samples 样本类别数Number of sample categories 第一组基因表达数据The first set of gene expression data 23082308 6464 44

对表1中的数据进行分类的步骤如下：The steps to classify the data in Table 1 are as follows:

第一步，将64个样本采用5重交叉验证方法将其分成训练样本和测试样本，即把样本分成5份，轮流把其中一份作为测试样本，其余的作为训练样本；设测试样本有p个，则训练样本个数m＝64-p，训练样本基因表达数据矩阵X是2308×m维的，测试样本基因表达数据矩阵U是2308×p维的。In the first step, the 64 samples are divided into training samples and test samples using the 5-fold cross-validation method, that is, the samples are divided into 5 parts, and one of them is used as a test sample in turn, and the rest are used as training samples; the test samples have p , then the number of training samples is m=64-p, the gene expression data matrix X of the training samples is 2308×m dimensional, and the gene expression data matrix U of the test sample is 2308×p dimensional.

第二步，由m个训练样本的类别信息和2308×m维的基因表达数据矩阵X，采用类别保留投影方法获得训练样本的d个鉴别特征向量y'_l，l＝1,2,…,d，特征维数d最大取到49，即d＝1,2,…,49。In the second step, from the category information of m training samples and the 2308×m-dimensional gene expression data matrix X, the category-preserving projection method is used to obtain d discriminative feature vectors y' _l of training samples, l=1,2,..., d, the maximum feature dimension d is 49, that is, d=1,2,...,49.

第三步，利用训练样本的d个鉴别特征向量和训练样本基因表达数据矩阵X，采用回归优化方法获得2308×d维的投影矩阵A。The third step is to use the d discriminant feature vectors of the training samples and the gene expression data matrix X of the training samples, and use the regression optimization method to obtain a 2308×d-dimensional projection matrix A.

第四步，根据训练样本基因表达数据矩阵X和投影矩阵A，获得d×m维的训练样本特征集Y＝A^TX。In the fourth step, according to the training sample gene expression data matrix X and the projection matrix A, a d×m-dimensional training sample feature set Y= ^AT X is obtained.

第五步，根据测试样本基因表达数据矩阵U和投影矩阵A，获得d×p维的测试样本特征集Q＝A^TU。In the fifth step, according to the test sample gene expression data matrix U and the projection matrix A, a d×p-dimensional test sample feature set Q= ^AT U is obtained.

第六步，根据训练样本特征集Y和测试样本特征集Q，对每个测试样本计算其d维特征向量到所有训练样本d维特征向量的欧氏距离，将欧氏距离最近的训练样本的类别作为该测试样本的类别。In the sixth step, according to the training sample feature set Y and the test sample feature set Q, calculate the Euclidean distance from its d-dimensional feature vector to all training samples’ d-dimensional feature vectors for each test sample, and the training sample with the closest Euclidean distance category as the category of this test sample.

如d＝3时，实验结果显示：64个样本中，63个样本的分类结果都与其真实类别一致，只有1个样本的分类结果与真实类别不一致。For example, when d=3, the experimental results show that among the 64 samples, the classification results of 63 samples are consistent with their true categories, and the classification results of only 1 sample are inconsistent with the real category.

第七步，把所有样本在d维特征上的分类结果与样本的真实类别进行比对，统计正确分类的样本个数，d＝1,2,…,49。如d＝3时，64个样本中，63个样本被正确分类，样本的分类正确识别率达到了98.44％。The seventh step is to compare the classification results of all samples on the d-dimensional feature with the true category of the sample, and count the number of correctly classified samples. d=1,2,...,49. For example, when d=3, out of 64 samples, 63 samples are correctly classified, and the classification correct recognition rate of samples reaches 98.44%.

采用本发明方法对表1中的数据进行分类的结果，记为分类结果1；采用CPP方法对表1中的数据进行特征提取后，再用最近邻分类器实现样本分类，获得分类结果2，用“CPP+最近邻”表示；将分类结果1与分类结果2的样本分类正确识别率进行比较，样本分类正确识别率对比曲线见图2，其中，横坐标表示样本的特征维数，纵坐标表示样本分类正确识别率。The result of using the inventive method to classify the data in Table 1 is recorded as classification result 1; after adopting the CPP method to carry out feature extraction to the data in Table 1, then use the nearest neighbor classifier to realize sample classification, and obtain classification result 2, It is represented by "CPP+nearest neighbor"; compare the correct recognition rate of the sample classification of classification result 1 and classification result 2, and the comparison curve of the correct recognition rate of sample classification is shown in Figure 2, where the abscissa represents the feature dimension of the sample, and the ordinate represents The correct recognition rate of sample classification.

实例2：对第二组基因表达数据进行分类实验。Example 2: A classification experiment was performed on the second set of gene expression data.

该组数据是190例不同癌症类型的组织样本的16063个基因片段的表达情况，包含14种癌症类型。数据的详细信息列于表2。This set of data is the expression of 16063 gene fragments in tissue samples of 190 cases of different cancer types, including 14 cancer types. The details of the data are listed in Table 2.

表2 第二组实验数据Table 2 The second set of experimental data

数据集名称dataset name 基因数number of genes 样本数Number of samples 样本类别数Number of sample categories 第二组基因表达数据Second set of gene expression data 1606316063 190190 1414

对表2中的数据进行分类的步骤与实例1的步骤相同。特征维数最大取到150。The steps to classify the data in Table 2 are the same as those in Example 1. The maximum feature dimension is 150.

采用本发明方法对表2中的数据进行分类的结果，记为分类结果3；采用CPP方法对表2中的数据进行特征提取后，再用最近邻分类器实现样本分类，获得分类结果4，用“CPP+最近邻”表示；将分类结果3与分类结果4的样本分类正确识别率进行比较，样本分类正确识别率对比曲线见图3，其中，横坐标表示样本的特征维数，纵坐标表示样本分类正确识别率。The result of adopting the inventive method to classify the data in Table 2 is recorded as classification result 3; after adopting the CPP method to carry out feature extraction to the data in Table 2, realize sample classification with nearest neighbor classifier again, obtain classification result 4, It is represented by "CPP+Nearest Neighbor"; compare the correct recognition rate of the sample classification of the classification result 3 and the classification result 4, and the comparison curve of the correct recognition rate of the sample classification is shown in Figure 3, where the abscissa represents the feature dimension of the sample, and the ordinate represents The correct recognition rate of sample classification.

从图2、图3可以得出以下结论：The following conclusions can be drawn from Figures 2 and 3:

a)本发明的分类正确识别率要明显好于CPP特征提取后的最近邻分类器的正确识别率，且本发明的正确识别率比较平稳，随着特征维数的增加呈单调不减趋势，而CPP特征提取后的最近邻分类器达到最高识别率后，随着特征维数的增加，正确识别率下降明显。a) The correct recognition rate of the classification of the present invention is obviously better than the correct recognition rate of the nearest neighbor classifier after the CPP feature extraction, and the correct recognition rate of the present invention is relatively stable, and it is a monotonous non-decreasing trend with the increase of the feature dimension, However, after the nearest neighbor classifier after CPP feature extraction reaches the highest recognition rate, the correct recognition rate drops significantly as the feature dimension increases.

b)对于第一组基因表达数据，在特征维数为3时，本发明的正确识别率达到了最大值，达到了98.44％，而CPP特征提取后的最近邻分类起的最高正确识别率是96.88％。b) For the first group of gene expression data, when the feature dimension is 3, the correct recognition rate of the present invention reaches the maximum value, reaching 98.44%, and the highest correct recognition rate from the nearest neighbor classification after CPP feature extraction is 96.88%.

c)对于第二组基因表达数据，CPP特征提取后的最近邻分类器在特征维数为16时，达到了最大正确识别率62.63％，而本发明方法在特征维数为13时，正确识别率已经达到了65.79％，最高正确识别率可达到66.32％。c) For the second group of gene expression data, the nearest neighbor classifier after CPP feature extraction reached a maximum correct recognition rate of 62.63% when the feature dimension was 16, while the method of the present invention correctly identified The rate has reached 65.79%, and the highest correct recognition rate can reach 66.32%.

Claims

1. A supervised gene expression data classification method is characterized by comprising the following steps:

(1) Setting the number of gene expressions in a training sampleData set X = { X _i L i =1,2, \8230 |, m }, where x _i Is an n-dimensional column vector which represents the expression level vector of the ith training sample on n genes, and m is the number of the training samples; let the ith training sample class be denoted as c _i ；

(2) Obtaining identification feature vector y 'of training sample by adopting category retention projection method' _l L =1,2, \ 8230, d is the number of identification feature vectors, d is more than or equal to 1<n：

(2.1) defining m x m dimensional homogeneous relation matrix W ¹ Value of (2)And m x m dimensions of heterogeneous relationship matrix W ² Value of (2)The following:

wherein, c _t Representing a category of the t-th training sample;

(2.2) calculating a diagonal matrix D of the same kind in dimensions of m x m ¹ Value of diagonal element ofAnd a heterogeneous diagonal matrix D of m x m dimensions ² Value of diagonal element of

Diagonal matrix of the same kind D ¹ And heterogeneous diagonal matrix D ² All the off-diagonal element values of (a) are 0;

(2.3) calculating an m x m-dimensional intra-class dispersion matrix L ¹ And m x m dimensional inter-class scatter matrix L ² ：

L ¹ ＝D ¹ -W ¹ ，

L ² ＝D ² -W ² ；

(2.4) defining generalized eigenequation L ¹ y'＝λL ² y ', λ are eigenvalues, y' is an m-dimensional eigenvector;

(2.5) solving feature vectors corresponding to the first d minimum feature values of the generalized feature equation to serve as d identification feature vectors y 'of the training sample' _l ，l＝1,2,…,d；

(3) Utilizing discriminative feature vector y' _l Obtaining a projection matrix A with dimensions of n multiplied by d by adopting a regression optimization method:

(3.1) setting a _l For the n-dimensional optimal projection vector, the following regression optimization equation is defined:

wherein alpha and beta are two regression coefficients with different values, and satisfy 0< alpha <1,0< beta <1, alpha ≠ beta;

(3.2) solving the regression optimization formula to obtain d n-dimensional projection vectors a _l L =1,2, \8230;, d, a projection matrix a = [ a ] constituting an n × d dimension ₁ ,a ₂ ,…,a _d ]；

(4) Gene expression level vector x of ith training sample _i Projected on the projection matrix A to obtain the eigenvector y of the ith training sample _i ＝A ^T x _i Wherein A is ^T Represents the transpose of the projection matrix a; training sample feature set Y = { Y = _i |i＝1,2,…,m}；

(5) Let test sample gene expression data set U = { U = { U _j L j =1,2, \8230;, p }, where u _j Is an n-dimensional column vector which represents the expression level vector of the jth test sample on n genes, and p is the number of the test samples;

(6) The gene expression level vector u of the jth test sample is expressed _j Projecting the test sample on a projection matrix A to obtain a feature vector q of a jth test sample _j ＝A ^T u _j Wherein A is ^T Represents the transpose of the projection matrix a; test sample feature set Q = { Q = { Q = _j |j＝1,2,…,p}；

(7) Classifying the test samples by adopting a nearest neighbor classifier, and calculating a characteristic vector q of the jth test sample _j To each training sample feature vector y _i The class of the training sample with the closest euclidean distance is used as the class of the jth test sample.