CN116779044A

CN116779044A - Gene classification method, system and equipment based on multi-tag feature selection

Info

Publication number: CN116779044A
Application number: CN202310810555.1A
Authority: CN
Inventors: 吴全旺; 李秀先; 张智勇; 曾洁; 周鹏
Original assignee: Chongqing University; China Merchants Testing Vehicle Technology Research Institute Co Ltd
Current assignee: Chongqing University; China Merchants Testing Vehicle Technology Research Institute Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-09-19

Abstract

The application provides a gene classification method, a system and a device based on multi-label feature selection, wherein the method comprises the following steps: acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing a multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; training the gene classification model by utilizing the feature subset of the gene expression dataset to obtain a trained gene classification model; classifying the genes to be classified by using a trained gene classification model based on the feature subsets of the gene data to be classified, and obtaining the labels of the gene data to be classified. The application adopts the gene classification method based on multi-tag feature selection, completes the feature selection of multi-tag gene data based on the missing data, considers all the dependency relations between the gene tags and the features, and completes the gene classification based on the feature subset after the feature selection, thereby fundamentally solving the problem of low accuracy of the existing multi-tag gene classification.

Description

Gene classification method, system and equipment based on multi-tag feature selection

Technical Field

The application belongs to the field of bioinformatics, and particularly relates to a gene classification method, system and equipment based on multi-tag feature selection.

Background

Along with the development of DNA microarray technology, a huge amount of gene expression data is generated, the gene expression data contains rich gene activity information, and analysis of hidden modes in the gene expression data has important significance for understanding and deducing biological gene functions, researching gene regulation mechanism and the like. How to analyze massive amounts of gene expression data effectively has become an important challenge in the field of bioinformatics.

Currently, genetic classification is a very popular area of research. The gene expression data is generally in a matrix form, has the characteristics of high dimensionality, small samples and multiple labels, and the main characteristic of the multi-label gene data is that one sample can be simultaneously associated with a plurality of labels. The multi-tag data contains three variable relationships, namely, tag-to-tag, feature-to-feature, and tag-to-feature correlation. The high-dimensional gene data with a large number of redundant features significantly increases the computational burden of multi-tag gene data classification, and also leads to overfitting and performance degradation of the gene classification, so that the accuracy of the gene classification result is greatly reduced. Meanwhile, the existing multi-tag gene feature selection method has insufficient importance on the function of gene tags, ignores the relation inside the tags, rarely reveals a potential causal mechanism of the gene tags, independently researches the correlation between features and the tags, the correlation between the tags or the correlation between the features and the features, can rarely process three kinds of correlations at the same time, ignores the mutual influence between the features and the features, and has the problem that the accuracy of gene classification is still low after the multi-tag feature selection method is based on the prior art.

Therefore, how to perform multi-tag feature selection of gene data and improve the accuracy of gene classification is a problem to be solved in the art.

Disclosure of Invention

The application aims at overcoming the defects of the prior art and provides a gene classification method, a system and equipment based on multi-tag feature selection. The gene classification method based on multi-tag feature selection, disclosed by the application, completes the feature selection of multi-tag gene data based on the missing data, takes the causal relationship between gene tags into account, and alternately performs the interpolation process of the missing values and the multi-tag MB learning process by utilizing mutual promotion.

In order to achieve the above purpose, the present application adopts the following technical scheme:

the application provides a gene classification method based on multi-tag feature selection, which is characterized by comprising the following steps:

s1, constructing a multi-tag feature selection model based on missing gene data;

s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;

the step S2 specifically comprises the following steps:

s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;

s22, gene data interpolation;

s23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB;

s24, acquiring extended-MB; the extended-MB consists of MB of class variables and the union of MB of each variable in the class variables;

s25, updating a data set; updating the data set to a new data set containing only extended-MBs;

s26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;

s27, returning the feature subset of the gene expression dataset after iteration;

s3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;

s4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.

Further, in the gene expression dataset, f= { F ₁ ，F ₂ ，...，F _m The m-dimensional feature set, y= { Y ₁ ，Y ₂ ，...，Y _q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(where i +.j,represents V _i The conditions are independent of V given s _j ；V\Y _i And V/Y _i Equivalent, means dividing Y by V _i All but.

Further, the gene data interpolation in step S22 specifically includes: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.

Further, step S23 includes:

(1) Mining causal mechanisms of each gene tag; from all features and all except Y _i Learning each Y in the tag of (1) _i Class-tagged MB, i.e. from V\Y _i Find Y in _i The result was denoted as MB (Y) _i )；

(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y _i ) Features that are ignored due to strong tag correlation are detected;

(3) Correcting the obtained false features; specifically, according to MB (Y _i ) Features and Y of (C) _i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y _i ) In (a) and (b); q tags were traversed by detecting SelFea (Y _i ) Whether or not the MB of the medium features contains Y _i From SelFea (Y) _i ) The dummy MB feature is removed.

Further, the gene classification model adopts MLKNN.

The application also provides a gene classification system based on multi-tag feature selection, which is characterized in that the gene classification system executes the gene classification method based on multi-tag feature selection, and the method comprises the following steps: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;

the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;

the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;

the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;

the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.

The application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.

Compared with the prior art, the method has the following beneficial effects: according to the gene classification method based on multi-tag feature selection, feature selection of multi-tag gene data is finished based on missing data, causal relations among gene tags are included, a mutually promoted missing value interpolation process and a multi-tag MB learning process are used for being alternately carried out, and meanwhile all dependency relations among the gene tags and the gene features are considered; and then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a gene classification method based on multi-tag feature selection according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a multi-tag feature selection model based on missing gene data according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of acquiring a multi-tag MB according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The application is further described below with reference to the drawings and specific examples, which are not intended to be limiting.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The application discloses a gene classification method based on multi-tag feature selection. As shown in fig. 1, the gene classification method based on the multi-tag feature selection includes the following steps S1 to S4.

s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; the flow of step S2 is shown in fig. 2;

the step S2 specifically comprises the following steps:

in the gene expression dataset, F= { F ₁ ，F ₂ ，...，F _m The m-dimensional feature set, y= { Y ₁ ，Y ₂ ，...，Y _q -q-dimensional tag set, v=fχy, s is any set of variables within V; by using(wherein i. Noteq. J; ->Represents V _i The conditions are independent of V given S _j ；V\Y _i And V/Y _i Equivalent, means dividing Y by V _i All but.

S22, gene data interpolation;

in particular, data interpolation uses observations of incomplete and complete instances in a data set to estimate missing values, so it enables a multi-tag causal learning method to address most of the missing values in the data set, especially when there are more incomplete instances than complete ones. The data interpolation provides an accurate data set for the reliability of MB learning.

In one embodiment, the interpolation of the genetic data is performed using KNN or Lagrange interpolation.

S23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB; the specific flow of step S23 is shown in fig. 3;

specifically, learning the MB of the class variable, searching the MB is accomplished by a sophisticated multi-tag causal feature selection algorithm MLMB.

Step S23 includes:

(1) Mining causal mechanisms of each gene tag; from all the features andall except Y _i Learning each Y in the tag of (1) _i Class-tagged MB, i.e. from V\Y _i Find Y in _i The result was denoted as MB (Y) _i )；

(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y _i ) Features that are ignored due to strong tag correlation are detected.

This would be computationally expensive if each feature was directly detected to be ignored. To solve the high-dimensional problem, according to the characteristics and Y _i Is arranged in descending order of the degree of association, the top k1% features with highest correlation are selected and stored in sel (Y _i ) Is selectively selected from sel (Y _i ) The missing true MB features are recovered.

(3) Correcting the obtained false features; specifically, according to MB (Y _i ) The association degree of the features and the features in the matrix is arranged in ascending order, MB (Y) _i ) Middle and Y _i The top k2% feature store SelFea (Y) _i ) Q tags were traversed by detecting SelFea (Y _i ) Whether or not the MB of the medium features contains Y _i From SelFea (Y) _i ) The dummy MB feature is removed.

For label Y _i E Y, in MB (Y _i ) There is always a small fraction of error features. So according to MB (Y) _i ) Features and Y of (C) _i Is arranged in ascending order of association, MB (Y) _i ) Middle and Y _i The top k2% features with the weakest correlation are stored in SalFea (Y _i ) And (3) detecting.

S24, acquiring extended-MB; the extended-MB consists of a MB of class variables and a union of MB of each variable in the class variable MB.

The basic principle behind this strategy is that, according to the MB feature selection theory, extended-MBs of class variables have a higher probability of containing causal information features than MBs of class variables for noisy datasets, since the missing values are filled with reasonable values using a data interpolation method.

S25, updating a data set; the dataset is updated to a new dataset that contains only extended-MBs.

Specifically, i.e., the new dataset of extended-MB found in step S24, this allows the data interpolation to fill only missing values of causal information features, not all features in the dataset.

in a specific embodiment, the iteration termination condition is that no more changes occur in the data in the extended-MB.

And S27, returning the feature subset of the gene expression data set after the iteration is ended.

In a specific embodiment, the quality of the feature subset of the obtained gene expression dataset is verified with the obtained classification accuracy using the returned optimal feature subset verification test set.

And S3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model.

The gene classification model adopts MLKNN. Inputting the feature subset of the gene expression data set into an MLKNN model for training, wherein the number of parameters k of the MLKNN model is 10, and other parameters remain default to obtain the MLKNN model with the optimized data set.

In bioinformatics research, yeast (yeast) gene expression data is often used to verify the effect of application of theoretical models, algorithms.

In one embodiment, the gene expression dataset: the yeast dataset is a more typical multi-tag gene expression dataset comprising microarray expression and phylogenetic maps of 2417 yeast genes. Each generation is annotated with a subset of the top 14 functional categories (e.g., metanolism, energy, etc.) of the functional directory. In order to test the performance of the proposed method, besides class attributes in the dataset, four different levels of missing values are set in each feature: 5%, 10%, 15% and 20%, data sets with missing values were generated.

We used Hamming Loss, average Precision, coverage and Ranking Loss, etc. as evaluation criteria for classification models:

1)Hamming Loss：

2)Average Precision：

3)Coverage：

4)Ranking Loss：

and (3) carrying out steps of a multi-label feature selection process diagram according to the whole flow chart, returning to the feature subset MB required by the user after the flow is finished, and finally training an MLKNN classifier model by the feature subset MB to obtain a model MLKNN_MB. In a comparison experiment, the MLKNN model is directly trained by using the training set Train by using the original data without feature selection, and the model MKNN_train is obtained. Substituting the Test set Test to obtain four indexes of the MLKNN_train model. The above data are aggregated as shown in table 1 below:

table 1: feature subset MB is compared with four indexes of all feature data sets scene

The larger the index Average Precision in the table, the better, and the smaller the index Hamming Los, coverage and Ranking Loss. From the experimental results, the MLKNN_MB classifier is better than the MLKNN_train classifier in various indexes. This shows that we propose a gene classification method based on multi-tag feature selection, which can effectively improve classification accuracy.

The application provides a gene classification method based on multi-tag feature selection, which comprises the steps of initializing data, performing causal feature selection on multi-tag missing data by learning a causal structure of each class tag, searching multi-tag MB, obtaining extended-MB, and integrating data interpolation and multi-tag MB learning into a unified framework so that two modules can be matched with each other. MB learning helps to interpolate missing data in potentially causal features, while data interpolation provides an accurate interpolated data set for the reliability of MB learning. And then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.

FIG. 4 is a diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application. As shown in fig. 4, the gene classification system based on multi-tag feature selection comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;

The gene classification system based on multi-tag feature selection described above may be implemented in the form of a computer program that is executable on a computer device.

The computer device may be a server, where the server may be a stand-alone server, or may be a server cluster formed by a plurality of servers.

The computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform a method of gene classification based on multi-tag feature selection.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform a method of gene classification based on multi-tag feature selection.

The network interface is for network communication with other devices. It will be appreciated by persons skilled in the art that the computer device structures described above are merely partial structures relevant to the present inventive arrangements and do not constitute a limitation of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

Wherein the processor is configured to run a computer program stored in a memory, the program implementing the gene classification method based on multi-tag feature selection as described in embodiment one.

It should be appreciated that in embodiments of the present application, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

The application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program when executed by a processor causes the processor to perform a method of gene classification based on multi-tag feature selection as described in embodiment one.

The storage medium may be a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that may store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.

Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, while the application has been described in connection with the above embodiments, the application is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the application, which is set forth in the following claims.

Claims

1. A gene classification method based on multi-tag feature selection is characterized by comprising the following steps:

s1, constructing a multi-label feature selection model based on missing data;

the step S2 specifically comprises the following steps:

s22, gene data interpolation;

2. The method of claim 1, wherein in the gene expression dataset, f= { F ₁ ，F ₂ ，...，F _m The m-dimensional feature set, y= { Y ₁ ，Y ₂ ，...，Y _q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(wherein i. Noteq. J; ->) Represents V _i The conditions are independent of V given S _j ；V\Y _i And V/Y _i Equivalent, means dividing Y by V _i All but.

3. The method according to claim 1, wherein the gene data interpolation of step S22 specifically comprises: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.

4. The method according to claim 1, wherein step S23 comprises:

(3) Correcting the obtained false features; specifically, according to MB (Y _i ) Features and Y of (C) _i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y _i ) In (a) and (b); q tags were traversed by detecting SelFea (Y _i ) M of the middle featureWhether B contains Y _i From SelFea (Y) _i ) The dummy MB feature is removed.

5. The method of claim 1, wherein the genetic classification model employs MLKNN.

6. A gene classification system based on multi-tag feature selection, wherein the gene classification system performs the gene classification method based on multi-tag feature selection of claim 1, comprising: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;

7. A computer device, characterized in that the device comprises a memory and a processor, the memory having stored thereon a computer program, which when executed by the processor implements the method according to any of claims 1 to 5.