CN116779044A - Gene classification method, system and equipment based on multi-tag feature selection - Google Patents
Gene classification method, system and equipment based on multi-tag feature selection Download PDFInfo
- Publication number
- CN116779044A CN116779044A CN202310810555.1A CN202310810555A CN116779044A CN 116779044 A CN116779044 A CN 116779044A CN 202310810555 A CN202310810555 A CN 202310810555A CN 116779044 A CN116779044 A CN 116779044A
- Authority
- CN
- China
- Prior art keywords
- gene
- tag
- feature selection
- data set
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 144
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000014509 gene expression Effects 0.000 claims abstract description 53
- 238000013145 classification model Methods 0.000 claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 16
- 230000001364 causal effect Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000010276 construction Methods 0.000 claims description 6
- 230000001174 ascending effect Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000002068 genetic effect Effects 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 5
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 238000000018 DNA microarray Methods 0.000 description 1
- 101150039863 Rich gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
- G06F18/15—Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Genetics & Genomics (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Molecular Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The application provides a gene classification method, a system and a device based on multi-label feature selection, wherein the method comprises the following steps: acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing a multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; training the gene classification model by utilizing the feature subset of the gene expression dataset to obtain a trained gene classification model; classifying the genes to be classified by using a trained gene classification model based on the feature subsets of the gene data to be classified, and obtaining the labels of the gene data to be classified. The application adopts the gene classification method based on multi-tag feature selection, completes the feature selection of multi-tag gene data based on the missing data, considers all the dependency relations between the gene tags and the features, and completes the gene classification based on the feature subset after the feature selection, thereby fundamentally solving the problem of low accuracy of the existing multi-tag gene classification.
Description
Technical Field
The application belongs to the field of bioinformatics, and particularly relates to a gene classification method, system and equipment based on multi-tag feature selection.
Background
Along with the development of DNA microarray technology, a huge amount of gene expression data is generated, the gene expression data contains rich gene activity information, and analysis of hidden modes in the gene expression data has important significance for understanding and deducing biological gene functions, researching gene regulation mechanism and the like. How to analyze massive amounts of gene expression data effectively has become an important challenge in the field of bioinformatics.
Currently, genetic classification is a very popular area of research. The gene expression data is generally in a matrix form, has the characteristics of high dimensionality, small samples and multiple labels, and the main characteristic of the multi-label gene data is that one sample can be simultaneously associated with a plurality of labels. The multi-tag data contains three variable relationships, namely, tag-to-tag, feature-to-feature, and tag-to-feature correlation. The high-dimensional gene data with a large number of redundant features significantly increases the computational burden of multi-tag gene data classification, and also leads to overfitting and performance degradation of the gene classification, so that the accuracy of the gene classification result is greatly reduced. Meanwhile, the existing multi-tag gene feature selection method has insufficient importance on the function of gene tags, ignores the relation inside the tags, rarely reveals a potential causal mechanism of the gene tags, independently researches the correlation between features and the tags, the correlation between the tags or the correlation between the features and the features, can rarely process three kinds of correlations at the same time, ignores the mutual influence between the features and the features, and has the problem that the accuracy of gene classification is still low after the multi-tag feature selection method is based on the prior art.
Therefore, how to perform multi-tag feature selection of gene data and improve the accuracy of gene classification is a problem to be solved in the art.
Disclosure of Invention
The application aims at overcoming the defects of the prior art and provides a gene classification method, a system and equipment based on multi-tag feature selection. The gene classification method based on multi-tag feature selection, disclosed by the application, completes the feature selection of multi-tag gene data based on the missing data, takes the causal relationship between gene tags into account, and alternately performs the interpolation process of the missing values and the multi-tag MB learning process by utilizing mutual promotion.
In order to achieve the above purpose, the present application adopts the following technical scheme:
the application provides a gene classification method based on multi-tag feature selection, which is characterized by comprising the following steps:
s1, constructing a multi-tag feature selection model based on missing gene data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
s22, gene data interpolation;
s23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB;
s24, acquiring extended-MB; the extended-MB consists of MB of class variables and the union of MB of each variable in the class variables;
s25, updating a data set; updating the data set to a new data set containing only extended-MBs;
s26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
s27, returning the feature subset of the gene expression dataset after iteration;
s3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
s4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
Further, in the gene expression dataset, f= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(where i +.j,represents V i The conditions are independent of V given s j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
Further, the gene data interpolation in step S22 specifically includes: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.
Further, step S23 includes:
(1) Mining causal mechanisms of each gene tag; from all features and all except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected;
(3) Correcting the obtained false features; specifically, according to MB (Y i ) Features and Y of (C) i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y i ) In (a) and (b); q tags were traversed by detecting SelFea (Y i ) Whether or not the MB of the medium features contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
Further, the gene classification model adopts MLKNN.
The application also provides a gene classification system based on multi-tag feature selection, which is characterized in that the gene classification system executes the gene classification method based on multi-tag feature selection, and the method comprises the following steps: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
The application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.
Compared with the prior art, the method has the following beneficial effects: according to the gene classification method based on multi-tag feature selection, feature selection of multi-tag gene data is finished based on missing data, causal relations among gene tags are included, a mutually promoted missing value interpolation process and a multi-tag MB learning process are used for being alternately carried out, and meanwhile all dependency relations among the gene tags and the gene features are considered; and then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a gene classification method based on multi-tag feature selection according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a multi-tag feature selection model based on missing gene data according to an embodiment of the present application.
Fig. 3 is a schematic flow chart of acquiring a multi-tag MB according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The application is further described below with reference to the drawings and specific examples, which are not intended to be limiting.
It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The application discloses a gene classification method based on multi-tag feature selection. As shown in fig. 1, the gene classification method based on the multi-tag feature selection includes the following steps S1 to S4.
S1, constructing a multi-tag feature selection model based on missing gene data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; the flow of step S2 is shown in fig. 2;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
in the gene expression dataset, F= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fχy, s is any set of variables within V; by using(wherein i. Noteq. J; ->Represents V i The conditions are independent of V given S j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
S22, gene data interpolation;
in particular, data interpolation uses observations of incomplete and complete instances in a data set to estimate missing values, so it enables a multi-tag causal learning method to address most of the missing values in the data set, especially when there are more incomplete instances than complete ones. The data interpolation provides an accurate data set for the reliability of MB learning.
In one embodiment, the interpolation of the genetic data is performed using KNN or Lagrange interpolation.
S23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB; the specific flow of step S23 is shown in fig. 3;
specifically, learning the MB of the class variable, searching the MB is accomplished by a sophisticated multi-tag causal feature selection algorithm MLMB.
Step S23 includes:
(1) Mining causal mechanisms of each gene tag; from all the features andall except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected.
This would be computationally expensive if each feature was directly detected to be ignored. To solve the high-dimensional problem, according to the characteristics and Y i Is arranged in descending order of the degree of association, the top k1% features with highest correlation are selected and stored in sel (Y i ) Is selectively selected from sel (Y i ) The missing true MB features are recovered.
(3) Correcting the obtained false features; specifically, according to MB (Y i ) The association degree of the features and the features in the matrix is arranged in ascending order, MB (Y) i ) Middle and Y i The top k2% feature store SelFea (Y) i ) Q tags were traversed by detecting SelFea (Y i ) Whether or not the MB of the medium features contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
For label Y i E Y, in MB (Y i ) There is always a small fraction of error features. So according to MB (Y) i ) Features and Y of (C) i Is arranged in ascending order of association, MB (Y) i ) Middle and Y i The top k2% features with the weakest correlation are stored in SalFea (Y i ) And (3) detecting.
S24, acquiring extended-MB; the extended-MB consists of a MB of class variables and a union of MB of each variable in the class variable MB.
The basic principle behind this strategy is that, according to the MB feature selection theory, extended-MBs of class variables have a higher probability of containing causal information features than MBs of class variables for noisy datasets, since the missing values are filled with reasonable values using a data interpolation method.
S25, updating a data set; the dataset is updated to a new dataset that contains only extended-MBs.
Specifically, i.e., the new dataset of extended-MB found in step S24, this allows the data interpolation to fill only missing values of causal information features, not all features in the dataset.
S26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
in a specific embodiment, the iteration termination condition is that no more changes occur in the data in the extended-MB.
And S27, returning the feature subset of the gene expression data set after the iteration is ended.
In a specific embodiment, the quality of the feature subset of the obtained gene expression dataset is verified with the obtained classification accuracy using the returned optimal feature subset verification test set.
And S3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model.
The gene classification model adopts MLKNN. Inputting the feature subset of the gene expression data set into an MLKNN model for training, wherein the number of parameters k of the MLKNN model is 10, and other parameters remain default to obtain the MLKNN model with the optimized data set.
In bioinformatics research, yeast (yeast) gene expression data is often used to verify the effect of application of theoretical models, algorithms.
In one embodiment, the gene expression dataset: the yeast dataset is a more typical multi-tag gene expression dataset comprising microarray expression and phylogenetic maps of 2417 yeast genes. Each generation is annotated with a subset of the top 14 functional categories (e.g., metanolism, energy, etc.) of the functional directory. In order to test the performance of the proposed method, besides class attributes in the dataset, four different levels of missing values are set in each feature: 5%, 10%, 15% and 20%, data sets with missing values were generated.
We used Hamming Loss, average Precision, coverage and Ranking Loss, etc. as evaluation criteria for classification models:
1)Hamming Loss:
2)Average Precision:
3)Coverage:
4)Ranking Loss:
and (3) carrying out steps of a multi-label feature selection process diagram according to the whole flow chart, returning to the feature subset MB required by the user after the flow is finished, and finally training an MLKNN classifier model by the feature subset MB to obtain a model MLKNN_MB. In a comparison experiment, the MLKNN model is directly trained by using the training set Train by using the original data without feature selection, and the model MKNN_train is obtained. Substituting the Test set Test to obtain four indexes of the MLKNN_train model. The above data are aggregated as shown in table 1 below:
table 1: feature subset MB is compared with four indexes of all feature data sets scene
The larger the index Average Precision in the table, the better, and the smaller the index Hamming Los, coverage and Ranking Loss. From the experimental results, the MLKNN_MB classifier is better than the MLKNN_train classifier in various indexes. This shows that we propose a gene classification method based on multi-tag feature selection, which can effectively improve classification accuracy.
S4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
The application provides a gene classification method based on multi-tag feature selection, which comprises the steps of initializing data, performing causal feature selection on multi-tag missing data by learning a causal structure of each class tag, searching multi-tag MB, obtaining extended-MB, and integrating data interpolation and multi-tag MB learning into a unified framework so that two modules can be matched with each other. MB learning helps to interpolate missing data in potentially causal features, while data interpolation provides an accurate interpolated data set for the reliability of MB learning. And then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.
FIG. 4 is a diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application. As shown in fig. 4, the gene classification system based on multi-tag feature selection comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
The gene classification system based on multi-tag feature selection described above may be implemented in the form of a computer program that is executable on a computer device.
The computer device may be a server, where the server may be a stand-alone server, or may be a server cluster formed by a plurality of servers.
The computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform a method of gene classification based on multi-tag feature selection.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform a method of gene classification based on multi-tag feature selection.
The network interface is for network communication with other devices. It will be appreciated by persons skilled in the art that the computer device structures described above are merely partial structures relevant to the present inventive arrangements and do not constitute a limitation of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
Wherein the processor is configured to run a computer program stored in a memory, the program implementing the gene classification method based on multi-tag feature selection as described in embodiment one.
It should be appreciated that in embodiments of the present application, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
The application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program when executed by a processor causes the processor to perform a method of gene classification based on multi-tag feature selection as described in embodiment one.
The storage medium may be a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that may store program codes.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.
Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, while the application has been described in connection with the above embodiments, the application is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the application, which is set forth in the following claims.
Claims (7)
1. A gene classification method based on multi-tag feature selection is characterized by comprising the following steps:
s1, constructing a multi-label feature selection model based on missing data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
s22, gene data interpolation;
s23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB;
s24, acquiring extended-MB; the extended-MB consists of MB of class variables and the union of MB of each variable in the class variables;
s25, updating a data set; updating the data set to a new data set containing only extended-MBs;
s26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
s27, returning the feature subset of the gene expression dataset after iteration;
s3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
s4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
2. The method of claim 1, wherein in the gene expression dataset, f= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(wherein i. Noteq. J; ->) Represents V i The conditions are independent of V given S j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
3. The method according to claim 1, wherein the gene data interpolation of step S22 specifically comprises: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.
4. The method according to claim 1, wherein step S23 comprises:
(1) Mining causal mechanisms of each gene tag; from all features and all except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected;
(3) Correcting the obtained false features; specifically, according to MB (Y i ) Features and Y of (C) i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y i ) In (a) and (b); q tags were traversed by detecting SelFea (Y i ) M of the middle featureWhether B contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
5. The method of claim 1, wherein the genetic classification model employs MLKNN.
6. A gene classification system based on multi-tag feature selection, wherein the gene classification system performs the gene classification method based on multi-tag feature selection of claim 1, comprising: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
7. A computer device, characterized in that the device comprises a memory and a processor, the memory having stored thereon a computer program, which when executed by the processor implements the method according to any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310810555.1A CN116779044A (en) | 2023-07-04 | 2023-07-04 | Gene classification method, system and equipment based on multi-tag feature selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310810555.1A CN116779044A (en) | 2023-07-04 | 2023-07-04 | Gene classification method, system and equipment based on multi-tag feature selection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116779044A true CN116779044A (en) | 2023-09-19 |
Family
ID=87989388
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310810555.1A Pending CN116779044A (en) | 2023-07-04 | 2023-07-04 | Gene classification method, system and equipment based on multi-tag feature selection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116779044A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118517383A (en) * | 2024-07-22 | 2024-08-20 | 国网上海市电力公司 | Deep learning-based intelligent detection method and equipment for running risk of wind turbine generator |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114265943A (en) * | 2021-12-24 | 2022-04-01 | 吉林大学 | Causal relationship event pair extraction method and system |
CN116364274A (en) * | 2023-03-16 | 2023-06-30 | 山西医科大学 | Disease prediction method and system based on causal inference and dynamic integration of multiple labels |
-
2023
- 2023-07-04 CN CN202310810555.1A patent/CN116779044A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114265943A (en) * | 2021-12-24 | 2022-04-01 | 吉林大学 | Causal relationship event pair extraction method and system |
CN116364274A (en) * | 2023-03-16 | 2023-06-30 | 山西医科大学 | Disease prediction method and system based on causal inference and dynamic integration of multiple labels |
Non-Patent Citations (2)
Title |
---|
KUI YU ET AL.: "Causal Feature Selection with Missing Data", ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 16, no. 4, 31 January 2022 (2022-01-31), pages 1 - 9 * |
XINGYU WU ET AL.: "Multi-Label Causal Feature Selection", THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20), 31 December 2020 (2020-12-31), pages 6430 - 6435 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118517383A (en) * | 2024-07-22 | 2024-08-20 | 国网上海市电力公司 | Deep learning-based intelligent detection method and equipment for running risk of wind turbine generator |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862173B (en) | Virtual screening method and device for lead compound | |
CN108334574B (en) | Cross-modal retrieval method based on collaborative matrix decomposition | |
Liu et al. | Incdet: In defense of elastic weight consolidation for incremental object detection | |
Yu et al. | Protein function prediction using multilabel ensemble classification | |
US20210020266A1 (en) | Phase-aware determination of identity-by-descent dna segments | |
AU2019231255A1 (en) | Systems and methods for spatial graph convolutions with applications to drug discovery and molecular simulation | |
US11615324B2 (en) | System and method for de novo drug discovery | |
US11256995B1 (en) | System and method for prediction of protein-ligand bioactivity using point-cloud machine learning | |
WO2013067461A2 (en) | Identifying associations in data | |
CN102214302A (en) | Recognition device, recognition method, and program | |
CN109637579B (en) | Tensor random walk-based key protein identification method | |
Bicego et al. | A bioinformatics approach to 2D shape classification | |
Liu et al. | EACP: An effective automatic channel pruning for neural networks | |
Bi et al. | High-dimensional supervised feature selection via optimized kernel mutual information | |
Brinda | Novel computational techniques for mapping and classification of Next-Generation Sequencing data | |
JP2022548960A (en) | Single-cell RNA-SEQ data processing | |
Zeng et al. | A novel HMM-based clustering algorithm for the analysis of gene expression time-course data | |
CN116779044A (en) | Gene classification method, system and equipment based on multi-tag feature selection | |
Liu et al. | Todynet: temporal dynamic graph neural network for multivariate time series classification | |
Shiga et al. | A variational bayesian framework for clustering with multiple graphs | |
Mestres et al. | Selection of the regularization parameter in graphical models using network characteristics | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
CN116383441A (en) | Community detection method, device, computer equipment and storage medium | |
US11367006B1 (en) | Toxic substructure extraction using clustering and scaffold extraction | |
Xu et al. | A structure-induced framework for multi-label feature selection with highly incomplete labels |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |