Nothing Special   »   [go: up one dir, main page]

CN116779044A - Gene classification method, system and equipment based on multi-tag feature selection - Google Patents

Gene classification method, system and equipment based on multi-tag feature selection Download PDF

Info

Publication number
CN116779044A
CN116779044A CN202310810555.1A CN202310810555A CN116779044A CN 116779044 A CN116779044 A CN 116779044A CN 202310810555 A CN202310810555 A CN 202310810555A CN 116779044 A CN116779044 A CN 116779044A
Authority
CN
China
Prior art keywords
gene
tag
feature selection
data set
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310810555.1A
Other languages
Chinese (zh)
Inventor
吴全旺
李秀先
张智勇
曾洁
周鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
China Merchants Testing Vehicle Technology Research Institute Co Ltd
Original Assignee
Chongqing University
China Merchants Testing Vehicle Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University, China Merchants Testing Vehicle Technology Research Institute Co Ltd filed Critical Chongqing University
Priority to CN202310810555.1A priority Critical patent/CN116779044A/en
Publication of CN116779044A publication Critical patent/CN116779044A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a gene classification method, a system and a device based on multi-label feature selection, wherein the method comprises the following steps: acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing a multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; training the gene classification model by utilizing the feature subset of the gene expression dataset to obtain a trained gene classification model; classifying the genes to be classified by using a trained gene classification model based on the feature subsets of the gene data to be classified, and obtaining the labels of the gene data to be classified. The application adopts the gene classification method based on multi-tag feature selection, completes the feature selection of multi-tag gene data based on the missing data, considers all the dependency relations between the gene tags and the features, and completes the gene classification based on the feature subset after the feature selection, thereby fundamentally solving the problem of low accuracy of the existing multi-tag gene classification.

Description

Gene classification method, system and equipment based on multi-tag feature selection
Technical Field
The application belongs to the field of bioinformatics, and particularly relates to a gene classification method, system and equipment based on multi-tag feature selection.
Background
Along with the development of DNA microarray technology, a huge amount of gene expression data is generated, the gene expression data contains rich gene activity information, and analysis of hidden modes in the gene expression data has important significance for understanding and deducing biological gene functions, researching gene regulation mechanism and the like. How to analyze massive amounts of gene expression data effectively has become an important challenge in the field of bioinformatics.
Currently, genetic classification is a very popular area of research. The gene expression data is generally in a matrix form, has the characteristics of high dimensionality, small samples and multiple labels, and the main characteristic of the multi-label gene data is that one sample can be simultaneously associated with a plurality of labels. The multi-tag data contains three variable relationships, namely, tag-to-tag, feature-to-feature, and tag-to-feature correlation. The high-dimensional gene data with a large number of redundant features significantly increases the computational burden of multi-tag gene data classification, and also leads to overfitting and performance degradation of the gene classification, so that the accuracy of the gene classification result is greatly reduced. Meanwhile, the existing multi-tag gene feature selection method has insufficient importance on the function of gene tags, ignores the relation inside the tags, rarely reveals a potential causal mechanism of the gene tags, independently researches the correlation between features and the tags, the correlation between the tags or the correlation between the features and the features, can rarely process three kinds of correlations at the same time, ignores the mutual influence between the features and the features, and has the problem that the accuracy of gene classification is still low after the multi-tag feature selection method is based on the prior art.
Therefore, how to perform multi-tag feature selection of gene data and improve the accuracy of gene classification is a problem to be solved in the art.
Disclosure of Invention
The application aims at overcoming the defects of the prior art and provides a gene classification method, a system and equipment based on multi-tag feature selection. The gene classification method based on multi-tag feature selection, disclosed by the application, completes the feature selection of multi-tag gene data based on the missing data, takes the causal relationship between gene tags into account, and alternately performs the interpolation process of the missing values and the multi-tag MB learning process by utilizing mutual promotion.
In order to achieve the above purpose, the present application adopts the following technical scheme:
the application provides a gene classification method based on multi-tag feature selection, which is characterized by comprising the following steps:
s1, constructing a multi-tag feature selection model based on missing gene data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
s22, gene data interpolation;
s23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB;
s24, acquiring extended-MB; the extended-MB consists of MB of class variables and the union of MB of each variable in the class variables;
s25, updating a data set; updating the data set to a new data set containing only extended-MBs;
s26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
s27, returning the feature subset of the gene expression dataset after iteration;
s3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
s4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
Further, in the gene expression dataset, f= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(where i +.j,represents V i The conditions are independent of V given s j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
Further, the gene data interpolation in step S22 specifically includes: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.
Further, step S23 includes:
(1) Mining causal mechanisms of each gene tag; from all features and all except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected;
(3) Correcting the obtained false features; specifically, according to MB (Y i ) Features and Y of (C) i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y i ) In (a) and (b); q tags were traversed by detecting SelFea (Y i ) Whether or not the MB of the medium features contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
Further, the gene classification model adopts MLKNN.
The application also provides a gene classification system based on multi-tag feature selection, which is characterized in that the gene classification system executes the gene classification method based on multi-tag feature selection, and the method comprises the following steps: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
The application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.
Compared with the prior art, the method has the following beneficial effects: according to the gene classification method based on multi-tag feature selection, feature selection of multi-tag gene data is finished based on missing data, causal relations among gene tags are included, a mutually promoted missing value interpolation process and a multi-tag MB learning process are used for being alternately carried out, and meanwhile all dependency relations among the gene tags and the gene features are considered; and then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a gene classification method based on multi-tag feature selection according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a multi-tag feature selection model based on missing gene data according to an embodiment of the present application.
Fig. 3 is a schematic flow chart of acquiring a multi-tag MB according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The application is further described below with reference to the drawings and specific examples, which are not intended to be limiting.
It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The application discloses a gene classification method based on multi-tag feature selection. As shown in fig. 1, the gene classification method based on the multi-tag feature selection includes the following steps S1 to S4.
S1, constructing a multi-tag feature selection model based on missing gene data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set; the flow of step S2 is shown in fig. 2;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
in the gene expression dataset, F= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fχy, s is any set of variables within V; by using(wherein i. Noteq. J; ->Represents V i The conditions are independent of V given S j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
S22, gene data interpolation;
in particular, data interpolation uses observations of incomplete and complete instances in a data set to estimate missing values, so it enables a multi-tag causal learning method to address most of the missing values in the data set, especially when there are more incomplete instances than complete ones. The data interpolation provides an accurate data set for the reliability of MB learning.
In one embodiment, the interpolation of the genetic data is performed using KNN or Lagrange interpolation.
S23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB; the specific flow of step S23 is shown in fig. 3;
specifically, learning the MB of the class variable, searching the MB is accomplished by a sophisticated multi-tag causal feature selection algorithm MLMB.
Step S23 includes:
(1) Mining causal mechanisms of each gene tag; from all the features andall except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected.
This would be computationally expensive if each feature was directly detected to be ignored. To solve the high-dimensional problem, according to the characteristics and Y i Is arranged in descending order of the degree of association, the top k1% features with highest correlation are selected and stored in sel (Y i ) Is selectively selected from sel (Y i ) The missing true MB features are recovered.
(3) Correcting the obtained false features; specifically, according to MB (Y i ) The association degree of the features and the features in the matrix is arranged in ascending order, MB (Y) i ) Middle and Y i The top k2% feature store SelFea (Y) i ) Q tags were traversed by detecting SelFea (Y i ) Whether or not the MB of the medium features contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
For label Y i E Y, in MB (Y i ) There is always a small fraction of error features. So according to MB (Y) i ) Features and Y of (C) i Is arranged in ascending order of association, MB (Y) i ) Middle and Y i The top k2% features with the weakest correlation are stored in SalFea (Y i ) And (3) detecting.
S24, acquiring extended-MB; the extended-MB consists of a MB of class variables and a union of MB of each variable in the class variable MB.
The basic principle behind this strategy is that, according to the MB feature selection theory, extended-MBs of class variables have a higher probability of containing causal information features than MBs of class variables for noisy datasets, since the missing values are filled with reasonable values using a data interpolation method.
S25, updating a data set; the dataset is updated to a new dataset that contains only extended-MBs.
Specifically, i.e., the new dataset of extended-MB found in step S24, this allows the data interpolation to fill only missing values of causal information features, not all features in the dataset.
S26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
in a specific embodiment, the iteration termination condition is that no more changes occur in the data in the extended-MB.
And S27, returning the feature subset of the gene expression data set after the iteration is ended.
In a specific embodiment, the quality of the feature subset of the obtained gene expression dataset is verified with the obtained classification accuracy using the returned optimal feature subset verification test set.
And S3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model.
The gene classification model adopts MLKNN. Inputting the feature subset of the gene expression data set into an MLKNN model for training, wherein the number of parameters k of the MLKNN model is 10, and other parameters remain default to obtain the MLKNN model with the optimized data set.
In bioinformatics research, yeast (yeast) gene expression data is often used to verify the effect of application of theoretical models, algorithms.
In one embodiment, the gene expression dataset: the yeast dataset is a more typical multi-tag gene expression dataset comprising microarray expression and phylogenetic maps of 2417 yeast genes. Each generation is annotated with a subset of the top 14 functional categories (e.g., metanolism, energy, etc.) of the functional directory. In order to test the performance of the proposed method, besides class attributes in the dataset, four different levels of missing values are set in each feature: 5%, 10%, 15% and 20%, data sets with missing values were generated.
We used Hamming Loss, average Precision, coverage and Ranking Loss, etc. as evaluation criteria for classification models:
1)Hamming Loss:
2)Average Precision:
3)Coverage:
4)Ranking Loss:
and (3) carrying out steps of a multi-label feature selection process diagram according to the whole flow chart, returning to the feature subset MB required by the user after the flow is finished, and finally training an MLKNN classifier model by the feature subset MB to obtain a model MLKNN_MB. In a comparison experiment, the MLKNN model is directly trained by using the training set Train by using the original data without feature selection, and the model MKNN_train is obtained. Substituting the Test set Test to obtain four indexes of the MLKNN_train model. The above data are aggregated as shown in table 1 below:
table 1: feature subset MB is compared with four indexes of all feature data sets scene
The larger the index Average Precision in the table, the better, and the smaller the index Hamming Los, coverage and Ranking Loss. From the experimental results, the MLKNN_MB classifier is better than the MLKNN_train classifier in various indexes. This shows that we propose a gene classification method based on multi-tag feature selection, which can effectively improve classification accuracy.
S4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
The application provides a gene classification method based on multi-tag feature selection, which comprises the steps of initializing data, performing causal feature selection on multi-tag missing data by learning a causal structure of each class tag, searching multi-tag MB, obtaining extended-MB, and integrating data interpolation and multi-tag MB learning into a unified framework so that two modules can be matched with each other. MB learning helps to interpolate missing data in potentially causal features, while data interpolation provides an accurate interpolated data set for the reliability of MB learning. And then, the gene classification is finished based on the feature subset after feature selection, so that the accuracy of the gene classification is improved, and the calculation load of the high-dimensional gene classification is reduced.
FIG. 4 is a diagram of a gene classification system based on multi-tag feature selection according to an embodiment of the present application. As shown in fig. 4, the gene classification system based on multi-tag feature selection comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
The gene classification system based on multi-tag feature selection described above may be implemented in the form of a computer program that is executable on a computer device.
The computer device may be a server, where the server may be a stand-alone server, or may be a server cluster formed by a plurality of servers.
The computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform a method of gene classification based on multi-tag feature selection.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform a method of gene classification based on multi-tag feature selection.
The network interface is for network communication with other devices. It will be appreciated by persons skilled in the art that the computer device structures described above are merely partial structures relevant to the present inventive arrangements and do not constitute a limitation of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
Wherein the processor is configured to run a computer program stored in a memory, the program implementing the gene classification method based on multi-tag feature selection as described in embodiment one.
It should be appreciated that in embodiments of the present application, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
The application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program when executed by a processor causes the processor to perform a method of gene classification based on multi-tag feature selection as described in embodiment one.
The storage medium may be a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that may store program codes.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.
Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, while the application has been described in connection with the above embodiments, the application is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the application, which is set forth in the following claims.

Claims (7)

1. A gene classification method based on multi-tag feature selection is characterized by comprising the following steps:
s1, constructing a multi-label feature selection model based on missing data;
s2, acquiring a gene expression data set, and performing feature selection on the gene expression data set by using the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the step S2 specifically comprises the following steps:
s21, acquiring a gene expression data set, wherein the gene expression data set contains m-dimensional characteristics and q-dimensional labels;
s22, gene data interpolation;
s23, searching for a multi-label Markov blanket MB, and learning the MB of the class variable by adopting a multi-label causal feature selection learning method MLMB;
s24, acquiring extended-MB; the extended-MB consists of MB of class variables and the union of MB of each variable in the class variables;
s25, updating a data set; updating the data set to a new data set containing only extended-MBs;
s26, judging whether iteration termination conditions are met, stopping if the iteration termination conditions are met, and continuously repeating the steps S22-S25 if the iteration termination conditions are not met;
s27, returning the feature subset of the gene expression dataset after iteration;
s3, training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
s4, obtaining gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining a label of the gene data to be classified.
2. The method of claim 1, wherein in the gene expression dataset, f= { F 1 ,F 2 ,...,F m The m-dimensional feature set, y= { Y 1 ,Y 2 ,...,Y q -q-dimensional tag set, v=fu { Y }, S is any set of variables within V; by using(wherein i. Noteq. J; ->) Represents V i The conditions are independent of V given S j ;V\Y i And V/Y i Equivalent, means dividing Y by V i All but.
3. The method according to claim 1, wherein the gene data interpolation of step S22 specifically comprises: estimating the deletion value by using the observation information of the incomplete examples and the complete examples in the gene expression data set, and performing interpolation of the gene data by adopting a KNN or Lagrange interpolation method.
4. The method according to claim 1, wherein step S23 comprises:
(1) Mining causal mechanisms of each gene tag; from all features and all except Y i Learning each Y in the tag of (1) i Class-tagged MB, i.e. from V\Y i Find Y in i The result was denoted as MB (Y) i );
(2) Detecting features that are ignored due to tag relevance; specifically, from F\MB (Y i ) Features that are ignored due to strong tag correlation are detected;
(3) Correcting the obtained false features; specifically, according to MB (Y i ) Features and Y of (C) i Is arranged in ascending order of the degree of association, and the first k2% of features with the weakest association are selected and stored in SelFea (Y i ) In (a) and (b); q tags were traversed by detecting SelFea (Y i ) M of the middle featureWhether B contains Y i From SelFea (Y) i ) The dummy MB feature is removed.
5. The method of claim 1, wherein the genetic classification model employs MLKNN.
6. A gene classification system based on multi-tag feature selection, wherein the gene classification system performs the gene classification method based on multi-tag feature selection of claim 1, comprising: the system comprises a multi-tag feature selection model construction module, a feature selection module, a gene classification model training module and a gene classification module;
the multi-tag feature selection model construction module is used for constructing a multi-tag feature selection model based on the missing gene data;
the feature selection module is used for acquiring a gene expression data set, and performing feature selection on the gene expression data set by utilizing the multi-tag feature selection model based on the missing gene data to obtain a feature subset of the gene expression data set;
the gene classification model training module is used for training the gene classification model by utilizing the feature subset of the gene expression data set to obtain a trained gene classification model;
the gene classification module is used for acquiring gene data to be classified, classifying the genes to be classified by using a trained gene classification model based on the feature subset of the gene data to be classified, and obtaining the labels of the gene data to be classified.
7. A computer device, characterized in that the device comprises a memory and a processor, the memory having stored thereon a computer program, which when executed by the processor implements the method according to any of claims 1 to 5.
CN202310810555.1A 2023-07-04 2023-07-04 Gene classification method, system and equipment based on multi-tag feature selection Pending CN116779044A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310810555.1A CN116779044A (en) 2023-07-04 2023-07-04 Gene classification method, system and equipment based on multi-tag feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310810555.1A CN116779044A (en) 2023-07-04 2023-07-04 Gene classification method, system and equipment based on multi-tag feature selection

Publications (1)

Publication Number Publication Date
CN116779044A true CN116779044A (en) 2023-09-19

Family

ID=87989388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310810555.1A Pending CN116779044A (en) 2023-07-04 2023-07-04 Gene classification method, system and equipment based on multi-tag feature selection

Country Status (1)

Country Link
CN (1) CN116779044A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118517383A (en) * 2024-07-22 2024-08-20 国网上海市电力公司 Deep learning-based intelligent detection method and equipment for running risk of wind turbine generator

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114265943A (en) * 2021-12-24 2022-04-01 吉林大学 Causal relationship event pair extraction method and system
CN116364274A (en) * 2023-03-16 2023-06-30 山西医科大学 Disease prediction method and system based on causal inference and dynamic integration of multiple labels

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114265943A (en) * 2021-12-24 2022-04-01 吉林大学 Causal relationship event pair extraction method and system
CN116364274A (en) * 2023-03-16 2023-06-30 山西医科大学 Disease prediction method and system based on causal inference and dynamic integration of multiple labels

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KUI YU ET AL.: "Causal Feature Selection with Missing Data", ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 16, no. 4, 31 January 2022 (2022-01-31), pages 1 - 9 *
XINGYU WU ET AL.: "Multi-Label Causal Feature Selection", THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20), 31 December 2020 (2020-12-31), pages 6430 - 6435 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118517383A (en) * 2024-07-22 2024-08-20 国网上海市电力公司 Deep learning-based intelligent detection method and equipment for running risk of wind turbine generator

Similar Documents

Publication Publication Date Title
CN107862173B (en) Virtual screening method and device for lead compound
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
Liu et al. Incdet: In defense of elastic weight consolidation for incremental object detection
Yu et al. Protein function prediction using multilabel ensemble classification
US20210020266A1 (en) Phase-aware determination of identity-by-descent dna segments
AU2019231255A1 (en) Systems and methods for spatial graph convolutions with applications to drug discovery and molecular simulation
US11615324B2 (en) System and method for de novo drug discovery
US11256995B1 (en) System and method for prediction of protein-ligand bioactivity using point-cloud machine learning
WO2013067461A2 (en) Identifying associations in data
CN102214302A (en) Recognition device, recognition method, and program
CN109637579B (en) Tensor random walk-based key protein identification method
Bicego et al. A bioinformatics approach to 2D shape classification
Liu et al. EACP: An effective automatic channel pruning for neural networks
Bi et al. High-dimensional supervised feature selection via optimized kernel mutual information
Brinda Novel computational techniques for mapping and classification of Next-Generation Sequencing data
JP2022548960A (en) Single-cell RNA-SEQ data processing
Zeng et al. A novel HMM-based clustering algorithm for the analysis of gene expression time-course data
CN116779044A (en) Gene classification method, system and equipment based on multi-tag feature selection
Liu et al. Todynet: temporal dynamic graph neural network for multivariate time series classification
Shiga et al. A variational bayesian framework for clustering with multiple graphs
Mestres et al. Selection of the regularization parameter in graphical models using network characteristics
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
CN116383441A (en) Community detection method, device, computer equipment and storage medium
US11367006B1 (en) Toxic substructure extraction using clustering and scaffold extraction
Xu et al. A structure-induced framework for multi-label feature selection with highly incomplete labels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination