CN109448787B

CN109448787B - Protein subnuclear localization method for feature extraction and fusion based on improved PSSM

Info

Publication number: CN109448787B
Application number: CN201811187766.XA
Authority: CN
Inventors: 聂仁灿; 阮小利; 周冬明; 贺康建; 李华光
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2021-10-08
Anticipated expiration: 2038-10-12
Also published as: CN109448787A

Abstract

The invention discloses a protein sub-nucleus positioning method for feature extraction and fusion based on improved PSSM, and relates to the technical field of biology and information. The protein sub-nucleus positioning method for extracting and fusing features based on the improved PSSM firstly adopts a Z-SoftMax function to standardize a position specificity scoring matrix for evolution information of a protein sequence; secondly, respectively extracting features of the position specificity scoring matrix in different directions and different jumping intervals by adopting the proposed SC-PSSM-C and SC-PSSM-R, and fixing the length of the PSSM; and finally, performing final classification prediction by using a W-SVM classifier after parameter optimization. The method can make up the limitation and singleness of the traditional characteristic extraction and improve the capability of protein subnuclear localization.

Description

Protein subnuclear localization method for feature extraction and fusion based on improved PSSM

Technical Field

The invention relates to the technical field of biology and information, in particular to a protein sub-nucleus positioning method for extracting and fusing features based on improved PSSM.

Background

With the popularization and improvement of human genome sequencing technology, protein sequences are produced in large quantities. In the last 20 years, the understanding of the protein function of newly detected sequences has become one of the hot spots in bioinformatics research. The function of a protein depends on its location in the cell, and determining the subcellular localization of a protein is considered to be an important step in understanding its function. The protein sub-nucleus localization information can provide important clues for the prevention, diagnosis and treatment of diseases. In recent years, with the rapid development of computer science, the research of protein sub-nucleus positioning by using a machine learning method becomes a hotspot of bioinformatics research, and the defects of high research and development cost and low prediction speed of the traditional method can be overcome.

At present, the key part of protein subcellular localization prediction research is the extraction of characteristic information and the construction of a classification algorithm model. Experiments of a large number of published papers show that evolution information has an important role in positioning and predicting subnuclei when being used for extracting characteristics of proteins, and how to convert effective evolution information of an extracted ordered sequence into an effective characteristic vector with fixed dimensions is a difficult point of current research. The most effective algorithms for improvement based on evolution information at present mainly include PSSM-CC proposed by Dong Q and Zhou S in 2009, "A multiple information fusion method for predicting sub cellular locations of two differences types of bacterial proteins and" k-segmented-bigrams-PSSM algorithm jointly proposed by Tokyo university, Australian Gregorphis university and Nantaiyang university in 2015 by jin Cheng.

In summary, the technical problems of the prior art are as follows: these models, while providing more information about the protein sequence of amino acid interactions, are still limited to valid discriminatory information in a column or row, or in two columns or rows with variable spacing; the extracted features are too single to express the overall features of the protein sequence. The extraction of effective features influences the classification result of the classifier, samples in proteomic data generally have the characteristic of high-dimensional features, and certain challenges still exist in how to effectively select the features of the data, remove irrelevant features and relieve 'dimensional disasters'; secondly, the data sets in the proteomics have unbalance problems, such as Mutipass membrane protein data sets and the like, the unbalance of the data sets causes the low class prediction precision of the small sample number, and the unbalance problem becomes a difficult point and a key research content in the proteomics. The existing problems are further researched on the basis of work of people before the summary, and a novel machine learning method is provided, so that the prediction accuracy of a few types can reach the result similar to the accuracy of a plurality of types in the final result, and the overall recognition effect is improved.

Disclosure of Invention

In view of the above problems in the prior art, a protein sub-nucleus localization method for performing feature extraction and fusion based on a Position Specificity Score Matrix (PSSM) is provided, a new feature extraction and fusion method is provided to improve the prediction recognition rate of the sub-nucleus protein, and a protein sub-nucleus localization method for performing feature extraction and fusion based on the improved PSSM is provided.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

the protein sub-nucleus positioning method for performing feature extraction and fusion based on the improved PSSM comprises the following steps:

step 1: acquiring a protein data set, determining whether the acquired data set is a single-label problem or a multi-label problem, converting the data set into a standard fata format aiming at a single label, and labeling the categories of all samples;

step 2: setting the iteration parameter to be 3, setting the E-value of each protein in comparison search to be 0.001, and calculating the PSSM matrix of each piece of data;

and step 3: respectively adopting different feature expressions to construct a feature set for the features obtained in the step 2, and extracting richer complementary information;

and 4, step 4: selecting the features by adopting an improved maximum information coefficient aiming at the features acquired in the step 3;

and 5: judging whether the feature set obtained in the step 4 is a balanced data set, if so, skipping the step, and if not, performing sampling processing;

the balance data set judges the difference value of each type through setting;

step 6: and (4) constructing a classification model aiming at the data set obtained in the step (4).

Further, in the step 1, a corresponding threshold is set for the acquired data set according to the length of each piece of data, and the length of the threshold is greater than 50.

Further, the PSSM matrix for each piece of data is calculated, each protein is denoted by P, where P ═ P1, P2.., P20], Pj ═ P1j, P2 j.. PLj ] (j ═ 1, 2.. 20), and L represents the length of each protein.

Further, the step of constructing the feature set by respectively adopting different feature expressions for the features obtained in the step 2 includes the following steps:

performing dimension unification on the PSSM processed in the step 2, wherein the formula is as follows:

wherein c represents the number of classes, and x represents the value of the original PSSM matrix;

normalizing the dimensionality-unified data set by the formula of (x-mu)/sigma, wherein x is a corresponding value processed in the step 3.1, mu is an average number, and sigma is a standard deviation;

and (3) carrying out feature extraction of an SC-PSSM-R algorithm on the processed data set, wherein the formula is as follows:

wherein

When r is 0, it represents two adjacent peptides, when r is 1, it represents two peptides at a distance of 1, and so on;

extracting column direction characteristics of the data set subjected to dimension unification and standardized processing, wherein the formula is as follows:

the above formula can be extended to the formula:

wherein

Representing the difference value of the corresponding values of the position specificity scoring matrix corresponding to the two peptides;

and setting the weight and the step length as 0.01 to traverse the score specificity evolution information under different jumping intervals in different directions, seeking the best feature set and analyzing the primary fusion effect of the features under different weights.

Further, the selecting of the features by using the improved maximum information coefficient for the obtained features includes the following steps:

the obtained maximum information coefficients are orderly arranged by scoring, the scoring conditions of different data sets are analyzed, different thresholds are set, and corresponding characteristics are selected;

and performing maximum information coefficient operation on the obtained features again, and forming a new feature set by taking the corresponding scores obtained as the weights of the features differently from the above.

Further, the constructing a classification model for the data set obtained in step 4 includes the following steps:

training classification models with different parameters according to the characteristics of different data sets, and optimizing the parameters by a global and local parameter optimization method;

and putting the processed protein test set data into a corresponding trained classification model for final classification prediction.

The invention has the beneficial effects that: the invention relates to a protein sub-nucleus positioning method for extracting and fusing features based on improved PSSM; firstly, preprocessing an obtained protein data set and calculating a position specificity score matrix of the obtained data set, and secondly, carrying out Z-Softmax function standardization processing on a PSSM matrix of the obtained position specificity score matrix, so that Nall data generated in the traditional method processing process is avoided; then, local and global characteristics of the rows and the columns of the processed PSSM matrix are extracted by setting different interval jump values R, namely SC-PSSM-R and SC-PSSM-L algorithms; then, carrying out feature selection and scoring weighting on the SC-PSSM-R and SC-PSSM-L feature matrixes subjected to weighting fusion twice by adopting the improved maximum information coefficient; and finally, performing final prediction evaluation through the classifier after the parameters are optimized. The PSSM improved feature extraction and fusion-based protein sub-nucleus localization research method provided by the invention can not only extract effective features of the position score specific matrix in different directions and different jump intervals, enhance the complementarity between effective information, but also remove redundancy by adopting an improved feature selection method. The feature extraction is the premise of classification, and the effective feature extraction can improve the recognition rate of the classifier. Compared with the traditional PSSM scoring matrix-based method, the method can extract more abundant and effective protein features.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a general flowchart of a protein sub-nucleus localization method for feature extraction and fusion based on modified PSSM according to an embodiment of the present invention;

FIG. 2 is a flowchart of an embodiment of the present invention, which is an implementation of a protein sub-nucleus localization method based on improved PSSM for feature extraction and fusion;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1-2

A protein sub-nucleus localization method for feature extraction and fusion based on improved PSSM comprises the following steps:

step 1: the method comprises the steps of acquiring a protein data set, determining whether the acquired data set is a single-label problem or a multi-label problem (the invention mainly aims at the single-label problem), converting the data set into a standard fata format, and labeling the category of all samples.

In step 1, a threshold value (generally, the length is greater than 50) is set according to the length of each piece of data for data screening of the acquired data set.

And 2, setting the iteration parameter to be 3, setting the E-value of each protein during comparison search to be 0.001, and calculating the PSSM matrix of each piece of data. Each protein is denoted by P, where P ═ P1, P2.., P20], Pj ═ P1j, P2 j.. PLj ] (j ═ 1, 2.. 20), and L represents the length of each protein.

And 3, respectively converting the position scoring matrixes obtained in the step 2, and respectively extracting corresponding characteristics to construct a characteristic set.

The first step of step 3 is: processing the PSSM obtained in the step 2 to enable the dimension of the PSSM to be unified, wherein the formula is as follows:

where c represents the number of classes and x represents the value of the original PSSM matrix.

The second step is: and carrying out normalization processing on the data set subjected to the first-step dimension normalization, wherein the formula is that z is (x-mu)/sigma. Where x is the value after step 3.1, μ is the mean and σ is the standard deviation.

The third step is: and performing feature extraction of the SC-PSSM-R algorithm on the data set processed in the second step. The formula is as follows:

wherein (m, n ═ 1, 2.. 20), wherein

When r is 0, it indicates two adjacent peptides, when r is 1, it indicates two peptides at a distance of 1, and so on.

The fourth step: extracting column direction characteristics of the data set processed in the second step of the step 3, wherein the formula is as follows:

the formula can be expanded to be:

wherein

Represents the difference between the values of the position-specific score matrix corresponding to the two peptides. Wherein r is the same as the above steps.

The fifth step of step 3: and (4) traversing the fused score specific evolution information under different directions and different hop intervals by setting the weight and the step length as 0.01, and searching for the best feature set. As shown in fig. 2, the weights are continuously updated, the effect of the primary feature fusion under different weights is analyzed, and an optimal CRC-PSSM feature set is selected by comparison.

And 4, step 4: selecting the features by adopting the improved maximum information coefficient aiming at the features selected in the fifth step in the step 3;

the first step is as follows: and (4) orderly arranging the maximum information coefficients obtained in the step (4) by scoring, analyzing the scoring distribution condition of each feature, setting different thresholds aiming at different data sets, and selecting corresponding features.

The second step is that: and performing maximum information coefficient operation on the features obtained in the first step again, wherein the maximum information coefficient operation is different from the operation of performing operation on the features obtained in the first step by taking the corresponding scores as weights of the features and taking the weights as new features.

And 5: and (4) judging whether the feature set obtained in the second step of the step (4) is a balanced data set (judging whether the difference value of each class is out of the range by setting a class difference threshold), if so, skipping the step, and if not, carrying out sampling processing.

And training classification models with different parameters for the characteristics of different data sets, and performing parameter optimization through a global and local parameter optimization method.

The classification model constructed in the above steps is applied to protein subcellular localization.

Example 2

The invention is experimentally verified based on the disclosed apoptotic protein data set ZD 98. ZD98 was established by Zhou and Doctor in 2003 and the data set contained apoptotic protein sequences at 4 subcellular locations, cytoplasmic proteins (CY), plasma membrane-bound proteins (ME), mitochondal proteins (MI) and OTHER proteins (OTHER), respectively. In Table I OA represents the overall correct recognition rate. The table result is that the feature is strictly fused according to the feature extraction method and the fusion strategy, and the dimension reduction is only carried out by adopting the traditional linear discriminant analysis algorithm in the aspect of feature selection, so that the result is superior to the traditional feature extraction method. As can be seen from table 1, the numerical values of the algorithm herein on these evaluation objective indices are more effective than other algorithms.

TABLE 1 fusion result graph based on different fusion methods

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A protein sub-nucleus positioning method for feature extraction and fusion based on improved PSSM is characterized in that: the method comprises the following steps:

and 4, step 4: selecting the features by adopting an improved maximum information coefficient aiming at the features acquired in the step 3 to obtain a feature set;

and 5: judging the difference value of each class by setting a class difference threshold value, judging whether the feature set obtained in the step 4 is a balanced data set, if so, skipping the step, and if not, performing sampling processing;

step 6: constructing a classification model aiming at the data set obtained in the step 4;

the method for constructing the feature set by respectively adopting different feature expressions for the features obtained in the step 2 comprises the following steps:

z is (x-mu)/sigma, wherein x is a corresponding value after the dimensionality unification treatment, mu is an average number, and sigma is a standard deviation;

carrying out the feature extraction of the SC-PSSM-R algorithm on the data set after the standardization treatment, wherein the formula is as follows:

wherein

the above formula can be extended to the formula:

wherein

2. The method for protein sub-nuclear localization based on improved PSSM for feature extraction and fusion of claim 1, wherein: and step 1, setting a corresponding threshold value for the acquired data set according to the length of each piece of data to carry out data screening, wherein the length of the threshold value is more than 50.

3. The method for protein sub-nuclear localization based on improved PSSM for feature extraction and fusion of claim 1, wherein: the PSSM matrix for each piece of data is calculated, each protein being denoted by P, where P ═ P1, P2.., P20], Pj ═ P1j, P2 j.. PLj ] (j ═ 1, 2.. 20), and L represents the length of each protein.

4. The method for protein sub-nuclear localization based on improved PSSM for feature extraction and fusion of claim 1, wherein: and (3) selecting the features by adopting the improved maximum information coefficient aiming at the features acquired in the step (3), wherein the method comprises the following steps:

the first step is as follows: the obtained maximum information coefficients are orderly arranged by scoring, the scoring conditions of different data sets are analyzed, different thresholds are set, and corresponding characteristics are selected;

the second step is that: and performing maximum information coefficient operation on the obtained features again, wherein the maximum information coefficient operation is different from the step of performing maximum information coefficient operation on the obtained corresponding scores as the weights of the features to form a new feature set.

5. The method for protein sub-nuclear localization based on improved PSSM for feature extraction and fusion of claim 1, wherein: the method for constructing the classification model of the data set obtained in the step 4 comprises the following steps: