Nothing Special   »   [go: up one dir, main page]

CN110689920B - Protein-ligand binding site prediction method based on deep learning - Google Patents

Protein-ligand binding site prediction method based on deep learning Download PDF

Info

Publication number
CN110689920B
CN110689920B CN201910879922.7A CN201910879922A CN110689920B CN 110689920 B CN110689920 B CN 110689920B CN 201910879922 A CN201910879922 A CN 201910879922A CN 110689920 B CN110689920 B CN 110689920B
Authority
CN
China
Prior art keywords
protein
neural network
residue
residual
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910879922.7A
Other languages
Chinese (zh)
Other versions
CN110689920A (en
Inventor
夏春秋
杨旸
沈红斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910879922.7A priority Critical patent/CN110689920B/en
Publication of CN110689920A publication Critical patent/CN110689920A/en
Application granted granted Critical
Publication of CN110689920B publication Critical patent/CN110689920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a protein-ligand binding site prediction algorithm based on deep learning, for a protein to be predicted, sequence characteristics and a distance matrix of the protein are firstly extracted, then the sequence characteristics are distributed to each residue through a sliding window method, the characteristics corresponding to the residues are input into a residual neural network and a mixed neural network one by one, the output results of the residual neural network and the mixed neural network are input into a Logistic regression classifier, and the final result is the binding probability corresponding to each residue in the protein. According to the invention, a classical bidirectional long-time and short-time memory network and a residual neural network are fused, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated. Compared with the existing method, the method has higher prediction accuracy, and has good generalization performance aiming at data sets of different ligands.

Description

Protein-ligand binding site prediction method based on deep learning
Technical Field
The invention relates to the field of protein biology and pattern recognition, in particular to a protein-ligand binding site prediction method based on deep learning.
Background
The interaction of proteins with ligands plays important roles in biological processes, such as signal transduction, post-translational modification, and antigen-antibody interaction. In addition, drug discovery and design also relies heavily on the analysis of the mechanism of protein-ligand interaction. For further exploration of the mechanism behind protein-ligand interactions, recognition of the binding site is a very critical step. As protein design techniques have emerged, and more new proteins have emerged, their properties and functions have not been explored, and the need for rapid, accurate binding site recognition tools has become more urgent. The current method for identifying the binding site of the protein by a wet experiment has the defects that: time consuming and costly.
Protein-ligand interactions can be classified into protein-protein interactions, protein-DNA/RNA interactions, and protein-small molecule interactions, depending on the type of ligand. At this stage, there are many computational methods based on sequence information (protein primary structure) or structural information (protein tertiary structure) that can predict protein-ligand binding sites.
Sequence-based methods can make site predictions for proteins with unknown three-dimensional structures using some purely sequence-based features such as evolutionary information and predicted secondary structures. However, since the position of the binding site is mainly determined by the tertiary structure of the protein, the prediction accuracy of the sequence-based method is relatively low.
The structure-based methods all require three-dimensional coordinates of every atom in the protein as input, but they follow different evaluation criteria, such as POCKETs assume that the binding SITE is more likely to be located in a depressed region of the protein surface, SITEHOUND uses an energy function to calculate the force field between the protein and the ligand, and TM-SITE is a template-based matching method.
Disclosure of Invention
The invention aims to provide a protein-ligand binding site prediction method based on deep learning aiming at the current situation that the prediction algorithm in the prior art is not high in precision so as to solve the problems in the prior art.
The invention provides a prediction method with higher precision by fusing a deep learning technology and the field knowledge of a protein structure aiming at the application scene of protein-ligand binding site recognition, and also provides an effective solution for partial problems, such as data imbalance problem, difficulty in registration between three-dimensional structures and the like.
The technical problem solved by the invention can be realized by adopting the following technical scheme:
a deep learning-based protein-ligand binding site prediction method comprises the following steps:
step 1) firstly, extracting sequence characteristics of a protein structure data set, then calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; finally, intercepting a feature tensor of each residue by using a sliding window method;
step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; randomly up-sampling a positive sample when constructing the mini-batch;
step 3) constructing a residual neural network by using a residual block, and training on the distance matrix;
step 4), integrating the built residual error neural network and the bidirectional long-time memory network through a full connection layer, building a hybrid neural network, and training on the sequence characteristics and the distance matrix;
step 5) training a Logistic regression classifier according to the output results of the residual error neural network and the mixed neural network;
and 6) for the protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the residues into a residual neural network and a mixed neural network one by one, and inputting output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the corresponding combination probability of each residue in the protein.
Further, the method for extracting the sequence feature and the distance matrix in the step 1) is as follows:
step 1.1) for the protein with the length of L, obtaining a position specificity scoring matrix PSSM thereof through a PSI-BLAST algorithm; the PSSM has a size of L × 20, wherein the ith row and the jth column element pijIndicating the possibility of mutating the ith residue into the jth amino acid, wherein the amino acids are 20 in total;
then for each pijNormalization was performed as follows:
Figure GDA0003371415790000031
step 1.2) for the protein with the length of L, obtaining a scoring matrix HHM through an HHblits algorithm, wherein the HHM identifies the evolution information of the protein sequence; HHM size is L × 30, wherein the first 20 columns are emission probabilities of 20 amino acids, the 21 st to 27 th columns are transition probabilities, and the 28 th to 30 th columns are local diversity;
for element h in HHMijNormalization was performed as follows:
Figure GDA0003371415790000032
step 1.3) predicting the secondary structure information and relative solvent accessibility of the protein with the length L by using an SCRATCH algorithm; the secondary structure information is represented as an L x 3 matrix, where each row siRepresenting the secondary structure of the ith residue as a helix, strand or otherwise in the form of a one-hot vector; solvent accessibility is represented as an L2 matrix, where each row riRepresenting the status of the ith residue as exposed or buried in the form of a one-hot vector;
step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element qi0And q isi1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residuei0And q isi1The sum of (1);
step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residuesαThe Euclidean distance between them, denoted as dij
Constructing a distance matrix D ═ D according to the sequence orderij}L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;
step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; and intercepting the distance matrix by using a sliding window with the size of W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.
Further, the random down-sampling in the step 2) and the up-sampling in the mini-batch need to satisfy the following conditions:
1) in random down-sampling, each negative sample is randomly selected from the original data set with a probability of 20%, and the selected negative sample and all positive samples are combined into a training subset; obtaining N in the same mannersetA training subset;
2) in upsampling in the mini-batch, N is cyclically selected from the set of all positive samples and the set of all negative samplespA positive sample and NnA negative sample, NpObtained according to the following formula:
Np=[0.3×Nb]
wherein N isbIs the size of the mini-batch [. degree]Is a rounded symbol, and Nn=Nb-Np
Further, the definition of the residual block and the construction process of the residual neural network are as follows:
in a neural network, the convolutional layer can be represented as Conv (X, W, H, D), where X is the input variable, W and H are the width and height of the convolutional kernels, respectively, and D is the number of convolutional kernels; the residual block is formed by stacking three convolution layers as shown in the following formula:
Res(X)=σ(Conv(σ(Conv(σ(Conv(X,1,1,D)),3,3,D)),1,1,4×D)+X)
wherein σ is an activation function; the residual error neural network is formed by stacking a plurality of residual error blocks and optimized by an Adam algorithm, and the input of the residual error neural network is a distance matrix of each residue;
in said NsetOn each subset, N can be trained for each residue in the proteinresA separate residual neural network, wherein Nres≤Nset
Further, the hybrid neural network in the step 4) integrates a residual neural network and the BilSTM, and is optimized by an Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;
in said NsetOn subsets, N can be trained for each residue in the proteinhybridAn independent hybrid neural network, wherein Nhybrid=Nset-Nres
Further, N corresponds to each residue in the step 5)resA residual error network and NhybirdThe output of each mixed neural network is spliced into a signal with the length of NsetThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding l to the loss function of the Logistic classifier1The regularization term prevents overfitting.
Further, in the step 6), for a length L and CαFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the combination probability corresponding to each residue in the protein.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a novel hybrid neural network, which fuses a classical bidirectional long-term memory network and a residual neural network, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated.
2. The invention adopts a random down-sampling and integration method to solve the problem of unbalance of positive and negative samples, and adopts batch-by-batch up-sampling of positive samples to further reduce the influence of the data input in the form of mini-batch in a neural network.
3. Compared with the existing method, the method has higher prediction precision, and has good generalization performance aiming at data sets of different ligands.
Drawings
FIG. 1 is a flow chart of the deep learning-based protein-ligand binding site prediction method of the present invention.
FIG. 2 is a schematic diagram of a residual error network module according to the present invention.
The device comprises a hybrid neural network architecture diagram (a), a sequence feature and distance matrix extraction module (b) and a bidirectional long-time and short-time memory network module (c).
FIG. 3 is a schematic diagram of a random sampling and integration method according to the present invention.
FIG. 4 is a schematic diagram of an implementation of a residual block in the residual neural network of the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
Referring to fig. 1, the method for predicting a deep learning-based protein-ligand binding site according to the present invention comprises the following steps:
step 1) for a given protein structure data set, firstly, respectively extracting evolution information, secondary structure information, relative solvent accessibility and combination probability of the given protein structure data set by utilizing a PSI-BLAST algorithm, a HHblits algorithm, a SCRATH algorithm and an S-SITE algorithm, and carrying out normalization processing on the evolution information; secondly, calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; truncating the feature tensor for each residue using a sliding window strategy;
step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; then randomly up-sampling a positive sample when constructing the mini-batch;
step 3) constructing a residual error neural network (ResNet) by using the residual error block, and training on the distance matrix obtained in the step 1);
step 4) integrating the residual error network in the step 3) with a bidirectional long-time and short-time memory network (BiLISTM) through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix obtained in the step 1);
step 5) training a Logistic regression classifier by using the residual error neural network in the step 3) and the output result of the mixed network in the step 4);
and 6) for a protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the characteristics to each residue through a sliding window method, then inputting the characteristics into a residual error network and a mixed neural network one by one, and then inputting an output result into a Logistic regression classifier, wherein a final result is the corresponding combination probability of each residue in the protein.
Wherein the specific process of the step 1) is as follows:
step 1.1) for the protein with the length of L, obtaining a position specificity scoring matrix PSSM thereof through a PSI-BLAST algorithm; the PSSM has a size of L × 20, wherein the ith row and the jth column element pijIndicating the possibility of mutating the ith residue into the jth amino acid, wherein the amino acids are 20 in total;
then for each pijNormalization was performed as follows:
Figure GDA0003371415790000081
step 1.2) for the protein with the length of L, obtaining a scoring matrix HHM through an HHblits algorithm, wherein the HHM identifies the evolution information of the protein sequence; HHM size is L × 30, wherein the first 20 columns are emission probabilities of 20 amino acids, the 21 st to 27 th columns are transition probabilities, and the 28 th to 30 th columns are local diversity;
for element h in HHMijNormalization was performed as follows:
Figure GDA0003371415790000082
step 1.3) predicting the secondary structure information and relative solvent accessibility of the protein with the length L by using an SCRATCH algorithm; the secondary structure information is represented as an L x 3 matrix, where each row siRepresenting the secondary structure of the ith residue as a helix, strand or otherwise in the form of a one-hot vector; solvent accessibility is represented as an L2 matrix, where each row riRepresenting the status of the ith residue as exposed or buried in the form of a one-hot vector;
step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element qi0And q isi1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residuei0And q isi1The sum of (1);
step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residuesαThe Euclidean distance between them, denoted as dij
Constructing a distance matrix D = { D) according to sequence orderij}L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;
step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; as shown in part a of fig. 2, the 4 features will be divided into two groups that are input to two bilstms, one of which contains only PSSM, SS (secondary structure information predicted by SCRATCH), RSA (relative solvent accessibility predicted by SCRATCH) and SST (binding tendency predicted by S-SITE), and the other contains only HHM, SS, RSA and SST. And intercepting the distance matrix by using a sliding window with the same size W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.
The random down-sampling and the up-sampling in the mini-batch in the step 2) are shown in fig. 3, and the following conditions need to be satisfied:
1) in random down-sampling, each negative sample is randomly selected from the original data set with a probability of 20%, and the selected negative sample and all positive samples are combined into a training subset; obtaining N in the same mannersetA training subset;
2) in upsampling in the mini-batch, N is cyclically selected from the set of all positive samples and the set of all negative samplespA positive sample and NnA negative sample, NpObtained according to the following formula:
Np=[0.3×Nb]
wherein N isbIs the size of the mini-batch [. degree]Is a rounded symbol, and Nn=Nb-Np
Further, in step 3, the definition of the residual block and the construction of the residual network are as follows:
as shown in fig. 4, the residual block is generally composed of a plurality of convolutional layers and an identity map, and nonlinear mapping is implemented between convolutional layers by an activation function. Fig. 4 shows a general residual block on the left side and a residual block in the form of a bottleneck (bottleeck) on the right side, which is advantageous in that parameters can be reduced while ensuring performance. The present invention employs a bottleneck-form residual block, which is described as follows:
Res(X)=σ(Conv(σ(Conv(σ(Conv(X,1,1,D)),3,3,D)),1,1,4×D)+X)
wherein σ is an activation function, Conv (X, W, H, D) is a convolution function, X is an input variable, W and H are the width and height of a convolution kernel respectively, and k is the number of the convolution kernels;
in the invention, a residual error network is formed by stacking a plurality of residual error blocks, as shown in fig. 2(b), and is optimized by an Adam algorithm, wherein the input of the network is a distance matrix of each residue. The specific network architecture is summarized in table 1.
Figure GDA0003371415790000101
aThe setting of the convolution layer respectively represents the size of convolution kernels, the number of the convolution kernels and the step length;
bthe step size of the residual block in the form of a bottleneck is 1.
TABLE 1 residual neural network module architecture
In said NsetOn each subset, N can be trained for each residue in the proteinresA separate residual neural network, wherein Nres≤Nset
Further, in the step 4), the hybrid neural network integrates the residual error network and the BiLSTM in the step 3) through a full connection layer, and is optimized through an Adam algorithm, and the overall architecture of the hybrid neural network is shown in fig. 2. As described in step 2), the inputs of two bilstms are two sets of sequence features, respectively.
In said NsetOn subsets, N can be trained for each residue in the proteinhybridA separate hybrid network, wherein Nhybrid=Nset-Nres
N corresponding to each residue in the step 5)resA residual error network and NhybirdThe output of the hybrid network is spliced into a length NsetThe vector of (a); training a Logistic regression classifier by taking the vector as input in a cross validation mode, wherein the specific form is shown in FIG. 3; adding l to the loss function of the Logistic classifier1The regularization term prevents overfitting.
In the step 6), for a length L and CαFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining a result, namely the combination probability corresponding to each residue in the protein.
Then, dividing the binding probability by an optimal threshold T epsilon (0, 1) learned on a training set, and if the binding probability is greater than T, considering the residue as a binding site; conversely, this residue is considered a non-binding site.
Examples
With protein and MN2+The binding site data set of (a) serves as a training set and a test set. The training set contains a total of 440 proteins, of which there are 1931 binding residues and 150229 non-binding residues; the test set contained a total of 144 proteins, of which there were 612 binding residues and 50838 non-binding residues.
Firstly, extracting evolution information, secondary structure information, relative solvent accessibility and combination probability of all proteins in a training set and a test set respectively by using a PSI-BLAST algorithm, an HHblits algorithm, an SCRATCH algorithm and an S-SITE algorithm, and normalizing the evolution information (including PSSM and HHM); secondly, calculating Euclidean distance between each residue pair from three-dimensional space coordinates of all residues of the proteins in the training set and the test set, constructing a distance matrix, and scaling the column number of the matrix to 400; finally, the eigentensor is truncated for each residue using a size 37 sliding window strategy, so that all residues correspond to a sequence eigenmatrix of size 37 × 57 and a distance matrix of size 37 × 400.
Due to the extremely unbalanced state of the data, i.e. binding sites (positive samples) are much less than non-binding sites (negative samples), the negative samples are randomly down-sampled with a sampling rate of 20%. The sampled negative samples are then combined with all positive samples to form a training subset, and the process is repeated until 13 training subsets are obtained. For each training subset, a hybrid neural network or a residual neural network may be trained. In this example, 10 hybrid networks and 3 residual networks are trained in total.
Inputting the data in each training subset into an independent hybrid neural network or a residual neural network according to the mini-batch with the size of 32, and ensuring that the proportion of positive and negative samples is controlled to be 3: 7. the parameters of the network are then optimized by the Adam algorithm until the effect of the neural network on the validation set is no longer improved.
The results of all 13 networks are concatenated into a length 13 vector, and a Logistic regression classifier is trained by means of cross-validation. Thus, the models included in the present invention are trained.
The data in the test set is then characterized in the same way and input into the network, except that the binding sites for the test data are unknown, i do not need to update the weights of the network through an optimization algorithm. Finally, the results of the multiple networks are input into a trained Logistic regression classifier to obtain the binding probability corresponding to each residue, and then the binding probabilities are divided according to a predetermined threshold, in this example, the threshold is 0.345.
The evaluation indexes adopted by the invention are as follows:
REC=TP/(TP+FN)
PRE=TP/(TP+FP)
Figure GDA0003371415790000131
wherein, TP, FP, TN and FN are true positive, false positive, true negative and false negative results respectively.
The predicted results of the experiment are as follows:
in the experimental phase, the present invention was compared with other representative protein-ligand binding site prediction methods, and the results are shown in the following table. The invention achieves the best result on the comprehensive index MCC, and is improved by 4.9 percent compared with the second good method IonCom. Although the method of the present invention works somewhat less well on the REC index because a higher threshold is selected to maximize MCC, the method of the present invention is significantly better than other existing methods in general.
Method REC PRE MCC
COACH 0.562 0.272 0.381
IonCom 0.531 0.495 0.506
TargetS 0.395 0.499 0.438
The method used in the present invention 0.513 0.632 0.565
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. A protein-ligand binding site prediction method based on deep learning is characterized by comprising the following steps:
step 1) firstly, extracting sequence characteristics of a protein structure data set, then calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; finally, intercepting a feature tensor of each residue by using a sliding window method;
step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; randomly up-sampling a positive sample when constructing the mini-batch;
step 3) constructing a residual neural network by using a residual block, and training on the distance matrix;
step 4) integrating the residual error neural network and the bidirectional long-time memory network through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix;
step 5) training a Logistic regression classifier according to the output results of the residual error neural network and the mixed neural network;
and 6) for the protein to be predicted, firstly extracting sequence characteristics of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the characteristics corresponding to the residues into a residual neural network and a mixed neural network one by one, and inputting output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the binding probability corresponding to each residue in the protein.
2. The method for predicting the deep learning-based protein-ligand binding site according to claim 1, wherein the extraction method of the sequence feature and distance matrix in the step 1) is as follows:
step 1.1) for the protein with the length L, a position specificity scoring matrix P of the protein is obtained through a PSI-BLAST algorithmSSM; the PSSM has a size of L × 20, wherein the ith row and the jth column element pijIndicating the possibility of mutating the ith residue into the jth amino acid, wherein the amino acids are 20 in total;
then for each pijNormalization was performed as follows:
Figure FDA0003371415780000021
step 1.2) for the protein with the length of L, obtaining a scoring matrix HHM through an HHblits algorithm, wherein the HHM identifies the evolution information of the protein sequence; HHM size is L × 30, wherein the first 20 columns are emission probabilities of 20 amino acids, the 21 st to 27 th columns are transition probabilities, and the 28 th to 30 th columns are local diversity;
for element h in HHMijNormalization was performed as follows:
Figure FDA0003371415780000022
step 1.3) predicting the secondary structure information and relative solvent accessibility of the protein with the length L by using an SCRATCH algorithm; the secondary structure information is represented as an L x 3 matrix, where each row siRepresenting the secondary structure of the ith residue as a helix, strand or otherwise in the form of a one-hot vector; solvent accessibility is represented as an L2 matrix, where each row riRepresenting the status of the ith residue as exposed or buried in the form of a one-hot vector;
step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element qi0And q isi1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residuei0And q isi1The sum of (1);
step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residuesαBetween the two Euclidean distanceI is marked as dij
Constructing a distance matrix D ═ D according to the sequence orderij}L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;
step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; and intercepting the distance matrix by using a sliding window with the size of W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.
3. The deep learning-based protein-ligand binding site prediction method according to claim 1 or 2, wherein the random down-sampling in step 2) and the up-sampling in mini-batch satisfy the following condition:
1) in random down-sampling, each negative sample is randomly selected from the original data set with a probability of 20%, and the selected negative sample and all positive samples are combined into a training subset; obtaining N in the same mannersetA training subset;
2) in upsampling in the mini-batch, N is cyclically selected from the set of all positive samples and the set of all negative samplespA positive sample and NnA negative sample, NpObtained according to the following formula:
Np=[0.3×Nb]
wherein N isbIs the size of the mini-batch [. degree]Is a rounded symbol, and Nn=Nb-Np
4. The deep learning-based protein-ligand binding site prediction method according to claim 3, wherein the definition of the residual block and the construction of the residual neural network are as follows:
in a neural network, the convolutional layer can be represented as Conv (X, W, H, D), where X is the input variable, W and H are the width and height of the convolutional kernels, respectively, and D is the number of convolutional kernels; the residual block is formed by stacking three convolution layers as shown in the following formula:
Res(X)=σ(Conv(σ(Conv(σ(Conv(X,1,1,D)),3,3,D)),1,1,4×D)+X)
wherein σ is an activation function; the residual error neural network is formed by stacking a plurality of residual error blocks and optimized by an Adam algorithm, and the input of the residual error neural network is a distance matrix of each residue;
in said NsetOn each subset, N can be trained for each residue in the proteinresA separate residual neural network, wherein Nres≤Nset
5. The deep learning-based protein-ligand binding site prediction method according to claim 4, wherein the hybrid neural network in step 4) integrates a residual neural network and BilSTM and is optimized by Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;
in said NsetOn subsets, N can be trained for each residue in the proteinhybridAn independent hybrid neural network, wherein Nhybrid=Nset-Nres
6. The method for predicting a deep learning-based protein-ligand binding site according to claim 5, wherein N corresponds to each residue in the step 5)resA residual error network and NhybirdThe output of each mixed neural network is spliced into a signal with the length of NsetThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding l to the loss function of the Logistic regression classifier1The regularization term prevents overfitting.
7. The method for predicting deep learning-based protein-ligand binding sites according to claim 6, wherein in the step 6), the length of one of the L and C isaEgg to be predicted with known spatial coordinatesFirstly, extracting sequence characteristics of the white matter, then distributing the sequence characteristics to each residue through a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining a result, namely the binding probability corresponding to each residue in the protein.
CN201910879922.7A 2019-09-18 2019-09-18 Protein-ligand binding site prediction method based on deep learning Active CN110689920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910879922.7A CN110689920B (en) 2019-09-18 2019-09-18 Protein-ligand binding site prediction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910879922.7A CN110689920B (en) 2019-09-18 2019-09-18 Protein-ligand binding site prediction method based on deep learning

Publications (2)

Publication Number Publication Date
CN110689920A CN110689920A (en) 2020-01-14
CN110689920B true CN110689920B (en) 2022-02-11

Family

ID=69109310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910879922.7A Active CN110689920B (en) 2019-09-18 2019-09-18 Protein-ligand binding site prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN110689920B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785321B (en) * 2020-06-12 2022-04-05 浙江工业大学 DNA binding residue prediction method based on deep convolutional neural network
CN112085245B (en) * 2020-07-21 2024-06-18 浙江工业大学 Protein residue contact prediction method based on depth residual neural network
CN112085247B (en) * 2020-07-22 2024-06-21 浙江工业大学 Protein residue contact prediction method based on deep learning
CN112216345B (en) * 2020-09-27 2021-12-17 浙江工业大学 Protein solvent accessibility prediction method based on iterative search strategy
CN112185458B (en) * 2020-10-23 2024-04-26 深圳晶泰科技有限公司 Method for predicting binding free energy of protein and ligand molecule based on convolutional neural network
CN112397139B (en) * 2020-11-13 2023-08-01 中山大学 Deep learning method for predicting binding site on antibody through sequence
CN112382338B (en) * 2020-11-16 2022-09-06 南京理工大学 DNA-protein binding site prediction method based on self-attention residual error network
CN112464804B (en) * 2020-11-26 2022-05-24 北京航空航天大学 Peptide fragment signal matching method based on neural network framework
CN112562790A (en) * 2020-12-09 2021-03-26 中国石油大学(华东) Traditional Chinese medicine molecule recommendation system, computer equipment and storage medium for regulating and controlling disease target based on deep learning
CN112599186B (en) * 2020-12-30 2022-09-27 兰州大学 Compound target protein binding prediction method based on multi-deep learning model consensus
CN112837740B (en) * 2021-01-21 2024-03-26 浙江工业大学 DNA binding residue prediction method based on structural characteristics
CN112837742B (en) * 2021-01-22 2024-03-26 浙江工业大学 Protein-protein interaction prediction method based on circulation network
CN112837741B (en) * 2021-01-25 2024-04-16 浙江工业大学 Protein secondary structure prediction method based on cyclic neural network
CN113192559B (en) * 2021-05-08 2023-09-26 中山大学 Protein-protein interaction site prediction method based on deep graph convolution network
CN113313167B (en) * 2021-05-28 2022-05-31 湖南工业大学 Method for predicting lncRNA-protein interaction based on deep learning dual neural network structure
CN113539354B (en) * 2021-07-19 2023-10-27 浙江理工大学 Method for efficiently predicting type III and type IV effector proteins of gram-negative bacteria
CN113643756B (en) * 2021-08-09 2024-08-16 安徽工业大学 Protein interaction site prediction method based on deep learning
CN113593631B (en) * 2021-08-09 2022-11-29 山东大学 Method and system for predicting protein-polypeptide binding site
CN113707213B (en) * 2021-09-08 2024-03-08 上海交通大学 Protein structure rapid classification method based on contrast graph neural network
TWI804229B (en) * 2021-09-27 2023-06-01 美商圖策智能科技有限公司 Method and system for estimating protein binding free energy based on protein mutation prediction
CN114300035A (en) * 2021-12-21 2022-04-08 上海交通大学 Personalized parameter generation method for protein force field simulation
CN114882945A (en) * 2022-07-11 2022-08-09 鲁东大学 Ensemble learning-based RNA-protein binding site prediction method
CN116844646B (en) * 2023-09-04 2023-11-24 鲁东大学 Enzyme function prediction method based on deep contrast learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105431853A (en) * 2013-05-23 2016-03-23 艾弗诺泰普有限责任公司 Phenotypic integrated social search database and method
CN107111691A (en) * 2014-10-27 2017-08-29 阿卜杜拉国王科技大学 The method and system of recognition ligand protein binding site
CN107194207A (en) * 2017-06-26 2017-09-22 南京理工大学 Protein ligands binding site estimation method based on granularity support vector machine ensembles
CN109147866A (en) * 2018-06-28 2019-01-04 南京理工大学 Residue prediction technique is bound based on sampling and the protein-DNA of integrated study
CN109326329A (en) * 2018-11-14 2019-02-12 金陵科技学院 Zinc-binding protein matter action site prediction technique based on integrated study under a kind of unbalanced mode

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190214107A1 (en) * 2015-04-21 2019-07-11 The Trustees Of Colombia University In The City Of New York Engineering surface epitopes to improve protein crystallization
AU2017268399B2 (en) * 2016-05-18 2023-01-12 Modernatx, Inc. mRNA combination therapy for the treatment of cancer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105431853A (en) * 2013-05-23 2016-03-23 艾弗诺泰普有限责任公司 Phenotypic integrated social search database and method
CN107111691A (en) * 2014-10-27 2017-08-29 阿卜杜拉国王科技大学 The method and system of recognition ligand protein binding site
CN107194207A (en) * 2017-06-26 2017-09-22 南京理工大学 Protein ligands binding site estimation method based on granularity support vector machine ensembles
CN109147866A (en) * 2018-06-28 2019-01-04 南京理工大学 Residue prediction technique is bound based on sampling and the protein-DNA of integrated study
CN109326329A (en) * 2018-11-14 2019-02-12 金陵科技学院 Zinc-binding protein matter action site prediction technique based on integrated study under a kind of unbalanced mode

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Recent methodology progress of deep learning for RNA-protein interaction prediction;Xiaoyong Pan.et.;《WIREs RNA》;20190508;第10卷(第6期);第e1544页 *
基于机器学习的蛋白质类别及蛋白质-配体相互作用预测研究;张丽娜;《中国优秀博硕士学位论文全文数据库(博士)基础科学辑》;20170815(第8期);第A006-59页 *

Also Published As

Publication number Publication date
CN110689920A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
CN110689920B (en) Protein-ligand binding site prediction method based on deep learning
CN111798921B (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
Hashemifar et al. Predicting protein–protein interactions through sequence-based deep learning
Liberis et al. Parapred: antibody paratope prediction using convolutional and recurrent neural networks
Baldi et al. Exploiting the past and the future in protein secondary structure prediction
CN107622182B (en) Method and system for predicting local structural features of protein
CN113221787A (en) Pedestrian multi-target tracking method based on multivariate difference fusion
CN111063389A (en) Ligand binding residue prediction method based on deep convolutional neural network
Li et al. Protein contact map prediction based on ResNet and DenseNet
CN113744799A (en) End-to-end learning-based compound and protein interaction and affinity prediction method
CN112837747B (en) Protein binding site prediction method based on attention twin network
CN108197427A (en) Proteins subcellular location method and apparatus based on depth convolutional neural networks
CN106021990A (en) Method for achieving classification and self-recognition of biological genes by means of specific characters
CN114974397A (en) Training method of protein structure prediction model and protein structure prediction method
Chen et al. Cascaded bidirectional recurrent neural networks for protein secondary structure prediction
CN113764037A (en) Method and device for model training, antibody modification and binding site prediction
Wang et al. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences
CN118314958A (en) Protein locus prediction method based on multiscale and isomorphous map neural network
CN113257357A (en) Method for predicting protein residue contact map
Shao et al. DeepSec: a deep learning framework for secreted protein discovery in human body fluids
CN107463799B (en) Method for identifying DNA binding protein by interactive fusion feature representation and selective integration
Zabardast et al. An automated framework for evaluation of deep learning models for splice site predictions
CN112270950A (en) Fusion network drug target relation prediction method based on network enhancement and graph regularization
CN115952930B (en) Social behavior body position prediction method based on IMM-GMR model
Pollastri et al. Prediction of protein topologies using generalized IOHMMs and RNNs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant