Nothing Special   »   [go: up one dir, main page]

Multifaceted Protein-Protein Interaction Prediction Based On Siamese Residual RCNN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Bioinformatics, 35, 2019, i305–i314

doi: 10.1093/bioinformatics/btz328
ISMB/ECCB 2019

Multifaceted protein–protein interaction


prediction based on Siamese residual RCNN
Muhao Chen1,*,†, Chelsea J.-T. Ju1,†, Guangyu Zhou1, Xuelu Chen1,
Tianran Zhang2, Kai-Wei Chang1, Carlo Zaniolo1 and Wei Wang1

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


1
Department of Computer Science and 2Department of Bioengineering, University of California, Los Angeles, Los
Angeles, CA 90095, USA
*To whom correspondence should be addressed.

The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Abstract
Motivation: Sequence-based protein–protein interaction (PPI) prediction represents a fundamental
computational biology problem. To address this problem, extensive research efforts have been
made to extract predefined features from the sequences. Based on these features, statistical algo-
rithms are learned to classify the PPIs. However, such explicit features are usually costly to extract,
and typically have limited coverage on the PPI information.
Results: We present an end-to-end framework, PIPR (Protein–Protein Interaction Prediction Based
on Siamese Residual RCNN), for PPI predictions using only the protein sequences. PIPR incorpo-
rates a deep residual recurrent convolutional neural network in the Siamese architecture, which
leverages both robust local features and contextualized information, which are significant for cap-
turing the mutual influence of proteins sequences. PIPR relieves the data pre-processing efforts
that are required by other systems, and generalizes well to different application scenarios.
Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the bin-
ary PPI prediction problem. Moreover, it shows a promising performance on more challenging
problems of interaction type prediction and binding affinity estimation, where existing approaches
fall short.
Availability and implementation: The implementation is available at https://github.com/muhao
chen/seq_ppi.git.
Contact: muhaochen@ucla.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction Evidently, there is an immense need for reliable computational


approaches to identify and characterize PPIs.
Detecting protein–protein interactions (PPIs) and characterizing the
The amino acid sequence represents the primary structure of a
interaction types are essential toward understanding cellular bio-
protein, which is the simplest type of information either obtained
logical processes in normal and disease states. Knowledge from these through direct sequencing or translated from DNA sequences. Many
studies potentially facilitates therapeutic target identification (Petta research efforts address the PPI problem based on predefined fea-
et al., 2016) and novel drug design (Skrabanek et al., 2008). High- tures extracted from protein sequences, such as ontological features
throughput experimental technologies have been rapidly developed of amino acids (Jansen et al., 2003), autocovariance (AC) (Guo
to discover and validate PPIs on a large scale. These technologies in- et al., 2008), conjoint triads (CT) (Shen et al., 2007) and
clude yeast two-hybrid screens (Fields and Song, 1989), tandem af- composition-transition-distribution (CTD) descriptors (Yang et al.,
finity purification (Gavin et al., 2002) and mass spectrometric 2010). These features generally summarize specific aspects of pro-
protein complex identification (Ho et al., 2002). However, tein sequences such as physicochemical properties, frequencies of
experiment-based methods remain expensive, labor-intensive and local patterns and the positional distribution of amino acids. On top
time-consuming. Most importantly, they often suffer from high lev- of these features, several statistical learning algorithms (Guo et al.,
els of false-positive predictions (Sun et al., 2017; You et al., 2015). 2008; Huang et al., 2015; You et al., 2014, 2015) are applied to

C The Author(s) 2019. Published by Oxford University Press.


V i305
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com
i306 M.Chen et al.

predict PPIs in the form of binary classification. These approaches wide usability of our approach. Especially on the binding affinity es-
provide feasible solutions to the problem. However, the extracted timation of mutated proteins, PIPR is able to respond to the subtle
features used in these approaches only have limited coverage on changes of point mutations and provides the best estimation with
interaction information, as they are dedicated to specific facets of the smallest errors.
the protein profiles.
To mitigate the inadequacy of statistical learning methods, deep
learning algorithms provide the powerful functionality to process 2 Related works
large-scale data and automatically extract useful features for objective Sequence-based approaches provide a critical solution to the binary
tasks (LeCun et al., 2015). Recently, deep learning architectures have PPI prediction task. Homology-based methods (Philipp et al., 2016)
produced powerful systems to address several bioinformatics problems rely on BLAST to map a pair of sequences to known interacting pro-
related to single nucleotide sequences, such as genetic variants detec- teins. Alternatively, other works address the task with statistical
tion (Anderson, 2018), DNA function classification (Quang and Xie, learning models, including SVM (Guo et al., 2008; You et al.,

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


2016), RNA-binding site prediction (Zhang et al., 2016) and chroma- 2014), kNN (Yang et al., 2010), Random Forest (Wong et al.,
tin accessibility prediction (Min et al., 2017). These works typically 2015), multi-layer perceptron (MLP) (Du et al., 2017) and ensemble
use convolutional neural networks (CNN) (Anderson, 2018; Zhang ELM (EELM) (You et al., 2013). These approaches rely on several
et al., 2016) for automatically selecting local features, or recurrent feature extraction processes for the protein sequences, such as CT
neural networks (RNN) (Quang and Xie, 2016) that aim at preserving (Sun et al., 2017; You et al., 2013), AC (Guo et al., 2008; Sun et al.,
the contextualized and long-term ordering information. In contrast, 2017; You et al., 2013), CTD (Du et al., 2017; Yang et al., 2010),
fewer efforts (discussed in Section 2) have been made to capture the multi-scale continuous and discontinuous (MCD) descriptors (You
pairwise interactions of proteins with deep learning, which remains a et al., 2013) and local phase quantization (LPQ) (Wong et al.,
non-trivial problem with the following challenges: (i) Characterization 2015). These features measure physicochemical properties of the 20
of the proteins requires a model to effectively filter and aggregate their
canonical amino acids, and aim at summarizing full sequence infor-
local features, while preserving significant contextualized and sequen-
mation relevant to PPIs. More recent works (Sun et al., 2017; Wang
tial information of the amino acids; (ii) extending a deep neural archi-
et al., 2017) propose the use of stacked autoencoders (SAE) to refine
tecture often leads to inefficient learning processes, and suffers from
these heterogeneous features in low-dimensional spaces, which im-
the notorious vanishing gradient problem (Pascanu et al., 2013); (iii)
prove the aforementioned models on the binary prediction task. On
an effective mechanism is also needed to apprehend the mutual influ-
the contrary, fewer efforts have been made toward multi-class pre-
ence of protein pairs in PPI prediction. Moreover, it is essential for the
diction to infer the interaction types (Silberberg et al., 2014; Zhu
framework to be scalable to large data, and to be generalized to differ-
et al., 2006) and the regression task to estimate binding affinity
ent prediction tasks.
(Srinivasulu et al., 2015; Yugandhar and Gromiha, 2014). These
In this paper, we introduce PIPR (Protein–Protein Interaction
methods have largely relied on their capability of extracting and
Prediction Based on Siamese Residual RCNN), a deep learning
selecting better features, while the extracted features are far from
framework for PPI prediction using only the sequences of a protein
fully exploiting the interaction information.
pair. PIPR employs a Siamese architecture to capture the mutual in-
By nature, the PPI prediction task is comparable to the neural sen-
fluence of a protein sequence pair. The learning architecture is based
tence pair modeling tasks in natural language processing (NLP) re-
on a residual recurrent convolutional neural network (RCNN),
search, as they both seek to characterize the mutual influence of two
which integrates multiple occurrences of convolution layers and re-
sequences based on their latent features. In NLP, neural sentence pair
sidual gated recurrent units. To represent each amino acid in this
models typically focus on capturing the discourse relations of lexicon
architecture, PIPR applies an efficient property-aware lexicon
embedding approach to better capture the contextual and physico- sequences, such as textual entailment (Hu et al., 2014; Yin et al.,
chemical relatedness of amino acids. This comprehensive encoding 2016), paraphrases (He et al., 2015; Yin and Schütze, 2015) and sub-
architecture provides a multi-granular feature aggregation process topic relations (Chen et al., 2018). Many recent efforts adopt a
to effectively leverage both sequential and robust local information Siamese encoding architecture, where encoders based on CNN (Hu
of the protein sequences. It is important to note that the scope of et al., 2014; Yin and Schütze, 2015) and RNN (Mueller and
this work focuses only on the primary sequence as it is the funda- Thyagarajan, 2016) are widely used. A binary classifier is then stacked
mental information to describe a protein. to the sequence pair encoder for the detection of a discourse relation.
Our contributions are 3-fold. First, we construct an end-to-end In contrast to sentences, proteins are profiled in sequences with
framework for PPI prediction that relieves the data pre-processing more intractable patterns, as well as in a drastically larger range of
efforts for users. PIPR requires only the primary protein sequences lengths. Precisely capturing the PPI requires much more comprehen-
as the input, and is trained to automatically preserve the critical fea- sive learning architectures to distill the latent information from the
tures from the sequences. Second, we emphasize and demonstrate entire sequences, and to preserve the long-term ordering informa-
the needs of considering the contextualized and sequential informa- tion. One recent work (Hashemifar et al., 2018), DPPI, uses a deep
tion when modeling the PPIs. Third, the architecture of PIPR can be CNN-based architecture which focuses on capturing local features
flexibly used to address different PPI tasks. Besides the binary pre- from protein profiles. DPPI represents the first work to deploy deep
diction that is widely attempted in previous works, our framework learning to PPI prediction, and has achieved the state-of-the-art per-
extends its use to two additional challenging problems: multi-class formance on the binary prediction task. However, it requires exces-
interaction type prediction and binding affinity estimation. We use sive efforts for data pre-processing such as constructing protein
five datasets to evaluate the performance of our framework on these profiles by PSI-BLAST (Altschul et al., 1997), and does not incorp-
tasks. PIPR outperforms various state-of-the-art approaches on the orate a neural learning architecture that captures the important con-
binary prediction task, which confirms the effectiveness in terms of textualized and sequential features. DNN-PPI (Li et al., 2018)
integrating both local features and sequential information. The represents another relevant work of this line, which deploys a differ-
promising performance of the other two tasks demonstrates the ent learning structure with two separated CNN encoders. However,
PIPR i307

DNN-PPI does not incorporate physicochemical properties into the interaction type of two proteins. (iii) Binding affinity estimation
amino acid representations, and does not employ a Siamese learning aims at producing a regression model to estimate the strength of the
architecture to fully characterize pairwise relations of sequences. binding interaction.

3.2 RCNN-based protein sequence encoder


3 Materials and methods
We employ a deep Siamese architecture of residual RCNN to cap-
We introduce an end-to-end deep learning framework, PIPR, for ture latent semantic features of the protein sequence pairs.
sequence-based PPI prediction tasks. The overall learning architec-
ture is illustrated in Figure 1. PIPR employs a Siamese architecture 3.2.1 Residual RCNN
of residual RCNN encoder to better apprehend and utilize the mu- The RCNN seeks to leverage both the global sequential information
tual influence of two sequences. To capture the features of the pro- and local features that are significant to the characterization of PPI
tein sequences from scratch, PIPR pre-trains the embeddings of

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


from the protein sequences. This deep neural encoder stacks mul-
canonical amino acids to capture their contextual similarity and tiple instances of two computational modules, i.e. convolution
physicochemical properties. The latent representation of each pro- layers with pooling and bidirectional residual gated recurrent units.
tein in a protein pair is obtained by feeding the corresponding amino The architecture of an RCNN unit is shown on the left of Figure 2.
acid embeddings into the sequence encoder. The embeddings of The convolution layer with pooling. We use X ¼ ½v1 ; v2 ; :::; vl  to
these two sequences are then combined to form a sequence pair vec- denote an input vector sequence that corresponds to the embedded
tor. Finally, this sequence pair vector is fed into an MLP with appro- amino acids or the outputs of a previous neural layer. A convolution
priate loss functions, suiting for specific prediction tasks. In this layer applies a weight-sharing kernel Mc 2 Rhk to generate a k-di-
section, we describe the details of each model component. We begin ð Þ
mension latent vector ht1 from a window vt:tþh1 of the input vector
with the denotations and problem specifications.
sequence X:
ð Þ
3.1 Preliminary ht1 ¼ Convðvt:tþh1 Þ ¼ Mc vt:tþh1 þ bc
We use A to denote the vocabulary of 20 canonical amino acids. A
protein is profiled as a sequence of amino acids S ¼ ½a1 ; a2 ; :::; al  for which h is the kernel size, and bc is a bias vector. The convolu-
such that each ai 2 A. For each amino acid ai, we use bold-faced ai tion layer applies the kernel as a sliding window to produce a se-
h i
ð Þ ð Þ ð1Þ
to denote its embedding representation, which we are going to spe- quence of latent vectors Hð1Þ ¼ h11 ; h21 ; :::; hlhþ1 , where each
cify in Section 3.2.3. We use I to denote the set of protein pairs, and latent vector combines the local features from each h-gram of the in-
p ¼ ðS1 ; S2 Þ 2 I denotes a pair of proteins of which our framework put sequence. The n-max-pooling mechanism is applied to every
captures the interaction. consecutive n-length subsequence (i.e. non-overlapped n-strides) of
We address three challenging PPI prediction tasks based only on
the convolution outputs, which takes the maximum value along
the primary sequence information: (i) Binary prediction seeks to  
ð Þ ð1Þ
provide a binary classifier to indicate whether the corresponding each dimension j by hi;j2 ¼ max hi:nþi1;j . The purpose of this
protein pair interacts, which is the simplest and widely considered mechanism is to discretize the convolution results, and preserve the
problem setting in previous works (Hashemifar et al., 2018; most significant features within each n-stride (Chen et al., 2018;
Skrabanek et al., 2008; Sun et al., 2017). (ii) Interaction type predic- Hashemifar et al., 2018; Kim, 2014). By definition, this mechanism
tion is a multi-class classification problem, which seeks to identify divides the size of processed features by n. The outputs from the

Interaction Binding affinity Interaction type


Prediction prediction estimation prediction
Tasks
MSE

Multi-Layer
Perceptron Binary cross-entropy Categorical cross-entropy

Sequence Pair Vector


Sequence Embedding
Vectors
Element-wise multiplication
Residual Shared
Residual Residual
RCNN RCNN parameters RCNN

Pre-trained Embeddings

Protein Sequences
MQSPYPMTQVSNVDDGSLLK... MLERIQQLVNAVNDPRSDVAT...

Fig. 1. The overall learning architecture of our framework


i308 M.Chen et al.

RCNN Unit Sequence Embedding Vector


Output ... ...

GRU GRU GRU GRU Global Average Pooling


Bidirectional
GRU
...
GRU GRU GRU GRU
with residual Convolution Layer
shortcuts

...
Pooling Max Pool Max Pool Max Pool ... Max Pool
RCNN Unit
Convolution

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


Layer RCNN Unit

Input ... Pre-trained Embeddings

Fig. 2. The structure of our residual RCNN encoder is shown on the right, and the RCNN unit is shown on the left. Each RCNN unit contains a convolution-pooling
layer followed a bidirectional residual GRU

max-pooling are fed into the bidirectional gated recurrent units in 2016). By adding the forwarded input values to the outputs, the corre-
our RCNN encoder. sponding neural layer is only required to capture the difference be-
The residual gated recurrent units. The gated recurrent unit tween the input and output values. This mechanism aims at improving
model (GRU) represents an alternative to the long-short-term mem- the learning process of non-linear neural layers by increasing the sensi-
ory (LSTM) network (Cho et al., 2014), which consecutively charac- tivity of the optimization gradients (He et al., 2016; Kim et al., 2016),
terizes the sequential information without using separated memory as well as preventing the model from the vanishing gradient problem.
cells (Dhingra et al., 2017). Each unit consists of two types of gates It has been widely deployed in deep learning architectures for various
to track the state of the sequence, i.e. the reset gate rt and the update tasks of image recognition (He et al., 2016), document classification
gate zt . Given the embedding vt of an incoming item (either a pre- (Conneau et al., 2017) and speech recognition (Zhang et al., 2017). In
trained amino acid embedding, or an output of the previous layer), our deep RCNN, the bidirectional GRU is incorporated with the re-
ð Þ
GRU updates the hidden state ht 3 of the sequence as a linear com- sidual mechanism, and will pass on the following outputs to its subse-
ð3Þ
bination of the previous state ht1 and the candidate state h ~ ð3Þ of a
t quent neural network layer:
new item vt , which is calculated as below. h! i

ð Þ
ht 5 ¼ ResGRUðvt Þ ¼ GRUðvt Þ þ vt ; GRU ðvt Þ þ vt :
ð Þ
ht 3 ¼ GRUðvt Þ ¼ zt  h~ ð3Þ þ ð1  zt Þ  hð3Þ
t t1
  In our development, we have found that the residual mechanism
ð3Þ
zt ¼ r Mz vt þ Nz ht1 þ bz is able to drastically simplify the training process, and largely
    decreases the epochs of parameter updates for the model to
~ ð3Þ ¼ tanh Ms vt þ rt  Ns hð3Þ þ bs
h converge.
t t1
 
ð3Þ
rt ¼ r Mr vt þ Nr ht1 þ br : 3.2.2 Protein sequence encoding
Figure 2 shows the entire structure of our RCNN encoder. The
Notation  thereof denotes the element-wise multiplication. The RCNN encoder ERCNN ðSÞ alternately stacks multiple occurrences of
update gate zt balances the information of the previous sequence the above two intermediary neural network components. A convolu-
and the new item, where capitalized M and N denote different tion layer serves as the first encoding layer to extract local features
weight matrices, b denote bias vectors and r is the sigmoid func-
from the input sequence. On top of that, a residual GRU layer takes
tion. The candidate state h ~ ð3Þ is calculated similarly to those in a
t in the preserved local features, whose outputs are passed to another
traditional recurrent unit, and the reset gate rt controls how much
convolution layer. Repeating of these two components in the net-
information of the past sequence contributes to h ~ ð3Þ . Note that GRU
t work structure conducts an automatic multi-granular feature aggre-
generally performs comparably to LSTM in sequence encoding
gation process on the protein sequence, while preserving the
tasks, but is less complex and requires much fewer computational
sequential and contextualized information on each granularity of
resources for training.
the selected features. The last residual GRU layer is followed by an-
A bidirectional GRU layer characterizes the sequential informa-
other convolution layer for a final roundh of local feature i selection to
tion in two directions. It contains the forward encoding process
!
produce the last hidden states H0 ¼ h01 ; h02 ; :::; h0jH0 j . Note that the
GRU that reads the input vector sequence X ¼ ½v1 ; v2 ; :::; vl  from v1 dimensionality of the last hidden states does not need to equal that
 of the previous hidden states. A high-level sequence embedding of
to vl , and a backward encoding process GRU that reads in the op-
the entire protein sequence is obtained from the global average-
posite direction. The encoding results of both processes are con- PjH0 j
ð Þ pooling (Lin et al., 2013) of H0 , i.e. ERCNN ðSÞ ¼ jH10 j i¼1 h0i .
catenated for each input item vt , i.e. ht 4 ¼ BiGRUðvt Þ ¼
! 
½GRUðvt Þ; GRUðvt Þ. 3.2.3 Pre-trained amino acid embeddings
The residual mechanism passes on an identity mapping of the To support inputting the non-numerical sequence information, we
GRU inputs to its output side through a residual shortcut (He et al., provide a useful embedding method to represent each amino acid
PIPR i309

a 2 A as a semi-latent vector a. Each embedding vector is a concat- 1 XX m


Lð1Þ ¼  c p log spi :
enation of two sub-embeddings, i.e. a ¼ ½ac ; aph . jIj p2I i¼1 i
The first part ac measures the co-occurrence similarity of the
amino acids, which is obtained by pre-training the Skip-Gram model ii. Mean squared loss is optimized for the binding affinity estima-
(Mikolov et al., 2013) on protein sequences. The learning objective tion task. In this case, ^s p is a scalar output that is normalized by
of Skip-Gram is to minimize the following negative log likelihood a sigmoid function sp ¼ 1þexp 1
ð^s p Þ, which is trained to approach
loss. the normalized ground truth score cp 2 ½0; 1 by minimizing the
following objective function:
1 X X  
JSG ¼  log p ac;tþj jac;t ; 1X p
jSj a 2S C < j < C Lð2Þ ¼ js  cp j2 :
t
jIj p2I
ac;t thereof is the first-part embedding of the t-th amino acid at 2 S;

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


ac;tþj is that of a neighboring amino acid, and C is the size of half
context. (The context of Skip-Gram means a subsequence of a given
4 Experiments
protein sequence S, such that the subsequence is of 2C þ 1 length.)
The probability p is defined as the following softmax: We present the experimental evaluation of the proposed framework
on three PPI prediction tasks, i.e. binary prediction, multi-class
  expðac;tþj  ac;t Þ
p ac;tþj jac;t ¼ n interaction type prediction and binding affinity estimation. The
P  ;
exp a0c;k  ac;t experiments are conducted on the following datasets.
k¼1

where n is the negative sampling size, and a0c;k is a negative sample 4.1 Datasets
that does not co-occur with ac;t in the same context. Guo’s datasets. Guo et al. (2008) generate several datasets from dif-
The second part aph represents the similarity of electrostaticity ferent species for the binary prediction of PPIs. Each dataset con-
and hydrophobicity among amino acids. The 20 amino acids can be tains a balanced number of positive and negative samples. Among
clustered into 7 classes based on their dipoles and volumes of the these resources, the Yeast dataset is a widely used benchmark by
side chains to reflect this property. Thus, aph is a one-hot encoding most state-of-the-art methods (Hashemifar et al., 2018; Wong et al.,
based on the classification defined by Shen et al. (2007). 2015; You et al., 2013, 2014). There are 2497 proteins forming
11 188 cases of PPIs, with half of them representing the positive
3.3 Learning architecture and learning objectives cases, and the other half the negative cases. The positive cases are
Our framework characterizes the interactions in the following two selected from the database of interacting proteins DIP_20070219
stages. (Salwinski et al., 2004), where proteins with fewer than 50 amino
acids or 40% sequence identity are excluded. We use the full pro-
tein sequences in our model, which are obtained from the UniProt
3.3.1 Siamese architecture
(Consortium et al., 2018). The negative cases are generated by ran-
Given a pair of proteins p ¼ ðS1 ; S2 Þ 2 I, the same RCNN encoder
domly pairing the proteins without evidence of interaction, and fil-
is used to obtain the sequence embeddings ERCNN ðS1 Þ and
tered by their sub-cellular locations. In other words, non-interactive
ERCNN ðS2 Þ of both proteins. Both sequence embeddings are com-
pairs residing in the same location are excluded.
bined using element-wise multiplication, i.e. ERCNN ðS1 Þ
In addition, we combine the data for Caenorhabditis elegans,
ERCNN ðS2 Þ. This is a commonly used operation to infer the
Escherichia coli and Drosophila melanogaster as the multi-species
relation of sequence embeddings (Hashemifar et al., 2018; Jiang
dataset. We use the cluster analysis of the CD-HIT (Li and Godzik,
et al., 2018; Rocktäschel et al., 2016; Tai et al., 2015). Note that
2006) program to generate non-redundant subsets. Proteins with
some works use the concatenation of sequence embeddings (Sun
fewer than 50 amino acids or high sequence identify (40, 25, 10 or
et al., 2017; Yin and Schütze, 2015) instead of multiplication, which
1%) are removed.
we find to be less effective in modeling the symmetric relations of
STRING datasets. The STRING database (Szklarczyk et al.,
proteins.
2016) annotates PPIs with their types. There are seven types of inter-
actions: activation, binding, catalysis, expression, inhibition, post-
3.3.2 Learning objectives
translational modification (ptmod) and reaction. We download all
An MLP with leaky ReLU (Maas et al., 2013) is applied to the previ-
interaction pairs for Homo sapiens from database version 10.5
ous sequence pair representation, whose output ^s p is either a vector
(Szklarczyk et al., 2016), along with their full protein sequences.
or a scalar, depending on whether the model solves a classification
Among the corresponding proteins, we randomly select 3000 pro-
or a regression task for the protein pair p. The entire learning archi-
teins and 8000 proteins that share <40% of sequence identity to
tecture is trained to optimize the following two types of losses
generate two subsets. In this process, we randomly sample instances
according to different PPI prediction problems.
of different interaction types to ensure a balanced class distribution.
i. Cross-entropy loss is optimized for the two classification prob- Eventually, the two generated datasets, denoted by SHS27k and
lems, i.e. binary prediction and interaction type prediction. In SHS148k, contain 26 945 cases and 148 051 cases of interactions re-
this case, the MLP output ^s p is a vector, whose dimensionality spectively. We use these two datasets for the PPI type prediction
equals the number of classes m. ^s p is normalized by a softmax task.
ð pÞ SKEMPI dataset. We obtain the protein binding affinity data
function, where the i-th dimension si p ¼ Pexp ^s i
expð^s j p Þ
corresponds
j from SKEMPI (the structural database of kinetics and energetics of
to the confidence score for the i-th class. The learning objective mutant protein interactions) (Moal and Fernández-Recio, 2012) for
is to minimize the following cross-entropy loss, where cp is a the affinity estimation task. It contains 3047 binding affinity
one-hot indicator for the class label of protein pair p. changes upon mutation of protein sub-units within a protein
i310 M.Chen et al.

complex. The binding affinity is measured by equilibrium dissoci- test the model so as to ensure an unbiased evaluation. We aggregate
ation constant (Kd), reflecting the strength of biomolecular interac- fix metrics on the test cases of each fold, i.e. the overall accuracy,
tions. The smaller Kd value means the higher binding affinity. Each precision, sensitivity, specificity, F1 and Matthews correlation coef-
protein complex contains single or multiple amino acid substitu- ficient (MCC) on positive cases. All these metrics are preferred to be
tions. The sequence of the protein complex is retrieved from the pro- higher to indicate better performance. Based on the reported accur-
tein data bank (PDB) (Berman et al., 2000). We manually replace acy over 5-folds, we also conduct two-tailed Welch’s t-tests (Welch,
the mutated amino acids. For duplicate entries, we take the average 1947) to evaluate the significance of the improvement on different
Kd. The final dataset results in the binding affinity of 2792 mutant pairs of approaches. The P-values are adjusted by the Benjamini–
protein complexes, along with 158 wild-types. Hochberg procedure (Benjamini and Hochberg, 1995) to control the
false discovery rate for multiple hypothesis testing.
4.2 Binary PPI prediction Results. As shown in Table 1, the CNN-based architecture,
Binary PPI prediction is the primary task targeted by a handful of DPPI, demonstrates state-of-the-art performance over other base-

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


previous works (Hashemifar et al., 2018; Shen et al., 2007; Sun lines that employ statistical learning algorithms or densely con-
et al., 2017; Yang et al., 2010; You et al., 2015). The objective of nected MLP (We are unable to obtain the source codes of two deep-
these works is to identify whether a given pair of proteins interacts learning methods, SAE and DNN-PPI. We implement these two
or not based on their sequences. We evaluate PIPR based on Guo’s models following the descriptions in their papers. Our implementa-
datasets. The Yeast benchmark dataset thereof is used to compare tions are verified by achieving comparable performance on the Pan’s
PIPR with various baseline approaches, and the multi-species data- dataset (Pan et al., 2010) as reported in the papers. However, these
set is to demonstrate PIPR’s capability of predicting interactions for two implementations can only achieve 67.17 and 76.61% in overall
proteins of different species that share very low sequence identity accuracy respectively on the Yeast dataset.). This shows the super-
with those in training. iority of deep-learning-based techniques in encapsulating various
The baseline approaches include SVM-AC (Guo et al., 2008), types of information of a protein pair, such as amino acid compos-
kNN-CTD (Yang et al., 2010), EELM-PCA (You et al., 2013), ition and their co-occurrences, and automatically extracting the ro-
SVM-MCD (You et al., 2014), MLP (Du et al., 2017), Random bust ones for the learning objectives. That said, DPPI requires an
Forest LPQ (RF-LPQ) (Wong et al., 2015), SAE (Sun et al., 2017), extensive effort in data pre-processing, specifically in constructing
DNN-PPI (Li et al., 2018) and DPPI (Hashemifar et al., 2018). In the protein profile for each sequence. On average, each PSI-BLAST
addition, we report the results of a Siamese Residual GRU (SRGRU)
search of a protein against the NCBI non-redundant protein data-
architecture, which is a simplification of PIPR, where we discard all
base (184 243 125 sequences) requires around 90 min of computa-
intermediary convolution layers and keep only the bidirectional re-
tion on our server. Even with eight cores, each search finishes in
sidual GRU. The purpose of SRGRU is to show the significance of
15 min. We estimate that processing 2497 sequences of the Yeast
the contextualized and sequential information of protein profiles in
dataset from scratch can take about 26 days. It is worth mentioning
characterizing PPIs. We also report the results of Siamese CNN
that PIPR only requires 8 s to pre-train the amino acid embedding,
(SCNN) by removing the residual GRU in PIPR. This degenerates
and 2.5 min to train on the Yeast dataset (Table 7). We implement
our framework to a similar architecture to DPPI, but differs in that
SCNN to evaluate the performance of a simplified CNN architec-
SCNN directly conducts an end-to-end training on raw sequences
ture, which produces comparable results as DPPI. These two frame-
instead of requiring the protein profiles constructed by PSI-BLAST.
works show that CNN can already leverage the significant features
We use AMSGrad (Reddi et al., 2018) to optimize the cross-
from primary protein sequences.
entropy loss, for which we set the learning rate a to 0.001, the expo-
In addition, the SRGRU architecture has offered comparable
nential decay rates b1 and b2 to 0.9 and 0.999, and batch size to 256
performance to SCNN. This indicates that preserving the sequential
on both datasets. The number of occurrences for the RCNN units
and contextualized features of the protein sequences is as crucial as
(i.e. one convolution-pooling layer followed by one bidirectional re-
incorporating the local features. By integrating both significant local
sidual GRU layer) is set to five, where we adopt three-max-pooling
and the convolution kernel of size three. We set the hidden state size features and sequential information, PIPR outperforms DPPI by
to be 50, and the RCNN output size to be 100. We set this configur- 2.54% in accuracy, 4.93% in sensitivity and 2.68% in F1-score.
ation to ensure the RCNN to compress the selected features in a rea- Next, we evaluate whether the improved accuracy of PIPR is statis-
sonably small vector sequence, before the features are aggregated by tically significant. Table 2 reports the P-values of SRGRU, SCNN
the last global average-pooling. We zero-pad short sequences to the and PIPR compared to other baseline approaches, where the statis-
longest sequence length in the dataset. This is a widely adopted tech- tically significant comparisons (P-values <0.01) are highlighted in
nique for sequence modeling in NLP (Chen et al., 2018; He et al., red. Since the SD of DPPI is unavailable, we are not able to include
2015; Hu et al., 2014; Yin et al., 2016; Zhou et al., 2017) as well as DPPI in this analysis. The evaluation shows that PIPR performs
in bioinformatics (Min et al., 2017; Müller et al., 2018; Pan and statistically significantly better than all other approaches, including
Shen, 2018) for efficient training. Note that the configuration of SCNN and SRGRU. On the other hand, SCNN is not statistically
embedding pre-training is discussed in Section 4.5, and the model significantly better than SRGRU. Thus, the residual RCNN is very
configuration study of different hyperparameter values is provided promising for modeling binary PPIs.
in the Supplementary Material. All model variants are trained until We also report the 5-fold CV performance of PIPR on variants
converge at each fold of the cross-validation (CV). of the multi-species dataset, where proteins are excluded based on
Evaluation protocol. Following the settings in previous works different thresholds of sequence identity. The results in Table 3
(Hashemifar et al., 2018; Shen et al., 2007; Sun et al., 2017; You show that PIPR performs consistently well under lenient and strin-
et al., 2014, 2015), we conduct 5-fold CV on the Yeast dataset. gent criteria of sequence identity between training and testing. More
Under the k-fold CV setting, the data are equally divided into k non- importantly, PIPR is able to train and test on multiple species, and
overlapping subsets, and each subset has a chance to train and to is robust against extremely low sequence identity of <1%.
PIPR i311

Table 1. Evaluation of binary PPI prediction on the Yeast dataset based on 5-fold cross-validation. We report the mean and SD for the test
sets

Methods Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) F1-score (%) MCC (%)

SVM-AC 87.35 6 1.38 87.82 6 4.84 87.30 6 5.23 87.41 6 6.33 87.34 6 1.33 75.09 6 2.51
kNN-CTD 86.15 6 1.17 90.24 6 1.34 81.03 6 1.74 NA 85.39 6 1.51 NA
EELM-PCA 86.99 6 0.29 87.59 6 0.32 86.15 6 0.43 NA 86.86 6 0.37 77.36 6 0.44
SVM-MCD 91.36 6 0.4 91.94 6 0.69 90.67 6 0.77 NA 91.3 6 0.73 84.21 6 0.66
MLP 94.43 6 0.3 96.65 6 0.59 92.06 6 0.36 NA 94.3 6 0.45 88.97 6 0.62
RF-LPQ 93.92 6 0.36 96.45 6 0.45 91.10 6 0.31 NA 93.7 6 0.37 88.56 6 0.63
SAE 67.17 6 0.62 66.90 6 1.42 68.06 6 2.50 66.30 6 2.27 67.44 6 1.08 34.39 6 1.25
DNN-PPI 76.61 6 0.51 75.1 6 0.66 79.63 6 1.34 73.59 6 1.28 77.29 6 0.66 53.32 6 1.05
DPPI 94.55 96.68 92.24 NA 94.41 NA

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


SRGRU 93.77 6 0.84 94.60 6 0.64 92.85 6 1.58 94.69 6 0.81 93.71 6 0.85 87.56 6 1.67
SCNN 95.03 6 0.47 95.51 6 0.77 94.51 6 1.27 95.55 6 0.77 95.00 6 0.50 90.08 6 0.93
PIPR 97.09 6 0.24 97.00 6 0.65 97.17 6 0.44 97.00 6 0.67 97.09 6 0.23 94.17 6 0.48

Each boldfaced number indicates the best of the corresponding metric.


NA, not available from the original paper.

Table 2. Statistical assessment (t-test; two-tailed) on the accuracy


[SAMME.R algorithm (Zhu et al., 2009)], kNN classifier and logistic
of binary PPI prediction
regression. For deep-learning-based approaches, we deploy the SCNN
P-value SRGRU SCNN PIPR architecture where an output MLP with categorical cross-entropy loss is
incorporated, as well as a similar SRGRU architecture into comparison.
SVM-AC 9.69E-05 1.22E-04 9.69E-05
Results of two naı̈ve baselines of random guessing and zero rule (i.e.
kNN-CTD 1.03E-05 2.23E-05 2.84E-05
EELM-PCA 2.33E-05 3.94E-08 2.43E-10 simply predicting the majority class) are also reported for reference.
SVM-MCD 1.67E-03 2.60E-06 1.35E-07 Evaluation protocol. All approaches are evaluated on the two
MLP 1.71E-01 5.29E-02 1.12E-06 datasets by 10-fold CV, using the same partition scheme for a more
RF-LPQ 7.28E-01 4.10E-03 1.75E-06 unbiased evaluation (James et al., 2013; McLachlan et al., 2005).
SAE 4.27E-10 1.78E-10 4.19E-09 We carry forward the model configurations from the last experiment
DNN-PPI 1.62E-08 2.27E-10 2.70E-09 to evaluate the performance of the frameworks under controlled
SRGRU NA 2.87E-02 6.60E-04
variables. For baseline models, we examine three different ways of
SCNN 2.87E-02 NA 1.80E-04
combining the feature vectors of the two input proteins, i.e. element-
Note: The statistically significant differences are highlighted in red. wise multiplication, the Manhattan difference [i.e. the absolute dif-
NA, not available. ferences of corresponding features (Mueller and Thyagarajan,
2016)] and concatenation. The Manhattan difference consistently
Table 3. Evaluation of binary PPI prediction on variants of multi- obtains better performance, considering the small values of the input
species (C. elegans, D. melanogaster and E. coli) dataset features and the asymmetry of the captured protein relations.
Results. The prediction accuracy and fold changes over the zero
Seq. # of Pos. Neg. Accuracy F1-score
rule baseline are reported in Table 4. Note that since the multi-class
identity proteins pairs pairs (%) (%)
prediction task is much more challenging than the binary prediction
Any 11 529 32 959 32 959 98.19 98.17 task, it is expected to observe lower accuracy and longer training time
<0.40 9739 25 916 22 012 98.29 98.28 (Table 7) than that reported in the previous experiment. Among all
<0.25 7790 19 458 15 827 97.91 98.08 the baselines using explicit features, the CTD-based models perform
<0.10 5769 12 641 9819 97.54 97.79 better than the AC-based ones. CTD descriptors seek to cover both
<0.01 5171 10 747 8065 97.51 97.80
continuous and discontinuous interaction information (Yang et al.,
2010), which potentially better discriminate among PPI types.
The best baseline using Random Forest thereof achieves satisfac-
4.3 Interaction type prediction tory results by more than doubling the accuracy of zero rule on the
The objective of this task is to predict the interaction type of two inter- smaller SHS27k dataset. However, on the larger SHS148k dataset,
acting proteins. We evaluate this task based on SHS27k and SHS148k the accuracy of these explicit-feature-based models is notably
datasets. To the best of our knowledge, much fewer efforts attempt for impaired. We hypothesize that such predefined explicit features are
the multi-class PPI prediction in contrast to the binary prediction. Zhu not representative enough to distinguish the PPI types. On the other
et al. (2006) train a two-stage SVM classifier to distinguish obligate, hand, the deep-learning-based approaches do not need to explicitly
non-obligate and crystal packing interactions; Silberberg et al. (2014) utilize these features, and perform consistently well in both settings.
use logistic regression to predict several types of enzymatic actions. The raw sequence information is sufficient for these approaches to
However, none of their implementations is publicly available. drastically outperform the Random Forest by at least 5.30% in accur-
Different from the categories of interaction types used above, we aim acy on SHS27k and 17.40% in accuracy on SHS148k. SCNN thereof
at predicting the interaction types annotated by the STRING database. outperforms SRGRU by 4.48 and 1.24% in accuracy on SHS27k and
We train several statistical learning algorithms on the widely SHS148k, respectively. This implies that the local interacting features
employed AC and CTD features for protein characterization as our are relatively more deterministic than contextualized and sequential
baselines. These algorithms include SVM, Random Forest, AdaBoost features on this task. The results by the residual RCNN-based
i312 M.Chen et al.

Table 4. Accuracy (%) and fold changes over zero rule for PPI interaction type prediction on two STRING datasets based on 10-fold cross-
validation

Features N/A AC CTD Embedded raw seqs

Methods Rand Zero rule SVM RF AdaBoost kNN Logistic SVM RF AdaBoost kNN Logistic SCNN SRGRU PIPR

SHS27k 14.28 16.70 33.17 44.82 28.67 35.44 25.47 35.56 45.76 31.81 35.56 30.57 55.54 51.06 59.56
(fold) — 1.00 1.99 2.68 1.72 2.12 1.52 2.13 2.74 1.90 2.13 1.83 3.33 3.06 3.57
SHS148k 14.28 16.21 28.17 36.01 27.87 33.81 24.96 31.37 36.65 29.67 33.13 26.96 55.29 54.05 61.91
(fold) — 1.00 1.74 2.22 1.72 2.09 1.54 1.94 2.26 1.83 2.04 1.66 3.41 3.33 3.82

Each boldfaced number indicates the best of the corresponding metric.

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


framework are very promising, as it outperforms SCNN by 4.02% Mutation on Chain I
Y61 -> A
and 6.62% in accuracy on SHS27k and SHS148k, respectively. It also
remarkably outperforms the best explicit-feature-based baselines on
the two datasets by 13.80 and 25.26% in accuracy, and more than
3.5 of fold changes over the zero rule on both datasets.

4.4 Binding affinity estimation


Lastly, we evaluate PIPR for binding affinity estimation using the
SKEMPI dataset. We employ the mean squared loss variant of PIPR
to address this regression task. Since the lengths of protein sequences
in SKEMPI are much shorter than those in the other datasets, we ac-
Fig. 3. Mutation effects on structure and binding affinity. The blue entity is
cordingly reduce the occurrences of RCNN units to three, while Subtilisin BPN’ precursor (Chain E), and the red entity is Chymotrypsin inhibi-
other configurations remain unchanged. For baselines, we compare tor (Chain I). The mutation is highlighted in yellow. The wild-type (1TM1) and
against several regression models based on the AC and CTD fea- mutant (1TO1) complexes are retrieved from PDB
tures, which include Bayesian Ridge regressor (BR), SVM, AdaBoost
with decision tree regressors and Random Forest regressor. The cor- 4.5 Amino acid embeddings
responding features for two sequences are again combined via the We further investigate the settings of amino acid embeddings in this
Manhattan difference. We also modify SCNN and SRGRU to their subsection. Each amino acid is represented by a vector of numerical
mean squared loss variants, in which we reduce the layers in the values that describe its relative physicochemical properties. The first
same way of RCNN. part of the embedding vector ac , which measures the co-occurrence
Evaluation protocol. We aggregate three metrics through 10-fold similarity of the amino acids in protein sequences, is empirically set
CV, i.e. mean squared error (MSE), mean absolute error (MAE) and as a five-dimensional vector. ac is obtained by pre-training the Skip-
Pearson’s correlation coefficient (Corr). These are three commonly
Gram model on all 8000 sequences from our largest STRING data-
reported metrics for regression tasks, for which lower MSE and
set, SHS148k, using a context window size of seven and a negative
MAE as well as higher Corr indicate better performance. In the CV
sampling size of five. The second part contains a seven-dimensional
process, we normalize the affinity values of the SKEMPI dataset to
vector, aph , which describes the categorization of electrostaticity and
½0; 1 via min–max re-scaling. (This is due to that we use sigmoid
hydrophobicity for the amino acid. We examine the performance of
function to smooth the output of the regressor. Note that this pro-
using each part individually, as well as the performance of combin-
cess does not affect correlation, while MSE, MAE and the original
ing them as used in our framework. In addition, we include a naı̈ve
affinity scores can be easily re-scaled back.).
one-hot vector representation, which does not consider the related-
Results. Table 5 reports the results for this experiment. It is note-
ness of amino acids and treats each of them independently. Table 6
worthy that, one single change of amino acid can lead to a drastic ef-
shows that, once we remove either of the two parts of the proposed
fect on binding affinity. While such subtle changes are difficult to be
embedding, the performance of the model slightly drops.
reflected by the explicit features, the deep-learning-based methods
Meanwhile, the proposed pre-trained embeddings lead to noticeably
can competently capture such changes from the raw sequences. Our
better performance of the model than adopting the naı̈ve one-hot
RCNN-based framework again offers the best performance among
the deep-learning-based approaches, and significantly outperforms encodings of the canonical amino acids. This pre-training process
the best baseline (CTD-based Random Forest) by offering a 0.233 completes in 8 s on a commodity workstation as shown in Table 7.
increase in Corr, as well as remarkably lower MSE and MAE. This is a one-time effort that can be reused on different tasks and
Figure 3 demonstrates an example of the effect of changing an datasets.
amino acid in a protein complex. Tyrosine at position 61 of
Chymotrypsin inhibitor 2 (Chain I) is substituted with Alanine, 4.6 Run-time analysis
causing the neighboring region of Subtilisin BPN’ precursor (Chain All of the experiments are conducted on one NVIDIA GeForce GTX
E) to relax. The binding affinity (kd) changes from 2.24E-12 to 1080 Ti GPU. We report the training time for each experiment, as
2.70E-10, which is validly captured by PIPR. While our experiment well as for the amino acid embedding in Table 7. For each experi-
is conducted on a relatively small dataset, we seek to extend our ment, we calculate the average training time over either 5-fold
PIPR framework to a more generalized solution for binding affinity (Yeast dataset) or 10-fold (others) CV. In both binary and multi-
estimation, once a larger and more heterogeneous corpus is class predictions, the training time increases along with the increased
available. number of training cases. The regression estimation generally
PIPR i313

Table 5. Results for binding affinity prediction on the SKEMPI dataset

Features AC CTD Embedded raw seqs

Methods BR SVM RF AdaBoost BR SVM RF AdaBoost SCNN SRGRU PIPR

MSE (102 ) 1.70 2.20 1.77 1.98 1.86 1.84 1.49 1.84 0.87 0.95 0.63
MAE (102 ) 9.56 11.81 9.81 11.15 10.20 11.04 9.06 10.69 6.49 7.08 5.48
Corr 0.564 0.353 0.546 0.451 0.501 0.501 0.640 0.508 0.831 0.812 0.873

Note: Each measurement is an average of the test sets over 10-fold cross-validation.
Each boldfaced number indicates the best of the corresponding metric.

process, how to extend PIPR with transfer learning based domain

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


Table 6. Comparison of amino acid representations based on bin-
ary prediction adaptation for different species is another meaningful direction.

½ac ; aph  ac only aph only One-hot


Acknowledgements
Dimension 12 5 7 20
We thank all of the reviewers for their valuable comments and suggestions.
Accuracy 97.09 96.67 96.03 96.11
Precision 97.00 96.35 95.91 96.34
F1-score 97.09 96.51 96.08 96.10
Funding
This work was partially supported by the National Institutes of Health
[R01GM115833, U54 GM114833]; and the National Science Foundation
[DBI-1565137, DGE-1829071].
Table 7. Run-time of training embeddings and different prediction
tasks Conflict of Interest: none declared.

Task Embeddings Binary Multi-class Multi-class Regression

Dataset SHS148k Yeast SHS27k SHS148k SKEMPI References


Sample size 8000 11 188 26 945 148 051 2 950 Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation
Training time 8 s 2.5 min 15.8 min 138.3 min 12.5 min of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Anderson,C. (2018) Google’s AI tool deepvariant promises significantly fewer
genome errors. Clinical OMICs, 5, 33–33.
Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a
requires more iterations per training case to converge than classifica- practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B
(Methodol.), 57, 289–300.
tion tasks. Thus, with much fewer cases, the training time on
Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28,
SKEMPI for affinity estimation is more than that on the Yeast data-
235–242.
set for binary prediction. Chen,M. et al. (2018) Neural article pair modeling for Wikipedia sub-article
matching. In: ECML-PKDD, pp. 3–19. Springer, Cham.
Cho,K. et al. (2014) Learning phrase representations using RNN
5 Conclusion Encoder–Decoder for statistical machine translation. In: Proceedings of con-
ference on empirical methods in natural language processing, pp.
In this paper, we propose a novel end-to-end framework for PPI pre-
1724–1734. ACL, Doha, Qatar.
diction based on the amino acid sequences. Our proposed frame-
Conneau,A. et al. (2017) Very deep convolutional networks for text classifica-
work, PIPR, employs a residual RCNN, which provides an tion. In Proceedings of the European Chapter of the Association for
automatic multi-granular feature selection mechanism to capture Computational Linguistics, ACL, pp. 1107–1116.
both local significant features and sequential features from the pri- Consortium,U. et al. (2018) UniProt: the universal protein knowledgebase.
mary protein sequences. By incorporating the RCNN in a Siamese- Nucleic Acids Res., 46, 2699.
based learning architecture, the framework captures effectively the Dhingra,B. et al. (2017) Gated-attention readers for text comprehension. In
mutual influence of protein pairs, and generalizes well to address Proceedings of ACL, ACL, Vancouver, Canada, pp. 1832–1846.
Du,X. et al. (2017) DeepPPI: boosting prediction of protein–protein interac-
different PPI prediction tasks without the need for predefined fea-
tions with deep neural networks. J. Chem. Inf. Model., 57, 1499–1510.
tures. Extensive experimental evaluations on five datasets show
Fields,S. and Song,O-K. (1989) A novel genetic system to detect protein–pro-
promising performance of our framework on three challenging PPI tein interactions. Nature, 340, 245–246.
prediction tasks. This also leads to significant amelioration over Gavin,A.-C. et al. (2002) Functional organization of the yeast proteome by
various baselines. Experiments on datasets of different sizes also systematic analysis of protein complexes. Nature, 415, 141–147.
demonstrate satisfactory scalability of the framework. For future Guo,Y. et al. (2008) Using support vector machine combined with auto covari-
work, one important direction is to apply the PIPR framework to ance to predict protein–protein interactions from protein sequences. Nucleic
Acids Res., 36, 3025–3030.
other sequence-based inference tasks in bioinformatics, such as mod-
Hashemifar,S. et al. (2018) Predicting protein–protein interactions through
eling RNA and protein interactions. We also seek to incorporate at-
sequence-based deep learning. Bioinformatics, 34, i802–i810.
tention mechanisms (Vaswani et al., 2017) to help pinpoint He,H. et al. (2015) Multi-perspective sentence similarity modeling with con-
interaction sites on protein sequences, and apply PIPR to predict volutional neural networks. In: Proceedings of the Conference on Empirical
confidence of interactions in the form of ordinal regression. Since Methods in Natural Language Processing, pp. 1576–1586. ACL, Lisbon,
PIPR has alleviated any costly domain-invariant feature engineering Portugal.
i314 M.Chen et al.

He,K. et al. (2016) Deep residual learning for image recognition. In: CVPR, Reddi,S.J. et al. (2018) On the convergence of Adam and Beyond. In:
pp. 770–778. International Conference on Learning Representations, pp. 1–23.
Ho,Y. et al. (2002) Systematic identification of protein complexes in OpenReview, Amherst, MA.
Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180–183. Rocktäschel,T. et al. (2016) Reasoning about entailment with neural atten-
Hu,B. et al. (2014) Convolutional neural network architectures for matching tion. In: International Conference on Learning Representations (ICLR), pp.
natural language sentences. In: Ghahramani,Z. et al. (eds) Advances in Neural 1–9. OpenReview, Amherst, MA.
Information Processing Systems 27. Curran Associates, Inc., pp. 2042–2050. Salwinski,L. et al. (2004) The database of interacting proteins: 2004 update.
Huang,Y.-A. et al. (2015) Using weighted sparse representation model com- Nucleic Acids Res., 32, D449–D451.
bined with discrete cosine transformation to predict protein–protein interac- Shen,J. et al. (2007) Predicting protein–protein interactions based only on
tions from protein sequence. BioMed Res. Int., 2015, 902198. sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341.
James,G. et al. (2013) An Introduction to Statistical Learning. Vol. 112. Silberberg,Y. et al. (2014) A method for predicting protein-protein interaction
Springer, New York. types. PLoS One, 9, e90904.
Jansen,R. et al. (2003) A Bayesian networks approach for predicting protein–- Skrabanek,L. et al. (2008) Computational prediction of protein–protein inter-

Downloaded from https://academic.oup.com/bioinformatics/article/35/14/i305/5529260 by guest on 23 September 2020


protein interactions from genomic data. Science, 302, 449–453. actions. Mol. Biotechnol., 38, 1–17.
Jiang,J.-Y. et al. (2018) Learning to disentangle interleaved conversational Srinivasulu,Y.S. et al. (2015) Characterizing informative sequence descriptors
threads with a Siamese hierarchical network and similarity ranking. In: and predicting binding affinities of heterodimeric protein complexes. BMC
Proceedings of the 2018 Conference of the North {A}merican Chapter of the Bioinformatics, 16, S14.
Association for Computational Linguistics: Human Language Technologies, Sun,T. et al. (2017) Sequence-based prediction of protein–protein interaction
Volume 1 (Long Papers), pp. 1812–1822. ACL, New Orleans, Louisiana. using a deep-learning algorithm. BMC Bioinformatics, 18, 277.
Kim,J. et al. (2016) Accurate image super-resolution using very deep convolu- Szklarczyk,D. et al. (2016) The string database in 2017: quality-controlled
tional networks. In: The IEEE Conference on Computer Vision and Pattern protein–protein association networks, made broadly accessible. Nucleic
Recognition (CVPR), pp. 1646–1654. Acids Res., 45, D362–D368.
Kim,Y. (2014) Convolutional neural networks for sentence classification. In: Tai,K.S. et al. (2015) Improved semantic representations from tree-structured
Proceedings of the 2014 Conference on Empirical Methods in Natural long short-term memory networks. In: Proceedings of the 53rd Annual
Language Processing (EMNLP), pp. 1746–1751. ACL, Doha, Qatar. Meeting of the Association for Computational Linguistics. ACL, Beijing,
LeCun,Y. et al. (2015) Deep learning. Nature, 521, 436–444. China, pp. 1556–1566.
Li,H. et al. (2018) Deep neural network based predictions of protein interac- Vaswani,A. et al. (2017) Attention is all you need. In: Guyon,I. et al. (eds)
tions using primary sequences. Molecules, 23, 1923. Advances in Neural Information Processing Systems. Curran Associates,
Li,W. and Godzik,A. (2006) CD-HIT: a fast program for clustering and com- Inc., pp. 5998–6008.
paring large sets of protein or nucleotide sequences. Bioinformatics, 22, Wang,Y.-B. et al. (2017) Predicting protein–protein interactions from protein
1658–1659. sequences by a stacked sparse autoencoder deep neural network. Mol.
Lin,M. et al. (2013) Network in network. In: International Conference on Biosyst., 13, 1336–1344.
Learning Representation, Scottsdale, Arizona. Welch,B.L. (1947) The generalization of Student’s problem when several dif-
Maas,A.L. et al. (2013) Rectifier nonlinearities improve neural network acous- ferent population variances are involved. Biometrika, 34, 28–35.
tic models. In: ICML Workshop on Deep Learning for Audio, Speech and Wong,L. et al. (2015) Detection of protein–protein interactions from amino
Language Processing, Vol. 30, p. 3. acid sequences using a rotation forest model with a novel PR-LPQ descrip-
McLachlan,G. et al. (2005) Analyzing Microarray Gene Expression Data. tor. In: Advanced Intelligent Computing Theories and Applications.
Vol. 422. John Wiley & Sons, Hoboken, New Jersey. Springer, Cham, pp. 713–720.
Mikolov,T. et al. (2013) Distributed representations of words and phrases and Yang,L. et al. (2010) Prediction of protein–protein interactions from protein
their compositionality. In: Burges,C.J.C. (eds) Advances in Neural sequence using local descriptors. Protein Pept. Lett., 17, 1085–1090.
Information Processing Systems. Curran Associates, Inc., pp. 3111–3119. Yin,W. and Schütze,H. (2015) Convolutional neural network for paraphrase
Min,X. et al. (2017) Chromatin accessibility prediction via convolutional long identification. In: Proceedings of the 2015 Conference of the North
short-term memory networks with k-mer embedding. Bioinformatics, 33, American Chapter of the Association for Computational Linguistics:
i92–i101. Human Language Technologies, pp. 901–911. ACL, Denver, Colorado.
Moal,I.H. and Fernández-Recio,J. (2012) SKEMPI: a structural kinetic and en- Yin,W. et al. (2016) ABCNN: attention-based convolutional neural network
ergetic database of mutant protein interactions and its use in empirical mod- for modeling sentence pairs. TACL, 4, 259–272.
els. Bioinformatics, 28, 2600–2607. You,Z.-H. et al. (2013) Prediction of protein–protein interactions from amino
Mueller,J. and Thyagarajan,A. (2016) Siamese recurrent architectures for acid sequences with ensemble extreme learning machines and principal com-
learning sentence similarity. In: Thirtieth AAAI Conference on Artificial ponent analysis. BMC Bioinformatics, 14, S10.
Intelligence, Vol. 16, pp. 2786–2792. AAAI Press, Menlo Park, CA. You,Z.-H. et al. (2014) Prediction of protein–protein interactions from amino
Müller,A.T. et al. (2018) Recurrent neural network model for constructive acid sequences using a novel multi-scale continuous and discontinuous fea-
peptide design. J. Chem. Inf. Model., 58, 472–479. ture set. BMC Bioinformatics, 15, S9.
Pan,X. and Shen,H.-B. (2018) Predicting RNA–protein binding sites and You,Z.-H. et al. (2015) Predicting protein–protein interactions from primary
motifs through combining local and global deep convolutional neural net- protein sequences using a novel multi-scale local feature representation
works. Bioinformatics, 34, 3427–3436. scheme and the Random Forest. PLoS One, 10, e0125811.
Pan,X.-Y. et al. (2010) Large-scale prediction of human protein–protein inter- Yugandhar,K. and Gromiha,M.M. (2014) Protein–protein binding affinity
actions from amino acid sequence based on latent topic features. prediction from amino acid sequence. Bioinformatics, 30, 3583–3589.
J. Proteome Res., 9, 4992–5001. Zhang,S. et al. (2016) A deep learning framework for modeling structural fea-
Pascanu,R. et al. (2013) On the difficulty of training recurrent neural net- tures of RNA-binding protein targets. Nucleic Acids Res., 44, e32.
works. In: Proceedings of the 30th International Conference on Machine Zhang,Y. et al. (2017) Very deep convolutional networks for end-to-end
Learning, Atlanta, GA, USA, pp. 1310–1318. speech recognition. In: 2017 IEEE International Conference on Acoustics,
Petta,I. et al. (2016) Modulation of protein–protein interactions for the devel- Speech and Signal Processing (ICASSP), pp. 4845–4849.
opment of novel therapeutics. Mol. Ther., 24, 707–718. Zhou,T. et al. (2017) Attention-based natural language person retrieval. In:
Philipp,O. et al. (2016) Path2PPI: an R package to predict protein–protein The IEEE Conference on Computer Vision and Pattern Recognition
interaction networks for a set of proteins. Bioinformatics, 32, 1427–1429. (CVPR) Workshops, pp. 27–34.
Quang,D. and Xie,X. (2016) DanQ: a hybrid convolutional and recurrent Zhu,H. et al. (2006) NOXclass: prediction of protein–protein interaction
deep neural network for quantifying the function of DNA sequences. types. BMC Bioinformatics, 7, 27.
Nucleic Acids Res., 44, e107. Zhu,J. et al. (2009) Multi-class AdaBoost. Stat. Interface, 2, 349–360.

You might also like