MONN: A Multi-Objective Neural Network For Predicting Compound-Protein Interactions and Affinities
MONN: A Multi-Objective Neural Network For Predicting Compound-Protein Interactions and Affinities
MONN: A Multi-Objective Neural Network For Predicting Compound-Protein Interactions and Affinities
In Brief
Prediction Identifying compound-protein
modules of MONN interactions is one of the essential
challenges in drug discovery. We
developed MONN, a multi-objective
neural network, which not only accurately
Database of Database of predicts the binding affinities but also
Pairwise Binding measured
complex structures successfully captures the non-covalent
non-covalent affinity binding affinities
interactions interactions between compounds and
Potential proteins. MONN can prove to be a useful
applications tool in exploring compound-protein
Understanding Virtual high-throughput interactions.
mechanisms of action screening of drugs
Highlights
d MONN models compound-protein interactions from
structure-free information
Methods
MONN: A Multi-objective Neural Network
for Predicting Compound-Protein
Interactions and Affinities
Shuya Li,1,5 Fangping Wan,1,5 Hantao Shu,1 Tao Jiang,2,3 Dan Zhao,1,* and Jianyang Zeng1,4,6,*
1Institutefor Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
2Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
3Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
4MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
5These authors contributed equally
6Lead Contact
SUMMARY
Computational approaches for understanding compound-protein interactions (CPIs) can greatly facilitate
drug development. Recently, a number of deep-learning-based methods have been proposed to predict
binding affinities and attempt to capture local interaction sites in compounds and proteins through neural at-
tentions (i.e., neural network architectures that enable the interpretation of feature importance). Here, we
compiled a benchmark dataset containing the inter-molecular non-covalent interactions for more than
10,000 compound-protein pairs and systematically evaluated the interpretability of neural attentions in exist-
ing models. We also developed a multi-objective neural network, called MONN, to predict both non-covalent
interactions and binding affinities between compounds and proteins. Comprehensive evaluation demon-
strated that MONN can successfully predict the non-covalent interactions between compounds and proteins
that cannot be effectively captured by neural attentions in previous prediction methods. Moreover, MONN
outperforms other state-of-the-art methods in predicting binding affinities. Source code for MONN is freely
available for download at https://github.com/lishuya17/MONN.
308 Cell Systems 10, 308–322, April 22, 2020 ª 2020 The Author(s). Published by Elsevier Inc.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
ll
Methods OPEN ACCESS
To overcome the current limitations of the structure-based as input, with capacity to handle large-scale datasets with rela-
computational methods, a number of structure-free models (Ci- tively low computational complexity. The input information is pro-
chonska et al., 2017; Airola and Pahikkala, 2018; Tsubaki et al., cessed by graph convolution networks and CNNs, but different
2019; Gao et al., 2018; Karimi et al., 2019; Wan et al., 2019; Öztu €rk from previous CPI prediction methods in the following aspects:
et al., 2018) have been developed for CPI prediction. An example (1) MONN uses a graph warp module (Ishiguro et al., 2019) in addi-
is the similarity-based methods that take similarity matrices as tion to a traditional graph convolution module (Lei et al., 2017) to
descriptors of both compounds and proteins (Cichonska et al., learn both a global feature for the whole compound and local fea-
2017; Airola and Pahikkala, 2018). These methods mainly focus tures for individual atoms of the compound to better capture the
on the global similarities of entire compounds or proteins, while molecular features of compounds; (2) MONN contains a pairwise
ignoring the detailed compositions of each molecule. interaction prediction module, which can capture the non-covalent
Conversely, deep-learning-based methods (Tsubaki et al., interactions between atoms of a compound and residues of a pro-
2019; Gao et al., 2018; Karimi et al., 2019; Öztu € rk et al., 2018) fully tein with extra supervision from the labels extracted from available
exploit the local features of both input compound structures and high-quality 3D compound-protein complex structures; and (3) in
protein sequences to predict their binding affinities. DeepDTA MONN, the pairwise non-covalent interaction prediction results
€rk et al., 2018) and DeepAffinity (Karimi et al., 2019) are rep-
(Öztu are further utilized to benefit the prediction of binding affinities,
resentatives of deep-learning-based models that require only by effectively incorporating the shared information between com-
simplified molecular-input line-entry system (SMILES) strings of pound and protein features into the downstream affinity prediction
compounds and primary sequences of proteins as input. They module.
employ the widely used deep neural network architectures, Comprehensive cross-validation tests on our constructed
such as convolutional neural networks (CNNs) and recurrent neu- benchmark dataset demonstrated that MONN can successfully
ral networks (RNNs) to process and extract contextual features learn the pairwise non-covalent interactions derived from high-
from the input sequence data. Another group of methods rely quality structural data, even using the 3D structure-free informa-
on graph-based representations to encode the molecular fea- tion as input. We also used an additional test dataset constructed
tures of compounds (Tsubaki et al., 2019; Gao et al., 2018), in from the protein data bank (PDB) (Berman et al., 2000) to further
which vertices represent atoms and edges represent chemical validate the generalization ability of MONN. Moreover, extensive
bonds. The graph convolution algorithms (Lei et al., 2017) are tests showed that MONN can achieve superior performance in
applied accordingly to extract useful molecular features from predicting CPI-binding affinities over other state-of-the-art struc-
such graph representations of compounds. Although these ture-free models. In addition, although the chemical rules, such
structure-free methods can successfully predict the binding af- as the correlation of hydrophobicity scores between compounds
finity between each pair of compound and protein, their interpret- and proteins and the preference of atom and residue types for
ability is still limited due to the lack of structural information. hydrogen bonds and p-stacking interactions, are not explicitly
A fraction of these structure-free methods make use of neural incorporated into the prediction framework, such features can
attentions, which have been widely used in the deep-learning still be effectively captured by MONN. All these results suggested
community to guide models to focus on those ‘‘important’’ fea- that MONN can provide a useful tool for effectively modeling CPIs
tures, and thus increase the interpretability of the prediction both locally and globally, and thus greatly facilitate the drug dis-
results (Vaswani et al., 2017; Santos et al., 2016). For the CPI pre- covery process.
diction tasks (Tsubaki et al., 2019; Gao et al., 2018; Karimi et al.,
2019), attentions are expected to be able to capture the local RESULTS
binding sites mediated by non-covalent interactions (e.g.,
hydrogen bonds and hydrophobic effects) between compounds The Network Architecture of MONN Is Designed for
and proteins. Although these methods demonstrated that real Solving a Multi-objective Machine Learning Problem
binding sites of compounds or proteins were enriched in their MONN is an end-to-end neural network model (Figures 1 and 2)
attention-highlighted regions in a few examples, systematic com- with two training objectives, whose main concept and key meth-
parison and evaluation on this learning capacity are still lacking, odological terms are explained in Primer (Box 1) and Glossary
probably due to the absence of benchmark datasets and evalua- (Box 2). One objective of MONN is to predict the non-covalent in-
tion standards. In this work, we constructed a benchmark dataset teractions between the atoms of a compound and the residues
containing pairwise non-covalent interactions between atoms of of its protein partner. We first define a pairwise interaction matrix
compounds and residues of proteins for more than 10,000 com- to describe the non-covalent interactions between the input
pound-protein pairs and comprehensively evaluated the inter- compound and protein pair. More specifically, for a compound
pretability of different neural attention-based frameworks. Tests with Na non-hydrogen atoms and a protein with Nr residues, their
on our constructed benchmark dataset showed that current neu- pairwise interaction matrix P is defined as an Na 3Nr binary ma-
ral-attention-based approaches have difficulty in automatically trix, in which each element Pij (i = 1; 2; /; Na and j = 1; 2; /; Nr )
capturing the accurate local non-covalent interactions between indicates whether there exists a non-covalent interaction (1 for
compounds and proteins without extra supervised guidance. existence, and 0 otherwise) between the i-th atom of the com-
Based on this observation, we developed MONN, a multi-objec- pound and the j-th residue of the protein when forming a com-
tive neural network, to learn both pairwise non-covalent plex structure. The interaction sites of the compound or protein
interactions and binding affinities between compounds and can be then derived from this pairwise interaction matrix by
proteins. MONN is a structure-free model that takes only graph maximizing over rows or columns (Figure 1A). The other objec-
representations of compounds and primary sequences of proteins tive of MONN is to predict the binding affinities (e.g., Ki, Kd, or
A C
B D
Predicted
Predicted pairwise binding affinity
non-covalent interactions
B C Input:
MHSWERL...
Input: initial features protein
(atom type, degree, sequence
valence, aromaticity) N
Blosum62
C encoding
N C
C
Message passing unit S M H S W E R L
Cl
super
N N
node Initial
N C N C
Local
residue
information features
S S
Global N
Graph warp unit n
information C
N C
C
S Convolutional
Cl
layers
Output:
updated N
Initial atom features
atom and C
Output:
Updated atom features N C
super nodee C updated
Super node features features S
Cl residue
Drection of
information passing features
M H S W E R L
Box 1. Primer
One of the key steps in drug discovery is to characterize the small molecule ligands of protein targets. However, it is still difficult to
experimentally measure large-scale CPIs in an efficient way. Computational methods have been developed in this field to facilitate
the discovery of hit or lead ligands for protein targets. Currently, there are still several challenges in establishing the computational
models for compound-protein interaction prediction:
d The first challenge is the accuracy of predictions, which is also the most important goal of computational methods. Machine
learning methods rely heavily on the amount and quality of training data to learn the regulations of the chemical or biological
objects. However, there exists certain bias in many datasets for drug discovery. For example, similar compounds or proteins
may be overly presented in the datasets, resulting from characterizing analogs of lead compounds and investigating impor-
tant target proteins identified earlier. Such bias may possibly lead to overfitting of computational models and thus inaccurate
reports of model performances.
d The second challenge is the limited accessibility of structural data at atomic resolution. Most structure-based machine
learning methods for CPI prediction heavily rely on the compound-protein complex structures as their input, thus limiting their
applications. In addition, these structure-based CPI prediction methods, including traditional CADD methods, generally
require immense computational resource when processing the enormous atom coordinates in the 3D structures.
d The third challenge is the interpretability of current deep-learning-based models, especially for those structure-independent
models. Although deep neural networks are powerful, they are also well known for their black-box nature. Despite that some
of them attempted to seek for explanations through the attention mechanism (which tells the ‘‘focuses’’ of computational
models), their performance is still quite limited, according to our systematic tests on a benchmark dataset.
In this paper, we developed a computational framework, called MONN, to address these challenges (Figure 1):
d First, through comprehensive evaluation of our model, we demonstrated that MONN can achieve superior performance than
existing state-of-the-art CPI prediction methods. To avoid the bias introduced by similar compounds and proteins in training
data, clustering-based cross-validation schemes were used to evaluate the ability of our model to make prediction for those
compounds or proteins that are dissimilar with training data.
d Second, our model takes only protein sequences and chemical structures of compounds as input. Although during the
training process, we can incorporate the non-covalent interaction labels derived from structural data to provide additional
support for the affinity prediction task, during the application phase, our model does not require structural information as
its input.
d Third, the network architecture of our model allows the extraction of contextual features from individual molecular compo-
nents (i.e., atoms of compounds and residues of proteins), followed by the prediction of local non-covalent interactions be-
tween compounds and proteins. In addition to the successful prediction of binding affinities by MONN, the predicted local
interactions can help provide useful mechanistic insights underlying the CPI events.
IC50), which can also be regarded as a global measurement of whole compound from a given molecular graph (Figure 2B), (2)
the binding strength, between a protein and its ligand. The a CNN module for extracting the features of individual residues
(predicted) binding affinity can be denoted by a real number from a given protein sequence (Figure 2C), (3) a pairwise interac-
a˛R. tion prediction module for predicting the probability of the non-
An input chemical compound with Na atoms can be repre- covalent interaction between any atom-residue pair from the
sented by a graph G = fV; Eg, where each node vi ˛V, i = 1; 2; previously learned atom and residue features (Figure 3A), and
/; Na , corresponds to the i-th non-hydrogen atom in the com- (4) an affinity prediction module for predicting the binding affinity
pound, and each edge ei1 ;i2 ˛E, i1 ; i2 ˛f1; 2; /; Na g, corresponds between the given pair of compound and protein, using the
to a chemical bond between the i1 -th and the i2 -th atoms. An previously extracted molecular features, as well as the derived
input protein with Nr residues can be represented by a string pairwise interaction matrix (Figure 3B). The graph convolution
of its primary sequence, denoted by S = ðr1 ; r2 ; /; rNr Þ, where module and the CNN module effectively extract information
each rj , j = 1; 2; /; Nr , is either one of the 20 standard amino from the local contexts for atoms of compounds and residues
acids, or a letter ‘‘X’’ for any non-standard amino acid. Given a of proteins, and the pairwise interaction prediction module infers
graph representation of a compound and a string representation the potential non-covalent interactions from the previously
of a protein sequence, our model is expected to output a pre- learned local features. The basic idea of the affinity prediction
dicted pairwise non-covalent interaction matrix P˛RNa 3Nr and module is to integrate information from both compounds and
an estimated binding affinity value a˛R. proteins to benefit the prediction of their binding affinities. During
MONN consists of four modules: (1) a graph convolution mod- this process, the predicted non-covalent interactions are used to
ule for extracting the features of both individual atoms and the enable information sharing between the components of
interaction prediction module to derive the predicted pairwise interaction matrix, which also enables one to construct the links between atoms of the compound
and residues of the protein. Finally, an affinity prediction module is used to integrate information from atom features, residue features, and the previously derived
pairwise interactions to predict the binding affinity.
(B) The graph convolution module for encoding the molecular features of an input compound.
(C) The CNN module for encoding the features of an input protein sequence. More details can be found in STAR Methods.
Box 2. Glossary
Pairwise Interactions
To computationally describe the non-covalent interactions between a compound-protein pair, we first regard the compound as a
list of atoms and the protein as a list of residues. A pairwise interaction matrix is represented by a [number of atoms]-by-[number of
residues] matrix in which each element is a binary value indicating whether the corresponding atom-residue pair has an interaction
or not.
Clustering Threshold
We use hierarchical clustering for splitting all the compounds (proteins) into groups (i.e., clusters) based on their similarities. A clus-
tering threshold determines the minimal distance between clusters. For example, a threshold 0.3 for compound clusters means that
any two compounds from different clusters have at least 30% difference in their chemical structures.
Neural Attentions
Neural attentions are generally designed to capture the importance of different input positions to the final prediction in deep-
learning models. They are often realized by calculating a ‘‘weight’’ for each input position, which thus can provide certain interpret-
ability about the contributions of individual input positions to the final prediction results (Vaswani et al., 2017; Santos et al., 2016).
compounds and proteins. Details about each module of MONN tectures. The attention weight can be regarded as a measure of
and the training process can be found in STAR Methods. importance of the feature at each position (e.g., an atom or a
residue), and thus such an attention mechanism is expected
Systematic Evaluation Indicates the Limited to be able to explain the interaction sites between compounds
Interpretability of Neural Attentions in CPI Prediction and proteins. For example, Tsubaki et al. developed an end-to-
Models end neural network with attentions for protein sequences (Tsu-
A number of deep-learning-based methods (Tsubaki et al., baki et al., 2019). They showed two examples in which the
2019; Gao et al., 2018; Karimi et al., 2019; Wan et al., 2019; attention-highlighted regions were able to capture the real
€ rk et al., 2018) have been developed previously for
Öztu interaction sites in proteins. The method developed by Gao
modeling CPIs from 3D structure-free inputs. Despite their suc- et al. involved both compound and protein attentions (Gao
cess in predicting binding affinities with relatively low computa- et al., 2018). By visualizing the attention weights, the authors
tional complexity, interpretability is still considered as a chal- demonstrated that the derived attention-highlighted regions
lenge for these structure-independent methods. Several derived from their model can successfully identify the interac-
recent studies (Tsubaki et al., 2019; Gao et al., 2018; Karimi tion interface in a compound-protein complex. DeepAffinity re-
et al., 2019) sought interpretability by incorporating neural at- ported an enrichment of true interaction sites in those regions
tentions (i.e., weighing the contributions of individual elements with high attention scores in protein sequences for several ex-
in the given input to the final predictions) into their model archi- amples (Karimi et al., 2019).
B C
D E
Figure 4. Performance Evaluation on the Interpretability of Different Neural Attentions and MONN for Predicting Non-covalent Interactions
between Compounds and Proteins
(A) Evaluation on the PDBbind-derived benchmark dataset. Average AUC scores and average enrichment scores were used for evaluating the prediction of
interaction sites (atoms) in compounds under the new-compound setting, interaction sites (residues) in proteins under the new-protein setting, and pairwise non-
covalent interactions between compounds and proteins under the both-new setting. The mean values and standard deviations over 10 repeats of cross-vali-
dation with clustering threshold 0.3 are plotted. The ratios of positive and negative labels are about 1:1.44, 1:46.5, and 1:605 under these three cross-validation
settings, respectively.
(B–E) Validating MONN on an additional test set derived from the PDB (Wang et al., 2005).
(B) The distribution of AUC scores for all the compound-protein pairs.
(legend continued on next page)
Under different prediction tasks and cross-validation settings, ples, we also examined the relationships between the achieved
all the four types of neural attentions achieved average AUC and AUC scores and various properties of the test compound-pro-
enrichment scores around 0.5 and 1, respectively, which were tein pairs. It seemed that the occurrence of the same compound
close to the scores of random predictions (Figures 4 and S3). in training data did not affect the prediction performance much
These results suggested that, although the attention-highlighted (Figure S2B), but the prediction performance was obviously
regions and the real binding sites displayed accordance in some affected by whether the same protein occurred in training data
cases (Tsubaki et al., 2019; Gao et al., 2018; Karimi et al., 2019), or not (Figure S2C). Also, molecular weights, logP values of com-
they only showed poor correlation in a comprehensive test on a pounds and sequence lengths of proteins may slightly influence
large-scale dataset. Thus, it seems not possible to derive the ac- the prediction performance (Figures S2D–S2F). In addition,
curate predictions of non-covalent interactions between com- MONN may perform better on certain protein families (e.g.,
pounds and proteins from the attention-based models trained kinases) than on others (Figures S2G–S2I).
using only binding affinity labels (i.e., without pairwise interaction To further examine the generalization ability of our model, we
labels). also validated MONN on an additional independent dataset con-
taining pairwise non-covalent interactions between compounds
MONN Successfully Predicts Pairwise Non-covalent and proteins. As our training data (i.e., the benchmark dataset
Interactions with Extra Supervision derived from the PDBbind v2018 [Wang et al., 2005, 2004])
Based on the above observation that neural attentions cannot included all the high-quality structures of compound-protein com-
automatically capture the non-covalent interactions between plexes released in the PDB (Berman et al., 2000) before 2018, we
compounds and proteins, we speculated that extra supervision also constructed an additional test dataset by collecting all the
information can be used to guide our model to capture such local compound-protein complexes from the PDB with the release
interactions. Instead of using attention mechanisms, MONN date from Jan 1st, 2018 to March 31st, 2019 (STAR Methods). In
uses an individual module (i.e., the pairwise interaction predic- this extra test, MONN achieved average AUC 0.859 and average
tion module) to learn the pairwise non-covalent interactions enrichment score 112.47 in predicting pairwise interactions of
from given labels (STAR Methods). Meanwhile, through margin- compound-protein pairs on this additional dataset (Figure 4B).
alizing the predicted pairwise interaction matrix, the predicted To visualize the prediction results of our model, we selected
interaction sites in either compounds or proteins can also be three representative compound-protein pairs ranked around
derived. 10%, 50%, and 90% in terms of the AUC scores and plotted
The cross-validation settings and the metrics for evaluating the the corresponding true labels and the predicted interaction sites
pairwise non-covalent interaction prediction results of our model in the compound structures and protein sequences (Figures 4C–
were the same as described in the previous section and STAR 4E). The example pair ranked around top 10% was a tyrosine
Methods. As shown in Figure 4A, our model achieved average kinase inhibitor binding to TYK2 (Figure 4C, PDB ID: 6DBK) (Fen-
AUC scores of 0.837, 0.763, and 0.821 and average enrichment some et al., 2018). In this example pair, top 40% of the predicted
scores of 1.63, 10.8, and 11.3 under the three application settings interaction sites (atoms) in the compound covered all the true
(i.e., new-compound, new-protein, and both-new settings), interaction sites, and the high prediction scores also appeared
respectively. Note that the values of the enrichment scores around the true interaction sites along the protein sequence.
were not comparable among these three settings, due to the The example pair ranked around the median prediction score
different ratios of positive-negative labels (STAR Methods). A contained a compound binding to KRAS (Figure 4D, PDB ID:
more comprehensive comparison test (Figure S1) on our model 6FA1) (Quevedo et al., 2018). The predicted interaction sites of
and different neural attentions was performed for different pre- the compound had several overlaps with true interaction sites
diction goals, cross-validation settings, and clustering thresh- (5/8 recall) but also with several false positives. For example,
olds, which showed that the predictions of MONN are effective the positively charged group in the compound was predicted
and robust (average AUC scores decreased less than 5% with as an interaction site, which is actually located outside the bind-
the clustering threshold increasing from 0.3 to 0.6). These results ing pocket. The predicted interaction sites (residues) of the pro-
suggested that while the neural attentions have difficulty in inter- tein had several overlaps with the true labels, but also with a
preting the non-covalent interactions, MONN is able to accu- number of false positives. The example pair ranked around
rately predict such interactions between compounds and pro- 90% was a ligand binding to rhodopsin (Figure 4E, PDB ID:
teins under different cross-validation settings. 6FK7) (Mattle et al., 2018). The deviation of the predicted interac-
The distributions of AUC scores of compound-protein pairs tion sites from true labels in this example was probably due to the
achieved by MONN for predicting the pairwise interactions under scarcity of training data to support the prediction. All these visu-
the three cross-validation settings are shown in Figure S2A. The alization results demonstrated that the accuracies of MONN
results indicated that different compound-protein pairs indeed predictions were consistent with their corresponding rankings
can have distinct performance. To further explore the potential in AUC scores. Overall, the above comprehensive validation
factors affecting the performance of MONN on individual sam- tests supported the strong predictive power of MONN.
(C–E) Three example pairs ranked around 10% (C), 50% (D), and 90% (E) in terms of AUC scores for the pairwise interaction prediction. We show the numbers of
occurrences of the same pair, the same compound, and the same protein in training data, as well as the AUC scores and the corresponding ranks for the
predicted pairwise interactions, the interaction sites (atoms) in compounds, and interaction sites (residues) in proteins. In the compound structures, true labels
and top 40% predicted interaction sites are marked in red using RDKit (Landrum, 2006). In the protein sequences, the true labels and the MONN-predicted scores
for individual positions are plotted.
B C D E
F G H
Figure 5. Evaluating MONN Using Binding Affinity Data and Known Chemical Rules
(A) Performance evaluation for MONN and baseline methods on binding affinity prediction, on both IC50 and KIKD datasets. Pearson correlations achieved by
MONN with single (denoted as MONNsingle) or multiple (denoted as MONNmulti) training objectives and four baseline methods, under three different cross-
validation settings and four different clustering thresholds are shown. The mean values and standard deviations over 10 repeats of cross-validation are plotted.
(B–E) Correlations between the hydrophobicity scores of the compounds and the corresponding interaction sites (residues) in the proteins.
(B) The interaction residues were derived from the pairwise interaction labels of the benchmark dataset.
(C) The interaction residues of the proteins were derived from the randomly selected residues from the protein sequences. Here, the number of selected residues
was the same as the number of true interaction sites in each protein sequence.
(D) The interaction residues of the proteins were predicted by MONN. Using a 9-fold cross-validation on the benchmark dataset under the both-new setting with a
clustering threshold 0.3, the interaction residues were derived from the predicted pairwise interaction matrices of the test samples for each fold.
(E) The interaction residues were predicted for the compound-protein pairs in the additional test dataset, while the model was trained using the benchmark
dataset.
Table 1. Performance Evaluation of Different Prediction line methods, the multi-objective model (MONNmulti) achieved an
Approaches on the BindingDB Dataset increase in Pearson correlation by up to 3.6% (average 2.3%). In
Pearson addition, the multi-objective model performed slightly better
Method RMSE Correlation than the single objective one, which indicated that incorporating
DeepAffinity (Single Model) 0.74 0.84
extra supervision information from pairwise interaction labels
can further improve the binding affinity prediction task.
DeepAffinity (Parameter Ensemble) 0.73 0.84
Since compound-protein complexes generally have limited
DeepAffinity (Parameter + NN Ensemble) 0.71 0.86
structural availability, we further tested our model on a
DeepDTA (Single Model) 0.782 0.848 large-scale structure-free CPI dataset. To our best knowledge,
DeepDTA (Ensemble of 30 Models) 0.686 0.886 among the baseline methods, only DeepAffinity has been evalu-
MONN (Single Model) 0.764 0.858 ated previously on a large dataset with more than 260,000
MONN (Ensemble of 30 Models) 0.658 0.895 training samples and more than 110,000 test samples, with the
The RMSE and Pearson correlation of DeepAffinity are adopted from the IC50 values derived from the BindingDB database (Gilson et al.,
original paper (Karimi et al., 2019), in which ‘‘parameter ensemble’’ means 2016). We followed the same experimental settings as in DeepAf-
averaging the predictions over the last 10 epochs, and ‘‘parameter + NN finity and also tested MONN and DeepDTA on the same dataset.
ensemble’’ means averaging predictions over the last 10 epochs of three The method by Tsubaki et al. and CGKronRLS are not suitable for
networks with different hyper-parameter settings (i.e., averaging over 30 this test mainly due to their limited scalability in processing such a
predictions). large dataset. To make a fair comparison, we also evaluated an
ensemble version (i.e., averaging predictions from several single
MONN Successfully Predicts Binding Affinities with models) of MONN on this BindingDB dataset, as in the DeepAffin-
Single- and Multi-objective Learning ity paper (Tsubaki et al., 2019) (details can be found in STAR
In this section, we examined the affinity prediction performance Methods). The performances of DeepAffinity, DeepDTA, and
of MONN and compared it to that of other state-of-the-art MONN were evaluated in terms of RMSE and Pearson correla-
models. For the binding affinity prediction task, we separated tion, as listed in Table 1. When evaluating the single models,
our PDBbind-derived dataset into two subsets, named IC50 MONN achieved the best Pearson correlation (0.858). For all
(which contained IC50 values) and KIKD (which contained both the methods, their performances can be largely improved
Ki and Kd values). The main reason for such a separation was through using the ensemble-based models. Among them, the
that IC50 values are generally dependent on experimental condi- ensemble version of MONN achieved the best performance
tions and thus often considered noisier than the measured Ki and (RMSE 0.658 and Pearson correlation 0.895). This comparison
Kd values. Here, the IC50 dataset with the new-compound result suggested that, MONN can achieve better performance
setting and clustering threshold 0.3 was used for hyper-param- than the state-of-the-art baseline methods even when the struc-
eter calibration. More details about training and hyper-parameter tural data is not available.
selection can be found in STAR Methods. To make a direct comparison between MONN and existing
We considered the following state-of-the-art baseline structure-based CPI prediction methods (including molecular
methods for comparison: the similarity-based kernel method docking and deep-learning-based models [Koes et al., 2013;
CGKronRLS (Cichonska et al., 2017), and the deep-learning- Wallach et al., 2015; Ragoza et al., 2017; Gonczarek et al.,
based methods, including DeepDTA (Öztu €rk et al., 2018), the 2018; Torng and Altman, 2019; Lim et al., 2019]), we also evalu-
method by Tsubaki et al. (Tsubaki et al., 2019) and DeepAffinity ated our model on the DUD-E dataset (Mysinger et al., 2012),
(Karimi et al., 2019). As in the previous sections, MONN and which was widely used as a benchmark dataset for evaluating
these baseline methods were evaluated under three different structure-based CPI prediction tasks. Among existing struc-
settings of clustering-based cross-validation (i.e., new-com- ture-based methods, Smina is a molecular docking method
pound, new-protein, and both-new), in terms of Pearson correla- (Koes et al., 2013), and AtomNet, and the methods by Lim
tion (Figure 5) and root mean squared error (RMSE, Figure S3). et al., Gonczarek et al., Torng et al., and Ragoza et al. are
To investigate whether involving the extra supervision from the deep-learning-based methods dealing with structural input infor-
pairwise interaction labels can help predict the binding affinities, mation. Although the method by Gonczarek et al. is structure
we mainly tested MONN under two conditions: one was a single independent, we still included it in the comparison as it was
objective model, denoted as MONNsingle, which used only the claimed to outperform most of the structure-based methods
affinity labels as supervision information, while the other was a (Gonczarek et al., 2018). The DUD-E dataset contains 22,805
multi-objective model, denoted as MONNmulti, which considered active compounds and 1,411,214 decoys (i.e., inactive com-
both pairwise interactions and binding affinities into the training pounds) for in total 102 proteins. Using the same training-test
objectives. splitting strategy as in Wallach et al. (2015), we evaluate MONN
Our tests showed that both MONNsingle and MONNmulti outper- under the ‘‘new-protein’’ condition, with 72 proteins as training
formed other baseline methods in all the three cross-validation data and the rest 30 proteins as test data. As shown in Table 2,
settings with different clustering thresholds, on both IC50 and the average AUC score over the 30 test proteins achieved by
KIKD datasets (Figure 5). In particular, compared with the base- MONN was higher than those of the structure-based methods.
(F–H) Conditional likelihood scores (whose definition can also be found in the main text) measuring the preference of specific residue types given the properties of
interaction sites (atoms) of the compounds, including hydrogen-bond acceptor atoms (F), hydrogen-bond donor atoms (G), and aromatic atoms (H). For each
given type of atoms from the compounds, we considered the interaction residues for four different situations, as described in (B–E).
Table 2. Performance Evaluation of Different Prediction true labels, this was probably because the information about hy-
Approaches on the DUD-E Dataset drophobicity was somewhat over-represented in the predicted
Method Average AUC interaction sites selected according to the current threshold (i.e.,
3 times of standard deviation above mean). If we used a stricter
Smina (Koes et al., 2013) 0.7
threshold, e.g., 4 or 5 times of standard deviation above mean,
AtomNet (Wallach et al., 2015) 0.855
the resulting correlations (0.489 and 0.471, respectively) become
Ragoza et al. (Ragoza et al., 2017) 0.868 closer to or lower than that of true labels, which suggested that
Torng et al. (Torng and Altman, 2019) 0.886 MONN may focus more on other features under stricter thresholds.
Gonczarek et al. (Gonczarek et al., 2018) 0.904 These results indicated that the predictions of MONN can also well
Lim et al. (Lim et al., 2019) 0.968 reflect the relationships between compounds and proteins in
MONN 0.974 terms of the global molecular property (i.e., the hydrophobicity).
The performances of previous methods were directly obtained from the
original papers (Wallach et al., 2015; Ragoza et al., 2017; Gonczarek MONN Captures the Chemical Rules of Non-covalent
et al., 2018; Torng and Altman, 2019; Lim et al., 2019). Note that the Interactions
train-test split schemes of some methods were slightly different: Lim The rules of non-covalent interactions and information of interac-
et al. used 72 proteins as training data and 25 as test data (Lim et al., tion types between compounds and proteins are not explicitly
2019); Toring et al. and Ragoza et al. used 4-fold and 3-fold cross-valida- incorporated into MONN. Nevertheless, we examined whether
tion strategies to evaluate their models, respectively (Torng and Altman, MONN can automatically capture such chemical rules. Among
2019; Ragoza et al., 2017). the three most common non-covalent interaction types (i.e.,
hydrophobic interactions, hydrogen bonds, and p-stackings)
We also examined the running time of our model. One of the between proteins and their ligands, we chose to analyze the
structure-based models, AtomNet, requires about a week preference of interaction partners for the atoms that can form
training on 6 Nvidia-K10 GPUs, as stated in the original paper hydrogen bonds or p-stackings. Hydrophobic interactions
(Wallach et al., 2015). In our test, MONN was able to fit this were not considered here, as hydrophobic carbons exist in all
DUD-E dataset in about 20 h on one GeForce GTX 1080Ti GPU. the 20 types of residues.
To characterize the preference of residues with a specific
MONN Captures the Global Molecular Property property under a given atom type of their interaction partners,
From the perspective of chemical properties, the size, shape, we first define the conditional likelihood score p (residue prop-
and hydrophobicity of a protein-binding pocket are essential erty = xjatom property = y) = (number of residues ˛SðxÞ that
for its interaction with a compound (Volkamer et al., 2012). Infor- interact with the atoms of property yÞ/(total number of residues
mation about the size and shape of a binding pocket is usually interacting with the atoms of property y), where S(x) represents
hard to derive only based on its raw sequence, so we mainly the set of residues whose side chains contain at least one kind
examined the hydrophobicity of the potential binding residues of elements satisfying the property x. To be more specific, S
predicted by MONN, through calculating the correlation between (‘‘H-bond donor’’) = {H, K, N, Q, R, S, T, W, Y}, in which each res-
the hydrophobicity scores of the entire compounds and the idue has at least one hydrogen-bond donor in its side chain.
average hydrophobicity scores of the predicted interaction sites Similarly, S (‘‘H-bond acceptor’’) = {D, E, H, N, Q, S, T, Y}, and
(residues) in the proteins. Here, the hydrophobicity of the com- S (‘‘aromatic’’) = {Y, W, F}. The corresponding properties of
pound was measured by the logP value calculated by RDKit atoms from the compounds were calculated using RDKit (Land-
(Landrum, 2006), which is defined as the log ratio of the solubility rum, 2006). Here, we calculated the conditional likelihood scores
of the compound in organic solvent (e.g., 1-octanol) against under different situations, in which the pairwise interactions
water (Wildman and Crippen, 1999). The hydrophobicity of the referred in the above definition were obtained from either true la-
(predicted) interaction sites of a protein is defined as the average bels, MONN predictions, or random choices (used as control).
hydrophobicity score over the corresponding side chains (Leh- A hydrogen bond is generally formed between a hydrogen
ninger et al., 2005). Here, the predicted interaction sites of the donor group and an acceptor group. When the atoms from com-
protein were selected from the top scored atom-residue pairs pounds are hydrogen-bond acceptors, the conditional likelihood
in the predicted pairwise interaction matrix P, according to a cut- of hydrogen-bond donor residues as their interaction partners
off value of mean (P) + 3 3 std (P), where std(,) stands for the (0.63, calculated using true labels) was much higher than the
standard deviation. Next, the residues involved in the selected control residues (0.38, calculated using the randomly chosen
top atom-residue pairs were used for the downstream analysis. residues, Figure 5F). The conditional likelihood scores calculated
The true interaction sites of proteins derived from the solved using MONN-predicted interaction sites were also relatively high
structures in the benchmark dataset showed a certain level of cor- (0.62 for cross-validation and 0.64 for the additional test,
relation (Pearson correlation 0.487) in hydrophobicity with their li- Figure 5F). Similarly, the hydrogen-bond acceptor residues
gands (Figure 5B). As a control, no significant correlation was from the MONN prediction results also had significantly higher
observed from randomly chosen residues (Figure 5C). conditional likelihood scores than the random control when their
The interaction sites of proteins predicted by MONN had similar interaction partners were the hydrogen-bond donor atoms from
correlations in hydrophobicity scores with their ligands (0.515 for the compounds (Figure 5G).
cross-validation and 0.499 for the additional test dataset, Figures The p-stacking interactions generally occur between aromatic
5D and 5E), close to that of true labels. Note that the correlation rings. There are three amino acids containing aromatic rings, i.e.,
achieved by the prediction result was slightly higher than that of phenylalanine, tryptophan and tyrosine. They generally had
higher conditional likelihood scores when their interaction part- dant information extracted from high-quality 3D structures
ners are aromatic atoms from the compounds (0.44 calculated with the large-scale sequence data under the generalizable
from true labels compared with 0.09 from random control, deep-learning-based frameworks.
Figure 5H). In the MONN prediction results, the three aromatic
residues also had higher conditional likelihood scores (0.35 for STAR+METHODS
cross-validation and 0.37 for the additional test set, Figure 5H)
than that from random control, which thus provided another Detailed methods are provided in the online version of this paper
evidence to support the reasonableness of the MONN prediction and include the following:
results.
In summary, the above results indicated that MONN can d KEY RESOURCES TABLE
correctly capture the preferred interaction partners for different d LEAD CONTACT AND MATERIALS AVAILABILITY
types of atoms in the compounds, according to the possibility d METHOD DETAILS
of forming different kinds of non-covalent interactions. B Detailed Implementation of Individual Modules
in MONN
DISCUSSION B Training
d QUANTIFICATION AND STATISTICAL ANALYSIS
Accurately predicting CPIs can greatly facilitate the drug discov- B Construction of the Benchmark Dataset
ery process. While several deep-learning-based tools have been B Construction of the Additional Test Dataset for Vali-
proposed to predict binding affinities and improve virtual high- dating the Pairwise Non-covalent Interaction Pre-
throughput screening, our approach MONN goes further to dictions
explore more about the mechanisms underlying CPIs. In this B Evaluation of Different Types of Neural Attentions
work, we demonstrated that MONN can successfully predict B Implementation of the Tested Neural Attentions
the pairwise non-covalent interaction matrices, which can also B Clustering-based Cross Validation
be used to infer the interaction sites in compounds and proteins. B Hyper-parameter Selection
Comparison tests showed that MONN can outperform other B Evaluation of MONN and Other Methods on the Bind-
state-of-the-art machine learning methods in predicting binding ingDB-derived Dataset
affinities. Besides, the structure-free input of MONN allows it to d DATA AND CODE AVAILABILITY
have a wider range of applications than those structure-depen-
dent approaches. We also verified that the predictions of SUPPLEMENTAL INFORMATION
MONN are accordant with chemical rules, in terms of the corre-
lation in hydrophobicity between interaction sites in compounds Supplemental Information can be found online at https://doi.org/10.1016/j.
cels.2020.03.002.
and proteins, and the preference of interaction partners for
different atom types. All these results indicated that MONN
ACKNOWLEDGMENTS
can provide a powerful and useful tool to advance the drug
development process. This work was supported in part by the National Natural Science Founda-
MONN takes molecule graphs of compounds and protein tion of China (61872216, 81630103, and 31900862) and the Turing AI Insti-
sequences as input, which brings both advantages and limita- tute of Nanjing and the Zhongguancun Haihua Institute for Frontier Informa-
tions into our method. Structure-free inputs allow MONN to tion Technology. The authors thank Mr. Tingzhong Tian and Dr. Hailin Hu for
make predictions for proteins without known 3D structures. helpful discussions about this work and suggestions for the manuscript.
On the other hand, the sequence-only inputs may limit the
amount of information directly conveyed into the model. Since AUTHOR CONTRIBUTIONS
most existing computer-aided drug design (CADD) tools rely on J.Z. and D.Z. conceived and supervised the research project. S.L. and F.W.
3D structure data to predict binding poses, this kind of informa- developed the method. S.L., F.W., H.S., and T.J. conducted the analyses.
tion is indeed useful for inferring the detailed binding mecha- All the authors contributed to the manuscript.
nisms between proteins and compounds. For example, it would
be beneficial if the region of binding pocket in a protein is DECLARATION OF INTERESTS
already defined before predicting its interaction with com- The authors declare no competing interests.
pounds. Although the definition of ‘‘interaction sites’’ in our
problem setting is not equivalent to that of binding pockets, Received: January 21, 2020
we checked the percentage of our predicted interaction sites Revised: February 19, 2020
included in the regions of known binding pockets provided by Accepted: March 5, 2020
PDBbind (Wang et al., 2005) (76.8%, 38.5%, and 33.4% for Published: April 2, 2020
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Lipton, Z.C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent
Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucleic neural networks for sequence learning. arXiv https://arxiv.org/abs/1506.
Acids Res 28, 235–242. 00019v4.
Bickerton, G.R., Paolini, G.V., Besnard, J., Muresan, S., and Hopkins, A.L. Mattle, D., Kuhn, B., Aebi, J., Bedoucha, M., Kekilli, D., Grozinger, N., Alker, A.,
(2012). Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98. Rudolph, M.G., Schmid, G., Schertler, G.F.X., et al. (2018). Ligand channel in
Chen, X., Yan, C.C., Zhang, X., Zhang, X., Dai, F., Yin, J., and Zhang, Y. (2016). pharmacologically stabilized rhodopsin. Proc. Natl. Acad. Sci. USA 115,
Drug–target interaction prediction: databases, web servers and computational 3640–3645.
models. Brief. Bioinform. 17, 696–712. Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Wegner, J.K.,
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of Ceulemans, H., Clevert, D.A., and Hochreiter, S. (2018). Large-scale compar-
gated recurrent neural networks on sequence modeling. arXiv https://arxiv. ison of machine learning methods for drug target prediction on. ChEMBL.
org/abs/1412.3555v1. Chem. Sci. 9, 5441–5451.
Cichonska, A., Ravikumar, B., Parri, E., Timonen, S., Pahikkala, T., Airola, A., Mysinger, M.M., Carchia, M., Irwin, J.J., and Shoichet, B.K. (2012). Directory of
Wennerberg, K., Rousu, J., and Aittokallio, T. (2017). Computational-experi- useful decoys, enhanced (DUD-E): better ligands and decoys for better bench-
mental approach to drug-target interaction mapping: a case study on kinase marking. J. Med. Chem. 55, 6582–6594.
inhibitors. PLoS Comput. Biol. 13, e1005678. Nam, H., Ha, J.-W., and Kim, J. (2017). Dual attention networks for multimodal
Fensome, A., Ambler, C.M., Arnold, E., Banker, M.E., Brown, M.F., Chrencik, reasoning and matching. In The IEEE Conference on Computer Vision and
J., Clark, J.D., Dowty, M.E., Efremov, I.V., Flick, A., et al. (2018). Dual inhibition Pattern Recognition (CVPR)), pp. 299–307.
of TYK2 and JAK1 for the treatment of autoimmune diseases: discovery of ((s)- Nogales, E., and Scheres, S.H. (2015). Cryo-EM: a unique tool for the visuali-
2, 2-difluorocyclopropyl)((1 r, 5 s)-3-(2-((1-methyl-1 h-pyrazol-4-yl) amino) pyr- zation of macromolecular complexity. Mol. Cell 58, 677–689.
imidin-4-yl)-3, 8-diazabicyclo [3.2. 1] octan-8-yl) methanone (pf-06700841). €rk, H., Özgu
Öztu €r, A., and Ozkirimli, E. (2018). DeepDTA: deep drug–target
J. Med. Chem. 61, 8597–8612. binding affinity prediction. Bioinformatics 34, i821–i829.
Gao, K.Y., Fokoue, A., Luo, H., Iyengar, A., Dey, S., and Zhang, P. (2018). Paul, S.M., Mytelka, D.S., Dunwiddie, C.T., Persinger, C.C., Munos, B.H.,
Interpretable drug target prediction using deep neural representation. In Lindborg, S.R., and Schacht, A.L. (2010). How to improve R&D productivity:
Proceedings of the Twenty-Seventh International Joint Conference on the pharmaceutical industry’s grand challenge. Nat. Rev. Drug Discov. 9,
Artificial Intelligence (IJCAI-18), pp. 3371–3377. 203–214.
Gilson, M.K., Liu, T., Baitaluk, M., Nicola, G., Hwang, L., and Chong, J. (2016). Price, A.J., Howard, S., and Cons, B.D. (2017). Fragment-based drug discov-
BindingDB in 2015: a public database for medicinal chemistry, computational ery and its application to challenging drug targets. Essays Biochem 61,
chemistry and systems pharmacology. Nucleic Acids Res 44, D1045–D1053. 475–484.
Gonczarek, A., Tomczak, J.M., Zare˛ba, S., Kaczmar, J., Da˛browski, P., and Quevedo, C.E., Cruz-Migoni, A., Bery, N., Miller, A., Tanaka, T., Petch, D.,
Walczak, M.J. (2018). Interaction prediction in structure-based virtual Bataille, C.J.R., Lee, L.Y.W., Fallon, P.S., Tulmin, H., et al. (2018). Small mole-
screening using deep learning. Comput. Biol. Med. 100, 253–258. cule inhibitors of RAS-effector protein interactions derived using an intracel-
Gower, J.C., and Ross, G.J.S. (1969). Minimum spanning trees and single link- lular antibody fragment. Nat. Commun. 9, 3169.
age cluster analysis. Journal of the Royal Statistical Society: Series C (Applied Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., and Koes, D.R. (2017). Protein–
Statistics) 18, 54–64. ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57,
Henikoff, S., and Henikoff, J.G. (1992). Amino acid substitution matrices from 942–957.
protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919. Rester, U. (2008). From virtuality to reality- virtual screening in lead discovery
Inglese, J., and Auld, D.S. (2008). High throughput screening (HTS) techniques: and lead optimization: a medicinal chemistry perspective. Curr. Opin. Drug
applications in chemical biology. In Wiley Encyclopedia of Chemical Biology, Discov. Dev. 11, 559–568.
T.P. Begley, ed. (American Cancer Society), pp. 1–15. Salentin, S., Schreiber, S., Haupt, V.J., Adasme, M.F., and Schroeder, M.
Ishiguro, K., Maeda, S.-i., and Koyama, M. (2019). Graph warp module: an (2015). PLIP: fully automated protein–ligand interaction profiler. Nucleic
auxiliary module for boosting the power of graph neural networks. arXiv Acids Res 43, W443–W447.
https://arxiv.org/abs/1902.01020v4. Salsbury, F.R., Jr. (2010). Molecular Dynamics simulations of protein dynamics
Karimi, M., Wu, D., Wang, Z., and Shen, Y. (2019). DeepAffinity: interpretable and their relevance to drug discovery. Curr. Opin. Pharmacol. 10, 738–744.
deep learning of compound–protein affinity through unified recurrent and con- Santos, C.d., Tan, M., Xiang, B., and Zhou, B. (2016). Attentive pooling net-
volutional neural networks. Bioinformatics 35, 3329–3338. works. arXiv https://arxiv.org/abs/1602.03609v1.
Koes, D.R., Baumgartner, M.P., and Camacho, C.J. (2013). Lessons learned in Sousa, S.F., Ribeiro, A.J., Coimbra, J.T., Neves, R.P., Martins, S.A., Moorthy,
empirical scoring with smina from the CSAR 2011 benchmarking exercise. N.S., Fernandes, P.A., and Ramos, M.J. (2013). Protein-ligand docking in the
J. Chem. Inf. Model. 53, 1893–1904. new millennium-a retrospective of 10 years in the field. Curr. Med. Chem. 20,
Kola, I., and Landis, J. (2004). Can the pharmaceutical industry reduce attrition 2296–2314.
rates? Nat. Rev. Drug Discov. 3, 711–715. Svergun, D.I., Petoukhov, M.V., and Koch, M.H. (2001). Determination of
Landrum, G. (2006). RDKit: open-source cheminformatics. http://www. domain structure of proteins from X-ray solution scattering. Biophys. J. 80,
rdkit.org. 2946–2953.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based Torng, W., and Altman, R.B. (2019). Graph convolutional neural networks for
learning applied to document recognition. Proc. IEEE 86, 2278–2324. predicting drug-target interactions. J. Chem. Inf. Model. 59, 4131–4149.
Lehninger, A.L., Nelson, D.L., and Cox, M.M. (2005). Lehninger Principles of Trott, O., and Olson, A.J. (2010). AutoDock Vina: improving the speed and
Biochemistry (MacMillan). accuracy of docking with a new scoring function, efficient optimization, and
Lei, T., Jin, W., Barzilay, R., and Jaakkola, T. (2017). Deriving neural architec- multithreading. J. Comp. Chem. 31, 455–461.
tures from sequence and graph kernels. In Proceedings of the 34th Tsubaki, M., Tomii, K., and Sese, J. (2019). Compound–protein interaction pre-
International Conference on Machine Learning-Volume 70, pp. 2024–2033. diction with end-to-end learning of neural networks for graphs and sequences.
Lim, J., Ryu, S., Park, K., Choe, Y.J., Ham, J., and Kim, W.Y. (2019). Predicting Bioinformatics 35, 309–318.
drug–target interaction using a novel graph neural network with 3D structure- UniProt Consortium (2019). UniProt: a worldwide hub of protein knowledge.
embedded graph representation. J. Chem. Inf. Model. 59, 3981–3988. Nucleic Acids Res 47, D506–D515.
STAR+METHODS
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Jianyang
Zeng (zengjy321@tsinghua.edu.cn). This study did not generate new reagents.
METHOD DETAILS
where fð ,Þ stands for the leaky ReLU activation function fðxÞ = maxð0;xÞ + 0:1minð0;xÞ, and W init ˛Rh1 382 . Note that for all the neural
network layers described in this paper, unless otherwise stated, fð ,Þ stands for the leaky ReLU activation function, W x (x can be any
subscript) stands for the learnable weight parameters, and the bias terms are omitted for clarity.
Each edge (i.e., chemical bond) ei1 ;i2 ˛E is represented by a feature vector ei1 ;i2 of length 6, which is the concatenation of one-hot
encodings representing the bond type (i.e., single, double, triple or aromatic) and other properties, e.g., whether the bond is conju-
gated and whether it is in a ring.
Na
The atom features are then processed by L iterations of graph convolution to produce a set of updated atom features fvLi ˛Rh1 gi = 1
and a super node feature s ˛R , which is an overall feature representation for the compound of interest. Note that the bond features
L h1
are not updated during the whole process. In particular, at each iteration of graph convolution, the atom features are sequentially
updated using both a basic message passing unit (Lei et al., 2017) and a graph warp unit (Ishiguro et al., 2019). The message passing
unit executes the following two steps to extract the local features from the given graph: gathering information and updating informa-
tion. During the first step (i.e., gathering information), each atom vi gathers local information tli from both its neighboring atoms and
bonds, that is,
X
tli = f W lgather v l1
k ; ei;k ; (Equation 2)
vk ˛Neighborðvi Þ
information), the gathered information and the atom features learned from the previous iteration are then processed to obtain the
N
updated features fuli gi =a 1 at each iteration l, that is,
uli = f W lupdate tli ; vl1
i ; (Equation 3)
where v0i stands for the transformed initial feature of the i-th atom, which is described in STAR Methods of the main text.
Through information sharing between the super node and all the atoms, distant atoms in the graph can communicate effectively
and efficiently through this super node, and thus a global feature can be extracted based on this technique (Ishiguro et al., 2019).
N
In the lth iteration, the message passing unit is used to obtain the updated atom features fuli gi =a 1 , as described above. Accordingly,
the graph warp unit first updates the super node feature by a single-layer neural network to obtain uls , that is,
uls = tanh W lsuper sl1 ; (Equation 5)
where l = 1; 2; /; L, tanh(,) stands for the hyperbolic tangent activation function, sl1 stands for the super node feature from the
ðl 1Þ-th iteration, and W lsuper ˛Rh1 3h1 denotes the learnable parameters.
Then, three steps are conducted to obtain the updated atom and super node features for each iteration.
Step 1: gathering information from the super node and the main nodes (atoms). The information (uls/v ) gathered from the super
node is calculated by a single-layer neural network, that is,
uls/v = tanh W ls/v sl1 ; (Equation 6)
ak;l
v;i = softmax W att bv;i ; k = 1; 2; /; K; i = 1; 2; /; Na ;
k;l k;l
(Equation 8)
k;l l1
bk;l
v;i = tanh W vatt v i satt s
tanh W k;l l1
; k = 1; 2; /; K; i = 1; 2; /; Na ; (Equation 9)
function, ½ ,; ,; /; , denotes the concatenation operation, * denotes the element-wise multiplication, and K is the number of heads.
Step 2: calculating the passed information using warp gates. For the super node, an element-wise warp gate glv/s ˛Rh1 is used to
combine the information from the super node itself uls and the main nodes (atoms) ulv/s , that is,
glv/s = s W lgate11 ulv/s + W lgate12 uls ; (Equation 10)
tlv/s = 1 glv/s ulv/s + glv/s uls ; (Equation 11)
where l = 1; 2;/;L, W lgate11 ;W lgate12 ˛Rh1 3h1 , sð ,Þ stands for the sigmoid activation function, 1 is an all-one vector of length h1 , and
tlv/s denotes the information passed to super node.
For each atom, similarly, an element-wise warp gate gls/i ˛Rh1 is used to combine the updated atom features uli and information
from the super node uls/v , that is:
gls/i = s W lgate21 uli + W lgate22 uls/v ; (Equation 12)
tls/i = 1 gls/i uli + gls/i uls/v ; (Equation 13)
sl = GRUs sl1 ; tlv/s : (Equation 15)
Na
After completing L iterations of graph convolution, the final atom features fv Li ˛Rh1 gi = 1
and the super node feature sL ˛ Rh1 are
generated and then fed into the downstream modules. In the remaining part of this paper, we will drop the superscript L for clarity.
The CNN Module
The protein sequence is encoded using the BLOSUM62 matrix (Henikoff and Henikoff, 1992), that is, the initial feature of each residue
is represented by the corresponding column of the BLOSUM62 matrix. The features of non-standard amino acids are zero-initialized.
We use this encoding strategy instead of the commonly used one-hot encoding scheme for protein sequences, mainly because the
BLOSUM62 matrix is a 20320 matrix that has encoded the evolutionary relationships between amino acids, while the one-hot en-
coding scheme lacks such information. Then, the initial features are updated through typical 1-D convolution layers (LeCun et al.,
1998) with a leaky ReLU activation function. Note that before being fed into each convolutional layer, the input is zero-padded to
ensure that the number of the output features remains fixed. The specific architecture of the employed convolutional neural network
is determined by three hyper-parameters, including the number of convolution layers, the number and the size of filters in each layer.
Nr
In the end, we obtain the final output features fr j ˛Rh1 gj = 1 for all the residues along the protein sequence (Figure 2C), where h1 stands
for the number of output channels and Nr stands for the number of residues in the input protein sequence.
The Pairwise Interaction Prediction Module
To predict the pairwise interactions between a given compound-protein pair, the pairwise interaction prediction module (Figure 3A)
Na Nr
uses the atom features fvi ˛Rh1 gi = 1 and the residue features fr j ˛Rh1 gj = 1 derived from the modules described above. The atom and
residue features are first transformed into a compatible space by two single-layer neural networks separately. Then, the predicted
probability of the interaction between an atom vi and a residue rj is derived based on the inner product between the transformed
atom and residue features, normalized by a sigmoid function, that is,
Pij = sðfðW atom vi Þ , fðW residue r j ÞÞ; (Equation 16)
h1 3h1 1
where i = 1; 2; /; Na , j = 1; 2;/;Nr , W atom ; W residue ˛R , sð ,Þ represents the sigmoid function sðxÞ = , and , denotes the
1 + ex
inner product.
The Affinity Prediction Module
The affinity prediction module (Figure 3B) integrates information from not only the previously learned atom features fv i gN a
i = 1 , the super
Nr
node feature s and the residue features fr j gj = 1 , but also the predicted pairwise interaction matrix P. Intuitively, P can be used to
construct the links to share information between atom and residue features, which may thus provide additional useful information
for predicting the binding affinity. Here, we describe how the binding affinity is predicted by our affinity prediction module.
First, the atom features fvi gN i = 1 and the super node feature s, which are originally constructed in the compound space, as well as
a
Nr
the residue features fr j gj = 1 , which are originally constructed in the protein space, are transformed into a compatible space for affinity
prediction by single-layer neural networks, that is,
hv;i = fðW v vi Þ; (Equation 17)
modify the original DAN framework by further exploiting the predicted pairwise interaction matrix to construct the direct links between
atoms and residues. Information passing is thus enabled by gathering features of interaction partners through such links for each
atom of the compound and each residue of the protein. The passed information is then incorporated into the calculation of compound
and protein attentions by DAN. Next we will describe how to use the modified DAN framework to derive compound and protein
attentions in the affinity prediction module to transform the atom and residue features into fixed-size vector representations.
Before all the DAN iterations, we first define the initial compound feature h0c ˛Rh2 , the initial protein feature h0p ˛Rh2 and the initial
memory vector m0 ˛Rh2 (h2 is the size of hidden units used in the affinity prediction module), that is,
1 X Na
h0c = hv;i ; (Equation 20)
Na i = 0
1 X Nr
h0p = hr;j ; (Equation 21)
Nr j = 0
X
Na
sdv/r;j = Pij tanh W dv/r hv;i ; j = 1; 2; /; Nr ; (Equation 24)
i=1
N N
where d = 1; 2;/;D, W dr/v ;W dv/r ˛Rh2 3h2 , fsdr/v;i gi =a 1 are the information delivered from residues to atoms, fsdv/r;j gj =r 1 are the infor-
mation delivered from atoms to residues, and Pij stands for the corresponding element in P.
Na Nr
Next, the atom or residue features (fhdv;i gi = 1 or fhdr;j gj = 1 ), the memory vector from the previous iteration (md1 ) and the above
d Na d Nr
derived shared information (fsr/v;i gi = 1 or fsv/r;j gj = 1 ) are combined to calculate the hidden states of the compound and protein
attentions, that is,
bdv;i = tanh W dvc hv;i tanh W dmc md1 sdr/v;i ; (Equation 25)
bdr;j = tanh W drp hr;j tanh W dmp md1 sdv/j ; (Equation 26)
adr;j = softmax W drs bdr;j ; (Equation 28)
X
Nr
hdp = adr;j hr;j ; (Equation 30)
j=0
md = GRU md1 ; hdc hdp ; (Equation 31)
where GRU stands for the gated recurrent unit (Chung et al., 2014).
After completing all the D iterations of updating the attentions, we obtain the fixed-size feature representations of the input
compound graph and protein sequence (that is, hDc and hDp , respectively).
Finally, hDc is concatenated with the transformed super node feature hs to obtain a combined representation of the compound
features (i.e., ½hDc ; hs ). To fully exploit the relationship between this combined representation of the compound features and the
representation of the protein features, we calculate their outer product, normalized by a leaky ReLU activation function f, and
then followed by a linear regression layer to predict the binding affinity, that is,
a = W affinity f flatten hDc ; hs 5 hDp ; (Equation 32)
where 5 denotes the outer product, flatten(,) reshapes the result of the outer product into a column vector of length 2h22 , and
2
W affinity ˛R132h2 .
Training
For a training dataset with N samples (i.e., compound-protein pairs), we minimize the cross-entropy loss for pairwise non-covalent
interaction prediction, which is defined as
1 XN X
ðnÞ
Na X
ðnÞ
Nr ðnÞ ðnÞ
Pbij logPij + 1 Pbij log 1 Pij
ðnÞ ðnÞ
LP = ; (Equation 33)
N n=1 i=1 j=1
ðnÞ
where Pij and Pbij stand for the predicted probability and the true binary label of the interaction between the i-th atom and the j-th
ðnÞ
ðnÞ ðnÞ
residue in the nth sample, respectively, and Na and Nr stand for the total number of atoms in the compound and the total number of
residues in the protein in the nth sample, respectively.
For binding affinity prediction, the objective is to minimize the mean squared error, which is defined as
1 XN
ðnÞ 2
LA = ðaðnÞ ab Þ ; (Equation 34)
N n=1
ðnÞ
where aðnÞ and ab stand for the predicted affinity and the true affinity label for the n-th sample, respectively.
In our multi-objective training process, we aim to minimize the combination of two losses to further enhance the binding affinity
prediction, that is,
L = LA + lLP ; (Equation 35)
where l stands for a weight parameter controlling the contribution of Lp to the final affinity prediction. During the training process, we
use a mini-batch stochastic gradient descent scheme to optimize the model parameters. For each training batch, compounds with
different numbers of atoms and proteins with different numbers of residues are zero-padded to obtain the same input feature lengths.
During the training process, the padded regions of features are masked so that they do not contribute to the calculation of the losses
and gradients. MONN has about two million learnable parameters. A single MONN model can be trained within an hour on a Linux
server with 48 logical CPU cores and one Nvidia Geforce GTX 1080Ti GPU.
ssalentin/plip/). The ligand atoms involved in the non-covalent interactions were then mapped to the corresponding compound struc-
tures (downloaded from http://ligand-expo.rcsb.org/ld-download.html), which contain the unique names and indices for all the
non-hydrogen atoms. For proteins, the residues involved in the non-covalent interactions were first mapped to the UniProt
sequences using a sequence alignment tool (Zhao et al., 2013) (https://github.com/mengyao/Complete-Striped-Smith-Waterman-
Library/). Then, we examined the mappings to control the quality of the generated interaction labels, and discarded those structures
when the detected interactions cannot be correctly mapped into the molecular graphs of the compounds and the protein sequences.
In addition, to further improve the mapping quality, we also filtered the complexes whose protein sequences in the PDB structures
and the corresponding UniProt sequences had less than 90% matched residues. After the mapping process, we filled the pairwise
interaction matrix according to the indices of the atoms and the residues involved in the non-covalent interactions to obtain the final
interaction labels. After these procedures, we successfully constructed pairwise non-covalent interaction labels for about 95% of the
compound protein pairs, resulting in 12,738 interaction matrices out of the 13,306 complex structures.
After constructing the benchmark dataset as described above, the performance of the pairwise interaction prediction was evalu-
ated using all the available data. For binding affinity prediction, we further separated the compound-protein pairs according to the
measurement types of binding affinities (i.e., Ki ; Kd or IC50 ), resulting in two affinity datasets, which were called the IC50 dataset and
the KIKD dataset (which contained both Ki and Kd values), respectively. The reason for this separation was that the IC50 values usually
depend on the experimental conditions and thus are often considered noisier than the Ki or Kd values. Here, we mainly used the IC50
dataset for hyper-parameter tuning for binding affinity prediction. For those repetitive records (defined as pairs with the same protein
IDs and the same compound InChIs), we only kept the pairs with pairwise interaction labels and higher binding affinities. The raw
affinity values were transformed into p(affinity) (i.e., log10 ðaffinityÞ ½M) to obtain the affinity labels. Finally, we obtained 5,340
and 6,689 unique pairs for the IC50 and KIKD datasets, respectively.
Construction of the Additional Test Dataset for Validating the Pairwise Non-covalent Interaction Predictions
We also downloaded the compound-protein complexes from the RCSB PDB database (Berman et al., 2000) to construct an addi-
tional test set for evaluating the pairwise non-covalent interaction prediction results of MONN. Since the PDBbind v2018 dataset,
which was used as our training data, already contained the high-quality compound-protein complex structures with releasing
date up to the end of 2017, here we downloaded structure data with date from January 2018 to March 2019 to avoid overlap between
training and additional test datasets. Here three criteria were used to select the compound-protein complexes and control the quality
of this additional dataset: (1) Each protein sequence can be mapped to a Uniprot sequence, with at least 90% matches in sequence
alignment; (2) To remove ions, coenzymes and other crystallization assistant chemicals, we retained only those compound-protein
pairs in which the quantitative estimation of drug-likeness (QED) scores (Bickerton et al., 2012) of the compounds are larger than 0.5;
(3) Overlaps between training and test datasets were removed by discarding the test samples with both compound and protein sim-
ilarities larger than 0.9 with any compound-protein pair in the training data. Then, the selected compound-protein complexes were
processed using PLIP (Salentin et al., 2015) to extract the non-covalent interactions and construct the pairwise interaction labels
using the same procedure as in the construction of the benchmark dataset.
1X N
average AUC score = AUCðnÞ; (Equation 36)
N n=1
where AUC(n) stands for the area under the ROC curve calculated between the labels and the predictions of the nth sample.
The average enrichment score is defined as:
1X N
average enrichment score = enrichmentðnÞ; (Equation 37)
N n=1
precisionðnÞ
enrichmentðnÞ = ; (Equation 38)
random precisionðnÞ
where precision(n) stands for the precision score between the true labels and the binarized predictions (defined below) of the nth
sample, and random_precision(n) stands for the expected precision of random predictions. Suppose that the positive-negative ratio
of the whole dataset is xpos : xneg , and
the length of prediction is lpred . Then the binarization is realized by sorting the real-value
predictions, and assigning 1 for top lpred 3xpos =ðxpos + xneg Þ predictions (d:,e stands for the ceiling operation), and 0 for the rest.
The random_precision(n) is calculated as random_precision(n) = xpos =ðxpos + xneg Þ.
The upper limit of the average enrichment score is derived below:
1X N
average enrichment score = enrichmentðnÞ (Equation 39)
N n=1
1X N
precisionðnÞ
=
N n = 1 xpos xpos + xneg
1X N
1
%
N n = 1 xpos xpos + xneg
xneg
=1+ :
xpos
xneg
Thus, with a relatively small positive-negative ratio (i.e., relatively large ), the upper limit of the average enrichment score is rela-
xpos
tively high.
the affinity prediction module of MONN (as described in STAR Methods) to calculate the compound and protein attentions according
to the neural attention based methods mentioned above. After that, the resulting compound and protein attentions substitute the
corresponding part (i.e., the DAN part) in our affinity prediction module, and then the models were trained according to the binding
affinity labels. Note that our pairwise interaction prediction module was not used in this process. More details about how we imple-
mented these neural attentions under our MONN framework are described below.
The Bilinear Attention of the Method by Gao et al.
Variables and parameters used only by this algorithm are marked with superscript ½G. The atom features and the residue features are
combined to calculate a soft alignment matrix P½G of size Na 3 Nr :
T ½G
½G
Pij = tanh W ½Gv hv;i W r hr;j ; (Equation 40)
h2 3h2
where W ½G ½G
v ; W r ˛R stands for the learnable weight parameters. Note that the bias terms for single-layer neural networks are
also omitted for clarity in this section.
½G Na ½G Nr
Then, the compound attentions fav;i gi = 1 and protein attentions far;j gj = 1 are calculated using max-pooling over the soft alignment
matrix P½G , and then followed by a softmax normalization function, that is:
½G ½G
av;i = softmax max Pij ; (Equation 41)
j = 1;2;/;Nr
½G ½G
ar;j = softmax max Pij ; (Equation 42)
i = 1;2;/;Na
These attentions are then used for reducing the sizes of compound and protein features for predicting binding affinity values.
½G Na ½G Nr
To evaluate the interpretability, the compound attentions fav;i gi = 1 and the protein attentions far;j gj = 1 are used as the predictions
of interaction sites of compounds and proteins, respectively. The soft alignment matrix P½G is used as the predicted pairwise inter-
action matrix.
½Ds ½Ds
av;i = softmax W ½Ds
av ev;i ; (Equation 44)
h2 3h2
where W ½Ds
ev ˛R and W ½Ds
av ˛R
13h2
stand for the learnable weight parameters, and tanh(,) stands for the hyperbolic tangent acti-
vation function.
½Ds
Similarly, the attentions for residue features of a protein (far;j g) are calculated by:
½Ds
er;j = tanh W ½Ds
er hr;j ; (Equation 45)
½Ds ½Ds
ar;j = softmax W ½Ds
ar er;j ; (Equation 46)
h2 3h2
where W ½Ds
er ˛R and W ½Ds
ar ˛R
13h2
stand for the learnable weight parameters.
The compound and protein attentions are then fed into the affinity prediction module of MONN. After trained by binding affinity
labels, these attentions are used as the predictions of the interaction sites. Evaluation on pairwise interaction prediction is not appli-
cable for this kind of attention, as the matchings between atoms and residues are not considered in this condition.
The Joint Attention of DeepAffinity
Variables and parameters used only by this algorithm are marked with superscript ½Dj. A pairwise interaction matrix P½Dj of size Na 3
Nr is first calculated through a single layer neural network that combines both atom and residue features, that is,
T
½Dj
Pij = tanh W ½Dj
pv hv;i W ½Dj
pr hr;j ; (Equation 47)
h2 3h2
where W ½Dj ½Dj
pv ; W pr ˛R stand for the learnable weight parameters.
Then, a softmax function is used to normalize the pairwise interaction matrix over all the elements, to obtain a Na 3Nr attention
matrix A½Dj , that is,
½Dj
exp Pij
½Dj
Aij = P P : (Equation 48)
Na Nr ½Dj
i=1 j = 1 exp Pi;j
This normalized pairwise attention matrix A½Dj can be used in the evaluation of pairwise interaction prediction. In addition, through
marginalizing A½Dj , we can also derive the predictions of interaction sites in compounds or proteins, that is,
½Dj ½Dj
av;i = max Pi;j ; (Equation 49)
j˛1;2;/;Nr
½Dj ½Dj
ar;j = max Pi;j : (Equation 50)
i˛1;2;/;Na
Since the original implementation of DeepAffinity with joint attention did not define the compound-wise/protein-wise attentions,
here we modified our affinity prediction module, by replacing the outer product between compound and protein features with a com-
bined feature, which is used in DeepAffinity:
½Dj ½Dj ½Dj
bi;j = tanh W bv hv;i + W br hr;j ; (Equation 51)
X
Na X
Nr
½Dj ½Dj
h½Dj = Aij bi;j ; (Equation 52)
i=1 j=1
where fð ,Þ stand for the leaky ReLU activation function, ½ ,; , stands for concatenation operation, and s represents the super node
feature.
where SW(,, ,) stands for the Smith-Waterman alignment score calculated based on the SSW library (Zhao et al., 2013). The clus-
tering threshold (which is a distance parameter used in the clustering algorithm) is defined as the mimimal distance between any com-
pounds (proteins) from different clusters. The clustering threshold value used in our paper is selected from {0.3, 0.4, 0.5, 0.6}. We
choose 0.3 as the lower limit of the threshold, because a distance smaller than 0.3 would not be separable enough to avoid the
data redundancy problem, consistent with the previous study (Mayr et al., 2018). The upper limit of our clustering threshold is set
to 0.6, because a higher threshold will lead to so large clusters that the splitting of training-test data would become highly imbalanced
(i.e., too much training data and too little test data, or vice versa, Tables S2 and S3).
Cross Validation Settings
After generating the compound and protein clusters, three settings are considered during the cross validation process, i.e., the new-
compound setting, the new-protein setting and the both-new setting. To explain these settings, we denote the training, validation and
test sets by Dtrain ; Dvalid and Dtest , respectively, and use ðci ; pi Þ to represent the compound-protein pair of the i-th sample (i = 1; 2;
/; N).
In the new-compound setting, cross validation is performed on compound clusters, so that the compound-protein pairs with com-
pounds from the same cluster cannot be shared across training, valid and test sets. That is, for any two compound-protein pairs
ðci ; pi Þ and ðcj ; pj Þ from different sets, ci and cj must come from different compound clusters.
In the new-protein setting, cross validation is performed on protein clusters, so that the compound-protein pairs with proteins from
the same cluster cannot be shared across training, valid and test sets. That is, for any two compound-protein pairs ðci ; pi Þ and ðcj ; pj Þ
from different sets, pi and pj must come from different protein clusters.
In the both-new setting, both compound clusters and protein clusters cannot be shared across training, valid and test sets. That is,
for any two compound-protein pairs ðci ; pi Þ and ðcj ; pj Þ from different sets, ci and cj must come from different compound clusters, and
pi and pj must come from different protein clusters as well.
For the new-compound and the new-protein settings, we use five-fold cross validation, and the train-valid-test splitting ratio is
approximately 7 : 1 : 2. Note that here the ratio is an approximation, because the splitting is performed on clusters, and the number
of data points among individual clusters is not necessarily evenly distributed. For the both-new setting, we randomly partition the
pairs of compound-protein clusters into a 333 grid. Then, a nine-fold cross validation (Airola and Pahikkala, 2018) was conducted
according to the following three steps: 1) select a grid as the test set; 2) discard the four grids that share compound or protein clusters
with the selected one; 3) reorganize the remaining grids as a new 333 grid setting and randomly select one grid as the validation set,
and the four grids that do not share any compound or protein cluster with the validation set are used as the training set. Such a cross-
validation strategy results in an approximately 16 : 4 : 9 train-valid-test ratio.
Hyper-parameter Selection
Four baseline models were used in the performance comparison for the binding affinity prediction task, including CGKronRLS (Ci-
chonska et al., 2017), DeepDTA (Öztu€rk et al., 2018), the method by Tsubaki et al. (Tsubaki et al., 2019) and DeepAffinity (Karimi et al.,
2019). The method by Gao et al. was not included here, because its source code was not released, and the model requires additional
input information (i.e., gene ontology terms of proteins) (Gao et al., 2018). For our model and all the baseline methods, each cross-
validation setting (i.e., new-compound, new-protein or both-new) has a specific set of hyper-parameters. For MONN, the hyper-
parameter selection was performed with both training objectives. The details of the hyper-parameter spaces for MONN and the base-
line methods are provided below:
Apart from all the hyper-parameters mentioned above, all the methods have another hyper-parameter, i.e., the number of epochs
(or iterations) during the training process. We used the RMSE as the evaluation metric from the validation set to select the best value
of this hyper-parameter for all the methods. The maximum number of epochs for our method was set to 30. For DeepDTA, the method
by Tsubaki et al. and DeepAffinity, we used their default maximum numbers of epochs (which is 100). For CGKronRLS, we set the
maximum number of iterations to 500, as the performance no longer increased after 500 iterations.
For each cross-validation setting, the best hyper-parameters were selected based on the IC50 dataset with clustering threshold
0.3. The same parameters were used for other scenarios (i.e., other thresholds and the KIKD dataset) under the corresponding cross-
validation setting. For pairwise interaction prediction, we also used the best hyper-parameters selected based on the affinity predic-
tion results. We did not select the hyper-parameters according to the performance of the pairwise prediction task (that is, only single
training objective of MONN was used), for the following two reasons. First, in MONN, those hyper-parameters in the affinity prediction
module will not be optimized under this condition. Second, the baseline methods do not include a direct supervised optimization
procedure for local interaction prediction. So for fair comparison, we did not specifically tune the hyper-parameters for pairwise inter-
action prediction for MONN.
Here, we did not specifically optimize the hyper-parameters of MONN and DeepDTA over the BindingDB dataset, and directly used
the hyper-parameter settings derived previously from the PDBbind-derived benchmark dataset.
The benchmark dataset created in this work and the source code of the MONN model can be downloaded from https://github.com/
lishuya17/MONN.