Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure
Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure
Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure
Abstract
Identifying a protein’s functional sites is an important step towards characterizing its molecular function. Numerous
structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small
molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-
based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain
protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand
binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct
comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely
complementary information, which can be combined to improve upon either approach alone. We also demonstrate that
ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms
and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our
understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of
proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.
princeton.edu/concavity/).
Citation: Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA (2009) Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence
Conservation and 3D Structure. PLoS Comput Biol 5(12): e1000585. doi:10.1371/journal.pcbi.1000585
Editor: Thomas Lengauer, Max-Planck-Institut für Informatik, Germany
Received May 11, 2009; Accepted October 30, 2009; Published December 4, 2009
Copyright: ß 2009 Capra et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: JAC has been supported by the Quantitative and Computational Biology Program NIH grant T32 HG003284. MS thanks the NSF for grant PECASE MCB-
0093399, and the NIH for grant GM076275. MS and TAF thank the NSF for grant IIS-0612231. This research has also been supported by the NIH Center of
Excellence grant P50 GM071508 and NIH grant CA041086. TAF also thanks the Leverhulme Trust and the BBSRC for funding his sabbatical at EBI. The funders had
no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: mona@cs.princeton.edu (MS); funk@cs.princeton.edu (TAF)
between sequence conservation, structural patterns, and functional methods are evaluated on 332 proteins from the non-redundant
importance. LigASite 7.0 dataset [24]. To evaluate pocket identification
performance, we predict ligand locations on the the holo version
Results of the dataset, in order to use the bound ligands’ locations as
positives. When evaluating residue predictions, we predict ligand
Preliminaries binding residues on the apo structures, and the residues annotated
For simplicity of exposition, we begin by comparing ConCavity’s as ligand binding (as derived from the holo structures) are used as
performance to a representative structural method and a positives.
representative conservation method. We use Ligsite+ as the We quantify the overall performance of each method’s
representative structure-based method, and refer to it as predictions in two ways. First, for both pocket and residue
‘‘Structure’’. Ligsite+ is our implementation (as indicated by prediction, we generate precision-recall (PR) curves that reflect
superscript ‘‘+’’) of a popular geometry based surface pocket the ability of each method’s grid and residue scores to identify ligand
identification algorithm. We demonstrate in the Methods section atoms and ligand binding residues, respectively. (Just as residues are
that Ligsite+ provides a fair representation of these methods. We assigned a range of ligand binding scores, grid points in predicted
choose Jensen-Shannon divergence (JSD) to represent conservation pockets get a range of scores, since there may be more evidence that
methods and refer to it as ‘‘Conservation.’’ JSD has been previously a ligand is bound in one part of a pocket than another.) Second, for
shown to provide state-of-the-art performance in identifying each set of predicted pockets (corresponding to groups of non-zero
catalytic sites and ligand binding sites [2]. We have developed values in the 3D grid), we consider how well they overlap known
three versions of ConCavity that integrate evolutionary conservation ligands via the Jaccard coefficient. The Jaccard coefficient captures
into different surface pocket prediction algorithms (Ligsite [16], the tradeoff between precision and recall by taking the ratio of the
Surfnet [14], or PocketFinder [23]). When the underlying algorithm is intersection of the predicted pocket and the actual ligand over their
relevant, we refer to these versions as ConCavityL, ConCavityS, and union. The Jaccard coefficient ranges between zero and one, and a
ConCavityP. However, for simplicity, we will use ConCavityL as high value implies that the prediction covers the ligand well and has
representative of these approaches and call it ‘‘ConCavity.’’ a similar volume. We assess the significance of the difference in
ConCavity and Structure produce predictions of ligand binding performance of methods on the dataset with respect to a given
pockets and residues. The pocket predictions are given as non-zero statistic via the Wilcoxon rank-sum test.
values on a regular 3D grid that surrounds the protein; the score
associated with each grid point represents an estimated likelihood Integrating evolutionary sequence conservation and
that it overlaps a bound ligand atom. Similarly, each residue in the structure-based pocket finding to predict ligand-binding
protein sequence is assigned a score that represents its likelihood of sites improves on either approach alone
contacting a bound ligand. Conservation only makes residue-level Figure 1 compares ConCavity with its constituent structure and
predictions, because it does not consider protein structure. All conservation based components. Figure 1A shows that, within
Figure 1. Ligand binding site prediction performance. (A) PR curves for prediction of the spatial location of biologically relevant bound
ligands. (B) PR curves for ligand binding residue prediction. Our ConCavity algorithm, which combines sequence conservation with structure-based
predictors, significantly outperforms either of the constituent methods at both tasks. Prediction based on structural information alone outperforms
considering sequence conservation alone. Comparing (A) and (B), we see that accurately predicting the location of all ligand atoms is harder for the
methods than finding all the contacting residues. Random gives the expected performance of a method that randomly ranks grid points and residues.
Conservation could not be included in (A), because it only predicts at the residue level. The curves are based on binding sites in 332 proteins from the
non-redundant LigASite 7.0 dataset.
doi:10.1371/journal.pcbi.1000585.g001
predicted pockets, grid points with higher scores are more likely to positions are highly conserved as well due to other functional
overlap the ligand, and that the significant improvement of constraints. In contrast, the structure-based predictions are
ConCavity over Structure (p,2.2e216) exists across the range of strongly clustered around surface pockets (Figure 3, left column);
score thresholds. Figure 1B demonstrates that the superior many of these residues near pockets are not evolutionarily
performance of ConCavity holds when predicting ligand binding conserved. However, these features provide largely complemen-
residues as well (p = 6.80e213). ConCavity’s ability to identify tary information about importance for ligand binding. Over the
ligand binding residues is striking: across this diverse dataset, the entire dataset, 68% of residues predicted by both Conservation and
first residue prediction of ConCavity will be in contact with a ligand Structure are in contact with ligands, while only 16% and 43% of
in nearly 80% of proteins. ConCavity also maintains high precision those predicted by only conservation or structure respectively are
across the full recall range: precision of 65% at 50% recall and ligand binding. ConCavity takes advantage of this complementarity
better than 30% when all ligand-binding residues have been to achieve its dramatic improvement; it gives high scores to
identified. As mentioned above, this large improvement exists positions that show evidence of both being in a well-formed pocket
when predicting ligand locations as well; however, the PR curves and being evolutionarily conserved.
illustrate that fully identifying a ligand’s position is more difficult The examples of Figures 2 and 3 illustrate this and highlight
for each of the methods than finding all contacting residues. several common patterns in ConCavity’s improved predictions. For
The ligand overlap statistics presented in Table 1 also 3CWK, a cellular retinoic acid-binding protein, Structure and
demonstrate the superior performance of ConCavity. In nearly ConCavity’s residue predictions center on the main ligand binding
95% of structures, ConCavity’s predictions overlap with a bound pocket (Figure 3A), while Conservation gives high scores to some
ligand. Structure’s predictions overlap ligands in nearly 92% of the positions in the binding site, but also to some unrelated residues
proteins considered. The differences between the methods become (Figure 2A). Looking at the ligand location predictions (green
more stark when we examine the magnitude of these overlaps. Both meshes in Figure 3A), Structure and ConCavity both find the pocket,
ConCavity and Structure predict pockets with total volume (Prediction but the signal from conservation enables ConCavity to more
Vol.) similar to that of all relevant ligands (Ligand Vol.), but accurately trace the ligand’s location. This illustrates how the
ConCavity’s pockets overlap a larger fraction of the ligand volume. pattern of functional conservation observed at the protein surface
influences the shape of the predicted pocket. Ligands often do not
Thus ConCavity has a significantly higher Jaccard coefficient
completely fill surface pockets; if the contacting residues are
(p,2.2e216). This suggests that the integration of sequence
conserved, our approach can suggest a more accurate shape.
conservation with structural pocket identification results in more
The results for 2CWH (Figure 3B) and 1G6C (Figure 3C)
accurate pockets than when using structural features alone.
demonstrate that ConCavity can predict dramatically different sets
Figure 1B also provides a direct comparison of ligand binding site
of pockets than are obtained when considering structure alone. In
prediction methods based on sequence conservation with those
2CWH, both methods identify the ligands, but Structure over-
based on structural features. Structure outperforms Conservation, a state-
predicts the bottom left binding pocket and predicts an additional
of-the-art method for estimating sequence conservation. Protein pocket that does not have a ligand bound. ConCavity traces the
residues can be evolutionarily conserved for a number of reasons, so ligands more closely and does not predict any additional pockets.
it is not surprising that Conservation identifies many non-ligand- Structure performs quite poorly on the tetramer 1G6C: it predicts
binding residues, and thus, does not perform as well as Structure. several pockets that do not bind ligands; it fails to completely
identify several ligands; and it misses one ligand entirely. In stark
ConCavity’s improvement comes from integrating contrast, ConCavity’s four predicted pockets each accurately trace a
complementary information from evolutionary sequence ligand. The incorporation of conservation resulted in the accurate
conservation and structure-based pocket identification prediction of a pocket in a region where no pocket was predicted
Figures 2 and 3 present pocket and residue predictions of using structure alone. Images of predictions for all methods on all
Conservation, Structure, and ConCavity on three example proteins. In proteins in the dataset are available in the Text S1 file, and
general, different types of positions are predicted by Conservation ConCavity’s predictions for all structures in the Protein Quaternary
and Structure. If we consider the number of known ligand binding Structure (PQS) database are available online.
residues for each protein in the dataset, and take this number of
top predictions for the Structure and Conservation methods, the ConCavity significantly outperforms available prediction
overlap is only 26%. The residues predicted by sequence servers
conservation are spread throughout the protein (Figure 2); We now compare the performance of ConCavity to several
ligand-binding residues are often very conserved, but many other existing ligand binding site identification methods with publicly
Table 1. The overlap between predicted pockets and bound ligands in holo protein structures from the LigASite database.
Fraction with Prediction Vol. Ligand Vol. Prediction \ Ligand Prediction | Ligand Jaccard
Method Ligand Overlap (Å3) (Å3) (Å3) (Å3) coefficient
The first column gives the fraction of proteins for which a method’s predictions overlap a ligand. The second column (Prediction Vol.) lists the average volume of the
predicted pockets for each protein, while the third column (Ligand Vol.) lists the average volume of ligands observed in the structure. The next columns give the
average volumes of the Intersection and Union of the predictions and ligands and the Jaccard coefficient (Intersection/Union). ConCavity and Structure predict pockets
of similar sizes---both use a similar pocket volume threshold---but ConCavity’s predictions overlap more of the bound ligands. ConCavity’s higher Jaccard coefficient
demonstrates that it better manages the tradeoff between precision and recall.
doi:10.1371/journal.pcbi.1000585.t001
Figure 3. Comparison of the binding site predictions of Structure and ConCavity on three example proteins. The three proteins presented here
correspond to those shown in Figure 2. In each pane, ligand binding residue scores have been mapped to the protein surface. Warmer colors indicate a
higher binding score. Pocket predictions are shown as green meshes. (A) PDB: 3CWK. Both methods identify the binding site, but by considering
conservation information (Figure 2A), ConCavity more accurately traces the ligand. (B) PDB: 2CWH. Structure significantly overpredicts the extent of the ligand
in the bottom left corner as well as predicting an additional pocket on the reverse of the protein. ConCavity predicts only the two ligand binding pockets. (C)
PDB: 1G6C. In order to visualize the predictions more clearly, only the secondary structure diagram of the protein is shown. This example illustrates the
difficulty presented by multichain proteins; there are many cavities in the structure, but not all bind ligands. Structure identifies some of the relevant pockets,
but focuses on the large, non-binding central cavity formed between the chains. Referring to this protein’s conservation profile (Figure 2C), we see that the
ligand binding pockets exhibit high conservation while the non-binding pockets do not. As a result, ConCavity selects only the relevant binding pockets. In
each example, ConCavity selects the binding pocket(s) out of all potential pockets and more accurately traces the ligands’ locations in these pockets.
doi:10.1371/journal.pcbi.1000585.g003
ConCavity performs similarly for geometry and energetics Shannon divergence (JSD) into Ligsite+, to represent the perfor-
based grid creation methods mance of the ConCavity approach. However, our strategy for
combining sequence conservation with structural predictions is
In the previous sections, we used ConCavityL, which integrates general; it can be used with a variety of grid-based surface pocket
evolutionary sequence conservation estimates from the Jensen- identification algorithms and conservation estimation methods.
Figure 5. Comparison of different versions of ConCavity. ConCavity provides a general framework for binding site prediction. We use Ligsite+ -
based ConCavity as representative, but it is possible to use other algorithms in ConCavity. This figure compares the PR curves for three versions
(ConCavityL, ConCavityP, ConCavityS )---each based on integrating sequence conservation with a different grid creation strategy (Ligsite+, PocketFinder+,
or Surfnet+). All three versions perform similarly, and all significantly outperform the methods based on structure analysis alone (dashed lines). These
conclusions hold for both ligand binding pocket (A) and ligand binding residue (B) prediction.
doi:10.1371/journal.pcbi.1000585.g005
The methods better identify ligand binding sites in specific region of the protein surface. Structure based surface
enzymes than non-enzymes cavity identification algorithms can guide analysis in such
The LigASite apo dataset contains protein molecules that carry situations [52].
out a range of different functions. Enzymes are by far the most To test ConCavity’s ability to identify drug binding sites, we
common; they make up 254 of the 332 proteins in the dataset. The evaluated it on a set of 98 protein-drug complexes [62]. The
remaining 78 non-enzyme ligand binding proteins are involved in superior performance provided by ConCavity over Structure on the
a wide variety of functions, e.g., transport, signaling, nucleic acid diverse set of proteins considered above suggests that ConCavity
binding, and immune system response. would likely be useful in the drug screening pipeline. Table 4
Table 3 compares the performance of the ligand binding site compares the ligand overlap PR-AUC and Jaccard coefficient for
prediction methods on enzymes and non-enzymes. There is more the three versions of ConCavity and their structure-based analogs.
variation within each method’s performance on non-enzyme Each ConCavity method significantly improves on the methods that
proteins, and all methods perform significantly better on the only consider structural features (e.g., p = 1.25e26 on overlap PR-
enzymes (e.g., p = 3.336e24 for ConCavityL ). Active sites in AUC and p = 2.06e26 on Jaccard for ConCavityL). While the
enzymes are usually found in large clefts on the protein surface improvement is not quite as large on this dataset as that seen on
and consistently exhibit evolutionary sequence conservation the more diverse LigASite dataset, it is still significant. It is possible
[60,61], so even though enzymes bind a wide array of substrates, that this is due to the fact that drug compounds are not the
these common features may simplify prediction when compared to proteins’ natural ligands; the evolutionary conservation of the
the variety of binding mechanisms found in other proteins. residues in binding pockets may reflect the pressures related to
Despite the drop in performance on non-enzyme proteins, the binding the actual ligands rather than the drugs.
main conclusions from the earlier sections still hold. However, the
improvement provided by ConCavity is not as great on the non- Examples of difficult structures
enzymes. This could be the result of the more complex patterns of While ConCavity signficantly outperforms previous approaches,
conservation found in non-enzyme proteins, and the compara- its performance is not flawless. In Figure 7, we give three example
tively poor performance of Conservation in this setting. It is also structures that illustrate patterns observed when ConCavity
possible that Ligsite+’s approach is particularly well suited to performs poorly. Handling these cases is likely to be important
identifying binding sites in non-enzymes. Overall, these results for further improvements in ligand binding site prediction.
highlight the importance of using a diverse dataset to evaluate The first pattern common among these difficult cases is
functional site predictions. evolutionary sequence conservation information leading predic-
tions away from actual ligand binding sites. Figure 7A provides an
ConCavity improves identification of drug binding sites example in which the ligand binding site is less conserved than
Knowledge of small molecule binding sites is of considerable use other parts of the protein. The ActR protein from Streptomyces
in drug discovery and design. Many of the techniques used to coelicolor (PDB: 3B6A) contains both a small molecule ligand-
screen potential targets, e.g., docking and virtual screening, are binding and a DNA-binding domain [63]. The ligand-binding
computationally intensive and feasible only when focused on a domain is in the bottom, less-conserved half of the structure. The
Residue PR-AUC
All methods perform better on holo structures than apo structures, but the
drop in performance is not dramatic, and the relative ranking of the methods is
the same across both datasets.
doi:10.1371/journal.pcbi.1000585.t002
This table compares the average grid value precision-recall AUC and the
average Jaccard coefficient of prediction-ligand overlap for ConCavity and
methods based on structural analysis alone on a set of 98 protein-drug
complexes. Integrating sequence conservation and structure-based pocket
finding improves the identification of drug binding sites. Conservation is not
included in this evaluation, because it does not make pocket-level predictions.
doi:10.1371/journal.pcbi.1000585.t004
Table 5. Catalytic residue identification (LigASite apo). well on multi-chain proteins; the presence of many non-ligand
binding pockets between chains has little effect on its performance.
While ConCavity outperforms previous approaches, we have
Method PR-AUC found two main causes of poor results: misleading evolutionary
sequence conservation information and ligands that bind partially
L
ConCavity 0.315 or entirely outside of well-defined concave surface pockets. Ligand
ConCavityP 0.301 binding sites may lack strong conservation for a number of
ConCavityS 0.288 reasons: the underlying sequence alignment may be of low quality,
Conservation 0.249
there may be other more conserved functional regions in the
+
protein, and some sites are hypervariable for functional reasons
Ligsite 0.190
[67]. The alignment quality issue will become less relevant as
PocketFinder+ 0.149 sequence data coverage and conservation estimation methods
Surfnet+ 0.142 improve. The second two cases may require the integration of
Random 0.012 additional features to better distinguish different types of functional
sites. Similarly, finding biologically relevant ligands that bind
ConCavity identifies more catalytic sites than other methods. However, in outside of concave surface pockets will likely require the
contrast to ligand binding residue prediction, Conservation outperforms the
structure-based approaches at detecting catalytic sites.
development of additional structural descriptors. Missing or
doi:10.1371/journal.pcbi.1000585.t005 incomplete ligands also affect the apparent performance of the
methods, but such issues are unavoidable due to the nature of the
structural data.
Several previous methods have combined sequence conserva- In implementing and evaluating previous 3D grid-based ligand
tion and structural properties in machine learning frameworks to binding site prediction approaches, we have found that the
predict catalytic sites [5,50,51]. Direct comparison with these methods used both to aggregate grid values into coherent pockets
methods is difficult because most datasets and algorithms are not as well as to map these pockets onto surface residues can have a
readily available. Tong et al. [51] compared the precision and large effect on performance. In order to focus on the improvement
recall of several machine learning methods on different datasets in provided by considering evolutionary sequence conservation, the
an attempt to develop a qualitative understanding of their relative results for previous structure-based methods presented above use
performance. While it is not prudent to draw conclusions based on our new algorithms for these steps. We describe the details of our
cross-dataset comparisons, we note for completeness that Con- approaches in the Methods section. On a high level, the new
Cavity’s catalytic site predictions the diverse LigASite dataset methodologies we propose provide significant improvement by
achieve higher precision (23.8%) at full recall than the maximum predicting a flexible number of well-formed pockets for each
precision (over all recall levels) reported for methods in their structure and by assigning each residue a likelihood of binding a
comparisons. ligand based on its local environment rather than on the rank of
the entire pocket. We have used morphological properties of
Discussion ligands to guide pocket creation, but the most appropriate
algorithms for these steps depend strongly on the nature of the
Evolutionary sequence conservation and protein 3D structures prediction task. These steps have received considerably less
have commonly been used to identify functionally important sites; attention than computing grid values; our results suggest that
here, we integrate these two approaches in ConCavity, a new they should be given careful consideration in the future.
algorithm for ligand binding site prediction. By evaluating a range We have focused on the prediction of ligand binding sites, but
of conservation and structure-based prediction strategies on a the direct synthesis of conservation and structure information is
large, diverse dataset of ligand binding sites, we establish that likely to be beneficial for predicting other types of functionally
structural approaches generally outperform sequence conserva- important sites. Our application of ConCavity to catalytic site
tion, and that by combining the two, ConCavity outperforms prediction illustrates the promise and challenges of such an
conservation-alone and structure-alone on about 95% and 70% of approach. Catalytic sites are usually found in surface pockets, but
structures respectively. Overall, ConCavity’s first predicted residue considering structural evidence alone performs quite poorly---
contacts a ligand in nearly 80% of the apo structures examined, worse than sequence conservation. Combining structure with
and it maintains high precision across all recall levels. These results evolutionary conservation provides a modest gain in performance
hold for the three variants of ConCavity we considered, each of over conservation alone. Protein-protein interface residues are
which uses a different underlying structure-based component. In another appealing target for prediction; much can be learned
addition, ConCavity’s integrated approach provides significant about a protein by characterizing its interactions with other
improvement over conservation and structure-based approaches proteins. However, protein-protein interaction sites provide
on the common task of identifying drug binding sites. additional challenges; they are usually large, flat, and often poorly
Combining sequence conservation-based methods with struc- conserved [68]. ConCavity is not appropriate for this task. Other
ture information is especially powerful in the case of multi-meric types of functional sites also lack simple attributes that correlate
proteins. Our analysis has shown that the performance of strongly with functional importance. Analysis of these sites’
structural approaches for identifying ligand binding sites dramat- geometries, physical properties, and functional roles will produce
ically decreases as the number of chains in the structure increases; more accurate predictors, and may also lead to new insights about
conservation alone outperforms structure-based approaches on the general mechanisms by which proteins accomplish their
proteins with five or more chains. It is difficult to determine from molecular functions.
structural attributes alone if a pocket formed at a chain interface In summary, this article significantly advances the state-of-the-
binds a ligand or not. However, ligand binding pockets usually art in ligand binding site identification by improving the
exhibit high evolutionary sequence conservation. ConCavity, which philosophy, methodology, and evaluation of prediction methods.
takes advantage of this complementary information, performs very It also increases our understanding of the relationship between
evolutionary sequence conservation, structural attributes of The goal is to produce grid values that correlate with the
proteins, and functional importance. By making our source code likelihoods of finding a bound ligand at each grid point.
and predictions available online, we hope to establish a platform Several methods have been proposed to produce grids of this
from which the prediction of functional sites and the integration of type. For example, Ligsite [16] produces a grid with values between
sequence and structure data can be investigated further. 0 and 7 by scanning for the protein surface along the three axes
and the four cubic diagonals. For each grid point outside of the
Methods protein, the number of scans that hit the protein surface in both
directions---so-called protein-solvent-protein (PSP) events---is the
ConCavity value given to that point. A large number of PSP events indicate
This section describes the components of the ConCavity that the grid point is surrounded by protein in many directions and
algorithm for predicting ligand binding residues from protein 3D thus likely to be in a pocket.
structures and evolutionary sequence conservation. Surfnet [14] assigns values to the grid by constructing spheres
ConCavity proceeds in three conceptual steps: grid creation, that fill the space between pairs of protein atoms without
pocket extraction, and residue mapping (Figure 8). First, the overlapping any other atoms. These sets of spheres are constructed
structural and evolutionary properties of a given protein are used for all pairs of protein surface atoms within 10 Å of each other.
to create a regular 3D grid surrounding the protein in which the Spheres with a radius smaller than 1.5 Å are ignored, and spheres
score associated with each grid point represents an estimated are allowed to have a maximum radius of 4 Å. This procedure
likelihood that it overlaps a bound ligand atom (Figure 8A). results in a set of overlapping spheres that fill cavities and clefts in
Second, groups of contiguous, high-scoring grid points are the protein. Extending the original algorithm slightly, we assign
clustered to extract pockets that adhere to given shape and size the value for each grid point to be the number of spheres that
constraints (Figure 8B). Finally, every protein residue is scored overlap it (rather than simply one for overlap and zero for no
with an estimate of how likely it is to bind to a ligand based on its overlap as in the original algorithm). Thus, higher values are
proximity to extracted pockets (Figure 8C). generally associated with the positions in the ‘‘center’’ of a pocket.
Grid-based strategies have been employed by several previous PocketFinder [23] assigns values to grid points by calculating the
systems for ligand binding site prediction (e.g., [14,16,23]). van der Waals interaction potential of an atomic probe with the
However, our adaptations to the three steps significantly affect protein. The Lennard-Jones function is used to estimate the
the quality of predictions. First, we demonstrate how to integrate interaction potential between the protein and a carbon atom
evolutionary information directly into the grid creation step for placed at each grid point. The potential at a grid point p is:
three different grid-based pocket prediction algorithms. Second,
we introduce a method that employs mathematical morphology X C a C6a
12
operators to extract well-shaped pockets from a grid. Third, we V ð pÞ~ { ð1Þ
provide a robust method for mapping grid-based ligand binding a[protein
r12 r6
predictions to protein residues based on Gaussian blurring. The
a
details of these three methods and an evaluation of their impacts where C12 and C6a are constants (taken from AutoDock [69]) that
on ligand-binding predictions are described in the following shape the Lennard-Jones function according to the interaction
subsections. energy between the carbon probe atom and protein atom a, and r
Grid creation. The first step of our process is to construct a is the distance between the grid point p and a (interactions over
3D regular grid covering the free-space surrounding a protein. distances greater than 10 Å are ignored).
Figure 8. ConCavity prediction pipeline. The large gray shape represents a protein 3D structure; the triangles represent surface residues; and the
gray gradient symbolizes the varying sequence conservation values in the protein. Darker shades of each color indicate higher values. (A) The initial
grid values come from the combination of evolutionary sequence conservation information and a structural predictor, in this example Ligsite. The
algorithm proceeds similarly for PocketFinder and Surfnet. (B) The grid generated in (A) is thresholded based on morphological criteria so that only
well-formed pockets have non-zero values. For simplicity, only grid values near the pockets are
0
shown. (C) Finally, the grid representing the pocket
predictions is mapped to the surface of the protein. We perform a 3D Gaussian blur (s~4A) of the pockets, and assign each residue the highest
overlapping grid value. Residues near regions of space with very high grid values receive the highest scores.
doi:10.1371/journal.pcbi.1000585.g008
Other grid creation methods have been proposed as well, but A slightly more adaptive method is used in PocketFinder [23]. In
these three (Ligsite, Surfnet, and PocketFinder) provide a representative ‘‘StdDev’’ the mean and standard deviation of values in the grid are
set for our study. used to determine a different threshold for0 every protein.
We augment these algorithms by integrating evolutionary Specifically, the grid is blurred with s~2:6A, and then the
information into the grid creation process. Our methodology is threshold is set to be 4.6 standard deviations above the mean of the
based on the observation that these (and other) grid creation grid values. This approach is problematic because the threshold
algorithms operate by accumulating evidence (‘‘votes’’) for ligand depends on the parameters of the grid; any change to how the
binding at grid points according to spatial relationships to nearby protein is embedded in the grid (e.g., orienting the protein
protein atoms. For PocketFinder, each protein atom casts a ‘‘vote’’ differently, changing the distance between the protein and the grid
for nearby grid points with magnitude equal to the (opposite of the) boundary, etc.) will affect the mean and standard deviation of the
van der Waals potential. In Ligsite, every pair of protein atoms grid values, which in turn will affect the threshold chosen to
‘‘votes’’ for solvent-accessible grid points on line segments between extract pockets. For example, simply making the extent of the grid
them. In our implementation of Surfnet, pairs of atoms ‘‘vote’’ for 10% greater will include a large number of near-zero values in the
all the grid points overlapping a sphere covering the solvent grid, which will bring the threshold down and make the extracted
accessible region between them. pockets larger. Also, no control is provided over the number and
Based on this observation, we weight the ‘‘votes’’ as the grid is size of pockets; it is possible that for some proteins no grid values
created by an estimate of sequence conservation of the residue(s) are 4.6 standard deviations above the mean, in which case no
associated with the atom(s) that generate the votes. We tested pockets will be predicted.
several schemes for scaling votes. If c1 and c2 are estimated It is difficult to control the number, sizes, and shapes of
conservation scores associated with the relevant atoms (e.g., extracted pockets with Threshold and StdDev. In both methods a
derived from their residues’ conservation in multiple sequence threshold is applied to every grid point independently and clusters
alignments), we scaled the structure-based component by the are formed only on the basis of geometric proximity between grid
c1 zc2 points, so it is possible to extract a set of pockets that have
product (c1 c2 ), the arithmetic mean ( ), the geometric mean
pffiffiffiffiffiffiffiffiffi 2 biologically implausible shapes. For example, there is no way to
( c1 c2 ), the product of exponentials (2c1 2c2 ), and the product of guarantee that the algorithm won’t extract one very large pocket
exponentials of transformed conservation values (22c1 {1 22c2 {1 ). that covers a significant fraction of the protein surface, or many
Each of these schemes provides improvement for all methods, but small pockets distributed across the protein surface, and/or
due to method specific differences, no single weighting scheme is pockets that contain long, thin regions where the cross-sectional
best for all methods. Specifically, for PocketFinder, which has only one diameter is too small to fit a bound ligand. Of course, it is possible
atom associated with each vote, we scale the vote (van der Waal’s to trim/discard such pockets after they have been extracted
potential) of each atom linearly by c1 . For Ligsite we scale the votes according to geometric criteria using post-processing algorithms
by arithmetic mean of the conservation values and for Surfnet by the [1,23,56]. However, unless there is feedback between the method
product of the exponentials of the transformed conservation values. used to select a grid threshold and the method used to cull pockets,
In our study, conservation scores are calculated by the Jensen- then there is no way to guarantee that a biologicaly plausible set of
Shannon divergence (JSD) with sequence weighting and a gap pockets is output, i.e., it is possible that none of the pockets
penalty [2]; however, any sequence conservation measure that extracted with the chosen grid threshold meet the culling criteria.
produces residue scores (which are then mapped to atoms within In ConCavity, we integrate extraction and culling of pockets into
the residues) could be incorporated. a single framework. We perform a binary search for the grid
Performance. The superior performance of our ConCavity grid threshold that produces a culled set of pockets that have specified
creation method at predicting ligand binding pockets and residues is properties (maximum number of pockets, total volume of all
demonstrated in Figure 5 of the Results section. The only difference pockets, minimum volume for any pocket, minimum cross-
between the ConCavity methods (ConCavityL, ConCavityS, ConCavityP) sectional radius for any pocket, and maximum distance from
and their counterparts based on structure alone (Ligsite+, Surfnet+, protein surface). Specifically, for each step of the binary search, we
PocketFinder+) is the use of sequence conservation in the grid creation select a grid threshold, extract a set of pockets (connected
step. For each grid creation strategy, considering evolutionary components of grid points having values above the threshold),
conservation yields significant improvement. and then apply a sequence of culling algorithms to trim/discard
Pocket extraction. The second step of our process is to pockets based their sizes and shapes. The algorithm iterates,
cluster groups of contiguous, high-scoring grid points into pockets adjusting the threshold up or down, if the set of pockets resulting
that most likely contain bound ligands. from the culling operations does not meet the specified global
Several methods have been previously proposed to address this properties. The binary search terminates when it has found a set of
problem. The simplest is to apply a fixed threshold to the grid, i.e., pockets meeting all of the specified properties or determines that
eliminate all grid points below some given value. Then, the none is possible. We call this method ‘‘Search’’.
remaining grid points can be clustered into pockets (e.g., Specifically, the culling steps are implemented with a series of
connected components), and small pockets can be discarded. This grid-based filters, each of which runs in compute time that grows
method, which we call ‘‘Threshold’’, has been used in previous linearly with the size of the grid. Given a current guess for the grid
versions of Ligsite [1,16]. A problem with this approach is that the threshold, the first filter simply zeroes all grid points whose value is
threshold is set to the same value for all proteins, which provides below the threshold value.
no control over the total number and size of pockets predicted by The second filter zeroes grid points whose distance from the van
the algorithm. In the worst case, when every grid value is below der Waal’s surface of the protein exceeds a given threshold,
the threshold, then the algorithm will predict no pockets. On the max_protein_offset. This filter is computed by first rasterizing a
other hand, if the threshold is too low, then there will be many sphere for all atoms of the protein into a grid, setting every grid
large pockets. Different proteins have different types of pockets, so point within the van der Waal’s radius of any protein atom to one
no one threshold can extract appropriately sized and shaped and the others to zero. Then, the square of the distance from each
pockets for all of them. grid point to the closest point on the van der Waal’s surface is
computed with three linear-time sweeps, and the resulting squared thus we set max_protein_offset to 5 Å. In order to target binding sites
distances are used to zero grid points of the original grid if the for biologically relevant ligands, we set min_pocket_radius to 1 Å and
squared distance is greater than max_protein_offset2. min_pocket_volume to 100 Å3. Based on the observation that the total
The third filter ensures that no part of a pocket has cross- volume of all bound ligands is roughly proportional to the total
sectional radius less than a given threshold, min_pocket_radius. This volume of the protein [17], we set total_pocket_volume to a given
filter is implemented with an ‘‘opening’’ operator from mathe- fraction of the total protein volume---2% in our studies (Text S1).
matical morphology. Intuitively, the boundary of every pocket
Finally, we set the grid resolution to 1 Å and e to 1 Å3.
(non-zero values of the grid) is ‘‘eroded’’ by min_pocket_radius and
Performance. To assess the impact of different pocket extraction
then ‘‘dilated’’ by the same amount, causing regions with cross-
strategies on the precison and accuracy of binding site detection,
sectional radius less than the threshold to be removed, while the
we implemented several alternative methods and compared how
others are unchanged. This operator is implemented with two well the pockets they predict overlap with ligands in holo
computations of the squared distances from pocket boundaries, structures from the LigASite dataset. Table 6 shows the results
each of which takes linear time in the size of the grid. of several pocket extraction algorithms (second column) on three
The fourth filter constructs connected components of the grid different grids types (first column). In addition to Thresh and StdDev,
and then zeros out grid points within components whose volume is Largest(N) refers to zeroing all grid entries not in the largest N
less than a given threshold, min_pocket_volume. Connected compo- pockets (connected components).
nents are computed with a series of depth-first traversals of The statistics presented in Table 6 reflect various attributes of
neighboring non-zero grid points, which take linear time all the pockets predicted by each extraction technique. The Jaccard
together, and pockets are sorted by volume using quicksort, which coefficient (Intersection/Union) ranges between zero and one and
takes OðplogpÞ time for p pockets. takes into account the natural tradeoff between recall and
After these filters are executed for each iteration, the total precision by rewarding predictions that overlap the known ligands
volume of all remaining pockets is computed and compared to a (large Intersection) and penalizing methods that predict very large
given target volume, total_pocket_volume. If the total volume is pockets (large Union). Thus, it is a suitable measure for comparing
greater (less) than the target, the grid threshold is increased the overall performance of the pocket extraction methods. For
(decreased) to a value half-way between the current threshold and example, though the pockets of PocketFinder+ with the StdDev
the maximum (minimum) possible threshold---initially the largest (meanzs) extraction method have very high recall (0.900), its
(smallest) value in the grid---and the minimum (maximum) is set to Jaccard coefficient is very low, because the predicted pockets have
the current threshold. The process is repeated with the new a very large average volume (49x more than the ligands). For each
threshold until the total volume of all pockets is within e of the grid type, our Search pocket extraction method predicts pockets
given total_pocket_volume. Note that we perform a 1 Å Gaussian blur with volumes close to the actual ligand volume and obtains the
on the Ligsite grid before beginning this search to provide finer best Jaccard coefficient. As a result, we use Search in ConCavity and
control over the predicted pockets than is provided by the Ligsite our implementations of previous grid based methods.
integer grid values. Residue mapping. The third step of our pipeline uses the
We set the parameters for these filters empirically. In previous extracted set of pockets to generate ligand-binding predictions for
studies, it has been observed that the vast majority of bound ligand residues. Our goal is to score every residue based on its
atoms reside within 5 Å of the protein’s van der Waal’s surface, relationship to the extracted pockets such that residues with
For three types of grids (first column), we ran different pocket extraction algorithms (second column) and compared how well the pockets overlap bound ligands in
holo PQS structures. The third column (‘‘Prediction Vol.’’) lists the average volume of all predicted pockets over each protein. For reference, the average volume of all
ligands observed in the PQS files (‘‘Ligand’’) is 1977.2 Å3. The next two columns list the average volumes of the Intersection (Ligand \ Prediction) and Union (Ligand |
Prediction) of the Prediction and Ligand grids. Finally, the rightmost four columns list the average over-prediction factor (Prediction/Ligand), precision (Intersection/
Prediction), recall (Intersection/Ligand), and Jaccard coefficient (Intersection/Union). For the last three columns, values range between zero and one, and higher values
represent better performance. Comparing the average volume of the pockets predicted by each method, we see that Search’s pockets are closest to the actual ligand
volumes. Moreover, Search’s high Jaccard coefficient for each grid type indicates that it provides the best tradeoff between recall and precision among the methods
tested.
doi:10.1371/journal.pcbi.1000585.t006
higher scores are more likely to bind ligands. This goal is more Table 7. Comparison of residue mapping strategies.
ambitious than that of previous residue mapping approaches
which have sought only to identify the residues associated with
predicted pockets. Pocket Grid Source
Perhaps the simplest and most common previous method is to L
Mapping Method ConCavity ConCavityP ConCavityS
mark all residues within some distance threshold, d, of any pocket
as binding (e.g., score = 1) and the rest as not binding (e.g., score Blur 0.608 0.602 0.587
= 0) [25]. We call this method ‘‘Dist-01.’’ Both the pocket surface Dist-Raw 0.477 0.553 0.509
and geometric center have been taken as the reference point
Dist-Size 0.442 0.486 0.474
previously; we use the pocket surface in Dist-01. This approach
Dist-Cons 0.426 0.473 0.437
ignores all local information about the predicted pockets. Two
related methods have incorporated attributes of predicted pockets Dist-01 0.404 0.455 0.414
into Dist-01. The first assigns residues near pockets scores that
We applied five residue mapping algorithms to three grids of predicted pockets
reflect the size of the closest pocket (‘‘Dist-Size’’) [1]; residues near (ConCavityL, ConCavityP, ConCavityS ). This table lists the PR-AUC for identifying
the largest pocket receive the highest score and so on. A similar ligand binding residues in the LigASite apo dataset for each combination. Our
approach uses the average conservation of all residues near the Blur algorithm achieves the best performance for each grid type.
pocket (‘‘Dist-Cons’’) [1] to rank the pockets and assign rank-based doi:10.1371/journal.pcbi.1000585.t007
scores to residues. previous methods considered in our evaluations. In some cases we
In ConCavity, our goal is to assign scores to residues based on have completely reimplemented strategies and in others we have
their likelihood of binding a ligand. We use the original grid values post-processed the output of existing implementations. Table 8
(which reflect the predicted likelihood of a ligand at every point in provides a summary of these details. As mentioned earlier, a ‘‘+’’
space) to weight the scores assigned to nearby residues. Starting appended to the method name indicates that it is (at least in part)
with the grid values within the set of0extracted pockets, we blur this our implementation, e.g., Ligsite+.
grid with a Gaussian filter (s~4A), and then assign to every Ligsite+, Surfnet+, and Pocketfinder+. We developed new
residue the maximum grid value evaluated at the location of any of implementations of the Ligsite, Surfnet, and Pocketfinder methods for
its atoms. This method, which we call ‘‘Blur,’’ assigns residues in grid generation. This was necessary to allow us to fully integrate
the same pocket different scores, since some residues are in the sequence conservation with these methods. However, it also enabled
middle of a binding site, next to the part of a pocket with highest us to investigate the the effect of different pocket extraction and
grid values, while others are at the fringe of a site, near marginal residue mapping algorithms on overall performance.
grid values. The score assigned by Blur reflects these differences in By default, we use Search to extract pockets and Blur to map to
the likelihood that an individual residue is ligand binding. residues for Ligsite+, Surfnet+, and Pocketfinder+, because as was shown
In contrast to Blur, none of the previous residue extraction above, these approaches yield the best performance. Our
methods give different scores to residues in the same pocket. For implementations output representations of the predicted ligand
comparison, we developed a version of the Dist strategy that (like binding pockets and ranked lists of contacting residues, so they can
Blur) considers the original grid values. Dist-Raw simply assigns to be included in both pocket and residue-based evaluations.
each residue within d of a pocket, the value of the nearest pocket LigsiteCS, Q-SiteFinder, and CASTp. For our experiments,
grid point. we generate binding site predictions using three publicly available
Performance. We analyze the performance of these residue web servers: LigsiteCS [1], QSiteFinder [25], and CASTp [19]. Each
mapping approaches by comparing their PR-AUC on the task of these servers produces a list of predicted pockets represented by
of predicting ligand binding residues as defined in the LigASite sets of residues. In each case, the residues do not have scores
apo dataset. In each case we start with the same grid of extracted associated with them. Thus to include these methods in the ligand
pockets and apply a different residue mapping algorithm. We binding residue prediction evaluation, we must assign scores to the
consider all residue mapping strategies on three different pocket residues. We tried two approaches. The first assigns all predicted
grids: ConCavityL, ConCavityS, and ConCavityP. For all Dist approach- residues a score of one and all others a score of zero. The second
es, we set d to 5 Å, and for Dist-Cons we consider the conservation ranks the residues by the highest ranking pocket to which they are
of all residues within 8 Å of the pocket (as in [1]). assigned, i.e., all residues from the first predicted pocket are given
The results presented in Table 7 demonstrate that Blur provides a higher score than those from the second and so on. These
better performance for each grid type than all versions of previous approaches are similar to the residue mapping algorithms
residue mapping approaches. Thus, we use Blur in ConCavity and discussed in the ConCavity section above; however, those exact
our implementations of previous ligand binding site prediction algorithms could not be applied here because the web servers do
algorithms. The two methods that assign residues scores based on not provide representations of the full extent of predicted pockets.
the values of nearby grid points (Blur and Dist-Raw) provide better We found that residue ranking produces better results (data not
performance in each case than those that assign all residues in a shown), so we use this approach. We consider the default number
pocket the same score based on a global property of the pocket of pockets predicted by each method: LigsiteCS returns three
(Dist-Size and Dist-Cons). This suggests that the local environment pockets; Q-SiteFinder returns ten pockets; and CASTp returns a
around residues should be considered when predicting binding variable number. The Q-SiteFinder web server would not accept
sites. proteins with more than 10,000 atoms.
LigsiteCS, Q-SiteFinder, and CASTp do not provide a representa-
Previous methods tion of each predicted pocket’s full extent, so they could not be
We have compared ConCavity to several methods for ligand included in the ligand location prediction evaluation.
binding site prediction. Many of these methods lack publicly LigsiteCSC+. The LigsiteCSC method is an extension of
accessible implementations, and those that are available output LigsiteCS that uses the evolutionary sequence conservation of
different representations of predicted pockets and residues. In this residues surrounding predicted pockets to reorder the pocket
section, we describe of how we generate predictions for all predictions. This feature on the LigsiteCS prediction server did not
This table summarizes the details of each step of the ligand binding site prediction process for the methods we evaluate. The new ConCavity methods are based entirely
on our code. We also developed new implementations (Ligsite+, PocketFinder+, and Surfnet+) of three previous methods. Predictions for the other previous methods
were obtained from the listed publicly accessible web servers. These servers output sets of residues associated with predicted binding pockets. For inclusion in the
residue prediction evaluation, the output of these servers was post-processed as specified. This step is not necessary for our methods, because Blur outputs ranked
residue predictions. A ‘‘+’’ appended to the method name indicates that it is based (at least in part) on our code. Implementation details of each algorithm are given in
the text, and code for our implementations is available online.
doi:10.1371/journal.pcbi.1000585.t008
work for many PQS structures in our dataset, so we implemented which biologically relevant ligands are identified in order to define
our own version on top of the LigsiteCS results. For each pocket, we ligand binding residues and map them to the apo structure. If
calculate the average conservation of all residues within 8 Å of the multiple holo structures are available for the protein, the sets of
pocket center. The JSD method on the HSSP alignments is used to contacting residues are combined to define the binding residues for
produce the conservation scores. The top three pockets in terms of the apo structure. We select the structures for our LigASite holo
size are then ranked in terms of average conservation. This evaluation set by taking the holo structure with the most ligand
implementation follows the published description of LigsiteCSC, contacting residues for each apo structure. The average number of
except for the use of JSD for conservation instead of ConSurf. holo structures for each apo structure is 2.58, and the maximum
Jensen-Shannon Divergence. The Jensen-Shannon divergence for any single structure is 32. The average chain length is 276
(JSD) is used to represent the performance of evolutionary sequence residues with a minimum of 59 and a maximum of 1023. The
conservation; it was recently shown to provide state-of-the-art average number of positives---sites contacting a biologically
performance on a range of functional site prediction tasks [2]. It relevant ligand---per chain is 25 residues (about 11% of the
compares the amino acid distribution observed in columns of a chain). The apo dataset includes many proteins with multiple
multiple sequence alignment of homologs to a background chains; the average number of chains per protein is 2.22. The
distribution. JSD scores range between zero and one. The code chain distribution is: 1 chain: 143, 2 chains: 112, 3 chains: 18, 4
provided in Capra and Singh [2] with the default sequence weighting chains: 35, 5 or more chains: 24.
and gap penalty was used to score all alignments. The drug dataset comes from a set of 100 non-redundant 3D
structures selected by [62]. This set contains a diverse set of high-
Data quality structures (resolution ,3 Å) with drug or drug-like
The prediction methods described in this paper take protein 3D molecules (molecular weight between 200 and 600, and 1212
structures and/or multiple sequence alignments as input. Protein rotatable bonds) bound. Structure 1LY7 has been removed from
structures were downloaded from the Protein Quaternary the PDB, and 1R09 could not be parsed. We consider the 98
Structure (PQS) server [58]. Predicted quaternary structures were remaining structures.
used (rather than the tertiary structures provided in PDB files) so as The catalytic site annotations were taken from version 2.2.9 of
to consider pockets and protein-ligand contacts for proteins in the Catalytic Site Atlas [66]. There are 153 proteins in the
their biologically active states. All alignments come from the LigASite apo dataset with entries in the Catalytic Site Atlas. These
Homology-derived Secondary Structure of Proteins (HSSP) proteins have an average of 3.2 catalytic sites per chain (just over
database [70]. All images of 3D structures were rendered with 1% of all residues in the chain).
PyMol [71].
Ligand binding sites as defined by the non-redundant version of Evaluation
the LigASite dataset (v7.0) [24] were used to evaluate method Predictions of ligand binding pockets are represented by non-
predictions. This set consists of 337 proteins with apo (unbound) zero values in a regular 3D grid around the protein. These
structures, each having less than 25% sequence identity with any represent regions in space thought to contain ligands. These
other protein in the set. Five of the 337 structures were left out of predictions are evaluated in two ways: on the pocket level by
the evaluation: 1P5T, 1YJG, and 3DL3 lacked holo ligand computing their overlap with known ligands, and on the grid level
information in the database, and 2PCY and 3EZM, because their by analyzing how well the grid scores rank grid points that overlap
corresponding holo structures are not in PQS or HSSP. Each apo ligand atoms. We use a grid with rasterized van der Waals spheres
structure has at least one associated holo (bound) structure in for ligand atoms from the PQS structure as the ‘‘positive’’ set of
grid points. From this, we calculate the intersection and union of dataset. The significance of the difference in performance of a
the actual ligand atoms and the predictions. We compare methods single method on different datasets is calculated with the Wilcoxon
using the over-prediction factor (Prediction Volume/Ligand rank-sum test.
Volume), precision (Intersection Volume/Prediction Volume), For the residue-based evaluation, we consider how well each
recall (Intersection Volume/Ligand Volume), and Jaccard coeffi- method’s residue scores identify ligand binding residues. Positives
cient (Intersection Volume/Union Volume). are those residues in contact with a ligand as defined by LigASite
We also create precision-recall (PR) curves, which compare database. PR curves were made by calculating, for each chain, the
precision (TP/(TP + FP)) on the y-axis with recall (TP/(TP + FN)) precision and recall at each position on the ranked list of residue
on the x-axis, to evaluate the ability of each method to predict scores. Composite PR curves were computed as described for the
whether a ligand atom is present at a grid point. We consider grid grid point evaluation, but curves were first averaged over the
points that overlap a ligand atom as positives. To construct the PR chains in a structure and then over structures. PR curves were
curve, we calculate the precision and recall at each cutoff of the constructed similarly for the catalytic site analysis, but positives
grid values in the pocket prediction grid. To summarize the were defined as those residues listed in the Catalytic Site Atlas.
performance of each method, we construct a composite PR curve
[72] by averaging the precision at each recall level for each Supporting Information
structure in the dataset. As a reference point, we include the
performance of a random classifier averaged over all the structures Text S1 Supplementary text, results, and analysis.
as well. The expected performance of a random method is the Found at: doi:10.1371/journal.pcbi.1000585.s001 (0.39 MB PDF)
number of positives over the number of all grid points. The
method and code of Davis and Goadrich [73] is used to calculate Author Contributions
the area under the PR curve (PR-AUC). The significance of the Conceived and designed the experiments: JAC MS TAF. Performed the
difference between methods is assessed using the Wilcoxon signed- experiments: JAC TAF. Analyzed the data: JAC MS TAF. Wrote the
rank test over paired performance statistics for all structures in the paper: JAC MS TAF. Provided methodological input: RAL JMT.
References
1. Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using 20. Xie L, Bourne P (2007) A robust and efficient algorithm for the shape
the Connolly surface and degree of conservation. BMC Struct Bio 6: 19. description of protein structures and its application in predicting ligand binding
2. Capra J, Singh M (2007) Predicting functionally important residues from sites. BMC Bioinf 8: S9.
sequence conservation. Bioinf 23: 1875–1882. 21. Weisel M, Proschak E, Schneider G (2007) PocketPicker: analysis of ligand
3. Lopez G, Valencia A, Tress M (2007) firestar---prediction of functionally binding-sites with shape descriptors. Chem Cen J 1: 7.
important residues using structural templates and alignment reliability. Nucleic 22. Valdar W (2002) Scoring residue conservation. Proteins: Structure, Function,
Acids Res 35: W573–W577. and Genetics 48: 227–241.
4. Kuznetsov I, Gou Z, Li R, Hwang S (2006) Using evolutionary and structural 23. An J, Totrov M, Abagyan R (2005) Pocketome via comprehensive identification
information to predict DNA-binding sites on DNA-binding proteins. Proteins: and classification of ligand binding envelopes. Mol Cell Prot 4: 752–761.
Stuct, Func, and Bioinf 64: 19–27. 24. Dessailly B, Lensink M, Orengo C, Wodak S (2008) LigASite: a database of
5. Youn E, Peters B, Radivojac P, Mooney S (2007) Evaluation of features for biologically relevant binding sites in proteins with known apo-structures. Nucleic
catalytic residue prediction in novel folds. Prot Sci 16: 216–226. Acids Res 36: D667–673.
6. Ofran Y, Rost B (2007) Protein-protein interaction hotspots carved into 25. Laurie A, Jackson R (2005) Q-SiteFinder: an energy-based method for the
sequences. PLoS Comput Biol 3: e119. prediction of protein–ligand binding sites. Bioinf 21: 1908–1916.
7. Zhou H, Qin S (2007) Interaction-site prediction for protein complexes: a critical 26. Mayrose I, Graur D, Ben-Tal N, Pupko T (2004) Comparison of site-specific
assessment. Bioinf 23: 2203–2209. rate-inference methods: Bayesian methods are superior. Mol Biol Evol 21:
8. Hannenhalli S, Russell R (2000) Analysis and prediction of functional sub-types 1781–1791.
from protein sequence alignments. J Mol Biol 303: 61–76. 27. Wang K, Samudrala R (2006) Incorporating background frequency improves
9. del Sol Mesa A, Pazos F, Valencia A (2003) Automatic methods for predicting entropy-based residue conservation measures. BMC Bioinf 7: 385.
functionally important residues. J Mol Biol 326: 1289–1302. 28. Mihalek I, Res I, Lichtarge O (2004) A family of evolution–entropy hybrid
10. Kalinina O, Mironov A, Gelfand M, Rakhmaninova A (2004) Automated methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282.
selection of positions determining functional specificity of proteins by 29. Sankararaman S, Sjolander K (2008) Intrepid–information-theoretic tree
comparative analysis of orthologous groups in protein families. Prot Sci 13: traversal for protein functional site identificantion. Bioinf 24: 2445–2452.
443–456. 30. Bahadur KD, Livesay D (2008) Improving position specific predictions of
11. Chakrabarti S, Bryant S, Panchenko A (2007) Functional specificity lies within protein functional sites using phylogenetic motifs. Bioinf 24: 2308–2316.
the properties and evolutionary changes of amino acids. J Mol Biol 373: 31. Fischer J, Mayer C, Soeding J (2008) Prediction of protein functional residues
801–10. from sequence by probability density estimation. Bioinf 24: 613–620.
12. Capra J, Singh M (2008) Characterization and prediction of residues 32. Elcock A (2001) Prediction of functionally important residues based solely on the
determining protein functional specificity. Bioinf 24: 1473–1480. computed energetics of protein structure. J Mol Biol 312: 885–896.
13. Levitt D, Banaszak L (1992) Pocket: A computer graphics method for identifying 33. Bate P, Warwicker J (2004) Enzyme/non-enzyme discrimination and prediction
and displaying protein cavities and their surrounding amino acids. J Mol of enzyme active site location using charge-based methods. J Mol Biol 340:
Graphics 10: 229–234. 263–276.
14. Laskowski R (1995) Surfnet: a program for visualizing molecular surfaces, 34. Hernandez M, Ghersi D, Sanchez R (2009) SITEHOUND-web: a server for
cavities, and intermolecular interactions. J Mol Graph 12: 323–330. ligand binding site identification in protein structures. Nucleic Acids
15. Peters K, Fauck J, Frömmel C (1996) The automatic search for ligand binding Res;doi:10.1093/nar/gkp281.
sites in proteins of known three-dimensional structure using only geometric 35. Ko J, Murga L, Andre P, Yang H, Ondrechen M, et al. (2005) Statistical criteria
criteria. J Mol Biol 256: 201–213. for the identification of protein active sites using theoretical microscopic titration
16. Hendlich M, Rippman F, Barnickel G (1997) LIGSITE: automatic and efficient curves. Proteins: Stuct, Func, and Bioinf 59: 193–195.
detection of potential small molecule-binding sites in proteins. J Mol Graph 36. Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for
Model 15: 359–363. ligand-binding site prediction and functional annotation. Proc Natl Acad Sci
17. Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and 105: 129–134.
cavities: Measurement of binding site geometry and implications for ligand 37. Halperin I, Wolfson H, Nussinov R (2003) SiteLight: binding-site prediction
design. Prot Sci 7: 1884–1897. using phage display libraries. Prot Sci 12: 1344–1359.
18. Brady Jr G, Stouten P (2000) Fast prediction and visualization of protein binding 38. Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, et al. (2004) Network
pockets with PASS. J Comp-Aided Mol Design 14: 383–401. analysis of protein structures identifies functional residues. J Mol Biol 344:
19. Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, et al. (2006) CASTp: 1135–1146.
computed atlas of surface topography of proteins with structural and 39. Landau M, Mayrose I, Rosenberg Y, Glaser Y, Martz E, et al. (2005) ConSurf
topographical mapping of functionally annotated residues. Nucleic Acids Res 2005: the projection of evolutionary conservation scores of residues on protein
34: W116–W118. structures. Nucleic Acids Res 33: W299–W302.
40. Nimrod G, Schushan M, Steinberg D, Ben-Tal N (2008) Detection of 57. Morgan D, Kristensen D, Mittleman D, Lichtarge O (2006) ET Viewer: An
functionally important regions in ‘‘hypothetical proteins’’ of known structure. application for predicting and visualizing functional sites in protein structures.
Structure 16: 1755–1763. Bioinf 22: 2049–2050.
41. Yao H, Kristensen D, Mihalek I, Sowa M, Shaw C, et al. (2003) An accurate, 58. Henrick K, Thornton J (1998) PQS: a protein quaternary structure file server.
sensitive, and scalable method to identify functional sites in protein structures. Trends Biochem Sci 23: 358–361.
J Mol Biol 326: 255–261. 59. Najmanovich R, Kuttner J, Sobolev V, Edelman M (2000) Side-chain flexibility
42. Panchenko A, Konrashov F, Bryant S (2004) Prediction of functional sites by in proteins upon ligand binding. Proteins: Structure, Function, and Genetics 39:
analysis of sequence and structure conservation. Prot Sci 13: 884–892. 261–268.
43. Chelliah V, Chen L, Blundell T, Lovell S (2004) Distinguishing structural and 60. Laskowski R, Luscombe N, Swindells M, Thornton J (1996) Protein clefts in
functional restraints in evolution in order to identify interaction sites. J Mol Biol molecular recognition and function. Prot Sci 5: 2438–2452.
342: 1487–1504. 61. Bartlett G, Porter C, Borkakoti N, Thornton J (2002) Analysis of catalytic
44. Cheng G, Qian B, Samudrala R, Baker D (2005) Improvement in protein residues in enzyme active sites. J Mol Biol 324: 105–121.
functional site prediction by distinguishing structural and functional constraints
62. Perola E, Walters W, Charifson P (2004) A detailed comparison of current
on protein family evolution using computational design. Nucleic Acids Res 33:
docking and scoring methods on systems of pharmaceutical relevance. Proteins:
5861–5867.
Stuct, Func, and Bioinf 56: 235–249.
45. Wang K, Horst J, Cheng G, Nickle D, Samudrala R (2008) Protein meta-
63. Willems A, Tahlan K, Taguchi T, Zhang K, Lee Z, et al. (2008) Crystal
functional signatures from combining sequence, structure, evolution, and amino
acid property information. PLoS Comput Biol 4: 9. structures of the Streptomyces coelicolor TetR-like protein ActR alone and in
46. Chen B, Fofanov V, Bryant D, Dodson B, Kristensen D, et al. (2007) The mash complex with actinorhodin or the actinorhodin biosynthetic precursor (S)-
pipeline for protein function prediction and an algorithm for the geometric DNPA. J Mol Biol 376: 1377–1387.
refinement of 3D motifs. J Comp Biol 14: 791–816. 64. Ling H, Boodhoo A, Hazes B, Cummings M, Armstrong G, et al. (1998)
47. Burgoyne N, Jackson R (2006) Predicting protein interaction sites: binding hot- Structure of the shiga-like toxin I B-pentamer complexed with an analogue of its
spots in protein-protein and protein-ligand interfaces. Bioinf 22: 1335–1342. receptor Gb3. Biochem 37: 1777–1788.
48. Yoon S, Ebert J, Chung E, DeMicheli D, Altman R (2007) Clustering protein 65. Charnock S, Bolam D, Nurizzo D, Szabó L, McKie V, et al. (2002) Promiscuity
environments for function prediction: finding PROSITE motifs in 3D. BMC in ligand-binding: The three-dimensional structure of a piromyces carbohydrate-
Bioinf 8 (Suppl 4): S10. binding module, cbm29-2, in complex with cello- and mannohexaose. Proc Natl
49. Gutteridge A, Bartlett G, Thornton J (2003) Using a neural network and spatial Acad Sci 99: 14077–14082.
clustering to predict the location of active sites in enzymes. J Mol Biol 330: 66. Porter C, Bartlett G, Thornton J (2004) The catalytic site atlas: a resource of
719–734. catalytic sites and residues identified in enzymes using structural data. Nucleic
50. Petrova N, Wu C (2006) Prediction of catalytic residues using support vector Acids Res 32: D129–D133.
machines with selected protein sequence and structural properties. BMC Bioinf 67. Magliery T, Regan L (2005) Sequence variation in ligand binding sites in
7: 312. proteins. BMC Bioinf 6: 240.
51. Tong W, Wei Y, Murga L, Ondrechen M, Williams R (2009) Partial order 68. Caffrey D, Somaroo S, Hughes J, Mintseris J, Huang E (2004) Are protein-
optimum likelihood (POOL): Maximum likelihood prediction of protein active protein interfaces more conserved in sequence than the rest of the protein
site residues using 3D structure and sequence properties. PLoS Comput Biol 5: surface? Prot Sci 13: 190–202.
1. 69. Morris G, Goodsell D, Halliday R, Huey R, Hart W, et al. (1998) Automated
52. Nayal M, Honig B (2006) On the nature of cavities on protein surfaces: docking using a Lamarckian genetic algorithm and an empirical binding free
Application to the identification of drug-binding sites. Proteins: Stuct, Func, and energy function. J Comp Chem 19: 1639–1662.
Bioinf 63: 892–906. 70. Dodge C, Schneider R, Sander C (1998) The HSSP database of protein
53. Wei L, Altman R (2003) Recognizing complex, asymmetric functional sites in structure-sequence alignments and family profiles. Nucleic Acids Res 26:
protein structures using a bayesian scoring function. J Bioinform Comput Biol 1: 313–315.
119–138. 71. DeLano W (2002) The PyMOL User’s Manual. DeLano Scientific, Palo Alto,
54. Bordner A (2008) Predicting small ligand binding sites in proteins using CA, USA. http://www.pymol.org.
backbone structure. Bioinf 24: 2865–2871.
72. Manning C, Raghavan P, Schütze H (2008) Introduction to Information
55. Ebert J, Altman R (2008) Robust recognition of zinc binding sites in proteins.
Retrieval. Cambridge University Press. pp 158–163.
Prot Sci 17: 54–65.
73. Davis J, Goadrich M (2006) The relationship between precision-recall and ROC
56. Glaser F, Morris R, Najmanovich R, Laskowski R, Thornton J (2006) A method
curves. Proc 23rd Int Conf on Machine Learning 23: 233–240.
for localizing ligand binding pockets in protein structures. Proteins: Stuct, Func,
and Bioinf 62: 479–488.