Proteins: Automated Prediction of Ligand-Binding Sites in Proteins
Proteins: Automated Prediction of Ligand-Binding Sites in Proteins
Proteins: Automated Prediction of Ligand-Binding Sites in Proteins
ABSTRACT INTRODUCTION
Methods for the prediction of protein function have progressed hand-in-hand with
We present a method, termed
the growth of structural information. For decades, visualization and simulation have
AutoLigand, for the predic-
tion of ligand-binding sites been used to test hypotheses about molecular recognition and catalysis, based on the
in proteins of known struc- crystallographic and NMR structures of macromolecules. One of the major successes of
ture. The method searches this work is rational drug design, wherein the knowledge of molecular interaction is
the space surrounding the applied to predict new, more effective inhibitors for a given target. Currently, with the
protein and finds the contig- growing body of structural data from structural genomics, these studies have expanded
uous envelope with the speci- to include de novo function prediction. The Protein Data Bank (http://www.pdb.org,
fied volume of atoms, which Oct 3, 2006 release) has over 500 structures designated as ‘‘hypothetical proteins,’’ for
has the largest possible inter- which a structure is known but the function is not. Methods are needed to predict
action energy with the pro- functions for these proteins and many others that will be provided in the near future
tein. It uses a full atomic by structural genomics projects.1–4 Both of these applications—rational drug design
representation, with atom
and function prediction—require a reliable method for identifying and characterizing
types for carbon, hydrogen,
oxygen, nitrogen and sulfur the primary ligand-binding site of a protein.
(and others, if desired), and Existing methods for ligand-site identification can be divided into three broad cate-
is designed to minimize the gories: genomic comparisons, ligand shape matching, and site geometry searching.
need for artificial geometry. Genomic techniques predict ligand-binding sites by the comparison of homologous
Testing on a set of 187 genomes or functional site residues and may include surface geometry.5–7 These tech-
diverse protein-ligand com- niques are based on the hypothesis that important binding site residues are conserved
plexes has shown that the within the gene families. Ligand-shape methods, on the other hand, are based on the
method is successful in pre- chemistry of a ligand, building three-dimensional templates or databases to represent
dicting the location and ap- characteristic distributions of hydrogen bonds, hydrophobicity, and other chemical pa-
proximate volume of the
rameters.8–12 However, the majority of binding-site prediction methods, including the
binding site in 73% of cases.
one described here, analyze the geometry of the receptor.
Additional testing was per-
formed on a set of 96 pro- Since most binding sites are found in clefts or pockets on a protein surface,13 man-
tein-ligand complexes with ual examination of a protein surface is a remarkably effective method for finding a
crystallographic structures of binding site. Thus, much of the work on structure-based binding site prediction has
apo and holo forms, and focused on geometric methods that look for clefts or pockets. The receptor geometry
AutoLigand was able to pre- methods can be divided into pure geometric methods,14–19 geometry with energy field
dict the binding site in 80% methods,20–22 and geometry mixed with other methods.23–25 Below, we describe
of the apo structures. three methods that are similar to the approach presented in this report. Each of these
Proteins 2008; 70:1506–1517. methods makes predictions based on the receptor structure only, with the addition of
V
C 2007 Wiley-Liss, Inc. sequence conservation information in one case.
The PocketFinder algorithm23 calculates a Lennard–Jones potential over a 1 Å grid,
Key words: functionalsitepre- clamps all unfavorable regions and smoothes the clamped potential grid, and then con-
diction; structural genomics; tours the grid to identify pockets. Finally, pockets of volumes smaller than 100 Å3 are
rational drug design; Auto- ignored. The method uses several tunable parameters, including a smoothing parameter
Dock.
and the contour level, which were derived from the analysis of a training step. The authors
*Correspondence to: David S. Goodsell, The Scripps Research Institute—MB5, 10550 N. Torrey Pines Road, La Jolla, CA 92037.
E-mail: goodsell@scripps.edu
Received 19 December 2006; Revised 9 May 2007; Accepted 30 May 2007
Published online 1 October 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21645
1506 PROTEINS V
C 2007 WILEY-LISS, INC.
Prediction of Ligand-Binding Sites
report that in 80.9% of their test cases, at least 50% of the optimal site for binding, and the optimal shape for the
ligand overlapped with the largest predicted pocket, and in ligand, given this number of atoms. This approach was
11.8%, the ligand overlapped the second largest pocket. presented in our previous report, using a less compre-
Q-SiteFinder24 uses the GRID method26 to evaluate hensive version of the algorithm.29
the van der Waals potential of a methyl probe. A thresh-
old is then applied to identify the best binding sites, and
a clustering method is used to identify sites in close spa- METHODS
tial proximity. The method has several tunable parame-
Overview
ters, including the threshold and a ‘‘connection range’’
for clustering. The program is very efficient, quickly pro- AutoLigand identifies a contiguous envelope of maxi-
ducing clusters of favorable energy points. The authors mal affinity for a given macromolecular structure, using
report that in 90% of proteins tested, the known ligand a grid-based representation of the binding affinity poten-
was identified in one of the top three predicted sites. tial. We use the term ‘‘envelope’’ to represent a contigu-
SURFNET-ConSurf25 is a two-stage method that first ous region in space filled with points that represent
identifies clefts and then trims them according to residue potential atomic centers for ligand atoms. In brief, the
conservation. The SURFNET code27 is a pure geometry user first calculates an affinity potential grid around the
method that identifies clefts by filling the space with protein, and then a flood-fill and a site-optimization pro-
spheres and then clusters the spheres within pockets. cess are used to identify the best contiguous envelope
Using SURFNET, the authors found that the most within the energy grid. As shown in Figure 1, the results
ligand-binding sites occur in the largest pocket. In the may be quite different than results from a simple con-
second part of the method, the spheres are trimmed touring method. In real ligands, the different atoms in
based on their distance from conserved residues scored the full molecule rarely hold positions of equal binding
from the ConSurf-HSSP database.28 By mixing a pure affinity—some will provide strong contributions through
geometry method with a genomic method, this program hydrogen bonds or salt bridges, and others are present
combines the strengths of both methods to generate bet- merely to act as bridges between these high-affinity frag-
ter results than these methods achieve individually. The ments. A simple contouring algorithm will identify the
method has many tunable parameters that define the locations of the high-affinity fragments, but not the loca-
placement of spheres and incorporation of genomic in- tion (or need for) of the bridging atoms. In addition,
formation. The authors report that in 75% of proteins very small changes in contour thresholds often have very
tested, the known ligands will show at least a 20% over- large effects on the shape of the enclosed space. AutoLi-
lap with one of the four largest pockets. gand is designed to address these problems: it finds the
We present a binding site prediction method, termed contiguous envelope that achieves the highest total affin-
AutoLigand, that has several advantages. It uses a full ity, wherein some of the individual points may be of
atomic representation, with atom types for carbon, lower affinity offset by points of higher affinity. Thus,
hydrogen, oxygen, nitrogen, and sulfur (and others, if AutoLigand will consistently find an envelope that enclo-
desired), to yield a chemically detailed prediction of the ses the desired number of atoms, with the best total
shape of a ligand within a predicted binding site. It is binding energy, whereas a contouring method cannot
designed to minimize the need for artificial geometry: ensure that a similar solution will be found.
the user chooses (1) the particular force field that will be
used, and (2) the number of atoms in the ligand. The
Affinity potential grids
method then searches the entire protein and finds the
contiguous envelope with the specified volume of atoms, Affinity potentials are precalculated by AutoGrid,30
which has the largest possible interaction energy with the using the default AutoGrid parameters in all cases. The
protein. It is our hypothesis that protein-binding sites protein is placed within a large grid, and a probe atom is
have evolved to create a region of maximal affinity for traced through the full grid, tabulating the interaction
the given size and shape of their ligand(s), and that energy at each point. The AutoGrid3 empirical free
by identifying the contiguous region of maximal-binding energy force field includes terms for dispersion/repulsion,
affinity, we will identify functional ligand-binding sites. hydrogen bonding, electrostatics, and desolvation. Affin-
We envision two ways to use the method. With a pro- ity potentials were created for six atom types: aliphatic
tein of unknown function, the user will test a wide range carbon, aromatic carbon, hydrogen, oxygen, nitrogen,
of ligand volumes, choosing the one that shows the best and sulfur, using a grid spacing of 1 Å and a smoothing
overall binding strength to identify the predicted ligand value of 0.5 Å (see Fig. 1). The parameters for nitrogen
that binds within the predicted binding site. This and oxygen are so similar that they were combined into
approach is presented in this manuscript. The method a single type for this analysis. Finally, the affinity grids
may also be used in rational drug design, starting with a were combined by choosing the atom type with strongest
desired size for the inhibitor, and then identifying the affinity at each point. Two grids were created for use by
Figure 1
HIV-1 protease (PDB file 1hps). The backbone of the HIV-1 protease is shown with a blue tube. (a) The inhibitor is shown with balls-and-sticks, with carbon gray,
nitrogen blue, oxygen red, and hydrogen light blue. (b) The affinity potential for carbon is shown with white surfaces, and the potential for oxygen is shown with red
surfaces. (c) An AutoLigand envelope with 200 points is shown as spheres with carbon gray, oxygen/nitrogen in magenta, sulfur in yellow, and hydrogen in light blue. All
molecular figures in this report were generated using the Python Molecular Viewer.34
AutoLigand: one holds the combined affinity values, and points is attained. This flood-fill envelope is necessarily
the other stores the atom type that gives the best affinity contiguous due to being constructed only from neighbor
at each point. points.
The grid spacing of 1 Å has the possibility of causing To speed up this process, we actually initiate flood fills
significant artifacts, given the steeply rising potential of at a reduced set of points. Every other point in each
the repulsive part of the interaction energy. The smooth- direction is used for small envelopes of less that 50
ing process, however, is designed to address this problem, points (i.e., stepping through the grid in 2 Å incre-
as described in our previous work.30 We tested this in ments), and every fourth point is used for larger enve-
AutoLigand by generating HIV-1 protease envelopes from lopes (stepping through the grid in 4 Å increments),
maps translated by 0.5 Å in each of the three coordinate yielding identical results as the exhaustive search, which
directions. In all cases, the maps were very similar in uses every point (data not shown). Because nearby start-
form and total interaction energies. ing points may coalesce to the same flood solution with
the same total energy, each new starting point is also
required to start outside of the existing flood fills. The 10
AutoLigand
best flood fill solutions are then used in the next step.
AutoLigand identifies optimal envelopes within the The second step starts with each of the flood-fill enve-
grid maps in three steps: (1) flood fill, (2) local migra- lopes, which may have some grid points in common, and
tion, and (3) a ray-casting neighborhood search. allows the envelope to migrate via an ‘‘affinotaxis’’ move-
The migration and ray-casting steps are performed in ment (analogous to the chemotaxis movement of an
iterative cycles until a stable envelope is generated. Sam- amoeba). The movement of the constant-volume enve-
ple envelopes from each step in the process are shown in lope occurs by adding the neighbor point with the best
Figure 2. energy/volume value and removing the worst point from
In the first step, AutoLigand moves to each point in the current envelope list.
the combined affinity map and does a simple flood fill But, before this point-exchange movement can take
around that point. Points are added one at a time to the place, the point to be removed must be checked to see if
envelope by searching all orthogonal neighboring points its removal would break the contiguity requirement with
and adding the one of best binding energy. This process any other points in the envelope. The check is performed
is repeated until the user-specified number of envelope in two passes. The first pass checks the point in the enve-
maximum user-defined length (a maximum length of 10 remove alternate locations and to have consistent naming
points was used here). Sample rays from one particular of atoms. Hydrogen atoms were then added automati-
point in an envelope are shown in Figure 2. The total cally using Babel, and then nonpolar hydrogen atoms
energy value for each ray for each length is ranked and if were merged.
the highest ranked ray is better than the lowest total The test set of 96 protein complexes with apo and
value for the same number of removable points within holo forms was taken from the work of Gunasekaran and
the envelope, the set of ray grid points is added to the Nussinov.31 Coordinates for the apo and holo structures
envelope and the worst envelope points are removed. were taken from the Protein Data Bank, superimposed
Iterative cycles of migration and ray extension are per- using all alpha carbon positions, and then treated simi-
formed until the envelope converges on a consistent low larly to the other test set.
energy solution. We use an intersection/union (I/U) ratio to quantify
the overlap between a predicted envelope and a known
ligand in ligand-receptor test systems. The intersection is
Points versus atomic volumes
the number of points within the envelope that are within
A challenging aspect of the method is the difference 1 Å of the ligand. The union is the total footprint of
between the number of points in the contiguous enve- both the envelope and the ligand, including both regions
lope, and the number of atoms that it represents. In that overlap and regions that do not. The value of I/U is
these affinity grids, each point in the grid represents a close to 1.0 when the envelope overlaps all the ligand
possible position for an atom center, and so the envelope without any extra fill exterior to the ligand. Envelopes
is a representation of the collection of atom centers, not that protrude into nonligand occupied spaces produce a
a representation of the space-filling volume of the mole- lower value.
cule. The difference is underscored by a simple example:
imagine a hypothetical envelope of four points on a 1 Å
grid. If these points are arranged in a square, the block-
Comparison of envelopes
shaped envelope may represent a diatomic molecule,
with atoms placed on two diagonal points—the remain- In applications of the AutoLigand method, it will be
ing two points are too close to represent real atoms in a useful to compare envelopes from different proteins, for
small molecule. If, however, the four points are arranged instance, to assign the similar function to proteins with
in a line-shaped envelope, we can imagine a molecule similar ligand-binding envelopes or to explore the effect
composed of three atoms in a line that will fit within the of protein motion on the local characteristics of the
envelope represented by predicted points. Thus, the shape binding site. To compare the envelopes generated from
of the predicted set of points will affect the actual num- different receptor proteins, we have used a modified ver-
ber of atoms that it represents: blocky sets of points rep- sion of AutoDock to overlap two envelopes. AutoDock
resent fewer atoms than the linear sets of points. uses a grid-based technique for the fast evaluation of
We have solved this problem by calculating an atomic interaction energies during a docking simulation. In our
volume for each set of points. Atomic volumes are esti- modified version of AutoDock, we have developed a dif-
mated using the grid representation, by counting the ferent set of potential grids that describe the overlap of
number of points within a 1 Å radius to the points in two sets of points.
the envelope. However, it is conceptually difficult to do The overlap is performed in two steps. First, the set of
the search and optimization of the ligand envelope keep- points in one envelope is chosen as the ‘‘target’’ envelope,
ing the atomic volume of the envelope constant—as the embedded in a grid, and an overlap potential is created
shape of the envelope changes during the search—a given on that grid. The potential has a value of 21.0 for dis-
volume may correspond to different numbers of points tances of 0–1.0 Å from the nearest point, and values
depending on how extended or blocky it is. Instead, we from 21.0 to 0.0 at distances between 1.0 and 2.0 Å
perform the search using a constant number of points, from the nearest point. (Negative values are used for easy
but optimize the envelope based on the value of total compatibility with the existing free energy evaluation
energy/atomic volume. used in the AutoDock code). A separate overlap grid
is calculated for each atom type represented in the
Atomic coordinates and overlap calculation
envelope.
The second envelope is then overlapped on the target
The test set of 187 protein complexes was taken from envelope using the docking methods in AutoDock. If
the reported calibration of the AutoDock energy func- there is a perfect overlap, each point in the envelope will
tion.30 In that study, coordinates were obtained from the overlap a portion of the overlap grid with a value of 21,
Protein Data Bank (http://www.pdb.org). These com- so that the total docking score will be equal to the num-
plexes were checked and corrected if necessary for the ber of points in the envelope. For less perfect overlaps,
proper biological unit, and files were regularized to the score is a direct measure of the intersection volume.
Figure 4
AutoLigand analysis is done using the volume of the envelope. (a) The envelope for L-arabinose-binding protein is calculated by optimizing the value of total energy/
volume. As in Figure 3, the data points are colored by the I/U values of each envelope. Insets show three sample envelopes, colored by predicted atom type as in Figure
1(c). The ligand, L-arabinose, is shown with balls-and-sticks. (b) A similar analysis of HIV-1 protease (PDB entry 1hps). Insets show three sample envelopes.
Figure 10
Figure 8
Overlap of HIV protease envelopes. (a) A sample overlap of a 300-point 1aaq
AutoLigand analysis of holo ligand-receptor complexes and corresponding apo envelope onto a 300-point 1hvk envelope. (b) Composite of fifteen 300-point
structures. Figures are generated similarly to Figure 5(a), with each point HIV-1 protease envelopes, colored by predicted atom type as in Figure 1(a). The
colored by the I/U value of the envelope. (a) Holo enzyme structures separated larger spheres show regions of the overlapped volumes that are most similar
into three classes based on the amount of conformational change induced by between the different envelopes, and the smallest spheres show regions that are
ligand binding. In each class, the smallest ligands are at the bottom and the found in only 1 of 15.
largest are at the top. (b) Corresponding apo enzyme structures.
Figure 9 demonstrates the types of problems that may the ligand binds. Thus, the AutoLigand envelope from
be encountered with flexible proteins that undergo con- the holo structure fully overlaps the ligand position, but
formational change upon ligand binding. In retinol-bind- the half of the binding site is filled with the phenylala-
ing protein, a phenylalanine sidechain is displaced when nine sidechain in the apo structure, so that the envelope
Figure 9
Comparison of envelopes generated for the structures of human retinol-binding protein. (a) Holo enzyme form bound to retinol (PDB entry 1brp). (b) Apo form of the
enzyme (PDB entry 1brq). Retinol is shown in magenta and a particularly mobile phenylalanine is shown in green.
Overlapping envelopes
Table II
PDB Accession Codes for Holo-Apo Protein Structures Our test set included 15 different HIV-1 protease
complexes with various ligands. The 300-point enve-
Holo Apo Holo Apo Holo Apo Holo Apo
Class I
lopes for each of these 15 systems were overlapped using
154l 153l 1bk9 1psj 1dmy 1dmx 1nft 1tfa a modified method in AutoDock3, as described in the
1a0t 1a0s 1bm7 1bmz 1duc 1dun 1pnf 1png methods. Table III shows the percentage of overlap for
1a8u 1a7u 1br6 1rtc 1dud 1dup 1pnl 1pnk each of these comparisons. The average overlap between
1afa 1afd 1bso 2blg 1eus 1eur 1rca 1aqp
1aha 1ahc 1bxq 3app 1gmp 1gmq 1tal 2alp
the different envelopes was found to be 92.6%. This
1aqm 1aqh 1byq 1yer 1hor 1dea 1vps 1vpn result indicates that the method is capable of finding
1arm 1yme 1com 2chs 1hvq 2hvm 1xzb 1xza the ligand site consistently for different crystal struc-
1awb 2hhm 1dcp 1dco 1icm 1ifb 2enb 1ena tures of the same receptor. It also implies that the
1b5d 1b49 1did 1xla 1kev 1ped 4tim 1ag1
1bj0 1bjz 1dil 2sil 1log 1loe 5enl 3enl various ligands only induce differences of about 7.4% in
the highest affinity zone for the HIV-1 protease recep-
Class II tor. Figure 10(a) shows the overlap of two sample enve-
1a26 2paw 1epb 1epa 1jul 1igs 1ra1 5dfr lopes with a moderate overlap value. Figure 10(b) shows
1adl 1alb 1fga 2fgf 1kel 1kem 1ses 1sry
1ai2 3icd 1fkl 1fkk 1ltt 1lts 1ubw 1ubv a composite of all 15 HIV-1 protease envelopes. The
1aj0 1ajz 1fut 1fus 1mjl 1mjk 1xva 1bhj larger spheres depict points of higher occupancy com-
1alw 1alv 1gca 1gcg 1mzm 1mzl 2gal 1bkz mon in most of the envelopes and the single dots show
1aq7 2ptn 1gd1 2gd1 1nhk 2nck 2izf 2izd points filled by only a single result. The distinctive cross
1beh 1bd9 1igb 1amp 1pgn 2pgd 3pca 2pcd
1ben 1trz 1jdc 1jda 1ptr 1ptq 6cha 4cha shape of typical protease inhibitors shows up, with the
1ddt 1sgk 1jef 135l 1qpq 1qpo P1 and P10 positions extending up and down from the
center and the P2 and P20 positions extending to left
Class III and right. The strong preference for a hydrogen-bond
1a9p 1a9o 1cen 1ceo 1mpj 3ins 3gal 1bkz
1aj7 2rcs 1ebg 1ebh 1o0r 1fgx 4fua 1fua acceptor at the center is also seen to be a common de-
1ake 4ake 1hex 1xaa 1swd 1swa 7tim 1ypi terminant of binding across these enzyme structures.
1anf 1omp 1hii 1hsi 1vrt 1rtj This hydrogen-bonding position is occupied by a con-
1brp 1brq 1hnl 1lz4 2bgu 2bgt
served water molecule in most structures, which is dis-
1byb 1bya 1lca 4tms 2cht 2chs
placed by an oxygen atom in cyclic urea inhibitors. We
are currently exploring the use of this technique to
compare envelopes from different mutants of the same
receptor to characterize the features of the binding site
is shifted, overlapping only part of the known ligand that change during the development of drug resistance
position. In these cases, a dynamic model is needed to and compare the binding sites of different proteins that
explore the range of possible conformations of sidechains bind to the same ligand to characterize the determinants
to make a fully correct prediction. of ligand specificity.
Table III
Percentage Overlap for Aligned HIV-1 Protease Envelopes
1aaq 1hbv 1hih 1hps 1hpx 1htf 1htg 1hvi 1hvj 1hvk 1hvl 1hvr 4phv 5hvp 9hvp
1aaq 95.8 93.0 97.8 98.3 95.4 95.9 94.4 96.9 94.6 95.7 94.1 96.8 97.2 96.7
1hbv 91.3 87.1 94.3 90.4 91.4 89.4 86.8 92.0 91.1 93.6 94.0 89.4 91.1 92.1
1hih 86.5 87.2 89.3 90.8 75.4 81.3 90.6 90.5 91.0 91.0 82.8 86.5 80.3 86.6
1hps 93.4 93.8 89.3 93.6 84.1 92.2 92.1 92.8 92.1 95.0 91.0 92.5 92.1 93.1
1hpx 95.4 92.8 94.0 96.5 93.1 94.8 95.2 98.0 98.4 98.1 90.1 95.3 95.8 92.7
1htf 91.4 91.0 88.3 93.9 92.9 97.8 96.3 87.7 94.9 95.0 89.3 94.0 94.5 90.7
1htg 92.8 86.1 83.3 96.3 96.1 92.3 92.1 91.2 87.0 94.8 87.4 95.7 95.3 91.0
1hvi 82.2 82.4 92.1 88.7 93.6 84.2 88.2 94.1 94.4 95.1 79.0 90.2 88.2 84.5
1hvj 91.2 87.1 88.4 91.1 94.7 86.5 92.1 93.7 98.3 98.3 84.0 91.2 87.9 89.0
1hvk 88.0 84.2 89.1 88.7 95.8 85.6 87.5 95.8 96.4 99.3 84.5 84.3 76.1 79.6
1hvl 88.2 89.2 90.4 93.3 97.1 85.0 92.8 96.2 97.8 99.4 85.6 95.5 92.8 91.2
1hvr 91.8 97.3 87.8 94.8 94.3 94.6 94.9 88.3 90.7 89.2 92.5 94.2 92.1 94.6
4phv 93.2 96.1 91.8 97.4 97.1 95.1 98.1 83.5 96.7 96.2 98.2 89.1 99.0 94.6
5hvp 95.6 93.4 93.1 94.7 98.7 94.0 98.9 95.6 92.2 97.1 82.6 83.3 98.3 89.6
9hvp 97.3 97.4 91.7 98.4 97.9 91.3 95.7 94.4 96.8 96.6 96.2 96.8 96.9 96.7
Ave 91.3 91.0 90.0 93.9 95.1 89.1 92.8 92.5 93.8 94.3 94.7 87.9 92.9 91.4 90.4
SD 4.0 4.8 2.9 3.2 2.7 5.8 4.9 3.9 3.3 3.7 4.2 5.1 4.1 6.5 4.5