(PDF Download) Algorithms and Methods in Structural Bioinformatics Nurit Haspel Filip Jagodzinski Kevin Molloy Eds Fulll Chapter
(PDF Download) Algorithms and Methods in Structural Bioinformatics Nurit Haspel Filip Jagodzinski Kevin Molloy Eds Fulll Chapter
(PDF Download) Algorithms and Methods in Structural Bioinformatics Nurit Haspel Filip Jagodzinski Kevin Molloy Eds Fulll Chapter
com
https://ebookmeta.com/product/algorithms-and-
methods-in-structural-bioinformatics-nurit-haspel-
filip-jagodzinski-kevin-molloy-eds/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookmeta.com/product/algorithms-in-bioinformatics-
theory-and-implementation-1st-edition-gagniuc-paul-a/
https://ebookmeta.com/product/international-handbook-of-
structural-fire-engineering-kevin-lamalva/
https://ebookmeta.com/product/algorithms-4th-edition-robert-
sedgewick-kevin-wayne/
https://ebookmeta.com/product/advances-in-protein-molecular-and-
structural-biology-methods-1st-edition-timir-tripathi-editor/
https://ebookmeta.com/product/bioinformatics-and-medical-
applications-big-data-using-deep-learning-algorithms-1st-edition-
a-suresh-editor/
https://ebookmeta.com/product/matrix-numerical-and-optimization-
methods-in-science-and-engineering-1st-edition-kevin-w-cassel/
https://ebookmeta.com/product/translational-bioinformatics-and-
systems-biology-methods-for-personalized-medicine-1st-edition-
qing-yan/
https://ebookmeta.com/product/expression-purification-and-
structural-biology-of-membrane-proteins-methods-in-molecular-
biology-2127-camilo-perez-editor/
Computational Biology
Nurit Haspel
Filip Jagodzinski
Kevin Molloy Editors
Algorithms
and Methods
in Structural
Bioinformatics
Computational Biology
Advisory Editors
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joseph Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken,
Germany
Marcella McClure, Montana State University, Bozeman, MT, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Lonnie Welch, Ohio University, Athens, OH, USA
Editor-in-Chief
Andreas Dress, CAS-MPG Partner Institute for Computational Biology, Shanghai,
China
Michal Linial, Hebrew University of Jerusalem, Jerusalem, Israel
Olga Troyanskaya, Princeton University, Princeton, NJ, USA
Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany
Kevin Molloy
ISAT/CS Building Room 216
James Madison University
Harrisonburg, VA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
vii
viii Contents
1 Introduction
N. Mishra
Department of Chemistry and Biochemistry, California State University, Los Angeles,
Los Angeles, CA, USA
e-mail: nmishra2@calstatela.edu
N. Forouzesh ()
Department of Computer Science, California State University, Los Angeles, Los Angeles,
CA, USA
e-mail: neginf@calstatela.edu
and vibrational d.o.f (specifying the relative positions of various nuclei) [14]. With
higher d.o.f., the atoms are more free to move around and therefore, the disorder
increases. More disordered systems are favored, but as we will see later in this
section, the favorability of a reaction is a compromise between enthalpy and entropy.
Given the two factors that contribute to energetic favorability, Gibbs free energy,
G, is known as an essential thermodynamic quantity that describes the spontaneity
of a reaction under isobaric and isothermal conditions [14]. The change in G is
defined as:
G = H − T S, (1)
where T is the temperature of the system in kelvin and H and S refer to the
enthalpy and entropy, respectively. Processes are spontaneous when G is negative
and non-spontaneous when G is positive. This equation exemplifies the “toss-up”
between enthalpy and entropy in determining the favorability of a reaction: the aim
is to find a balance between decreasing energy and increasing disorder to minimize
G [13, 14].
Thermodynamic factors are important to consider when determining G, but
kinetic factors play an important role as well. In the kinetic context, protein-ligand
binding is dependent on the concentrations of the protein and the ligand (denoted
as [P ] and [L], respectively), as this determines the speed with which protein-
ligand complexes can form (the concentration of the complex is denoted as [P L]).
Michaelis-Menten kinetics [13] introduces the equilibrium association constant, Ka ,
which relates to how quickly the protein-ligand complex can form. This is inversely
proportional to Kd , which relates to how quickly a protein and ligand unattach from
one another.
[P L] 1
Ka = = (2)
[P ][L] Kd
Higher Ka values and lower Kd values are associated with higher binding affinity
and therefore a lower G◦ , which is the standard free energy. Binding free energy
(BFE), G, is the change in free energy when a protein and ligand bind to one
another. It is defined as
where R is the gas constant and Q is the reaction quotient, which is equal to Ka
at equilibrium. The extent of the protein-ligand interaction is determined by how
negative the change in free energy is upon binding [7].
4 N. Mishra and N. Forouzesh
There are several methods by which binding free energy can be calculated based
on computational analysis of the protein and the ligand, including three principle
methods: alchemical methods, transition path sampling, and end-point methods.
Each of these techniques and recent advancements will be discussed in greater detail
below.
Free energy is a state function, meaning that the G value does not depend on
the path taken to reach the starting conformation to the bound conformation [15].
Alchemical methods take advantage of this fact and introduce “dummy” molecules,
or other chemical species, to bridge the high-probability region between the
unbound and bound states of the protein and the ligand [7, 16]. The use of these
chemical species are considered non-physical (alchemical) intermediate states,
and their use allows for a robust computational method that can calculate large
G values [16]. The general idea of alchemical free energy calculations can be
summarized by a thermodynamic cycle, shown in Fig. 1. An important parameter in
alchemical free energy calculations is the alchemical reaction coordinate, λ. It helps
define the thermodynamic path connecting the two states and shifts from 0 to 1 as
the coordinate moves.
In the alchemical category, there are two primary methodologies used for free
energy calculations; the first is free energy perturbation (FEP). This method relates
the relative free energy between the unbound and bound protein-ligand complex
to an average function of their energy [17]. Potential energies based on FEP
theory can be averaged over ensembles generated via molecular dynamics (MD)
or Monte Carlo (MC) simulations [17–19]. FEP simulations drive the perturbation
of one molecule into another by using alchemical intermediate states, also known
as windows, that lead one state into the other [7, 19]. Once the calculations are
complete, the convergence free energy results of each of the windows is estimated.
One caveat with FEP is that the convergence is only straightforward when U (the
change in potential) between the initial and final states is small.
The second method is known as thermodynamic integration (TI). It calculates the
change in free energy as follows:
λ=1 δH
G = dλ (4)
λ=0 δλ λ
As shown in Fig. 1, the G is the difference between the G of the bound and
unbound states calculated from Eq. 4. The important characteristics of TI are that it
relies on the average force exerted in a system undergoing an alchemical transition,
and integrating it along the alchemical coordinate. This method is often considered
more straightforward than other alchemical approaches to binding free energy [20].
There have been a few techniques developed to try and integrate TI into simulations.
One method is known as slow-growth TI, which progresses through transitions
in an extremely gradual manner and keeps the system very close to equilibrium.
The work done through the transition is equivalent to the free energy change [21].
Another method is discrete TI, which functions very similarly to
FEP
and divides the
thermodynamic path into separate windows. A calculation for δH δλ is done, followed
by the integration necessary to get the value [20, 22, 23].
methods pull the ligand from the binding site back into the solvent. There are also
non-equilibrium methods based on the Jarzynski relationship that can be used to
obtain the BFE [27, 28].
While TPS has many advantages, especially regarding its speed and ability
to analyze rare dynamical events that ordinary MD simulations cannot, there are
disadvantages that also coincide with this technique. One of the primary issues
is the delicacy of defining the stable states. If states are not defined precisely,
conformations can get “stuck” in a specific transition state or in an intermediate
state, which leads to insufficient sampling of the conformations and transition
pathways [7, 26]. Additionally, the unbiased nature of TPS makes it powerful,
however, it is also more difficult to implement because of the bookkeeping necessary
to keep track of both the forward and backward time integrations employed in
this method [26]. There have been some techniques based off of TPS which have
enhanced sampling and led to more accurate and faster PMF calculations. For
example, temperature acceleration method uses high temperatures to accelerate
calculations [29]. Other enhanced sampling techniques include metadynamics [30],
blue-moon sampling [31], umbrella sampling [32], and replica exchange [33].
End-point methods are the final and most efficient G calculation technique
discussed in this chapter. The low expense of these methods relies on the fact that
they sample only the final states of a system (unlike alchemical and path sampling,
which sample non-physical and physical intermediates, respectively) [34]. End-
point methods aim to serve as an intermediate between the speed of scoring
approaches and the accuracy of computationally expensive alchemical and path
sampling methods [35, 36]. Molecular mechanics Poisson–Boltzmann surface area
(MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA)
are of the most commonly used end-point methods and are closely related [37]. The
basis of the MM/PB(GB)SA approach relies on some of the ideas discussed in the
introduction to this chapter. The change in binding free energy is:
G = H − T S = GP L − GP − GL (5)
Where P refers to the protein, L to the ligand, and P L to the complex. This equation
can be broken down into more specific contributions based on different interactions
with the protein-ligand system [7, 36, 38]. Specifically, the enthalpic contribution
is the sum of the changes in gas-phase molecular mechanics and the solvation free
energy:
EMM is the gas-phase energy of the solute and consists of the changes in internal
energy (i.e., the changes from bond, angle and dihedral energies), van der Waals
energy (such as the formation and breaking of hydrogen bonds), and the change in
electrostatic energy [36, 38, 39]. Gsolv is the solvation free energy which consists
of both polar and nonpolar components. The polar component is calculated via
PB (or GB) [40–44] and the nonpolar component is calculated using the solvent-
accessible surface area [45, 46]. The −T S term refers to the absolute temperature
multiplied by the entropy, estimated from normal mode analysis (NMA). This
method is quite computationally expensive and is often disregarded in PB/GB
calculations. Recently, a few computational and theoretical techniques have been
introduced to overcome the intractability of NMA calculations, such as system
truncation [39], consideration of distance-dependent dielectric constant [47], and
reduction in translational and rotational freedom of the ligand upon protein-ligand
binding [48].
MD simulations can be employed for MM/PB(GB)SA calculations in one of
three ways [49]. The first way, the separate-trajectory method (3A), uses three
different trajectories to analyze the protein, ligand, and bound complex separately.
Another method is known as the single trajectory (1A) method and uses one
trajectory to generate the ensemble of snapshots [36, 38, 39]. The performance of
the separate versus the single method was found to vary between systems and test
models, but the separate trajectory approach has been shown to have greater standard
errors and uncertainties, leading the 1A method to be generally deemed as more
accurate [50–52]. Another benefit of the single trajectory method is that the change
in bond energy is no longer relevant to EMM as it cancels out [49]. One of the
downfalls with the single trajectory approach is that it disregards structural changes
to the ligand upon binding to the protein/receptor that are potentially important in
getting accurate characterization of the protein-ligand interaction [7, 53]. To remedy
this, there have been suggestions of a “compromise”—the 2A approach, which
employs trajectories to sample both the complex as well as the ligand in order
to account for energetic changes due to restructuring the ligand [49, 53]. The 2A
method has been found to improve results of binding affinity prediction [54].
The computational protocol for MM/PB(GB)SA can be generally described as
follows: first, an explicit solvent model is used to perform an MD simulation on the
protein-ligand complex to get the free energy contributions from the protein, ligand,
and complex and produce the necessary ensemble of snapshots. Then, the solvent
molecules and charged ions are deleted from each snapshot to prepare for the use
of the implicit solvent model [55], and then the MM/PB(GB)SA method is used to
analyze and compute the total Gsolv for each snapshot. The overall Gsolv is the
sum of the individual contributions [7, 34, 39]. This thermodynamic cycle has been
shown in Fig. 2. The use of an explicit solvent model in the MD simulation followed
by the implicit solvent model for calculation might seem inconsistent since different
energy functions are used, but studies have actually found that using an implicit
solvent model for the MD simulation as well has led to less accurate results [56].
The unreliable double implicit model has sometimes led to a dissociation of the
protein or ligand, rendering it less functional than an explicit simulation followed
by an implicit calculation [57].
8 N. Mishra and N. Forouzesh
The final popular database is Binding Mother of All Databases, better known as
Binding MOAD [66]. Similar to PDBBind, Binding MOAD has also generated sets
based on high-quality data from the PDB. Binding MOAD finds proteins with x-
ray crystal structures, and the subset is said to fall between PDBbind’s general and
refined sets [66, 67]. Binding MOAD provides a myriad of information and data.
The webpage for each protein shows ligand information (both valid and invalid
reports), any available binding data, the chemical structures of the ligands, and
proteins within the 90%, 70%, and 50% homologies. These tools are all extremely
useful because they allow for an easy way to search for similar structures [66, 67].
As of the 2019 release, there are 38,702 protein-ligand structures, 14,324 instances
of binding data, 18,939 ligands, and 10,500 protein families. Overall, Binding
MOAD is one of the most robust protein-ligand databases available, providing large
amounts of information across many types of complexes.
4 Molecular Docking
have been developed using energy functions to pull out the putative binding poses
between a protein and ligand [10]. There are many algorithms by which these poses
can be found, and one goal of algorithms currently being developed is to expedite
the search process [10, 75].
The other part of the docking process is the scoring phase, wherein the various
poses are ranked from most to least likely. The scoring function is used to assess the
binding affinity between the protein and ligand after docking; ideally, the scoring
function should predict both the binding free energy as well as allow for high-
throughput screening of various drugs [7]. Different scoring functions have been
developed for docking purposes, and they can be broken down into three principal
groups: empirically-based functions, knowledge-based functions, and force-field
based functions.
Empirically-based scoring functions are perhaps the most simplistic of those
listed above. This model involves getting a descriptor for the binding, a training
set of a vast number and types of protein-ligand complexes with the experimental
binding affinities, and a regression, classification, or machine learning algorithm
to help form a relationship between the descriptor(s) and the experimental affinity
values [76, 77]. Empirical scoring functions rely on many different energy terms
to determine favorability; these types of functions are more simplistic in that they
do not consider the underlying physical interactions of the protein and ligand. Some
areas of research within empirically-based functions include finding accurate energy
terms to incorporate into the function as well as avoiding overfitting energy terms,
which is an issue that stems from the large number of energy terms used in empirical
functions [78].
Knowledge-based scoring functions are statistically-based functions which come
from the assumption that interactions between the protein and ligand that occur
more frequently than expected by random chance contribute more favorably to the
binding affinity [79]. These types of scoring functions also rely on training sets
to find the statistical potentials of various interactions. One of the biggest issues
with knowledge-based scoring functions is their computational implementation,
however, they serve as a “happy medium” between empirical and force-field based
scoring functions due to their relative swiftness and simplicity, as well as their ability
to avoid some of the challenges posed by empirical functions [80].
Force-field based approaches account for the underlying physical interactions
occurring between binding partners. As mentioned previously, entropic calculations
are computationally expensive; thus, they are often ignored. However, it is important
to acknowledge that entropic effects can play a major role in the binding affinity,
generating a need for a more efficient method to find the entropic contribution.
Force-field based methods are of particular relevance because they can employ
the implicit solvent models within PBSA and GBSA to account for the physical
interactions of the solvent, improving the accuracy with which the binding affinity
can be predicted [81, 82]. One of the benefits of using MM/PB(GB)SA is that such
methods can explore structure-activity relationships within ligands as well as their
selectivity profiles, and are less computationally demanding than FEP techniques.
Additionally, computation time can be decreased by modifying MM/PB(GB)SA to
Protein-Ligand Binding with Applications in Molecular Docking 11
5 Conclusion
Acknowledgments This works has been partially supported by a California State University
Program for Education and Research in Biotechnology (CSUPERB) New Investigator funding
awarded to N.F.
References
1. W. L. Jorgensen, “The many roles of computation in drug discovery,” Science, vol. 303,
no. 5665, pp. 1813–1818, 2004.
12 N. Mishra and N. Forouzesh
2. H.-J. Woo and B. Roux, “Calculation of absolute protein–ligand binding free energy from
computer simulations,” Proceedings of the National Academy of Sciences, vol. 102, no. 19,
pp. 6825–6830, 2005.
3. M. M. Pierce, C. Raman, and B. T. Nall, “Isothermal titration calorimetry of protein–protein
interactions,” Methods, vol. 19, no. 2, pp. 213–221, 1999.
4. R. Karlsson and A. Fält, “Experimental design for kinetic analysis of protein-protein inter-
actions with surface plasmon resonance biosensors,” Journal of Immunological Methods,
vol. 200, no. 1-2, pp. 121–133, 1997.
5. A. M. Rossi and C. W. Taylor, “Analysis of protein-ligand interactions by fluorescence
polarization,” Nature Protocols, vol. 6, no. 3, pp. 365–387, 2011.
6. M. Jerabek-Willemsen, T. André, R. Wanner, H. M. Roth, S. Duhr, P. Baaske, and D. Breit-
sprecher, “Microscale thermophoresis: Interaction analysis and beyond,” Journal of Molecular
Structure, vol. 1077, pp. 101–113, 2014.
7. X. Du, Y. Li, Y.-L. Xia, S.-M. Ai, J. Liang, P. Sang, X.-L. Ji, and S.-Q. Liu, “Insights
into protein–ligand interactions: mechanisms, models, and methods,” International Journal
of Molecular Sciences, vol. 17, no. 2, p. 144, 2016.
8. M. K. Gilson and H.-X. Zhou, “Calculation of protein-ligand binding affinities,” Annual Review
of Biophysics and Biomolecular Structure, vol. 36, pp. 21–42, 2007.
9. S. J. Y. Macalino, V. Gosu, S. Hong, and S. Choi, “Role of computer-aided drug design in
modern drug discovery,” Archives of Pharmacal Research, vol. 38, no. 9, pp. 1686–1701, 2015.
10. P. H. Torres, A. C. Sodero, P. Jofily, and F. P. Silva-Jr, “Key topics in molecular docking for
drug design,” International Journal of Molecular Sciences, vol. 20, no. 18, p. 4574, 2019.
11. J. Li, A. Fu, and L. Zhang, “An overview of scoring functions used for protein–ligand
interactions in molecular docking,” Interdisciplinary Sciences: Computational Life Sciences,
vol. 11, no. 2, pp. 320–328, 2019.
12. H. Li, Y. Xie, C. Liu, and S. Liu, “Physicochemical bases for protein folding, dynamics, and
protein-ligand binding,” Science China Life Sciences, vol. 57, no. 3, pp. 287–302, 2014.
13. R. Miesfeld and M. McEvoy, Biochemistry. W.W. Norton, 2017.
14. D. A. McQuarrie and J. D. Simon, Physical Chemistry: a Molecular Approach, vol. 1.
University science books Sausalito, CA, 1997.
15. J. D. Chodera, D. L. Mobley, M. R. Shirts, R. W. Dixon, K. Branson, and V. S. Pande,
“Alchemical free energy methods for drug discovery: progress and challenges,” Current
Opinion in Structural Biology, vol. 21, no. 2, pp. 150–160, 2011.
16. A. S. Mey, B. Allen, H. E. B. Macdonald, J. D. Chodera, M. Kuhn, J. Michel, D. L. Mobley,
L. N. Naden, S. Prasad, A. Rizzi, et al., “Best practices for alchemical free energy calculations,”
arXiv preprint arXiv:2008.03067, 2020.
17. W. L. Jorgensen and L. L. Thomas, “Perspective on free-energy perturbation calculations for
chemical equilibria,” Journal of Chemical Theory and Computation, vol. 4, no. 6, pp. 869–876,
2008.
18. Y. Meng, D. Sabri Dashti, and A. E. Roitberg, “Computing alchemical free energy differences
with Hamiltonian replica exchange molecular dynamics (H-REMD) simulations,” Journal of
Chemical Theory and Computation, vol. 7, no. 9, pp. 2721–2727, 2011.
19. W. Jespers, M. Esguerra, J. Åqvist, and H. Gutiérrez-de Terán, “Qligfep: an automated
workflow for small molecule free energy calculations in q,” Journal of Cheminformatics,
vol. 11, no. 1, pp. 1–16, 2019.
20. V. Gapsys, S. Michielssens, J. H. Peters, B. L. de Groot, and H. Leonov, “Calculation of binding
free energies,” in Molecular Modeling of Proteins, pp. 173–209, Springer, 2015.
21. M. J. Mitchell and J. A. McCammon, “Free energy difference calculations by thermodynamic
integration: difficulties in obtaining a precise value,” Journal of Computational Chemistry,
vol. 12, no. 2, pp. 271–275, 1991.
22. M. Jorge, N. M. Garrido, A. J. Queimada, I. G. Economou, and E. A. Macedo, “Effect of the
integration method on the accuracy and computational efficiency of free energy calculations
using thermodynamic integration,” Journal of Chemical Theory and Computation, vol. 6, no. 4,
pp. 1018–1027, 2010.
Protein-Ligand Binding with Applications in Molecular Docking 13
23. S. Bruckner and S. Boresch, “Efficiency of alchemical free energy simulations. I. A practical
comparison of the exponential formula, thermodynamic integration, and Bennett’s acceptance
ratio method,” Journal of Computational Chemistry, vol. 32, no. 7, pp. 1303–1319, 2011.
24. W. You, Z. Tang, and C.-e. A. Chang, “Potential mean force from umbrella sampling
simulations: What can we learn and what is missed?,” Journal of Chemical Theory and
Computation, vol. 15, no. 4, pp. 2433–2443, 2019.
25. S. Wan, A. P. Bhati, S. J. Zasada, and P. V. Coveney, “Rapid, accurate, precise and reproducible
ligand–protein binding free energy prediction,” Interface Focus, vol. 10, no. 6, p. 20200007,
2020.
26. P. Bolhuis and C. Dellago, “Practical and conceptual path sampling issues,” The European
Physical Journal Special Topics, vol. 224, no. 12, pp. 2409–2427, 2015.
27. C. Jarzynski, “Nonequilibrium equality for free energy differences,” Physical Review Letters,
vol. 78, no. 14, p. 2690, 1997.
28. C. F. Narambuena, D. M. Beltramo, and E. P. Leiva, “Polyelectrolyte adsorption on a charged
surface. free energy calculation from Monte Carlo simulations using Jarzynski equality,”
Macromolecules, vol. 41, no. 21, pp. 8267–8274, 2008.
29. L. Maragliano and E. Vanden-Eijnden, “A temperature accelerated method for sampling free
energy and determining reaction pathways in rare events simulations,” Chemical Physics
Letters, vol. 426, no. 1-3, pp. 168–175, 2006.
30. A. Laio and M. Parrinello, “Escaping free-energy minima,” Proceedings of the National
Academy of Sciences, vol. 99, no. 20, pp. 12562–12566, 2002.
31. G. Ciccotti, R. Kapral, and E. Vanden-Eijnden, “Blue moon sampling, vectorial reaction
coordinates, and unbiased constrained dynamics,” ChemPhysChem, vol. 6, no. 9, pp. 1809–
1814, 2005.
32. J.-F. St-Pierre, M. Karttunen, N. Mousseau, T. Rog, and A. Bunker, “Use of umbrella sampling
to calculate the entrance/exit pathway for z-pro-prolinal inhibitor in prolyl oligopeptidase,”
Journal of Chemical Theory and Computation, vol. 7, no. 6, pp. 1583–1594, 2011.
33. M. Fajer, D. Hamelberg, and J. A. McCammon, “Replica-exchange accelerated molecular
dynamics (REXAMD) applied to thermodynamic integration,” Journal of Chemical Theory
and Computation, vol. 4, no. 10, pp. 1565–1569, 2008.
34. E. Wang, H. Sun, J. Wang, Z. Wang, H. Liu, J. Z. Zhang, and T. Hou, “End-point binding
free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug
design,” Chemical Reviews, vol. 119, no. 16, pp. 9478–9508, 2019.
35. E. A. Rifai, M. van Dijk, and D. P. Geerke, “Recent developments in linear interaction energy
based binding free energy calculations,” Frontiers in Molecular Biosciences, vol. 7, p. 114,
2020.
36. S. Genheden and U. Ryde, “Comparison of the efficiency of the lie and MM/GBSA methods
to calculate ligand-binding energies,” Journal of Chemical Theory and Computation, vol. 7,
no. 11, pp. 3768–3778, 2011.
37. H. Gohlke and D. A. Case, “Converging free energy estimates: Mm-pb (gb) sa studies on the
protein–protein complex ras–raf,” Journal of Computational Chemistry, vol. 25, no. 2, pp. 238–
250, 2004.
38. J. Srinivasan, T. E. Cheatham, P. Cieplak, P. A. Kollman, and D. A. Case, “Continuum solvent
studies of the stability of DNA, RNA, and phosphoramidate- DNA helices,” Journal of the
American Chemical Society, vol. 120, no. 37, pp. 9401–9409, 1998.
39. N. Forouzesh and N. Mishra, “An effective MM/GBSA protocol for absolute binding free
energy calculations: A case study on sars-cov-2 spike protein and the human ace2 receptor,”
Molecules, vol. 26, no. 8, p. 2383, 2021.
40. A. Onufriev, D. Bashford, and D. A. Case, “Modification of the generalized born model suitable
for macromolecules,” J. Phys. Chem. B, vol. 104, no. 15, pp. 3712–3720, 2000.
41. N. Forouzesh, S. Izadi, and A. V. Onufriev, “Grid-based surface generalized born model
for calculation of electrostatic binding free energies,” Journal of Chemical Information and
Modeling, vol. 57, no. 10, pp. 2505–2513, 2017.
14 N. Mishra and N. Forouzesh
61. M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, and J. Chong, “BindingDB in 2015: a
public database for medicinal chemistry, computational chemistry and systems pharmacology,”
Nucleic Acids Research, vol. 44, no. D1, pp. D1045–D1053, 2016.
62. R. Wang, X. Fang, Y. Lu, and S. Wang, “The PDBbind database: Collection of binding affinities
for protein- ligand complexes with known three-dimensional structures,” Journal of Medicinal
Chemistry, vol. 47, no. 12, pp. 2977–2980, 2004.
63. R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang, “The PDBbind database: methodologies
and updates,” Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4111–4119, 2005.
64. M. Su, Q. Yang, Y. Du, G. Feng, Z. Liu, Y. Li, and R. Wang, “Comparative assessment of
scoring functions: the casf-2016 update,” Journal of Chemical Information and Modeling,
vol. 59, no. 2, pp. 895–913, 2018.
65. Z. Liu, M. Su, L. Han, J. Liu, Q. Yang, Y. Li, and R. Wang, “Forging the basis for developing
protein–ligand interaction scoring functions,” Accounts of Chemical Research, vol. 50, no. 2,
pp. 302–309, 2017.
66. L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson, “Binding moad (mother of
all databases),” Proteins: Structure, Function, and Bioinformatics, vol. 60, no. 3, pp. 333–340,
2005.
67. R. D. Smith, J. J. Clark, A. Ahmed, Z. J. Orban, J. B. Dunbar Jr, and H. A. Carlson, “Updates
to binding moad (mother of all databases): polypharmacology tools and their utility in drug
repurposing,” Journal of Molecular Biology, vol. 431, no. 13, pp. 2423–2433, 2019.
68. X. Zhang, H. Perez-Sanchez, and F. C Lightstone, “A comprehensive docking and MM/GBSA
rescoring study of ligand recognition upon binding antithrombin,” Current Topics in Medicinal
Chemistry, vol. 17, no. 14, pp. 1631–1639, 2017.
69. S.-Y. Huang and X. Zou, “Advances and challenges in protein-ligand docking,” International
Journal of Molecular Sciences, vol. 11, no. 8, pp. 3016–3034, 2010.
70. S. F Sousa, N. MFSA Cerqueira, P. A Fernandes, and M. Joao Ramos, “Virtual screening
in drug design and development,” Combinatorial Chemistry & High Throughput Screening,
vol. 13, no. 5, pp. 442–453, 2010.
71. R. G. Coleman, M. Carchia, T. Sterling, J. J. Irwin, and B. K. Shoichet, “Ligand pose and
orientational sampling in molecular docking,” PloS One, vol. 8, no. 10, p. e75992, 2013.
72. K. A. Johnson, “Role of induced fit in enzyme specificity: a molecular forward/reverse switch,”
Journal of Biological Chemistry, vol. 283, no. 39, pp. 26297–26301, 2008.
73. N. Forouzesh, M. R. Kazemi, and A. Mohades, “Structure-based analysis of protein binding
pockets using von Neumann entropy,” in International Symposium on Bioinformatics Research
and Applications, pp. 301–309, Springer, 2014.
74. S. F. Sousa, A. J. Ribeiro, J. Coimbra, R. Neves, S. Martins, N. Moorthy, P. Fernandes, and
M. Ramos, “Protein-ligand docking in the new millennium–a retrospective of 10 years in the
field,” Current Medicinal Chemistry, vol. 20, no. 18, pp. 2296–2314, 2013.
75. I. Halperin, B. Ma, H. Wolfson, and R. Nussinov, “Principles of docking: An overview
of search algorithms and a guide to scoring functions,” Proteins: Structure, Function, and
Bioinformatics, vol. 47, no. 4, pp. 409–443, 2002.
76. I. A. Guedes, F. S. Pereira, and L. E. Dardenne, “Empirical scoring functions for structure-
based virtual screening: applications, critical aspects, and challenges,” Frontiers in Pharma-
cology, vol. 9, p. 1089, 2018.
77. L. P. Pason and C. A. Sotriffer, “Empirical scoring functions for affinity prediction of protein-
ligand complexes,” Molecular Informatics, vol. 35, no. 11-12, pp. 541–548, 2016.
78. S.-Y. Huang, S. Z. Grinter, and X. Zou, “Scoring functions and their evaluation methods for
protein–ligand docking: recent advances and future directions,” Physical Chemistry Chemical
Physics, vol. 12, no. 40, pp. 12899–12908, 2010.
79. I. Muegge, “PMF scoring revisited,” Journal of Medicinal Chemistry, vol. 49, no. 20, pp. 5895–
5902, 2006.
80. S. Z. Grinter and X. Zou, “Challenges, applications, and recent advances of protein-ligand
docking in structure-based drug design,” Molecules, vol. 19, no. 7, pp. 10150–10176, 2014.
16 N. Mishra and N. Forouzesh
1 Introduction
the geometric and chemical aspects of the proteins being compared. Second,
the digital representation is paired with a comparison metric, which is used to
evaluate similarity between representations of multiple protein structures. Third,
a comparison algorithm is developed to identify the fairest comparison of two or
more proteins based on the metric. Finally, a statistical model is used to evaluate
the significance of the final measurement relative to baseline structural noise.
Enhancements in representations, metrics, comparison algorithms and statistical
models, while sometimes presented together as a singular method, represent
progress in the field towards the solution of problems in specificity annotation.
Due to their integrated nature, novel designs for some components, especially new
representations of protein structure, can represent major advancements to the field,
because they can permit new analytical capabilities, as we will discuss later in this
chapter.
2 Specificity Assignment
a) b) c)
{g,v,c}
{r,k}
{f,w,y}
{a,v,i,l}
Fig. 1 Examples of point-based motif designs. (a) Multiple amino acid labels describe variations
that occur in major evolutionary divergences. (b) Vectors at each motif point encode sidechain
orientation. (c) Cavity spheres denote regions of the binding cavity that must remain empty in the
matching structure, to accommodate the ligand
statistically significant false positives while maintaining 87% of true positives (some
true positives are also lost in the process) [19].
Beyond the enhancement of motifs with additional information, algorithms have
also been developed to better select the points in the motif, to achieve superior
prediction accuracy. By repeatedly comparing variations of potential motifs against
a nonredundant subset of the PDB, Geometric Sieving [22] identifies motifs that
exhibit greater sensitivity than if an arbitrary set of points near the binding site were
used for the motif. Composite motifs use multiple structures of the same binding
site atoms to generate averaged structures can exhibit greater similarity to proteins
with similar binding preferences [18]. These composite motifs can be significantly
extended to represent the same protein in different conformations and variations of
the same family of proteins with different binding specificities [8].
Spherical harmonics have also been developed as an alternative representation
of protein-ligand binding pockets [58]. These infinitely differentiable single-value
functions on two polar coordinates can represent some three dimensional regions,
as long as they fulfill the star-convexity property: A region r is star-convex if there
exists a point p0 ∈ r such that for every point p ∈ r, all points on the segment
p0 p are inside r. Whereas the comparison of point-based binding sites requires
many pairs of corresponding points to be found and aligned, spherical harmonics
can be compared by finding optimal rotational superpositions. Spherical harmonics
can also be used to represent ligands and other small molecules.
1 n
RMSD = d(mi , pi ),
n
1
where d(mi , pi ) is the Euclidean distance between two points. LRMSD is the
minimal value of RMSD when over all rigid transformations of the motif, which
can be rapidly determined in linear time using eigenvector [44, 82] or quaternion
[27] methods. Because of the minimal nature of LRMSD, measuring LRMSD yields
both a measurement, in Angstroms, and a rigid superposition of one structure onto
the other. Many methods refer to LRMSD as RMSD, because there is no value in
using RMSD as a comparison metric when it has not been minimized.
Many methods treat this core element of protein structure comparison as a black
box, focusing instead on the selection of atoms to align and on the algorithm for
rapidly finding atoms that yield low LRMSD alignments. Since LRMSD amounts
to measuring the minimum geometric mean of the distances between points, newer
methods have developed additional geometric criteria to exclude other biologically
irrelevant matches. For example, Cavity Aware Match Augmentation [19] uses
spheres to detect and eliminate proteins that do not maintain an empty binding cavity
within the motif. pvSOAR maps points around a pocket onto the unit sphere, and
measures their LRMSD to evaluate structural similarity [81].
An important constraint on LRMSD is the fact that it depends on bijections
between two equally sized pointsets. This constraint has serious implications for
specificity assignment in two ways: First, the structure of different amino acids
cannot be fully compared, because different amino acids have different numbers of
atoms. Many methods partially sidestep this problem by representing amino acids
with only a single point. Unfortunately, that approach partially ignores variations
in sidechain geometry, causing important differences in shape to be overlooked.
Second, depending on a bijection requires that any amino acid that occupies
different positions in two binding sites, thereby altering specificity, must also be
part of the motif, otherwise the difference may not be detected. As a result, effective
motif design, whether by expert design (e.g. [41]) or algorithmic refinement (e.g.
[7, 18, 20, 22]) is mandatory for practical specificity assignment.
biologically irrelevant. Even though they are biologically irrelevant, however, very
poor matches must be accounted for in any statistical models, to prevent algorithmic
bias [32].
Geometric Hashing Point-based motifs can be rapidly compared using Geometric
Hashing [33, 48], a technique that represents points in a rotationally and translation-
ally invariant vector. Vectors are generated for every triplet in the motif and stored
in a range search structure. When searching for matches for a given motif inside a
protein structure, the triplets of points in the structure are encoded into invariants and
compared against the invariants of the motif. Similar invariants represent a triplet of
corresponding points that can be used to generate initial alignments [10] or used in
voting systems to generate final correspondences of points between the protein and
the motif.
Depth First Search A second group of algorithms uses depth first search techniques
to systematically construct bijections between motif points and target points (Fig. 2)
[73]. Algorithms like Match Augmentation [10], and Labelhash [57], begin with
one or a handful of point-point correspondences. From this initial state, depth first
searches superpose the corresponding points and then identify additional correspon-
dences between motif points that are brought into proximity with acceptable target
points. Multiple potential correspondences can be detected for each motif point,
creating a branching nature in the depth first search as different possibilities are
explored. After exploring many possibilities, both geometric hashing and depth
first search matching algorithms return the match with the largest number of
corresponding points.
c1 d1 e)
a) b)
c2 d2
X
?
Fig. 2 Operation of depth first search comparison algorithms. (a) three points (black, white have
correspondences (thin black lines), and we seek to find acceptable correspondences for the next
motif point (grey). (b) Range search (circle) for points that may correspond (white) with the
next motif point. At this point, the depth first search first considers one possibility (c1–e), and
eventually considers the other (c2, d2). (c1) A tentative correspondence is considered with the
first point. (d1) LRMSD superposition between corresponding points is computed. (e) The final
distance between points in LRMSD superposition is acceptable, so we consider the next motif
point (grey). (c2) A tentative correspondence is considered with the second point. (d2) The final
distance between points in LRMSD superposition is not acceptable, because they get too far apart.
This set of correspondences is not used further, and we backtrack to other correspondences
24 Z. Guo and B. Y. Chen
Algorithms that compare points in space always require a data structure that enables
range search. A common structure that achieves this purpose is a three dimensional
kd-tree [4], a space partitioning data structure which is well documented and does
not need to be described further. Due to the fact that atoms can only pack into limited
densities, however, three dimensional cubic lattice structures, Lattice Hashes, can
also be very useful. Geometric comparisons generally search for atoms nearby a
given point, which amounts to a range search on a spherical range. Lattice Hashes
can support spherical range search very easily.
Construction To construct a Lattice Hash, we begin with a given set of points in
three dimensions and a resolution that defines the size of each cube in the lattice.
We first find the bounding box of the input points and determine the smallest lattice
of cubes, based on the resolution parameter, that fully contains the bounding box
(Fig. 3a). Rather than allocating each cube immediately, we refer to each cube using
a unique index:
where xpos, ypos, and zpos are the position of the cube, counting from the low-x,
low-y, low-z corner of the lattice. ydim and zdim are the number of cubes along
the y and z dimensions of the lattice. All cubes that contain points are stored in a
hash table based on this index. To insert each point into the lattice, we associate it
with the cube that contains it, allocating memory only for cubes that contain points
(Fig. 3b).
Spherical Range Search We begin with a point and a radius that define the spherical
range. To discover points in the lattice within the range, we first identify the set of
cubes that contain the bounding box of the range, and then identify individual cubes
a) b) c) d)
Fig. 3 Lattice Hash construction and operation. (a) The bounds of the Lattice Hash (edge of
lattice) surround the points to be represented (black circles) in uniform cubes. (b) Only occupied
cubes (black squares) are allocated on a hash table (second row). (c) Spherical range search (large
circle) centered at a point (white with heavy black outline). (d) Range search first identifies cubes
that intersect the range (black squares), then identifies occupied cubes (shaded squares) based on
their presence in the hash table
Explaining Specificity with Volumetric Representations 25
that overlap the sphere, using the distance of cube corners and the intersection of
cube segments (Fig. 3c). Finally, individual points are tested, and points within the
range are returned.
3 Component Localization
Many parts of a protein work together to achieve specificity. Finding these com-
ponents within the rest of the protein can be extremely difficult. However, once
the molecular mechanisms that implement specificity are found, they can provide a
crucial platform for applied research: Later studies could mutate these mechanisms
to examine how, for example, the protein could be reengineered to achieve novel
binding preferences for industrial applications, or how the protein might mutate
to achieve inhibitor resistance. Likewise, before the molecular mechanism of
specificity is known, predicting elements of protein structures that might control
specificity can narrow the space of possibilities that must be examined in order
to unravel how specificity is actually achieved. By finding individually influential
components, component localization is the new effort to use computational preci-
sion and speed to suggest hypotheses about binding mechanisms that might have
been overlooked by human investigators.
The approach that component localization algorithms take is to search for subtle
differences in very similar proteins that cause different binding preferences. This
approach differs fundamentally from that of specificity assignment algorithms,
which search for subtle similarities that necessitate similar binding preferences
in very different proteins. Second, there are many ways in which differences in
protein structure can cause differences in binding preferences, including differences
in steric hindrance at binding sites, differences in electric fields, and so on. Finding a
difference that necessitates differences in binding preferences requires a biophysical
rationale for different binding preferences, and it also requires the consideration
of many different kinds of components. While amino acids are one kind of
component that influences specificity, empty clefts and cavities also play a role in
accommodating binding partners, as do local patterns of molecular flexibility. Many
components of protein structure and their biophysical roles must be considered for
a comprehensive approach to component localization.
Component localization algorithms originated in sequence based methods that
analyze alignments of related sequences in different specificity groups, looking for
sequence positions that are conserved within groups and different between groups
Explaining Specificity with Volumetric Representations 27
[11, 12, 45, 61, 63, 65]. Amino acids with these properties are under evolutionary
pressure to support the different binding preferences of distinct groups. Generally,
these methods require sequences that are categorized into groups with different
binding preferences, though some exceptions exist [34, 62, 70]. This evolutionary
rationale offers one explanation for the role of each amino acid in specificity.
Evolutionary information can be mapped onto protein structures for component
localization. Evolutionary Trace Annotation [2] (ETA), is a specificity assignment
algorithm that uses Paired Distance matching [83] to identify binding sites with
specificity identical to that of motifs generated with the Evolutionary Trace [61].
Since ETA motifs are generated using the Evolutionary Trace, an algorithm that
performs sequence-based component localization, the amino acids that structurally
match the motif are also implied to influence specificity in the same way as
the amino acids in the motif sequences. Recently, this approach was verified
experimentally against carboxylesterases with different binding preferences [2].
Three dimensional solids can be defined with a closed surface as a boundary for a
solid region. To represent proteins, the molecular surface (also known as the solvent
excluded surface[49]) is a useful boundary, and practical surfaces, constructed from
thousands of triangles, are provided by many existing methods, including MSMS
[75], GRASP2 [64], VADAR [85], and others [26]. Geometric solids, like spheres,
cubes, and tetrahedra, can also define useful boundaries for solid representations.
Comparisons of these solid representations are not founded on an analysis of
surface geometry, but actual comparisons of the region occupied by the solid. This
kind of comparison is made possible through CSG operations (Fig. 4), also called
boolean operations. Solid comparisons of protein structures, therefore, are direct
28 Z. Guo and B. Y. Chen
Fig. 4 A demonstration of CSG operations union, intersection, and difference, illustrating the
borders of two input (dotted) and output regions (solid)
a b c
d e f
Fig. 5 The union operation computed with Marching Cubes. (a) The input solid regions: A and
B (gray). (b) The regions in the lattice. (c) Interior points (dark gray) and exterior points (light
gray). (d) The lines connecting one interior point and one exterior point. (e) The crossing points
that intersects the boundary of the union. (f) Output triangles (solid lines) for union approximation
a b c d
Fig. 6 Computing the volume of a given closed region. (a) Input region A and the its geometric
centroid (black dot). (b) One triangle and its normal vector. (c) Tetrahedra (triangles inclosed in
solid black lines) based on triangles (thick black lines) with normals (while arrows) facing away
from the centroid. (d) Tetrahedra (triangles inclosed in solid black lines) based on triangles (thick
black lines) with normals (black arrows) facing towards the centroid. The volume of the region is
the difference between volume sum in (c) and volume sum in (d)
Computing the volume within a closed region A is also a crucial aspect of VASP,
and we summarize it here: First, the centroid of the corners of all triangles is
determined. In A, the three points of each triangle and the centroid together define
a tetrahedron. Since the triangle either faces towards or away from the centroid,
by the Surveyor’s Formula [76], we can evaluate the volume within A by adding
the volumes of all tetrahedra that face away from the centroid, and subtracting
the volumes of all tetrahedra that face towards the centroid. The volume of each
tetrahedron can be accurately computed using Tartaglia’s Rule [5] (Fig. 6).
30 Z. Guo and B. Y. Chen
Variations in structure that cause differences in binding specificity are most likely
to occur at binding sites. To compare binding site geometry, solid representations
can be used to describe the shape of the empty region that accommodates ligands
using a series of CSG operations. As an instructive example, the following method
describes one simple way to represent a binding cavity. We begin with the whole
structure of one protein A and the ligand l that binds at its functional site. For
each atom in l, a sphere is generated with radius 5.0 Å, and the union of all the
spheres, Sl , are calculated with VASP using CSG operations. Sl defines the vicinity
of the ligand binding cavity. GRASP2 [64] can be used to compute the molecular
surface m(A) using the classic rolling probe method [71]. The molecular surface
is generated using a 1.4 Å probe. A second “envelope” surface, e(A), is generated
with 5.0 Å probe using the same algorithm. The binding cavity a is generated using
the following CSG expression:
— Ah, hyvät herrat, näin juuri niin kaunista unta. Uneksuin, että
kuningas korotti Lehdonrauhan maat markiisikunnaksi. Voi, se oli
ainoastaan unelma ja minä tiedän liiankin hyvin, että kuninkaan
tarkoitusperät ovat aivan päinvastaiset.
— Mennään eteenpäin, sanoi Pyhä-Sylvanus. On myöhä,
emmekä saa hukata aikaa.
Hieronymus.