Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)
Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)
Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)
Nurit Haspel
Filip Jagodzinski
Kevin Molloy Editors
Algorithms
and Methods
in Structural
Bioinformatics
Computational Biology
Advisory Editors
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joseph Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken,
Germany
Marcella McClure, Montana State University, Bozeman, MT, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Lonnie Welch, Ohio University, Athens, OH, USA
Editor-in-Chief
Andreas Dress, CAS-MPG Partner Institute for Computational Biology, Shanghai,
China
Michal Linial, Hebrew University of Jerusalem, Jerusalem, Israel
Olga Troyanskaya, Princeton University, Princeton, NJ, USA
Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany
Kevin Molloy
ISAT/CS Building Room 216
James Madison University
Harrisonburg, VA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
vii
viii Contents
1 Introduction
N. Mishra
Department of Chemistry and Biochemistry, California State University, Los Angeles,
Los Angeles, CA, USA
e-mail: nmishra2@calstatela.edu
N. Forouzesh ()
Department of Computer Science, California State University, Los Angeles, Los Angeles,
CA, USA
e-mail: neginf@calstatela.edu
and vibrational d.o.f (specifying the relative positions of various nuclei) [14]. With
higher d.o.f., the atoms are more free to move around and therefore, the disorder
increases. More disordered systems are favored, but as we will see later in this
section, the favorability of a reaction is a compromise between enthalpy and entropy.
Given the two factors that contribute to energetic favorability, Gibbs free energy,
G, is known as an essential thermodynamic quantity that describes the spontaneity
of a reaction under isobaric and isothermal conditions [14]. The change in G is
defined as:
G = H − T S, (1)
where T is the temperature of the system in kelvin and H and S refer to the
enthalpy and entropy, respectively. Processes are spontaneous when G is negative
and non-spontaneous when G is positive. This equation exemplifies the “toss-up”
between enthalpy and entropy in determining the favorability of a reaction: the aim
is to find a balance between decreasing energy and increasing disorder to minimize
G [13, 14].
Thermodynamic factors are important to consider when determining G, but
kinetic factors play an important role as well. In the kinetic context, protein-ligand
binding is dependent on the concentrations of the protein and the ligand (denoted
as [P ] and [L], respectively), as this determines the speed with which protein-
ligand complexes can form (the concentration of the complex is denoted as [P L]).
Michaelis-Menten kinetics [13] introduces the equilibrium association constant, Ka ,
which relates to how quickly the protein-ligand complex can form. This is inversely
proportional to Kd , which relates to how quickly a protein and ligand unattach from
one another.
[P L] 1
Ka = = (2)
[P ][L] Kd
Higher Ka values and lower Kd values are associated with higher binding affinity
and therefore a lower G◦ , which is the standard free energy. Binding free energy
(BFE), G, is the change in free energy when a protein and ligand bind to one
another. It is defined as
where R is the gas constant and Q is the reaction quotient, which is equal to Ka
at equilibrium. The extent of the protein-ligand interaction is determined by how
negative the change in free energy is upon binding [7].
4 N. Mishra and N. Forouzesh
There are several methods by which binding free energy can be calculated based
on computational analysis of the protein and the ligand, including three principle
methods: alchemical methods, transition path sampling, and end-point methods.
Each of these techniques and recent advancements will be discussed in greater detail
below.
Free energy is a state function, meaning that the G value does not depend on
the path taken to reach the starting conformation to the bound conformation [15].
Alchemical methods take advantage of this fact and introduce “dummy” molecules,
or other chemical species, to bridge the high-probability region between the
unbound and bound states of the protein and the ligand [7, 16]. The use of these
chemical species are considered non-physical (alchemical) intermediate states,
and their use allows for a robust computational method that can calculate large
G values [16]. The general idea of alchemical free energy calculations can be
summarized by a thermodynamic cycle, shown in Fig. 1. An important parameter in
alchemical free energy calculations is the alchemical reaction coordinate, λ. It helps
define the thermodynamic path connecting the two states and shifts from 0 to 1 as
the coordinate moves.
In the alchemical category, there are two primary methodologies used for free
energy calculations; the first is free energy perturbation (FEP). This method relates
the relative free energy between the unbound and bound protein-ligand complex
to an average function of their energy [17]. Potential energies based on FEP
theory can be averaged over ensembles generated via molecular dynamics (MD)
or Monte Carlo (MC) simulations [17–19]. FEP simulations drive the perturbation
of one molecule into another by using alchemical intermediate states, also known
as windows, that lead one state into the other [7, 19]. Once the calculations are
complete, the convergence free energy results of each of the windows is estimated.
One caveat with FEP is that the convergence is only straightforward when U (the
change in potential) between the initial and final states is small.
The second method is known as thermodynamic integration (TI). It calculates the
change in free energy as follows:
λ=1 δH
G = dλ (4)
λ=0 δλ λ
As shown in Fig. 1, the G is the difference between the G of the bound and
unbound states calculated from Eq. 4. The important characteristics of TI are that it
relies on the average force exerted in a system undergoing an alchemical transition,
and integrating it along the alchemical coordinate. This method is often considered
more straightforward than other alchemical approaches to binding free energy [20].
There have been a few techniques developed to try and integrate TI into simulations.
One method is known as slow-growth TI, which progresses through transitions
in an extremely gradual manner and keeps the system very close to equilibrium.
The work done through the transition is equivalent to the free energy change [21].
Another method is discrete TI, which functions very similarly to
FEP
and divides the
thermodynamic path into separate windows. A calculation for δH δλ is done, followed
by the integration necessary to get the value [20, 22, 23].
methods pull the ligand from the binding site back into the solvent. There are also
non-equilibrium methods based on the Jarzynski relationship that can be used to
obtain the BFE [27, 28].
While TPS has many advantages, especially regarding its speed and ability
to analyze rare dynamical events that ordinary MD simulations cannot, there are
disadvantages that also coincide with this technique. One of the primary issues
is the delicacy of defining the stable states. If states are not defined precisely,
conformations can get “stuck” in a specific transition state or in an intermediate
state, which leads to insufficient sampling of the conformations and transition
pathways [7, 26]. Additionally, the unbiased nature of TPS makes it powerful,
however, it is also more difficult to implement because of the bookkeeping necessary
to keep track of both the forward and backward time integrations employed in
this method [26]. There have been some techniques based off of TPS which have
enhanced sampling and led to more accurate and faster PMF calculations. For
example, temperature acceleration method uses high temperatures to accelerate
calculations [29]. Other enhanced sampling techniques include metadynamics [30],
blue-moon sampling [31], umbrella sampling [32], and replica exchange [33].
End-point methods are the final and most efficient G calculation technique
discussed in this chapter. The low expense of these methods relies on the fact that
they sample only the final states of a system (unlike alchemical and path sampling,
which sample non-physical and physical intermediates, respectively) [34]. End-
point methods aim to serve as an intermediate between the speed of scoring
approaches and the accuracy of computationally expensive alchemical and path
sampling methods [35, 36]. Molecular mechanics Poisson–Boltzmann surface area
(MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA)
are of the most commonly used end-point methods and are closely related [37]. The
basis of the MM/PB(GB)SA approach relies on some of the ideas discussed in the
introduction to this chapter. The change in binding free energy is:
G = H − T S = GP L − GP − GL (5)
Where P refers to the protein, L to the ligand, and P L to the complex. This equation
can be broken down into more specific contributions based on different interactions
with the protein-ligand system [7, 36, 38]. Specifically, the enthalpic contribution
is the sum of the changes in gas-phase molecular mechanics and the solvation free
energy:
EMM is the gas-phase energy of the solute and consists of the changes in internal
energy (i.e., the changes from bond, angle and dihedral energies), van der Waals
energy (such as the formation and breaking of hydrogen bonds), and the change in
electrostatic energy [36, 38, 39]. Gsolv is the solvation free energy which consists
of both polar and nonpolar components. The polar component is calculated via
PB (or GB) [40–44] and the nonpolar component is calculated using the solvent-
accessible surface area [45, 46]. The −T S term refers to the absolute temperature
multiplied by the entropy, estimated from normal mode analysis (NMA). This
method is quite computationally expensive and is often disregarded in PB/GB
calculations. Recently, a few computational and theoretical techniques have been
introduced to overcome the intractability of NMA calculations, such as system
truncation [39], consideration of distance-dependent dielectric constant [47], and
reduction in translational and rotational freedom of the ligand upon protein-ligand
binding [48].
MD simulations can be employed for MM/PB(GB)SA calculations in one of
three ways [49]. The first way, the separate-trajectory method (3A), uses three
different trajectories to analyze the protein, ligand, and bound complex separately.
Another method is known as the single trajectory (1A) method and uses one
trajectory to generate the ensemble of snapshots [36, 38, 39]. The performance of
the separate versus the single method was found to vary between systems and test
models, but the separate trajectory approach has been shown to have greater standard
errors and uncertainties, leading the 1A method to be generally deemed as more
accurate [50–52]. Another benefit of the single trajectory method is that the change
in bond energy is no longer relevant to EMM as it cancels out [49]. One of the
downfalls with the single trajectory approach is that it disregards structural changes
to the ligand upon binding to the protein/receptor that are potentially important in
getting accurate characterization of the protein-ligand interaction [7, 53]. To remedy
this, there have been suggestions of a “compromise”—the 2A approach, which
employs trajectories to sample both the complex as well as the ligand in order
to account for energetic changes due to restructuring the ligand [49, 53]. The 2A
method has been found to improve results of binding affinity prediction [54].
The computational protocol for MM/PB(GB)SA can be generally described as
follows: first, an explicit solvent model is used to perform an MD simulation on the
protein-ligand complex to get the free energy contributions from the protein, ligand,
and complex and produce the necessary ensemble of snapshots. Then, the solvent
molecules and charged ions are deleted from each snapshot to prepare for the use
of the implicit solvent model [55], and then the MM/PB(GB)SA method is used to
analyze and compute the total Gsolv for each snapshot. The overall Gsolv is the
sum of the individual contributions [7, 34, 39]. This thermodynamic cycle has been
shown in Fig. 2. The use of an explicit solvent model in the MD simulation followed
by the implicit solvent model for calculation might seem inconsistent since different
energy functions are used, but studies have actually found that using an implicit
solvent model for the MD simulation as well has led to less accurate results [56].
The unreliable double implicit model has sometimes led to a dissociation of the
protein or ligand, rendering it less functional than an explicit simulation followed
by an implicit calculation [57].
8 N. Mishra and N. Forouzesh
The final popular database is Binding Mother of All Databases, better known as
Binding MOAD [66]. Similar to PDBBind, Binding MOAD has also generated sets
based on high-quality data from the PDB. Binding MOAD finds proteins with x-
ray crystal structures, and the subset is said to fall between PDBbind’s general and
refined sets [66, 67]. Binding MOAD provides a myriad of information and data.
The webpage for each protein shows ligand information (both valid and invalid
reports), any available binding data, the chemical structures of the ligands, and
proteins within the 90%, 70%, and 50% homologies. These tools are all extremely
useful because they allow for an easy way to search for similar structures [66, 67].
As of the 2019 release, there are 38,702 protein-ligand structures, 14,324 instances
of binding data, 18,939 ligands, and 10,500 protein families. Overall, Binding
MOAD is one of the most robust protein-ligand databases available, providing large
amounts of information across many types of complexes.
4 Molecular Docking
have been developed using energy functions to pull out the putative binding poses
between a protein and ligand [10]. There are many algorithms by which these poses
can be found, and one goal of algorithms currently being developed is to expedite
the search process [10, 75].
The other part of the docking process is the scoring phase, wherein the various
poses are ranked from most to least likely. The scoring function is used to assess the
binding affinity between the protein and ligand after docking; ideally, the scoring
function should predict both the binding free energy as well as allow for high-
throughput screening of various drugs [7]. Different scoring functions have been
developed for docking purposes, and they can be broken down into three principal
groups: empirically-based functions, knowledge-based functions, and force-field
based functions.
Empirically-based scoring functions are perhaps the most simplistic of those
listed above. This model involves getting a descriptor for the binding, a training
set of a vast number and types of protein-ligand complexes with the experimental
binding affinities, and a regression, classification, or machine learning algorithm
to help form a relationship between the descriptor(s) and the experimental affinity
values [76, 77]. Empirical scoring functions rely on many different energy terms
to determine favorability; these types of functions are more simplistic in that they
do not consider the underlying physical interactions of the protein and ligand. Some
areas of research within empirically-based functions include finding accurate energy
terms to incorporate into the function as well as avoiding overfitting energy terms,
which is an issue that stems from the large number of energy terms used in empirical
functions [78].
Knowledge-based scoring functions are statistically-based functions which come
from the assumption that interactions between the protein and ligand that occur
more frequently than expected by random chance contribute more favorably to the
binding affinity [79]. These types of scoring functions also rely on training sets
to find the statistical potentials of various interactions. One of the biggest issues
with knowledge-based scoring functions is their computational implementation,
however, they serve as a “happy medium” between empirical and force-field based
scoring functions due to their relative swiftness and simplicity, as well as their ability
to avoid some of the challenges posed by empirical functions [80].
Force-field based approaches account for the underlying physical interactions
occurring between binding partners. As mentioned previously, entropic calculations
are computationally expensive; thus, they are often ignored. However, it is important
to acknowledge that entropic effects can play a major role in the binding affinity,
generating a need for a more efficient method to find the entropic contribution.
Force-field based methods are of particular relevance because they can employ
the implicit solvent models within PBSA and GBSA to account for the physical
interactions of the solvent, improving the accuracy with which the binding affinity
can be predicted [81, 82]. One of the benefits of using MM/PB(GB)SA is that such
methods can explore structure-activity relationships within ligands as well as their
selectivity profiles, and are less computationally demanding than FEP techniques.
Additionally, computation time can be decreased by modifying MM/PB(GB)SA to
Protein-Ligand Binding with Applications in Molecular Docking 11
5 Conclusion
Acknowledgments This works has been partially supported by a California State University
Program for Education and Research in Biotechnology (CSUPERB) New Investigator funding
awarded to N.F.
References
1. W. L. Jorgensen, “The many roles of computation in drug discovery,” Science, vol. 303,
no. 5665, pp. 1813–1818, 2004.
12 N. Mishra and N. Forouzesh
2. H.-J. Woo and B. Roux, “Calculation of absolute protein–ligand binding free energy from
computer simulations,” Proceedings of the National Academy of Sciences, vol. 102, no. 19,
pp. 6825–6830, 2005.
3. M. M. Pierce, C. Raman, and B. T. Nall, “Isothermal titration calorimetry of protein–protein
interactions,” Methods, vol. 19, no. 2, pp. 213–221, 1999.
4. R. Karlsson and A. Fält, “Experimental design for kinetic analysis of protein-protein inter-
actions with surface plasmon resonance biosensors,” Journal of Immunological Methods,
vol. 200, no. 1-2, pp. 121–133, 1997.
5. A. M. Rossi and C. W. Taylor, “Analysis of protein-ligand interactions by fluorescence
polarization,” Nature Protocols, vol. 6, no. 3, pp. 365–387, 2011.
6. M. Jerabek-Willemsen, T. André, R. Wanner, H. M. Roth, S. Duhr, P. Baaske, and D. Breit-
sprecher, “Microscale thermophoresis: Interaction analysis and beyond,” Journal of Molecular
Structure, vol. 1077, pp. 101–113, 2014.
7. X. Du, Y. Li, Y.-L. Xia, S.-M. Ai, J. Liang, P. Sang, X.-L. Ji, and S.-Q. Liu, “Insights
into protein–ligand interactions: mechanisms, models, and methods,” International Journal
of Molecular Sciences, vol. 17, no. 2, p. 144, 2016.
8. M. K. Gilson and H.-X. Zhou, “Calculation of protein-ligand binding affinities,” Annual Review
of Biophysics and Biomolecular Structure, vol. 36, pp. 21–42, 2007.
9. S. J. Y. Macalino, V. Gosu, S. Hong, and S. Choi, “Role of computer-aided drug design in
modern drug discovery,” Archives of Pharmacal Research, vol. 38, no. 9, pp. 1686–1701, 2015.
10. P. H. Torres, A. C. Sodero, P. Jofily, and F. P. Silva-Jr, “Key topics in molecular docking for
drug design,” International Journal of Molecular Sciences, vol. 20, no. 18, p. 4574, 2019.
11. J. Li, A. Fu, and L. Zhang, “An overview of scoring functions used for protein–ligand
interactions in molecular docking,” Interdisciplinary Sciences: Computational Life Sciences,
vol. 11, no. 2, pp. 320–328, 2019.
12. H. Li, Y. Xie, C. Liu, and S. Liu, “Physicochemical bases for protein folding, dynamics, and
protein-ligand binding,” Science China Life Sciences, vol. 57, no. 3, pp. 287–302, 2014.
13. R. Miesfeld and M. McEvoy, Biochemistry. W.W. Norton, 2017.
14. D. A. McQuarrie and J. D. Simon, Physical Chemistry: a Molecular Approach, vol. 1.
University science books Sausalito, CA, 1997.
15. J. D. Chodera, D. L. Mobley, M. R. Shirts, R. W. Dixon, K. Branson, and V. S. Pande,
“Alchemical free energy methods for drug discovery: progress and challenges,” Current
Opinion in Structural Biology, vol. 21, no. 2, pp. 150–160, 2011.
16. A. S. Mey, B. Allen, H. E. B. Macdonald, J. D. Chodera, M. Kuhn, J. Michel, D. L. Mobley,
L. N. Naden, S. Prasad, A. Rizzi, et al., “Best practices for alchemical free energy calculations,”
arXiv preprint arXiv:2008.03067, 2020.
17. W. L. Jorgensen and L. L. Thomas, “Perspective on free-energy perturbation calculations for
chemical equilibria,” Journal of Chemical Theory and Computation, vol. 4, no. 6, pp. 869–876,
2008.
18. Y. Meng, D. Sabri Dashti, and A. E. Roitberg, “Computing alchemical free energy differences
with Hamiltonian replica exchange molecular dynamics (H-REMD) simulations,” Journal of
Chemical Theory and Computation, vol. 7, no. 9, pp. 2721–2727, 2011.
19. W. Jespers, M. Esguerra, J. Åqvist, and H. Gutiérrez-de Terán, “Qligfep: an automated
workflow for small molecule free energy calculations in q,” Journal of Cheminformatics,
vol. 11, no. 1, pp. 1–16, 2019.
20. V. Gapsys, S. Michielssens, J. H. Peters, B. L. de Groot, and H. Leonov, “Calculation of binding
free energies,” in Molecular Modeling of Proteins, pp. 173–209, Springer, 2015.
21. M. J. Mitchell and J. A. McCammon, “Free energy difference calculations by thermodynamic
integration: difficulties in obtaining a precise value,” Journal of Computational Chemistry,
vol. 12, no. 2, pp. 271–275, 1991.
22. M. Jorge, N. M. Garrido, A. J. Queimada, I. G. Economou, and E. A. Macedo, “Effect of the
integration method on the accuracy and computational efficiency of free energy calculations
using thermodynamic integration,” Journal of Chemical Theory and Computation, vol. 6, no. 4,
pp. 1018–1027, 2010.
Protein-Ligand Binding with Applications in Molecular Docking 13
23. S. Bruckner and S. Boresch, “Efficiency of alchemical free energy simulations. I. A practical
comparison of the exponential formula, thermodynamic integration, and Bennett’s acceptance
ratio method,” Journal of Computational Chemistry, vol. 32, no. 7, pp. 1303–1319, 2011.
24. W. You, Z. Tang, and C.-e. A. Chang, “Potential mean force from umbrella sampling
simulations: What can we learn and what is missed?,” Journal of Chemical Theory and
Computation, vol. 15, no. 4, pp. 2433–2443, 2019.
25. S. Wan, A. P. Bhati, S. J. Zasada, and P. V. Coveney, “Rapid, accurate, precise and reproducible
ligand–protein binding free energy prediction,” Interface Focus, vol. 10, no. 6, p. 20200007,
2020.
26. P. Bolhuis and C. Dellago, “Practical and conceptual path sampling issues,” The European
Physical Journal Special Topics, vol. 224, no. 12, pp. 2409–2427, 2015.
27. C. Jarzynski, “Nonequilibrium equality for free energy differences,” Physical Review Letters,
vol. 78, no. 14, p. 2690, 1997.
28. C. F. Narambuena, D. M. Beltramo, and E. P. Leiva, “Polyelectrolyte adsorption on a charged
surface. free energy calculation from Monte Carlo simulations using Jarzynski equality,”
Macromolecules, vol. 41, no. 21, pp. 8267–8274, 2008.
29. L. Maragliano and E. Vanden-Eijnden, “A temperature accelerated method for sampling free
energy and determining reaction pathways in rare events simulations,” Chemical Physics
Letters, vol. 426, no. 1-3, pp. 168–175, 2006.
30. A. Laio and M. Parrinello, “Escaping free-energy minima,” Proceedings of the National
Academy of Sciences, vol. 99, no. 20, pp. 12562–12566, 2002.
31. G. Ciccotti, R. Kapral, and E. Vanden-Eijnden, “Blue moon sampling, vectorial reaction
coordinates, and unbiased constrained dynamics,” ChemPhysChem, vol. 6, no. 9, pp. 1809–
1814, 2005.
32. J.-F. St-Pierre, M. Karttunen, N. Mousseau, T. Rog, and A. Bunker, “Use of umbrella sampling
to calculate the entrance/exit pathway for z-pro-prolinal inhibitor in prolyl oligopeptidase,”
Journal of Chemical Theory and Computation, vol. 7, no. 6, pp. 1583–1594, 2011.
33. M. Fajer, D. Hamelberg, and J. A. McCammon, “Replica-exchange accelerated molecular
dynamics (REXAMD) applied to thermodynamic integration,” Journal of Chemical Theory
and Computation, vol. 4, no. 10, pp. 1565–1569, 2008.
34. E. Wang, H. Sun, J. Wang, Z. Wang, H. Liu, J. Z. Zhang, and T. Hou, “End-point binding
free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug
design,” Chemical Reviews, vol. 119, no. 16, pp. 9478–9508, 2019.
35. E. A. Rifai, M. van Dijk, and D. P. Geerke, “Recent developments in linear interaction energy
based binding free energy calculations,” Frontiers in Molecular Biosciences, vol. 7, p. 114,
2020.
36. S. Genheden and U. Ryde, “Comparison of the efficiency of the lie and MM/GBSA methods
to calculate ligand-binding energies,” Journal of Chemical Theory and Computation, vol. 7,
no. 11, pp. 3768–3778, 2011.
37. H. Gohlke and D. A. Case, “Converging free energy estimates: Mm-pb (gb) sa studies on the
protein–protein complex ras–raf,” Journal of Computational Chemistry, vol. 25, no. 2, pp. 238–
250, 2004.
38. J. Srinivasan, T. E. Cheatham, P. Cieplak, P. A. Kollman, and D. A. Case, “Continuum solvent
studies of the stability of DNA, RNA, and phosphoramidate- DNA helices,” Journal of the
American Chemical Society, vol. 120, no. 37, pp. 9401–9409, 1998.
39. N. Forouzesh and N. Mishra, “An effective MM/GBSA protocol for absolute binding free
energy calculations: A case study on sars-cov-2 spike protein and the human ace2 receptor,”
Molecules, vol. 26, no. 8, p. 2383, 2021.
40. A. Onufriev, D. Bashford, and D. A. Case, “Modification of the generalized born model suitable
for macromolecules,” J. Phys. Chem. B, vol. 104, no. 15, pp. 3712–3720, 2000.
41. N. Forouzesh, S. Izadi, and A. V. Onufriev, “Grid-based surface generalized born model
for calculation of electrostatic binding free energies,” Journal of Chemical Information and
Modeling, vol. 57, no. 10, pp. 2505–2513, 2017.
14 N. Mishra and N. Forouzesh
61. M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, and J. Chong, “BindingDB in 2015: a
public database for medicinal chemistry, computational chemistry and systems pharmacology,”
Nucleic Acids Research, vol. 44, no. D1, pp. D1045–D1053, 2016.
62. R. Wang, X. Fang, Y. Lu, and S. Wang, “The PDBbind database: Collection of binding affinities
for protein- ligand complexes with known three-dimensional structures,” Journal of Medicinal
Chemistry, vol. 47, no. 12, pp. 2977–2980, 2004.
63. R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang, “The PDBbind database: methodologies
and updates,” Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4111–4119, 2005.
64. M. Su, Q. Yang, Y. Du, G. Feng, Z. Liu, Y. Li, and R. Wang, “Comparative assessment of
scoring functions: the casf-2016 update,” Journal of Chemical Information and Modeling,
vol. 59, no. 2, pp. 895–913, 2018.
65. Z. Liu, M. Su, L. Han, J. Liu, Q. Yang, Y. Li, and R. Wang, “Forging the basis for developing
protein–ligand interaction scoring functions,” Accounts of Chemical Research, vol. 50, no. 2,
pp. 302–309, 2017.
66. L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson, “Binding moad (mother of
all databases),” Proteins: Structure, Function, and Bioinformatics, vol. 60, no. 3, pp. 333–340,
2005.
67. R. D. Smith, J. J. Clark, A. Ahmed, Z. J. Orban, J. B. Dunbar Jr, and H. A. Carlson, “Updates
to binding moad (mother of all databases): polypharmacology tools and their utility in drug
repurposing,” Journal of Molecular Biology, vol. 431, no. 13, pp. 2423–2433, 2019.
68. X. Zhang, H. Perez-Sanchez, and F. C Lightstone, “A comprehensive docking and MM/GBSA
rescoring study of ligand recognition upon binding antithrombin,” Current Topics in Medicinal
Chemistry, vol. 17, no. 14, pp. 1631–1639, 2017.
69. S.-Y. Huang and X. Zou, “Advances and challenges in protein-ligand docking,” International
Journal of Molecular Sciences, vol. 11, no. 8, pp. 3016–3034, 2010.
70. S. F Sousa, N. MFSA Cerqueira, P. A Fernandes, and M. Joao Ramos, “Virtual screening
in drug design and development,” Combinatorial Chemistry & High Throughput Screening,
vol. 13, no. 5, pp. 442–453, 2010.
71. R. G. Coleman, M. Carchia, T. Sterling, J. J. Irwin, and B. K. Shoichet, “Ligand pose and
orientational sampling in molecular docking,” PloS One, vol. 8, no. 10, p. e75992, 2013.
72. K. A. Johnson, “Role of induced fit in enzyme specificity: a molecular forward/reverse switch,”
Journal of Biological Chemistry, vol. 283, no. 39, pp. 26297–26301, 2008.
73. N. Forouzesh, M. R. Kazemi, and A. Mohades, “Structure-based analysis of protein binding
pockets using von Neumann entropy,” in International Symposium on Bioinformatics Research
and Applications, pp. 301–309, Springer, 2014.
74. S. F. Sousa, A. J. Ribeiro, J. Coimbra, R. Neves, S. Martins, N. Moorthy, P. Fernandes, and
M. Ramos, “Protein-ligand docking in the new millennium–a retrospective of 10 years in the
field,” Current Medicinal Chemistry, vol. 20, no. 18, pp. 2296–2314, 2013.
75. I. Halperin, B. Ma, H. Wolfson, and R. Nussinov, “Principles of docking: An overview
of search algorithms and a guide to scoring functions,” Proteins: Structure, Function, and
Bioinformatics, vol. 47, no. 4, pp. 409–443, 2002.
76. I. A. Guedes, F. S. Pereira, and L. E. Dardenne, “Empirical scoring functions for structure-
based virtual screening: applications, critical aspects, and challenges,” Frontiers in Pharma-
cology, vol. 9, p. 1089, 2018.
77. L. P. Pason and C. A. Sotriffer, “Empirical scoring functions for affinity prediction of protein-
ligand complexes,” Molecular Informatics, vol. 35, no. 11-12, pp. 541–548, 2016.
78. S.-Y. Huang, S. Z. Grinter, and X. Zou, “Scoring functions and their evaluation methods for
protein–ligand docking: recent advances and future directions,” Physical Chemistry Chemical
Physics, vol. 12, no. 40, pp. 12899–12908, 2010.
79. I. Muegge, “PMF scoring revisited,” Journal of Medicinal Chemistry, vol. 49, no. 20, pp. 5895–
5902, 2006.
80. S. Z. Grinter and X. Zou, “Challenges, applications, and recent advances of protein-ligand
docking in structure-based drug design,” Molecules, vol. 19, no. 7, pp. 10150–10176, 2014.
16 N. Mishra and N. Forouzesh
1 Introduction
the geometric and chemical aspects of the proteins being compared. Second,
the digital representation is paired with a comparison metric, which is used to
evaluate similarity between representations of multiple protein structures. Third,
a comparison algorithm is developed to identify the fairest comparison of two or
more proteins based on the metric. Finally, a statistical model is used to evaluate
the significance of the final measurement relative to baseline structural noise.
Enhancements in representations, metrics, comparison algorithms and statistical
models, while sometimes presented together as a singular method, represent
progress in the field towards the solution of problems in specificity annotation.
Due to their integrated nature, novel designs for some components, especially new
representations of protein structure, can represent major advancements to the field,
because they can permit new analytical capabilities, as we will discuss later in this
chapter.
2 Specificity Assignment
a) b) c)
{g,v,c}
{r,k}
{f,w,y}
{a,v,i,l}
Fig. 1 Examples of point-based motif designs. (a) Multiple amino acid labels describe variations
that occur in major evolutionary divergences. (b) Vectors at each motif point encode sidechain
orientation. (c) Cavity spheres denote regions of the binding cavity that must remain empty in the
matching structure, to accommodate the ligand
statistically significant false positives while maintaining 87% of true positives (some
true positives are also lost in the process) [19].
Beyond the enhancement of motifs with additional information, algorithms have
also been developed to better select the points in the motif, to achieve superior
prediction accuracy. By repeatedly comparing variations of potential motifs against
a nonredundant subset of the PDB, Geometric Sieving [22] identifies motifs that
exhibit greater sensitivity than if an arbitrary set of points near the binding site were
used for the motif. Composite motifs use multiple structures of the same binding
site atoms to generate averaged structures can exhibit greater similarity to proteins
with similar binding preferences [18]. These composite motifs can be significantly
extended to represent the same protein in different conformations and variations of
the same family of proteins with different binding specificities [8].
Spherical harmonics have also been developed as an alternative representation
of protein-ligand binding pockets [58]. These infinitely differentiable single-value
functions on two polar coordinates can represent some three dimensional regions,
as long as they fulfill the star-convexity property: A region r is star-convex if there
exists a point p0 ∈ r such that for every point p ∈ r, all points on the segment
p0 p are inside r. Whereas the comparison of point-based binding sites requires
many pairs of corresponding points to be found and aligned, spherical harmonics
can be compared by finding optimal rotational superpositions. Spherical harmonics
can also be used to represent ligands and other small molecules.
1 n
RMSD = d(mi , pi ),
n
1
where d(mi , pi ) is the Euclidean distance between two points. LRMSD is the
minimal value of RMSD when over all rigid transformations of the motif, which
can be rapidly determined in linear time using eigenvector [44, 82] or quaternion
[27] methods. Because of the minimal nature of LRMSD, measuring LRMSD yields
both a measurement, in Angstroms, and a rigid superposition of one structure onto
the other. Many methods refer to LRMSD as RMSD, because there is no value in
using RMSD as a comparison metric when it has not been minimized.
Many methods treat this core element of protein structure comparison as a black
box, focusing instead on the selection of atoms to align and on the algorithm for
rapidly finding atoms that yield low LRMSD alignments. Since LRMSD amounts
to measuring the minimum geometric mean of the distances between points, newer
methods have developed additional geometric criteria to exclude other biologically
irrelevant matches. For example, Cavity Aware Match Augmentation [19] uses
spheres to detect and eliminate proteins that do not maintain an empty binding cavity
within the motif. pvSOAR maps points around a pocket onto the unit sphere, and
measures their LRMSD to evaluate structural similarity [81].
An important constraint on LRMSD is the fact that it depends on bijections
between two equally sized pointsets. This constraint has serious implications for
specificity assignment in two ways: First, the structure of different amino acids
cannot be fully compared, because different amino acids have different numbers of
atoms. Many methods partially sidestep this problem by representing amino acids
with only a single point. Unfortunately, that approach partially ignores variations
in sidechain geometry, causing important differences in shape to be overlooked.
Second, depending on a bijection requires that any amino acid that occupies
different positions in two binding sites, thereby altering specificity, must also be
part of the motif, otherwise the difference may not be detected. As a result, effective
motif design, whether by expert design (e.g. [41]) or algorithmic refinement (e.g.
[7, 18, 20, 22]) is mandatory for practical specificity assignment.
biologically irrelevant. Even though they are biologically irrelevant, however, very
poor matches must be accounted for in any statistical models, to prevent algorithmic
bias [32].
Geometric Hashing Point-based motifs can be rapidly compared using Geometric
Hashing [33, 48], a technique that represents points in a rotationally and translation-
ally invariant vector. Vectors are generated for every triplet in the motif and stored
in a range search structure. When searching for matches for a given motif inside a
protein structure, the triplets of points in the structure are encoded into invariants and
compared against the invariants of the motif. Similar invariants represent a triplet of
corresponding points that can be used to generate initial alignments [10] or used in
voting systems to generate final correspondences of points between the protein and
the motif.
Depth First Search A second group of algorithms uses depth first search techniques
to systematically construct bijections between motif points and target points (Fig. 2)
[73]. Algorithms like Match Augmentation [10], and Labelhash [57], begin with
one or a handful of point-point correspondences. From this initial state, depth first
searches superpose the corresponding points and then identify additional correspon-
dences between motif points that are brought into proximity with acceptable target
points. Multiple potential correspondences can be detected for each motif point,
creating a branching nature in the depth first search as different possibilities are
explored. After exploring many possibilities, both geometric hashing and depth
first search matching algorithms return the match with the largest number of
corresponding points.
c1 d1 e)
a) b)
c2 d2
X
?
Fig. 2 Operation of depth first search comparison algorithms. (a) three points (black, white have
correspondences (thin black lines), and we seek to find acceptable correspondences for the next
motif point (grey). (b) Range search (circle) for points that may correspond (white) with the
next motif point. At this point, the depth first search first considers one possibility (c1–e), and
eventually considers the other (c2, d2). (c1) A tentative correspondence is considered with the
first point. (d1) LRMSD superposition between corresponding points is computed. (e) The final
distance between points in LRMSD superposition is acceptable, so we consider the next motif
point (grey). (c2) A tentative correspondence is considered with the second point. (d2) The final
distance between points in LRMSD superposition is not acceptable, because they get too far apart.
This set of correspondences is not used further, and we backtrack to other correspondences
24 Z. Guo and B. Y. Chen
Algorithms that compare points in space always require a data structure that enables
range search. A common structure that achieves this purpose is a three dimensional
kd-tree [4], a space partitioning data structure which is well documented and does
not need to be described further. Due to the fact that atoms can only pack into limited
densities, however, three dimensional cubic lattice structures, Lattice Hashes, can
also be very useful. Geometric comparisons generally search for atoms nearby a
given point, which amounts to a range search on a spherical range. Lattice Hashes
can support spherical range search very easily.
Construction To construct a Lattice Hash, we begin with a given set of points in
three dimensions and a resolution that defines the size of each cube in the lattice.
We first find the bounding box of the input points and determine the smallest lattice
of cubes, based on the resolution parameter, that fully contains the bounding box
(Fig. 3a). Rather than allocating each cube immediately, we refer to each cube using
a unique index:
where xpos, ypos, and zpos are the position of the cube, counting from the low-x,
low-y, low-z corner of the lattice. ydim and zdim are the number of cubes along
the y and z dimensions of the lattice. All cubes that contain points are stored in a
hash table based on this index. To insert each point into the lattice, we associate it
with the cube that contains it, allocating memory only for cubes that contain points
(Fig. 3b).
Spherical Range Search We begin with a point and a radius that define the spherical
range. To discover points in the lattice within the range, we first identify the set of
cubes that contain the bounding box of the range, and then identify individual cubes
a) b) c) d)
Fig. 3 Lattice Hash construction and operation. (a) The bounds of the Lattice Hash (edge of
lattice) surround the points to be represented (black circles) in uniform cubes. (b) Only occupied
cubes (black squares) are allocated on a hash table (second row). (c) Spherical range search (large
circle) centered at a point (white with heavy black outline). (d) Range search first identifies cubes
that intersect the range (black squares), then identifies occupied cubes (shaded squares) based on
their presence in the hash table
Explaining Specificity with Volumetric Representations 25
that overlap the sphere, using the distance of cube corners and the intersection of
cube segments (Fig. 3c). Finally, individual points are tested, and points within the
range are returned.
3 Component Localization
Many parts of a protein work together to achieve specificity. Finding these com-
ponents within the rest of the protein can be extremely difficult. However, once
the molecular mechanisms that implement specificity are found, they can provide a
crucial platform for applied research: Later studies could mutate these mechanisms
to examine how, for example, the protein could be reengineered to achieve novel
binding preferences for industrial applications, or how the protein might mutate
to achieve inhibitor resistance. Likewise, before the molecular mechanism of
specificity is known, predicting elements of protein structures that might control
specificity can narrow the space of possibilities that must be examined in order
to unravel how specificity is actually achieved. By finding individually influential
components, component localization is the new effort to use computational preci-
sion and speed to suggest hypotheses about binding mechanisms that might have
been overlooked by human investigators.
The approach that component localization algorithms take is to search for subtle
differences in very similar proteins that cause different binding preferences. This
approach differs fundamentally from that of specificity assignment algorithms,
which search for subtle similarities that necessitate similar binding preferences
in very different proteins. Second, there are many ways in which differences in
protein structure can cause differences in binding preferences, including differences
in steric hindrance at binding sites, differences in electric fields, and so on. Finding a
difference that necessitates differences in binding preferences requires a biophysical
rationale for different binding preferences, and it also requires the consideration
of many different kinds of components. While amino acids are one kind of
component that influences specificity, empty clefts and cavities also play a role in
accommodating binding partners, as do local patterns of molecular flexibility. Many
components of protein structure and their biophysical roles must be considered for
a comprehensive approach to component localization.
Component localization algorithms originated in sequence based methods that
analyze alignments of related sequences in different specificity groups, looking for
sequence positions that are conserved within groups and different between groups
Explaining Specificity with Volumetric Representations 27
[11, 12, 45, 61, 63, 65]. Amino acids with these properties are under evolutionary
pressure to support the different binding preferences of distinct groups. Generally,
these methods require sequences that are categorized into groups with different
binding preferences, though some exceptions exist [34, 62, 70]. This evolutionary
rationale offers one explanation for the role of each amino acid in specificity.
Evolutionary information can be mapped onto protein structures for component
localization. Evolutionary Trace Annotation [2] (ETA), is a specificity assignment
algorithm that uses Paired Distance matching [83] to identify binding sites with
specificity identical to that of motifs generated with the Evolutionary Trace [61].
Since ETA motifs are generated using the Evolutionary Trace, an algorithm that
performs sequence-based component localization, the amino acids that structurally
match the motif are also implied to influence specificity in the same way as
the amino acids in the motif sequences. Recently, this approach was verified
experimentally against carboxylesterases with different binding preferences [2].
Three dimensional solids can be defined with a closed surface as a boundary for a
solid region. To represent proteins, the molecular surface (also known as the solvent
excluded surface[49]) is a useful boundary, and practical surfaces, constructed from
thousands of triangles, are provided by many existing methods, including MSMS
[75], GRASP2 [64], VADAR [85], and others [26]. Geometric solids, like spheres,
cubes, and tetrahedra, can also define useful boundaries for solid representations.
Comparisons of these solid representations are not founded on an analysis of
surface geometry, but actual comparisons of the region occupied by the solid. This
kind of comparison is made possible through CSG operations (Fig. 4), also called
boolean operations. Solid comparisons of protein structures, therefore, are direct
28 Z. Guo and B. Y. Chen
Fig. 4 A demonstration of CSG operations union, intersection, and difference, illustrating the
borders of two input (dotted) and output regions (solid)
a b c
d e f
Fig. 5 The union operation computed with Marching Cubes. (a) The input solid regions: A and
B (gray). (b) The regions in the lattice. (c) Interior points (dark gray) and exterior points (light
gray). (d) The lines connecting one interior point and one exterior point. (e) The crossing points
that intersects the boundary of the union. (f) Output triangles (solid lines) for union approximation
a b c d
Fig. 6 Computing the volume of a given closed region. (a) Input region A and the its geometric
centroid (black dot). (b) One triangle and its normal vector. (c) Tetrahedra (triangles inclosed in
solid black lines) based on triangles (thick black lines) with normals (while arrows) facing away
from the centroid. (d) Tetrahedra (triangles inclosed in solid black lines) based on triangles (thick
black lines) with normals (black arrows) facing towards the centroid. The volume of the region is
the difference between volume sum in (c) and volume sum in (d)
Computing the volume within a closed region A is also a crucial aspect of VASP,
and we summarize it here: First, the centroid of the corners of all triangles is
determined. In A, the three points of each triangle and the centroid together define
a tetrahedron. Since the triangle either faces towards or away from the centroid,
by the Surveyor’s Formula [76], we can evaluate the volume within A by adding
the volumes of all tetrahedra that face away from the centroid, and subtracting
the volumes of all tetrahedra that face towards the centroid. The volume of each
tetrahedron can be accurately computed using Tartaglia’s Rule [5] (Fig. 6).
30 Z. Guo and B. Y. Chen
Variations in structure that cause differences in binding specificity are most likely
to occur at binding sites. To compare binding site geometry, solid representations
can be used to describe the shape of the empty region that accommodates ligands
using a series of CSG operations. As an instructive example, the following method
describes one simple way to represent a binding cavity. We begin with the whole
structure of one protein A and the ligand l that binds at its functional site. For
each atom in l, a sphere is generated with radius 5.0 Å, and the union of all the
spheres, Sl , are calculated with VASP using CSG operations. Sl defines the vicinity
of the ligand binding cavity. GRASP2 [64] can be used to compute the molecular
surface m(A) using the classic rolling probe method [71]. The molecular surface
is generated using a 1.4 Å probe. A second “envelope” surface, e(A), is generated
with 5.0 Å probe using the same algorithm. The binding cavity a is generated using
the following CSG expression:
a b c
d e f
Fig. 7 Generating binding cavity regions. (a) The molecular surface m(A) (gray region) of protein
A and the binding substrate l. (b) The same molecular surface with respect to the border (dotted
line) of envelope surface e(A) of A. (c) The union of all the atom spheres Sl . The spheres are shown
in dotted lines. (d) The difference region (D) that lies inside Sl but not m(A). (e) The difference
region relative to the envelope surface border. (f) The output binding cavity. It is generated by
intersecting D and e(A)
cases, solid comparisons of ligand binding cavities can reveal differences in patterns
of steric hindrance that control binding preferences.
We begin with two proteins, A and B. To identify amino acids of B that have
a steric influence on the difference in specificity between A and B, we follow the
following process: First, we superpose B onto A using whole structure (e.g. [35, 40,
78, 87]) or binding site superposition algorithms (e.g. [10, 73]). Second, we generate
the binding cavity of A, called a, using the methods described in Sect. 3.1.2. Third,
we find every amino acid bi of B, and generate it’s individual molecular surface,
m(bi ). We then measure the volume of m(bi ) ∩ a using the method described in
Fig. 6. If the intersection volume is small or zero, then bi is unlikely to have a
steric influence on differences in specificity between A and B because it does not
create differences in steric hindrance at the binding site. However, if the volume is
large, then we can say that ai creates large steric differences between a and b, and
therefore is likely to be a steric cause for differences in specificity. At a practical
level, sterically influential amino acids often create very large differences that are
hard to overlook [23].
To identify parts of binding cavities that influence specificity, consider two
families of protein structures A and B , where each family exhibits different
binding preferences and all family members are aligned to the same structure. First,
extract binding cavities from every protein in A and B , then compute the union u
of all cavities of A and the intersection i of all cavities of B . The difference, u − i,
is the cavity subregion that is conserved among all cavities of A that is not present
in any cavity of B . As we have observed in studies of serine proteases [23], this
difference can be the reason why some proteins prefer to bind different ligands.
32 Z. Guo and B. Y. Chen
Variations in ligand binding cavities can occur for many reasons that are unrelated
to specificity. They might arise from minor changes in backbone and sidechain
conformation, so they must be eliminated to reduce incorrect predictions. Towards
the ultimate goal of applying component localization algorithms at a large scale,
eliminating even minor variations automatically is essential: Variations that an
expert might immediately dismiss as having no effect on binding can lead to
false predictions, without an automated way to eliminate them. Likewise, subtle
but important differences might be overlooked by the human eye. To address this
issue, we developed VASP-S [15] to automatically isolate the elements of protein
structures that may influence preferential binding.
VASP-S (VASP with Statistics) was developed to filter out differences in binding
cavities that are too small to alter binding preferences, in order to isolate influential
differences. It is a parametric statistical model on the volume of individual structural
differences between binding cavities, which we call fragments (Fig. 8f, h). The
design of VASP-S begins with a null hypothesis, that two binding cavities have
no steric explanation for different binding preferences. Because only cavity shape is
considered here, the same model does not rule out the influence of other biophysical
effects on specificity, which we discuss later.
Suppose we wish to evaluate whether cavity b exhibits binding preferences
similar to a. The two cavities are already structurally aligned. Using a population
of cavities {a0 , a1 , . . . an } with specificity identical to a, we first compute the
volume of all fragments that occur between any pair of cavities ai and aj . We
fit a log-normal distribution to the fragment volumes to produce a probability
density function. Then, using any fragment f between a and b, we can test the
null hypothesis: if the probability of observing a fragment f with volume (denoted
v(f ) larger than v(f ) is less than our threshold for statistical significance, α, we
reject the null hypothesis, and adopt the alternative hypothesis: There exists a steric
explanation for a and b to exhibit different binding preferences. Moreover, the
Explaining Specificity with Volumetric Representations 33
a c d b
g f h
e
Insignif icant
Signif icant
Fragments
Fragments
Fig. 8 Isolating the conserved or the varying regions between aligned binding cavities with CSG
operations. (a, b) The molecular surfaces of two protein structures (c, d) The corresponding two
binding cavities. (e) The superposition of binding cavities based on the whole structure alignment
method. (g) The intersection used in the VASP-I model. (f, h) The fragments calculated by CSG
difference in the VASP-S model. Statistically significant fragments usually have large volume
while insignificant fragments usually have small volume
Fig. 9 S1 specificity site of Atlantic salmon trypsin (pdb: 1a0j), shown in transparent grey. A
statistically significant fragment between the trypsin site and the S1 specificity site of porcine
pancreatic elastase (pdb:1b0e) shown in opaque grey. The fragment, which occupies the lower half
of the trypsin site, sterically hinders elastase from binding the larger substrates that trypsin prefers,
such as the tryptophan, shown in black sticks
fragment f creates variations in binding cavity shape that account for this predicted
difference in specificity. One example of a statistically significant fragment is shown
in Fig. 9. A related model of cavity intersections, VASP-I [14], was developed using
related concepts.
Different portions of a binding cavity do not necessarily have equal importance
for binding specificity. Statistical models, coupled with CSG operations, can also be
used to partially deconstruct empty cavity regions based on their impact on binding,
using a technique we call regionalization [16, 17]. Beginning with aligned binding
34 Z. Guo and B. Y. Chen
cavities, we generate a lattice of cubes that together cover a rectangular region that
surrounds the cavities. Each cube, a solid object, covers a part of the overlapping
binding cavities, and if we compute an intersection between that cube and all
binding cavities, we have a regional representation of each cavity inside each cube.
Fragments between regional representations are fragments inside each cube, and
using VASP-S, we can evaluate the statistical significance of each fragment. Once
VASP-S has been trained on every cube, some lattice cubes exhibit very different
fragment volume distributions: In some cubes, where cavities with identical binding
preferences vary considerably, large fragments are not statistically significant. In
other cubes, where cavities are highly conserved, even very small fragments are
statistically significant. These differences in cubic regions dissect binding cavities
into regions where the importance of steric hindrance is better understood.
Fig. 10 Superposition using DFO-VASP. Beginning with unsuperimposed cavities (left), each
iteration of DFO-VASP performs one superposition, then evaluates the volume overlapping
between the binding cavities. Iterations repeat until the overlapping volume is maximized, enabling
the detection of differences in cavity shape independent of atomic positions
Explaining Specificity with Volumetric Representations 35
since the objective function has no closed-form expression and the function value
is prone to noise. DFO-VASP is a trust-region based algorithm [25] designed to
address this problem. The input is two protein cavities a and b, defined as solid
representations. The algorithm examines values of the vector x, which has seven
parameters: rotation about three axes, the rotation angle and translation on three
axes, seeking the set of rotations and translations that superposes b onto a with the
largest volume of intersection. In a series of iterations, model functions are built
to approximate the objective function within a trust region. The model function is
optimized in the trust region to search for the best position for the next iteration.
To reduce the influence of noise, a dynamic incremental strategy is designed: in
each iteration, given the resolution of VASP r and the estimate noise ξr , reduce the
noise level by decreasing the resolution r if the reduction of the model function is
comparable to ξr .
In practice, DFO-VASP tends to converge to the local optimal and find a
reasonable overlapping volume, but it cannot guarantee the global optimal solution.
Given different initial values, DFO-VASP may converge to different final solutions.
Since the atom coordinates and the optimal overlapping can be highly related, the
structural backbone superposition is be used as a starting point. We call this a warm-
start superposition. The initial solution can also be randomly sampled by Latin
Hypercube Sampling methods [56], to produce well distributed random starting
points.
By visualizing the optimal superposition, it was found that DFO-VASP aligned
two protein binding cavities logically: the entrances to the cavities were oriented
in the same direction and their conserved regions were highly overlapped. Using
backbone alignment as the baseline, both warm-start and random DFO-VASP
exhibited greater intersection. And in general warm-start alignment had slightly
greater intersection volume since a reasonable initial enabled DFO-VASP to explore
more in the neighborhood of the optimal superposition.
Proteins are flexible molecules. This flexibility complicates protein structure com-
parison and it interferes with accurate component localization. In recent years,
several protein structure comparison methods have represented backbone motions
using hinges [54], partial-order graphs [88] and fragment pairs [74]. These rep-
resentations describe protein structures using rigid components tied together with
flexible linkers. While these approaches approximate partial backbone flexibility,
they do not permit general flexibility in the backbone, and they also simplify the
flexibility of amino acid side chains. Here, we discuss two methods that achieve
a more general representation of the flexibility of protein structures and enable
component localization in spite of conformational variations.
36 Z. Guo and B. Y. Chen
One way to reduce errors from conformational change is to use protein structure
prediction to remodel proteins in nonfunctional conformations into conformations
that can be reasonably compared. We evaluated homolog modeling to determine if
it can be an effective tool to reduce this kind of comparison error [36].
The statistical models and other methods developed above have generally
considered a families of sequentially nonredundant proteins with identical fold and
function, but different binding preferences. Among these families, backbone or
sidechain variations are inevitable. We use remodeling to resolve this problem by
selecting one structure as a template. The structure is selected because it contains a
bound ligand or is otherwise in an active conformation. Beginning with the template
structure and the aligned sequences of the others, NEST [86], a homology modeling
package for protein structure prediction, was used to model the structures of the
other proteins with the template. Leveraging the fact that structure prediction by
homology modeling is highly accurate when applied to close homologs [3, 46],
the model structures from NEST were able to reduce conformational differences in
binding cavities. Model structures were then structurally aligned onto the template
structure using Ska [87] and the binding cavity of each model structure was
extracted using the method in Sect. 3.1.2. Between the binding cavities of same-
specificity proteins in inactive and active conformations, the largest fragment was
originally very large, and almost always statistically significant. After remodeling,
the largest fragment was generally consistent with the largest fragment observed
between other proteins with identical binding preferences [36]. Remodeling reduces
prediction errors associated with conformational changes.
Proteins are constantly in motion. The structural changes that result from these
motions do not change specificity, but they can create sources of error for component
localization. To test this hypothesis, we used molecular dynamics to simulate the
motion of nine sequentially nonredundant trypsin structures with the same binding
preferences for 100 nanoseconds. Examining 600 snapshots from each simulation,
we aligned each snapshot, generated the binding cavity, and identified the largest
fragment between the snapshot and a different trypsin. Over 65 percent of the largest
fragments were so large as to be statistically significant, indicating that normal
protein motions will cause VASP-S [15] to create erroneous predictions [37].
To perform component localization accurately despite the presence of flexibility,
we developed FAVA, an algorithm that uses conformational sampling to produce
an representation of the ligand binding cavity that as it appears in the majority of
samples. As input, FAVA accepts conformational snapshot {A1 , A2 , ..., AN } of the
protein A where N is the number of samples and k, the overlap threshold. FAVA
first extracts the binding cavity ci of the snapshot Ai with VASP (e.g. Sect. 3.1.2).
With each binding cavity snapshot, FAVA computes a frequent region, defined as
Explaining Specificity with Volumetric Representations 37
the region that is solvent accessible in more than k/N samples. Ideally, the frequent
region could be computed as the union of all intersections of all distinct sets of k
cavity snapshots. Unfortunately, this computation would require Nk intersections,
an impractical goal for most values of k. Our approach with FAVA, given any
k, is to compute 500 different k-sized sets of conformations c0 , c1 , ..., ck , where
the subset is randomly selected from the set of all samples. We then compute the
intersection of each set, and then the union of all 500 intersections, to approximate
the ideal frequent region. The frequent region ignores the subspace of unusual cavity
conformations and characterizes the average shape of the binding region in the
context of conformational flexibility.
Given two proteins A, B and the overlap threshold k, we can use FAVA to obtain
their frequent regions αk , βk , and it is of interest to compare them. To measure the
similarity of their frequent regions, we use D(αk , βk )
v(αk ∩ βk )
D(αk , βk ) = 1 − (1)
v(αk ∪ βk )
where v(·) is the volume of a given region. If two proteins have identical binding
preference, they should have more similar frequent regions and their volumetric
distance should be smaller than that of proteins with different binding preferences.
We tested FAVA on the selected non-redundant proteins of the serine protease and
enolase superfamilies. We constructed a UPGMA (Unweighted Pair Group Method
with Arithmetic mean [79]) clustering tree based on the measure D(αk , βk ) between
frequent regions, finding that almost all binding cavities were correctly separated by
their binding specificities [37].
a c e h j
b d f g i k
Fig. 11 Generating and comparing cavity field. (a, b) The molecular surface of protein A and B is
shown in dotted line. The binding cavities are shown in dark gray with solid lines. The electrostatic
isopotentials are shown in transparent light gray. (c, d) The cavity fields (dark gray regions) are
inside both the binding cavities and the solid isopotentials. (e) The intersection of two binding
cavities I . (f) To guarantee that cavity field comparison is not influenced by steric differences. The
intersection between cavity fields and I was created, named as IA and IB . The alignment between
IA and IB (in dark gray). (g) IA and IB (dotted lines) with the intersection (dark gray rectangle).
(h, i) The differences between IA and IB . (j, k) The same differences with respect to the molecular
surfaces
Explaining Specificity with Volumetric Representations 39
is large, then it might accommodate the same ligand fragment. The difference
between two cavity fields is a region of solvent accessible electrostatic potential
that exists in one protein but not in another; an area where the two proteins may
not electrostatically stabilize similar ligands. VASP-E can thus be used to localize
electrostatic components that influence ligand binding specificity.
VASP-E was tested on cavities fields from serine proteases and cysteine proteases
and built UPGMA clustering trees by calculating the similarities with volumetric
distance (Eq. 1). The cavity fields were generated at {−2.5, −5.0, −7.5, −10.0}
kT /e. The results showed that all of the clusterings correctly classified cavity fields
into groups with different binding specificities and VASP-E proved to be a useful
tool to characterize binding cavities based on electrostatic elements.
VASP-E can also identify amino acids that have an electrostatic influence on
specificity. First, the electrostatic contribution of one amino acid at a time is
removed (nullified), and the cavity field is recomputed. Then the recomputed cavity
field is compared to a cavity field from a protein with different binding preferences.
If the two cavity fields are more similar than they would have been if no amino
acids were nullified, then we know that the nullified amino acid was causing the
cavity fields to be different. Therefore, that amino acid is an electrostatic influence
on specificity. This approach correctly identified electrostatically influential amino
acids in the serine proteases [13].
4 Discussion
Algorithms for specificity annotation represent a significant new direction for the
development of protein structure comparison. Earlier comparison algorithms were
focused on the function annotation problem, and they were developed to propose
functions for protein structures that are being solved without known function, such
as those generated by the international protein structure initiatives. Here, specificity
annotation algorithms are designed to propose binding preferences for proteins
with unknown binding preferences, and they can be applied to suggest binding
preferences for the many genetic variants that may be observed in the field.
Component localization adds additional capabilities. For example, as we discover
mutants that resist drug therapies, we are forced to reexamine the function of the
protein and the drug to discover how resistance occurs. Algorithms for component
localization can offer explanations as to which components of the protein structure
cause resistance, not only when those parts that are amino acids, but also when they
are specific regions inside ligand binding cavities or electrostatic fields. As methods
for component localization advance, they will be able to offer suggestions as to the
biophysical mechanism of resistance. These explanations are valuable because they
are testable hypotheses generated from biophysical measurements. Since the space
of mutations is combinatorial, the value of automated hypothesis generation in this
context is considerable.
The development of algorithms for specificity annotation present novel and
interesting technical advancements in the field of protein structure comparison.
40 Z. Guo and B. Y. Chen
First, solid representations that contrast from the conventional point-based atomic
representations reveal new technical capabilities, such as the regionalization of
structural analysis. Second, solid representations are defined by a boundary surface,
so they can be used to compare elements of protein structure, such as cavity regions
or electrostatic isopotentials with high precision. Since they do not require the
bijective correspondences that are a prerequisite of point-based comparisons, they
can provide objective comparison scores of structures that involve different numbers
of atoms, such as a comparison of individual amino acids with the shape of a binding
cavity. We believe these capabilities, and others that are not yet discovered, create
significant opportunities for breakthrough algorithms in the space of specificity
annotation.
Specificity annotation is a field in its infancy with many areas for potential
advancement and specialization. The most open space for advancements lies in the
design of algorithms for component localization using on biophysical phenomena
that have not yet been exploited for specificity annotation, such as hydrophobicity or
hydrogen bonding. It is crucial to note that phenomenon-specific prediction methods
are not directed at the creation of energy functions. The goal of these predictors is
not to provide a single energy of interaction, but rather to predict whether or not
a difference in proteins can eliminate similarity in binding preferences because of
a particular phenomenon. The integration of methods that consider all phenomena
will be a comprehensive step towards general component localization.
A second direction for future development lies in the better use of manycore
architectures in protein structure comparison. Most structure comparison algorithms
to date have been developed for single threaded applications, and this approach is
increasingly archaic. Multithreaded software has considerable potential to enhance
the performance of protein structure comparisons, yielding faster, and higher
resolution comparisons. In the overall field, the main winner from such a transition
will be the enhanced sophistication of statistical models that will have much
expanded data upon which to train. More robust multipart validation will also be
possible.
Finally, a much needed and often overlooked aspect of the field is the enhance-
ment of user accessibility. Algorithms in computational structural biology are
especially technical and because of many disciplinary and practical reasons, user
interfaces are often overlooked. This challenge strongly limits impact in the field
of function annotation because it could better serve experimentalists who are not
highly trained computational researchers. As sophisticated user interfaces become
increasingly easy to develop because of their ubiquity in other fields, software for
function annotation must become more accessible as well.
Acknowledgments This work was funded in part by NIH Grant R01GM123131 to Brian Y. Chen.
Explaining Specificity with Volumetric Representations 41
References
1. Stark A., Sunyaev S., and Russell RB. A model for statistical significance of local similarities
in structure. J. Mol. Biol., 326:1307–1316, 2003.
2. Shivas R Amin, Serkan Erdin, R Matthew Ward, Rhonald C Lua, and Olivier Lichtarge.
Prediction and experimental validation of enzyme substrate specificity in protein structures.
Proceedings of the National Academy of Sciences, 110(45):E4195–E4202, 2013.
3. David Baker and Andrej Sali. Protein structure prediction and structural genomics. Science,
294(5540):93–96, 2001.
4. Jon Louis Bentley. Multidimensional binary search trees used for associative searching.
Communications of the ACM, 18(9):509–517, 1975.
5. G Biggiogero. La geometria del tetraedro. Enciclopedia delle Matematiche Elementari e
Complementi, 2(1):219–252, 1950.
6. T.A. Binkowski, P. Freeman, and J. Liang. pvSOAR: Detecting similar surface patterns of
pocket and void surfaces of amino acid residues on proteins. Nucl. Acid. Res., 32:W555–8,
2004.
7. Drew H Bryant, Mark Moll, Brian Y Chen, Viacheslav Y Fofanov, and Lydia E Kavraki.
Analysis of substructural variation in families of enzymatic proteins with applications to
protein function prediction. BMC bioinformatics, 11(1):242, 2010.
8. Drew H Bryant, Mark Moll, Paul W Finn, and Lydia E Kavraki. Combinatorial clustering
of residue position subsets predicts inhibitor affinity across the human kinome. PLoS
computational biology, 9(6):e1003087, 2013.
9. Chen B.Y., Bryant D.H, Fofanov V.Y., Kristensen D.M., Cruess A.E., Kimmel M., Lichtarge
O., and Kavraki L.E. Cavity-aware motifs reduce false positives in protein function prediction.
Proceedings of the 2006 IEEE Computational Systems Bioinformatics Conference (CSB 2006),
accepted, August 2006.
10. Chen B.Y., Fofanov V.Y., Kristensen D.M., Kimmel M., Lichtarge O., and Kavraki L.E.
Algorithms for structural comparison and statistical analysis of 3D protein motifs. Proceedings
of Pacific Symposium on Biocomputing 2005, pages 334–45, 2005.
11. John A Capra and Mona Singh. Characterization and prediction of residues determining protein
functional specificity. Bioinformatics, 24(13):1473–1480, 2008.
12. Saikat Chakrabarti, Stephen H Bryant, and Anna R Panchenko. Functional specificity lies
within the properties and evolutionary changes of amino acids. Journal of molecular biology,
373(3):801–810, 2007.
13. Brian Y Chen. Vasp-e: Specificity annotation with a volumetric analysis of electrostatic
isopotentials. PLoS computational biology, 10(8):e1003792, 2014.
14. Brian Y Chen and Soutir Bandyopadhyay. A statistical model of overlapping volume in
ligand binding cavities. In Bioinformatics and Biomedicine Workshops (BIBMW), 2011 IEEE
International Conference on, pages 424–431. IEEE, 2011.
15. Brian Y Chen and Soutir Bandyopadhyay. Vasp-s: A volumetric analysis and statistical model
for predicting steric influences on protein-ligand binding specificity. In Bioinformatics and
Biomedicine (BIBM), 2011 IEEE International Conference on, pages 22–29. IEEE, 2011.
16. Brian Y Chen and Soutir Bandyopadhyay. Modeling regionalized volumetric differences in
protein-ligand binding cavities. Proteome science, 10(Suppl 1):S6, 2012.
17. Brian Y Chen and Soutir Bandyopadhyay. A regionalizable statistical model of intersecting
regions in protein–ligand binding cavities. Journal of bioinformatics and computational
biology, 10(03), 2012.
18. Brian Y Chen, Drew H Bryant, Amanda E Cruess, Joseph H Bylund, Viacheslav Y Fofanov,
Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Composite motifs integrating multiple
protein structures increase sensitivity for function prediction. In Comput Syst Bioinformatics
Conf, volume 6, pages 343–355, 2007.
42 Z. Guo and B. Y. Chen
19. Brian Y Chen, Drew H Bryant, Viacheslav Y Fofanov, David M Kristensen, Amanda E Cruess,
Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Cavity-aware motifs reduce false
positives in protein function prediction. In Proceedings of the 2006 IEEE Computational
Systems Bioinformatics Conference (CSB 2006), pages 311–23, 2006.
20. Brian Y Chen, Drew H Bryant, Viacheslav Y Fofanov, David M Kristensen, Amanda E
Cruess, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Cavity scaling: automated
refinement of cavity-aware motifs in protein function prediction. Journal of bioinformatics
and computational biology, 5(02a):353–382, 2007.
21. Brian Y Chen, Viacheslav Y Fofanov, Drew H Bryant, Bradley D Dodson, David M Kristensen,
Andreas M Lisewski, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Geometric
sieving: Automated distributed optimization of 3d motifs for protein function prediction. In
Research in Computational Molecular Biology, pages 500–515. Springer, 2006.
22. Brian Y Chen, Viacheslav Y Fofanov, Drew H Bryant, Bradley D Dodson, David M Kristensen,
Andreas M Lisewski, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. The mash
pipeline for protein function prediction and an algorithm for the geometric refinement of 3d
motifs. Journal of Computational Biology, 14(6):791–816, 2007.
23. Brian Y Chen and Barry Honig. Vasp: a volumetric analysis of surface properties yields
insights into protein-ligand binding specificity. PLoS computational biology, 6(8):e1000881,
2010.
24. Ruobing Chen, Katya Scheinberg, and Brian Y Chen. Aligning ligand binding cavities by
optimizing superposed volume. In Bioinformatics and Biomedicine (BIBM), 2012 IEEE
International Conference on, pages 1–5. IEEE, 2012.
25. Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free
optimization, volume 8. Siam, 2009.
26. Michael L Connolly. The molecular surface package. Journal of molecular graphics,
11(2):139–141, 1993.
27. Evangelos A Coutsias, Chaok Seok, and Ken A Dill. Using quaternions to calculate RMSD.
Journal of computational chemistry, 25(15):1849–1857, 2004.
28. Porter C.T., Bartlett G.J., and Thornton J.M. The catalytic site atlas: a resource of catalytic sites
and residues identified in enzymes using structural data. Nucleic Acids Research, 32:D129–
D133, 2004.
29. Kristensen D.M., Chen B.Y., Fofanov V.Y., Ward R.M., Lisewski A.M., Kimmel M., Kavraki
L.E., and Lichtarge O. Recurrent use of evolutionary importance for functional annotation of
proteins based on local structural similarity. Protein Science, in press, 2006.
30. Liang J. Edelsbrunner H., Facello M. On the definition and the construction of pockets in
macromolecules. Discrete Applied Mathematics, 88:83–102, 1998.
31. Ferré F., Ausiello G, Zanzoni A, and Helmer-Citterich M. Surface: a database of protein surface
regions for functional annotation. Nucl. Acid. Res., 32:D240–4, 2004.
32. Viacheslav Y Fofanov, Brian Y Chen, Drew H Bryant, Mark Moll, Olivier Lichtarge, Lydia
Kavraki, and Marek Kimmel. A statistical model to correct systematic bias introduced by
algorithmic thresholds in protein structural comparison algorithms. In Bioinformatics and
Biomedicine Workshops, 2008. BIBMW 2008. IEEE International Conference on, pages 1–8.
IEEE, 2008.
33. Verbitsky G., Nussinov R., and Wolfson H.J. Structural comparison allowing hinge bending.
Prot: Struct. Funct. Genet., 34(2):232–254, 1999.
34. Benjamin Georgi, Jörg Schultz, and Alexander Schliep. Context-specific independence mixture
modelling for protein families. In Knowledge Discovery in Databases: PKDD 2007, pages 79–
90. Springer, 2007.
35. Jean-Francois Gibrat, Thomas Madej, and Stephen H Bryant. Surprising similarities in
structure comparison. Current opinion in structural biology, 6(3):377–385, 1996.
36. Brian G Godshall and Brian Y Chen. Improving accuracy in binding site comparison with
homology modeling. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE
International Conference on, pages 662–669. IEEE, 2012.
Explaining Specificity with Volumetric Representations 43
37. Ziyi Guo, Trevor Kuhlengel, Steven Stinson, Seth Blumenthal, Brian Y Chen, and Soutir
Bandyopadhyay. A flexible volumetric comparison of protein cavities can reveal patterns
in ligand binding specificity. In Proceedings of the 5th ACM Conference on Bioinformatics,
Computational Biology, and Health Informatics, pages 445–454. ACM, 2014.
38. Edelsbrunner H. and Mucke E.P. Three-dimensional alpha shapes. ACM Trans. Graphics,
13:43–72, 1994.
39. Liisa Holm and Chris Sander. Protein structure comparison by alignment of distance matrices.
Journal of molecular biology, 233(1):123–138, 1993.
40. Liisa Holm and Chris Sander. Mapping the protein universe. Science, 273(5275):595–602,
1996.
41. Barker J.A. and Thornton J.M. An algorithm for constraint-based structural template matching:
application to 3D templates with statistical analysis. Bioinf., 19(13):1644–1649, 2003.
42. Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. In
ACM Transactions on Graphics (TOG), volume 21, pages 339–346. ACM, 2002.
43. Kinoshita K. and Nakamura H. Identification of protein biochemical functions by similarity
search using the molecular surface database ef-site. Protein Science, 12:1589–1595, 2003.
44. Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystal-
lographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography,
32(5):922–923, 1976.
45. Olga V Kalinina, Pavel S Novichkov, Andrey A Mironov, Mikhail S Gelfand, and Aleksandra B
Rakhmaninova. Sdppred: a tool for prediction of amino acid residues that determine differences
in functional specificity of homologous proteins. Nucleic acids research, 32(suppl 2):W424–
W428, 2004.
46. Patrice Koehl, Michael Levitt, et al. A brighter future for protein structure prediction. nature
structural biology, 6:108–111, 1999.
47. David M Kristensen, R Matthew Ward, Andreas M Lisewski, Serkan Erdin, Brian Y Chen,
Viacheslav Y Fofanov, Marek Kimmel, Lydia E Kavraki, and Olivier Lichtarge. Prediction
of enzyme function based on 3d templates of evolutionarily important amino acids. BMC
bioinformatics, 9(1):17, 2008.
48. Yehezkel Lamdan and Haim J Wolfson. Geometric hashing: A general and efficient model-
based recognition scheme. In ICCV, volume 88, pages 238–249, 1988.
49. Byungkook Lee and Frederic M Richards. The interpretation of protein structures: estimation
of static accessibility. Journal of molecular biology, 55(3):379–IN4, 1971.
50. William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface
construction algorithm. In ACM Siggraph Computer Graphics, volume 21, pages 163–169.
ACM, 1987.
51. Rosen M., Lin S.L., Wolfson H., and Nussinov R. Molecular shape comparisons in searches
for active sites and functional similarity. Prot. Eng., 11(4):263–277, 1998.
52. Shatsky M., Shulman-Peleg A., Nussinov R., and Wolfson H.J. Recognition of binding patterns
common to a set of protein structures. Proceedings of RECOMB 2005, pages 440–55, 2005.
53. Shatsky M., Shulman-Peleg A., Nussinov R., and Wolfson H.J. The multiple common point set
problem and its application to molecule binding pattern detection. J. Comp. Biol., 13(2):407–
28, 2006.
54. Shatsky M., Nussinov R., and Wolfson H.J. A method for simultaneous alignment of multiple
protein structures. Proteins, 56(1):143–56, 2004.
55. Srinivasan Madabushi, Hui Yao, Mike Marsh, David M Kristensen, Anne Philippi, Mathew E
Sowa, and Olivier Lichtarge. Structural clusters of evolutionary trace residues are statistically
significant and common in proteins. Journal of molecular biology, 316(1):139–154, 2002.
56. Michael D McKay, Richard J Beckman, and William J Conover. Comparison of three
methods for selecting values of input variables in the analysis of output from a computer code.
Technometrics, 21(2):239–245, 1979.
57. Mark Moll, Drew H Bryant, and Lydia E Kavraki. The labelhash algorithm for substructure
matching. BMC bioinformatics, 11(1):555, 2010.
44 Z. Guo and B. Y. Chen
58. Richard J Morris, Rafael J Najmanovich, Abdullah Kahraman, and Janet M Thornton. Real
spherical harmonic expansion coefficients as 3d shape descriptors for protein binding pocket
and ligand comparisons. Bioinformatics, 21(10):2347–2355, 2005.
59. Anthony Nicholls, Kim A Sharp, and Barry Honig. Protein folding and association: insights
from the interfacial and thermodynamic properties of hydrocarbons. Proteins: Structure,
Function, and Bioinformatics, 11(4):281–296, 1991.
60. Bachar O., Fischer D., Nussinov R., and Wolfson H. A computer vision based technique for
3-d sequence independent structural comparison of proteins. Prot. Eng., 6(3):279–288, 1993.
61. Lichtarge O. and Sowa M.E. Evolutionary predictions of binding surfaces and interactions.
Curr. Opin. Struct. Biol., 12(1):21–27, 2002.
62. Jimin Pei, Wei Cai, Lisa N Kinch, and Nick V Grishin. Prediction of functional specificity
determinants from protein sequences using log-likelihood ratios. Bioinformatics, 22(2):164–
171, 2006.
63. Osnat Penn, Adi Stern, Nimrod D Rubinstein, Julien Dutheil, Eran Bacharach, Nicolas Galtier,
and Tal Pupko. Evolutionary modeling of rate shifts reveals specificity determinants in hiv-1
subtypes. PLoS computational biology, 4(11):e1000214, 2008.
64. Donald Petrey and Barry Honig. Grasp2: visualization, surface properties, and electrostatics
of macromolecular structures and sequences. Methods in enzymology, 374:492–509, 2002.
65. Walter Pirovano, K Anton Feenstra, and Jaap Heringa. Sequence comparison by sequence
harmony identifies subtype-specific functional sites. Nucleic acids research, 34(22):6540–
6548, 2006.
66. Benjamin J Polacco and Patricia C Babbitt. Automated discovery of 3d motifs for protein
function annotation. Bioinformatics, 22(6):723–730, 2006.
67. Norel R., Fischer D., Wolfson H.J., and Nussinov R. Molecular surface recognition by a
computer vision-based technique. Prot. Eng., 7:39–46, 1994.
68. Norel R., Petrey D., Wolfson H.J., and Nussinov R. Examination of shape complementarity in
docking of unbound proteins. Prot: Struct. Funct. Genet., 36:307–317, 1999.
69. Laskowski R.A., Watson J.D., and Thornton J.M. Protein function prediction using local 3D
templates. Journal of Molecular Biology, 351:614–626, 2005.
70. Boris Reva, Yevgeniy Antipin, and Chris Sander. Determinants of protein function revealed by
combinatorial entropy optimization. Genome Biol, 8(11):R232, 2007.
71. Frederick M Richards. Areas, volumes, packing, and protein structure. Annu. Rev. Biophys.
Bioeng., 6:151–176, 1977.
72. Walter Rocchia, Emil Alexov, and Barry Honig. Extending the applicability of the nonlinear
Poisson-Boltzmann equation: Multiple dielectric constants and multivalent ions. The Journal
of Physical Chemistry B, 105(28):6507–6514, 2001.
73. Robert B Russell. Detection of protein three-dimensional side-chain patterns: new examples
of convergent evolution. Journal of molecular biology, 279(5):1211–1227, 1998.
74. Saeed Salem, Mohammed J Zaki, and Chris Bystroff. Flexsnap: Flexible non-sequential protein
structure alignment. Algorithms for Molecular Biology, 5(1):12, 2010.
75. Michel F Sanner, Arthur J Olson, and Jean-Claude Spehner. Reduced surface: an efficient way
to compute molecular surfaces. Biopolymers, 38(3):305–320, 1996.
76. J Schaer and MG Stone. Face traverses and a volume algorithm for polyhedra. In New Results
and New Trends in Computer Science, pages 290–297. Springer, 1991.
77. Gideon Schreiber and Alan R Fersht. Rapid, electrostatically assisted association of proteins.
Nature Structural & Molecular Biology, 3(5):427–431, 1996.
78. Ilya N Shindyalov and Philip E Bourne. Protein structure alignment by incremental
combinatorial extension (ce) of the optimal path. Protein engineering, 11(9):739–747, 1998.
79. Peter HA Sneath and Robert R Sokal. Numerical taxonomy. Nature, 193(4818):855–860,
1962.
80. Binkowski T.A., Joachimiak A., and Liang J. Protein surface analysis for function annotation
in high-throughput structural genomics pipeline. Protein Science, 14:2972–2981, 2005.
81. Binkowski T.A., Adamian L., and Liang J. Inferring functional relationships of proteins from
local sequence and spatial surface patterns. J. Mol. Biol., 332:505–526, 2003.
Explaining Specificity with Volumetric Representations 45
1 Introduction
Fig. 1 Conformational changes of Adenylate kinase, going from its closed conformation (4AKE)
to its open conformation (1AKE). PDB IDs for conformation (a) 4AKE, (b) 1DVR, (c) 1ANK,
and (d) 1AKE
full range of conformations that each of the SARS-CoV-2 proteins can access in
order to understand how they replicate, infect cells, and avoid the host’s immune
system. Exploring the conformational space of SARS-CoV-2 proteins also provides
information concerning the vulnerability of its proteins that will help in drug and
vaccine design [53].
Based on the reasons mentioned above, researchers in the fields of compu-
tational, structural, and experimental biology have made efforts to study protein
conformational pathways and analyzing the conformational space of proteins and
complexes. However, as these changes happen in microseconds and many of these
intermediate conformations are transient, it has been a challenge to capture them
experimentally or computationally. During the past decades, various experimental
and computational methods have been proposed to learn about these conformations.
We will review most of the highly used strategies with an emphasis on the new
machine learning-based approaches.
to capture low-frequency motions with large amplitudes and rapid motions that are
less likely to happen [37]. Consequently, in order to study protein motions in more
natural environments, alternate methods like X-ray solution scattering and nuclear
magnetic resonance (NMR) have been introduced.
Another approach to study protein dynamics is to observe the time evolution
of X-ray scattering by liquid solutions. X-ray scattering techniques are used as
analytical tools which reveal information about the crystal structure and physical
properties of proteins. In these techniques, X-rays are scattered at the electrons
of the atomic shell, during which the electrons start oscillating. The scattered
intensity of the X-ray hitting a sample is then observed as a function of incident and
scattered angle, polarization, and wavelength or energy. This method is promising,
especially to investigate biological motions on systems that are either not easy to
crystallize or undergo large-scale conformational changes that cannot take place
within a crystalline environment [29]. Small-angle X-ray scattering [6] provides
information about the overall size and shape of the protein while wide-angle X-ray
scattering [21] gives more information regarding the fold of helices and sheets [9].
The combination of X-ray solution scattering with crystallography has also been
utilized to enhance the accuracy and efficacy of protein structure determination
[30, 33].
One of the most efficient conformational searching methods is Monte Carlo (MC)
search. Each Monte Carlo search step has three main components: first, one
or more dihedral angles (or other movable degrees of freedom) of the starting
structure are chosen for manipulation. Second, they each get rotated by a random
amount as to generate a new conformation. This is called a “step”. Third, the
molecule gets energetically minimized. The new step is accepted or rejected based
on an optimization criterion. If the new conformation has lower energy than its
predecessor, it will be accepted. Otherwise, it will be accepted with a probability
proportional to the energy difference. This allows the search to accept some higher
energy conformations that could help it avoid being stuck in a local minimum. The
search ends when a low energy conformation is reached or after a certain number of
iterations. By repeating this search step and sorting and comparing all the different
generated conformers, it is possible to identify the different low-energy conformers
of the molecule. The entire process is called Monte Carlo Multiple Minimum
(MCMM) search. This procedure ends when the newly generated conformations are
the same as the ones already found [10]. Sometimes it is not possible to rotate all
the bonds. For instance, for cyclic molecules, we can’t rotate all the bonds without
cutting the ring. For these kinds of molecules, we can open the ring, rotate the
bonds, and then try to close the ring. If the ends are not close enough to each
other, we can revert the last step and choose another part randomly to break the
ring. Then the molecule gets minimized, and the whole process is repeated. One of
the problems of MC is the difficulty of finding all the low-energy conformations for
large molecules because the number of possible bonds’ changes grows exponentially
with the number of bonds in the molecule.
Sometimes our chance of finding a low energy conformation after changing one
or more bonds is very low. In this case we need MC to sample from a small low
energy region, A. However, a simple MC sampling may not sample from region A
even after many iterations. We can make MC do this by sampling from a distribution
that overweights region A. This weighting technique is called importance sampling.
It is a way to make Monte Carlo simulations converge much faster [40].
Another enhancement of the MC method called acceptance-rejection sampling
is very similar to importance sampling. In this method, instead of over-weighting a
region, all the high energy (or impossible) conformations are placed in a rejection
set, and the algorithm automatically rejects the new conformation if it belongs to
the rejection set [16].
MD (see Sect. 3.1) and MC search methods are the two most commonly utilized
techniques in physics-based computational methods. Both methods generally have
similar outlines, including consideration of molecules as collections of atoms,
employment of some force fields for the potential energy terms, and implementation
of recurrent boundary conditions. The major differences are in the forms of sampling
the available configuration space. MD needs 1.6–3.8 times more computation time
and it is 2–3 times less efficient than MC [24], but it produces a more physically
plausible path. Another difference between MD simulations and MC methods is
52 F. Afrasiabi et al.
that despite MC, MD simulations update the velocities of particles as well, rather
than depending on just their positions.
The next class of methods are inspired by the robotic motion planning problem,
where the goal is to find a sequence of valid states (configurations) that moves the
object from the source to the destination in a physically constrained environment.
The idea is to consider proteins as robot-like objects. The conformational space
of a protein is modeled as the geometric space in which a robot operates (also
called the workspace). The physical constraints are captured in the energy function.
Many researchers have applied robotics-inspired methods since more than a decade
ago to simulate protein folding, protein conformational change, docking, binding
and others. Motion planning algorithms build a graph (or a “roadmap”) that
approximates the search space. Then a search is conducted in the graph to find a
path from the start to the goal [26].
Fig. 2 An illustration of the RRT* algorithm. (1) A new node is added to the tree and its
nearest neighbors (based on a predefined distance metric) are highlighted. (2) The best (least-cost)
neighbor is chosen as the new node’s parent. (3) For each nearest neighbor, the algorithm checks
whether its path cost is less through the new node rather than its current parent. If so, its parent is
changed to the new node
angles of the last found conformation towards the goal conformation. If this new
conformation has a valid energy value, it is accepted and gets connected to the
closest conformation in the tree. The RRT algorithm is impressive due to the fact that
it can rapidly explore the search space. However, because of its random nature of
node selection and the fact that it expands to all areas of the workspace, in a protein’s
conformational landscape that is high-dimensional, the algorithm converges slowly
and does not always produce smooth and optimized paths. The optimized version of
the RRT algorithm, called the RRT* algorithm, was introduced by Karaman et al.
[25] to enhance the smoothness of the resulting paths by steering the system towards
the goal. It also uses a rewiring procedure that attempts to find a more direct path
rather than creating ragged ones. An illustration of the RRT* algorithm is shown in
Fig. 2. In (1), a new node is added to the tree with its nearest neighbors shown in a
circle. The least-cost parent is chosen for the newly added node is chosen among the
nearest neighbors in (2). Next, the cost of rewiring is checked for all the neighbors,
and the tree gets rewired if the newly added node would be a better parent for one
of its neighbors (3).
In [4], the RRT* method is used with an A*-like heuristic function and a rewiring
operation to produce smooth, more realistic paths that converge faster. Authors in
[14] evaluated their robotics-based algorithm called RRTMC [3] using Topological
Data Analysis (TDA) methods. They sampled the intermediate conformations that
RRTMC generates most and depicted that these conformations are close to existing
known intermediate conformations. They also attempted to show that TDA can be
useful in assessing the computational methods that sample protein conformation.
Machine Learning (ML) is a method of data analysis that can learn and improve
automatically from experience (training data) without being explicitly programmed.
54 F. Afrasiabi et al.
It’s a branch of Artificial Intelligence (AI) which concentrates on the use of data
and methods to simulate how humans learn. ML methods are being used for
different purposes, from stock market trading [39] to social media analyses [19].
Computational and structural biologists have also made many efforts to apply
ML methods to investigate protein structures and conformations. Protein structures
can theoretically be predicted by MD simulations as well, but MD simulations
are computationally expensive. Humans cannot interpret enormous sets of high-
dimensional data by simple visual analysis. Contrarily, ML methods are exactly
designed to process extensive data sets [17]. For example, Dehghanpoor et al. [15]
used ML methods on a very large dataset to find the effects of mutations on protein
stability. Some algorithms have been designed that can get complex arrangements
and highly nontrivial relationships from these large datasets and summarize this
information in the assessment of new data. A protein conformation is complex
and high-dimensional, with 3N spatial degrees of freedom (DOF) for a protein
with N atoms, but only a subset of this space is physically possible, because
physical constraints render a very large part of the conformations inaccessible.
Therefore, the protein configuration can be approximated as a manifold of lower
dimensionality, which is difficult to find and interpret. Hence, it is vital to use
dimensionality reduction techniques that reduce the complexity of our datasets
while maintaining the valuable information (important features of this manifold)
in the data. Unsupervised methods can help obtain metastable states from high-
dimensional simulation data and connect them to measurable observables [32]. We
need sufficient and reliable data to train these machine learning methods in order
to extract more and better predictions instead of running more simulations. The
problem arises when we have a data shortage for rare events, i.e., the ones for which
enhanced sampling is needed. As a solution, we should iterate between machine
learning and sampling methods [5]. The reduced dimensional space consists of
fewer parameters, called the reaction coordinates or collective variables (CV), while
retaining as much of the variance in the data as possible. Machine learning and
relevant techniques are increasingly being applied to provide a less heuristic way
for finding CVs. We should choose a good collection of CVs to answer a particular
question. For example, if the target is to improve the sampling of transitions from
one state to another, we should find CVs that represent the slowest motion of the
process. Or, for the answer to questions related to biophysics, a suitable set of CVs
should track the differences on the molecular level between different states.
1. Choosing the few first principal components from the principal component
analysis (PCA) [43].
2. Selecting the linear combinations of a set of descriptors chosen by Harmonic
mean Linear Discriminant Analysis (HLDA) [7].
3. Implementing time-lagged independent component analysis (tICA); This tech-
nique can create slow CVs by linearly combining potentially simple CVs (e.g.,
dihedral angles) such that the auto-correlations of the output data are maximized
[35].
4. Using autoencoders; Refer to Sect. 5.2 for more details.
5. Using spectral gap optimization of order parameters (SGOOP) [46]; They
estimate the best mix of reduced dimensional CVs according to the maximum
separation of timescales between visible and invisible (fast) transforms, i.e., path
entropy estimate of the spectral gap.
6. Selecting the essential internal coordinates chosen by a decision tree (XGBoost
algorithm) trained with MD simulations and metastable states [8].
5.2 Autoencoders
Fig. 3 The illustration of an autoencoder with nine layers, including input, latent, output, and four
intermediate layers
close to some proteins from the training set, but they will not surely be physically
feasible structures. For this problem, they trained a Random Forest (RF) classifier
using valid structures from the MD simulations and invalid ones created by adding
noise to the valid structures. This RF model gets a protein structure and decides if it
is physically plausible or not.
Ramaswamy et al. [36] used an extra loss function in the training phase instead
of using RF after the training. Their second loss function is equal to the force field
of the generated protein structure. In this way, their autoencoder tries to generate
a plausible structure close to the input structure. They used many MD simulations
around the start and end proteins in a protein conformational change trajectory for
training the autoencoder and found a line from the cluster of conformations around
the start structure to the end structures. The decoder uses points at a close distance
to this line to generate intermediate structures in the conformational trajectory.
A novel approach was applied by Jin et al. [23]. In their method, instead of letting
the network decide the latent space, the user chooses the latent space to guide the
search. For this reason, they used two loss functions for training the network. One
loss function to compare the original and the reconstructed data and a second loss
function to force the latent space to be as close as possible to the user-defined space.
For each input data, they computed the desired latent space (for example, the first
two principal components of the 2D matrix of the pairwise RMSD between any pair
of input proteins) and fed it to the second loss function. This new approach aims
to examine whether we can use a very simple and understandable latent space to
predict new and physically valid protein structures.
Lemke et al. [28] also tried to make the latent space meaningful in another
way. They used a nonlinear distance metric cost function, named sketch-map, as
the second loss function for their autoencoder. The sketch-map cost function is
based on the pair-wise distances between data points in the high dimensional input
Machine Learning-Based Approaches for Protein Conformational Exploration 57
space and the low dimensional latent space. Sketch-map makes the latent space
more informative and prevents it from shifting toward large numbers. The low
dimensional representations of similar input data are collected in clusters, and the
latent space correctly captures major proximity relations between these clusters.
Wehmeyer et al. [50] explored a distinct type of autoencoder, i.e., time-lagged
autoencoder (TAE), where the loss function is designed in a way that tries to predict
the next frame in MD simulations instead of reconstructing the current frame of the
input structure. It helps not only in dimensionality reduction but also to generate
the next frames of the MD simulations at the same time and to capture the gradual
changes of the underlying stochastic processes.
Researchers have presented and utilized several methods, many of which are
mentioned in this chapter, to explore the conformational space of proteins. As a
consequence, there exist huge amounts of data regarding protein dynamics simula-
tions that need to be examined and evaluated. As an example, the Folding@home
project (https://foldingathome.org/) was used in MD simulations of more than a
dozen of the spike proteins of the SARS-CoV-2 virus [53] that resulted in 36
datasets that discovered potential drug design opportunities. In this section we revise
strategies that can be used as helpful tools to analyze the generated simulations and
conformations.
In the last few years, different clustering techniques and topology analysis methods
received a lot of attention with the aim of studying molecular kinetics from
simulation data. In the free energy landscape of a protein, local minima where
metastable conformations of proteins can be found are of high interest. This
landscape is often rugged and contains a large number of local minima and so
is hard to navigate. Additionally, the fact that a protein contains many atoms and
its landscape has many degrees of freedom makes enumerating all its possible
conformations impractical. However, methods that attempt to sample this space,
have been presented in this work. Yet even the sampled space needs to be filtered and
clustered to provide meaningful information. Many clustering methods have been
designed to cluster the protein simulation data so that these clusters reflect the local
minima upon which the kinetics of the protein can be summarized in a model or a
graph. An example of such strategies is to build a disconnectivity graph for the local
minima and a Markov state model (MSM) for the metastable states. Chang et al.
[11] proposed a data exploration method, called Multi-Persistent Clustering (MPC),
to solve the model selection problem of MSMs by extending the topology analysis
58 F. Afrasiabi et al.
Incorporating a priori knowledge about proteins can guide the search algorithms
toward finding physically feasible conformations and hence reduce time spent on
exploring inefficient spaces. This knowledge can be, for instance, evolutionary
information about the protein coming from studying its sequence or structure
or analysis of rigid residues in the protein that help predict which groups of
atoms are likely to maintain their rigid structure when the protein goes through
conformational changes. This information can be obtained from empirical and
experimental methods or previously done simulations. In [3], rigidity analysis
information and MC criterion are used to reduce the sampling search space and
guide the algorithm by expanding it in the direction of energetically feasible
conformations using perturbation near flexible residues.
7 Conclusions
References
25. Karaman, S., Walter, M.R., Perez, A., Frazzoli, E., Teller, S.: Anytime motion planning using
the rrt* (2011).
26. Kavraki, L., Svestka, P., Latombe, J.C., Overmars, M.: Probabilistic roadmaps for path
planning in high-dimensional configuration spaces (1996).
27. Lavalle, S.M.: Rapidly-exploring random trees: A new tool for path planning. Tech. rep. (1998)
28. Lemke, T., Peter, C.: Encodermap: Dimensionality reduction and generation of molecule
conformations (2019).
29. Levantino, M., Yorke, B.A., Monteiro, D.C., Cammarata, M., Pearson, A.R.: Using syn-
chrotrons and xfels for time-resolved x-ray crystallography and solution scattering experiments
on biomolecules (2015).
30. Levantino, M., Yorke, B.A., Monteiro, D.C., Cammarata, M., Pearson, A.R.: Using syn-
chrotrons and xfels for time-resolved x-ray crystallography and solution scattering experiments
on biomolecules (2015).
31. Liu, P., Kim, B., Friesner, R.A., Berne, B.J.: Replica exchange with solute tempering: A method
for sampling biological systems in explicit water (2005).
32. Noé, F., De Fabritiis, G., Clementi, C.: Machine learning for protein folding and dynamics
(2020).
33. Putnam, C.D., Hammel, M., Hura, G.L., Tainer, J.A.: X-ray solution scattering (SAXS)
combined with crystallography and computation: defining accurate macromolecular structures,
conformations and assemblies in solution (2007).
34. Pérez, A., Martínez-Rosell, G., De Fabritiis, G.: Simulations meet machine learning in
structural biology (2018).
35. Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G., Noé, F.: Identification of slow
molecular order parameters for Markov model construction (2013).
36. Ramaswamy, V.K., Musson, S.C., Willcocks, C.G., Degiacomi, M.T.: Deep learning protein
conformational space with convolutions and latent interpolations (2021).
37. Ringe, D., Petsko, G.A.: [19]study of protein dynamics by x-ray diffraction (1986).
38. Roy, K., Kar, S., Das, R.N.: Computational chemistry (2015).
39. Shen, J., Shafiq, M.O.: Short-term stock market price trend prediction using a comprehensive
deep learning system (2020).
40. Siegmund, D.: Importance sampling in the Monte Carlo study of sequential tests (1976).
41. Singh, G., Memoli, F., Carlsson, G.: Topological methods for the analysis of high dimensional
data sets and 3d object recognition. In: M. Botsch, R. Pajarola, B. Chen, M. Zwicker (eds.)
Eurographics Symposium on Point-Based Graphics. The Eurographics Association (2007).
42. Smyth, M.S.: x ray crystallography (2000).
43. Spiwok, V., Lipovová, P., Králová, B.: Metadynamics in essential coordinates: Free energy
simulation of conformational changes (2007).
44. Stelzl, L.S., Hummer, G.: Kinetics from replica exchange molecular dynamics simulations
(2017).
45. Sun, L., Li, P., Ju, X., Rao, J., Huang, W., Ren, L., Zhang, S., Xiong, T., Xu, K., Zhou, X.,
Gong, M., Miska, E., Ding, Q., Wang, J., Zhang, Q.C.: In vivo structural characterization of
the SARS-CoV-2 RNA genome identifies host proteins vulnerable to repurposed drugs (2021).
46. Tiwary, P., Berne, B.J.: Spectral gap optimization of order parameters for sampling complex
molecular systems (2016).
47. Verkhivker, G.M., Di Paola, L.: Integrated biophysical modeling of the sars-cov-2 spike protein
binding and allosteric interactions with antibodies (2021).
48. Walls, A.C., Park, Y.J., Tortorici, M.A., Wall, A., McGuire, A.T., Veesler, D.: Structure,
function, and antigenicity of the sars-cov-2 spike glycoprotein (2020).
49. Wang, M.Y., Zhao, R., Gao, L.J., Gao, X.F., Wang, D.P., Cao, J.M.: Sars-cov-2: Structure,
biology, and structure-based therapeutics development (2020).
50. Wehmeyer, C., Noé, F.: Time-lagged autoencoders: Deep learning of slow collective variables
for molecular kinetics (2018).
51. Wolf, S., Stock, G.: Targeted molecular dynamics calculations of free energy profiles using a
nonequilibrium friction correction (2018).
Machine Learning-Based Approaches for Protein Conformational Exploration 61
52. Zhang, L., Lin, D., Sun, X., Curth, U., Drosten, C., Sauerhering, L., Becker, S., Rox, K.,
Hilgenfeld, R.: Crystal structure of sars-cov-2 main protease provides a basis for design of
improved α-ketoamide inhibitors (2020).
53. Zimmerman, M.I., Porter, J.R., Ward, M.D., Singh, S., Vithani, N., Meller, A., Mallimadugula,
U.L., Kuhn, C.E., Borowsky, J.H., Wiewiora, R.P., Hurley, M.F.D., Harbison, A.M., Fogarty,
C.A., Coffland, J.E., Fadda, E., Voelz, V.A., Chodera, J.D., Bowman, G.R.: Sars-cov-2
simulations go exascale to predict dramatic spike opening and cryptic pockets across the
proteome (2021).
Low Rank Approximation Methods
for Identifying Impactful Pairwise
Protein Mutations
1 Introduction
Early approaches to infer the effects of a mutation in the physical protein involved
modifying the gene that encodes for a specific amino acid in a polypeptide, followed
by protein expression and purification via in vitro assays. Free energy of unfolding
studies can then be performed by denaturing the protein mutant and its non-mutated
form (wild type).
One way in which a residue can be identified as critical is by performing a
mutation in a physical protein, and then measuring the effect of the substitution.
Matthews et al. have designed and analyzed many mutants of lysozyme from
bacteriophage T4, and concluded that the unoccupied volume that is caused by
some mutations induces a collapse of that region, while in other cases the cavity
remains empty [1]. Therefore, mutating a large residue does not necessarily have a
measurable impact on the stability and structure of a protein. Also, residues that are
held relatively rigidly within the core of the protein make the largest contribution
to the protein’s stability [2], while residues with a high solvent accessibe surface
area are often not as critical, because their mutations often have little bearing on
the stability of the molecule. The Protherm database [3] includes thermodynamic
data (unfolding Gibbs free energy change arrived at via the Schellman equation [4],
enthalpy change, heat capacity change, transition temperature, activity etc.), struc-
tural information (secondary structure, accessibility etc.), measurement methods,
experimental conditions and literature information for wildtype proteins and their
mutants. It is the most extensive source of experimentally available data on the effect
of point mutations on the thermodynamics of proteins.
The experimental studies above are valuable but time consuming and often cost
prohibitive. Moreover some mutant proteins cannot be expressed, due to dramatic
destabilization caused by the mutation, but we would still like to infer whether they
are critical or not. Most importantly, assessing which pairs of mutations have the
greatest impact on the structural stability of a protein is beyond the scope of wet-
lab work, because engineering a set of exhaustive mutants with two amino acid
substitutions is infeasible due to the combinatorial growth of possible decoys. To
address this, computational and analysis techniques are available.
Computational approaches range from estimating folding free energy changes
upon mutations [5], to the use of machine learning (ML) and statistical methods
for predicting the effects of mutations and to infer which residues are critical.
Approaches which utilize Support Vector Machines (SVMs) are able to predict
with 84% accuracy the sign of the stability change for a protein induced by a
single point mutation [6]. Also, data of amino acid replacements that are tolerated
within families of homologous proteins has been used to devise stability scores for
predicting the effect of residue substitutions [7, 8], while force fields for predicting
protein stability have enabled fast and quantitative estimation of interactions that
contribute to complex stability [9].
More recent approaches leverage new biochemistry technologies capable of
mutating many residues, and assessing the effects of each amino acid substitution.
Alanine Scanning [10, 11] involves systematic site-directed mutagenesis, which
produces a library of mutants with an Alanine at each possible residue location.
High-throughput protein expression in coordination with assay runs can reveal
the functional effects of the Alamine mutations. A newer technology still, deep
mutational scanning [12], relies on high-throughput sequencing to generate upwards
of 1 million mutant versions of a protein, which are analyzed to reveal the effects of
the amino acid substitutions via function assays. These new technologies are capable
of providing insights about the effects of a wide range of amino acid substitutions,
but still do not permit an exhaustive-style analysis due to the combinatorial
explosion when 2- or 3-site amino acid substitutions are performed. For example,
for a n = 200 residue protein, there are n2 × 192 = 7,183,900 possible protein
variants which have 2 residue mutations relative to the wild type. This number is
beyond the range and capability of even deep mutational scanning techniques.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 65
2 Related Work
Most methods that strive to infer how a mutation affects the stability of a protein
reason about single amino acid substitutions only. They include PoPMuSiC 2.1 [18]
which makes predictions about G and generates a sequence optimality score,
AutoMute [19] which is an ML-based approach that requires a large training
set, and CUPSAT [20], which relies on energy potentials (atomic and torsional
angles), whose calculation is time intensive. Others include D-Mutant [21], I-
mutation2.0 [22], and STRUM [23]. Respectively these rely on residue-specific
all-atom potential calculations, structural and sequence information, and multiple
threading template alignment. An Unfolding Mutation Screen (UMS) is also
available, that relies on propensity tables and free energy calculations [24].
66 C. Daw et al.
A few approaches permit reasoning about the effects of multiple mutations, but
none are able to perform screening-like analyses. These include MAESTRO and
MAESTROweb [25], as well as DUET [26]. In our recent work, we developed a
compute pipeline for generating in silico all mutants with pairwise mutations [27],
and generated an Allostery Impact Map to identify pairs of residues that cause a
disruption to the protein’s stability.
Our low rank smoothing techniques are inspired by work in matrix completion
[28–31], rank regularized least squares [32] and sparse plus low rank decompo-
sitions [33–35]. Matrix completion is a task in which a subset of the entries of
a matrix are known and the remainder need to be inferred (completed). Perhaps
the most famous example is the “Netflix Prize,” [36] which involves an user-
movie matrix, where the (i, j )th element is user i’s rating of movie j . Given an
incomplete set of user-movie ratings, the remainder of the matrix is to be inferred.
Many solutions to this problem [37] assumed a low rank matrix structure, which
corresponds to an assumption that there are a relative small number of underlying
factors governing movie preferences—phrased another way, that there are kinds of
users and kinds of movies. Rank regularized least squares [32] involves solving a
least squares problem with a nuclear norm (also known as trace norm, Ky-Fan r-
norm, or Schatten-1 norm) regularization term, which is a convex relaxation of the
rank function. These problems differ from matrix completion in that the low rank
approximation’s corresponding elements are close, but not identical to, the known
elements. Some our approaches assume a sparse plus low rank decomposition, in
which a matrix M is approximated as the sum of a sparse matrix and a low rank
matrix. Nearly all of the above utilize singular value decompositions and the optimal
approximation result of the famous Eckart-Young-Mirksy theorem [38].
3 Methods
Exhaustive mutation sets have been used in the past to explore and identify
impactful amino acid substitutions via Allostery Impact Maps (AIMs) [27, 39].
However, generating all possible mutants with two amino acid substitutions can take
several weeks—even months—of compute time. In this paper, we present a multi-
phase compute pipeline (Fig. 1) for evaluating several sampling and approximation
methods to reduce the number of pairwise mutations required to make good
predictions, based on rigidity analysis, about the effects of pairwise mutations.
The pipeline has been engineered to scale the prediction process to large proteins
and resulting exhaustive pairwise mutation sets. There are three phases to the
pipeline.
Phase 1: Generating exhaustive pairwise mutation set—This phase is required
for validating the effectiveness of our sampling methods. There are two tasks:
(i) identifying the exhaustive set of all possible mutants having two amino acid
substitutions, and (ii) analyzing the effects of these pairwise mutations using rigidity
analysis. Phase 2: Sampling from exhaustive pairwise mutation set—We utilize six
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 67
Fig. 1 Pipeline: We generate an exhaustive mutation set (blue), use sampling methods to generate
empirical Allostery Impact Maps (yellow), then reconstruct the empirical AIMs using sparse-plus-
low-rank, low rank, and svd matrix reconstruction methods (red)
In this phase we generate mutant structure and analyze rigidity in order to construct
an allostery impact map to understand the effect of pairwise mutations.
Generating Mutant Structures: We use the ProMuteHT software [40] to generate
an exhaustive sample of mutants with two amino acid substitutions for eight proteins
(Table 1). So that we could validate our approach against experimental data about
the effects of mutations done on physical proteins, for this work we focus on the
164-residue lysozyme from bacteriophage T4, in the PDB file 2LZM. Lysozyme
was the second protein, and the first enzyme, whose structure was solved in 1965
using X-ray crystallography. It is among the more well-studied proteins for which
many mutations have been performed via mutagenesis experiments. A large count of
entries in the ProTherm database are for lysozyme, and thus there is ample stability
data—G—for a variety of mutants.
Rigidity Analysis: Rigidity analysis [41] is a fast graph-based method that
identifies rigid regions of biomolecules [42]. Atoms and their chemical interactions
68 C. Daw et al.
Table 1 PDB files used, and PDB file Num residues Mutants Runtime
mutants generated
1CRN 46 373,635 2.1 hr
1PGA 56 555,940 3.3 hr
1BPI 58 596,733 3.8 hr
1ROP 63 705,033 4.7 hr
1CSP 67 798,171 6.1 hr
1VQB 87 1,350,501 8.3 hr
1HHP 99 1,751,211 14.3 hr
2LZM 164 4,825,126 48.9 hr
are used to construct a mechanical model and associated graph of a protein, whose
analysis via a pebble game algorithm [43] identifies rigid clusters of atoms (Figs. 2
and 3). For this work, we tally the counts and distribution of rigid clusters in the
wild type, as well as a mutant, to quantitatively assess the effect of the amino acid
substitutions performed in silico. We use the following rigidity metric (see [39]):
i=LRC
RDW T →mutant : i × [W Ti − Muti ] (1)
i=1
where W T refers to Wild Type, Mut refers to mutant, and LRC is the size of the
Largest Rigid Cluster (in atoms). Each summation term of RDW T →mutant calculates
the difference in the count of a specific cluster size, i, of the wild type and mutant,
and weighs that difference by i.
Allostery Impact Map: We use the rigidity analysis data to create an Exhaustive
Mutation Tensor, T ex ∈ Rn×n×361 . The (i, j, k)th element, Tijexk , contains the
rigidity data for performing the kth pair of substitutions (out of 192 = 361 total
possible pairs of substitutions) at residues i and j .
From T ex , we build an Exhaustive Allostery Impact Map (AIM), M ex , [27]
which provides an infographic (Figs. 4 and 5) based on quantitative data for
reasoning about the effects of mutating two residues. The color of any one cell in the
AIM designates a sum value of all the metrics for all of the 361 mutant structures
when the residues indicated by the x and y values are exhaustively mutated.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 69
We were ultimately restricted to 500 processes causing multiple runs. Our total
compute time for SASA was approximately 15 hours.
Using this exhaustive set of SASA values, we are able study the impact of
pairwise mutations whose substituted amino acids exist at the core and surface of the
biomolecule. We consider a pairwise mutation to be at the core of the biomolecule if
its substituted amino acids are below 60% SASA. We consider a pairwise mutation
to be at the surface of the biomolecule if both its substituted amino acids are at or
above 60% SASA.
Let W T Seq be the amino acid sequence of length n for a wildtype protein
and W T Seqi denote the ith amino acid of that sequence for i ∈ {1, 2, . . . , n}.
Additionally, let M denote the set of mutation sites at which amino acid substitutions
have been made for a given protein mutation. Note that M = k = 2 in this
work. For k = 2, the number of possible mutations for any mutation site pair M is
192 = 361, meaning that MTpho and MTphil account for up to 22.4% and 33.5%
of the set of mutations for a given site pair M. Hence, the overhead of evaluating
multiple sampling methods for several exhaustive pairwise sets is bound by the
largest exhaustive pairwise mutation set.
While empirical AIMs are far more efficient to compute than exhaustive AIMs,
they tend to be highly noisy due to their inherently incomplete nature. In this
phase we explore several strategies for smoothing their values in order to improve
approximation quality. These approximations “fill in” the missing information by
making some assumptions about global structure of the exhaustive AIM; namely
that the exhaustive AIM is low rank, possibly with additive sparse noise.
The rank of a matrix is the number of linearly independent columns (and rows)
in the matrix; equivalently, it is the number of non-zero singular values. Rank can
be thought of as a notion of complexity in the matrix: low rank matrices can be
explained by a relatively small number of underlying factors. Figure 6 plots the
singular values (in the conventional descending order) for the exhaustive AIMs for
the proteins we considered. While none of the matrices are exactly low rank, all are
approximately low rank: most of the singular values are approximately zero.
72 C. Daw et al.
Fig. 6 Singular values for 8 proteins, revealing approximate low rank structure. Singular values
were normalized by dividing by the largest singular value
Our first method produces a low rank approximation of the empirical AIM by
solving the following convex optimization problem:
where U and R are the left and right singular values of M emp , respectively, and is
the matrix whose diagonal contains the singular values of M emp ; all three matrices
can be obtained by a singular value decomposition. R is with all but the R largest
singular values replaced by zeros. Our Singular Value Decomposition (SVD) AIM,
emp
M svd , is defined to be MR , the optimal rank R approximation of M emp . Note that
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 73
Here M ∗ denotes the nuclear norm of M, which is the sum of its singular values
and is a convex function of M. Penalizing the nuclear norm of a matrix has the effect
of reducing its rank. The regularization coefficient γ0 controls the extent to which
we penalize the rank. We refer to this approximation as the Low Rank (LR) AIM.
While the exhaustive AIMs appear to have strong low rank structure, they also
exhibit some sparse patterns; in the visual representation of the empirical AIMs,
these patterns look like sparse noise. Rather than force the low rank matrix to
capture these sparse patterns, and artificially drive up its rank, we also consider
an approximation that is the sum of a low rank matrix and a sparse matrix. This is
accomplished with the following variant of the low rank optimization problem:
Fig. 7 An example of a Sparse plus Low Rank approximation of the 1CRN protein. Matrices a
and b are added together to produce matrix c, the SLR AIM, which is a low rank smoothing matrix
d, the empirical AIM
For the SLR results, we tune a wide range of γ0 and γ1 values; for the LR, we tune
γ0 but fix γ1 ≈ ∞ to prevent the sparse component from being used.
We evaluate the quality of approximation using Sum Absolute Errors (SAE) relative
to the exhaustive AIM:
n n
SAE = |Mijex − Mij | (10)
i=1 j =1
and Root Mean Squared Error (RMSE) relative to the exhaustive AIM:
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 75
n n
1
RMSE = (Mijex − Mij )2 . (11)
n2
i=1 j =1
In both cases, M is one of the following: an empirical AIM (M emp ), a singular value
decomposition AIM (M svd ), a sparse plus low rank AIM (M slr ), or a low rank AIM
(M lr ). As the number of samples increases, M emp approaches M ex and therefore its
RMSE approaches zero.
Fig. 8 Percent improvement in SAE approximating 1CRN by SVD relative to the “filled”
empirical method at various sampling rates of 19 mutations across 25% of mutation site pairs
Fig. 9 Percent improvement in SAE approximating 1CRN by SVD relative to the “filled”
empirical method at various sampling rates of 19 mutations across 50% of mutation site pairs
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 77
Fig. 10 Percent improvement in SAE approximating 1CRN by SVD relative to the “filled”
empirical method at various sampling rates of 19 mutations across 75% of mutation site pairs
Fig. 11 Percent improvement in SAE approximating 1CRN by SVD relative to the “filled”
empirical method at various sampling rates of 19 mutations across 100% of mutation site pairs
78 C. Daw et al.
Fig. 12 Percent improvement in SAE approximating 1PGA by SVD relative to the “filled”
empirical method at 25% of mutation site pairs
Fig. 13 Percent improvement in SAE approximating 1PGA by SVD relative to the “filled”
empirical method at 50% of mutation site pairs
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 79
Fig. 14 Percent improvement in SAE approximating 1PGA by SVD relative to the “filled”
empirical method at 75% of mutation site pairs
Fig. 15 Percent improvement in SAE approximating 1PGA by SVD relative to the “filled”
empirical method at 100% of mutation site pairs
80 C. Daw et al.
Fig. 16 Empirical approximation error against SVD approximation error for various ranks when
sampling across 25% of mutation site pairs for 2LZM
We plot the sum of absolute error (SAE) of the empirical and SVD approximated
matrices using three sampling strategies at 25% site pairs sampled for 2LZM (Y-
axis) against rank (X-axis) in Fig. 16. The plot shows a clear optimum along the
rank axis centering around R = 32. As mentioned above, as the rank approaches the
total number of residues, the SVD approximation approaches the empirical matrix
and therefore the SAE for the SVD approximation approaches the empirical SVD.
Figure 17 shows a similar graph for 1HHP with 25% site pairs sampled. The value of
the optimal rank for the approximation method is much lower, at R = 2, suggesting
the underlying stability structure is relatively simple compared to 2LZM. The plot
also shows the SAE for the “Mutation to Hydrophobic” to be much higher than
the other sampling methods. This can be explained biophysically since a mutation
from a hydrophilic to a hydrophobic residue could cause the surface residue to be
energetically unfavorable.
Figure 18 shows M ex , M emp and M svd for 1CRN. From this figure, we observe
that the SVD method is able to capture the banded, low rank nature of the exhaustive
AIM, although it does somewhat over-smooth. For other proteins (not shown), the
SVD method demonstrates similar behavior.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 81
Fig. 17 Empirical approximation error against SVD approximation error for various ranks when
sampling across 75% of mutation site pairs for 1hhp
Fig. 18 1CRN : Exhaustive (left), empirical approximation (upper right) and SVD approximation
(low right) AIMs
82 C. Daw et al.
Fig. 19 AIM heatmaps for all sampling methods and all approximation methods. Each heatmap
shows its RMSE relative to the ground truth
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 83
Fig. 20 Correlation plot for atCore sampling method, SLR approximation method, and 6%
sampling
Fig. 21 Line plot for all approximation methods, where each line is averaged across all sampling
methods
Fig. 22 Line plot for all sampling methods, where each line is averaged across all approximation
methods
Furthermore, when the sampling percentage is lowest (and thus even random
sampling has missing rows or columns), the biologically-inspired atCore and
areHydrogenBonded sampling methods yield the best performance.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 85
Using the 2LZM rigidity analysis results with pairwise mutations, we measured the
sampling strategies across 0.01 and 0.06 sampling portions due to low sampling
numbers with certain sampling strategies. Surprisingly, the random sampling strat-
egy performed the best, which could be due to the sparsity of the information in
2LZM needed for optimal reconstruction. In addition the atCore sampling strategy is
promising. We found that the LR approximation method consistently outperformed
the others including SLR. It seems that SLR applies too much emphasis on the
sparsity matrix values, which increases error. Hence, omitting the sparse component
improved the approximation. It is the case with LR that a sweetspot between
sparsity and rank exists. In general, it seems that random sampling with the LR
approximation provides an optimal approximation. It would be worth exploring
how well these approximation and sampling methods perform on new proteins
for further validation of LR approximation and random sampling. In addition, an
increase in total protein mutations could reveal different insights into our methods.
An extension of our current method would be a direct operation on the 3D tensor
rather than a flattened matrix. Doing so makes way to applications of active learning
and quantification of information per mutation.
References
1. J. Xu, W.A. Baase, E. Baldwin, and B.W. Matthews. The response of T4 lysozyme to large-to-
small substitutions within the core and its relation to the hydrophobic effect. Protein Science,
7(1):158–177, 1998.
2. T. Alber, S. Dao-pin, J.A. Wozniak, S.P. Cook, and B.W. Matthews. Contributions of hydrogen
bonds of Thr 157 to the thermodynamic stability of phage T4 lysozyme. Nature, 330:41–46,
1987.
3. M.D. Kumar, K.A. Bava, M.M. Gromiha, P. Prabakaran, K. Kitajima, H. Uedaira, and
A. Sarai. Protherm and pronit : Thermodynamic databases for proteins and protein-nucleic
acid interactions. Nucleic Acids Research, 34:D204–D206, 2005.
4. J Schellman. The thermodynamic stability of proteins. Annual rev. of biophysics and chem,
16(1):115–137, 1987.
5. D. Gilis and M. Rooman. Predicting protein stability changes upon mutation using database-
derived potentials: Solvent accessibility determines the importance of local versus non-local
interactions along the sequence. Journal of Molecular Biology, 272(2):276–290, 1997.
6. J. Cheng, A. Randall, and P. Baldi. Prediction of protein stability changes for single-site
mutations using support vector machines. PROTEINS: Struct Func & Bioinfo, 62:1125–1132,
2006.
7. C.M. Topham, N. Srinivasan, and T. Blundell. Prediction of the stability of protein mutants
based on structural environment-dependent amino acid substitutions and propensity tables.
Protein Engineering, 10(1):7–21, 1997.
8. CL Worth, R Preissner, and L Blundell. Sdm-a server for predicting effects of mutations on
protein stability and malfunction. Nucleic Acids Research, 39(Web Server Issue):W215–W222,
2011.
86 C. Daw et al.
9. R. Guerois, J.E. Nielsen, and L. Serrano. Predicting changes in the stability of proteins and
protein complexes: A study of more than 1000 mutations. Journal of Molecular Biology,
320(2):369–387, 2002.
10. Brian C Cunningham and James A Wells. High-resolution epitope mapping of hgh-receptor
interactions by alanine-scanning mutagenesis. Science, 244(4908):1081–1085, 1989.
11. Tanja Kortemme, David E Kim, and David Baker. Computational alanine scanning of protein-
protein interfaces. Sci. STKE, 2004(219):pl2–pl2, 2004.
12. Douglas M Fowler and Stanley Fields. Deep mutational scanning: a new style of protein
science. Nature methods, 11(8):801, 2014.
13. S Henikoff and PC Ng. Predicting the effects of amnio acid substitutions on protein functions.
Annual Reviews of Genomics Human Genetics, 7:61–80, 2006.
14. S Teng, E Michonova-Alexova, and E Alexov. Approaches and resources for prediction
of the effects of non-synonymous single nucleotide polymorphisms on protein function and
interactions. Cur. Pharmacology Biotech., 9:123–133, 2008.
15. SY Rhee, J Taylor, J Fessel, D Kaufman, W Towner, P Troia, P Ruane, J Hellinger, V Shirvani,
A Zolopa, and R Shafer. Hiv-1 protease mutations and protease inhibitor cross-resistance.
Antimicrobial Agents & Chem., 59(8):4253–4261, 2010.
16. Garman SC and Garboczi DN. Structural basis of fabry disease. Molecular Genetics and
Metabolism, 77:3–11, 2002.
17. N Majeske, Jagodzinski, B Hutchinson, and T Islam. Low rank smoothed sampling methods
for identifying impactful pairwise utations. In Proc. CSBW, 2018.
18. Y Dehouck, J Kwasigroch, M Gilis, and Rooman M. Popmusic2.1: a web server for the
estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinfo,
12, 2011.
19. M Masso and I Vaisman. Auto-mute: web-based tools for predicting stability changes in
proteins due to single amino acid replacements. Protein Engineering Design and Selection,
23(8):683–687, 2010.
20. V Parthiban, M Gromiha, and D Schomburg. Cupsat: prediction of protein stability upon point
mutations. Nucleic Acids Res, 34(suppl 2):W239–W242, 2006.
21. H Zhou and Y Zhou. Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein science,
11(11):2714–2726, 2002.
22. E Capriotti, P Fariselli, and R Casadio. I-mutant2 : predicting stability changes upon mutation
from the protein sequence or structure. Nucleic Acids Res., 33(suppl 2):W306–W310, 2005.
23. L Quan, Q Lv, and Y Zhang. Strum: structure-based prediction of protein stability changes
upon single-point mutation. Bioinfo, 32(19):2936–2946, 2016.
24. C McCafferty and Y Sergeev. In silico mapping of protein unfolding mutations for inherited
disease. Scientific Reports, 6:37298, 2016.
25. J Laimer, H Hofer, M Fritz, S Wegenkittl, and P Lackner. Maestro-multi agent stability
prediction upon point mutations. BMC bioinformatics, 16(1):116, 2015.
26. D Pires, D Ascher, and T Blundell. Duet: a server for predicting effects of mutations on protein
stability using an integrated computational approach. Nucleic acids research, 42(W1):W314–
W319, 2014.
27. N Majeske and F Jagodzinski. Elucidating which pairwise mutations affect protein stability:
An exhaustive big data approach. In proc. of IEEE COMPSAC (International Conference on
Computers, Software & Applications), July 2018.
28. E. J. Cand‘es J.-F. Cai and Z. Shen. A singular value thresholding algorithm for matrix
completion. 2008.
29. A. Montanari R. H. Keshavan and S. Oh. Matrix completion from a few entries. 2009.
30. E. J. Candes and B. Recht. Exact matrix completion via convex optimization.
31. E. J. Candes and T. Tao. The power of convex relaxation: near-optimal matrix completion.
IEEE Trans. Inf. Theor., 56(5):2053–2080, 2010.
32. K.C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized
least squares problems. Pacific Journal of Optimization, 6(3):615–640, 2010.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 87
33. V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Sparse and lowrank matrix
decompositions. In Proc. IFAC Symposium on System Identification, pages 962–967, Sep 2009.
34. X. Yuan and J. Yang. Sparse and low-rank matrix decomposition via alternating direction
methods. Technical report, Hong Kong Baptist University, 2009.
35. E. Candès, X Li, Y Ma, and J Wright. Robust principal component analysis? J. ACM,
58(3):11:1–11:37, 2011.
36. J. Bennet and S. Lanning. The netflix prize. In Proc. KDD Cup and Workshop, 2007.
37. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, Aug 2009.
38. C Eckart and G Young. The approximation of one matrix by another of lower rank.
Psychometrika, 1(3):211–218, Sep 1936.
39. M Siderius and F Jagodzinski. Mutation sensitivity maps: Identifying residue substitutions
that impact protein structure via a rigidity analysis in silico mutation approach. J of Comp Bio,
25(1):89–102, 2018.
40. E Andersson and F Jagodzinski. Promuteht: A high throughput compute pipeline for generating
protein mutants in silico. In Proceedings of the 8th ACM International Conference on
Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, pages 655–
660, 2017.
41. D Jacobs, AJ Rader, M Thorpe, and L Kuhn. Protein flexibility predictions using graph theory.
Proteins, 44:150–165, 2001.
42. Andreas G. Ladurner and Alan R. Fersht. Glutamine, alanine or glycine repeats inserted into
the loop of a protein have minimal effects on stability and folding rates1. Journal of Molecular
Biology, 273(1):330 – 337, 1997.
43. D Jacobs and B Hendrickson. An algorithm for two-dimensional rigidity percolation: the
pebble game. Journal of Computational Physics, 137:346–365, 1997.
44. S. Mitternacht. Freesasa: An open source c library for solven accessible surface area
calculations. F1000Research, 5, 2016.
45. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. A sparse plus low rank maximum
entropy language model. In Proc. Interspeech, 2012.
46. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. A sparse plus low-rank exponential
language model for limited resource scenarios. Audio, Speech, and Language Processing,
IEEE/ACM Transactions on, 23:494–504, 03 2015.
Detection and Analysis of Amino Acid
Insertions and Deletions
1 Introduction
When analyzing the evolution of proteins, the most prevalent among sequence
variations in proteins are amino acid substitutions, and insertions or deletions [1, 2]
(Figs. 1 and 2). Amino acid substitutions have been the focus of numerous recent and
well established efforts in wet lab and computational studies. The research relating
to InDels, however, remains rather nascent. Many queries related to InDels remain
unrequited despite the fact that InDels account for more changes in the structure and
function of proteins than do amino acid substitutions [3, 4]. There is plentiful data
indicating that these variations, rather than substitutions, are primarily responsible
for causing various kinds of functional changes in corresponding proteins that
account for evolution.
Due to their impact on the protein structure and function, it can be reasonably
hypothesized that InDels can be responsible for the modification of interaction
interfaces of proteins resulting in the gain or loss of the protein-protein interactions.
It can be concluded that interaction networks of proteins are majorly impacted by
InDels.
The two perspectives of looking at InDels are a sequence perspective and a
structural perspective [5]. In the sequence perspective, we look at InDels as addition
or deletion of amino acids from the protein sequence as a result of changes in
the genome. In the structural perspective, we look at domains of proteins and how
various proteins combine to form multi-domain architectures. Small InDels tend not
to have drastic impacts on the structure of a protein. Rather, they cause an alteration
to the activity of the protein as well as the alteration of interfaces of proteins with
other proteins.
The length of the coding insertions and deletions in the human genome are
contributing factors to insertions and deletions in proteins [6, 7]. The two kinds
Detection and Analysis of Amino Acid Insertions and Deletions 91
of coding insertions and deletions in the human genome are those that are divisible
by three and those that are not. The latter kind causes frameshifts. More frequently,
these occur either in redundant genes or at the ends of genes [8]. Thus, the location
of the insertions and deletions in the genome is the deciding factor in their impact
on the resulting protein. For genome insertions and deletions with length divisible
by three, the term “3n” or nonframeshift (NFS) is used, and these are the ones that
can result in either insertion and deletion of amino acids in the protein sequence
encoded by the nucleotide sequence [9, 10].
The structural integrity of proteins can be severely perturbed by large InDels, thus
resulting in many types of disease phenotypes in an organism. Oftentimes InDels are
accompanied by substitutions that occur prior to or following an InDel. The impact
of an InDel combined with a substitution can be quite severe. For instance, when an
active site loop is deleted and this phenomenon is followed by a substitution, that
can lead to change in the enzyme related activity of a protein [11, 12]. The more
common types of proteins that accommodate InDels are the ones that are essential in
nature or that interact more with other proteins [13, 14]. The interaction interfaces of
proteins can contain InDels, and these are the ones that play a significant role in the
interactions of biomolecules. The InDels that occur in the cores of protein structures
can be deemed responsible for deviation in function of homologous proteins [15].
InDels more frequently occur in the proteins that are essential in nature, and also in
proteins that interact with other proteins more frequently.
The implications and importance of studying InDels can be understood by
looking at the case of the spike protein of severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2) [16]. Among other things, what sets this virus
apart from other betacoronaviruses are InDels that occur at the S1/S2 boundary,
which constitutes the cleavage site and play a role in virus transmission [17].
Understanding the nature and effects of these InDels can aid the study of their
origin and evolutionary patterns of virus. Another disease caused by this key
structural variation is cystic fibrosis. In that case, several mutants are known. The
F508del mutation in nucleotide-binding domain-1 (NBD1) of the cystic fibrosis
transmembrane conductance regulator (CFTR) is the predominant cause of cystic
fibrosis [18]. InDel implications are vast ranging and encompass various types of
cancers as well [19, 20].
In the remaining sections, we explore various research efforts and touch on
research efforts that aim at exploring various aspects of InDels. Section 2 targets
efforts focused on detection of InDels, Sect. 3 proceeds to explore efforts in the
InDel analysis realm. We conclude with a discussion of various InDel related efforts.
InDel analysis is highly important in order to gain insight into the effects of amino
acid insertions and deletions. Here we explore Machine Learning based approaches,
as well as efforts to study the effects of InDels, including plasticity.
Detection and Analysis of Amino Acid Insertions and Deletions 93
Machine learning based methods for studying and analyzing the effects of InDels
range from causal analysis of InDels to studying the impact of InDels on protein
stability. In some instances, hidden Markov models are used for predicting flanking
regions of InDels. In order to study the relationship between Mendelian disorders
and NFS InDels in the genome, InDels have been characterized based on various
parameters [27]. These parameters, which include functional and evolutionary
features or metrics, have been subsequently used to find out which InDels are disease
causing and which are neutral. The end result of this process is an approach called
KD4i [27], which can be used to predict the phenotypic effects of NFS InDels. NFS
indels in the genome are the cause of InDels in the resulting protein sequence, thus
this is a means to analyze the impact of InDels in the protein sequence. An advantage
of this approach is that rationale behind each prediction is provided as a set of rules.
The said rules can be understood by individuals who are not necessarily experts in
the area.
Due to their perturbance of the core structure of a protein, InDels often result in
significantly altering the biophysical properties of a protein. The positive unlabeled
learning-based prediction framework (PROFOUND) [28] is a method for encoding
the changes in the stability of a protein due to single point deletions. This is
achieved via an Elliptic Envelope and Random Forest classifier outlier detector. The
PROFOUND work also presents a novel database containing single point deletions
(SPD) and is a step towards screening of unfavorable backbone modifications. In a
subsequent publication [29], Banerjee et al. expand this framework to distinguish
multipoint deletion (MPD) instances that are harmful. There was previously no
such database for multipoint deletions in proteins, hence a novel database for MPDs
has been introduced. The authors concluded that there is no random foldability in
proteins, but rather there are unique specifications imposed on the deleted regions.
Due to the complexity of computational space, various machine learning based
efforts for InDel modeling are still in their incipient stages. A proposed improve-
ment to this effort is a probabilistic model for InDel evolution [30] that is able
to distinguish between insertions and deletions by using a preexisting inferential
scheme. This approach proposes two novel machine learning based models that
assume the same or varying rate and length distributions of InDels. The model
that assumes variable rates of distribution seems to perform better than the one that
assumes the same rates of distribution.
When it comes to predicting flanking regions, a number of machine learning
based efforts are being developed or proposed. For a provided protein fold, in order
to predict the InDel flanking regions in a given protein sequence, Mufleh et al.
[31] proposed a framework, using a variable-order Markov model. The predictions
of flanking regions are performed by probabilistic suffix tree and partial match
approaches. The developed predictors are built in a fully automated manner.
The preliminary analysis of machine learning based methodologies for analyzing
the impact of InDels reveal that there is a need for further work in this area as it is
94 M. Jilani et al.
relatively new. The benefits of novel efforts will further our understanding of protein
evolution and help identify the InDels that result in disease phenotypes. The area can
be expanded upon in various engineering endeavors.
effects of short insertions and deletion mutations on the structure and stability
properties of proteins.
A novel interactive and free tool for visualizing 3D structures of proteins is
called SWISS-PO [38]. The web-based tool aims to help the users in decision
making regarding what is the best structure to analyze related to the inserted or
deleted residues. The decision is based on the interaction of the molecule with
neighboring molecules within the same protein or with other macro-molecules in the
3D structure. This computational effort is novel in nature and it aids InDel analysis
greatly.
The efforts in the InDel analysis realm are few and far between but developed
nonetheless. There is scope for novel efforts that combine computational InDel
generation with analysis in this area to provide a more broad-based analysis.
Some studies explore the plasticity of proteins to InDels. A convenient model for
evolution analysis of proteins is TEM-1 β-lactamase. Using Amp resistance as a
proxy for fitness, Courtney et al. [6] have examined the relationship between the
tolerance of InDels and additional related measures in order to understand the origin
of insertions or deletions. They conclude by quantifying the fitness effects of 4737
amino acid InDels in the studied proteins. This study highlights the sparseness of
insertion studies, which are few indeed and only study a few reasonably chosen
sites in a few proteins. But nonetheless, these studies reveal that insertions are
tolerated more than deletions. Such studies of fitness effects in other proteins is
crucial to predicting computational structure of proteins in a more robust manner.
In order to evaluate plasticity, the tolerance of loop deletion across the entirety of
a protein was observed by James et al. [39] using directed evolution in enhanced
green fluorescent protein (EGFP). This was a very comprehensive study which
revealed an enhancing effect of rearrangement compared to only mutations. It is also
observed that loops that occur between other secondary structures are normally more
tolerant of deletions. Since Green Fluorescent Protein (GFP) investigation has been
revolutionary in biomedical research endeavours, another thorough investigation
worth mentioning here is a more comprehensive study targeting plasticity GFP to
amino acid deletions [40]. This study not only explores the impact of deletions, but
also explores how fold enhancing mutations can reverse the impact of deleterious
mutations on fluorescence of GFP. In certain cases, point deletions can introduce
beneficial structural effect. This phenomena has been explored in GFP [41]. The
study highlights the creation of deletion mutants by X-Ray crystallography and
shows that a protein can remain somewhat plastic to a deletion in the loop region.
In a similar study on human lysozyme, the role of amino acids in the loop regions
in terms of conformational stability has been explored using X-Ray crystallography
analysis [42]. The study reveals that impairment in stability is mainly due to deletion
of hydrogen bonds between molecules, and not due to amino acid deletions.
96 M. Jilani et al.
The stability and folding rate of a protein is largely impacted by protein loops.
The conformational entropy of protein is demonstrated to be increased by shortening
of loops [43]. Using simulations of all-atom molecular dynamics, Yulian et al. [43]
have generalized the effect of the shortening of loops on native state dynamics
of protein for four different proteins, and the impact of this phenomenon on the
stability of the protein structures. They later confirmed their results via NMR
simulations [44].
An approach called weighted contact number (WCN) has been used by Jackson
et al. [45]. The method scans the structure of the protein in order to measure the
density of the packing of a residue. Such information is used as an indicator of
whether the protein will tolerate an InDel or not. The key finding of this novel work
is the key element contributing to deletion constraining in proteins is structure.
These studies lay the groundwork for future examination of tolerance of more
complex large InDels.
Since it has been revealed by various analysis methods that InDels and disorders
go hand in hand, it is fair to assume a causal relationship between InDels and
disorders. A detailed study [46] pertaining InDel analysis with respect to the region
of occurrence reveals that InDels are not a cause for disorder. Rather, regions that
are inherently disordered incur InDels. This approach uses hidden Markov model
pairwise alignment in two sets of Eukaryote proteins.
In summary, there are many studies that explore the plasticity of proteins to
InDels. Some are more comprehensive and perform an in depth analysis of this
phenomenon alongside exploring the relationship between InDels and disorders.
Understanding the impact InDels have on the plasticity of proteins is relevant for
protein engineering so that it is easier to predict the cites which are more tolerant
towards deletions.
4 Conclusion
References
35. Ágnes Tóth-Petróczy and Dan S Tawfik. Hopeful (protein indel) monsters? Structure,
22(6):803–804, 2014.
36. Raheleh Salari, Alexander Schönhuth, Fereydoun Hormozdiari, Artem Cherkasov, and S Cenk
Sahinalp. The relation between indel length and functional divergence: a formal study. In
International Workshop on Algorithms in Bioinformatics, pages 330–341. Springer, 2008.
37. Muneeba Jilani, Alistair Turcan, Nurit Haspel, and Filip Jagodzinski. Assessing the effects
of amino acid insertion and deletion mutations. In 2021 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), pages 2511–2518. IEEE, 2021.
38. Fanny S Krebs, Vincent Zoete, Maxence Trottet, Timothée Pouchon, Christophe Bovigny, and
Olivier Michielin. Swiss-po: a new tool to analyze the impact of mutations on protein three-
dimensional structures for precision oncology. NPJ precision oncology, 5(1):1–9, 2021.
39. James AJ Arpino, Samuel C Reddington, Lisa M Halliwell, Pierre J Rizkallah, and D Dafydd
Jones. Random single amino acid deletion sampling unveils structural tolerance and the
benefits of helical registry shift on gfp folding and structure. Structure, 22(6):889–898, 2014.
40. Shu-su Liu, Xuan Wei, Xue Dong, Liang Xu, Jia Liu, and Biao Jiang. Structural plasticity of
green fluorescent protein to amino acid deletions and fluorescence rescue by folding-enhancing
mutations. BMC biochemistry, 16(1):1–11, 2015.
41. James AJ Arpino, Pierre J Rizkallah, and D Dafydd Jones. Structural and dynamic changes
associated with beneficial engineered single-amino-acid deletion mutations in enhanced
green fluorescent protein. Acta Crystallographica Section D: Biological Crystallography,
70(8):2152–2162, 2014.
42. Kazufumi Takano, Yuriko Yamagata, and Katsuhide Yutani. Role of amino acid residues
at turns in the conformational stability and folding of human lysozyme. Biochemistry,
39(29):8655–8665, 2000.
43. Yulian Gavrilov, Shlomi Dagan, and Yaakov Levy. Shortening a loop can increase protein
native state entropy. Proteins: Structure, Function, and Bioinformatics, 83(12):2137–2146,
2015.
44. Yulian Gavrilov, Shlomi Dagan, Ziv Reich, Tali Scherf, and Yaakov Levy. An nmr confirmation
for increased folded state entropy following loop truncation. The Journal of Physical Chemistry
B, 122(48):10855–10860, 2018.
45. Eleisha L Jackson, Stephanie J Spielman, and Claus O Wilke. Computational prediction of
the tolerance to amino-acid deletion in green-fluorescent protein. PloS one, 12(4):e0164905,
2017.
46. Sara Light, Rauan Sagit, Diana Ekman, and Arne Elofsson. Long indels are disordered: a
study of disorder and indels in homologous eukaryotic proteins. Biochimica Et Biophysica
Acta (BBA)-Proteins and Proteomics, 1834(5):890–897, 2013.
47. Qi Wang, Esley Heizer, Bruce A Rosa, Scott A Wildman, James W Janetka, and Makedonka
Mitreva. Characterization of parasite-specific indels and their proposed relevance for selective
anthelminthic drug targeting. Infection, Genetics and Evolution, 39:201–211, 2016.
DeepTracer Web Service for Fast and
Accurate De Novo Protein Complex
Structure Prediction from Cryo-EM
Dong Si, Hanze Meng, Jonas Pfab, Yinrui Deng, Yutong Xie, Jackson Tan,
Sheung Him Martin Chow, Jason Chen, and Aditi Jain
1 Introduction
Since the advent of cryo-electron microscopy, its imaging technique has been
constantly evolving in order to achieve higher resolution in the cryo-EM maps
of macro molecules. Through this process, the related computational methods to
derive structural information from these maps have also experienced three distinct
stages of evolution from template-based homology modeling, to de novo cryo-EM
modeling using traditional machine learning algorithms such as KNN and SVM,
then to deep-learning based de novo modeling [1]. As an important type of macro
molecules, proteins have great value in medical aspects such as drug and vaccine
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 101
N. Haspel et al. (eds.), Algorithms and Methods in Structural Bioinformatics,
Computational Biology, https://doi.org/10.1007/978-3-031-05914-8_6
102 D. Si et al.
Fig. 1 Part a is the DeepTracer model determination pipeline. Part b describes the evaluation
results for set of 476 experimental cryo-EM maps. Evaluation of determined models from
DeepTracer (blue) and Phenix (red) for 476 cryo-EM maps. The dotted lines represent the trend
for each method. More comparison results and other details can be found in our previous PNAS
paper [4]
104 D. Si et al.
2 Procedures
Fig. 2 A general demonstration of the workflow of DeepTracer’s web service. Users’ jobs will
be put into the job listener and be dispatched to any of the two job queues whenever they are
not occupied by other jobs. The predicted results will be saved properly on the server and the
middleware will respond to the clients in time
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 105
Fig. 3 The job submission interface on client side. If using Option 1, users only need to input the
EMDB ID so that the web server will automatically download from PDB. If using Option 2, then
only the cryo-EM Map is required to be uploaded to the server. Solved Structure is mainly used
for the result evaluation after prediction and Amino Acid Sequence is mainly used for better amino
acid type prediction
Users have two ways to utilize the web server as shown in Fig. 3: upload their
own input in the form of an mrc or map file or use a provided field to enter a
deposited structure ID. The former allows for experienced researchers who are
working with novel, unsolved structures to accelerate their work by minimizing the
need for manual manipulation of the structure. The latter feature allows users of all
backgrounds to produce highly accurate structural determinations for any structures
deposited in the EMDataResource [9]. These input files consist of the results of
cryo-EM experiments, a 3D grid of voxels that estimates the electron density for
the given structure, and a reference to a call to DeepTracer’s prediction pipeline to
render a structural determination for the input. Users’ jobs will be saved in a task
queue, so that they do not have to worry about losing their place in line. We keep a
record for each job in our logs and present useful statistics such as RMSD, coverage,
and sequence matching percentage.
Every job is executed in chronological order. In each step of the prediction
pipeline, we update the database and report these updates to users so that it gives a
clear visual representation of the job status (see Fig. 4) and estimated waiting time
to users. Once the job is finished, users can click on the Evaluation tab to compare
different metrics between the predicted result and the solved structure, which is
either automatically downloaded from the PDB if using option one or self-provided
by users if using option two, as shown in Fig. 5.
We have built an update system that automatically initiates predictions for novel
coronavirus cryo-EM maps each week from EMDataResource [9]. A background
thread is running to check the newly added coronavirus maps, download them, and
create a job for prediction periodically. When the job finishes, our server will save
the results and update the records accordingly. By January 4th, 2022, there are 628
106 D. Si et al.
Fig. 4 The details for each step in a job predicting EMD-0401. This page refreshes every 20
seconds to request the current job status. Users can directly get the server log from the black box
for each step by clicking on different white check marks. Here the figure shows the results of the
U-Net Prediction step and the prediction progress bar indicates an estimated waiting time until
finishing
Fig. 5 The evaluation metrics displayed to users in a concise dialog. These metrics assist users in
having a quick view of the accuracy and a quick inspection of the structure without any external
tools
records displaying on our website for users downloading the prediction results and
other corresponding files. With the Coronavirus archive, users no longer need to
submit a related job and initiate a new prediction process to get the results. The web
service also offers 3D visualizations by Miew from Life Science Open Resources
(https://lifescience.opensource.epam.com/miew/) shown in Fig. 6 for prediction
results. For the details of visualization, see Sect. 3.3.2 below.
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 107
Fig. 6 This figure shows the Coronavirus archive which will be updated once a week. Download
and evaluation options are also available on this page, same as the job page. Concise labels are set
next to each covid prediction job’s title. Red label is to mark that the protein is a type of SARS-
CoV-2 virus and green label is to mark the absence of solved structure
3 Results
3.1 Architecture
3.1.1 Web Application Architecture
In order to make full use of the computing resources on DeepTracer Web Server,
parallelism is implemented through multiprocessing. Currently, two prediction jobs
can be run in parallel on the server, both with GPU acceleration enabled and this
number can be scaled up easily with any increase of GPU resources in the future.
We also set up a simple load balancer in order to evenly distribute the tasks to
these computing resources. In addition, a maximum of 64 CPU cores can be utilized
for one prediction task. Overall, although this implementation of parallelism does
not completely prevent jobs from getting queued up, users’ waiting time is greatly
reduced. As a result, we have increased our job processing bandwidth to let users
submit 5 jobs every 24 hours instead of the previous duration of 72 hours.
DeepTracer Web Server provides users with prediction evaluations if the user
uploads a corresponding solved structure. DeepTracer uses the solved structure as
the ground truth to calculate the evaluations. There are two types of evaluations:
metrics and visualization. The metrics provide users with a quick view of the
prediction accuracy and the visualizations provide a quick inspection of the structure
without the need of other visualization tools, such as, UCSF Chimera [10]. There
are two metrics categories: individual residue type predictions and overall residue
connectivity. Due to the reason that all residues are predicted from the cryo-EM
map, we need to benchmark the accuracy of the identified residues and their
attributes on a per residue basis. It consists of the percentage of correctly predicted
residues (% Matching); the percentage of falsely predicted residues (% False
Positive); the root-mean-square deviation of the backbone Cα atoms (Backbone
RMSD); the percentage of matching residues with the same amino acid type (%
Sequence Matching); and the percentage of secondary structure types (% SSE
Matching). The connectivity metrics indicates how well DeepTracer predicted the
sequence of the protein and show the percentage of incorrect connections (% False
Connections). We are considering implementing other connectivity metrics in the
future, such as, the mean-chain-length.
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 109
To make the DeepTracer Web Server as easy to use as possible and to provide
necessary functionality, we have implemented several features.
3.3.1 Logger
We provide users with step-by-step transparency into each job execution using a
logger component originally used by DeepTracer’s development team internally.
This feature provides insight into every step of DeepTracer’s pipeline: input map
resampling and normalization, U-Net prediction, atomic tracing of the backbone
structure, merging amino acid chains, secondary structure addition, amino acid
sequence mapping, structural refinement, sidechain prediction, and evaluation.
Figure 7 shows an example of the logs available for every step in DeepTracer’s
prediction pipeline.
3.3.2 Visualization
Fig. 7 Results page provided to users once the job is finished. Here the map used is EMD-30127.
DeepTracer Web Server automatically downloads the target structure from the PDB and provides
evaluation metrics such as RMSD, sequence coverage, sequence matching percentage, and mean
length of matched segments
110 D. Si et al.
Fig. 8 A side-by-side comparison of the predicted and solved protein structures provided by the
3D viewer. The predicted structure is less dense because it is missing the side-chains but the
backbone trace is similar by shape
Fig. 9 The matching results of EMD-6272. The amino acid sequence predicted by DeepTracer
closely matches the solved sequence. Shared color schemes represent amino acids with similar
properties
backbone and side-chain, but we are planning to incorporate more features in the
future. The other visualization tool is the text-based sequence comparison tool as
shown in Fig. 9. The sequence comparison tool is useful because it provides quick
feedback of the sequence prediction. The challenge of map-to-model is twofold,
the first is to identify the correct residues and residue types; and the second is to
connect the predicted residues in the correct order. Therefore, it is helpful to see if
the matching residues have the same amino acid types and if they are connected in
the correct order. The sequence comparison tool provides both pieces of information
in a user-friendly way. The predicted residues are ordered in the same sequence
as the solved sequence. The amino acids are color-coded based on Jmol’s scheme
[12] for easy identification and the consensus signs (*, ;, .) indicate the similarity
between the predicted and solved amino acids based on the PAM 250 scoring matrix.
Users can identify incorrect connections in the predicted sequence instantly and it
is especially beneficial for DeepTracer developers to identify problems and improve
on the prediction algorithm.
With the advent of DeepTracer, researchers from more than 20 countries are utilizing
our fully-automated atomic structure prediction service and we have received
plenty of requests to build an offline application being capable of conducting
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 111
Fig. 10 The offline application’s architecture. The design is similar to our web application with
some transplants to stay offline
With the increasing number of users using DeepTracer, sending emails becomes an
essential function for DeepTracer to communicate with all users. In addition to the
verification use during signing up and password resetting, sending out important
updates and notices to users subscribed is also important. To achieve this, we have
implemented a subscription mechanism on the web server so that users can receive
our latest news such as the publishing of our offline applications. With our bulk
email sending function, we make DeepTracer more user-friendly.
Since the launch of DeepTracer Web Server, we have seen many users accessing
the server using mobile devices. We introduced and developed several mobile
features to let users have the best possible experience. We have implemented
112 D. Si et al.
features that render DeepTracer Web Server as a cutting-edge Progressive Web App
(PWA) that merges the functionality of a web application with that of a native
mobile application (https://web.dev/what-are-pwas/). This provides mobile users
with advanced features, including the ability to install the application to their home
screen, create shortcuts for easy access to data, use in offline viewing mode, and
enable push notifications.
During the year of 2021, our website has been accessed by users from 10855
different locations around the world. Their average daily engagement time is about
4 minutes which shows their strong interest in our service and the ultra-fast job
running time—users could easily and quickly get what they need in minutes. Also,
the devices they use are coming from all mainstream platforms for computers and
mobile devices including Windows, Mac, Linux, iOS and Android.
4 Discussion
In spite of the mature frameworks used on our web server, there are still many
aspects worthy of enhancement. Besides the previously mentioned increase on job
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 113
Acknowledgments We would like to thank the entire DeepTracer team for their discussion and
collaboration on the project, including Andrew Nakamura, Michael Chavez, Haowen Guan, Sean
Lin, Kiran Mitra, Victor Suciu, Kristin Ding, Hongyi Ji, Rene Gomez and the previous developers.
Funding This material is based upon work supported by the National Science Foundation under
Grant No. 2030381, the graduate research award of Computing and Software Systems division,
and the SRCP Seed Grant at University of Washington Bothell to D.S.. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
References
1. Dong Si, Andrew Nakamura, Runbang Tang, Haowen Guan, Jie Hou, Ammaar Firozi, Renzhi
Cao, Kyle Hippe, and Minglei Zhao. Artificial intelligence advances for de novo molecular
structure modeling in cryo-electron microscopy. WIREs Computational Molecular Science,
page e1542.
2. Y. Cheng, N. Grigorieff, P. A. Penczek, and T. Walz. A Primer to Single-Particle Cryo-Electron
Microscopy. Cell, 161(3):438–449, April 2015.
3. Ewen Callaway. Revolutionary cryo-EM is taking over structural biology. Nature,
578(7794):201, February 2020.
4. Jonas Pfab, Nhut Minh Phan, and Dong Si. Deeptracer for fast de novo cryo-em protein
structure modeling and special studies on cov-related complexes. Proceedings of the National
Academy of Sciences, 118(2), 2021.
5. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional Networks for
Biomedical Image Segmentation. In International Conference on Medical Image Computing
and Computer-assisted Intervention, pages 234–241. Springer, 2015.
6. Thomas C Terwilliger, Paul D Adams, Pavel V Afonine, and Oleg V Sobolev. A fully automatic
method yielding initial models from high-resolution cryo-electron microscopy maps. Nature
methods, 15(11):905–908, 2018.
7. Jonas Pfab, Yinrui Deng, Hanze Meng, Yutong Xie, Jackson Tan, Michael Chavez, and Dong
Si. DeepTracer Engineering Team. https://deeptracer.uw.edu/about-us. Accessed 4 January
2022.
8. F. Kern, T. Fehlmann, and A. Keller. On the lifetime of bioinformatics web services. Nucleic
Acids Research, 48(22):12523–12533, 2020.
9. Catherine L. Lawson, Ardan Patwardhan, Matthew L. Baker, Corey Hryc, Eduardo Sanz
Garcia, Brian P. Hudson, Ingvar Lagerstedt, Steven J. Ludtke, Grigore Pintilie, Raul Sala,
John D. Westbrook, Helen M. Berman, Gerard J. Kleywegt, and Wah Chiu. Emdatabank unified
data resource for 3dem. Nucleic Acids Research, 44(D1):D396–D403, 11 2015.
114 D. Si et al.
10. Eric F Pettersen, Thomas D Goddard, Conrad C Huang, Gregory S Couch, Daniel M
Greenblatt, Elaine C Meng, and Thomas E Ferrin. UCSF chimera–a visualization system for
exploratory research and analysis. J. Comput. Chem., 25(13):1605–1612, October 2004.
11. M. Bertoni, F. Kiefer, M. Biasini, L. Bordoli, and T. Schwede. Modeling Protein Quaternary
Structure of Homo- and Hetero-oligomers Beyond Binary Interactions by Homology. Sci Rep,
7(10480):9654–8, 2017.
12. Jmol: an open-source Java viewer for chemical structures in 3D. http://jmol.sourceforge.net/.
Accessed 12 January 2022.