Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)

Computational Biology
Nurit Haspel
Filip Jagodzinski
Kevin Molloy Editors
Algorithms
and Methods
in Structural
Bioinformatics
Advisory Editors
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joseph Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken,
Germany
Marcella McClure, Montana State University, Bozeman, MT, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Lonnie Welch, Ohio University, Athens, OH, USA
Editor-in-Chief
Andreas Dress, CAS-MPG Partner Institute for Computational Biology, Shanghai,
China
Michal Linial, Hebrew University of Jerusalem, Jerusalem, Israel
Olga Troyanskaya, Princeton University, Princeton, NJ, USA
Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany
Editorial Board Members

Robert Giegerich, University of Bielefeld, Bielefeld, Germany
Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig,
Germany
Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics,
Dresden, Germany
Pavel Pevzner, University of California, San Diego, CA, USA
Endorsed by the International Society for Computational Biology, the Computa-
tional Biology series publishes the very latest, high-quality research devoted to
specific issues in computer-assisted analysis of biological data. The main emphasis
is on current scientific developments and innovative techniques in computational
biology (bioinformatics), bringing to light methods from mathematics, statistics
and computer science that directly address biological problems currently under
investigation.
The series offers publications that present the state-of-the-art regarding the
problems in question; show computational biology/bioinformatics methods at work;
and finally discuss anticipated demands regarding developments in future method-
ology. Titles can range from focused monographs, to undergraduate and graduate
textbooks, and professional text/reference works.
Nurit Haspel • Filip Jagodzinski • Kevin Molloy
Editors
Algorithms and Methods

in Structural Bioinformatics
Editors
Nurit Haspel Filip Jagodzinski
Department of Computer Science Computer Science
University of Massachusetts Boston Western Washington University
Boston, MA, USA Bellingham, WA, USA
Kevin Molloy
ISAT/CS Building Room 216
James Madison University
Harrisonburg, VA, USA
ISSN 1568-2684 ISSN 2662-2432 (electronic)

ISBN 978-3-031-05913-1 ISBN 978-3-031-05914-8 (eBook)
https://doi.org/10.1007/978-3-031-05914-8
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The three-dimensional structure and function of molecules present many challenges

and opportunities for developing an understanding of biological systems. With the
increasing availability of molecular structures and the advancing accuracy of struc-
ture predictions and molecular simulations, the space for algorithmic advancement
on many analytical and predictive problems is both broad and deep. To support this
field, a rich set of methods and algorithms are available, addressing a variety of
important problems such as protein-protein interactions, the effect of mutations on
protein structure and function, and protein structure determination.
Recently, a deep learning-based algorithm, AlphaFold, made tremendous
progress in predicting the three-dimensional structures of proteins, known as
the protein folding problem. However, many problems still remain unsolved.
In particular, the experimental resolution of protein structures, especially large
macromolecules, still lags behind the availability of protein sequences. Modeling
protein-protein interactions and protein binding still remain a challenge. Even
when the three-dimensional structure of two interacting proteins is known, it is still
difficult to determine the complex formed by the two proteins. It is also challenging
to model and analyze conformational transitions in proteins, due to the transient
nature of intermediate structures.
Chapter “Protein-Ligand Binding with Applications in Molecular Docking”
presents a multi-dimensional analysis approach to protein-ligand binding. Under-
standing protein-ligand binding from different perspectives (energetics, structure,
homology, etc.) can provide insights to further drug design and protein conformation
studies. This chapter reviews basic principles and recent advances in protein-
ligand binding, including the underlying thermodynamics basis, computational
methodologies, and freely available databases. End-point free energy methods,
which have shown to be significantly more efficient than the popular alchemical and
transitional path sampling methods, are discussed in more detail. The chapter ends
with a brief review of molecular docking and its applications to high throughput
screening in early-stage drug discovery.
Chapter “Explaining Small Molecule Binding Specificity with Volumetric Rep-
resentations of Protein Binding Sites” presents advances in algorithmic approaches
v
vi Preface
for studying the volumetric properties of molecular surfaces and electrostatic

isopotentials. Elucidating the surface properties of molecules is needed for advances
in drug design and protein ligand binding.
Chapter “Machine Learning-Based Approaches for Protein Conformational
Exploration” surveys computational methods for conformational exploration of
proteins. The chapter provides a detailed discussion of the challenges of using
these methods, their strengths, and their shortcomings. The surveyed topics include
physics-based methods such as molecular dynamics as well as geometry-based
methods and a focus on new machine learning-based strategies that have been a
research hot spot for computational biologists in recent years.
Chapter “Low Rank Approximation Methods for Identifying Impactful Pairwise
Protein Mutations” describes recent advances in the use of machine learning-based
approaches, including low rank sampling, for efficiently identifying similarities
among proteins and studying the effects of pairwise mutations. Identifying protein
classes and similarities among sets of proteins has relevance to homology modeling
and computational experiments that aim to better understand new protein structures
based on their similarity to other biomolecules.
Chapter “Detection and Analysis of Amino Acid Insertions and Deletions”
showcases on-going work that relies on robotics and coarse-grained combinatorial
approaches for predicting the effects on insertion and deletions (indels) on protein
structural stability. Even a single amino acid substitution can cause significant
changes to a protein’s shape and function. A variety of approaches have been
developed in the past decade for inferring the effects of substitution mutations, but
very few address the effects of insertion or deletion (indel) mutations.
Chapter “DeepTracer Web Service for Fast and Accurate De Novo Protein
Complex Structure Prediction from Cryo-EM” introduces DeepTracer—a Web
service for deep learning-based de novo protein complex structure prediction from
Cryo-EM. Cryo-EM is increasingly being used to resolve protein structures. The
resolution of most Cryo-EM resolved entries in the protein data bank (PDB) is
medium or low, in which fine-level details are obscured, but new technology and
improved computational methods allow for more accurate structure prediction.
Boston, MA, USA Nurit Haspel

Bellingham, WA, USA Filip Jagodzinski
Harrisonburg, VA, USA Kevin Molloy
January 2022
Contents
Protein-Ligand Binding with Applications in Molecular Docking . . . . . . . . . 1

Nikita Mishra and Negin Forouzesh
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thermodynamic Basis of Protein-Ligand Interaction . . . . . . . . . . . . . . . . . 2
2 Computational Methods for Estimating Binding Free Energy . . . . . . . . . . . . . . 4
2.1 Alchemical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Transition Path Sampling Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 End-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Protein-Ligand Binding Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Molecular Docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Explaining Small Molecule Binding Specificity with Volumetric
Representations of Protein Binding Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Ziyi Guo and Brian Y. Chen
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Comparison Algorithms for Examining Specificity . . . . . . . . . . . . . . . . . . . 18
2 Specificity Assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Binding Site Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Metrics for Binding Site Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Comparison Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Statistical Models for Binding Site Comparison . . . . . . . . . . . . . . . . . . . . . . 25
3 Component Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Foundations of Structure-Based Component Localization . . . . . . . . . . . . 27
3.2 Using CSG for Component Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Statistical Models for Component Localization . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Volumetric Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Flexible Representations for Component Localization. . . . . . . . . . . . . . . . 35
3.6 Solid Representations of Electrostatic Isopotentials . . . . . . . . . . . . . . . . . 37
4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
viii Contents
4.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Machine Learning-Based Approaches for Protein
Conformational Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Fatemeh Afrasiabi, Ramin Dehghanpoor, and Nurit Haspel
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2 Biophysical and Empirical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Physics-Based Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Monte Carlo Based Search Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Geometric and Robotics-Inspired Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Motion Planning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Machine Learning-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Dimensionality Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Toolkits for Applying Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Topology and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Using a priori Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Low Rank Approximation Methods for Identifying Impactful
Pairwise Protein Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chris Daw, Brian Barragan-Cruz, Nicholas Majeske, Filip Jagodzinski,
Tanzima Islam, and Brian Hutchinson
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1 Phase 1: Generate Exhaustive Pairwise Data . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Phase 2: Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Phase 3: Smooth Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Results: SVD Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1 SVD Approximation and Sampling Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Results: Case Study on 2LZM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Detection and Analysis of Amino Acid Insertions and Deletions . . . . . . . . . . . 89
Muneeba Jilani, Nurit Haspel, and Filip Jagodzinski
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2 Computational Methods of InDel Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3 Computational Methods of InDel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1 Machine Learning Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Contents ix
3.2 Detecting Functional and Fitness Effects of InDels on

Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Plasticity of Proteins to InDels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
DeepTracer Web Service for Fast and Accurate De Novo Protein
Complex Structure Prediction from Cryo-EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Dong Si, Hanze Meng, Jonas Pfab, Yinrui Deng, Yutong Xie, Jackson
Tan, Sheung Him Martin Chow, Jason Chen, and Aditi Jain
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2 Prediction Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.3 UI/UX Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.1 Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Future Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Protein-Ligand Binding with
Applications in Molecular Docking
Nikita Mishra and Negin Forouzesh
1 Introduction
Protein-ligand binding is one of the most fundamental processes in various organ-

isms. There are several examples of this process occurring in nature: enzymes bind
to substrates to allow for energy production via catabolic and anabolic pathways,
viruses attach their proteins onto receptors of the host cell to transfer their genetic
material, and signaling ligands bind to intercellular receptors to implement a signal
transduction process. Protein-ligand binding also plays a key role in the molecular
recognition that is central to drug design and discovery [1]. Therefore, it is vital
to understand protein-ligand binding mechanism at molecular level in biological
systems.
One of the key features of protein-ligand interactions is the binding free energy
change that occurs between the protein and the ligand upon the ligand’s attachment.
Binding free energy heavily dictates how strongly a protein and ligand interact.
This is a particularly useful physiochemical feature to understand for drug design,
studying infectious diseases, and signal transduction pathways in the cell [2]. There
are various ways to determine the binding free energy. Some common experi-
mental techniques include isothermal titration calorimetry [3], surface plasmon
resonance [4], fluorescence polarization [5], and MicroScale thermophoresis [6],
but these types of methods are often costly and labor intensive. Thus, efficient
N. Mishra
Department of Chemistry and Biochemistry, California State University, Los Angeles,
Los Angeles, CA, USA
e-mail: nmishra2@calstatela.edu
N. Forouzesh ()
Department of Computer Science, California State University, Los Angeles, Los Angeles,
CA, USA
e-mail: neginf@calstatela.edu
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1

N. Haspel et al. (eds.), Algorithms and Methods in Structural Bioinformatics,
Computational Biology, https://doi.org/10.1007/978-3-031-05914-8_1
2 N. Mishra and N. Forouzesh
computational methods have emerged for the quantitative determination of binding

free energy, e.g., alchemical methods, transition path sampling, and endpoint
methods [7, 8].
The intersection between computational and biochemical fields has led to new
technologies that can be used for computer-aided drug design [9]. One such novel
method is molecular docking, which can be used to screen various compounds as
they attach to their target proteins [10]. By examining many protein-ligand different
binding conformations and orientations, evaluating these poses, and ranking them
in accordance to a score, docking methods determine the most likely conformation
of a ligand [11]. Molecular docking is a powerful tool that can significantly expedite
and cut costs of drug discovery.
This chapter will discuss basic principles and recent advances in computational
study of protein-ligand binding. First, we review binding mechanisms from a
thermodynamic point of view. Next, popular computational methods are introduced
followed by a thorough description of three relevant binding databases. Finally,
molecular docking algorithms, scoring functions, and software are explained.
1.1 Thermodynamic Basis of Protein-Ligand Interaction
A property of a thermodynamic system, the heat content of a system at constant

pressure is defined as enthalpy, H . Enthalpy is not measured directly, and therefore
physical and chemical systems are generally more concerned with the net change
in enthalpy, H . In a protein-ligand system, enthalpy measures total energy—
both the internal energy of a system and the energy required to create the system
due to the volume it displaces in its environment [12]. Binding interactions, such
as covalent and non-covalent bonds, in biological systems impact the enthalpy
value. Covalent bonds involve the direct sharing of electrons between two atoms
and are very strong interactions. Noncovalent interactions are weak bonds that
do not involve sharing electrons; although weak, they are vital in determining
physiological macromolecule behavior. There are four basic types of noncovalent
interactions in biological systems: hydrogen bonds, ionic interactions, van der
Waals interactions, and hydrophobic effects [13]. The formation of noncovalent
interactions is energetically favorable and also releases heat, meaning that it is
exothermic (H < 0); breaking such interactions absorbs heat, and is considered
endothermic (H > 0).
There are examples of chemical interactions occurring although they are energet-
ically unfavorable, indicating that there are other factors besides enthalpy impacting
the progression and favorability of an interaction. One such factor is known as
entropy, S. Entropy is a measure of the dispersal of energy in a system, often
referred to as its disorder. On a molecular level, entropy is strongly related to the
degrees of freedom (d.o.f) of a molecule, which describe a molecule’s movement
in space. As such, there are three types of d.o.f to consider: translational degrees
of freedom (specifying the center of mass), rotational d.o.f (specifying orientation),
Protein-Ligand Binding with Applications in Molecular Docking 3
and vibrational d.o.f (specifying the relative positions of various nuclei) [14]. With
higher d.o.f., the atoms are more free to move around and therefore, the disorder
increases. More disordered systems are favored, but as we will see later in this
section, the favorability of a reaction is a compromise between enthalpy and entropy.
Given the two factors that contribute to energetic favorability, Gibbs free energy,
G, is known as an essential thermodynamic quantity that describes the spontaneity
of a reaction under isobaric and isothermal conditions [14]. The change in G is
defined as:
G = H − T S, (1)
where T is the temperature of the system in kelvin and H and S refer to the
enthalpy and entropy, respectively. Processes are spontaneous when G is negative
and non-spontaneous when G is positive. This equation exemplifies the “toss-up”
between enthalpy and entropy in determining the favorability of a reaction: the aim
is to find a balance between decreasing energy and increasing disorder to minimize
G [13, 14].
Thermodynamic factors are important to consider when determining G, but
kinetic factors play an important role as well. In the kinetic context, protein-ligand
binding is dependent on the concentrations of the protein and the ligand (denoted
as [P ] and [L], respectively), as this determines the speed with which protein-
ligand complexes can form (the concentration of the complex is denoted as [P L]).
Michaelis-Menten kinetics [13] introduces the equilibrium association constant, Ka ,
which relates to how quickly the protein-ligand complex can form. This is inversely
proportional to Kd , which relates to how quickly a protein and ligand unattach from
one another.
[P L] 1
Ka = = (2)
[P ][L] Kd
Higher Ka values and lower Kd values are associated with higher binding affinity
and therefore a lower G◦ , which is the standard free energy. Binding free energy
(BFE), G, is the change in free energy when a protein and ligand bind to one
another. It is defined as
G = G◦ + RT lnQ (3)
where R is the gas constant and Q is the reaction quotient, which is equal to Ka
at equilibrium. The extent of the protein-ligand interaction is determined by how
negative the change in free energy is upon binding [7].
2 Computational Methods for Estimating Binding

Free Energy
There are several methods by which binding free energy can be calculated based
on computational analysis of the protein and the ligand, including three principle
methods: alchemical methods, transition path sampling, and end-point methods.
Each of these techniques and recent advancements will be discussed in greater detail
below.
2.1 Alchemical Methods
Free energy is a state function, meaning that the G value does not depend on
the path taken to reach the starting conformation to the bound conformation [15].
Alchemical methods take advantage of this fact and introduce “dummy” molecules,
or other chemical species, to bridge the high-probability region between the
unbound and bound states of the protein and the ligand [7, 16]. The use of these
chemical species are considered non-physical (alchemical) intermediate states,
and their use allows for a robust computational method that can calculate large
G values [16]. The general idea of alchemical free energy calculations can be
summarized by a thermodynamic cycle, shown in Fig. 1. An important parameter in
alchemical free energy calculations is the alchemical reaction coordinate, λ. It helps
define the thermodynamic path connecting the two states and shifts from 0 to 1 as
the coordinate moves.
In the alchemical category, there are two primary methodologies used for free
energy calculations; the first is free energy perturbation (FEP). This method relates
the relative free energy between the unbound and bound protein-ligand complex
to an average function of their energy [17]. Potential energies based on FEP
theory can be averaged over ensembles generated via molecular dynamics (MD)
or Monte Carlo (MC) simulations [17–19]. FEP simulations drive the perturbation
of one molecule into another by using alchemical intermediate states, also known
as windows, that lead one state into the other [7, 19]. Once the calculations are
Fig. 1 The vertical arrows

indicate a biochemical
reaction that shifts the ligand
and receptor from being free
in solution to bound at the
binding site. The horizontal
arrows indicate a
transmutation that
alchemically alters the ligand.
G = GA − GB =
GC − GD
complete, the convergence free energy results of each of the windows is estimated.
One caveat with FEP is that the convergence is only straightforward when U (the
change in potential) between the initial and final states is small.
The second method is known as thermodynamic integration (TI). It calculates the
change in free energy as follows:
λ=1 δH
G = dλ (4)
λ=0 δλ λ
As shown in Fig. 1, the G is the difference between the G of the bound and
unbound states calculated from Eq. 4. The important characteristics of TI are that it
relies on the average force exerted in a system undergoing an alchemical transition,
and integrating it along the alchemical coordinate. This method is often considered
more straightforward than other alchemical approaches to binding free energy [20].
There have been a few techniques developed to try and integrate TI into simulations.
One method is known as slow-growth TI, which progresses through transitions
in an extremely gradual manner and keeps the system very close to equilibrium.
The work done through the transition is equivalent to the free energy change [21].
Another method is discrete TI, which functions very similarly to
FEP
and divides the
thermodynamic path into separate windows. A calculation for δH δλ is done, followed
by the integration necessary to get the value [20, 22, 23].
2.2 Transition Path Sampling Methods
Proteins and ligands undergo several conformational changes in stable states

before reaching the ideal and most stable bound conformation. Graphically, this
is represented by a curve with several local minima until the global minimum at
the end of the curve (i.e., the overall transition path). The transitions between each
of these minima creates a free energy map along the progression of the reaction
pathway (i.e., the reaction coordinate) [7]. This map is known as the potential
of mean force [24]. To analyze the free energy change using these minima and
transition states, MD simulations generate trajectories between the stable states of
the protein and ligand to analyze the nature of the binding mechanism [25]. The
beauty of the transition path sampling (TPS) method is the unbiased nature of
the transition trajectories: a trial trajectory is randomly generated from the initial
trajectory and then analyzed and subject to the Metropolis rule [26]. This rule either
accepts a trial trajectory if it connects two minima or rejects the trial trajectory if it
does not. The ensemble of paths from this analysis can be used to create the reaction
coordinate and free energy landscape, which is known as the potential of mean force
(PMF) [25, 26]. In the context of protein-ligand binding, the PMF-based method
restrains the ligand into the conformation it has while bound, and then translates
it into the binding site and removing the restraints; this technique is particularly
useful with large solvation free energies [2]. The more traditional double decoupling
methods pull the ligand from the binding site back into the solvent. There are also
non-equilibrium methods based on the Jarzynski relationship that can be used to
obtain the BFE [27, 28].
While TPS has many advantages, especially regarding its speed and ability
to analyze rare dynamical events that ordinary MD simulations cannot, there are
disadvantages that also coincide with this technique. One of the primary issues
is the delicacy of defining the stable states. If states are not defined precisely,
conformations can get “stuck” in a specific transition state or in an intermediate
state, which leads to insufficient sampling of the conformations and transition
pathways [7, 26]. Additionally, the unbiased nature of TPS makes it powerful,
however, it is also more difficult to implement because of the bookkeeping necessary
to keep track of both the forward and backward time integrations employed in
this method [26]. There have been some techniques based off of TPS which have
enhanced sampling and led to more accurate and faster PMF calculations. For
example, temperature acceleration method uses high temperatures to accelerate
calculations [29]. Other enhanced sampling techniques include metadynamics [30],
blue-moon sampling [31], umbrella sampling [32], and replica exchange [33].
2.3 End-Point Methods
End-point methods are the final and most efficient G calculation technique
discussed in this chapter. The low expense of these methods relies on the fact that
they sample only the final states of a system (unlike alchemical and path sampling,
which sample non-physical and physical intermediates, respectively) [34]. End-
point methods aim to serve as an intermediate between the speed of scoring
approaches and the accuracy of computationally expensive alchemical and path
sampling methods [35, 36]. Molecular mechanics Poisson–Boltzmann surface area
(MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA)
are of the most commonly used end-point methods and are closely related [37]. The
basis of the MM/PB(GB)SA approach relies on some of the ideas discussed in the
introduction to this chapter. The change in binding free energy is:
G = H − T S = GP L − GP − GL (5)
Where P refers to the protein, L to the ligand, and P L to the complex. This equation
can be broken down into more specific contributions based on different interactions
with the protein-ligand system [7, 36, 38]. Specifically, the enthalpic contribution
is the sum of the changes in gas-phase molecular mechanics and the solvation free
energy:
G = EMM + Gsolv − T S (6)

EMM is the gas-phase energy of the solute and consists of the changes in internal
energy (i.e., the changes from bond, angle and dihedral energies), van der Waals
energy (such as the formation and breaking of hydrogen bonds), and the change in
electrostatic energy [36, 38, 39]. Gsolv is the solvation free energy which consists
of both polar and nonpolar components. The polar component is calculated via
PB (or GB) [40–44] and the nonpolar component is calculated using the solvent-
accessible surface area [45, 46]. The −T S term refers to the absolute temperature
multiplied by the entropy, estimated from normal mode analysis (NMA). This
method is quite computationally expensive and is often disregarded in PB/GB
calculations. Recently, a few computational and theoretical techniques have been
introduced to overcome the intractability of NMA calculations, such as system
truncation [39], consideration of distance-dependent dielectric constant [47], and
reduction in translational and rotational freedom of the ligand upon protein-ligand
binding [48].
MD simulations can be employed for MM/PB(GB)SA calculations in one of
three ways [49]. The first way, the separate-trajectory method (3A), uses three
different trajectories to analyze the protein, ligand, and bound complex separately.
Another method is known as the single trajectory (1A) method and uses one
trajectory to generate the ensemble of snapshots [36, 38, 39]. The performance of
the separate versus the single method was found to vary between systems and test
models, but the separate trajectory approach has been shown to have greater standard
errors and uncertainties, leading the 1A method to be generally deemed as more
accurate [50–52]. Another benefit of the single trajectory method is that the change
in bond energy is no longer relevant to EMM as it cancels out [49]. One of the
downfalls with the single trajectory approach is that it disregards structural changes
to the ligand upon binding to the protein/receptor that are potentially important in
getting accurate characterization of the protein-ligand interaction [7, 53]. To remedy
this, there have been suggestions of a “compromise”—the 2A approach, which
employs trajectories to sample both the complex as well as the ligand in order
to account for energetic changes due to restructuring the ligand [49, 53]. The 2A
method has been found to improve results of binding affinity prediction [54].
The computational protocol for MM/PB(GB)SA can be generally described as
follows: first, an explicit solvent model is used to perform an MD simulation on the
protein-ligand complex to get the free energy contributions from the protein, ligand,
and complex and produce the necessary ensemble of snapshots. Then, the solvent
molecules and charged ions are deleted from each snapshot to prepare for the use
of the implicit solvent model [55], and then the MM/PB(GB)SA method is used to
analyze and compute the total Gsolv for each snapshot. The overall Gsolv is the
sum of the individual contributions [7, 34, 39]. This thermodynamic cycle has been
shown in Fig. 2. The use of an explicit solvent model in the MD simulation followed
by the implicit solvent model for calculation might seem inconsistent since different
energy functions are used, but studies have actually found that using an implicit
solvent model for the MD simulation as well has led to less accurate results [56].
The unreliable double implicit model has sometimes led to a dissociation of the
protein or ligand, rendering it less functional than an explicit simulation followed
by an implicit calculation [57].
Fig. 2 The thermodynamic

cycle for calculating the
solvation component of
binding free energy,
Gsolv , using implicit
solvent modeling. The water
environment is shown in blue
and vacuum is in white
3 Protein-Ligand Binding Databases
Several freely available protein-ligand binding databases have been organized to

examine the affinities of ligands to various proteins, with some databases containing
several thousands of protein targets and over one million test compounds. While
these expansive databases are useful for many types of analysis, they are particularly
helpful in the process of drug discovery; similar to molecular docking (discussed
later in the chapter), they allow for pre-screening of candidate drugs to target
proteins. This section will briefly discuss a few of the protein-ligand binding
databases available as web services.
First introduced in 1995, BindingDB [58] primarily focuses on proteins discov-
ered to be drug targets, with one of the goals of the database being a resource to
help design self-assembling systems in vitro [59, 60]. The database has an expansive
query system, allowing for search by structure, name, sequence, pathways, journal
articles, affinity range, and more. One of the interesting features of BindingDB is
that it provides virtual screening on the web service, wherein the user gives a ligand
training set alongside candidate ligands; BindingDB will then use the training set to
rank the candidates [59–61]. As of September 2021, BindingDB has 8,625 protein
targets and 1,006,573 test compounds. This database is a vast and incredibly useful
resource, especially considering its accessibility and embedded features.
Another popular protein-ligand binding database is PDBBind, which aims to
align itself with biomolecular complexes in the Protein Data Bank (PDB). PDBbind
considers only experimentally determined complexes and reports binding data
obtained via experiment [62–65]. Each year, automated programs organize com-
plexes from the PDB database into four categories: protein-small ligand complexes,
nucleic acid-small ligand complexes, protein-nucleic acid complexes, and protein-
protein complexes. Once this categorization is complete, three sets are created from
this information: the general set, refined set, and core set. The general set is the
most broad, containing the biomolecular complexes for which binding affinity data
are provided. The refined set is more specific with complexes from the general set
that undergo several filters to obtain those with higher quality data. Finally, the
core set is further filtered to provide complexes for validating docking and scoring
methods [62–65]. The 2020 update of PDBBind found 78,460 valid complexes,
placed 24,496 in the general set, 5,316 in the refined set, and 285 complexes in the
core set.
The final popular database is Binding Mother of All Databases, better known as
Binding MOAD [66]. Similar to PDBBind, Binding MOAD has also generated sets
based on high-quality data from the PDB. Binding MOAD finds proteins with x-
ray crystal structures, and the subset is said to fall between PDBbind’s general and
refined sets [66, 67]. Binding MOAD provides a myriad of information and data.
The webpage for each protein shows ligand information (both valid and invalid
reports), any available binding data, the chemical structures of the ligands, and
proteins within the 90%, 70%, and 50% homologies. These tools are all extremely
useful because they allow for an easy way to search for similar structures [66, 67].
As of the 2019 release, there are 38,702 protein-ligand structures, 14,324 instances
of binding data, 18,939 ligands, and 10,500 protein families. Overall, Binding
MOAD is one of the most robust protein-ligand databases available, providing large
amounts of information across many types of complexes.
4 Molecular Docking
Molecular docking is a computational method used to determine the binding mode

of ligands when they interact with proteins of known structure. By examining
many protein-ligand binding conformations and orientations—known as poses—
evaluating these poses, and ranking them via a score, docking methods determine
the most likely conformation of a ligand [68]. Molecular docking has shown
applicability in a variety of research areas, particularly computer-aided drug design
(CADD). Docking can be used to screen various compounds as they attach to
their target proteins. It is a powerful tool that can significantly expedite and
cut costs of drug discovery [7]. This section will discuss the main principles
of protein-ligand docking as well as how protein-ligand (PL) docking relates to
concepts discussed earlier in this chapter. The end-goal of PL docking is to achieve
high-throughput screening of various compounds to select the most viable drug
candidates; determining the general nature of in silico binding between a drug and a
human receptor will allow for more efficient drug discovery methods [68–70]. There
are two main parts of molecular docking: sampling to find putative ligand binding
modes (poses), and scoring to rank the poses from sampling phase [69, 71].
The sampling step of PL docking accounts for both the nature of the ligand and
the way it binds to the receptor as well as the flexibility of the protein [69]. Protein-
ligand binding is generally considered to occur through the induced fit model of
protein-ligand interaction. In this model, the protein is considered flexible enough
to slightly change conformation upon a ligand reaching its binding site to better
accommodate the ligand [13, 72, 73]. This model allows for a tighter overall fit and
accounts for proteins and ligands that bind but do not have matching shapes prior
to interaction. Accounting for protein flexibility has shown to be one of the biggest
challenges in PL docking because of the large number of states possible for the
protein as well as the computational cost of analysis for the full conformational
space around the protein [74]. Sampling is done via search algorithms, which
have been developed using energy functions to pull out the putative binding poses
between a protein and ligand [10]. There are many algorithms by which these poses
can be found, and one goal of algorithms currently being developed is to expedite
the search process [10, 75].
The other part of the docking process is the scoring phase, wherein the various
poses are ranked from most to least likely. The scoring function is used to assess the
binding affinity between the protein and ligand after docking; ideally, the scoring
function should predict both the binding free energy as well as allow for high-
throughput screening of various drugs [7]. Different scoring functions have been
developed for docking purposes, and they can be broken down into three principal
groups: empirically-based functions, knowledge-based functions, and force-field
based functions.
Empirically-based scoring functions are perhaps the most simplistic of those
listed above. This model involves getting a descriptor for the binding, a training
set of a vast number and types of protein-ligand complexes with the experimental
binding affinities, and a regression, classification, or machine learning algorithm
to help form a relationship between the descriptor(s) and the experimental affinity
values [76, 77]. Empirical scoring functions rely on many different energy terms
to determine favorability; these types of functions are more simplistic in that they
do not consider the underlying physical interactions of the protein and ligand. Some
areas of research within empirically-based functions include finding accurate energy
terms to incorporate into the function as well as avoiding overfitting energy terms,
which is an issue that stems from the large number of energy terms used in empirical
functions [78].
Knowledge-based scoring functions are statistically-based functions which come
from the assumption that interactions between the protein and ligand that occur
more frequently than expected by random chance contribute more favorably to the
binding affinity [79]. These types of scoring functions also rely on training sets
to find the statistical potentials of various interactions. One of the biggest issues
with knowledge-based scoring functions is their computational implementation,
however, they serve as a “happy medium” between empirical and force-field based
scoring functions due to their relative swiftness and simplicity, as well as their ability
to avoid some of the challenges posed by empirical functions [80].
Force-field based approaches account for the underlying physical interactions
occurring between binding partners. As mentioned previously, entropic calculations
are computationally expensive; thus, they are often ignored. However, it is important
to acknowledge that entropic effects can play a major role in the binding affinity,
generating a need for a more efficient method to find the entropic contribution.
Force-field based methods are of particular relevance because they can employ
the implicit solvent models within PBSA and GBSA to account for the physical
interactions of the solvent, improving the accuracy with which the binding affinity
can be predicted [81, 82]. One of the benefits of using MM/PB(GB)SA is that such
methods can explore structure-activity relationships within ligands as well as their
selectivity profiles, and are less computationally demanding than FEP techniques.
Additionally, computation time can be decreased by modifying MM/PB(GB)SA to
shorter MD simulations or other parameter changes, allowing for a more flexible

method of calculation that can evaluate multiple complexes (i.e. multiple binding
poses) at once [80, 83]. Although this method is perhaps the most complex and
computationally expensive of the three types of functions discussed, the consid-
eration of the physical interactions involved in the binding mechanisms leads to
more specific binding affinity values and therefore more accurate rankings within
the scoring phase of docking.
The versatile nature of molecular docking and the different algorithms that have
been developed to try and maximize the accuracy and efficiency of this technique
has led to a vast number of available online web services and downloadable
software for docking. One of the older freewares developed for docking is Docking
wIth eVolutionary AlgorIthms, also known as DIVALI. It was developed in 1995
based off AMBER-type potential functions within the search algorithm [84]. More
modern docking resources include the Automated Active Site Detection, Docking,
and Scoring (AADS) protocol which uses an MC based method to construct a
robust web service with accurate results for protein-ligand binding [85]. AutoDock
and AutoDock Vina from the Scripps Research Institute are perhaps two of the
most commonly used docking software; Vina was developed as a faster and more
accurate version of AutoDock. Vina and Autodock each have their own advantages;
comparative studies between Vina and AutoDock 4 (the most recent version of
AutoDock) have found that Vina is better at predicting the experimental protein-
ligand binding poses, but AutoDock calculates more accurate and precise binding
energies [86, 87]. The variation among available docking software allows for a lot
of versatility within protein-ligand calculations.
5 Conclusion
Computational study of protein-ligand binding is useful in several fields, including

drug design and discovery. In this chapter, we presented a general background to
protein-ligand binding with emphasis on binding free energy calculation. Leading
computational methods, binding databases, and relevant applications to molecular
docking have been introduced and discussed thoroughly.
Acknowledgments This works has been partially supported by a California State University
Program for Education and Research in Biotechnology (CSUPERB) New Investigator funding
awarded to N.F.
References
1. W. L. Jorgensen, “The many roles of computation in drug discovery,” Science, vol. 303,
no. 5665, pp. 1813–1818, 2004.
2. H.-J. Woo and B. Roux, “Calculation of absolute protein–ligand binding free energy from
computer simulations,” Proceedings of the National Academy of Sciences, vol. 102, no. 19,
pp. 6825–6830, 2005.
3. M. M. Pierce, C. Raman, and B. T. Nall, “Isothermal titration calorimetry of protein–protein
interactions,” Methods, vol. 19, no. 2, pp. 213–221, 1999.
4. R. Karlsson and A. Fält, “Experimental design for kinetic analysis of protein-protein inter-
actions with surface plasmon resonance biosensors,” Journal of Immunological Methods,
vol. 200, no. 1-2, pp. 121–133, 1997.
5. A. M. Rossi and C. W. Taylor, “Analysis of protein-ligand interactions by fluorescence
polarization,” Nature Protocols, vol. 6, no. 3, pp. 365–387, 2011.
6. M. Jerabek-Willemsen, T. André, R. Wanner, H. M. Roth, S. Duhr, P. Baaske, and D. Breit-
sprecher, “Microscale thermophoresis: Interaction analysis and beyond,” Journal of Molecular
Structure, vol. 1077, pp. 101–113, 2014.
7. X. Du, Y. Li, Y.-L. Xia, S.-M. Ai, J. Liang, P. Sang, X.-L. Ji, and S.-Q. Liu, “Insights
into protein–ligand interactions: mechanisms, models, and methods,” International Journal
of Molecular Sciences, vol. 17, no. 2, p. 144, 2016.
8. M. K. Gilson and H.-X. Zhou, “Calculation of protein-ligand binding affinities,” Annual Review
of Biophysics and Biomolecular Structure, vol. 36, pp. 21–42, 2007.
9. S. J. Y. Macalino, V. Gosu, S. Hong, and S. Choi, “Role of computer-aided drug design in
modern drug discovery,” Archives of Pharmacal Research, vol. 38, no. 9, pp. 1686–1701, 2015.
10. P. H. Torres, A. C. Sodero, P. Jofily, and F. P. Silva-Jr, “Key topics in molecular docking for
drug design,” International Journal of Molecular Sciences, vol. 20, no. 18, p. 4574, 2019.
11. J. Li, A. Fu, and L. Zhang, “An overview of scoring functions used for protein–ligand
interactions in molecular docking,” Interdisciplinary Sciences: Computational Life Sciences,
vol. 11, no. 2, pp. 320–328, 2019.
12. H. Li, Y. Xie, C. Liu, and S. Liu, “Physicochemical bases for protein folding, dynamics, and
protein-ligand binding,” Science China Life Sciences, vol. 57, no. 3, pp. 287–302, 2014.
13. R. Miesfeld and M. McEvoy, Biochemistry. W.W. Norton, 2017.
14. D. A. McQuarrie and J. D. Simon, Physical Chemistry: a Molecular Approach, vol. 1.
University science books Sausalito, CA, 1997.
15. J. D. Chodera, D. L. Mobley, M. R. Shirts, R. W. Dixon, K. Branson, and V. S. Pande,
“Alchemical free energy methods for drug discovery: progress and challenges,” Current
Opinion in Structural Biology, vol. 21, no. 2, pp. 150–160, 2011.
16. A. S. Mey, B. Allen, H. E. B. Macdonald, J. D. Chodera, M. Kuhn, J. Michel, D. L. Mobley,
L. N. Naden, S. Prasad, A. Rizzi, et al., “Best practices for alchemical free energy calculations,”
arXiv preprint arXiv:2008.03067, 2020.
17. W. L. Jorgensen and L. L. Thomas, “Perspective on free-energy perturbation calculations for
chemical equilibria,” Journal of Chemical Theory and Computation, vol. 4, no. 6, pp. 869–876,
2008.
18. Y. Meng, D. Sabri Dashti, and A. E. Roitberg, “Computing alchemical free energy differences
with Hamiltonian replica exchange molecular dynamics (H-REMD) simulations,” Journal of
Chemical Theory and Computation, vol. 7, no. 9, pp. 2721–2727, 2011.
19. W. Jespers, M. Esguerra, J. Åqvist, and H. Gutiérrez-de Terán, “Qligfep: an automated
workflow for small molecule free energy calculations in q,” Journal of Cheminformatics,
vol. 11, no. 1, pp. 1–16, 2019.
20. V. Gapsys, S. Michielssens, J. H. Peters, B. L. de Groot, and H. Leonov, “Calculation of binding
free energies,” in Molecular Modeling of Proteins, pp. 173–209, Springer, 2015.
21. M. J. Mitchell and J. A. McCammon, “Free energy difference calculations by thermodynamic
integration: difficulties in obtaining a precise value,” Journal of Computational Chemistry,
vol. 12, no. 2, pp. 271–275, 1991.
22. M. Jorge, N. M. Garrido, A. J. Queimada, I. G. Economou, and E. A. Macedo, “Effect of the
integration method on the accuracy and computational efficiency of free energy calculations
using thermodynamic integration,” Journal of Chemical Theory and Computation, vol. 6, no. 4,
pp. 1018–1027, 2010.
23. S. Bruckner and S. Boresch, “Efficiency of alchemical free energy simulations. I. A practical
comparison of the exponential formula, thermodynamic integration, and Bennett’s acceptance
ratio method,” Journal of Computational Chemistry, vol. 32, no. 7, pp. 1303–1319, 2011.
24. W. You, Z. Tang, and C.-e. A. Chang, “Potential mean force from umbrella sampling
simulations: What can we learn and what is missed?,” Journal of Chemical Theory and
Computation, vol. 15, no. 4, pp. 2433–2443, 2019.
25. S. Wan, A. P. Bhati, S. J. Zasada, and P. V. Coveney, “Rapid, accurate, precise and reproducible
ligand–protein binding free energy prediction,” Interface Focus, vol. 10, no. 6, p. 20200007,
2020.
26. P. Bolhuis and C. Dellago, “Practical and conceptual path sampling issues,” The European
Physical Journal Special Topics, vol. 224, no. 12, pp. 2409–2427, 2015.
27. C. Jarzynski, “Nonequilibrium equality for free energy differences,” Physical Review Letters,
vol. 78, no. 14, p. 2690, 1997.
28. C. F. Narambuena, D. M. Beltramo, and E. P. Leiva, “Polyelectrolyte adsorption on a charged
surface. free energy calculation from Monte Carlo simulations using Jarzynski equality,”
Macromolecules, vol. 41, no. 21, pp. 8267–8274, 2008.
29. L. Maragliano and E. Vanden-Eijnden, “A temperature accelerated method for sampling free
energy and determining reaction pathways in rare events simulations,” Chemical Physics
Letters, vol. 426, no. 1-3, pp. 168–175, 2006.
30. A. Laio and M. Parrinello, “Escaping free-energy minima,” Proceedings of the National
Academy of Sciences, vol. 99, no. 20, pp. 12562–12566, 2002.
31. G. Ciccotti, R. Kapral, and E. Vanden-Eijnden, “Blue moon sampling, vectorial reaction
coordinates, and unbiased constrained dynamics,” ChemPhysChem, vol. 6, no. 9, pp. 1809–
1814, 2005.
32. J.-F. St-Pierre, M. Karttunen, N. Mousseau, T. Rog, and A. Bunker, “Use of umbrella sampling
to calculate the entrance/exit pathway for z-pro-prolinal inhibitor in prolyl oligopeptidase,”
Journal of Chemical Theory and Computation, vol. 7, no. 6, pp. 1583–1594, 2011.
33. M. Fajer, D. Hamelberg, and J. A. McCammon, “Replica-exchange accelerated molecular
dynamics (REXAMD) applied to thermodynamic integration,” Journal of Chemical Theory
and Computation, vol. 4, no. 10, pp. 1565–1569, 2008.
34. E. Wang, H. Sun, J. Wang, Z. Wang, H. Liu, J. Z. Zhang, and T. Hou, “End-point binding
free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug
design,” Chemical Reviews, vol. 119, no. 16, pp. 9478–9508, 2019.
35. E. A. Rifai, M. van Dijk, and D. P. Geerke, “Recent developments in linear interaction energy
based binding free energy calculations,” Frontiers in Molecular Biosciences, vol. 7, p. 114,
2020.
36. S. Genheden and U. Ryde, “Comparison of the efficiency of the lie and MM/GBSA methods
to calculate ligand-binding energies,” Journal of Chemical Theory and Computation, vol. 7,
no. 11, pp. 3768–3778, 2011.
37. H. Gohlke and D. A. Case, “Converging free energy estimates: Mm-pb (gb) sa studies on the
protein–protein complex ras–raf,” Journal of Computational Chemistry, vol. 25, no. 2, pp. 238–
250, 2004.
38. J. Srinivasan, T. E. Cheatham, P. Cieplak, P. A. Kollman, and D. A. Case, “Continuum solvent
studies of the stability of DNA, RNA, and phosphoramidate- DNA helices,” Journal of the
American Chemical Society, vol. 120, no. 37, pp. 9401–9409, 1998.
39. N. Forouzesh and N. Mishra, “An effective MM/GBSA protocol for absolute binding free
energy calculations: A case study on sars-cov-2 spike protein and the human ace2 receptor,”
Molecules, vol. 26, no. 8, p. 2383, 2021.
40. A. Onufriev, D. Bashford, and D. A. Case, “Modification of the generalized born model suitable
for macromolecules,” J. Phys. Chem. B, vol. 104, no. 15, pp. 3712–3720, 2000.
41. N. Forouzesh, S. Izadi, and A. V. Onufriev, “Grid-based surface generalized born model
for calculation of electrostatic binding free energies,” Journal of Chemical Information and
Modeling, vol. 57, no. 10, pp. 2505–2513, 2017.
42. N. Forouzesh, A. Mukhopadhyay, L. T. Watson, and A. V. Onufriev, “Multidimensional global

optimization and robustness analysis in the context of protein–ligand binding,” Journal of
Chemical Theory and Computation, vol. 16, no. 7, pp. 4669–4684, 2020.
43. C. Tan, L. Yang, and R. Luo, “How well does Poisson- Boltzmann implicit solvent agree with
explicit solvent? a quantitative analysis,” Journal of Physical Chemistry B, vol. 110, no. 37,
pp. 18680–18687, 2006.
44. D. Chen, Z. Chen, C. Chen, W. Geng, and G.-W. Wei, “MIBPB: a software package for
electrostatic analysis,” Journal of Computational Chemistry, vol. 32, no. 4, pp. 756–770, 2011.
45. M. K. Gilson and B. Honig, “Calculation of the total electrostatic energy of a macromolecular
system: solvation energies, binding energies, and conformational analysis,” Proteins: Structure,
Function, and Bioinformatics, vol. 4, no. 1, pp. 7–18, 1988.
46. J. Wang, T. Hou, and X. Xu, “Recent advances in free energy calculations with a combination
of molecular mechanics and continuum models,” Current Computer-Aided Drug Design, vol. 2,
no. 3, pp. 287–306, 2006.
47. S. Genheden, O. Kuhn, P. Mikulskis, D. Hoffmann, and U. Ryde, “The normal-mode entropy
in the MM/GBSA method: effect of system truncation, buffer region, and dielectric constant,”
Journal of Chemical Information and Modeling, vol. 52, no. 8, pp. 2079–2088, 2012.
48. I. Y. Ben-Shalom, S. Pfeiffer-Marek, K.-H. Baringhaus, and H. Gohlke, “Efficient approxima-
tion of ligand rotational and translational entropy changes upon binding for use in mm-pbsa
calculations,” Journal of Chemical Information and Modeling, vol. 57, no. 2, pp. 170–189,
2017.
49. S. Genheden and U. Ryde, “The MM/PBSA and MM/GBSA methods to estimate ligand-
binding affinities,” Expert Opinion on Drug Discovery, vol. 10, no. 5, pp. 449–461, 2015.
50. S. Genheden and U. Ryde, “Comparison of end-point continuum-solvation methods for
the calculation of protein–ligand binding free energies,” Proteins: Structure, Function, and
Bioinformatics, vol. 80, no. 5, pp. 1326–1342, 2012.
51. P. Mikulskis, S. Genheden, and U. Ryde, “Effect of explicit water molecules on ligand-binding
affinities calculated with the MM/GBSA approach,” Journal of Molecular Modeling, vol. 20,
no. 6, pp. 1–11, 2014.
52. D. A. Pearlman, “Evaluating the molecular mechanics Poisson- Boltzmann surface area free
energy method using a congeneric series of ligands to p38 map kinase,” Journal of Medicinal
Chemistry, vol. 48, no. 24, pp. 7796–7807, 2005.
53. J. M. Swanson, R. H. Henchman, and J. A. McCammon, “Revisiting free energy calculations:
a theoretical connection to MM/PBSA and direct calculation of the association free energy,”
Biophysical Journal, vol. 86, no. 1, pp. 67–74, 2004.
54. C.-Y. Yang, H. Sun, J. Chen, Z. Nikolovska-Coleska, and S. Wang, “Importance of ligand
reorganization free energy in protein- ligand binding-affinity prediction,” Journal of the
American Chemical Society, vol. 131, no. 38, pp. 13709–13721, 2009.
55. A. Onufriev, “Implicit solvent models in molecular dynamics simulations: A brief overview,”
Annual Reports in Computational Chemistry, vol. 4, pp. 125–137, 2008.
56. A. Weis, K. Katebzadeh, P. Söderhjelm, I. Nilsson, and U. Ryde, “Ligand affinities predicted
with the MM/PBSA method: dependence on the simulation method and the force field,”
Journal of Medicinal Chemistry, vol. 49, no. 22, pp. 6596–6606, 2006.
57. F. Godschalk, S. Genheden, P. Söderhjelm, and U. Ryde, “Comparison of MM/GBSA
calculations based on explicit and implicit solvent simulations,” Physical Chemistry Chemical
Physics, vol. 15, no. 20, pp. 7731–7739, 2013.
58. T. Liu, Y. Lin, X. Wen, R. N. Jorissen, and M. K. Gilson, “BindingDB: a web-accessible
database of experimentally determined protein–ligand binding affinities,” Nucleic Acids
Research, vol. 35, no. suppl_1, pp. D198–D201, 2007.
59. X. Chen, Y. Lin, M. Liu, and M. K. Gilson, “The binding database: data management and
interface design,” Bioinformatics, vol. 18, no. 1, pp. 130–139, 2002.
60. X. Chen, Y. Lin, and M. K. Gilson, “The binding database: overview and user’s guide,”
Biopolymers: Original Research on Biomolecules, vol. 61, no. 2, pp. 127–141, 2001.
61. M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, and J. Chong, “BindingDB in 2015: a
public database for medicinal chemistry, computational chemistry and systems pharmacology,”
Nucleic Acids Research, vol. 44, no. D1, pp. D1045–D1053, 2016.
62. R. Wang, X. Fang, Y. Lu, and S. Wang, “The PDBbind database: Collection of binding affinities
for protein- ligand complexes with known three-dimensional structures,” Journal of Medicinal
63. R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang, “The PDBbind database: methodologies
and updates,” Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4111–4119, 2005.
64. M. Su, Q. Yang, Y. Du, G. Feng, Z. Liu, Y. Li, and R. Wang, “Comparative assessment of
scoring functions: the casf-2016 update,” Journal of Chemical Information and Modeling,
vol. 59, no. 2, pp. 895–913, 2018.
65. Z. Liu, M. Su, L. Han, J. Liu, Q. Yang, Y. Li, and R. Wang, “Forging the basis for developing
protein–ligand interaction scoring functions,” Accounts of Chemical Research, vol. 50, no. 2,
pp. 302–309, 2017.
66. L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson, “Binding moad (mother of
all databases),” Proteins: Structure, Function, and Bioinformatics, vol. 60, no. 3, pp. 333–340,
2005.
67. R. D. Smith, J. J. Clark, A. Ahmed, Z. J. Orban, J. B. Dunbar Jr, and H. A. Carlson, “Updates
to binding moad (mother of all databases): polypharmacology tools and their utility in drug
repurposing,” Journal of Molecular Biology, vol. 431, no. 13, pp. 2423–2433, 2019.
68. X. Zhang, H. Perez-Sanchez, and F. C Lightstone, “A comprehensive docking and MM/GBSA
rescoring study of ligand recognition upon binding antithrombin,” Current Topics in Medicinal
69. S.-Y. Huang and X. Zou, “Advances and challenges in protein-ligand docking,” International
Journal of Molecular Sciences, vol. 11, no. 8, pp. 3016–3034, 2010.
70. S. F Sousa, N. MFSA Cerqueira, P. A Fernandes, and M. Joao Ramos, “Virtual screening
in drug design and development,” Combinatorial Chemistry & High Throughput Screening,
vol. 13, no. 5, pp. 442–453, 2010.
71. R. G. Coleman, M. Carchia, T. Sterling, J. J. Irwin, and B. K. Shoichet, “Ligand pose and
orientational sampling in molecular docking,” PloS One, vol. 8, no. 10, p. e75992, 2013.
72. K. A. Johnson, “Role of induced fit in enzyme specificity: a molecular forward/reverse switch,”
Journal of Biological Chemistry, vol. 283, no. 39, pp. 26297–26301, 2008.
73. N. Forouzesh, M. R. Kazemi, and A. Mohades, “Structure-based analysis of protein binding
pockets using von Neumann entropy,” in International Symposium on Bioinformatics Research
and Applications, pp. 301–309, Springer, 2014.
74. S. F. Sousa, A. J. Ribeiro, J. Coimbra, R. Neves, S. Martins, N. Moorthy, P. Fernandes, and
M. Ramos, “Protein-ligand docking in the new millennium–a retrospective of 10 years in the
field,” Current Medicinal Chemistry, vol. 20, no. 18, pp. 2296–2314, 2013.
75. I. Halperin, B. Ma, H. Wolfson, and R. Nussinov, “Principles of docking: An overview
of search algorithms and a guide to scoring functions,” Proteins: Structure, Function, and
Bioinformatics, vol. 47, no. 4, pp. 409–443, 2002.
76. I. A. Guedes, F. S. Pereira, and L. E. Dardenne, “Empirical scoring functions for structure-
based virtual screening: applications, critical aspects, and challenges,” Frontiers in Pharma-
cology, vol. 9, p. 1089, 2018.
77. L. P. Pason and C. A. Sotriffer, “Empirical scoring functions for affinity prediction of protein-
ligand complexes,” Molecular Informatics, vol. 35, no. 11-12, pp. 541–548, 2016.
78. S.-Y. Huang, S. Z. Grinter, and X. Zou, “Scoring functions and their evaluation methods for
protein–ligand docking: recent advances and future directions,” Physical Chemistry Chemical
Physics, vol. 12, no. 40, pp. 12899–12908, 2010.
79. I. Muegge, “PMF scoring revisited,” Journal of Medicinal Chemistry, vol. 49, no. 20, pp. 5895–
5902, 2006.
80. S. Z. Grinter and X. Zou, “Challenges, applications, and recent advances of protein-ligand
docking in structure-based drug design,” Molecules, vol. 19, no. 7, pp. 10150–10176, 2014.
81. W. C. Still, A. Tempczyk, R. C. Hawley, and T. Hendrickson, “Semianalytical treatment

of solvation for molecular mechanics and dynamics,” J. Am. Chem. Soc., vol. 112, no. 16,
pp. 6127–6129, 1990.
82. G. D. Hawkins, C. J. Cramer, and D. G. Truhlar, “Pairwise solute descreening of solute charges
from a dielectric medium,” Chemical Physics Letters, vol. 246, no. 1-2, pp. 122–129, 1995.
83. C. Granchi, M. Lapillo, S. Glasmacher, G. Bononi, C. Licari, G. Poli, M. El Boustani,
I. Caligiuri, F. Rizzolio, J. Gertsch, et al., “Optimization of a benzoylpiperidine class identifies
a highly potent and selective reversible monoacylglycerol lipase (magl) inhibitor,” Journal of
Medicinal Chemistry, vol. 62, no. 4, pp. 1932–1958, 2019.
84. K. P. Clark, “Flexible ligand docking without parameter adjustment across four ligand–receptor
complexes,” Journal of Computational Chemistry, vol. 16, no. 10, pp. 1210–1226, 1995.
85. T. Singh, D. Biswas, and B. Jayaram, “Aads-an automated active site identification, docking,
and scoring protocol for protein targets based on physicochemical descriptors,” Journal of
Chemical Information and Modeling, vol. 51, no. 10, pp. 2515–2527, 2011.
86. N. T. Nguyen, T. H. Nguyen, T. N. H. Pham, N. T. Huy, M. V. Bay, M. Q. Pham, P. C. Nam,
V. V. Vu, and S. T. Ngo, “Autodock vina adopts more accurate binding poses but autodock4
forms better binding affinity,” Journal of Chemical Information and Modeling, vol. 60, no. 1,
pp. 204–211, 2019.
87. S. Forli, R. Huey, M. E. Pique, M. F. Sanner, D. S. Goodsell, and A. J. Olson, “Computational
protein–ligand docking and virtual drug screening with the autodock suite,” Nature Protocols,
vol. 11, no. 5, pp. 905–919, 2016.
Explaining Small Molecule Binding
Specificity with Volumetric
Representations of Protein Binding Sites
Ziyi Guo and Brian Y. Chen
1 Introduction
Biological systems depend on proteins to perform almost every chemical function

in the cell. The catalysis of essential reactions, the transport of critical molecules,
the mechanical integrity of cells and many other functions rely on these diverse
and specialized worker molecules. Specialization is apparent in the way proteins
interact with other molecules: While there are tens of thousands of unique molecules
in the cell, most proteins only bind a narrow range of partners. This property, of
preferentially forming interactions with select molecules, is called specificity and it
organizes proteins into teams that function robustly even though they share crowded
cellular spaces with many unrelated molecules. Building an understanding of the
mechanisms that achieve specificity is a common goal in many areas of molecular
biology because it could reveal how teams of molecules function and how they
might be manipulated or reengineered for medical or industrial purposes. This
chapter aims to describe current computational methods by which specificity, and
the mechanisms that govern it, may be discovered.
The challenge of uncovering the function and specificity of all proteins is
staggering. In the human body alone there exist tens of thousands of unique proteins
and each is a complex machine that acts through multiple biophysical phenomena
to achieve particular functions. Some proteins achieve specificity using precise
structural complementarity. Others use the hydrophobic effect, the attraction and
repulsion of electric fields, or combinations of these and other phenomena. This
mechanistic complexity, coupled with the sheer number of proteins, is the reason
why the mechanisms controlling the binding preferences of most proteins remain
Z. Guo · B. Y. Chen ()

Lehigh University, Bethlehem, PA, USA
e-mail: zig312@lehigh.edu; byc210@lehigh.edu

18 Z. Guo and B. Y. Chen
uncertain: The space of hypothetical interactions and interaction mechanisms is

enormous, and exhaustive testing is out of the question. Despite this difficulty, the
expert insights of investigators in structural and molecular biology steadily reveal
the mechanisms of action employed by several more families of proteins every year.
We make progress in part because what we learn from the mechanisms of past
proteins has always informed the way we study the next protein. Crystallography
has revealed many interactions where close steric or electrostatic complementarity
occurs between molecules that bind tightly (e.g. [77]). Evolutionary conservation
exposes the presence of functional regions on molecular surfaces (e.g. [55]). These
lessons alert us to ways in which new interactions might occur. By systematizing
and reapplying these concepts, computational methods can be tools for hypothesis
generation, thereby narrowing the set of experiments necessary to discover binding
preferences and mechanisms. For example, algorithms for function annotation
automate the identification of geometrically similar functional sites in an effort to
identify proteins that catalyze the same reaction. This chapter reviews an emerging
subset of function annotation methods that specialize in specificity annotation,
the algorithmic prediction, deconstruction and functional analysis of the structural
mechanisms that achieve binding specificity. We maintain a focus on specificity
annotation for interactions between proteins and small molecules ( ligands), for
which a variety of recent techniques have been developed.
1.1 Comparison Algorithms for Examining Specificity
Specificity is a property that is fundamentally defined by comparison: A protein

prefers one ligand because it binds with less affinity to other ligands, not because of
absolute affinity. These preferences arise because, for example, the preferred ligand
has a shape that is more complementary than other ligands, not because of absolute
complementarity. The comparative nature of specificity, as a phenomenon, makes
protein sequence and structure comparison algorithms an ideal class of methods for
examining specificity.
At least two prediction problems in the field of specificity annotation relate to
protein-ligand binding specificity. Here, we refer to these problems using descriptive
names to clarify the aims of existing methods, even though consensus surrounding
the actual terminology in this emerging field has yet to be reached. The most
commonly studied problem is the specificity assignment problem of predicting the
binding preferences of a protein with unknown specificity. A second and emerging
category of problems is the component localization problem, the challenge of
identifying amino acids and other elements of protein sequence and structure
that influence specificity. Using structure comparison algorithms, some approaches
achieve objectives in multiple categories.
All protein structure comparison techniques share four integrated components,
which are the focus of specific methodological advancements in the field. First,
they employ a digital representation of protein structure that explicitly represents
Explaining Specificity with Volumetric Representations 19
the geometric and chemical aspects of the proteins being compared. Second,
the digital representation is paired with a comparison metric, which is used to
evaluate similarity between representations of multiple protein structures. Third,
a comparison algorithm is developed to identify the fairest comparison of two or
more proteins based on the metric. Finally, a statistical model is used to evaluate
the significance of the final measurement relative to baseline structural noise.
Enhancements in representations, metrics, comparison algorithms and statistical
models, while sometimes presented together as a singular method, represent
progress in the field towards the solution of problems in specificity annotation.
Due to their integrated nature, novel designs for some components, especially new
representations of protein structure, can represent major advancements to the field,
because they can permit new analytical capabilities, as we will discuss later in this
chapter.
2 Specificity Assignment
The prediction of small molecule binding specificity is founded on the principle

that ligand binding sites that are sufficiently similar to a site on another protein
will exhibit the same binding preferences. Therefore, if a geometric comparison
algorithm can find proteins that have binding sites with identical atoms in nearly
identical places, then we predict that those proteins prefer to catalyze the same
reaction on the same binding partners. It should be noted that some proteins catalyze
reactions with the same substrates using different molecular mechanisms. That
situation does not preclude the prediction of specificity based on similarity: it simply
means that the lack of similarity is not an indicator for different binding preferences.
This overall approach to structure comparison, with its focus on finding similar
binding sites, originates from and overlaps with earlier work intended to find
proteins with similar functions based on similar binding sites (e.g. function pre-
diction). Below, we review many methods developed for binding site comparison,
even if some have not been explicitly tested for identifying proteins with identical
specificity. In many cases, such methods could be modified or refined for specificity
assignment. Excluding them would paint an incomplete picture of the space of
technologies that can identify proteins with similar binding preferences. This section
organizes advancements in specificity assignment around the four common compo-
nents of structure comparison algorithms: First, we describe several computational
representations of binding sites (Sect. 2.1). We then describe the standard metric of
structural comparison, LRMSD, used to measure how similar two binding sites are
(Sect. 2.2). Algorithms in Sect. 2.3 describe how matching binding sites are found.
Finally, we describe several types of statistical models used to interpret whether
the matches indicate that matched binding sites exhibit similar binding preferences
(Sect. 2.4).
The prediction accuracy of most specificity assignment algorithms can be
evaluated in several ways. One way is to a benchmark against a nomenclature of
enzymes developed by the Nomenclature Committee of the International Union of

Biochemistry and Molecular Biology. This classification of an individual protein
is generally called the Enzyme Committee number (EC number), and it is written
as four period-delimited integers, w.x.y.z. The integer w denotes the broadest class
of chemical reactions catalyzed by proteins. x and y denote narrower categories
of distinct reactions. The integer z is used to classify proteins with identical
function and generally different specificity [84]. A specificity assignment method
thus operates correctly when it predicts the four digit EC number of a given protein
structure.
2.1 Binding Site Representations
The comparison of protein-ligand binding sites depends on a digital representation

called a motif that describes the atoms critical for binding and catalysis. Some
methods refer to this representation as a template (e.g. [2]). These atoms or amino
acids must first be selected for comparison, most frequently through literature
search [9], manual selection by biochemical experts [10, 41], or from databases
of condensed active site information [28, 69]. Other motifs selections are based
on the analysis of structure or sequence data, such as the largest cavity [80], or
using evolutionarily significant amino acids close to known ligand binding sites
[10, 29, 47]. The correct selection of atoms is essential for prediction accuracy in
specificity and function annotation, as even small variations can lead to considerable
variations in prediction accuracy [22]. Fortunately, several methods have been
developed to automatically refine motif designs [7, 18, 20–22, 66]. Once the atoms
or amino acids are selected, several binding site representations have been developed
to enable the comparison of the site.
The earliest specificity assignment and function prediction algorithms represent
binding sites with groups of points in three dimensions that describe the positions of
individual atoms. Labels have been used to identify points in space as representing
alpha carbons [10, 33], sidechain atoms [1, 60], or conserved binding patterns [52,
53]. Labeled points have also been used to represent molecular surfaces [51, 67, 68]
as well as electrostatic potentials on the molecular surface [43].
To reduce matches between unrelated binding sites, motifs of labeled points
were enhanced with additional information. These enhancements began with more
sophisticated labels, such as label sets that represent substitutions that may occur
in major evolutionary divergences [2, 10] (Fig. 1a). Other enhancements include
the incorporation of hinges that could partially represent the flexibility of protein
structures [54], pairs of points to describe sidechain orientation [31] (Fig. 1b), and
alpha shapes [30, 38, 80, 81] or spheres [19, 20] (Fig. 1c) that describe the empty
interior of a binding site. When the atoms of a protein match the motif, spheres
can be used to detect if other atoms occupy the region that must remain empty
to accommodate the ligand (Fig. 1c). If the ligand binding regions are not empty,
then the match can be eliminated. This elimination strategy can eliminate 80% of
a) b) c)
{g,v,c}
{r,k}
{f,w,y}
{a,v,i,l}
Fig. 1 Examples of point-based motif designs. (a) Multiple amino acid labels describe variations
that occur in major evolutionary divergences. (b) Vectors at each motif point encode sidechain
orientation. (c) Cavity spheres denote regions of the binding cavity that must remain empty in the
matching structure, to accommodate the ligand
statistically significant false positives while maintaining 87% of true positives (some
true positives are also lost in the process) [19].
Beyond the enhancement of motifs with additional information, algorithms have
also been developed to better select the points in the motif, to achieve superior
prediction accuracy. By repeatedly comparing variations of potential motifs against
a nonredundant subset of the PDB, Geometric Sieving [22] identifies motifs that
exhibit greater sensitivity than if an arbitrary set of points near the binding site were
used for the motif. Composite motifs use multiple structures of the same binding
site atoms to generate averaged structures can exhibit greater similarity to proteins
with similar binding preferences [18]. These composite motifs can be significantly
extended to represent the same protein in different conformations and variations of
the same family of proteins with different binding specificities [8].
Spherical harmonics have also been developed as an alternative representation
of protein-ligand binding pockets [58]. These infinitely differentiable single-value
functions on two polar coordinates can represent some three dimensional regions,
as long as they fulfill the star-convexity property: A region r is star-convex if there
exists a point p0 ∈ r such that for every point p ∈ r, all points on the segment
p0 p are inside r. Whereas the comparison of point-based binding sites requires
many pairs of corresponding points to be found and aligned, spherical harmonics
can be compared by finding optimal rotational superpositions. Spherical harmonics
can also be used to represent ligands and other small molecules.
2.2 Metrics for Binding Site Comparison
Structural similarity between point-based motifs is almost always measured using

least root mean squared distance (LRMSD). This measure is evaluated between n
pairs of corresponding points {m1 , p1 }, {m2 , p2 }, . . . {mn , pn }, where mi is a motif
point and pi is a point from the matching protein. RMSD is evaluated with the
expression:

1 n
RMSD = d(mi , pi ),
n
1
where d(mi , pi ) is the Euclidean distance between two points. LRMSD is the
minimal value of RMSD when over all rigid transformations of the motif, which
can be rapidly determined in linear time using eigenvector [44, 82] or quaternion
[27] methods. Because of the minimal nature of LRMSD, measuring LRMSD yields
both a measurement, in Angstroms, and a rigid superposition of one structure onto
the other. Many methods refer to LRMSD as RMSD, because there is no value in
using RMSD as a comparison metric when it has not been minimized.
Many methods treat this core element of protein structure comparison as a black
box, focusing instead on the selection of atoms to align and on the algorithm for
rapidly finding atoms that yield low LRMSD alignments. Since LRMSD amounts
to measuring the minimum geometric mean of the distances between points, newer
methods have developed additional geometric criteria to exclude other biologically
irrelevant matches. For example, Cavity Aware Match Augmentation [19] uses
spheres to detect and eliminate proteins that do not maintain an empty binding cavity
within the motif. pvSOAR maps points around a pocket onto the unit sphere, and
measures their LRMSD to evaluate structural similarity [81].
An important constraint on LRMSD is the fact that it depends on bijections
between two equally sized pointsets. This constraint has serious implications for
specificity assignment in two ways: First, the structure of different amino acids
cannot be fully compared, because different amino acids have different numbers of
atoms. Many methods partially sidestep this problem by representing amino acids
with only a single point. Unfortunately, that approach partially ignores variations
in sidechain geometry, causing important differences in shape to be overlooked.
Second, depending on a bijection requires that any amino acid that occupies
different positions in two binding sites, thereby altering specificity, must also be
part of the motif, otherwise the difference may not be detected. As a result, effective
motif design, whether by expert design (e.g. [41]) or algorithmic refinement (e.g.
[7, 18, 20, 22]) is mandatory for practical specificity assignment.
2.3 Comparison Algorithms
Comparison algorithms for function annotation and specificity assignment are

search algorithms that seek to identify bijective correspondences between the
motif points and the points of the protein that satisfy all matching criteria and
minimize LRMSD. An exhaustive approach is impractical, especially in cases where
evolutionary labels can cause the algorithm to consider comparing a single motif
point to many target points. Nonexhaustive approaches are practical, and generally
identify correct matches except when the only possible matches are so poor as to be
biologically irrelevant. Even though they are biologically irrelevant, however, very
poor matches must be accounted for in any statistical models, to prevent algorithmic
bias [32].
Geometric Hashing Point-based motifs can be rapidly compared using Geometric
Hashing [33, 48], a technique that represents points in a rotationally and translation-
ally invariant vector. Vectors are generated for every triplet in the motif and stored
in a range search structure. When searching for matches for a given motif inside a
protein structure, the triplets of points in the structure are encoded into invariants and
compared against the invariants of the motif. Similar invariants represent a triplet of
corresponding points that can be used to generate initial alignments [10] or used in
voting systems to generate final correspondences of points between the protein and
the motif.
Depth First Search A second group of algorithms uses depth first search techniques
to systematically construct bijections between motif points and target points (Fig. 2)
[73]. Algorithms like Match Augmentation [10], and Labelhash [57], begin with
one or a handful of point-point correspondences. From this initial state, depth first
searches superpose the corresponding points and then identify additional correspon-
dences between motif points that are brought into proximity with acceptable target
points. Multiple potential correspondences can be detected for each motif point,
creating a branching nature in the depth first search as different possibilities are
explored. After exploring many possibilities, both geometric hashing and depth
first search matching algorithms return the match with the largest number of
corresponding points.
c1 d1 e)
a) b)
c2 d2
X
?
Fig. 2 Operation of depth first search comparison algorithms. (a) three points (black, white have
correspondences (thin black lines), and we seek to find acceptable correspondences for the next
motif point (grey). (b) Range search (circle) for points that may correspond (white) with the
next motif point. At this point, the depth first search first considers one possibility (c1–e), and
eventually considers the other (c2, d2). (c1) A tentative correspondence is considered with the
first point. (d1) LRMSD superposition between corresponding points is computed. (e) The final
distance between points in LRMSD superposition is acceptable, so we consider the next motif
point (grey). (c2) A tentative correspondence is considered with the second point. (d2) The final
distance between points in LRMSD superposition is not acceptable, because they get too far apart.
This set of correspondences is not used further, and we backtrack to other correspondences
2.3.1 Data Structures
Algorithms that compare points in space always require a data structure that enables
range search. A common structure that achieves this purpose is a three dimensional
kd-tree [4], a space partitioning data structure which is well documented and does
not need to be described further. Due to the fact that atoms can only pack into limited
densities, however, three dimensional cubic lattice structures, Lattice Hashes, can
also be very useful. Geometric comparisons generally search for atoms nearby a
given point, which amounts to a range search on a spherical range. Lattice Hashes
can support spherical range search very easily.
Construction To construct a Lattice Hash, we begin with a given set of points in
three dimensions and a resolution that defines the size of each cube in the lattice.
We first find the bounding box of the input points and determine the smallest lattice
of cubes, based on the resolution parameter, that fully contains the bounding box
(Fig. 3a). Rather than allocating each cube immediately, we refer to each cube using
a unique index:
index = xpos ∗ (ydim ∗ zdim) + ypos ∗ (zdim) + zpos
where xpos, ypos, and zpos are the position of the cube, counting from the low-x,
low-y, low-z corner of the lattice. ydim and zdim are the number of cubes along
the y and z dimensions of the lattice. All cubes that contain points are stored in a
hash table based on this index. To insert each point into the lattice, we associate it
with the cube that contains it, allocating memory only for cubes that contain points
(Fig. 3b).
Spherical Range Search We begin with a point and a radius that define the spherical
range. To discover points in the lattice within the range, we first identify the set of
cubes that contain the bounding box of the range, and then identify individual cubes
a) b) c) d)
Fig. 3 Lattice Hash construction and operation. (a) The bounds of the Lattice Hash (edge of
lattice) surround the points to be represented (black circles) in uniform cubes. (b) Only occupied
cubes (black squares) are allocated on a hash table (second row). (c) Spherical range search (large
circle) centered at a point (white with heavy black outline). (d) Range search first identifies cubes
that intersect the range (black squares), then identifies occupied cubes (shaded squares) based on
their presence in the hash table
that overlap the sphere, using the distance of cube corners and the intersection of
cube segments (Fig. 3c). Finally, individual points are tested, and points within the
range are returned.
2.4 Statistical Models for Binding Site Comparison
Protein structure comparison algorithms identify matches that indicate geometric

similarity between the atoms of two protein structures. When only a few atoms are
being compared, as in the situation of motif comparison, matches occur frequently
by random chance. The existence of a match is therefore insufficient to assign
similar binding specificity. Furthermore, matches of motifs with just a few points
generally have lower LRMSDs than matches to motifs with larger numbers of
points, making any specific LRMSD threshold also insufficient to assign specificity.
To address this problem, statistical models can be used to assess the likelihood of
observing matches of a given motif at a particular LRMSD. When the likelihood p
of observing the match is so low that it could not have occurred by random chance,
it is called statistically significant.
Statistical significance is often measured with parametric models, which estimate
the probability of observing a match based on given parameters of a motif, such as
the number of atoms and the type of amino acids it represents. These parameters
are used to approximate the probability density function (PDF), which maps any
value of LRMSD l to the probability of observing matching sites with similarity
l. Parameters are also used to approximate the cumulative distribution function
(CDF), an analog of the PDF. The shape of the PDF has been approximated
using categories of functions that include extreme value distributions [1], empirical
distributions [6, 81], and mixtures of Gaussian distributions [41]. The advantage
of this approach is that the relationship between motif characteristics and the
coefficients of approximating function can be rapidly trained for a larger set of
motifs, and that the statistical significance of a given match can be evaluated in
constant time. Unfortunately, in unusual cases where the approximating function
cannot resemble the PDF, parametric models can produce inaccurate probabilities
and thus incorrectly label some matches as statistically significant.
Nonparametric statistical models use density estimation to create a custom PDF
that reflects any motif. Rather than generating the PDF based on motif parameters,
the shape of the PDF is sampled from a set of matches with the motif, before
statistical significance can be computed. This approach can be considerably more
accurate, but it also requires a large amount of computation to generate the PDF,
whereas a parametric model is simply a curve specified by motif parameters.
MASH [22, 32] and LabelHash [57] use nonparametric statistical models that can
eliminate 99% of matches to proteins that have different 4-digit EC numbers [32].
Leveraging parallel computation to train the model as rapidly as possible, MASH
demonstrated that training a nonparametric statistical modeling was not so compu-
tationally onerous as to be impractical. Enhancements to MASH demonstrated that
nonparametric models could compensate for incomplete matching algorithms that

are not guaranteed to identify matches when only low quality matches exist [32].
This kind of compensation is impossible with parametric models, which cannot alter
the shape of their probability density function based on algorithmic limitations.
Whether a parametric or nonparametric model is used, statistically significant
matches have been found to be strong predictors of proteins with identical binding
specificity (e.g. [1, 10, 32, 41]). As a result, statistical models act as a quantitative
tool for interpreting the output of specificity assignment algorithms, separating
matches that are likely to indicate that two proteins have similar specificity from
other matches that do not support the same inference.
3 Component Localization
Many parts of a protein work together to achieve specificity. Finding these com-
ponents within the rest of the protein can be extremely difficult. However, once
the molecular mechanisms that implement specificity are found, they can provide a
crucial platform for applied research: Later studies could mutate these mechanisms
to examine how, for example, the protein could be reengineered to achieve novel
binding preferences for industrial applications, or how the protein might mutate
to achieve inhibitor resistance. Likewise, before the molecular mechanism of
specificity is known, predicting elements of protein structures that might control
specificity can narrow the space of possibilities that must be examined in order
to unravel how specificity is actually achieved. By finding individually influential
components, component localization is the new effort to use computational preci-
sion and speed to suggest hypotheses about binding mechanisms that might have
been overlooked by human investigators.
The approach that component localization algorithms take is to search for subtle
differences in very similar proteins that cause different binding preferences. This
approach differs fundamentally from that of specificity assignment algorithms,
which search for subtle similarities that necessitate similar binding preferences
in very different proteins. Second, there are many ways in which differences in
protein structure can cause differences in binding preferences, including differences
in steric hindrance at binding sites, differences in electric fields, and so on. Finding a
difference that necessitates differences in binding preferences requires a biophysical
rationale for different binding preferences, and it also requires the consideration
of many different kinds of components. While amino acids are one kind of
component that influences specificity, empty clefts and cavities also play a role in
accommodating binding partners, as do local patterns of molecular flexibility. Many
components of protein structure and their biophysical roles must be considered for
a comprehensive approach to component localization.
Component localization algorithms originated in sequence based methods that
analyze alignments of related sequences in different specificity groups, looking for
sequence positions that are conserved within groups and different between groups
[11, 12, 45, 61, 63, 65]. Amino acids with these properties are under evolutionary
pressure to support the different binding preferences of distinct groups. Generally,
these methods require sequences that are categorized into groups with different
binding preferences, though some exceptions exist [34, 62, 70]. This evolutionary
rationale offers one explanation for the role of each amino acid in specificity.
Evolutionary information can be mapped onto protein structures for component
localization. Evolutionary Trace Annotation [2] (ETA), is a specificity assignment
algorithm that uses Paired Distance matching [83] to identify binding sites with
specificity identical to that of motifs generated with the Evolutionary Trace [61].
Since ETA motifs are generated using the Evolutionary Trace, an algorithm that
performs sequence-based component localization, the amino acids that structurally
match the motif are also implied to influence specificity in the same way as
the amino acids in the motif sequences. Recently, this approach was verified
experimentally against carboxylesterases with different binding preferences [2].
3.1 Foundations of Structure-Based Component Localization
While sequence-based approaches can reveal an evolutionary rationale for the

influence of particular structural components on specificity, an analysis of protein
structures can add biophysical rationales as well. In such cases, the components
localized are not simply correlated with specificity, as in sequence-based methods,
but actually implicated in binding mechanisms. Here we describe a purely structural
method for identifying amino acids for which mutations alter steric influences on
specificity called VASP (Volumetric Analysis of Surface Properties) [23]. VASP
uses solid representations of ligand binding sites and algorithms from constructive
solid geometry (CSG) to identify regions where patterns of steric hindrance are
different and thus implicated in different binding preferences. We explain this
method in Sect. 3.2 after introducing fundamental concepts and methods necessary
for generating and manipulating solid representations of protein structures.
3.1.1 Comparing Solid Representations with CSG
Three dimensional solids can be defined with a closed surface as a boundary for a
solid region. To represent proteins, the molecular surface (also known as the solvent
excluded surface[49]) is a useful boundary, and practical surfaces, constructed from
thousands of triangles, are provided by many existing methods, including MSMS
[75], GRASP2 [64], VADAR [85], and others [26]. Geometric solids, like spheres,
cubes, and tetrahedra, can also define useful boundaries for solid representations.
Comparisons of these solid representations are not founded on an analysis of
surface geometry, but actual comparisons of the region occupied by the solid. This
kind of comparison is made possible through CSG operations (Fig. 4), also called
boolean operations. Solid comparisons of protein structures, therefore, are direct
Fig. 4 A demonstration of CSG operations union, intersection, and difference, illustrating the
borders of two input (dotted) and output regions (solid)
comparisons of the patterns of steric hindrance imposed by protein structure onto

interacting ligands.
CSG operations can be computed using Marching Cubes [50]. Marching Cubes
is a computer graphics algorithm for extracting the surface of a three-dimensional
solid, and it has been used in the past for visualization [59]. Here, we paraphrase
how Marching Cubes can be used to compute CSG operations, as others have
demonstrated [42]. As input, Marching Cubes begins with two closed regions A
and B, a resolution parameter that defines the precision of the output and one
parameter that defines the nature of the CSG operation: boolean intersection, union
or difference (Fig. 4). The output generated is the boundary surface approximating
the boolean intersection, union or difference defined by a triangular mesh. When
describing sets of serial CSG operations, we will use the operators −, ∪, and ∩ to
denote difference, union, and intersection operations (Fig. 5).
Here, we paraphrase the union operation as an example CSG operation. First,
a cubic lattice is constructed to cover both the regions A and B with the input
resolution. The lattice can be described as a grid of points oriented along the three
primary axes, or as a collection of line segments connecting two adjacent co-axial
lattice points, or as a set of unit cubes formed by multiple lattice segments. Second,
it is decided whether each lattice point p is inside or outside the region of A and
B. This procedure can be achieved using a randomly orientated ray starting from
p and counting the number of intersections with A and with B: an even number
of intersections indicates that p is outside, while an odd number of intersections
indicates that p is inside. If p is inside either A or B, then p is inside the union
region. Third, a series of lattice segments are selected where one point is inside the
union region and the other is outside the union region. Since A and B are closed
regions, their union must also be closed. Therefore, on the selected segment there
must be a crossing point that intersects the boundary of A, or the boundary of B,
or both. In the end, all the crossing points are connected to form triangles that
approximate the boundary of the union region [50]. Variations on this procedure
can be used to compute intersection and difference operations.
a b c
d e f
Fig. 5 The union operation computed with Marching Cubes. (a) The input solid regions: A and
B (gray). (b) The regions in the lattice. (c) Interior points (dark gray) and exterior points (light
gray). (d) The lines connecting one interior point and one exterior point. (e) The crossing points
that intersects the boundary of the union. (f) Output triangles (solid lines) for union approximation
a b c d
Fig. 6 Computing the volume of a given closed region. (a) Input region A and the its geometric
centroid (black dot). (b) One triangle and its normal vector. (c) Tetrahedra (triangles inclosed in
solid black lines) based on triangles (thick black lines) with normals (while arrows) facing away
from the centroid. (d) Tetrahedra (triangles inclosed in solid black lines) based on triangles (thick
black lines) with normals (black arrows) facing towards the centroid. The volume of the region is
the difference between volume sum in (c) and volume sum in (d)
Computing the volume within a closed region A is also a crucial aspect of VASP,
and we summarize it here: First, the centroid of the corners of all triangles is
determined. In A, the three points of each triangle and the centroid together define
a tetrahedron. Since the triangle either faces towards or away from the centroid,
by the Surveyor’s Formula [76], we can evaluate the volume within A by adding
the volumes of all tetrahedra that face away from the centroid, and subtracting
the volumes of all tetrahedra that face towards the centroid. The volume of each
tetrahedron can be accurately computed using Tartaglia’s Rule [5] (Fig. 6).
3.1.2 Solid Representations of Binding Cavities
Variations in structure that cause differences in binding specificity are most likely
to occur at binding sites. To compare binding site geometry, solid representations
can be used to describe the shape of the empty region that accommodates ligands
using a series of CSG operations. As an instructive example, the following method
describes one simple way to represent a binding cavity. We begin with the whole
structure of one protein A and the ligand l that binds at its functional site. For
each atom in l, a sphere is generated with radius 5.0 Å, and the union of all the
spheres, Sl , are calculated with VASP using CSG operations. Sl defines the vicinity
of the ligand binding cavity. GRASP2 [64] can be used to compute the molecular
surface m(A) using the classic rolling probe method [71]. The molecular surface
is generated using a 1.4 Å probe. A second “envelope” surface, e(A), is generated
with 5.0 Å probe using the same algorithm. The binding cavity a is generated using
the following CSG expression:
a = (Sl − m(A)) ∩ e(A)
As an instructive example, this cavity definition can have some shortcomings. A

bound ligand may not be available, or the available ligand might only occupy part
of the binding site. In such cases, atoms of l can be substituted with waters in the
binding site, or points in space positioned in the binding site by an expert. Unlike in
point-based methods, the precise position of the point is not significant, because its
only purpose is to describe a solid sphere that occupies the binding region: Regions
outside the binding region but inside the molecular surface (e.g. Fig. 7d), or outside
the envelope surface (e.g. Fig. 7e), are eliminated. In other cases, perhaps due to
small internal voids, the final cavity might exhibit multiple disconnected regions.
These can be eliminated with a simple depth first traversal of the triangles on the
surface, identifying parts of the surface that are disconnected, and removing all but
the largest or most biologically relevant region. Solid representations are compatible
with many approaches for accurately defining ligand binding cavities.
3.2 Using CSG for Component Localization
To find components of protein structures that cause different binding preferences,

we compare proteins that perform the same function but exhibit different binding
preferences. This starting point ensures that any differences identified are not
involved in performing different functions, but rather in binding different ligands.
For example, a protrusion in one binding site that does not exist in another could
prevent certain ligands from binding. This localization process is achieved using
CSG differences, which isolate amino acids that make the one cavity different from
another, or regions in one cavity that are not inside another cavity [23]. In these
a b c
d e f
Fig. 7 Generating binding cavity regions. (a) The molecular surface m(A) (gray region) of protein
A and the binding substrate l. (b) The same molecular surface with respect to the border (dotted
line) of envelope surface e(A) of A. (c) The union of all the atom spheres Sl . The spheres are shown
in dotted lines. (d) The difference region (D) that lies inside Sl but not m(A). (e) The difference
region relative to the envelope surface border. (f) The output binding cavity. It is generated by
intersecting D and e(A)
cases, solid comparisons of ligand binding cavities can reveal differences in patterns
of steric hindrance that control binding preferences.
We begin with two proteins, A and B. To identify amino acids of B that have
a steric influence on the difference in specificity between A and B, we follow the
following process: First, we superpose B onto A using whole structure (e.g. [35, 40,
78, 87]) or binding site superposition algorithms (e.g. [10, 73]). Second, we generate
the binding cavity of A, called a, using the methods described in Sect. 3.1.2. Third,
we find every amino acid bi of B, and generate it’s individual molecular surface,
m(bi ). We then measure the volume of m(bi ) ∩ a using the method described in
Fig. 6. If the intersection volume is small or zero, then bi is unlikely to have a
steric influence on differences in specificity between A and B because it does not
create differences in steric hindrance at the binding site. However, if the volume is
large, then we can say that ai creates large steric differences between a and b, and
therefore is likely to be a steric cause for differences in specificity. At a practical
level, sterically influential amino acids often create very large differences that are
hard to overlook [23].
To identify parts of binding cavities that influence specificity, consider two
families of protein structures A and B , where each family exhibits different
binding preferences and all family members are aligned to the same structure. First,
extract binding cavities from every protein in A and B , then compute the union u
of all cavities of A and the intersection i of all cavities of B . The difference, u − i,
is the cavity subregion that is conserved among all cavities of A that is not present
in any cavity of B . As we have observed in studies of serine proteases [23], this
difference can be the reason why some proteins prefer to bind different ligands.
These algorithms, constructed almost entirely of CSG operations, describe an

initial approach to identifying components of protein structures that have a steric
influence on specificity. From this starting point, new questions organize our review
of component localization: While we can identify structural variations that cause
differences in specificity, how much variation is necessary (Sect. 3.3)? Alignments
of ligand binding sites are currently founded on superpositions of atom geometry.
Can we create an algorithm that superposes the cavity based on the molecular
surface of the cavity, rather than the geometry atoms nearby (Sect. 3.4)? Can we
mitigate conformational flexibility as a source of error in component localization
(Sect. 3.5)? Finally, can we find components of protein structures that influence
specificity by other biophysical mechanisms (Sect. 3.6)? The next sections address
these questions.
3.3 Statistical Models for Component Localization
Variations in ligand binding cavities can occur for many reasons that are unrelated
to specificity. They might arise from minor changes in backbone and sidechain
conformation, so they must be eliminated to reduce incorrect predictions. Towards
the ultimate goal of applying component localization algorithms at a large scale,
eliminating even minor variations automatically is essential: Variations that an
expert might immediately dismiss as having no effect on binding can lead to
false predictions, without an automated way to eliminate them. Likewise, subtle
but important differences might be overlooked by the human eye. To address this
issue, we developed VASP-S [15] to automatically isolate the elements of protein
structures that may influence preferential binding.
VASP-S (VASP with Statistics) was developed to filter out differences in binding
cavities that are too small to alter binding preferences, in order to isolate influential
differences. It is a parametric statistical model on the volume of individual structural
differences between binding cavities, which we call fragments (Fig. 8f, h). The
design of VASP-S begins with a null hypothesis, that two binding cavities have
no steric explanation for different binding preferences. Because only cavity shape is
considered here, the same model does not rule out the influence of other biophysical
effects on specificity, which we discuss later.
Suppose we wish to evaluate whether cavity b exhibits binding preferences
similar to a. The two cavities are already structurally aligned. Using a population
of cavities {a0 , a1 , . . . an } with specificity identical to a, we first compute the
volume of all fragments that occur between any pair of cavities ai and aj . We
fit a log-normal distribution to the fragment volumes to produce a probability
density function. Then, using any fragment f between a and b, we can test the
null hypothesis: if the probability of observing a fragment f with volume (denoted
v(f ) larger than v(f ) is less than our threshold for statistical significance, α, we
reject the null hypothesis, and adopt the alternative hypothesis: There exists a steric
explanation for a and b to exhibit different binding preferences. Moreover, the
a c d b
g f h
e
Insignif icant
Signif icant
Fragments
Fragments
Fig. 8 Isolating the conserved or the varying regions between aligned binding cavities with CSG
operations. (a, b) The molecular surfaces of two protein structures (c, d) The corresponding two
binding cavities. (e) The superposition of binding cavities based on the whole structure alignment
method. (g) The intersection used in the VASP-I model. (f, h) The fragments calculated by CSG
difference in the VASP-S model. Statistically significant fragments usually have large volume
while insignificant fragments usually have small volume
Fig. 9 S1 specificity site of Atlantic salmon trypsin (pdb: 1a0j), shown in transparent grey. A
statistically significant fragment between the trypsin site and the S1 specificity site of porcine
pancreatic elastase (pdb:1b0e) shown in opaque grey. The fragment, which occupies the lower half
of the trypsin site, sterically hinders elastase from binding the larger substrates that trypsin prefers,
such as the tryptophan, shown in black sticks
fragment f creates variations in binding cavity shape that account for this predicted
difference in specificity. One example of a statistically significant fragment is shown
in Fig. 9. A related model of cavity intersections, VASP-I [14], was developed using
related concepts.
Different portions of a binding cavity do not necessarily have equal importance
for binding specificity. Statistical models, coupled with CSG operations, can also be
used to partially deconstruct empty cavity regions based on their impact on binding,
using a technique we call regionalization [16, 17]. Beginning with aligned binding
cavities, we generate a lattice of cubes that together cover a rectangular region that
surrounds the cavities. Each cube, a solid object, covers a part of the overlapping
binding cavities, and if we compute an intersection between that cube and all
binding cavities, we have a regional representation of each cavity inside each cube.
Fragments between regional representations are fragments inside each cube, and
using VASP-S, we can evaluate the statistical significance of each fragment. Once
VASP-S has been trained on every cube, some lattice cubes exhibit very different
fragment volume distributions: In some cubes, where cavities with identical binding
preferences vary considerably, large fragments are not statistically significant. In
other cubes, where cavities are highly conserved, even very small fragments are
statistically significant. These differences in cubic regions dissect binding cavities
into regions where the importance of steric hindrance is better understood.
3.4 Volumetric Alignment
Localizing components of protein structure that influence ligand binding specificity

is affected by the initial structural alignment between proteins. Traditional methods
for binding site superposition usually minimize LRMSD between amino acids
between whole protein structures (e.g. [39, 78]) or motifs of amino acids repre-
senting catalytic sites [1, 6, 8, 22]. However, atomic alignments do not necessarily
optimize the alignment of the binding cavity regions that actually accommodate the
ligand.
To address this problem, we developed DFO-VASP [24], an optimization-
based method that incorporated derivative free optimization (DFO) and VASP, to
maximize the overlapping volume between the shape of empty cavities in two
binding sites, ignoring nearby atoms. The general framework of DFO-VASP is
demonstrated in Fig. 10.
DFO methods focus on the problem {min f (x) : x ∈ R n } where the derivative
of the objective function is usually not available. Finding the optimal superposition
of two binding cavities by maximizing their overlapping volume is such a problem
Fig. 10 Superposition using DFO-VASP. Beginning with unsuperimposed cavities (left), each
iteration of DFO-VASP performs one superposition, then evaluates the volume overlapping
between the binding cavities. Iterations repeat until the overlapping volume is maximized, enabling
the detection of differences in cavity shape independent of atomic positions
since the objective function has no closed-form expression and the function value
is prone to noise. DFO-VASP is a trust-region based algorithm [25] designed to
address this problem. The input is two protein cavities a and b, defined as solid
representations. The algorithm examines values of the vector x, which has seven
parameters: rotation about three axes, the rotation angle and translation on three
axes, seeking the set of rotations and translations that superposes b onto a with the
largest volume of intersection. In a series of iterations, model functions are built
to approximate the objective function within a trust region. The model function is
optimized in the trust region to search for the best position for the next iteration.
To reduce the influence of noise, a dynamic incremental strategy is designed: in
each iteration, given the resolution of VASP r and the estimate noise ξr , reduce the
noise level by decreasing the resolution r if the reduction of the model function is
comparable to ξr .
In practice, DFO-VASP tends to converge to the local optimal and find a
reasonable overlapping volume, but it cannot guarantee the global optimal solution.
Given different initial values, DFO-VASP may converge to different final solutions.
Since the atom coordinates and the optimal overlapping can be highly related, the
structural backbone superposition is be used as a starting point. We call this a warm-
start superposition. The initial solution can also be randomly sampled by Latin
Hypercube Sampling methods [56], to produce well distributed random starting
points.
By visualizing the optimal superposition, it was found that DFO-VASP aligned
two protein binding cavities logically: the entrances to the cavities were oriented
in the same direction and their conserved regions were highly overlapped. Using
backbone alignment as the baseline, both warm-start and random DFO-VASP
exhibited greater intersection. And in general warm-start alignment had slightly
greater intersection volume since a reasonable initial enabled DFO-VASP to explore
more in the neighborhood of the optimal superposition.
3.5 Flexible Representations for Component Localization
Proteins are flexible molecules. This flexibility complicates protein structure com-
parison and it interferes with accurate component localization. In recent years,
several protein structure comparison methods have represented backbone motions
using hinges [54], partial-order graphs [88] and fragment pairs [74]. These rep-
resentations describe protein structures using rigid components tied together with
flexible linkers. While these approaches approximate partial backbone flexibility,
they do not permit general flexibility in the backbone, and they also simplify the
flexibility of amino acid side chains. Here, we discuss two methods that achieve
a more general representation of the flexibility of protein structures and enable
component localization in spite of conformational variations.
3.5.1 Improving Binding Site Comparison
One way to reduce errors from conformational change is to use protein structure
prediction to remodel proteins in nonfunctional conformations into conformations
that can be reasonably compared. We evaluated homolog modeling to determine if
it can be an effective tool to reduce this kind of comparison error [36].
The statistical models and other methods developed above have generally
considered a families of sequentially nonredundant proteins with identical fold and
function, but different binding preferences. Among these families, backbone or
sidechain variations are inevitable. We use remodeling to resolve this problem by
selecting one structure as a template. The structure is selected because it contains a
bound ligand or is otherwise in an active conformation. Beginning with the template
structure and the aligned sequences of the others, NEST [86], a homology modeling
package for protein structure prediction, was used to model the structures of the
other proteins with the template. Leveraging the fact that structure prediction by
homology modeling is highly accurate when applied to close homologs [3, 46],
the model structures from NEST were able to reduce conformational differences in
binding cavities. Model structures were then structurally aligned onto the template
structure using Ska [87] and the binding cavity of each model structure was
extracted using the method in Sect. 3.1.2. Between the binding cavities of same-
specificity proteins in inactive and active conformations, the largest fragment was
originally very large, and almost always statistically significant. After remodeling,
the largest fragment was generally consistent with the largest fragment observed
between other proteins with identical binding preferences [36]. Remodeling reduces
prediction errors associated with conformational changes.
3.5.2 Flexible Volumetric Comparison of Protein Cavities
Proteins are constantly in motion. The structural changes that result from these
motions do not change specificity, but they can create sources of error for component
localization. To test this hypothesis, we used molecular dynamics to simulate the
motion of nine sequentially nonredundant trypsin structures with the same binding
preferences for 100 nanoseconds. Examining 600 snapshots from each simulation,
we aligned each snapshot, generated the binding cavity, and identified the largest
fragment between the snapshot and a different trypsin. Over 65 percent of the largest
fragments were so large as to be statistically significant, indicating that normal
protein motions will cause VASP-S [15] to create erroneous predictions [37].
To perform component localization accurately despite the presence of flexibility,
we developed FAVA, an algorithm that uses conformational sampling to produce
an representation of the ligand binding cavity that as it appears in the majority of
samples. As input, FAVA accepts conformational snapshot {A1 , A2 , ..., AN } of the
protein A where N is the number of samples and k, the overlap threshold. FAVA
first extracts the binding cavity ci of the snapshot Ai with VASP (e.g. Sect. 3.1.2).
With each binding cavity snapshot, FAVA computes a frequent region, defined as
the region that is solvent accessible in more than k/N samples. Ideally, the frequent
region could be computed as the union of all intersections of all distinct sets of k
cavity snapshots. Unfortunately, this computation would require Nk intersections,
an impractical goal for most values of k. Our approach with FAVA, given any
k, is to compute 500 different k-sized sets of conformations c0 , c1 , ..., ck , where
the subset is randomly selected from the set of all samples. We then compute the
intersection of each set, and then the union of all 500 intersections, to approximate
the ideal frequent region. The frequent region ignores the subspace of unusual cavity
conformations and characterizes the average shape of the binding region in the
context of conformational flexibility.
Given two proteins A, B and the overlap threshold k, we can use FAVA to obtain
their frequent regions αk , βk , and it is of interest to compare them. To measure the
similarity of their frequent regions, we use D(αk , βk )
v(αk ∩ βk )
D(αk , βk ) = 1 − (1)
v(αk ∪ βk )
where v(·) is the volume of a given region. If two proteins have identical binding
preference, they should have more similar frequent regions and their volumetric
distance should be smaller than that of proteins with different binding preferences.
We tested FAVA on the selected non-redundant proteins of the serine protease and
enolase superfamilies. We constructed a UPGMA (Unweighted Pair Group Method
with Arithmetic mean [79]) clustering tree based on the measure D(αk , βk ) between
frequent regions, finding that almost all binding cavities were correctly separated by
their binding specificities [37].
3.6 Solid Representations of Electrostatic Isopotentials
The volume-based algorithms described above permit a comparison of protein shape

that reveals differences in patterns of steric hindrance. Significant differences in
binding cavities can indicate that one protein can sterically accommodate ligands
that another cannot. Using the same solid representations, however, it is also possi-
ble to perform comparison of electrostatic fields, identifying amino acids that play
an electrostatic role in specificity. We developed VASP-E [13], a volumetric tool to
localize electrostatic components of protein structures that influence specificity.
VASP-E and VASP identify components of protein structure that may influence
specificity through different biochemical mechanisms. Together, they represent the
first tools in a space of techniques that will localize components of protein structures
according to well defined biochemical phenomena, thereby providing the user with
a rationale for why a particular amino acid influences specificity. This kind of
information could be used to design more specific experiments: if one amino acid
changes the shape of the binding cavity, then perhaps enlarging into a tryptophan
or shrinking it into glycine will achieve a more dramatic change in specificity.
If an amino acid causes two binding cavities to exhibit different electrostatic

potential, then perhaps substituting the amino acid for an oppositely charged residue
will reveal the role of electrostatics in the interaction. This biophysical approach
provides an additional dimension of information to experimental researchers that
is not provided by component localization algorithms, which simply indicate the
importance of some element of protein structure.
VASP-E generates volumetric representations of electrostatic isopotentials using
Marching Cubes. As input, VASP-E begins with a whole protein structure A, its
electrostatic fields E from DelPhi [72], the isopotentials threshold k and user’s
choice (+/−) of representing the close region with potential greater or less than k.
When generating isopotentials at k kT /e, VASP-E always represent the solid region
with the absolute value of isopotentials greater than k. The construction procedure
is almost the same as generating solid regions using VASP (Fig. 5) and the only
difference is how to mark each lattice point as being inside or outside the solid
region. In electrostatic representation, VASP-E evaluated the potential UE (p) for
each lattice point p. If |UE (r)| k, p is inside the isopotential region. Otherwise, p
is marked as being outside. Lattice segments that connect an inside and outside point
are then examined to identify a “crossing point” where the segment intersects the
isopotential. Combining the crossing points in each lattice cube yields the corners of
a series of triangles that define a triangular mesh that approximates the isopotential.
Solid representations of electrostatic isopotentials can be used to create cavity
fields which represent the region inside the geometric binding cavity and the
electrostatic isopotential (Fig. 11). For this reason, we generate cavity fields by
computing the CSG intersection between the isopotential and the binding cavity,
which we computed earlier. CSG operations also offer unique opportunities for
comparing cavity fields. The intersection of two cavity fields is that is solvent
accessible in both cavities and inside both solid isopotentials; if the intersection
a c e h j
b d f g i k
Fig. 11 Generating and comparing cavity field. (a, b) The molecular surface of protein A and B is
shown in dotted line. The binding cavities are shown in dark gray with solid lines. The electrostatic
isopotentials are shown in transparent light gray. (c, d) The cavity fields (dark gray regions) are
inside both the binding cavities and the solid isopotentials. (e) The intersection of two binding
cavities I . (f) To guarantee that cavity field comparison is not influenced by steric differences. The
intersection between cavity fields and I was created, named as IA and IB . The alignment between
IA and IB (in dark gray). (g) IA and IB (dotted lines) with the intersection (dark gray rectangle).
(h, i) The differences between IA and IB . (j, k) The same differences with respect to the molecular
surfaces
is large, then it might accommodate the same ligand fragment. The difference
between two cavity fields is a region of solvent accessible electrostatic potential
that exists in one protein but not in another; an area where the two proteins may
not electrostatically stabilize similar ligands. VASP-E can thus be used to localize
electrostatic components that influence ligand binding specificity.
VASP-E was tested on cavities fields from serine proteases and cysteine proteases
and built UPGMA clustering trees by calculating the similarities with volumetric
distance (Eq. 1). The cavity fields were generated at {−2.5, −5.0, −7.5, −10.0}
kT /e. The results showed that all of the clusterings correctly classified cavity fields
into groups with different binding specificities and VASP-E proved to be a useful
tool to characterize binding cavities based on electrostatic elements.
VASP-E can also identify amino acids that have an electrostatic influence on
specificity. First, the electrostatic contribution of one amino acid at a time is
removed (nullified), and the cavity field is recomputed. Then the recomputed cavity
field is compared to a cavity field from a protein with different binding preferences.
If the two cavity fields are more similar than they would have been if no amino
acids were nullified, then we know that the nullified amino acid was causing the
cavity fields to be different. Therefore, that amino acid is an electrostatic influence
on specificity. This approach correctly identified electrostatically influential amino
acids in the serine proteases [13].
4 Discussion
Algorithms for specificity annotation represent a significant new direction for the
development of protein structure comparison. Earlier comparison algorithms were
focused on the function annotation problem, and they were developed to propose
functions for protein structures that are being solved without known function, such
as those generated by the international protein structure initiatives. Here, specificity
annotation algorithms are designed to propose binding preferences for proteins
with unknown binding preferences, and they can be applied to suggest binding
preferences for the many genetic variants that may be observed in the field.
Component localization adds additional capabilities. For example, as we discover
mutants that resist drug therapies, we are forced to reexamine the function of the
protein and the drug to discover how resistance occurs. Algorithms for component
localization can offer explanations as to which components of the protein structure
cause resistance, not only when those parts that are amino acids, but also when they
are specific regions inside ligand binding cavities or electrostatic fields. As methods
for component localization advance, they will be able to offer suggestions as to the
biophysical mechanism of resistance. These explanations are valuable because they
are testable hypotheses generated from biophysical measurements. Since the space
of mutations is combinatorial, the value of automated hypothesis generation in this
context is considerable.
The development of algorithms for specificity annotation present novel and
interesting technical advancements in the field of protein structure comparison.
First, solid representations that contrast from the conventional point-based atomic
representations reveal new technical capabilities, such as the regionalization of
structural analysis. Second, solid representations are defined by a boundary surface,
so they can be used to compare elements of protein structure, such as cavity regions
or electrostatic isopotentials with high precision. Since they do not require the
bijective correspondences that are a prerequisite of point-based comparisons, they
can provide objective comparison scores of structures that involve different numbers
of atoms, such as a comparison of individual amino acids with the shape of a binding
cavity. We believe these capabilities, and others that are not yet discovered, create
significant opportunities for breakthrough algorithms in the space of specificity
annotation.
4.1 Future Directions
Specificity annotation is a field in its infancy with many areas for potential
advancement and specialization. The most open space for advancements lies in the
design of algorithms for component localization using on biophysical phenomena
that have not yet been exploited for specificity annotation, such as hydrophobicity or
hydrogen bonding. It is crucial to note that phenomenon-specific prediction methods
are not directed at the creation of energy functions. The goal of these predictors is
not to provide a single energy of interaction, but rather to predict whether or not
a difference in proteins can eliminate similarity in binding preferences because of
a particular phenomenon. The integration of methods that consider all phenomena
will be a comprehensive step towards general component localization.
A second direction for future development lies in the better use of manycore
architectures in protein structure comparison. Most structure comparison algorithms
to date have been developed for single threaded applications, and this approach is
increasingly archaic. Multithreaded software has considerable potential to enhance
the performance of protein structure comparisons, yielding faster, and higher
resolution comparisons. In the overall field, the main winner from such a transition
will be the enhanced sophistication of statistical models that will have much
expanded data upon which to train. More robust multipart validation will also be
possible.
Finally, a much needed and often overlooked aspect of the field is the enhance-
ment of user accessibility. Algorithms in computational structural biology are
especially technical and because of many disciplinary and practical reasons, user
interfaces are often overlooked. This challenge strongly limits impact in the field
of function annotation because it could better serve experimentalists who are not
highly trained computational researchers. As sophisticated user interfaces become
increasingly easy to develop because of their ubiquity in other fields, software for
function annotation must become more accessible as well.
Acknowledgments This work was funded in part by NIH Grant R01GM123131 to Brian Y. Chen.
References
1. Stark A., Sunyaev S., and Russell RB. A model for statistical significance of local similarities
in structure. J. Mol. Biol., 326:1307–1316, 2003.
2. Shivas R Amin, Serkan Erdin, R Matthew Ward, Rhonald C Lua, and Olivier Lichtarge.
Prediction and experimental validation of enzyme substrate specificity in protein structures.
Proceedings of the National Academy of Sciences, 110(45):E4195–E4202, 2013.
3. David Baker and Andrej Sali. Protein structure prediction and structural genomics. Science,
294(5540):93–96, 2001.
4. Jon Louis Bentley. Multidimensional binary search trees used for associative searching.
Communications of the ACM, 18(9):509–517, 1975.
5. G Biggiogero. La geometria del tetraedro. Enciclopedia delle Matematiche Elementari e
Complementi, 2(1):219–252, 1950.
6. T.A. Binkowski, P. Freeman, and J. Liang. pvSOAR: Detecting similar surface patterns of
pocket and void surfaces of amino acid residues on proteins. Nucl. Acid. Res., 32:W555–8,
2004.
7. Drew H Bryant, Mark Moll, Brian Y Chen, Viacheslav Y Fofanov, and Lydia E Kavraki.
Analysis of substructural variation in families of enzymatic proteins with applications to
protein function prediction. BMC bioinformatics, 11(1):242, 2010.
8. Drew H Bryant, Mark Moll, Paul W Finn, and Lydia E Kavraki. Combinatorial clustering
of residue position subsets predicts inhibitor affinity across the human kinome. PLoS
computational biology, 9(6):e1003087, 2013.
9. Chen B.Y., Bryant D.H, Fofanov V.Y., Kristensen D.M., Cruess A.E., Kimmel M., Lichtarge
O., and Kavraki L.E. Cavity-aware motifs reduce false positives in protein function prediction.
Proceedings of the 2006 IEEE Computational Systems Bioinformatics Conference (CSB 2006),
accepted, August 2006.
10. Chen B.Y., Fofanov V.Y., Kristensen D.M., Kimmel M., Lichtarge O., and Kavraki L.E.
Algorithms for structural comparison and statistical analysis of 3D protein motifs. Proceedings
of Pacific Symposium on Biocomputing 2005, pages 334–45, 2005.
11. John A Capra and Mona Singh. Characterization and prediction of residues determining protein
functional specificity. Bioinformatics, 24(13):1473–1480, 2008.
12. Saikat Chakrabarti, Stephen H Bryant, and Anna R Panchenko. Functional specificity lies
within the properties and evolutionary changes of amino acids. Journal of molecular biology,
373(3):801–810, 2007.
13. Brian Y Chen. Vasp-e: Specificity annotation with a volumetric analysis of electrostatic
isopotentials. PLoS computational biology, 10(8):e1003792, 2014.
14. Brian Y Chen and Soutir Bandyopadhyay. A statistical model of overlapping volume in
ligand binding cavities. In Bioinformatics and Biomedicine Workshops (BIBMW), 2011 IEEE
International Conference on, pages 424–431. IEEE, 2011.
15. Brian Y Chen and Soutir Bandyopadhyay. Vasp-s: A volumetric analysis and statistical model
for predicting steric influences on protein-ligand binding specificity. In Bioinformatics and
Biomedicine (BIBM), 2011 IEEE International Conference on, pages 22–29. IEEE, 2011.
16. Brian Y Chen and Soutir Bandyopadhyay. Modeling regionalized volumetric differences in
protein-ligand binding cavities. Proteome science, 10(Suppl 1):S6, 2012.
17. Brian Y Chen and Soutir Bandyopadhyay. A regionalizable statistical model of intersecting
regions in protein–ligand binding cavities. Journal of bioinformatics and computational
biology, 10(03), 2012.
18. Brian Y Chen, Drew H Bryant, Amanda E Cruess, Joseph H Bylund, Viacheslav Y Fofanov,
Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Composite motifs integrating multiple
protein structures increase sensitivity for function prediction. In Comput Syst Bioinformatics
Conf, volume 6, pages 343–355, 2007.
19. Brian Y Chen, Drew H Bryant, Viacheslav Y Fofanov, David M Kristensen, Amanda E Cruess,
Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Cavity-aware motifs reduce false
positives in protein function prediction. In Proceedings of the 2006 IEEE Computational
Systems Bioinformatics Conference (CSB 2006), pages 311–23, 2006.
20. Brian Y Chen, Drew H Bryant, Viacheslav Y Fofanov, David M Kristensen, Amanda E
Cruess, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Cavity scaling: automated
refinement of cavity-aware motifs in protein function prediction. Journal of bioinformatics
and computational biology, 5(02a):353–382, 2007.
21. Brian Y Chen, Viacheslav Y Fofanov, Drew H Bryant, Bradley D Dodson, David M Kristensen,
Andreas M Lisewski, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. Geometric
sieving: Automated distributed optimization of 3d motifs for protein function prediction. In
Research in Computational Molecular Biology, pages 500–515. Springer, 2006.
22. Brian Y Chen, Viacheslav Y Fofanov, Drew H Bryant, Bradley D Dodson, David M Kristensen,
Andreas M Lisewski, Marek Kimmel, Olivier Lichtarge, and Lydia E Kavraki. The mash
pipeline for protein function prediction and an algorithm for the geometric refinement of 3d
motifs. Journal of Computational Biology, 14(6):791–816, 2007.
23. Brian Y Chen and Barry Honig. Vasp: a volumetric analysis of surface properties yields
insights into protein-ligand binding specificity. PLoS computational biology, 6(8):e1000881,
2010.
24. Ruobing Chen, Katya Scheinberg, and Brian Y Chen. Aligning ligand binding cavities by
optimizing superposed volume. In Bioinformatics and Biomedicine (BIBM), 2012 IEEE
25. Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free
optimization, volume 8. Siam, 2009.
26. Michael L Connolly. The molecular surface package. Journal of molecular graphics,
11(2):139–141, 1993.
27. Evangelos A Coutsias, Chaok Seok, and Ken A Dill. Using quaternions to calculate RMSD.
Journal of computational chemistry, 25(15):1849–1857, 2004.
28. Porter C.T., Bartlett G.J., and Thornton J.M. The catalytic site atlas: a resource of catalytic sites
and residues identified in enzymes using structural data. Nucleic Acids Research, 32:D129–
D133, 2004.
29. Kristensen D.M., Chen B.Y., Fofanov V.Y., Ward R.M., Lisewski A.M., Kimmel M., Kavraki
L.E., and Lichtarge O. Recurrent use of evolutionary importance for functional annotation of
proteins based on local structural similarity. Protein Science, in press, 2006.
30. Liang J. Edelsbrunner H., Facello M. On the definition and the construction of pockets in
macromolecules. Discrete Applied Mathematics, 88:83–102, 1998.
31. Ferré F., Ausiello G, Zanzoni A, and Helmer-Citterich M. Surface: a database of protein surface
regions for functional annotation. Nucl. Acid. Res., 32:D240–4, 2004.
32. Viacheslav Y Fofanov, Brian Y Chen, Drew H Bryant, Mark Moll, Olivier Lichtarge, Lydia
Kavraki, and Marek Kimmel. A statistical model to correct systematic bias introduced by
algorithmic thresholds in protein structural comparison algorithms. In Bioinformatics and
Biomedicine Workshops, 2008. BIBMW 2008. IEEE International Conference on, pages 1–8.
IEEE, 2008.
33. Verbitsky G., Nussinov R., and Wolfson H.J. Structural comparison allowing hinge bending.
Prot: Struct. Funct. Genet., 34(2):232–254, 1999.
34. Benjamin Georgi, Jörg Schultz, and Alexander Schliep. Context-specific independence mixture
modelling for protein families. In Knowledge Discovery in Databases: PKDD 2007, pages 79–
90. Springer, 2007.
35. Jean-Francois Gibrat, Thomas Madej, and Stephen H Bryant. Surprising similarities in
structure comparison. Current opinion in structural biology, 6(3):377–385, 1996.
36. Brian G Godshall and Brian Y Chen. Improving accuracy in binding site comparison with
homology modeling. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE
37. Ziyi Guo, Trevor Kuhlengel, Steven Stinson, Seth Blumenthal, Brian Y Chen, and Soutir
Bandyopadhyay. A flexible volumetric comparison of protein cavities can reveal patterns
in ligand binding specificity. In Proceedings of the 5th ACM Conference on Bioinformatics,
Computational Biology, and Health Informatics, pages 445–454. ACM, 2014.
38. Edelsbrunner H. and Mucke E.P. Three-dimensional alpha shapes. ACM Trans. Graphics,
13:43–72, 1994.
39. Liisa Holm and Chris Sander. Protein structure comparison by alignment of distance matrices.
Journal of molecular biology, 233(1):123–138, 1993.
40. Liisa Holm and Chris Sander. Mapping the protein universe. Science, 273(5275):595–602,
1996.
41. Barker J.A. and Thornton J.M. An algorithm for constraint-based structural template matching:
application to 3D templates with statistical analysis. Bioinf., 19(13):1644–1649, 2003.
42. Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. In
ACM Transactions on Graphics (TOG), volume 21, pages 339–346. ACM, 2002.
43. Kinoshita K. and Nakamura H. Identification of protein biochemical functions by similarity
search using the molecular surface database ef-site. Protein Science, 12:1589–1595, 2003.
44. Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystal-
lographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography,
32(5):922–923, 1976.
45. Olga V Kalinina, Pavel S Novichkov, Andrey A Mironov, Mikhail S Gelfand, and Aleksandra B
Rakhmaninova. Sdppred: a tool for prediction of amino acid residues that determine differences
in functional specificity of homologous proteins. Nucleic acids research, 32(suppl 2):W424–
W428, 2004.
46. Patrice Koehl, Michael Levitt, et al. A brighter future for protein structure prediction. nature
structural biology, 6:108–111, 1999.
47. David M Kristensen, R Matthew Ward, Andreas M Lisewski, Serkan Erdin, Brian Y Chen,
Viacheslav Y Fofanov, Marek Kimmel, Lydia E Kavraki, and Olivier Lichtarge. Prediction
of enzyme function based on 3d templates of evolutionarily important amino acids. BMC
bioinformatics, 9(1):17, 2008.
48. Yehezkel Lamdan and Haim J Wolfson. Geometric hashing: A general and efficient model-
based recognition scheme. In ICCV, volume 88, pages 238–249, 1988.
49. Byungkook Lee and Frederic M Richards. The interpretation of protein structures: estimation
of static accessibility. Journal of molecular biology, 55(3):379–IN4, 1971.
50. William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface
construction algorithm. In ACM Siggraph Computer Graphics, volume 21, pages 163–169.
ACM, 1987.
51. Rosen M., Lin S.L., Wolfson H., and Nussinov R. Molecular shape comparisons in searches
for active sites and functional similarity. Prot. Eng., 11(4):263–277, 1998.
52. Shatsky M., Shulman-Peleg A., Nussinov R., and Wolfson H.J. Recognition of binding patterns
common to a set of protein structures. Proceedings of RECOMB 2005, pages 440–55, 2005.
53. Shatsky M., Shulman-Peleg A., Nussinov R., and Wolfson H.J. The multiple common point set
problem and its application to molecule binding pattern detection. J. Comp. Biol., 13(2):407–
28, 2006.
54. Shatsky M., Nussinov R., and Wolfson H.J. A method for simultaneous alignment of multiple
protein structures. Proteins, 56(1):143–56, 2004.
55. Srinivasan Madabushi, Hui Yao, Mike Marsh, David M Kristensen, Anne Philippi, Mathew E
Sowa, and Olivier Lichtarge. Structural clusters of evolutionary trace residues are statistically
significant and common in proteins. Journal of molecular biology, 316(1):139–154, 2002.
56. Michael D McKay, Richard J Beckman, and William J Conover. Comparison of three
methods for selecting values of input variables in the analysis of output from a computer code.
Technometrics, 21(2):239–245, 1979.
57. Mark Moll, Drew H Bryant, and Lydia E Kavraki. The labelhash algorithm for substructure
matching. BMC bioinformatics, 11(1):555, 2010.
58. Richard J Morris, Rafael J Najmanovich, Abdullah Kahraman, and Janet M Thornton. Real
spherical harmonic expansion coefficients as 3d shape descriptors for protein binding pocket
and ligand comparisons. Bioinformatics, 21(10):2347–2355, 2005.
59. Anthony Nicholls, Kim A Sharp, and Barry Honig. Protein folding and association: insights
from the interfacial and thermodynamic properties of hydrocarbons. Proteins: Structure,
Function, and Bioinformatics, 11(4):281–296, 1991.
60. Bachar O., Fischer D., Nussinov R., and Wolfson H. A computer vision based technique for
3-d sequence independent structural comparison of proteins. Prot. Eng., 6(3):279–288, 1993.
61. Lichtarge O. and Sowa M.E. Evolutionary predictions of binding surfaces and interactions.
Curr. Opin. Struct. Biol., 12(1):21–27, 2002.
62. Jimin Pei, Wei Cai, Lisa N Kinch, and Nick V Grishin. Prediction of functional specificity
determinants from protein sequences using log-likelihood ratios. Bioinformatics, 22(2):164–
171, 2006.
63. Osnat Penn, Adi Stern, Nimrod D Rubinstein, Julien Dutheil, Eran Bacharach, Nicolas Galtier,
and Tal Pupko. Evolutionary modeling of rate shifts reveals specificity determinants in hiv-1
subtypes. PLoS computational biology, 4(11):e1000214, 2008.
64. Donald Petrey and Barry Honig. Grasp2: visualization, surface properties, and electrostatics
of macromolecular structures and sequences. Methods in enzymology, 374:492–509, 2002.
65. Walter Pirovano, K Anton Feenstra, and Jaap Heringa. Sequence comparison by sequence
harmony identifies subtype-specific functional sites. Nucleic acids research, 34(22):6540–
6548, 2006.
66. Benjamin J Polacco and Patricia C Babbitt. Automated discovery of 3d motifs for protein
function annotation. Bioinformatics, 22(6):723–730, 2006.
67. Norel R., Fischer D., Wolfson H.J., and Nussinov R. Molecular surface recognition by a
computer vision-based technique. Prot. Eng., 7:39–46, 1994.
68. Norel R., Petrey D., Wolfson H.J., and Nussinov R. Examination of shape complementarity in
docking of unbound proteins. Prot: Struct. Funct. Genet., 36:307–317, 1999.
69. Laskowski R.A., Watson J.D., and Thornton J.M. Protein function prediction using local 3D
templates. Journal of Molecular Biology, 351:614–626, 2005.
70. Boris Reva, Yevgeniy Antipin, and Chris Sander. Determinants of protein function revealed by
combinatorial entropy optimization. Genome Biol, 8(11):R232, 2007.
71. Frederick M Richards. Areas, volumes, packing, and protein structure. Annu. Rev. Biophys.
Bioeng., 6:151–176, 1977.
72. Walter Rocchia, Emil Alexov, and Barry Honig. Extending the applicability of the nonlinear
Poisson-Boltzmann equation: Multiple dielectric constants and multivalent ions. The Journal
of Physical Chemistry B, 105(28):6507–6514, 2001.
73. Robert B Russell. Detection of protein three-dimensional side-chain patterns: new examples
of convergent evolution. Journal of molecular biology, 279(5):1211–1227, 1998.
74. Saeed Salem, Mohammed J Zaki, and Chris Bystroff. Flexsnap: Flexible non-sequential protein
structure alignment. Algorithms for Molecular Biology, 5(1):12, 2010.
75. Michel F Sanner, Arthur J Olson, and Jean-Claude Spehner. Reduced surface: an efficient way
to compute molecular surfaces. Biopolymers, 38(3):305–320, 1996.
76. J Schaer and MG Stone. Face traverses and a volume algorithm for polyhedra. In New Results
and New Trends in Computer Science, pages 290–297. Springer, 1991.
77. Gideon Schreiber and Alan R Fersht. Rapid, electrostatically assisted association of proteins.
Nature Structural & Molecular Biology, 3(5):427–431, 1996.
78. Ilya N Shindyalov and Philip E Bourne. Protein structure alignment by incremental
combinatorial extension (ce) of the optimal path. Protein engineering, 11(9):739–747, 1998.
79. Peter HA Sneath and Robert R Sokal. Numerical taxonomy. Nature, 193(4818):855–860,
1962.
80. Binkowski T.A., Joachimiak A., and Liang J. Protein surface analysis for function annotation
in high-throughput structural genomics pipeline. Protein Science, 14:2972–2981, 2005.
81. Binkowski T.A., Adamian L., and Liang J. Inferring functional relationships of proteins from
local sequence and spatial surface patterns. J. Mol. Biol., 332:505–526, 2003.
82. S. Umeyama. Least-squares estimation of transformation parameters between two point

patterns. IEEE Trans. Pat. Anal. Mach. Int., PAMI-13(4):376–86, Apr 1991.
83. R Matthew Ward, Serkan Erdin, Tuan A Tran, David M Kristensen, Andreas Martin Lisewski,
and Olivier Lichtarge. De-orphaning the structural proteome through reciprocal comparison of
evolutionarily important structural features. PLoS One, 3(5):e2136, 2008.
84. Edwin C Webb et al. Enzyme nomenclature 1992. Recommendations of the Nomenclature
Committee of the International Union of Biochemistry and Molecular Biology on the Nomen-
clature and Classification of Enzymes. Number Ed. 6. Academic Press, 1992.
85. Leigh Willard, Anuj Ranjan, Haiyan Zhang, Hassan Monzavi, Robert F Boyko, Brian D Sykes,
and David S Wishart. Vadar: a web server for quantitative evaluation of protein structure
quality. Nucleic acids research, 31(13):3316–3319, 2003.
86. Zhexin Xiang and Barry Honig. Extending the accuracy limits of prediction for side-chain
conformations. Journal of molecular biology, 311(2):421–430, 2001.
87. An-Suei Yang and Barry Honig. An integrated approach to the analysis and modeling of protein
sequences and structures. i. protein structural alignment and a quantitative measure for protein
structural distance. Journal of molecular biology, 301(3):665–678, 2000.
88. Yuzhen Ye and Adam Godzik. Multiple flexible structure alignment using partial order graphs.
Bioinformatics, 21(10):2362–9, May 2005.
Machine Learning-Based Approaches
for Protein Conformational Exploration
Fatemeh Afrasiabi, Ramin Dehghanpoor, and Nurit Haspel
1 Introduction
A possible 3D shape of a protein is denoted a conformation. When a protein is

generated in the ribosome, it folds into a low-energy conformation called the native
state. However, since proteins are dynamic molecules, they can transition from
one conformation to another due to changes in the environment, binding to other
molecules, performing specific functions such as regulating cell signaling, protein
catalysis, and so forth. Therefore, studying the structure of proteins is not sufficient
to fully comprehend their functionalities. In order to understand and analyze these
functionalities, we need to investigate the conformational space of proteins and the
conformational pathways they take when going through these transitions. Proteins
may interact with each other without any change in their conformations. Though
in most cases, they need to undergo a conformational transition to optimally
interact with other molecules. These conformational changes may occur not only
at the binding interface of the protein but also away from the binding site (also
called allosteric changes). Figure 1 shows the conformational changes of Adenylate
kinase.
The SARS-CoV-2 virus currently endangering the global population is a good
example of how crucial it is to study the conformational space of proteins. As
of January 2022, there have been over 323 million confirmed cases and 5.5
millions confirmed deaths. These numbers show how important it is to find new
therapeutics and vaccines immediately. To do so, understanding the structures of
the SARS-CoV-2 proteins and the complexes they form is necessary [45, 47–
49, 52]. However, as mentioned before, it is of significance to also study the
F. Afrasiabi · R. Dehghanpoor · N. Haspel ()

University of Massachusetts Boston, Boston, MA, USA
e-mail: Fatemeh.Afrasiabi001@umb.edu; Ramin.Dehghanpoor001@umb.edu;
Nurit.Haspel@umb.edu

48 F. Afrasiabi et al.
Fig. 1 Conformational changes of Adenylate kinase, going from its closed conformation (4AKE)
to its open conformation (1AKE). PDB IDs for conformation (a) 4AKE, (b) 1DVR, (c) 1ANK,
and (d) 1AKE
full range of conformations that each of the SARS-CoV-2 proteins can access in
order to understand how they replicate, infect cells, and avoid the host’s immune
system. Exploring the conformational space of SARS-CoV-2 proteins also provides
information concerning the vulnerability of its proteins that will help in drug and
vaccine design [53].
Based on the reasons mentioned above, researchers in the fields of compu-
tational, structural, and experimental biology have made efforts to study protein
conformational pathways and analyzing the conformational space of proteins and
complexes. However, as these changes happen in microseconds and many of these
intermediate conformations are transient, it has been a challenge to capture them
experimentally or computationally. During the past decades, various experimental
and computational methods have been proposed to learn about these conformations.
We will review most of the highly used strategies with an emphasis on the new
machine learning-based approaches.
2 Biophysical and Empirical Methods
Experimental methods such as time-resolved optical spectroscopy, X-ray crystallog-

raphy, solution scattering, and several more have been used for decades to capture
the structural changes of proteins happening over a wide range of time scales and
amplitudes of motion.
X-ray crystallography, for instance, has for long been employed to discover
and study the structures of proteins [42]. It aims to resolve the 3D structure of
a crystallized protein molecule with an atomic resolution. However, because of
its rigorous requirements, namely needing highly ordered, soluble and radiation-
resistant crystals, its applicability has been limited. Interestingly enough, during
the crystallization process, substantial conformational motions may be neglected as
proteins are put in constrained environments. That said, data collection by high-
resolution X-ray diffraction crystallography with precise refinements can provide
the spatial distribution of high-frequency displacements in a protein. Yet, it fails
Machine Learning-Based Approaches for Protein Conformational Exploration 49
to capture low-frequency motions with large amplitudes and rapid motions that are
less likely to happen [37]. Consequently, in order to study protein motions in more
natural environments, alternate methods like X-ray solution scattering and nuclear
magnetic resonance (NMR) have been introduced.
Another approach to study protein dynamics is to observe the time evolution
of X-ray scattering by liquid solutions. X-ray scattering techniques are used as
analytical tools which reveal information about the crystal structure and physical
properties of proteins. In these techniques, X-rays are scattered at the electrons
of the atomic shell, during which the electrons start oscillating. The scattered
intensity of the X-ray hitting a sample is then observed as a function of incident and
scattered angle, polarization, and wavelength or energy. This method is promising,
especially to investigate biological motions on systems that are either not easy to
crystallize or undergo large-scale conformational changes that cannot take place
within a crystalline environment [29]. Small-angle X-ray scattering [6] provides
information about the overall size and shape of the protein while wide-angle X-ray
scattering [21] gives more information regarding the fold of helices and sheets [9].
The combination of X-ray solution scattering with crystallography has also been
utilized to enhance the accuracy and efficacy of protein structure determination
[30, 33].
3 Physics-Based Computational Methods
3.1 Molecular Dynamics
In classical Molecular Dynamics (MD) simulations, Newton’s equations of motion

are used to determine how each atom in a molecule changes its position, velocity,
and acceleration. In other words, it is a computer simulation method used to
predict the evolution of an interacting system in time by generating potential atomic
trajectories of the system using Newton’s equation of motion [38]. Forces acting
on the atoms and their potential energies are often calculated using inter-atomic
potentials or molecular mechanics force fields. MD simulations are widely used
to study the kinetic behavior of proteins. It can provide an accurate representation
of a system’s potential energy landscape and sample the conformational spaces of
the protein. Given that proteins are complex dynamic structures that are constantly
interacting with molecules in their environment, it is remarkably valuable to be able
to simulate all of these interactions.
Even though experimental methods such as X-ray crystallography and others
mentioned above have been used to discover 3D structures –used as a starting
point in molecular dynamics simulation– they only provide snapshots of the protein
structures and are unable to capture full scale conformational changes under
certain physiological environmental conditions. MD simulations can be extremely
beneficial in the sense that they can be used to quantify the properties of a system on
an otherwise inaccessible time scale with a certain amount of accuracy. However,

simulation time is limited to the microsecond level at best and is therefore not fast
enough to capture conformational transitions that take place over larger timescales;
milliseconds, and more. This problem still exists even when the simulations are
implemented on state-of-the-art computers with high-level processing abilities since
more and more simulations are now needed to be done on more complex systems
with better accuracy [2].
Various dynamic processes of proteins have time scales ranging from femtosec-
onds to hours. As standard Molecular Dynamics fall short of capturing complex and
timely dynamics of proteins, researchers have come up with alternate computational
approaches; variants of Molecular Dynamics.
Steered molecular dynamics (SMD) offers simulations that have a time/position-
dependent force to steer the system along specific degrees of freedom. The steering
function enables the simulation to spend the computational resources efficiently on
particular events like binding to a ligand. In other words, during SMD simulations
external forces, dependent on time, are applied to assist and facilitate the process
of a ligand binding to or unbinding from a protein. Substantial information about
the relationship between the structure and function of the simulated protein-
ligand complex can be gained by analyzing the interactions of the associating or
dissociating ligand with the protein complex [22]. It is worthwhile to mention that
steered Molecular Dynamics simulations are the equivalent of the experimental
methods mentioned above, which are based on external mechanical manipulations
to the proteins [1].
Standard Molecular Dynamics simulations usually sample only one path of
the conformational space. Therefore, the simulation results rely on the choice of
initial conditions as these conditions determine the path explored by the simulation.
On the other hand, several properties of a molecular system, barriers between
energy minima, for instance, can be difficult to cross at room temperatures over
accessible simulation time scales. Replica exchange simulations (REMD), also
known as parallel tempering, attempt to enhance the sampling procedure done in
standard MD by running numerous independent replicas in marginally different
initial conditions, and periodically exchanging the coordinates of these replicas
between the ensembles. This technique helps the simulation overcome energy
minima barriers on the system’s potential energy surface [44]. That means the
exploration of free energy landscapes can be expedited by using REMD. In each
time period, (at least one picosecond) replicas are exchanged among close-by tem-
peratures. Each replica exchange relies on doing Monte Carlo moves that preserve
canonical equilibrium distributions at each temperature between ensembles. REMD
simulations motivated researchers to promote variants of REMD with enhanced
sampling methods: Hamiltonian Replica Exchange [18], and Replica Exchange with
Solute Tempering [31], to name a few.
In Targeted MD (TMD) simulations, a subset of atoms is guided towards a final
target structure, hence causing a conformational change, using a steering force. This
is done by employing a moving distance constraint to enforce unique transitions
[51].
3.2 Monte Carlo Based Search Method
One of the most efficient conformational searching methods is Monte Carlo (MC)
search. Each Monte Carlo search step has three main components: first, one
or more dihedral angles (or other movable degrees of freedom) of the starting
structure are chosen for manipulation. Second, they each get rotated by a random
amount as to generate a new conformation. This is called a “step”. Third, the
molecule gets energetically minimized. The new step is accepted or rejected based
on an optimization criterion. If the new conformation has lower energy than its
predecessor, it will be accepted. Otherwise, it will be accepted with a probability
proportional to the energy difference. This allows the search to accept some higher
energy conformations that could help it avoid being stuck in a local minimum. The
search ends when a low energy conformation is reached or after a certain number of
iterations. By repeating this search step and sorting and comparing all the different
generated conformers, it is possible to identify the different low-energy conformers
of the molecule. The entire process is called Monte Carlo Multiple Minimum
(MCMM) search. This procedure ends when the newly generated conformations are
the same as the ones already found [10]. Sometimes it is not possible to rotate all
the bonds. For instance, for cyclic molecules, we can’t rotate all the bonds without
cutting the ring. For these kinds of molecules, we can open the ring, rotate the
bonds, and then try to close the ring. If the ends are not close enough to each
other, we can revert the last step and choose another part randomly to break the
ring. Then the molecule gets minimized, and the whole process is repeated. One of
the problems of MC is the difficulty of finding all the low-energy conformations for
large molecules because the number of possible bonds’ changes grows exponentially
with the number of bonds in the molecule.
Sometimes our chance of finding a low energy conformation after changing one
or more bonds is very low. In this case we need MC to sample from a small low
energy region, A. However, a simple MC sampling may not sample from region A
even after many iterations. We can make MC do this by sampling from a distribution
that overweights region A. This weighting technique is called importance sampling.
It is a way to make Monte Carlo simulations converge much faster [40].
Another enhancement of the MC method called acceptance-rejection sampling
is very similar to importance sampling. In this method, instead of over-weighting a
region, all the high energy (or impossible) conformations are placed in a rejection
set, and the algorithm automatically rejects the new conformation if it belongs to
the rejection set [16].
MD (see Sect. 3.1) and MC search methods are the two most commonly utilized
techniques in physics-based computational methods. Both methods generally have
similar outlines, including consideration of molecules as collections of atoms,
employment of some force fields for the potential energy terms, and implementation
of recurrent boundary conditions. The major differences are in the forms of sampling
the available configuration space. MD needs 1.6–3.8 times more computation time
and it is 2–3 times less efficient than MC [24], but it produces a more physically
plausible path. Another difference between MD simulations and MC methods is
that despite MC, MD simulations update the velocities of particles as well, rather
than depending on just their positions.
4 Geometric and Robotics-Inspired Methods
The next class of methods are inspired by the robotic motion planning problem,
where the goal is to find a sequence of valid states (configurations) that moves the
object from the source to the destination in a physically constrained environment.
The idea is to consider proteins as robot-like objects. The conformational space
of a protein is modeled as the geometric space in which a robot operates (also
called the workspace). The physical constraints are captured in the energy function.
Many researchers have applied robotics-inspired methods since more than a decade
ago to simulate protein folding, protein conformational change, docking, binding
and others. Motion planning algorithms build a graph (or a “roadmap”) that
approximates the search space. Then a search is conducted in the graph to find a
path from the start to the goal [26].
4.1 Motion Planning Methods
As mentioned in Sect. 3.1, MD simulations can compute protein dynamics in the

range of microseconds. Using Monte Carlo methods enables the computations of
dynamics occurring at larger timescales, but they are still restricted. The two major
restricting factors of these methods are the fact that they tend to get stuck in energy
local minima and also the high cost of the computing energy values. They also do
not calculate the physical path.
A path planning approach proposed by Cortes et al. [12] tries to enhance
the speed of the conformational searching. They applied sampling-based motion
planning techniques to first filter efficient and geometrically feasible conformations
(considering the conformations that avoid steric clashes between their atoms while
satisfying the kinematic closure constraints associated with loops and hydrogen
bonds) and then do an energy refinement on the filtered conformations. In the first
stage, plausible conformations and pathways are found using geometric filtering to
expedite the computations in the refinement stage that use classical energy-based
methods. The geometric filtering procedure is remarkably important for proteins
that undergo large conformational changes.
Rapidly exploring random tree (RRT)[27] is a notable algorithm that efficiently
searches nonconvex, high-dimensional spaces. The algorithm builds a tree rooted
at the start configuration, using random samples from the search space, accepts
the plausible ones, and connects them to the closest vertex in the tree. To find a
path from a start conformation to a goal conformation, random conformations are
generated where each new conformation is created by rotating one of the dihedral
Fig. 2 An illustration of the RRT* algorithm. (1) A new node is added to the tree and its
nearest neighbors (based on a predefined distance metric) are highlighted. (2) The best (least-cost)
neighbor is chosen as the new node’s parent. (3) For each nearest neighbor, the algorithm checks
whether its path cost is less through the new node rather than its current parent. If so, its parent is
changed to the new node
angles of the last found conformation towards the goal conformation. If this new
conformation has a valid energy value, it is accepted and gets connected to the
closest conformation in the tree. The RRT algorithm is impressive due to the fact that
it can rapidly explore the search space. However, because of its random nature of
node selection and the fact that it expands to all areas of the workspace, in a protein’s
conformational landscape that is high-dimensional, the algorithm converges slowly
and does not always produce smooth and optimized paths. The optimized version of
the RRT algorithm, called the RRT* algorithm, was introduced by Karaman et al.
[25] to enhance the smoothness of the resulting paths by steering the system towards
the goal. It also uses a rewiring procedure that attempts to find a more direct path
rather than creating ragged ones. An illustration of the RRT* algorithm is shown in
Fig. 2. In (1), a new node is added to the tree with its nearest neighbors shown in a
circle. The least-cost parent is chosen for the newly added node is chosen among the
nearest neighbors in (2). Next, the cost of rewiring is checked for all the neighbors,
and the tree gets rewired if the newly added node would be a better parent for one
of its neighbors (3).
In [4], the RRT* method is used with an A*-like heuristic function and a rewiring
operation to produce smooth, more realistic paths that converge faster. Authors in
[14] evaluated their robotics-based algorithm called RRTMC [3] using Topological
Data Analysis (TDA) methods. They sampled the intermediate conformations that
RRTMC generates most and depicted that these conformations are close to existing
known intermediate conformations. They also attempted to show that TDA can be
useful in assessing the computational methods that sample protein conformation.
5 Machine Learning-Based Methods
Machine Learning (ML) is a method of data analysis that can learn and improve
automatically from experience (training data) without being explicitly programmed.
It’s a branch of Artificial Intelligence (AI) which concentrates on the use of data
and methods to simulate how humans learn. ML methods are being used for
different purposes, from stock market trading [39] to social media analyses [19].
Computational and structural biologists have also made many efforts to apply
ML methods to investigate protein structures and conformations. Protein structures
can theoretically be predicted by MD simulations as well, but MD simulations
are computationally expensive. Humans cannot interpret enormous sets of high-
dimensional data by simple visual analysis. Contrarily, ML methods are exactly
designed to process extensive data sets [17]. For example, Dehghanpoor et al. [15]
used ML methods on a very large dataset to find the effects of mutations on protein
stability. Some algorithms have been designed that can get complex arrangements
and highly nontrivial relationships from these large datasets and summarize this
information in the assessment of new data. A protein conformation is complex
and high-dimensional, with 3N spatial degrees of freedom (DOF) for a protein
with N atoms, but only a subset of this space is physically possible, because
physical constraints render a very large part of the conformations inaccessible.
Therefore, the protein configuration can be approximated as a manifold of lower
dimensionality, which is difficult to find and interpret. Hence, it is vital to use
dimensionality reduction techniques that reduce the complexity of our datasets
while maintaining the valuable information (important features of this manifold)
in the data. Unsupervised methods can help obtain metastable states from high-
dimensional simulation data and connect them to measurable observables [32]. We
need sufficient and reliable data to train these machine learning methods in order
to extract more and better predictions instead of running more simulations. The
problem arises when we have a data shortage for rare events, i.e., the ones for which
enhanced sampling is needed. As a solution, we should iterate between machine
learning and sampling methods [5]. The reduced dimensional space consists of
fewer parameters, called the reaction coordinates or collective variables (CV), while
retaining as much of the variance in the data as possible. Machine learning and
relevant techniques are increasingly being applied to provide a less heuristic way
for finding CVs. We should choose a good collection of CVs to answer a particular
question. For example, if the target is to improve the sampling of transitions from
one state to another, we should find CVs that represent the slowest motion of the
process. Or, for the answer to questions related to biophysics, a suitable set of CVs
should track the differences on the molecular level between different states.
5.1 Dimensionality Reduction Techniques
Dimensionality reduction techniques are commonly applied to obtain CVs. The

following list showcases some common techniques used to process data produced
from MD simulations:
1. Choosing the few first principal components from the principal component
analysis (PCA) [43].
2. Selecting the linear combinations of a set of descriptors chosen by Harmonic
mean Linear Discriminant Analysis (HLDA) [7].
3. Implementing time-lagged independent component analysis (tICA); This tech-
nique can create slow CVs by linearly combining potentially simple CVs (e.g.,
dihedral angles) such that the auto-correlations of the output data are maximized
[35].
4. Using autoencoders; Refer to Sect. 5.2 for more details.
5. Using spectral gap optimization of order parameters (SGOOP) [46]; They
estimate the best mix of reduced dimensional CVs according to the maximum
separation of timescales between visible and invisible (fast) transforms, i.e., path
entropy estimate of the spectral gap.
6. Selecting the essential internal coordinates chosen by a decision tree (XGBoost
algorithm) trained with MD simulations and metastable states [8].
5.2 Autoencoders
Recently, applications of ML and especially Neural Networks (NN), have become

widespread. We can improve the performance and speed of predictive NN models
by combining MD and ML. These NN models (that are much faster than the
simulations) are good, for example, for predictions in drug design, where a lot of
molecules need to be assessed. [34] shows how we can develop innovative and better
predictive ML models by using the produced data from the simulations.
An Autoencoder is a type of NN that tries to project the data to a lower dimension
and reconstruct it to get as close as possible to the original data. It starts with
an encoder and ends with a decoder. The first layer of the encoder has the same
size as the input data, and it is followed by several layers with decreasing number
of neurons in each layer. The last layer of the encoder, i.e., the latent space, is
a summarized version of the input data. The decoder has the same layers as the
encoder but in reverse order. It gets the latent space as input and projects the data
to a higher dimension in each layer, to eventually reconstruct the output in the same
shape as the input data. A schematic illustration of an autoencoder is shown in Fig. 3.
In this example, the encoder has four layers: input, latent, and two intermediate
layers. The decoder has the same number of layers with the same size but in the
reverse order.
We can train an autoencoder to project the protein structures to a low-dimensional
latent space and then use the second part of this trained network, i.e., the decoder,
as a generative model by changing the latent space a little and create a new protein
conformation close to the ones from the training set using that new latent space.
In [13], the author used this idea and trained their network with MD simulations
of a protein. They chose the latent space to be a 2D space so that any point in the
latent space generates a protein structure. We know that the generated structures are
Fig. 3 The illustration of an autoencoder with nine layers, including input, latent, output, and four
intermediate layers
close to some proteins from the training set, but they will not surely be physically
feasible structures. For this problem, they trained a Random Forest (RF) classifier
using valid structures from the MD simulations and invalid ones created by adding
noise to the valid structures. This RF model gets a protein structure and decides if it
is physically plausible or not.
Ramaswamy et al. [36] used an extra loss function in the training phase instead
of using RF after the training. Their second loss function is equal to the force field
of the generated protein structure. In this way, their autoencoder tries to generate
a plausible structure close to the input structure. They used many MD simulations
around the start and end proteins in a protein conformational change trajectory for
training the autoencoder and found a line from the cluster of conformations around
the start structure to the end structures. The decoder uses points at a close distance
to this line to generate intermediate structures in the conformational trajectory.
A novel approach was applied by Jin et al. [23]. In their method, instead of letting
the network decide the latent space, the user chooses the latent space to guide the
search. For this reason, they used two loss functions for training the network. One
loss function to compare the original and the reconstructed data and a second loss
function to force the latent space to be as close as possible to the user-defined space.
For each input data, they computed the desired latent space (for example, the first
two principal components of the 2D matrix of the pairwise RMSD between any pair
of input proteins) and fed it to the second loss function. This new approach aims
to examine whether we can use a very simple and understandable latent space to
predict new and physically valid protein structures.
Lemke et al. [28] also tried to make the latent space meaningful in another
way. They used a nonlinear distance metric cost function, named sketch-map, as
the second loss function for their autoencoder. The sketch-map cost function is
based on the pair-wise distances between data points in the high dimensional input
space and the low dimensional latent space. Sketch-map makes the latent space
more informative and prevents it from shifting toward large numbers. The low
dimensional representations of similar input data are collected in clusters, and the
latent space correctly captures major proximity relations between these clusters.
Wehmeyer et al. [50] explored a distinct type of autoencoder, i.e., time-lagged
autoencoder (TAE), where the loss function is designed in a way that tries to predict
the next frame in MD simulations instead of reconstructing the current frame of the
input structure. It helps not only in dimensionality reduction but also to generate
the next frames of the MD simulations at the same time and to capture the gradual
changes of the underlying stochastic processes.
6 Toolkits for Applying Machine Learning
Researchers have presented and utilized several methods, many of which are
mentioned in this chapter, to explore the conformational space of proteins. As a
consequence, there exist huge amounts of data regarding protein dynamics simula-
tions that need to be examined and evaluated. As an example, the Folding@home
project (https://foldingathome.org/) was used in MD simulations of more than a
dozen of the spike proteins of the SARS-CoV-2 virus [53] that resulted in 36
datasets that discovered potential drug design opportunities. In this section we revise
strategies that can be used as helpful tools to analyze the generated simulations and
conformations.
6.1 Topology and Clustering
In the last few years, different clustering techniques and topology analysis methods
received a lot of attention with the aim of studying molecular kinetics from
simulation data. In the free energy landscape of a protein, local minima where
metastable conformations of proteins can be found are of high interest. This
landscape is often rugged and contains a large number of local minima and so
is hard to navigate. Additionally, the fact that a protein contains many atoms and
its landscape has many degrees of freedom makes enumerating all its possible
conformations impractical. However, methods that attempt to sample this space,
have been presented in this work. Yet even the sampled space needs to be filtered and
clustered to provide meaningful information. Many clustering methods have been
designed to cluster the protein simulation data so that these clusters reflect the local
minima upon which the kinetics of the protein can be summarized in a model or a
graph. An example of such strategies is to build a disconnectivity graph for the local
minima and a Markov state model (MSM) for the metastable states. Chang et al.
[11] proposed a data exploration method, called Multi-Persistent Clustering (MPC),
to solve the model selection problem of MSMs by extending the topology analysis
concept used in building a disconnectivity graph. In [20], a persistent homology and

dimensionality reduction-based hierarchical method is presented to detect clusters
of intermediate structures that span the protein conformational space. In [14], the
Mapper algorithm is used to examine how close the intermediate conformations
generated by the RRT*-inspired algorithm (refer to Sect. 4) are to the ones found by
experimental results. Mapper [41] is a topological data analysis algorithm that uses a
combination of dimensionality reduction, clustering, and graph networking methods
to provide a graph (called the Mapper graph) that summarizes the structure of a
dataset. It is used to identify what intermediate conformations appear the most in
simulated conformational pathways and how close they are to existing experimental
data.
6.2 Using a priori Knowledge
Incorporating a priori knowledge about proteins can guide the search algorithms
toward finding physically feasible conformations and hence reduce time spent on
exploring inefficient spaces. This knowledge can be, for instance, evolutionary
information about the protein coming from studying its sequence or structure
or analysis of rigid residues in the protein that help predict which groups of
atoms are likely to maintain their rigid structure when the protein goes through
conformational changes. This information can be obtained from empirical and
experimental methods or previously done simulations. In [3], rigidity analysis
information and MC criterion are used to reduce the sampling search space and
guide the algorithm by expanding it in the direction of energetically feasible
conformations using perturbation near flexible residues.
7 Conclusions
Understanding the conformational dynamics of proteins is essential in order to

know their function. Experimental methods are limited in their ability to explore
protein dynamics, and many computational methods can bridge the gap. In this
chapter we survey various computational methods that explore the conformational
spaces of proteins. These include physics based Molecular Dynamics (MD), Monte-
Carlo (MC), geometry-based robotics motion planning methods, machine learning
methods and most recently autoencoders. Despite great progress, several challenges
remain. Most of these methods strike a balance between speed and accuracy, and
often sacrifice one to achieve the other. Additionally, the relative dearth of exper-
imental intermediate structures makes validation difficult. Recent developments in
deep learning and AI based methods are encouraging. Combined with improved
experimental validation methods, more accurate and efficient exploration methods
can be developed.
References
1. Computational Molecular Dynamics: Challenges, Methods, Ideas. Springer Berlin Heidelberg

(1999).
2. Adcock, S.A., McCammon, J.A.: Molecular dynamics: Survey of methods for simulating the
activity of proteins (2006).
3. Afrasiabi, F., Dehghanpoor, R., Haspel, N.: Integrating rigidity analysis into the exploration of
protein conformational pathways using rrt* and mc (2021).
4. Afrasiabi, F., Haspel, N.: Efficient exploration of protein conformational pathways using rrt*
and mc (2020).
5. Allison, J.R.: Computational methods for exploring protein conformations (2020).
6. Bernadó, P., Mylonas, E., Petoukhov, M.V., Blackledge, M., Svergun, D.I.: Structural charac-
terization of flexible proteins using small-angle x-ray scattering (2007).
7. Bonati, L., Rizzi, V., Parrinello, M.: Data-driven collective variables for enhanced sampling
(2020).
8. Brandt, S., Sittel, F., Ernst, M., Stock, G.: Machine learning of biomolecular reaction
coordinates (2018).
9. Cammarata, M., Levantino, M., Schotte, F., Anfinrud, P.A., Ewald, F., Choi, J., Cupane, A.,
Wulff, M., Ihee, H.: Tracking the structural dynamics of proteins in solution using time-
resolved wide-angle x-ray scattering (2008).
10. Chang, G., Guida, W.C., Still, W.C.: An internal-coordinate Monte Carlo method for searching
conformational space (1989).
11. Chang, H.W., Bacallado, S., Pande, V.S., Carlsson, G.E.: Persistent topology and metastable
state in conformational dynamics (2013).
12. Cortes, J., Simeon, T., Ruiz de Angulo, V., Guieysse, D., Remaud-Simeon, M., Tran, V.: A path
planning approach for computing large-amplitude motions of flexible molecules (2005).
13. Degiacomi, M.T.: Coupling molecular dynamics and deep learning to mine protein conforma-
tional space (2019).
14. Dehghanpoor, R., Afrasiabi, F., Haspel, N.: Using topological data analysis and rrt to
investigate protein conformational spaces (2021).
15. Dehghanpoor, R., Ricks, E., Hursh, K., Gunderson, S., Farhoodi, R., Haspel, N., Hutchinson,
B., Jagodzinski, F.: Predicting the effect of single and multiple mutations on protein structural
stability (2018).
16. Dieter, U., Ahrens, J.H.: Acceptance-rejection techniques for sampling from the gamma and
beta distributions. (1974)
17. Fleetwood, O., Kasimova, M.A., Westerlund, A.M., Delemotte, L.: Molecular insights from
conformational ensembles via machine learning (2020).
18. Fukunishi, H., Watanabe, O., Takada, S.: On the Hamiltonian replica exchange method for
efficient sampling of biomolecular systems: Application to protein structure prediction (2002).
19. Garcia, G.G.P., Dehghanpoor, R., Stringfellow, E.J., Gupta, M., Rochelle, J., Mason, E., Pujol,
T.A., Jalali, M.S.: Identifying online advice-seekers for recovering from opioid use disorder
(2021).
20. Haspel, N., Luo, D., González, E.: Detecting intermediate protein conformations using
algebraic topology (2017).
21. Hirai, M., Iwase, H., Hayakawa, T., Miura, K., Inoue, K.: Structural hierarchy of several
proteins observed by wide-angle solution scattering (2002).
22. Izrailev, S., Stepaniants, S., Isralewitz, B., Kosztin, D., Lu, H., Molnar, F., Wriggers, W.,
Schulten, K.: Steered molecular dynamics (1999).
23. Jin, Y., Johannissen, L.O., Hay, S.: Predicting new protein conformations from molecular
dynamics simulation conformational landscapes and machine learning (2021).
24. Jorgensen, W.L., Tirado-Rives, J.: Monte Carlo vs molecular dynamics for conformational
sampling (1996).
25. Karaman, S., Walter, M.R., Perez, A., Frazzoli, E., Teller, S.: Anytime motion planning using
the rrt* (2011).
26. Kavraki, L., Svestka, P., Latombe, J.C., Overmars, M.: Probabilistic roadmaps for path
planning in high-dimensional configuration spaces (1996).
27. Lavalle, S.M.: Rapidly-exploring random trees: A new tool for path planning. Tech. rep. (1998)
28. Lemke, T., Peter, C.: Encodermap: Dimensionality reduction and generation of molecule
conformations (2019).
29. Levantino, M., Yorke, B.A., Monteiro, D.C., Cammarata, M., Pearson, A.R.: Using syn-
chrotrons and xfels for time-resolved x-ray crystallography and solution scattering experiments
on biomolecules (2015).
30. Levantino, M., Yorke, B.A., Monteiro, D.C., Cammarata, M., Pearson, A.R.: Using syn-
chrotrons and xfels for time-resolved x-ray crystallography and solution scattering experiments
on biomolecules (2015).
31. Liu, P., Kim, B., Friesner, R.A., Berne, B.J.: Replica exchange with solute tempering: A method
for sampling biological systems in explicit water (2005).
32. Noé, F., De Fabritiis, G., Clementi, C.: Machine learning for protein folding and dynamics
(2020).
33. Putnam, C.D., Hammel, M., Hura, G.L., Tainer, J.A.: X-ray solution scattering (SAXS)
combined with crystallography and computation: defining accurate macromolecular structures,
conformations and assemblies in solution (2007).
34. Pérez, A., Martínez-Rosell, G., De Fabritiis, G.: Simulations meet machine learning in
structural biology (2018).
35. Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G., Noé, F.: Identification of slow
molecular order parameters for Markov model construction (2013).
36. Ramaswamy, V.K., Musson, S.C., Willcocks, C.G., Degiacomi, M.T.: Deep learning protein
conformational space with convolutions and latent interpolations (2021).
37. Ringe, D., Petsko, G.A.: [19]study of protein dynamics by x-ray diffraction (1986).
38. Roy, K., Kar, S., Das, R.N.: Computational chemistry (2015).
39. Shen, J., Shafiq, M.O.: Short-term stock market price trend prediction using a comprehensive
deep learning system (2020).
40. Siegmund, D.: Importance sampling in the Monte Carlo study of sequential tests (1976).
41. Singh, G., Memoli, F., Carlsson, G.: Topological methods for the analysis of high dimensional
data sets and 3d object recognition. In: M. Botsch, R. Pajarola, B. Chen, M. Zwicker (eds.)
Eurographics Symposium on Point-Based Graphics. The Eurographics Association (2007).
42. Smyth, M.S.: x ray crystallography (2000).
43. Spiwok, V., Lipovová, P., Králová, B.: Metadynamics in essential coordinates: Free energy
simulation of conformational changes (2007).
44. Stelzl, L.S., Hummer, G.: Kinetics from replica exchange molecular dynamics simulations
(2017).
45. Sun, L., Li, P., Ju, X., Rao, J., Huang, W., Ren, L., Zhang, S., Xiong, T., Xu, K., Zhou, X.,
Gong, M., Miska, E., Ding, Q., Wang, J., Zhang, Q.C.: In vivo structural characterization of
the SARS-CoV-2 RNA genome identifies host proteins vulnerable to repurposed drugs (2021).
46. Tiwary, P., Berne, B.J.: Spectral gap optimization of order parameters for sampling complex
molecular systems (2016).
47. Verkhivker, G.M., Di Paola, L.: Integrated biophysical modeling of the sars-cov-2 spike protein
binding and allosteric interactions with antibodies (2021).
48. Walls, A.C., Park, Y.J., Tortorici, M.A., Wall, A., McGuire, A.T., Veesler, D.: Structure,
function, and antigenicity of the sars-cov-2 spike glycoprotein (2020).
49. Wang, M.Y., Zhao, R., Gao, L.J., Gao, X.F., Wang, D.P., Cao, J.M.: Sars-cov-2: Structure,
biology, and structure-based therapeutics development (2020).
50. Wehmeyer, C., Noé, F.: Time-lagged autoencoders: Deep learning of slow collective variables
for molecular kinetics (2018).
51. Wolf, S., Stock, G.: Targeted molecular dynamics calculations of free energy profiles using a
nonequilibrium friction correction (2018).
52. Zhang, L., Lin, D., Sun, X., Curth, U., Drosten, C., Sauerhering, L., Becker, S., Rox, K.,
Hilgenfeld, R.: Crystal structure of sars-cov-2 main protease provides a basis for design of
improved α-ketoamide inhibitors (2020).
53. Zimmerman, M.I., Porter, J.R., Ward, M.D., Singh, S., Vithani, N., Meller, A., Mallimadugula,
U.L., Kuhn, C.E., Borowsky, J.H., Wiewiora, R.P., Hurley, M.F.D., Harbison, A.M., Fogarty,
C.A., Coffland, J.E., Fadda, E., Voelz, V.A., Chodera, J.D., Bowman, G.R.: Sars-cov-2
simulations go exascale to predict dramatic spike opening and cryptic pockets across the
proteome (2021).
Low Rank Approximation Methods
for Identifying Impactful Pairwise
Protein Mutations
Chris Daw, Brian Barragan Cruz, Nicholas Majeske, Filip Jagodzinski,

Tanzima Islam, and Brian Hutchinson
1 Introduction
Early approaches to infer the effects of a mutation in the physical protein involved
modifying the gene that encodes for a specific amino acid in a polypeptide, followed
by protein expression and purification via in vitro assays. Free energy of unfolding
studies can then be performed by denaturing the protein mutant and its non-mutated
form (wild type).
One way in which a residue can be identified as critical is by performing a
mutation in a physical protein, and then measuring the effect of the substitution.
Matthews et al. have designed and analyzed many mutants of lysozyme from
bacteriophage T4, and concluded that the unoccupied volume that is caused by
some mutations induces a collapse of that region, while in other cases the cavity
remains empty [1]. Therefore, mutating a large residue does not necessarily have a
measurable impact on the stability and structure of a protein. Also, residues that are
held relatively rigidly within the core of the protein make the largest contribution
to the protein’s stability [2], while residues with a high solvent accessibe surface
C. Daw · B. Barragan-Cruz · N. Majeske · F. Jagodzinski ()

Western Washington University, Bellingham, WA, USA
T. Islam
Texas State University, San Marcos, TX, USA
e-mail: filip.jagodzinski@wwu.edu
B. Hutchinson
e-mail: tanzima@txstate.edu
Pacific Northwest National Laboratory, Richland, WA, USA

64 C. Daw et al.
area are often not as critical, because their mutations often have little bearing on
the stability of the molecule. The Protherm database [3] includes thermodynamic
data (unfolding Gibbs free energy change arrived at via the Schellman equation [4],
enthalpy change, heat capacity change, transition temperature, activity etc.), struc-
tural information (secondary structure, accessibility etc.), measurement methods,
experimental conditions and literature information for wildtype proteins and their
mutants. It is the most extensive source of experimentally available data on the effect
of point mutations on the thermodynamics of proteins.
The experimental studies above are valuable but time consuming and often cost
prohibitive. Moreover some mutant proteins cannot be expressed, due to dramatic
destabilization caused by the mutation, but we would still like to infer whether they
are critical or not. Most importantly, assessing which pairs of mutations have the
greatest impact on the structural stability of a protein is beyond the scope of wet-
lab work, because engineering a set of exhaustive mutants with two amino acid
substitutions is infeasible due to the combinatorial growth of possible decoys. To
address this, computational and analysis techniques are available.
Computational approaches range from estimating folding free energy changes
upon mutations [5], to the use of machine learning (ML) and statistical methods
for predicting the effects of mutations and to infer which residues are critical.
Approaches which utilize Support Vector Machines (SVMs) are able to predict
with 84% accuracy the sign of the stability change for a protein induced by a
single point mutation [6]. Also, data of amino acid replacements that are tolerated
within families of homologous proteins has been used to devise stability scores for
predicting the effect of residue substitutions [7, 8], while force fields for predicting
protein stability have enabled fast and quantitative estimation of interactions that
contribute to complex stability [9].
More recent approaches leverage new biochemistry technologies capable of
mutating many residues, and assessing the effects of each amino acid substitution.
Alanine Scanning [10, 11] involves systematic site-directed mutagenesis, which
produces a library of mutants with an Alanine at each possible residue location.
High-throughput protein expression in coordination with assay runs can reveal
the functional effects of the Alamine mutations. A newer technology still, deep
mutational scanning [12], relies on high-throughput sequencing to generate upwards
of 1 million mutant versions of a protein, which are analyzed to reveal the effects of
the amino acid substitutions via function assays. These new technologies are capable
of providing insights about the effects of a wide range of amino acid substitutions,
but still do not permit an exhaustive-style analysis due to the combinatorial
explosion when 2- or 3-site amino acid substitutions are performed. For example,
for a n = 200 residue protein, there are n2 × 192 = 7,183,900 possible protein
variants which have 2 residue mutations relative to the wild type. This number is
beyond the range and capability of even deep mutational scanning techniques.
Low Rank Approx. for Identifying Impactful Pairwise Protein Mutations 65
Motivation and Contributions

The majority of existing energy, homology and/or ML-based methods reason
about the effects of single mutations [13, 14], despite there being well-known
proteins whose structure and function are greatly impacted in response to multiple
mutations [15, 16]. One example is multiple mutations to the active site of α-
galactosidase cause Fabry disease, a lysosomal disorder, that results in a wide range
of symptoms including cardiac complications [16].
Note that for those proteins where free energy calculations and wet-lab values
are available, summing the free energy data for simple mutations cannot be used
to predict the effects when those mutations are performed all at once. For example,
the single W94L mutation in Barnase Bacillus amyloliquefaciens yields a G
of −1.59 (ProTherm entry 2262), and the single H18G mutation yields a G
of −0.98 (ProTherm entry 2263). These two sum to −1.59 + −0.98 = −2.57.
However, when both mutations are performed at the same time in the physical
protein, the experimental G value is −1.17 (ProTherm entry 2264).
For this work, we strive to develop a framework capable of reasoning about which
pairs of mutations are the most impactful to the structural stability of a protein. Even
using computational approaches makes this a challenging and big data task, because
of the vast number of possible mutants with two amino acid substitutions that can
be engineered for even a small protein.
This work extends our previously published research in [17], in which we
developed a computational pipeline for generating mutants with two amino acid
substitutions. We used it to generate an exhaustive set of possible mutants for each
of eight proteins. Our contributions in this work include a method to accurately
approximate the exhaustive data using a fraction of the total samples. In general, the
fewer samples these empirical models are based upon, the more computationally
efficient they will be, but at the expense of approximation quality. To counter-act the
effect of random noise on the empirical models, we employ smoothing techniques
based on sparse and low rank matrix approximation techniques, yielding low rank
estimates that are able to filter out noise and improve approximation quality.
2 Related Work
Most methods that strive to infer how a mutation affects the stability of a protein
reason about single amino acid substitutions only. They include PoPMuSiC 2.1 [18]
which makes predictions about G and generates a sequence optimality score,
AutoMute [19] which is an ML-based approach that requires a large training
set, and CUPSAT [20], which relies on energy potentials (atomic and torsional
angles), whose calculation is time intensive. Others include D-Mutant [21], I-
mutation2.0 [22], and STRUM [23]. Respectively these rely on residue-specific
all-atom potential calculations, structural and sequence information, and multiple
threading template alignment. An Unfolding Mutation Screen (UMS) is also
available, that relies on propensity tables and free energy calculations [24].
66 C. Daw et al.
A few approaches permit reasoning about the effects of multiple mutations, but
none are able to perform screening-like analyses. These include MAESTRO and
MAESTROweb [25], as well as DUET [26]. In our recent work, we developed a
compute pipeline for generating in silico all mutants with pairwise mutations [27],
and generated an Allostery Impact Map to identify pairs of residues that cause a
disruption to the protein’s stability.
Our low rank smoothing techniques are inspired by work in matrix completion
[28–31], rank regularized least squares [32] and sparse plus low rank decompo-
sitions [33–35]. Matrix completion is a task in which a subset of the entries of
a matrix are known and the remainder need to be inferred (completed). Perhaps
the most famous example is the “Netflix Prize,” [36] which involves an user-
movie matrix, where the (i, j )th element is user i’s rating of movie j . Given an
incomplete set of user-movie ratings, the remainder of the matrix is to be inferred.
Many solutions to this problem [37] assumed a low rank matrix structure, which
corresponds to an assumption that there are a relative small number of underlying
factors governing movie preferences—phrased another way, that there are kinds of
users and kinds of movies. Rank regularized least squares [32] involves solving a
least squares problem with a nuclear norm (also known as trace norm, Ky-Fan r-
norm, or Schatten-1 norm) regularization term, which is a convex relaxation of the
rank function. These problems differ from matrix completion in that the low rank
approximation’s corresponding elements are close, but not identical to, the known
elements. Some our approaches assume a sparse plus low rank decomposition, in
which a matrix M is approximated as the sum of a sparse matrix and a low rank
matrix. Nearly all of the above utilize singular value decompositions and the optimal
approximation result of the famous Eckart-Young-Mirksy theorem [38].
3 Methods
Exhaustive mutation sets have been used in the past to explore and identify
impactful amino acid substitutions via Allostery Impact Maps (AIMs) [27, 39].
However, generating all possible mutants with two amino acid substitutions can take
several weeks—even months—of compute time. In this paper, we present a multi-
phase compute pipeline (Fig. 1) for evaluating several sampling and approximation
methods to reduce the number of pairwise mutations required to make good
predictions, based on rigidity analysis, about the effects of pairwise mutations.
The pipeline has been engineered to scale the prediction process to large proteins
and resulting exhaustive pairwise mutation sets. There are three phases to the
pipeline.
Phase 1: Generating exhaustive pairwise mutation set—This phase is required
for validating the effectiveness of our sampling methods. There are two tasks:
(i) identifying the exhaustive set of all possible mutants having two amino acid
substitutions, and (ii) analyzing the effects of these pairwise mutations using rigidity
analysis. Phase 2: Sampling from exhaustive pairwise mutation set—We utilize six
Fig. 1 Pipeline: We generate an exhaustive mutation set (blue), use sampling methods to generate
empirical Allostery Impact Maps (yellow), then reconstruct the empirical AIMs using sparse-plus-
low-rank, low rank, and svd matrix reconstruction methods (red)
biologically-inspired sampling methods along with random sampling to analyze the

impact of different pairwise mutations types. Phase 3: Low Rank smoothing—To
improve the approximation quality of the empirical (sampled) AIMs, we consider
three different sparse and/or low rank constraints or regularization terms to smooth
the AIMs. This smoothing reduces noise and improves approximation quality. We
detail each of these phases, as well as details of the tasks involved, in the following
subsections.
3.1 Phase 1: Generate Exhaustive Pairwise Data
In this phase we generate mutant structure and analyze rigidity in order to construct
an allostery impact map to understand the effect of pairwise mutations.
Generating Mutant Structures: We use the ProMuteHT software [40] to generate
an exhaustive sample of mutants with two amino acid substitutions for eight proteins
(Table 1). So that we could validate our approach against experimental data about
the effects of mutations done on physical proteins, for this work we focus on the
164-residue lysozyme from bacteriophage T4, in the PDB file 2LZM. Lysozyme
was the second protein, and the first enzyme, whose structure was solved in 1965
using X-ray crystallography. It is among the more well-studied proteins for which
many mutations have been performed via mutagenesis experiments. A large count of
entries in the ProTherm database are for lysozyme, and thus there is ample stability
data—G—for a variety of mutants.
Rigidity Analysis: Rigidity analysis [41] is a fast graph-based method that
identifies rigid regions of biomolecules [42]. Atoms and their chemical interactions
68 C. Daw et al.
Table 1 PDB files used, and PDB file Num residues Mutants Runtime
mutants generated
1CRN 46 373,635 2.1 hr
1PGA 56 555,940 3.3 hr
1BPI 58 596,733 3.8 hr
1ROP 63 705,033 4.7 hr
1CSP 67 798,171 6.1 hr
1VQB 87 1,350,501 8.3 hr
1HHP 99 1,751,211 14.3 hr
2LZM 164 4,825,126 48.9 hr
Fig. 2 PDB file 1CSP
are used to construct a mechanical model and associated graph of a protein, whose
analysis via a pebble game algorithm [43] identifies rigid clusters of atoms (Figs. 2
and 3). For this work, we tally the counts and distribution of rigid clusters in the
wild type, as well as a mutant, to quantitatively assess the effect of the amino acid
substitutions performed in silico. We use the following rigidity metric (see [39]):
i=LRC
RDW T →mutant : i × [W Ti − Muti ] (1)
i=1
where W T refers to Wild Type, Mut refers to mutant, and LRC is the size of the
Largest Rigid Cluster (in atoms). Each summation term of RDW T →mutant calculates
the difference in the count of a specific cluster size, i, of the wild type and mutant,
and weighs that difference by i.
Allostery Impact Map: We use the rigidity analysis data to create an Exhaustive
Mutation Tensor, T ex ∈ Rn×n×361 . The (i, j, k)th element, Tijexk , contains the
rigidity data for performing the kth pair of substitutions (out of 192 = 361 total
possible pairs of substitutions) at residues i and j .
From T ex , we build an Exhaustive Allostery Impact Map (AIM), M ex , [27]
which provides an infographic (Figs. 4 and 5) based on quantitative data for
reasoning about the effects of mutating two residues. The color of any one cell in the
AIM designates a sum value of all the metrics for all of the 361 mutant structures
when the residues indicated by the x and y values are exhaustively mutated.
Fig. 3 Rigidity analysis for

PDB file 1CSP identifies
atoms belonging to the same
rigid clusters
Fig. 4 An Allostery Impact

Map (AIM) is a heatmap of
rigidity metrics for pairs of
mutations. The color at grid
entry i, j designates an
aggregate rigidity metric for
all possible mutations at
residues i and j
Fig. 5 Sample Allostery

Impact Map (AIM)
reproduced from [27]
70 C. Daw et al.
Because of the large count of structures that make up an exhaustive pairwise

mutation set for a protein, we distribute the computational tasks for Phase 1
among 165 machines. Each machine further subdivides each task via process-level
parallelism by spawning one mutex process for mutation for each available compute
core. We achieve a process-level granularity of 19k nk /(165 · 8) when generating
all possible protein mutants containing k = 2 amino acid substitutions for a protein
with n residues. Our compute pipeline leverages the knowledge that no two pairwise
protein mutations depend on each other to parallelize the generation and analysis of
all pairwise mutations, and the mutually independent computation tasks are run in a
distributed computing environment. The use of parallel computing ensures that the
Phase 1 finishes fast and uses local storage space to ensure that I/O bound tasks do
not overwhelm the attached network filesystem.
3.2 Phase 2: Sampling Methods
As described in the previous section, constructing the exhaustive AIM is highly

computationally intensive. One central question of our work is whether high quality
approximations of the exhaustive AIM can be made using a subset of all possible
mutations. We first consider random sampling, where we vary both the number of
sites sampled and the number of samples overall. We additionally explore several
biologically-inspired sampling strategies; drawing pairwise mutations whose sub-
stituted amino acids which
1. Change from hydrophilic to hydrophobic,
2. Change from hydrophobic to hydrophilic,
3. Exist at the core of the biomolecule,
4. Exist at the surface of the biomolecule,
5. Engage in hydrogen bonds, or
6. Engage in hydrophobic interactions.
Each of these sampling strategies is used to produce an empirical AIM, denoted
M emp and derived analogously to the exhaustive AIM, but using only the subset of
mutations sampled. We define two strategies for “completing” the empirical AIM
when mutation pair sites go unsampled: in “unfilled” empirical AIMs unsampled
mutation site pairs are left as zero, while in “filled” empirical AIMs unsampled
mutation site pairs are set to the average of all sampled mutation site pairs. Table 2
lists the fraction of sites and mutations that can be sampled using each of the
biologically inspired sampling strategies for protein 2LZM.
The SASA value of an amino acid is the total surface area of that amino acid
exposed to the surrounding solvent. This metric can be computed in Angstroms2 by
Free SASA, an open source SASA calculator. Free SASA uses the Lee & Richards
algorithm with a probe-radius of 1.2 Angstroms [44]. To compute the percent SASA
values for all 192 164
2 mutations, we distributed the work across multiple machines.
Table 2 The maximum Sampling method % Sites % Mutations

number of sites and samples
that can be obtained on Hydrophilic to Hydrophobic (phob) 28.0 6.3
2LZM with each biologically Hydrophobic to Hydrophilic (phil) 21.9 7.3
inspired sampling strategy At Core (core) 86.4 39.6
At Surface (surf) 30.6 13.5
Hydrogen Bonds (hbond) 74.5 39.5
Hydrophobic Bonds (phobbond) 47.9 7.3
We were ultimately restricted to 500 processes causing multiple runs. Our total
compute time for SASA was approximately 15 hours.
Using this exhaustive set of SASA values, we are able study the impact of
pairwise mutations whose substituted amino acids exist at the core and surface of the
biomolecule. We consider a pairwise mutation to be at the core of the biomolecule if
its substituted amino acids are below 60% SASA. We consider a pairwise mutation
to be at the surface of the biomolecule if both its substituted amino acids are at or
above 60% SASA.
Let W T Seq be the amino acid sequence of length n for a wildtype protein
and W T Seqi denote the ith amino acid of that sequence for i ∈ {1, 2, . . . , n}.
Additionally, let M denote the set of mutation sites at which amino acid substitutions
have been made for a given protein mutation. Note that M = k = 2 in this
work. For k = 2, the number of possible mutations for any mutation site pair M is
192 = 361, meaning that MTpho and MTphil account for up to 22.4% and 33.5%
of the set of mutations for a given site pair M. Hence, the overhead of evaluating
multiple sampling methods for several exhaustive pairwise sets is bound by the
largest exhaustive pairwise mutation set.
3.3 Phase 3: Smooth Approximation Methods
While empirical AIMs are far more efficient to compute than exhaustive AIMs,
they tend to be highly noisy due to their inherently incomplete nature. In this
phase we explore several strategies for smoothing their values in order to improve
approximation quality. These approximations “fill in” the missing information by
making some assumptions about global structure of the exhaustive AIM; namely
that the exhaustive AIM is low rank, possibly with additive sparse noise.
The rank of a matrix is the number of linearly independent columns (and rows)
in the matrix; equivalently, it is the number of non-zero singular values. Rank can
be thought of as a notion of complexity in the matrix: low rank matrices can be
explained by a relatively small number of underlying factors. Figure 6 plots the
singular values (in the conventional descending order) for the exhaustive AIMs for
the proteins we considered. While none of the matrices are exactly low rank, all are
approximately low rank: most of the singular values are approximately zero.
72 C. Daw et al.
Fig. 6 Singular values for 8 proteins, revealing approximate low rank structure. Singular values
were normalized by dividing by the largest singular value
3.3.1 Singular Value Decomposition Allostery Impact Map
Our first method produces a low rank approximation of the empirical AIM by
solving the following convex optimization problem:
arg min M emp − M F (2)

M
s.t. rank(M) ≤ R (3)
where · F is the Frobenius norm and R is the desired rank (a value to be

assessed empirically). The theorem of Eckert-Young-Mirsky states that the closed
form solution to this problem is:
emp
MR =U RV
T
, (4)
where U and R are the left and right singular values of M emp , respectively, and is
the matrix whose diagonal contains the singular values of M emp ; all three matrices
can be obtained by a singular value decomposition. R is with all but the R largest
singular values replaced by zeros. Our Singular Value Decomposition (SVD) AIM,
emp
M svd , is defined to be MR , the optimal rank R approximation of M emp . Note that
this assumes we want to approximate all elements of M emp , which is suboptimal

when using a sampling strategy that does not sample all sites.
3.3.2 Low Rank Allostery Impact Map
Instead of approximating M emp everywhere, we ideally want to approximate it only

at the sampled sites, ignoring any dummy values plugged in at the unsampled sites.
We can accomplish this by solving the following optimization problem:
arg min W ◦ (M emp − M) ◦ (M emp − M) 2

F (5)
M
s.t. rank(M) ≤ R (6)
where ◦ denotes elementwise multiplication. W is a weighting matrix; Wij is

proportional to the number of samples at site (i, j ), with zeros in all unsampled
sites. This pushes the approximation to be consistent with M emp for sites that are
“trustworthy,” leaving it free to fill in unsampled sites as it sees fit.
Unfortunately, the optimization problem (5) is non-convex and difficult to
optimize. Instead, we solve the following convex relaxation of the problem:

F + γ0 M ∗. (7)
M
Here M ∗ denotes the nuclear norm of M, which is the sum of its singular values
and is a convex function of M. Penalizing the nuclear norm of a matrix has the effect
of reducing its rank. The regularization coefficient γ0 controls the extent to which
we penalize the rank. We refer to this approximation as the Low Rank (LR) AIM.
3.3.3 Sparse Plus Low Rank Allostery Impact Map
While the exhaustive AIMs appear to have strong low rank structure, they also
exhibit some sparse patterns; in the visual representation of the empirical AIMs,
these patterns look like sparse noise. Rather than force the low rank matrix to
capture these sparse patterns, and artificially drive up its rank, we also consider
an approximation that is the sum of a low rank matrix and a sparse matrix. This is
accomplished with the following variant of the low rank optimization problem:

F + γ0 L ∗ + γ1 S 1 . (8)
M=S+L
In this objective, γ1 is the regularization coefficient on an 1 penalty: S 1 =

i,j |Sij |. This term has the effect of making S sparse (i.e., primarily zero);
γ1 controls the extent to which we sparsify S. We refer to this approximation
M = S + L as the Sparse and Low Rank (SLR) AIM.
74 C. Daw et al.
Fig. 7 An example of a Sparse plus Low Rank approximation of the 1CRN protein. Matrices a
and b are added together to produce matrix c, the SLR AIM, which is a low rank smoothing matrix
d, the empirical AIM
An example of decomposition on the 1CRN protein is shown in Fig. 7. In this

example, the values in the sparse matrix are likely noise in the empirical matrix. We
tried fitting a SLR model but then using only the low rank component (to filter out
this noise), but found that it performed worse than the LR and SLR methods, and
thus do not report those results in this paper.
The optimization problems for both the LR AIM and the SLR AIM are solved
with the block-coordinate accelerated proximal gradient descent algorithm listed in
Algorithm 1 from our previous work [17], which in turn adapted it from [45, 46]. In
the algorithm, Sμ is the soft-thresholding operator:
Sμ (X) = sgn(X) ◦ max(0, |X| − μ) (9)
For the SLR results, we tune a wide range of γ0 and γ1 values; for the LR, we tune
γ0 but fix γ1 ≈ ∞ to prevent the sparse component from being used.
3.4 Evaluation Metrics
We evaluate the quality of approximation using Sum Absolute Errors (SAE) relative
to the exhaustive AIM:
n n
SAE = |Mijex − Mij | (10)
i=1 j =1
and Root Mean Squared Error (RMSE) relative to the exhaustive AIM:
Algorithm 1 SLRD ecomposition()

1: S ← L ← pS ← pL ← 0; t ← t ← 1
2: while not converged do
3: mL ← pL + ((t − 1)/t)(pL − L)
4: mS ← pS + ((t − 1)/t)(pS − S)
5: mX ← mL + mS
6: Pick/ModifyτS
7: gS ← mS − (1/τS )∇mX [W ◦ (X − M emp ) ◦ (X − M emp )]
8: pS ← Sγ1 /τS (gS)
9: S ← pS
10: mX ← mL + pS
11: Pick/ModifyτL
12: gL ← mL − (1/τL )∇mX [W ◦ (X − M emp ) ◦ (X − M emp )]
13: U, , V = SV D(gL)
14: pL ← U Sγ0 /τL ( )V T
15: L ← pL √
16: t ← t; t ← 1 + 1 + 4t 2 /2
17: end while

n n
1
RMSE = (Mijex − Mij )2 . (11)
n2
i=1 j =1
In both cases, M is one of the following: an empirical AIM (M emp ), a singular value
decomposition AIM (M svd ), a sparse plus low rank AIM (M slr ), or a low rank AIM
(M lr ). As the number of samples increases, M emp approaches M ex and therefore its
RMSE approaches zero.
4 Results: SVD Smoothing
As we previously reported, and as shown in Figs. 8, 9, 10 and 11 the SVD approxi-

mation consistently improves SAE compared to the “filled” approximation method.
We see that smaller values for rank, R, improve the result. As R approaches the total
number of residues for 1CRN, 46, M svd approaches M emp and the improvement
converges to 0 since the matrices become identical. The “Mutation to Hydrophobic”
sampling method shows the greatest improvement when using the SVD smoothing
method. This suggests this subset of sites contains more information on the low-
rank, banded nature of the empirical matrix. In Figs. 12, 13, 14 and 15 we see
the same analysis performed on 1PGA. Interestingly, it shows steadily improving
performance as the percentage of site-pairs sampled increases to 100%. This result
suggests that all mutation sites of 1PGA encode unique, important information to
reconstruct the empirical matrix, more so than in 1CRN, which held a fairly constant
approximation error as a function of the amount of sampling.
76 C. Daw et al.
Fig. 8 Percent improvement in SAE approximating 1CRN by SVD relative to the “filled”
empirical method at various sampling rates of 19 mutations across 25% of mutation site pairs
78 C. Daw et al.
Fig. 12 Percent improvement in SAE approximating 1PGA by SVD relative to the “filled”
empirical method at 25% of mutation site pairs
80 C. Daw et al.
Fig. 16 Empirical approximation error against SVD approximation error for various ranks when
sampling across 25% of mutation site pairs for 2LZM
4.1 SVD Approximation and Sampling Error
We plot the sum of absolute error (SAE) of the empirical and SVD approximated
matrices using three sampling strategies at 25% site pairs sampled for 2LZM (Y-
axis) against rank (X-axis) in Fig. 16. The plot shows a clear optimum along the
rank axis centering around R = 32. As mentioned above, as the rank approaches the
total number of residues, the SVD approximation approaches the empirical matrix
and therefore the SAE for the SVD approximation approaches the empirical SVD.
Figure 17 shows a similar graph for 1HHP with 25% site pairs sampled. The value of
the optimal rank for the approximation method is much lower, at R = 2, suggesting
the underlying stability structure is relatively simple compared to 2LZM. The plot
also shows the SAE for the “Mutation to Hydrophobic” to be much higher than
the other sampling methods. This can be explained biophysically since a mutation
from a hydrophilic to a hydrophobic residue could cause the surface residue to be
energetically unfavorable.
Figure 18 shows M ex , M emp and M svd for 1CRN. From this figure, we observe
that the SVD method is able to capture the banded, low rank nature of the exhaustive
AIM, although it does somewhat over-smooth. For other proteins (not shown), the
SVD method demonstrates similar behavior.
Fig. 17 Empirical approximation error against SVD approximation error for various ranks when
sampling across 75% of mutation site pairs for 1hhp
Fig. 18 1CRN : Exhaustive (left), empirical approximation (upper right) and SVD approximation
(low right) AIMs
82 C. Daw et al.
5 Results: Case Study on 2LZM
Next, we explore the full range of sampling and approximation methods on

2LZM, to better understand their relative strengths. Figure 19 shows a systematic
comparison of all sampling and approximation methods on 2LZM. In addition to
the visualized reconstruction of the ground truth empirical matrix (last column), the
RMSE between the reconstruction and M ex is overlayed. While the SLR and LR
smoothing methods are more computationally intensive than the SVD method, they
almost always outperform SVD smoothing when the sampling percentage is small
(it is 6% for this figure). This can be seen both qualitatively (visual inspection)
Fig. 19 AIM heatmaps for all sampling methods and all approximation methods. Each heatmap
shows its RMSE relative to the ground truth
Fig. 20 Correlation plot for atCore sampling method, SLR approximation method, and 6%
sampling
and quantitatively according to RMSE. We note that the LR method generally

outperforms SLR, presumably due to SLR “memorizing” the sparse outliers (i.e.,
keeping too much noise from the empirical matrix). We also note that the different
sampling strategies yield very different empirical matrices, some more effective than
others (addressed further below).
Figure 20 show the correlation between rank, sparsity and RMSE. We first
note that low rank yields the best RMSE. This supports our original intuition that
AIM matrices are approximately low rank by nature. Likewise, lower sparsity also
correlates with lower RMSE; this is also intuitive since the less sparse S is, the
less smoothed M = L + S is. With dense enough S, M just memorizes the noise
in the empirical matrix. We further see, as expected, an inverse relation between
rank and the number of non-zero entries in S, since they both have to do with the
representational power of M, and when M is the sum of a high rank with a non-
sparse matrix, it again insufficiently smooths.
Figure 21 shows the RMSE as a function of sampling percentages for all
approximation methods. Each data point on a line is the average RMSE across
the seven sampling strategies. It reveals that across a range of small sampling
percentages, the LR and SLR approximation methods perform best, while the
unsmoothed empirical method lags behind. We focus on these small sampling
percentages because they are the most computationally feasible ranges for large
proteins.
Figure 22 plots complementary information: the sampling method, averaged over
all approximation strategies. We see that random sampling outperforms all other
sampling strategies for most sampling rates, presumably due to the fact that the
empirical matrix rarely lacks entire columns or rows, so there is some amount of
information about every mutation site. atCore performs second best overall, which
is biologically plausible, considering that work by Matthews et al. in mutagenesis
experiments performed on physical proteins has shown that residues at the core of
the protein have the largest contribution to the protein’s stability, and that surface
residues often have little impact on the molecule’s stability [1, 2].
84 C. Daw et al.
Fig. 21 Line plot for all approximation methods, where each line is averaged across all sampling
methods
Fig. 22 Line plot for all sampling methods, where each line is averaged across all approximation
methods
Furthermore, when the sampling percentage is lowest (and thus even random
sampling has missing rows or columns), the biologically-inspired atCore and
areHydrogenBonded sampling methods yield the best performance.
6 Conclusions and Future Work
Using the 2LZM rigidity analysis results with pairwise mutations, we measured the
sampling strategies across 0.01 and 0.06 sampling portions due to low sampling
numbers with certain sampling strategies. Surprisingly, the random sampling strat-
egy performed the best, which could be due to the sparsity of the information in
2LZM needed for optimal reconstruction. In addition the atCore sampling strategy is
promising. We found that the LR approximation method consistently outperformed
the others including SLR. It seems that SLR applies too much emphasis on the
sparsity matrix values, which increases error. Hence, omitting the sparse component
improved the approximation. It is the case with LR that a sweetspot between
sparsity and rank exists. In general, it seems that random sampling with the LR
approximation provides an optimal approximation. It would be worth exploring
how well these approximation and sampling methods perform on new proteins
for further validation of LR approximation and random sampling. In addition, an
increase in total protein mutations could reveal different insights into our methods.
An extension of our current method would be a direct operation on the 3D tensor
rather than a flattened matrix. Doing so makes way to applications of active learning
and quantification of information per mutation.
References
1. J. Xu, W.A. Baase, E. Baldwin, and B.W. Matthews. The response of T4 lysozyme to large-to-
small substitutions within the core and its relation to the hydrophobic effect. Protein Science,
7(1):158–177, 1998.
2. T. Alber, S. Dao-pin, J.A. Wozniak, S.P. Cook, and B.W. Matthews. Contributions of hydrogen
bonds of Thr 157 to the thermodynamic stability of phage T4 lysozyme. Nature, 330:41–46,
1987.
3. M.D. Kumar, K.A. Bava, M.M. Gromiha, P. Prabakaran, K. Kitajima, H. Uedaira, and
A. Sarai. Protherm and pronit : Thermodynamic databases for proteins and protein-nucleic
acid interactions. Nucleic Acids Research, 34:D204–D206, 2005.
4. J Schellman. The thermodynamic stability of proteins. Annual rev. of biophysics and chem,
16(1):115–137, 1987.
5. D. Gilis and M. Rooman. Predicting protein stability changes upon mutation using database-
derived potentials: Solvent accessibility determines the importance of local versus non-local
interactions along the sequence. Journal of Molecular Biology, 272(2):276–290, 1997.
6. J. Cheng, A. Randall, and P. Baldi. Prediction of protein stability changes for single-site
mutations using support vector machines. PROTEINS: Struct Func & Bioinfo, 62:1125–1132,
2006.
7. C.M. Topham, N. Srinivasan, and T. Blundell. Prediction of the stability of protein mutants
based on structural environment-dependent amino acid substitutions and propensity tables.
Protein Engineering, 10(1):7–21, 1997.
8. CL Worth, R Preissner, and L Blundell. Sdm-a server for predicting effects of mutations on
protein stability and malfunction. Nucleic Acids Research, 39(Web Server Issue):W215–W222,
2011.
86 C. Daw et al.
9. R. Guerois, J.E. Nielsen, and L. Serrano. Predicting changes in the stability of proteins and
protein complexes: A study of more than 1000 mutations. Journal of Molecular Biology,
320(2):369–387, 2002.
10. Brian C Cunningham and James A Wells. High-resolution epitope mapping of hgh-receptor
interactions by alanine-scanning mutagenesis. Science, 244(4908):1081–1085, 1989.
11. Tanja Kortemme, David E Kim, and David Baker. Computational alanine scanning of protein-
protein interfaces. Sci. STKE, 2004(219):pl2–pl2, 2004.
12. Douglas M Fowler and Stanley Fields. Deep mutational scanning: a new style of protein
science. Nature methods, 11(8):801, 2014.
13. S Henikoff and PC Ng. Predicting the effects of amnio acid substitutions on protein functions.
Annual Reviews of Genomics Human Genetics, 7:61–80, 2006.
14. S Teng, E Michonova-Alexova, and E Alexov. Approaches and resources for prediction
of the effects of non-synonymous single nucleotide polymorphisms on protein function and
interactions. Cur. Pharmacology Biotech., 9:123–133, 2008.
15. SY Rhee, J Taylor, J Fessel, D Kaufman, W Towner, P Troia, P Ruane, J Hellinger, V Shirvani,
A Zolopa, and R Shafer. Hiv-1 protease mutations and protease inhibitor cross-resistance.
Antimicrobial Agents & Chem., 59(8):4253–4261, 2010.
16. Garman SC and Garboczi DN. Structural basis of fabry disease. Molecular Genetics and
Metabolism, 77:3–11, 2002.
17. N Majeske, Jagodzinski, B Hutchinson, and T Islam. Low rank smoothed sampling methods
for identifying impactful pairwise utations. In Proc. CSBW, 2018.
18. Y Dehouck, J Kwasigroch, M Gilis, and Rooman M. Popmusic2.1: a web server for the
estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinfo,
12, 2011.
19. M Masso and I Vaisman. Auto-mute: web-based tools for predicting stability changes in
proteins due to single amino acid replacements. Protein Engineering Design and Selection,
23(8):683–687, 2010.
20. V Parthiban, M Gromiha, and D Schomburg. Cupsat: prediction of protein stability upon point
mutations. Nucleic Acids Res, 34(suppl 2):W239–W242, 2006.
21. H Zhou and Y Zhou. Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein science,
11(11):2714–2726, 2002.
22. E Capriotti, P Fariselli, and R Casadio. I-mutant2 : predicting stability changes upon mutation
from the protein sequence or structure. Nucleic Acids Res., 33(suppl 2):W306–W310, 2005.
23. L Quan, Q Lv, and Y Zhang. Strum: structure-based prediction of protein stability changes
upon single-point mutation. Bioinfo, 32(19):2936–2946, 2016.
24. C McCafferty and Y Sergeev. In silico mapping of protein unfolding mutations for inherited
disease. Scientific Reports, 6:37298, 2016.
25. J Laimer, H Hofer, M Fritz, S Wegenkittl, and P Lackner. Maestro-multi agent stability
prediction upon point mutations. BMC bioinformatics, 16(1):116, 2015.
26. D Pires, D Ascher, and T Blundell. Duet: a server for predicting effects of mutations on protein
stability using an integrated computational approach. Nucleic acids research, 42(W1):W314–
W319, 2014.
27. N Majeske and F Jagodzinski. Elucidating which pairwise mutations affect protein stability:
An exhaustive big data approach. In proc. of IEEE COMPSAC (International Conference on
Computers, Software & Applications), July 2018.
28. E. J. Cand‘es J.-F. Cai and Z. Shen. A singular value thresholding algorithm for matrix
completion. 2008.
29. A. Montanari R. H. Keshavan and S. Oh. Matrix completion from a few entries. 2009.
30. E. J. Candes and B. Recht. Exact matrix completion via convex optimization.
31. E. J. Candes and T. Tao. The power of convex relaxation: near-optimal matrix completion.
IEEE Trans. Inf. Theor., 56(5):2053–2080, 2010.
32. K.C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized
least squares problems. Pacific Journal of Optimization, 6(3):615–640, 2010.
33. V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Sparse and lowrank matrix
decompositions. In Proc. IFAC Symposium on System Identification, pages 962–967, Sep 2009.
34. X. Yuan and J. Yang. Sparse and low-rank matrix decomposition via alternating direction
methods. Technical report, Hong Kong Baptist University, 2009.
35. E. Candès, X Li, Y Ma, and J Wright. Robust principal component analysis? J. ACM,
58(3):11:1–11:37, 2011.
36. J. Bennet and S. Lanning. The netflix prize. In Proc. KDD Cup and Workshop, 2007.
37. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, Aug 2009.
38. C Eckart and G Young. The approximation of one matrix by another of lower rank.
Psychometrika, 1(3):211–218, Sep 1936.
39. M Siderius and F Jagodzinski. Mutation sensitivity maps: Identifying residue substitutions
that impact protein structure via a rigidity analysis in silico mutation approach. J of Comp Bio,
25(1):89–102, 2018.
40. E Andersson and F Jagodzinski. Promuteht: A high throughput compute pipeline for generating
protein mutants in silico. In Proceedings of the 8th ACM International Conference on
Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, pages 655–
660, 2017.
41. D Jacobs, AJ Rader, M Thorpe, and L Kuhn. Protein flexibility predictions using graph theory.
Proteins, 44:150–165, 2001.
42. Andreas G. Ladurner and Alan R. Fersht. Glutamine, alanine or glycine repeats inserted into
the loop of a protein have minimal effects on stability and folding rates1. Journal of Molecular
Biology, 273(1):330 – 337, 1997.
43. D Jacobs and B Hendrickson. An algorithm for two-dimensional rigidity percolation: the
pebble game. Journal of Computational Physics, 137:346–365, 1997.
44. S. Mitternacht. Freesasa: An open source c library for solven accessible surface area
calculations. F1000Research, 5, 2016.
45. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. A sparse plus low rank maximum
entropy language model. In Proc. Interspeech, 2012.
46. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. A sparse plus low-rank exponential
language model for limited resource scenarios. Audio, Speech, and Language Processing,
IEEE/ACM Transactions on, 23:494–504, 03 2015.
Detection and Analysis of Amino Acid
Insertions and Deletions
Muneeba Jilani, Nurit Haspel, and Filip Jagodzinski
1 Introduction
When analyzing the evolution of proteins, the most prevalent among sequence
variations in proteins are amino acid substitutions, and insertions or deletions [1, 2]
(Figs. 1 and 2). Amino acid substitutions have been the focus of numerous recent and
well established efforts in wet lab and computational studies. The research relating
to InDels, however, remains rather nascent. Many queries related to InDels remain
unrequited despite the fact that InDels account for more changes in the structure and
function of proteins than do amino acid substitutions [3, 4]. There is plentiful data
indicating that these variations, rather than substitutions, are primarily responsible
for causing various kinds of functional changes in corresponding proteins that
account for evolution.
Due to their impact on the protein structure and function, it can be reasonably
hypothesized that InDels can be responsible for the modification of interaction
interfaces of proteins resulting in the gain or loss of the protein-protein interactions.
It can be concluded that interaction networks of proteins are majorly impacted by
InDels.
The two perspectives of looking at InDels are a sequence perspective and a
structural perspective [5]. In the sequence perspective, we look at InDels as addition
or deletion of amino acids from the protein sequence as a result of changes in
the genome. In the structural perspective, we look at domains of proteins and how
various proteins combine to form multi-domain architectures. Small InDels tend not
M. Jilani · N. Haspel ()

University of Massachusetts Boston, Boston, MA, USA
e-mail: Muneeba.Jilani001@umb.edu; Nurit.Haspel@umb.edu
F. Jagodzinski
e-mail: filip.Jagodzinski@wwu.edu

90 M. Jilani et al.
Fig. 1 An insertion mutation
Fig. 2 A deletion mutation
to have drastic impacts on the structure of a protein. Rather, they cause an alteration
to the activity of the protein as well as the alteration of interfaces of proteins with
other proteins.
The length of the coding insertions and deletions in the human genome are
contributing factors to insertions and deletions in proteins [6, 7]. The two kinds
Detection and Analysis of Amino Acid Insertions and Deletions 91
of coding insertions and deletions in the human genome are those that are divisible
by three and those that are not. The latter kind causes frameshifts. More frequently,
these occur either in redundant genes or at the ends of genes [8]. Thus, the location
of the insertions and deletions in the genome is the deciding factor in their impact
on the resulting protein. For genome insertions and deletions with length divisible
by three, the term “3n” or nonframeshift (NFS) is used, and these are the ones that
can result in either insertion and deletion of amino acids in the protein sequence
encoded by the nucleotide sequence [9, 10].
The structural integrity of proteins can be severely perturbed by large InDels, thus
resulting in many types of disease phenotypes in an organism. Oftentimes InDels are
accompanied by substitutions that occur prior to or following an InDel. The impact
of an InDel combined with a substitution can be quite severe. For instance, when an
active site loop is deleted and this phenomenon is followed by a substitution, that
can lead to change in the enzyme related activity of a protein [11, 12]. The more
common types of proteins that accommodate InDels are the ones that are essential in
nature or that interact more with other proteins [13, 14]. The interaction interfaces of
proteins can contain InDels, and these are the ones that play a significant role in the
interactions of biomolecules. The InDels that occur in the cores of protein structures
can be deemed responsible for deviation in function of homologous proteins [15].
InDels more frequently occur in the proteins that are essential in nature, and also in
proteins that interact with other proteins more frequently.
The implications and importance of studying InDels can be understood by
looking at the case of the spike protein of severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2) [16]. Among other things, what sets this virus
apart from other betacoronaviruses are InDels that occur at the S1/S2 boundary,
which constitutes the cleavage site and play a role in virus transmission [17].
Understanding the nature and effects of these InDels can aid the study of their
origin and evolutionary patterns of virus. Another disease caused by this key
structural variation is cystic fibrosis. In that case, several mutants are known. The
F508del mutation in nucleotide-binding domain-1 (NBD1) of the cystic fibrosis
transmembrane conductance regulator (CFTR) is the predominant cause of cystic
fibrosis [18]. InDel implications are vast ranging and encompass various types of
cancers as well [19, 20].
In the remaining sections, we explore various research efforts and touch on
research efforts that aim at exploring various aspects of InDels. Section 2 targets
efforts focused on detection of InDels, Sect. 3 proceeds to explore efforts in the
InDel analysis realm. We conclude with a discussion of various InDel related efforts.
2 Computational Methods of InDel Detection
Identification of InDels is important in order to understand the evolution of proteins

and the purpose they serve. But due to the large search space, this remains a key
challenge. Identification of InDels from PDB protein structures is performed by
92 M. Jilani et al.
Indel PDB [21]. This resource is designed to identify multi-dimensional aspects of

InDel sites by utilizing structural information available related to proteins under
analysis. This resource supposedly maintained a large amount of data pertaining
to non-redundant sites of InDels alongside their respective proteins [21], but this
resource of InDel data is no longer available. Another promising effort was the
Sequence Feature and InDel Region Extractor (SeqFIRE) [22], to enable the
automated identification and extraction of InDels from protein sequence alignments.
The program extracts conserved blocks and identifies fast evolving sites using a
combination of conservation analysis and entropy. Like InDel PDB, this resource is
no longer available.
The flanking regions of InDels are disparate from other regions and pertain some
special qualities, hence their identification can be of interest for some research
endeavors. A database called IndelFR [23] was designed to highlight information
related to flanking regions of InDels in the corresponding domains of proteins.
This database like some other ones mentioned is not available anymore. The
method of obtaining InDels in this approach is via homologous structures’ pairwise
alignment. Such information relating flanking regions of InDels can be used for
protein sequence alignment which is a crucial task for numerous bioinformatics
applications [24].
In addition to detecting InDels, creating them is achieved by a recent technique
[25] called transposon-based mutagenesis (TRIAD), that generates variant libraries
containing short in-frame InDels. Prior protocols for introducing InDels mainly
focused on insertion mutants. This protocol consists of transposition and cloning
steps, and can be adapted to target certain regions of the protein as well as creating
large libraries. It is an effort similar to a previous mutagenesis based approach [26]
that facilitates multiple (up to 5 codon) deletions in green fluorescent protein.
Computational InDel detection is a topic that has received some attention over
time, but the majority of the tools designed for this purpose are either removed or
discontinued according to our knowledge. There is a dire need for novel efforts in
this area in order to assist in other computational endeavors such as computational
generation of mutants before carrying out expensive MD simulations. Methods for
creating InDels in proteins are mainly mutagenesis based, but there is a need for
computational such methods in order to assist and further accelerate the protein
engineering process.
3 Computational Methods of InDel Analysis
InDel analysis is highly important in order to gain insight into the effects of amino
acid insertions and deletions. Here we explore Machine Learning based approaches,
as well as efforts to study the effects of InDels, including plasticity.
3.1 Machine Learning Based Methods
Machine learning based methods for studying and analyzing the effects of InDels
range from causal analysis of InDels to studying the impact of InDels on protein
stability. In some instances, hidden Markov models are used for predicting flanking
regions of InDels. In order to study the relationship between Mendelian disorders
and NFS InDels in the genome, InDels have been characterized based on various
parameters [27]. These parameters, which include functional and evolutionary
features or metrics, have been subsequently used to find out which InDels are disease
causing and which are neutral. The end result of this process is an approach called
KD4i [27], which can be used to predict the phenotypic effects of NFS InDels. NFS
indels in the genome are the cause of InDels in the resulting protein sequence, thus
this is a means to analyze the impact of InDels in the protein sequence. An advantage
of this approach is that rationale behind each prediction is provided as a set of rules.
The said rules can be understood by individuals who are not necessarily experts in
the area.
Due to their perturbance of the core structure of a protein, InDels often result in
significantly altering the biophysical properties of a protein. The positive unlabeled
learning-based prediction framework (PROFOUND) [28] is a method for encoding
the changes in the stability of a protein due to single point deletions. This is
achieved via an Elliptic Envelope and Random Forest classifier outlier detector. The
PROFOUND work also presents a novel database containing single point deletions
(SPD) and is a step towards screening of unfavorable backbone modifications. In a
subsequent publication [29], Banerjee et al. expand this framework to distinguish
multipoint deletion (MPD) instances that are harmful. There was previously no
such database for multipoint deletions in proteins, hence a novel database for MPDs
has been introduced. The authors concluded that there is no random foldability in
proteins, but rather there are unique specifications imposed on the deleted regions.
Due to the complexity of computational space, various machine learning based
efforts for InDel modeling are still in their incipient stages. A proposed improve-
ment to this effort is a probabilistic model for InDel evolution [30] that is able
to distinguish between insertions and deletions by using a preexisting inferential
scheme. This approach proposes two novel machine learning based models that
assume the same or varying rate and length distributions of InDels. The model
that assumes variable rates of distribution seems to perform better than the one that
assumes the same rates of distribution.
When it comes to predicting flanking regions, a number of machine learning
based efforts are being developed or proposed. For a provided protein fold, in order
to predict the InDel flanking regions in a given protein sequence, Mufleh et al.
[31] proposed a framework, using a variable-order Markov model. The predictions
of flanking regions are performed by probabilistic suffix tree and partial match
approaches. The developed predictors are built in a fully automated manner.
The preliminary analysis of machine learning based methodologies for analyzing
the impact of InDels reveal that there is a need for further work in this area as it is
94 M. Jilani et al.
relatively new. The benefits of novel efforts will further our understanding of protein
evolution and help identify the InDels that result in disease phenotypes. The area can
be expanded upon in various engineering endeavors.
3.2 Detecting Functional and Fitness Effects of InDels on

Protein Structure
Compared to substitutions, the functional effect of InDels on proteins is far less

known and explored. The methods in this category attempt to make predictions
for single and multiple InDels. The Protein Variation Effect Analyzer (PROVEAN)
[32] is a tool in the form of a web server, developed for predicting the functional
effects of single or multiple insertions, deletions, as well as substitutions of amino
acids. The server provides a high-speed analysis of variants of a protein from any
organism. The score approach of PROVEAN is alignment-based. It is capable of
generating functional predictions for multiple InDels and substitutions. Another
effort to describe the fitness landscapes of InDels relies on computational assays as
well as a software package called InDelScanner that interprets the data that contain
InDels[33]. The main published finding of this research is that the fitness landscape
in this kinase pair is shaped by the activating effect of hydrophobic residues in the
docking groove, as well as widespread positive epistasis. This study highlights how
critical computational networks are in order to investigate the function-sequence
correlation in proteins.
A very important question in the evolution of proteins is how and at what point
are sequence variations translated into structural changes [24]. This question is
explored in detail [34] by analyzing the correlation of 75 homologous structural
families with the conclusion that the correlation between the degree of structural
variations and the number of substitutions and InDels is bilinear.
The impact of InDels depends on the context in which they occur [5]. An example
of such a phenomenon would be that if InDels occur in a protein-protein interface,
they can impact the set of interaction partners. Moreover, InDels impact function
and phenotype alike. Homo sapiens’ closest relative, chimpanzees, have 13%
coding sequence InDels, thus causing phenotypical alterations. Deemed “hopeful
monsters” [35], InDels that are extremely deleterious in nature can lead to sudden
and dramatic transitions in the protein structures leading to profound changes or
tentatively a novel fold to the structure [36].
In order to predict the impact of InDels on a protein’s structure and rigidity,
Jilani et al. [37] have devised a two step approach. The method first models
the InDels computationally by employing an inverse kinematics-based approach.
InDel modeling is followed by rigidity analysis of the wildtype protein structure,
experimental mutant and computationally generated mutant, in order to assess the
functional differences between them. The approach shows promise to predict the
effects of short insertions and deletion mutations on the structure and stability
properties of proteins.
A novel interactive and free tool for visualizing 3D structures of proteins is
called SWISS-PO [38]. The web-based tool aims to help the users in decision
making regarding what is the best structure to analyze related to the inserted or
deleted residues. The decision is based on the interaction of the molecule with
neighboring molecules within the same protein or with other macro-molecules in the
3D structure. This computational effort is novel in nature and it aids InDel analysis
greatly.
The efforts in the InDel analysis realm are few and far between but developed
nonetheless. There is scope for novel efforts that combine computational InDel
generation with analysis in this area to provide a more broad-based analysis.
3.3 Plasticity of Proteins to InDels
Some studies explore the plasticity of proteins to InDels. A convenient model for
evolution analysis of proteins is TEM-1 β-lactamase. Using Amp resistance as a
proxy for fitness, Courtney et al. [6] have examined the relationship between the
tolerance of InDels and additional related measures in order to understand the origin
of insertions or deletions. They conclude by quantifying the fitness effects of 4737
amino acid InDels in the studied proteins. This study highlights the sparseness of
insertion studies, which are few indeed and only study a few reasonably chosen
sites in a few proteins. But nonetheless, these studies reveal that insertions are
tolerated more than deletions. Such studies of fitness effects in other proteins is
crucial to predicting computational structure of proteins in a more robust manner.
In order to evaluate plasticity, the tolerance of loop deletion across the entirety of
a protein was observed by James et al. [39] using directed evolution in enhanced
green fluorescent protein (EGFP). This was a very comprehensive study which
revealed an enhancing effect of rearrangement compared to only mutations. It is also
observed that loops that occur between other secondary structures are normally more
tolerant of deletions. Since Green Fluorescent Protein (GFP) investigation has been
revolutionary in biomedical research endeavours, another thorough investigation
worth mentioning here is a more comprehensive study targeting plasticity GFP to
amino acid deletions [40]. This study not only explores the impact of deletions, but
also explores how fold enhancing mutations can reverse the impact of deleterious
mutations on fluorescence of GFP. In certain cases, point deletions can introduce
beneficial structural effect. This phenomena has been explored in GFP [41]. The
study highlights the creation of deletion mutants by X-Ray crystallography and
shows that a protein can remain somewhat plastic to a deletion in the loop region.
In a similar study on human lysozyme, the role of amino acids in the loop regions
in terms of conformational stability has been explored using X-Ray crystallography
analysis [42]. The study reveals that impairment in stability is mainly due to deletion
of hydrogen bonds between molecules, and not due to amino acid deletions.
96 M. Jilani et al.
The stability and folding rate of a protein is largely impacted by protein loops.
The conformational entropy of protein is demonstrated to be increased by shortening
of loops [43]. Using simulations of all-atom molecular dynamics, Yulian et al. [43]
have generalized the effect of the shortening of loops on native state dynamics
of protein for four different proteins, and the impact of this phenomenon on the
stability of the protein structures. They later confirmed their results via NMR
simulations [44].
An approach called weighted contact number (WCN) has been used by Jackson
et al. [45]. The method scans the structure of the protein in order to measure the
density of the packing of a residue. Such information is used as an indicator of
whether the protein will tolerate an InDel or not. The key finding of this novel work
is the key element contributing to deletion constraining in proteins is structure.
These studies lay the groundwork for future examination of tolerance of more
complex large InDels.
Since it has been revealed by various analysis methods that InDels and disorders
go hand in hand, it is fair to assume a causal relationship between InDels and
disorders. A detailed study [46] pertaining InDel analysis with respect to the region
of occurrence reveals that InDels are not a cause for disorder. Rather, regions that
are inherently disordered incur InDels. This approach uses hidden Markov model
pairwise alignment in two sets of Eukaryote proteins.
In summary, there are many studies that explore the plasticity of proteins to
InDels. Some are more comprehensive and perform an in depth analysis of this
phenomenon alongside exploring the relationship between InDels and disorders.
Understanding the impact InDels have on the plasticity of proteins is relevant for
protein engineering so that it is easier to predict the cites which are more tolerant
towards deletions.
4 Conclusion
The review of efforts pertaining InDel structural variation of proteins highlighted

some important findings. The detection of InDels is an area that is important for
many engineering applications. This area was explored by some previous efforts
but there is a dire need for continuous work. Such resources will aid in protein
bioinformatics analysis and help identify InDel-directed drug binding sites [21, 47].
Some novel efforts analyze InDels, their plasticity and the tolerance of proteins
to InDel occurrence. However, there is need for further research and improvement
in this territory of structural biology. This is a crucial step towards matching the
research progress on these “hopeful monsters” to their substitution counterparts.
References
1. Stefano Pascarella and Patrick Argos. Analysis of insertions/deletions in protein structures.

Journal of molecular biology, 224(2):461–471, 1992.
2. Fereydoun Hormozdiari, Raheleh Salari, Michael Hsing, Alexander Schönhuth, Simon K
Chan, S Cenk Sahinalp, and Artem Cherkasov. The effect of insertions and deletions on wirings
in protein-protein interaction networks: a large-scale study. Journal of Computational Biology,
16(2):159–167, 2009.
3. Pravech Ajawatanawong and Sandra L Baldauf. Evolution of protein indels in plants, animals
and fungi. BMC evolutionary biology, 13(1):1–15, 2013.
4. RyangGuk Kim and Jun-tao Guo. Systematic analysis of short internal indels and their impact
on protein folding. BMC structural biology, 10(1):1–11, 2010.
5. Romain A Studer, Benoit H Dessailly, and Christine A Orengo. Residue mutations and
their impact on protein structure and function: detecting beneficial and pathogenic changes.
Biochemical journal, 449(3):581–594, 2013.
6. Courtney E Gonzalez, Paul Roberts, and Marc Ostermeier. Fitness effects of single amino acid
insertions and deletions in tem-1 β-lactamase. Journal of molecular biology, 431(12):2320–
2330, 2019.
7. Monica Berrondo and Jeffrey J Gray. Computed structures of point deletion mutants and their
enzymatic activities. Proteins: Structure, Function, and Bioinformatics, 79(10):2844–2860,
2011.
8. Jing Hu and Pauline C Ng. Sift indel: predictions for the functional effects of amino acid
insertions/deletions in proteins. PloS one, 8(10):e77940, 2013.
9. Maoxuan Lin, Sarah Whitmire, Jing Chen, Alvin Farrel, Xinghua Shi, and Jun-tao Guo. Effects
of short indels on protein structure and function in human genomes. Scientific reports, 7(1):1–
9, 2017.
10. Stefanie Barbirz, Jürgen J Müller, Charlotte Uetrecht, Alvin J Clark, Udo Heinemann, and
Robert Seckler. Crystal structure of Escherichia coli phage hk620 tailspike: podoviral tailspike
endoglycosidase modules are evolutionarily related. Molecular microbiology, 69(2):303–316,
2008.
11. Agnes Tóth-Petróczy and Dan S Tawfik. Protein insertions and deletions enabled by neutral
roaming in sequence space. Molecular biology and evolution, 30(4):761–771, 2013.
12. Liat Rockah-Shmuel, Ágnes Tóth-Petróczy, Asaf Sela, Omri Wurtzel, Rotem Sorek, and Dan S
Tawfik. Correlated occurrence and bypass of frame-shifting insertion-deletions (indels) to give
functional proteins. PLoS genetics, 9(10):e1003882, 2013.
13. Bijendra Khadka, Mobolaji Adeolu, Robert E Blankenship, and Radhey S Gupta. Novel
insights into the origin and diversification of photosynthesis based on analyses of conserved
indels in the core reaction center proteins. Photosynthesis research, 131(2):159–171, 2017.
14. Yuri Wolf, Thomas Madej, Vladimir Babenko, Benjamin Shoemaker, and Anna R Panchenko.
Long-term trends in evolution of indels in protein sequences. BMC evolutionary biology,
7(1):1–10, 2007.
15. Zheng Zhang, Jinlan Wang, Ya Gong, and Yuezhong Li. Contributions of substitutions and
indels to the structural variations in ancient protein superfamilies. BMC genomics, 19(1):1–9,
2018.
16. Zhe Liu, Huanying Zheng, Huifang Lin, Mingyue Li, Runyu Yuan, Jinju Peng, Qianling Xiong,
Jiufeng Sun, Baisheng Li, Jie Wu, et al. Identification of common deletions in the spike protein
of severe acute respiratory syndrome coronavirus 2. Journal of virology, 94(17):e00790–20,
2020.
17. Yunkai Zhu, Fei Feng, Gaowei Hu, Yuyan Wang, Yin Yu, Yuanfei Zhu, Wei Xu, Xia Cai,
Zhiping Sun, Wendong Han, Rong Ye, Hongjun Chen, Qiang Ding, Qiliang Cai, Di Qu, Youhua
Xie, Zhenghong Yuan, and Rong Zhang. The s1/s2 boundary of sars-cov-2 spike protein
modulates cell entry pathways and transmission. bioRxiv, 2020.
98 M. Jilani et al.
18. HA Lewis, C Wang, X Zhao, Y Hamuro, K Conners, MC Kearins, F Lu, JM Sauder,

KS Molnar, SJ Coales, et al. Structure and dynamics of nbd1 from cftr characterized using
crystallography and hydrogen/deuterium exchange mass spectrometry. Journal of molecular
biology, 396(2):406–430, 2010.
19. Elisa Donnard, Paula F Asprino, Bruna R Correa, Fabiana Bettoni, Fernanda C Koyama,
Fabio CP Navarro, Rodrigo O Perez, John Mariadason, Oliver M Sieber, Robert L Strausberg,
et al. Mutational analysis of genes coding for cell surface proteins in colorectal cancer cell lines
reveal novel altered pathways, druggable mutations and mutated epitopes for targeted therapy.
Oncotarget, 5(19):9199, 2014.
20. Prathima Iengar. An analysis of substitution, deletion and insertion mutations in cancer genes.
Nucleic acids research, 40(14):6401–6413, 2012.
21. Michael Hsing and Artem Cherkasov. Indel pdb: a database of structural insertions and
deletions derived from sequence alignments of closely related proteins. BMC bioinformatics,
9(1):1–12, 2008.
22. Pravech Ajawatanawong, Gemma C Atkinson, Nathan S Watson-Haigh, Bryony MacKenzie,
and Sandra L Baldauf. Seqfire: a web application for automated extraction of indel regions
and conserved blocks from protein multiple sequence alignments. Nucleic acids research,
40(W1):W340–W347, 2012.
23. Zheng Zhang, Cheng Xing, Lushan Wang, Bin Gong, and Hui Liu. Indelfr: a database of indels
in protein structures and their flanking regions. Nucleic acids research, 40(D1):D512–D518,
2012.
24. Mufleh Al-Shatnawi, M Omair Ahmad, and MNS Swamy. Msaindelfr: a scheme for multiple
protein sequence alignment using information on indel flanking regions. BMC bioinformatics,
16(1):1–11, 2015.
25. Stephane Emond, Maya Petek, Emily J Kay, Brennen Heames, Sean RA Devenish, Nobuhiko
Tokuriki, and Florian Hollfelder. Accessing unexplored regions of sequence space in directed
enzyme evolution via insertion/deletion mutagenesis. Nature communications, 11(1):1–14,
2020.
26. Shu-su Liu, Xuan Wei, Qun Ji, Xiu Xin, Biao Jiang, and Jia Liu. A facile and efficient
transposon mutagenesis method for generation of multi-codon deletions in protein sequences.
Journal of biotechnology, 227:27–34, 2016.
27. Carlos Bermejo-Das-Neves, Hoan-Ngoc Nguyen, Olivier Poch, and Julie D Thompson. A
comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of
their phenotypic effects by a machine learning method (kd4i). BMC bioinformatics, 15(1):1–
20, 2014.
28. Anupam Banerjee, Yaakov Levy, and Pralay Mitra. Analyzing change in protein stability
associated with single point deletions in a newly defined protein structure database. Journal of
proteome research, 18(3):1402–1410, 2019.
29. Anupam Banerjee, Amit Kumar, Kushal Kanti Ghosh, and Pralay Mitra. Estimating change in
foldability due to multipoint deletions in protein structures. Journal of Chemical Information
and Modeling, 60(12):6679–6690, 2020.
30. Gil Loewenthal, Dana Rapoport, Oren Avram, Asher Moshe, Alon Itzkovitch, Omer Israeli,
Dana Azouri, Reed Austin Cartwright, Itay Mayrose, and Tal Pupko. A probabilistic model for
indel evolution: differentiating insertions from deletions. bioRxiv, 2020.
31. Mufleh Al-Shatnawi, M Omair Ahmad, and MN Shanmukha Swamy. Prediction of indel
flanking regions in protein sequences using a variable-order Markov model. Bioinformatics,
31(1):40–47, 2015.
32. Yongwook Choi and Agnes P Chan. Provean web server: a tool to predict the functional effect
of amino acid substitutions and indels. Bioinformatics, 31(16):2745–2747, 2015.
33. Maya Petek. Characterising fitness landscapes of protein evolution by next-generation
sequencing. PhD thesis, University of Cambridge, 2020.
34. Zheng Zhang, Yuxiao Wang, Lushan Wang, and Peiji Gao. The combined effects of amino
acid substitutions and indels on the evolution of structure within protein families. PloS one,
5(12):e14316, 2010.
35. Ágnes Tóth-Petróczy and Dan S Tawfik. Hopeful (protein indel) monsters? Structure,
22(6):803–804, 2014.
36. Raheleh Salari, Alexander Schönhuth, Fereydoun Hormozdiari, Artem Cherkasov, and S Cenk
Sahinalp. The relation between indel length and functional divergence: a formal study. In
International Workshop on Algorithms in Bioinformatics, pages 330–341. Springer, 2008.
37. Muneeba Jilani, Alistair Turcan, Nurit Haspel, and Filip Jagodzinski. Assessing the effects
of amino acid insertion and deletion mutations. In 2021 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), pages 2511–2518. IEEE, 2021.
38. Fanny S Krebs, Vincent Zoete, Maxence Trottet, Timothée Pouchon, Christophe Bovigny, and
Olivier Michielin. Swiss-po: a new tool to analyze the impact of mutations on protein three-
dimensional structures for precision oncology. NPJ precision oncology, 5(1):1–9, 2021.
39. James AJ Arpino, Samuel C Reddington, Lisa M Halliwell, Pierre J Rizkallah, and D Dafydd
Jones. Random single amino acid deletion sampling unveils structural tolerance and the
benefits of helical registry shift on gfp folding and structure. Structure, 22(6):889–898, 2014.
40. Shu-su Liu, Xuan Wei, Xue Dong, Liang Xu, Jia Liu, and Biao Jiang. Structural plasticity of
green fluorescent protein to amino acid deletions and fluorescence rescue by folding-enhancing
mutations. BMC biochemistry, 16(1):1–11, 2015.
41. James AJ Arpino, Pierre J Rizkallah, and D Dafydd Jones. Structural and dynamic changes
associated with beneficial engineered single-amino-acid deletion mutations in enhanced
green fluorescent protein. Acta Crystallographica Section D: Biological Crystallography,
70(8):2152–2162, 2014.
42. Kazufumi Takano, Yuriko Yamagata, and Katsuhide Yutani. Role of amino acid residues
at turns in the conformational stability and folding of human lysozyme. Biochemistry,
39(29):8655–8665, 2000.
43. Yulian Gavrilov, Shlomi Dagan, and Yaakov Levy. Shortening a loop can increase protein
native state entropy. Proteins: Structure, Function, and Bioinformatics, 83(12):2137–2146,
2015.
44. Yulian Gavrilov, Shlomi Dagan, Ziv Reich, Tali Scherf, and Yaakov Levy. An nmr confirmation
for increased folded state entropy following loop truncation. The Journal of Physical Chemistry
B, 122(48):10855–10860, 2018.
45. Eleisha L Jackson, Stephanie J Spielman, and Claus O Wilke. Computational prediction of
the tolerance to amino-acid deletion in green-fluorescent protein. PloS one, 12(4):e0164905,
2017.
46. Sara Light, Rauan Sagit, Diana Ekman, and Arne Elofsson. Long indels are disordered: a
study of disorder and indels in homologous eukaryotic proteins. Biochimica Et Biophysica
Acta (BBA)-Proteins and Proteomics, 1834(5):890–897, 2013.
47. Qi Wang, Esley Heizer, Bruce A Rosa, Scott A Wildman, James W Janetka, and Makedonka
Mitreva. Characterization of parasite-specific indels and their proposed relevance for selective
anthelminthic drug targeting. Infection, Genetics and Evolution, 39:201–211, 2016.
DeepTracer Web Service for Fast and
Accurate De Novo Protein Complex
Structure Prediction from Cryo-EM
Dong Si, Hanze Meng, Jonas Pfab, Yinrui Deng, Yutong Xie, Jackson Tan,
Sheung Him Martin Chow, Jason Chen, and Aditi Jain
1 Introduction
Since the advent of cryo-electron microscopy, its imaging technique has been
constantly evolving in order to achieve higher resolution in the cryo-EM maps
of macro molecules. Through this process, the related computational methods to
derive structural information from these maps have also experienced three distinct
stages of evolution from template-based homology modeling, to de novo cryo-EM
modeling using traditional machine learning algorithms such as KNN and SVM,
then to deep-learning based de novo modeling [1]. As an important type of macro
molecules, proteins have great value in medical aspects such as drug and vaccine
D. Si () · J. Pfab · J. Tan · S. H. M. Chow · J. Chen

Division of Computing and Software Systems, University of Washington, Bothell, WA, USA
e-mail: dongsi@uw.edu
H. Meng
Department of Mathematics, University of Washington, Seattle, WA, USA
Y. Deng
Department of Economics, University of Washington, Seattle, WA, USA
Department of Electrical and Computer Engineering, New York University, Brooklyn, NY, USA
Y. Xie
Department of Mathematics, University of Washington, Seattle, WA, USA
Department of Applied Mathematics, University of Washington, Seattle, WA, USA
Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle,
WA, USA
Department of Statistics, University of Washington, Seattle, WA, USA
A. Jain
Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle,
WA, USA
102 D. Si et al.
development. Several methods exist to determine a protein’s tertiary structure, but

recently electron cryomicroscopy (cryo-EM) has become an increasingly popular
one due to its speed of determination and, most importantly, its increase in ability
to produce near-atomic (<4Å) resolution all-atom cryo-EM maps of a protein [2].
Over the past decade, the number of novel cryo-EM structures deposited annually
has increased at an exponential rate [3]. In order to derive the structural information
from these high-resolution maps, the DeepTracer team developed a fully-automatic
deep-learning based de novo modeling method to determine with high accuracy the
3D structure of a protein using cryo-EM maps [4].
Specifically, DeepTracer preprocesses each map by data grid resampling, density
value normalization and grid division. Then it constructs a neural network con-
necting 4 separate U-Nets [5] for structural aspects of atoms, backbone, secondary
structure elements, and amino acid types. Both the input and output layers of each
U-Net have the same 643 shape. Next, DeepTracer feeds the preprocessed cryo-EM
map to the network, and finally transforms the output into a protein structure. An
overview of the steps involved in this prediction pipeline is provided in Fig. 1, Part a.
Furthermore, since the neural network itself can predict the amino acid type through
the specific U-Net, DeepTracer is able to predict proteins without any known amino
acid sequences. But the accuracy can only be limited between 10 to 50% due
to the similar appearances of some amino acids in cryo-EM maps. Therefore, a
user-provided, known and true amino acid sequence in a FASTA format file still
helps improve the prediction accuracy which is achieved through a custom dynamic
sequence alignment algorithm. Generally, it aligns intervals of the initially predicted
sequence to the known true amino acid sequence (protein primary structure) and
then updates the types of the predicted amino acids accordingly.
Before the above sequence mapping step, the prediction pipeline will trace the
backbone first by performing a central postprocessing step which creates an initial
model structure connecting only the Cα atoms into chains. The backbone tracing
step contains three separate parts: identifying disconnected chains prior to atom
prediction, calculating coordinates of Cα atoms and connecting Cα atoms into
chains. Particularly, DeepTracer applies the traveling salesman algorithm on the
last part to address the factorial growth of combinations in which the atoms can be
connected. It is worthy of note that DeepTracer manually adds a start/end pseudo-
atom connected to every other atom and defines a custom confidence function to
accommodate the criterion of this algorithm as well as the fact that the ideal distance
between Cα atoms is 3.8Å instead of the shortest possible path. The function will
return a confidence score between 0 and 1 based on both the Euclidean distance
between the atoms and the average density values of voxels that lay in between
the atoms on the backbone confidence map predicted by the Backbone U-Net. This
shows the confidence level on whether the two Cα atoms are connected or not.
Then DeepTracer subtracts this score from 1 and provides it to the algorithm so that
it can find the minimized sum of these values. Following this backbone tracing step,
DeepTracer will work on the amino acid sequence mapping and then reconstruct
other carbon, nitrogen and oxygen atoms in the backbone to reach the end of the
pipeline [4].
DeepTracer Web Service for Fast and Accurate De Novo Protein Complex. . . 103
Fig. 1 Part a is the DeepTracer model determination pipeline. Part b describes the evaluation
results for set of 476 experimental cryo-EM maps. Evaluation of determined models from
DeepTracer (blue) and Phenix (red) for 476 cryo-EM maps. The dotted lines represent the trend
for each method. More comparison results and other details can be found in our previous PNAS
paper [4]
104 D. Si et al.
To evaluate the effectiveness of DeepTracer, we applied it on multiple test

datasets and compared the results with other similar tools. In Fig. 1, Part b, we
applied DeepTracer on 476 cryo-EM maps from EMDB and compared the predicted
results through four different metrics with the results produced by Phenix, which
are published on its website [6]. Generally, DeepTracer achieved better results
than the Phenix method for every metric shown in Fig. 1, Part b [4]. In doing
so, we recognized the need to provide access to its powerful prediction pipeline
to users with little or no programming knowledge. We present here DeepTracer
Web Server, an easy-to-use web interface to DeepTracer’s prediction pipeline that
enables users of all disciplines and experience levels to use DeepTracer for structural
determinations for proteins and protein complexes [7]. We hope providing quick
and easy access to structural determination of proteins will aid in finding new drug
treatments and understanding more about the novel coronavirus SARS-CoV-2 that
has caused a worldwide pandemic.
2 Procedures
There are many tools available to researchers for structural determination of

proteins. However, many of them are difficult to install and require some familiarity
with programming to set up. Of the tools that offer web servers as interfaces to back
end prediction pipelines, many can be slow with queues that have a backlog of jobs
for hours or even days [8]. Aside from high accuracy in determining structures, the
two fundamentals of DeepTracer web server are the ease of use and timely execution
for researchers in this field so that they can do the vital work of experimental
observation and analysis. With this focus, we set out to create an online service
that achieves these two goals with the workflow shown in Fig. 2.
Fig. 2 A general demonstration of the workflow of DeepTracer’s web service. Users’ jobs will
be put into the job listener and be dispatched to any of the two job queues whenever they are
not occupied by other jobs. The predicted results will be saved properly on the server and the
middleware will respond to the clients in time
Fig. 3 The job submission interface on client side. If using Option 1, users only need to input the
EMDB ID so that the web server will automatically download from PDB. If using Option 2, then
only the cryo-EM Map is required to be uploaded to the server. Solved Structure is mainly used
for the result evaluation after prediction and Amino Acid Sequence is mainly used for better amino
acid type prediction
Users have two ways to utilize the web server as shown in Fig. 3: upload their
own input in the form of an mrc or map file or use a provided field to enter a
deposited structure ID. The former allows for experienced researchers who are
working with novel, unsolved structures to accelerate their work by minimizing the
need for manual manipulation of the structure. The latter feature allows users of all
backgrounds to produce highly accurate structural determinations for any structures
deposited in the EMDataResource [9]. These input files consist of the results of
cryo-EM experiments, a 3D grid of voxels that estimates the electron density for
the given structure, and a reference to a call to DeepTracer’s prediction pipeline to
render a structural determination for the input. Users’ jobs will be saved in a task
queue, so that they do not have to worry about losing their place in line. We keep a
record for each job in our logs and present useful statistics such as RMSD, coverage,
and sequence matching percentage.
Every job is executed in chronological order. In each step of the prediction
pipeline, we update the database and report these updates to users so that it gives a
clear visual representation of the job status (see Fig. 4) and estimated waiting time
to users. Once the job is finished, users can click on the Evaluation tab to compare
different metrics between the predicted result and the solved structure, which is
either automatically downloaded from the PDB if using option one or self-provided
by users if using option two, as shown in Fig. 5.
We have built an update system that automatically initiates predictions for novel
coronavirus cryo-EM maps each week from EMDataResource [9]. A background
thread is running to check the newly added coronavirus maps, download them, and
create a job for prediction periodically. When the job finishes, our server will save
the results and update the records accordingly. By January 4th, 2022, there are 628
106 D. Si et al.
Fig. 4 The details for each step in a job predicting EMD-0401. This page refreshes every 20
seconds to request the current job status. Users can directly get the server log from the black box
for each step by clicking on different white check marks. Here the figure shows the results of the
U-Net Prediction step and the prediction progress bar indicates an estimated waiting time until
finishing
Fig. 5 The evaluation metrics displayed to users in a concise dialog. These metrics assist users in
having a quick view of the accuracy and a quick inspection of the structure without any external
tools
records displaying on our website for users downloading the prediction results and
other corresponding files. With the Coronavirus archive, users no longer need to
submit a related job and initiate a new prediction process to get the results. The web
service also offers 3D visualizations by Miew from Life Science Open Resources
(https://lifescience.opensource.epam.com/miew/) shown in Fig. 6 for prediction
results. For the details of visualization, see Sect. 3.3.2 below.
Fig. 6 This figure shows the Coronavirus archive which will be updated once a week. Download
and evaluation options are also available on this page, same as the job page. Concise labels are set
next to each covid prediction job’s title. Red label is to mark that the protein is a type of SARS-
CoV-2 virus and green label is to mark the absence of solved structure
3 Results
3.1 Architecture
3.1.1 Web Application Architecture
The DeepTracer Web Server’s architecture follows a layered client-server model.

The frontend is written in Angular (https://angular.io/), while the backend is utiliz-
ing Flask (https://palletsprojects.com/p/flask/) with Gevent (http://www.gevent.org/
index.html) as a high performance micro web framework, and a MongoDB database
(https://www.mongodb.com/) as the data layer to store the results of each step in
the prediction pipeline. The predicted structure, along with other downloaded or
uploaded data, will be saved directly on the server. We choose this model because it
allows for evolvability and reusability through components accessible as application
programming interfaces. Also, it enables current and future developers to maintain
a reliable codebase and add features rapidly. Some of the features implemented are
parallelism of DeepTracer prediction tasks, email notification upon the successful
completion of a job, and automatic downloading of input data to DeepTracer. We
will provide more details on these features in the Parallel Computing subsection and
UI/UX section below respectively. By using Angular, we are able to provide the fluid
experience of a single-page application, an interface most modern web-applications
use, and by using gevent, we are able to respond to considerable incoming requests
efficiently through the mechanism of coroutines. This helps work toward our
goal of broadening the user base beyond experienced researchers. In addition,
DeepTracer Web Server is mobile-friendly on all modern browsers. Finally, as the
backend prediction pipeline is constantly being updated, the DeepTracer Web Server
108 D. Si et al.
development team uses continuous-integration and continuous-deployment (CI/CD)

processes and tools to ensure all new features added to the backend prediction
pipeline are rapidly deployed and available on the DeepTracer Web Server.
3.1.2 Parallel Computing
In order to make full use of the computing resources on DeepTracer Web Server,
parallelism is implemented through multiprocessing. Currently, two prediction jobs
can be run in parallel on the server, both with GPU acceleration enabled and this
number can be scaled up easily with any increase of GPU resources in the future.
We also set up a simple load balancer in order to evenly distribute the tasks to
these computing resources. In addition, a maximum of 64 CPU cores can be utilized
for one prediction task. Overall, although this implementation of parallelism does
not completely prevent jobs from getting queued up, users’ waiting time is greatly
reduced. As a result, we have increased our job processing bandwidth to let users
submit 5 jobs every 24 hours instead of the previous duration of 72 hours.
3.2 Prediction Evaluation
DeepTracer Web Server provides users with prediction evaluations if the user
uploads a corresponding solved structure. DeepTracer uses the solved structure as
the ground truth to calculate the evaluations. There are two types of evaluations:
metrics and visualization. The metrics provide users with a quick view of the
prediction accuracy and the visualizations provide a quick inspection of the structure
without the need of other visualization tools, such as, UCSF Chimera [10]. There
are two metrics categories: individual residue type predictions and overall residue
connectivity. Due to the reason that all residues are predicted from the cryo-EM
map, we need to benchmark the accuracy of the identified residues and their
attributes on a per residue basis. It consists of the percentage of correctly predicted
residues (% Matching); the percentage of falsely predicted residues (% False
Positive); the root-mean-square deviation of the backbone Cα atoms (Backbone
RMSD); the percentage of matching residues with the same amino acid type (%
Sequence Matching); and the percentage of secondary structure types (% SSE
Matching). The connectivity metrics indicates how well DeepTracer predicted the
sequence of the protein and show the percentage of incorrect connections (% False
Connections). We are considering implementing other connectivity metrics in the
future, such as, the mean-chain-length.
3.3 UI/UX Features
To make the DeepTracer Web Server as easy to use as possible and to provide
necessary functionality, we have implemented several features.
3.3.1 Logger
We provide users with step-by-step transparency into each job execution using a
logger component originally used by DeepTracer’s development team internally.
This feature provides insight into every step of DeepTracer’s pipeline: input map
resampling and normalization, U-Net prediction, atomic tracing of the backbone
structure, merging amino acid chains, secondary structure addition, amino acid
sequence mapping, structural refinement, sidechain prediction, and evaluation.
Figure 7 shows an example of the logs available for every step in DeepTracer’s
prediction pipeline.
3.3.2 Visualization
Our visualization is a user-friendly feature that provides a quick overview of the

predicted structure without the need to import it in other software. We are currently
using 3D viewer plug-in provided by Miew (https://lifescience.opensource.epam.
com/miew/) to display the atomic structure [11]. 3D manipulation on the visualiza-
tion is enabled by simply dragging the image with different angles and zooming in
or zooming out. If a solved structure is existent, the web server will display both
structures side-by-side for comparison. An example of this is represented in Fig. 8.
Users can determine the structural similarity of the prediction by inspecting the side-
by-side comparison. Currently the 3D view only offers a rudimentary view of the
Fig. 7 Results page provided to users once the job is finished. Here the map used is EMD-30127.
DeepTracer Web Server automatically downloads the target structure from the PDB and provides
evaluation metrics such as RMSD, sequence coverage, sequence matching percentage, and mean
length of matched segments
110 D. Si et al.
Fig. 8 A side-by-side comparison of the predicted and solved protein structures provided by the
3D viewer. The predicted structure is less dense because it is missing the side-chains but the
backbone trace is similar by shape
Fig. 9 The matching results of EMD-6272. The amino acid sequence predicted by DeepTracer
closely matches the solved sequence. Shared color schemes represent amino acids with similar
properties
backbone and side-chain, but we are planning to incorporate more features in the
future. The other visualization tool is the text-based sequence comparison tool as
shown in Fig. 9. The sequence comparison tool is useful because it provides quick
feedback of the sequence prediction. The challenge of map-to-model is twofold,
the first is to identify the correct residues and residue types; and the second is to
connect the predicted residues in the correct order. Therefore, it is helpful to see if
the matching residues have the same amino acid types and if they are connected in
the correct order. The sequence comparison tool provides both pieces of information
in a user-friendly way. The predicted residues are ordered in the same sequence
as the solved sequence. The amino acids are color-coded based on Jmol’s scheme
[12] for easy identification and the consensus signs (*, ;, .) indicate the similarity
between the predicted and solved amino acids based on the PAM 250 scoring matrix.
Users can identify incorrect connections in the predicted sequence instantly and it
is especially beneficial for DeepTracer developers to identify problems and improve
on the prediction algorithm.
3.3.3 DeepTracer Offline
With the advent of DeepTracer, researchers from more than 20 countries are utilizing
our fully-automated atomic structure prediction service and we have received
plenty of requests to build an offline application being capable of conducting
Fig. 10 The offline application’s architecture. The design is similar to our web application with
some transplants to stay offline
prediction completely offline on users’ own machines due to privacy considerations.

In light of such considerations and demand, we have developed a desktop appli-
cation for MacOS and Windows distributions opened up to general public. Users
must apply for academic license at https://els2.comotion.uw.edu/product/deeptracer
before downloading.
We choose the Kivy framework (https://kivy.org/#home) for the offline appli-
cation because it allows us to package and distribute the same code to multiple
platforms. This reduces the complexities and workloads associated with developing
cross-platform applications. To integrate the prediction pipelines, we employ an
architecture that consists of a graphical user interface (GUI) and a job listener that
runs in a separate process. The job listener records the user-submitted jobs in a
priority queue and loads them into the prediction pipelines sequentially. Since the
prediction pipeline’s input and output are both facilitated through the file systems
on the users’ machines, we use a JSON file in place of a database to manage them.
The resulting architecture is shown in Fig. 10.
3.3.4 Email Subscription
With the increasing number of users using DeepTracer, sending emails becomes an
essential function for DeepTracer to communicate with all users. In addition to the
verification use during signing up and password resetting, sending out important
updates and notices to users subscribed is also important. To achieve this, we have
implemented a subscription mechanism on the web server so that users can receive
our latest news such as the publishing of our offline applications. With our bulk
email sending function, we make DeepTracer more user-friendly.
3.3.5 Mobile Features
Since the launch of DeepTracer Web Server, we have seen many users accessing
the server using mobile devices. We introduced and developed several mobile
features to let users have the best possible experience. We have implemented
112 D. Si et al.
features that render DeepTracer Web Server as a cutting-edge Progressive Web App
(PWA) that merges the functionality of a web application with that of a native
mobile application (https://web.dev/what-are-pwas/). This provides mobile users
with advanced features, including the ability to install the application to their home
screen, create shortcuts for easy access to data, use in offline viewing mode, and
enable push notifications.
3.3.6 User Engagement
During the year of 2021, our website has been accessed by users from 10855
different locations around the world. Their average daily engagement time is about
4 minutes which shows their strong interest in our service and the ultra-fast job
running time—users could easily and quickly get what they need in minutes. Also,
the devices they use are coming from all mainstream platforms for computers and
mobile devices including Windows, Mac, Linux, iOS and Android.
4 Discussion
4.1 Design Philosophy
We created DeepTracer to improve the speed and accuracy of de novo structural

determinations using cryo-EM data. DeepTracer Web Server is a natural extension
of this effort. Our goal in building and improving it is to make cryo-EM structural
determinations more easily accessible through a simple and intuitive interface.
Every decision by the team centers on what will be the best for users.
Reusability and extensibility are another two elements of DeepTracer Web
Server’s design. As discussed in the architecture section of this paper, we built
DeepTracer as a set of components that can be reused and extended as needed.
By doing so, we saved substantial development time and kept our code base simpler
than if we implemented the entire service from scratch.
The overall design philosophy of DeepTracer Web Server can be summarized
as nimble, user-centered, and state of the art. Our team is focused on conducting
groundbreaking research and making our solutions available to a wide audience
quickly. We recognize there is much work ahead.
4.2 Future Features
In spite of the mature frameworks used on our web server, there are still many
aspects worthy of enhancement. Besides the previously mentioned increase on job
processing bandwidth, we plan to improve the concurrency of our server as well.

This is because the number of registered users in DeepTracer has been continually
growing since the time when our website was made public, which leads to a
higher possibility to handle several heavy tasks such as data upload and protein
prediction at the same time. To avoid other functional parts of the web server
being affected by these tasks, we will fully utilize the power of gevent framework
through customization of the source code which helps achieve high performance
programming on the production server. Also, we plan to continue working on
new UI/UX features such as a position indicator for each user whenever there are
multiple jobs queued up. By implementing these features, we will eventually make
the DeepTracer website as close to the industry standard as possible.
Acknowledgments We would like to thank the entire DeepTracer team for their discussion and
collaboration on the project, including Andrew Nakamura, Michael Chavez, Haowen Guan, Sean
Lin, Kiran Mitra, Victor Suciu, Kristin Ding, Hongyi Ji, Rene Gomez and the previous developers.
Funding This material is based upon work supported by the National Science Foundation under
Grant No. 2030381, the graduate research award of Computing and Software Systems division,
and the SRCP Seed Grant at University of Washington Bothell to D.S.. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
References
1. Dong Si, Andrew Nakamura, Runbang Tang, Haowen Guan, Jie Hou, Ammaar Firozi, Renzhi
Cao, Kyle Hippe, and Minglei Zhao. Artificial intelligence advances for de novo molecular
structure modeling in cryo-electron microscopy. WIREs Computational Molecular Science,
page e1542.
2. Y. Cheng, N. Grigorieff, P. A. Penczek, and T. Walz. A Primer to Single-Particle Cryo-Electron
Microscopy. Cell, 161(3):438–449, April 2015.
3. Ewen Callaway. Revolutionary cryo-EM is taking over structural biology. Nature,
578(7794):201, February 2020.
4. Jonas Pfab, Nhut Minh Phan, and Dong Si. Deeptracer for fast de novo cryo-em protein
structure modeling and special studies on cov-related complexes. Proceedings of the National
Academy of Sciences, 118(2), 2021.
5. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional Networks for
Biomedical Image Segmentation. In International Conference on Medical Image Computing
and Computer-assisted Intervention, pages 234–241. Springer, 2015.
6. Thomas C Terwilliger, Paul D Adams, Pavel V Afonine, and Oleg V Sobolev. A fully automatic
method yielding initial models from high-resolution cryo-electron microscopy maps. Nature
methods, 15(11):905–908, 2018.
7. Jonas Pfab, Yinrui Deng, Hanze Meng, Yutong Xie, Jackson Tan, Michael Chavez, and Dong
Si. DeepTracer Engineering Team. https://deeptracer.uw.edu/about-us. Accessed 4 January
2022.
8. F. Kern, T. Fehlmann, and A. Keller. On the lifetime of bioinformatics web services. Nucleic
Acids Research, 48(22):12523–12533, 2020.
9. Catherine L. Lawson, Ardan Patwardhan, Matthew L. Baker, Corey Hryc, Eduardo Sanz
Garcia, Brian P. Hudson, Ingvar Lagerstedt, Steven J. Ludtke, Grigore Pintilie, Raul Sala,
John D. Westbrook, Helen M. Berman, Gerard J. Kleywegt, and Wah Chiu. Emdatabank unified
data resource for 3dem. Nucleic Acids Research, 44(D1):D396–D403, 11 2015.
114 D. Si et al.
10. Eric F Pettersen, Thomas D Goddard, Conrad C Huang, Gregory S Couch, Daniel M
Greenblatt, Elaine C Meng, and Thomas E Ferrin. UCSF chimera–a visualization system for
exploratory research and analysis. J. Comput. Chem., 25(13):1605–1612, October 2004.
11. M. Bertoni, F. Kiefer, M. Biasini, L. Bordoli, and T. Schwede. Modeling Protein Quaternary
Structure of Homo- and Hetero-oligomers Beyond Binary Interactions by Homology. Sci Rep,
7(10480):9654–8, 2017.
12. Jmol: an open-source Java viewer for chemical structures in 3D. http://jmol.sourceforge.net/.
Accessed 12 January 2022.

Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)

Uploaded by

Copyright:

Available Formats

Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Algorithms and Methods in Structural Bioinformatics (Computational Biology) (Nurit Haspel (Editor) Etc.)

Uploaded by

Copyright:

Available Formats

Computational Biology

Editorial Board Members

Algorithms and Methods

ISSN 1568-2684 ISSN 2662-2432 (electronic)

The three-dimensional structure and function of molecules present many challenges

for studying the volumetric properties of molecular surfaces and electrostatic

Boston, MA, USA Nurit Haspel

Protein-Ligand Binding with Applications in Molecular Docking . . . . . . . . . 1

4.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Detecting Functional and Fitness Effects of InDels on

Nikita Mishra and Negin Forouzesh

Protein-ligand binding is one of the most fundamental processes in various organ-

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1

computational methods have emerged for the quantitative determination of binding

1.1 Thermodynamic Basis of Protein-Ligand Interaction

A property of a thermodynamic system, the heat content of a system at constant

G = G◦ + RT lnQ (3)

2 Computational Methods for Estimating Binding

2.1 Alchemical Methods

Fig. 1 The vertical arrows

2.2 Transition Path Sampling Methods

Proteins and ligands undergo several conformational changes in stable states

2.3 End-Point Methods

G = EMM + Gsolv − T S (6)

Fig. 2 The thermodynamic

3 Protein-Ligand Binding Databases

Several freely available protein-ligand binding databases have been organized to

Molecular docking is a computational method used to determine the binding mode

shorter MD simulations or other parameter changes, allowing for a more flexible

Computational study of protein-ligand binding is useful in several fields, including

42. N. Forouzesh, A. Mukhopadhyay, L. T. Watson, and A. V. Onufriev, “Multidimensional global

81. W. C. Still, A. Tempczyk, R. C. Hawley, and T. Hendrickson, “Semianalytical treatment

Ziyi Guo and Brian Y. Chen

Biological systems depend on proteins to perform almost every chemical function

Z. Guo · B. Y. Chen ()

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 17

uncertain: The space of hypothetical interactions and interaction mechanisms is

1.1 Comparison Algorithms for Examining Specificity

Specificity is a property that is fundamentally defined by comparison: A protein

The prediction of small molecule binding specificity is founded on the principle

enzymes developed by the Nomenclature Committee of the International Union of

2.1 Binding Site Representations

The comparison of protein-ligand binding sites depends on a digital representation

2.2 Metrics for Binding Site Comparison

Structural similarity between point-based motifs is almost always measured using

2.3 Comparison Algorithms

Comparison algorithms for function annotation and specificity assignment are

2.3.1 Data Structures

index = xpos ∗ (ydim ∗ zdim) + ypos ∗ (zdim) + zpos

2.4 Statistical Models for Binding Site Comparison

Protein structure comparison algorithms identify matches that indicate geometric

nonparametric models could compensate for incomplete matching algorithms that

3.1 Foundations of Structure-Based Component Localization

While sequence-based approaches can reveal an evolutionary rationale for the

3.1.1 Comparing Solid Representations with CSG

comparisons of the patterns of steric hindrance imposed by protein structure onto

3.1.2 Solid Representations of Binding Cavities

a = (Sl − m(A)) ∩ e(A)

As an instructive example, this cavity definition can have some shortcomings. A