Computational Molecular Biology
Computational Molecular Biology
Computational Molecular Biology
net/publication/47842586
CITATIONS READS
8 501
3 authors, including:
Petra Mutzel
University of Bonn
302 PUBLICATIONS 6,053 CITATIONS
SEE PROFILE
All content following this page was uploaded by Petra Mutzel on 05 June 2014.
Contents
1. Books and Surveys
2. Sequence Alignment and Evolution
2.1 Pairwise Sequence Alignment
2.2 Multiple Sequence Alignment
2.3 Evolutionary Trees
2.4 Trees and Alignment
3. Tree Comparison
4. Genome Rearrangements
5. Sequencing and Mapping
5.1 Sequence Assembly
5.2 Sequencing by Hybridization
5.3 Digest Mapping
5.4 Mapping Using Hybridization Data
6. Protein Threading and Lattice Models
6.1 Threading
6.2 Lattice Models
Computational Biology is a fairly new subject that arose in response to the
computational problems posed by the analysis and the processing of biomolecular
sequence and structure data. The eld was initiated in the late 60's and early 70's
largely by pioneers working in the life sciences. Physicists and mathematicians entered
the eld in the 70's and 80's, while Computer Science became involved with the new
biological problems in the late 1980's. Computational problems have gained further
importance in molecular biology through the various genome projects which produce
enormous amounts of data.
For this bibliography we focus on those areas of computational molecular biology
that involve discrete algorithms or discrete optimization. We thus neglect several other
areas of computational molecular biology, like most of the literature on the protein
folding problem, as well as databases for molecular and genetic data, and genetic
mapping algorithms. Due to the availability of review papers and a bibliography from
1984 some older papers will not be explicitely mentioned in this bibliography.
1
1 Books and Surveys
In this section, we list some books and surveys that serve as a general introduction to
the eld.
M. S. Waterman. Introduction to Computational Biology. Chapman & Hall, 1995.
is a recent book on computational molecular biology containing a wealth of material
on all the subjects dealt with in this bibliography.
J. R. Jungk and R. M. Friedman. Bull. Math. Biol., 46:699{744, 1984.
is an annotated bibliography summarizing the literature up to 1984.
R. Doolittle. Molecular evolution: Computer analysis of protein and nucleic acid
sequences. Meth. Enzymol., 183, 1990.
is a collection of articles. This volume gives a good overview of the approaches and
the software that are in use in molecular biology. A new volume by the same editor is
due to appear in 1996.
P. A. Pevzner and M. S. Waterman. Open combinatorial problems in computational
molecular biology. In Proc. 3-rd Israel Symp. Theory of Comput. and Systems, Tel
Aviv, Israel, 1995. IEEE Computer Society Press.
is a collection of open problems and implicitely gives an excellent overview of the
area.
The journal Algorithmica devoted a special issue to computational molecular
biology: volume 13, numbers 1/2, 1995 (ed. by Myers). Similarly, Discrete Applied
Mathematics is going to publish a special issue. A brief overview that gives a taste of
the area is
E. S. Lander, R. Langridge, and D. M. Saccocio. A report on computing in molecular
biology: Mapping and interpreting biological information. Commun. ACM, 34(11):33{
39, 1991.
2
be a set of nite strings over A. We de ne a new alphabet A^ = A [ f g by adding to
A the symbol dash \ " to represent indels. A set S^ = fs^1; s^2 ; ; s^k g of strings over
the alphabet A^ is called an alignment of the set S , if the following properties hold: (1)
The strings in S^ have the same length. (2) Ignoring dashes, string s^i is identical with
string si (for all i 2 f1; 2; ; kg). Hence an alignment can be interpreted as an array
with k rows. The ith row contains string s^i . Each column must contain at least one
letter of a string in S (columns lled only with dashes are forbidden). The score of an
alignment is based on a distance function d(^si ; s^j ) for aligned sequences.
D. Sanko and J. B. Kruskal. Time Warps, String Edits and Macromolecules: the
Theory and Practice of Sequence Comparison. Addison Wesley, 1983.
is a book entirely dedicated to sequence comparison.
M. S. Waterman. General methods of sequence comparison. Bull. Math. Biol., 46:473{
500, 1984.
M. S. Waterman. Sequence alignments. In Mathematical Methods for DNA Sequences.,
pages 53{92. CRC Press, Boca Raton, Fl, 1989.
E. Myers. An overview of sequence comparison algorithms in molecular biology.
Technical Report 29, Department of Computer Science of the University of Arizona
at Tucson Arizona, 1991.
are reviews on sequence alignment.
3
M. S. Waterman, M. Eggert, and E. Lander. Parametric sequence comparison. Proc.
Natl. Acad. Sci. USA, 89:6090{6093, 1992.
D. Gus eld, K. Balasubramanian, and D. Naor. Parametric optimization of sequence
alignment. Algorithmica, 12(4/5):312{326, 1994.
M. Waterman. Parametric and ensemble sequence alignment algorithms. Bull. Math.
Biol., 56(4):743{767, 1994.
describe algorithms for parametric sequence comparison.
D. Eppstein, Z. Galil, R. Giancarlo, and G. Italiano. Sparse dynamic programming I:
Linear cost functions. J. ACM, 39:519{545, 1992.
D. Eppstein, Z. Galil, R. Giancarlo, and G. Italiano. Sparse dynamic programming
II: Concave, convex cost functions. J. ACM, 39:546{567, 1992.
study sparse dynamic programming for sequence alignment.
The following books contain extensive material on approximate string matching and
are thus in a wider sense related to our topic:
G. Stephen. String Searching Algorithms. World Scienti c, 1994.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
Much like sequence alignment, deriving the optimal fold of an RNA molecule is
usually done by dynamic programming. This topic, too, is summarized in Waterman's
\Introduction to computational biology".
2.2 Multiple Sequence Alignment
The biological task of comparing several sequences simultaneously has been formalized
in di erent ways. An overview is given in
S. Chan, A. Wong, and D. Chiu. A survey of multiple sequence comparison methods.
Bull. Math. Biol., 54(4):563{598, 1992.
2.2.1 Sum-of-Pairs Multiple Alignment
The Sum of Pairs Multiple Alignment Problem (SPMA) is de ned as follows: Compute
the alignment S^ of S that minimizes the sum of the distances of all pairs s^i ; s^j :
2 3
X
SPMA(S ) := min 4 d(^si ; s^j )5 :
^ S i;j
L. Wang and T. Jiang. On the complexity of multiple sequence alignment. J. Comput.
Biol., 1:337{348, 1994.
prove that the SPMA-problem is NP-complete by reduction from the shortest com-
mon supersequence problem [5].
The SPMA-problem can be formulated as a shortest-path problem: Assume for sim-
plicity that all strings in S have length n. The set of possible alignments of S can be
4
represented by a k dimensional mesh-shaped graph with nk vertices and O(nk 2k ) di-
rected edges. Each path in the graph from the source to the sink represents a possible
multiple alignment of S . The sequence of edges of a path represents the sequence of
columns of the alignment, i.e., each edge codes for one column of the alignment. A
vertex in the graph can be interpreted as a \frontier" in the string set S and thus
also represents a set of pre xes of S ending at this frontier. Hence the set of all paths
from the source to a vertex encode the set of all possible alignments of the pre xes
represented by the vertex. Solving the SPMA-problem for S means computing the
(shortest) path that minimizes the cost function.
The SPMA-problem can be solved using dynamic programming:
M. Waterman, T. Smith, and W. Beyer. Some biological sequence metrics. Adv. Math.,
20:367{387, 1976.
For k = 3 see
R. Jue, N. Woodbury, and R. Doolittle. Sequence homologies among E. Coli ribosomal
proteins: Evidence for evolutionary related groupings and internal duplications. J.
Mol. Evol., 15:129{148, 1980.
M. Fredman. Algorithms for computing evolutionary similarity measures with length
independent gap penalties. Bull. Math. Biol., 46:553{566, 1984.
M. Murata, J. Richardson, and J. Sussman. Simultaneous comparisons of three protein
sequences. Proc. Natl. Acad. Sci. U.S.A., 82:3037{3077, 1985.
O. Gotoh. Alignment of three biological sequences with an ecient traceback
procedure. J. Theor. Biol., 121:327{337, 1986.
Since the size of the graph grows exponentially with the number of strings k (O(nk )
vertices and O(nk 2k ) edges) the dynamic programming approach works only for very
small k.
H. Carrillo and D. J. Lipman. The multiple sequence alignment problem in biology.
SIAM J. Appl. Math., 48(5):1073{1082, 1988.
present an elegant branch-and-bound approach. They describe a technique for reduc-
ing the part of the graph that has to be examined. Only the paths that are contained
in a certain \polytope" around the shortest path will be explored by their algorithm.
D. Lipman, S. Altschul, and J. Kececioglu. A tool for multiple sequence alignment.
Proc. Natl. Acad. Sci. U.S.A., 86:4412{4415, 1989.
implemented a slightly modi ed version of the algorithm of Carrillo and Lipman.
Since the bounds of Carrillo and Lipman are not suciently tight for solving \real
world" multiple sequence alignment instances, Lipman et al. propose heuristics to im-
prove bounds.
S. Gupta, J. Kececioglu, and A. Schae er. Improving the practical space and time
eciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment.
J. Comput. Biol., 2:459{472, 1995.
achieve further, signi cant reduction in space requirements by introducing new al-
gorithmic invariants determining when edges and vertices of the graph rst need to
5
be created and when they can be safely destroyed. Speedup results from the usage of
more ecient data structures for the shortest-path problem.
This algorithm for computing optimal multiple sequence alignments can handle at
most a dozen sequences of length 200 400. Practical problem instances, however,
may contain hundreds of sequences and sequence length may be above 1000. Such
instances cannot be expected to be solved to optimality. Approximation algorithms
and heuristics are required.
D. Gus eld. Ecient methods for multiple sequence alignment with guaranteed error
bounds. Bull. Math. Biol., 55(1):141{154, 1993.
presents the rst approximation algorithm for the SPMA-problem. The approxima-
tion factor of this algorithm is 2 2=k, where k is the number of sequences. The main
idea of Gus eld's approach is to consider alignments that are derived from the star
trees of S .
P. Pevzner. Multiple alignment, communication cost, and graph matching. SIAM J.
Appl. Math., 52(6):1763{1779, 1992.
improves the approximation bound to 2 3=k by deriving multiple alignments from
so-called 3-stars. Using the elegant concept of communication cost, Pevzner proved
that there are 3-stars whose derived multiple alignments are at most a factor 2 3=k
from optimal. Pevzner's algorithm computes an alignment in time O(n3k3 + k4 ), where
n is the (maximal) length of the sequences.
V. Bafna, E. Lawler, and P. Pevzner. Approximation algorithms for multiple sequence
alignment. In Proc. 5-th Annual Symp. Combinatorial Pattern Matching, pages 43{53.
Springer-Verlag, 1994.
By exploring cliques of l sequences with one common center (l-stars instead of 3-
stars) Bafna, Lawler and Pevzner achieve an approximation bound of 2 l=k.
2.2.2 Maximum Weight Trace
The input to the Maximum Weight Trace (MWT) alignment problem is a set S of k
strings and a graph G = (V; E ). The letters of the strings si of S are the vertices V of
the graph. Every edge e 2 E of the graph has a positive weight we and connects two
vertices (letters) that belong to di erent strings (i.e., there are no edges that connect
two letters of the same string). An alignment S^ realizes an edge e if the two letters
connected by the edge are placed in the same column of the alignment. The set of edges
realized by an alignment S^ is called the trace (trace(S^)) of S^. The MWT problem is
de ned as follows: Compute an alignment S^ that realizes a trace with maximal weight:
2 3
X
MWT(S ) := max 4 we 5 :
^ S e2trace(S^)
J. Kececioglu. The maximum weight trace problem in multiple sequence alignment.
In Proc. 4-th Symp. Combinatorial Pattern Matching, pages 106{119. Springer-Verlag,
6
1993.
introduced the MWT-problem. He proves that the MWT-problem contains the
SPMA-problem under certain conditions, as a special case, and that the MWT-
problem is NP-complete by reduction from the Feedback Arc Set Problem [5].
Kececioglu presents a branch-and-bound algorithm for the MWT-problem whose
implementation could optimally align six sequences of length 250 in a few minutes.
Here, a heuristic alignment of the k sequences yields a lower bound. Upper bounds
for the branch-and-bound approach are calculated by adding up the weights of all
MWT-optimal pairwise alignments of sux sets of S .
2.2.3 Chaining Multiple-Alignment Fragments
A fragment of a set S of k strings is a set S 0 = fs0 ; s0 ; ; s0k g, where s0i is a
1 2
non-empty substring of si (for all i). A fragment f1 is smaller than a fragment f2
(f1 < f2 ), if the substrings of f1 and f2 do not overlap and the substrings of f1
are to the left of the substrings of f2 . The set of letters of S that lie between the
substrings of two fragments f1 and f2 with f1 < f2 are called the gap of f1 and f2 .
We call a set ff1 < f2 < < fl g of ordered fragments a chain of fragments. The
Chaining Multiple-Alignment Fragments Problem is de ned as follows: Let F be a set
of fragments of S , where each fragment f 2 F has a positive score (score(f )). Let
gap cost(; ) be a "gap" penalty function that assigns a cost to a gap between two
ordered fragments. Compute a chain of fragments F^ that maximizes the following
function:
2 3
X
CMAF(S ) := max 4 score(f ) gap cost(f; successor(f ))5 :
^ F f 2F
^
7
implemented a practical algorithm for the CMAF-problem that uses kD-trees to
compute the optimal chain.
E. Myers and W. Miller. Chaining multiple-alignment fragments in sub-quadratic
time. In Proc. 6-th Annual ACM-SIAM Symp. Discr. Algorithms, pages 38{47, 1995.
presented the rst sub-quadratic time algorithm for special gap cost functions.
Myers and Miller prove that the maximal fragment chain can be computed in time
O(jF j(log jF j)k) with O(kjF j(log jF j)k 1) space.
2.3 Evolutionary Trees
Molecular sequences are used to reconstruct the course of evolution. Since evolution is
assumed to have proceeded from a common ancestral species in a tree-like branching
of species (molecules), this process is generally modeled by a tree. When the most
ancestral species is known, the model will be a rooted tree. The leaves of the tree are
labeled with contemporary species while the inner nodes correspond to hypothetical
ancestors. The key question is the reconstruction of this tree based on contemporary
data. These data may come in one of two forms: As a multiple alignment with the
sequences corresponding to leaves or as a matrix of distances between leaf-labels.
Methods are thus divided into character-based methods and distance-based methods.
2.3.1 Character-based methods
F. K. Hwang, D. S. Richards, and P. Winter. The Steiner Tree Problem. North-
Holland, 1992.
contains a good introduction to character-based methods.
An idealized but interesting model of evolution is embodied in the Perfect Phy-
logeny Problem. Let the number of species be k. Let a set of m characters (e.g., the
columns of the multiple alignment), each character having r possible states (e.g., the
four nucleotides A, C, G, and T), be given. We say that a character is compatible
with a tree when the inner nodes of the tree can be labeled such that each character
state induces a subtree. A tree is said to be a perfect phylogeny when all characters
are compatible with it. The perfect phylogeny problem is to decide whether a given
set of characters has a perfect phylogeny and if so construct it.
H. Bodlaender, M. Fellows, and T. Warnow. Two strikes against perfect phylogeny. In
Proc. 19-th Int. Coll. Automata, Lang. and Program., pages 273{283. Lecture Notes
Comp. Sci., 1992.
M. A. Steel. The complexity of reconstructing trees from qualitative characters and
subtrees. J. Classi cation, 9:91{116, 1992.
both show NP-completeness for arbitrary number of character states.
D. Gus eld. Ecient algorithms for inferring evolutionary trees. Networks, 21:19{28,
1991.
solves the perfect phylogeny problem for binary characters (r = 2) in linear time.
8
A. Dress and M. Steel. Convex tree realizations. Appl. Math. Lett., 5:3{6, 1992.
present a solution for r = 3 in O(km2 ).
S. K. Kannan and T. J. Warnow. Inferring evolutionary history from DNA sequences.
SIAM J. Comput., 23(4):713{737, 1994.
This paper presents an O(k2m) algorithm for r 4 which is especially important
because it allows the modeling of evolution of DNA sequences.
F. R. McMorris, T. Warnow, and T. Wimer. Triangulating vertex-colored graphs.
SIAM J. Discr. Math., 7:296{306, 1993.
show that the perfect phylogeny problem is polynomially equivalent to coloring tri-
angulated graphs and use this to design a perfect phylogeny algorithm for arbitrary r
running in O((rm)m+1 + km2 ).
R. Agarwala and D. Fernandez-Baca. A polynomial time algorithm for the perfect
phylogeny problem when the number of character states is xed. In Proc. 34-th Annual
IEEE Symp. Found. Comput. Sci., 1993. Also to appear in SIAM J. Comp.
improve this result by giving a dynamic programing algorithm for the perfect
phylogeny problem. Based on their ideas
S. Kannan and T. Warnow. A fast algorithm for the computation and enumeration
of perfect phylogenies when the number of character states is xed. pages 595{603,
1995.
give an O(22r km2 ) algorithm.
T. Warnow. Tree compatibility and inferring evolutionary history. J. Algorithms,
16:388{407, 1994.
For a given alignment, the question of identifying the maximal number of alignment
columns that allow a perfect phylogeny is addressed. It is mapped to a maximum
clique problem.
9
mutations over all trees one nds the \most parsimonious tree".
L. R. Foulds and R. Graham. The Steiner problem in phylogeny is NP-complete. Adv.
Appl. Math., 3:43{49, 1982.
show NP-completeness of nding the most parsimonious tree. This is based on an
analogy to Steiner trees: The most parsimonious tree is the minimal Steiner tree link-
ing the given sequences.
L. R. Foulds, M. D. Hendy, and D. Penny. A graph theoretic approach to the
development of minimal phylogenetic trees. J. Mol. Evol., 13:127{149, 1979.
give an approximation algorithm for the most parsimonious tree based on the min-
imum spanning tree heuristic for Steiner trees.
M. D. Hendy and D. Penny. Branch and bound algorithms to determine minimal
evolutionary trees. Math. Biosci., 59:277{290, 1982.
apply branch-and-bound algorithms for calculating the most parsimonious tree.
2.3.2 Distance based methods
Distance-based methods attempt to approximate a given set of distances on the leaf-
labels of the tree by the path-metric of an edge-weighted tree. A distance matrix that
coincide with the path-metric of a tree is called an additive matrix. A characterization
of such matrices is given in
P. Bunemann. The recovery of trees from measures of dissimilarity. In F. Hodson,
D. Kendall, and P. Tautu, editors, Mathematics in the archaeological and historical
sciences, pages 387{395. Edinburgh University Press, 1971.
M. S. Waterman, T. F. Smith, M. Singh, and W. A. Beyer. Additive evolutionary
trees. J. Theoret. Biol., 64:199{213, 1977.
Let an additive matrix be given. This paper presents an algorithm to compute the
tree whose path-metric coincides with the given matrix. Furthermore it proves unique-
ness of the tree.
Other algorithms for this purpose are given in:
H.-J. Bandelt and A. Dress. Reconstructing the shape of a tree from observed
dissimilarity data. Adv. Appl Math., 7:309{343, 1986.
N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing
phylogenetic trees. Mol. Biol. Evol., 4:406{425, 1987.
J. Culberson and P. Rudnicki. A fast algorithm for constructing trees from distance
matrices. Inform. Process. Lett., 30:215{220, 1989.
H.-J. Bandelt. Recognition of tree metrics. SIAM J. Discr. Math., 3:1{6, 1990.
The idea that for all species the same amount of time has passed since the existence
of some common ancestor has led to the study of rooted, edge-weighted trees with all
leaves being the same distance away from the root. Such a tree is called an ultrametric
tree and its corresponding path-metric is called an ultrametric. A simple clustering
method like single linkage clustering [3] suces to recognize such metrics and compute
10
the tree.
Let (A)i;j be the (additive) path-metric of a tree and let (B )i;j be an arbitrary
distance matrix. To judge how well the additive matrix (and thus the tree that goes
with it) approximates
P the distance matrix, a distance between matrices is used. Usually
it is de ned as i;j jAij Bij j , with = 1 corresponding to the L1 -norm, = 2
corresponding to the L2-norm. Some authors use the L1 -norm (maxi;j jAij Bij j).
The problem of nding the closest tree under either an L1 or and L2 -norm has been
proven NP-complete in
W. H. E. Day. Computational complexity of inferring phylogenies from dissimilarity
matrices. Bull. Math. Biol., 49:461{467, 1987.
using results from
M. Krivanek and J. Moravek. NP-hard problems in hierarchical-tree clustering. Acta
Inform., 23:311{323, 1986.
Recently, guaranteed error bound algorithms for approximation of a distance matrix
by ultrametric and additive trees have been developed:
M. Farach, S. Kannan, and T. Warnow. A robust model for nding optimal
evolutionary trees. Algorithmica, 13(1/2):155{179, 1995.
R. Agarwala, V. Bafna, M. Farach, B. Naryanan, M. Paterson, and M. Thorup. On the
approximability of numerical taxonomy ( tting distances by tree metrics). In Proc.
7-th Annual ACM-SIAM Symp. Discr. Algorithms, pages 365{372, 1996.
In this paper the approximation is measured in L1 -norm.
The reader may have noted a certain abundance of algorithms to construct trees
from additive matrices. The importance of having several algorithms on hand for this
purpose lies in the fact that they also constitute a source of ideas for heuristics. The
review
D. L. Swo ord and G. J. Olsen. Phylogeny Reconstruction. In D. M. Hillis
and C. Moritz, editors, Molecular Systematics, pages 411{501. Sinauer Associates,
Sunderland, Massachusetts, 1990.
give an overview of heuristics for tree approximation as well as many other ap-
proaches to phylogeny reconstruction that are in practical use.
A prominent method applies maximum likelihood estimation to judge the quality
of a tree:
J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood
approach. J. Mol. Evol., 17:368{376, 1981.
In the last few years, several other interesting approaches to phylogeny
reconstruction have emerged which are not based on discrete optimization. Some key
papers are
L. Szekely, M. Steel, and P. Erdos. Fourier calculus on evolutionary trees. Adv. Appl.
Math., 14:200{216, 1993.
S. N. Evans and T. P. Speed. Invariants of some probability models used in
phylogenetic inference. Ann. Statist., 21:355{377, 1993.
H. Bandelt and A. Dress. A canonical decomposition theory for metrics on a nite
11
set. Adv. Appl. Math., 92:47{105, 1992.
12
and Macromolecules: The Theory and Practice of Sequence Comparison., pages 93{
120. Addison-Wesley, 1983.
Sanko raised the question of multiple sequence alignment and introduced the
MSTA-problem. It can be formulated as a shortest-path problem in a k-dimensional
mesh-shaped graph and can be optimally solved using dynamic programming. The
edges of the graph represent possible columns of alignments. We have to assign letters
to the lower m rows of each column (letters for the inner vertices), such that the cost
of each column is minimized. The m lower letters of a column can be computed using
a dynamic programming approach (the Fitch-Hartigan algorithm to construct a par-
simonious tree, see Section 2.3). This \inner minimization" has to be carried out for
each edge (possible column) of the graph. The shortest-path from the source to the
sink codes the optimal tree alignment.
S. Altschul and D. J. Lipman. Tree, stars, and multiple biological sequence alignment.
SIAM J. Appl. Math., 49(1):179{209, 1989.
present a branch-and-bound algorithm for the MSTA-problem that is an extension
of Carrillo and Lipman's algorithm ([Carrillo and Lipman 1988], see Subsection 2.2).
Altschul and Lipman describe a new approach to compute suitable bounds for the
branch-and-bound algorithm by solving optimization problems that are \almost" clas-
sic linear programming problems.
D. Sanko , R. J. Cedergren, and G. Lapalme. Frequency of insertion-deletion,
transversion, and transition in evolution of 5S ribosomal RNA. J. Mol. Evol., 7:133{
149, 1976.
designed an iterative procedure for local optimization in the tree. Sanko et al. de-
compose the tree into small overlapping subtrees (star trees). In each iteration step
all subtrees will be locally optimized starting with the more \peripheral" subtrees.
The algorithm stops when an iteration step has been performed without change in the
subtree cost.
J. Hein. A new method that simultaneously aligns and reconstructs ancestral sequences
for any number of homologous sequences, when the phylogeny is given. Mol. Biol.
Evol., 6:649{668, 1989.
J. Hein. A tree reconstruction method that is economical in the number of pairwise
comparisons used. Mol. Biol. Evol., 6:669{684, 1989.
Hein designed and implemented a heuristic method that yields good approxima-
tions for tree alignment and suggests an algorithm for the GMSTA-problem. Hein
introduced the concept of sequence graphs for storing large sets of sequences. A se-
quence graph is a directed, acyclic and connected graph with a source and a sink.
Each edge represents a letter or even a subsequence of a sequence. Each path from the
source to the sink codes a sequence. A sequence graph represents the set of sequences
that is coded by all source-to-sink paths.
D. Gus eld. Ecient methods for multiple sequence alignment with guaranteed error
bounds. Bull. Math. Biol., 55(1):141{154, 1993.
13
Since the MSTA-problem can be formulated as a Steiner tree problem on graphs,
approximation approaches for Steiner minimal trees have been successfully applied to
the MSTA-problem. Gus eld has shown that the minimum spanning tree approach
yields an approximation ratio of 2. By using better approximation techniques for the
Steiner tree problem, for instance the approach of Zelikovsky [8] or of Berman and
Ramaiyer [1], better approximation ratios can be obtained.
R. Ravi and J. Kececioglu. Approximation algorithms for multiple sequence alignment.
In Z. Galil and E. Ukkonen, editors, Proc. 6-th Symp. Combinatorial Pattern Matching,
pages 330{339, 1995.
propose an approximation algorithm for regular deg-ary trees (each internal node
has exactly deg children) that nds solutions with an approximation ratio of degdeg+1 .
1
3 Tree Comparison
Given two or more evolutionary trees computed from di erent gene families, the prob-
lem of comparing these phylogenies arises. Several notions of similarity between trees
have been suggested, some of which are merely a similarity measure while others com-
pute a consensus tree. We give only a few references to some prominent approaches.
M. S. Waterman and T. F. Smith. On the similarity of dendrograms. J. Theor. Biol.,
73:784{900, 1978.
and
K. Culik II and D. Wood. A note on some tree similarity measures. Inform. Process.
Lett., 15:39{42, 1982.
introduce the Nearest Neighbor Interchange (NNI) metric on trees. The metric
counts the number of elementary operations, the NNI's, required to transform one
tree into the other. For an edge with two subtrees at either node, two NNIs are pos-
sible, representing the two alternative topologies.
E. K. Brown and W. H. E. Day. A computationally ecient approximation to the
14
nearest neighbor interchange metric. J. Classi cation, 1:93{124, 1984.
give a heuristic approximation algorithm for calculation of NNI distance.
The existence of a polynomial time algorithm for NNI is still open.
M. Krivanek. Computing the nearest neighbor interchange metric for unlabeled binary
trees is NP-complete. J. Classi cation, 3:55{60, 1986.
has shown NP-completeness only for unlabeled trees.
C. R. Finden and A. D. Gordon. Obtaining common pruned trees. J. Classi cation,
2:255{276, 1985.
introduced Maximum Agreement Subtree as the homeomorphous subtree common
to two trees spanning the maximum number of leaves.
M. Steel and T. Warnow. Kaikoura tree theorems: Computing the maximum
agreement subtree. Inform. Process. Lett., 48:77{82, 1993.
M. Farach and M. Thorup. Fast comparison of evolutionary trees. In Proc. 5-th
Annual ACM-SIAM Symp. Discr. Algorithms, pages 481{488, 1994.
have successively improved the corresponding algorithms.
Maximum agreement subtrees for more than two trees are considered in
A. Amir and D. Keselman. Maximum agreement subtree in a set of evolutionary trees
- metrics and ecient algorithms. In Proc. 35-th Annual IEEE Symp. Found. Comput.
Sci., pages 758{769, 1994.
J. Hein, T. Jiang, L. Wang, and K. Zhang. On the complexity of comparing
evolutionary trees. In Proc. 6-th Annual Symp. Combinatorial Pattern Matching,
pages 177{190, 1995.
4 Genome Rearrangements
The order in which genes are arranged on the DNA molecule is the result of an evolu-
tionary process. Over time, a gene order formerly present in an ancient species may,
due to certain rearrangements in the genome, have evolved into a gene order we can
observe today. The computational task lies in reconstructing the changes that may
have occured to transform one gene order into another. A much studied elementary
operation in these transformations is, e.g., the reversal of subsets of genes. The choice
of elementary operation depends on the organism or cell organelle under study. More
sophisticated scenarios model the evolution of sets of chromosomes which can exchange
genes among each other.
S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial
algorithm for sorting signed permutations by reversals). In Proc. 27-th Annual ACM
Symp. Theory of Comput., 1995.
is an excellent overview of the eld of genome rearrangements.
15
Let an ordered set of n genes be given. A reversal is the operation which cuts out
a contiguous subset, inverts the order of genes in this subset, and reinserts it again
in its original position. Reversal distance between two permutations is the minimal
number of reversals required to transform one permutation into another. Since genes
also have a direction, a more accurate model introduces signs for the genes. A given set
of genes is represented by a permutation with signs on each entry. Reversing a subset
of genes then has the additional e ect of changing all signs of the a ected genes. The
problem is to calculate the minimal number of signed reversals necessary to transform
one signed permutation into another.
G. A. Watterson, W. J. Ewens, T. E. Hall, and A. Morgan. The chromosome inversion
problem. J. Theor. Biol., 99:1{7, 1982.
was the rst paper to raise the formal problems of comparing gene orders.
D. Sanko , G. Leduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren. Gene
order comparisons for phylogenetic inference: Evolution of the mitochondrial genome.
Proc. Natl. Acad. Sci. USA, 89:6575{6579, 1992.
use heuristics to estimate the number of rearrangements that occured between two
contemporary species. The resulting distance measure is used to reconstruct an evo-
lutionary tree.
J. Kececioglu and D. Sanko . Exact and approximation algorithms for the inversion
distance between two permutations. Algorithmica, 13:180{210, 1995.
give an approximation algorithm for the number of (unsigned) reversals with a
guaranteed error bound of 2 and devise a branch-and-bound algorithm. They specu-
late that the problem is NP-complete.
V. Bafna and P. Pevzner. Genome rearrangements and sorting by reversals. In Proc.
34-th Annual IEEE Symp. Found. Comput. Sci., pages 148{157, 1993. To appear in
SIAM J. Comp., vol 25(2), April 1996.
improve the error bound to 1:75. They further show that it may take up to n 1
reversals to transform one permutation into another and also devise a factor 1:5 ap-
proximation algorithm for signed reversals.
J. Kececioglu and D. Sanko . Ecient bounds for oriented chromosome inversion
distance. In Proc 5-th Annual Symp. Combinatorial Pattern Matching, pages 307{
325, 1994.
give a branch-and-bound algorithm for signed reversals.
S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial
algorithm for sorting signed permutations by reversals). In Proc. 27-th Annual ACM
Symp. Theory of Comput., 1995.
found a duality theorem for the number of signed reversals. Based on this theorem
they devised an O(n4) algorithm.
16
S. Hannenhalli and P. Pevzner. To cut ... or not to cut. In Proc. 7-th Annual ACM-
SIAM Symp. Discr. Algorithms, pages 304{313, 1996.
use the improved algorithm for signed permutations and connections between the
two problems to devise a practically ecient algorithm for unsigned reversals.
V. Bafna and P. Pevzner. Sorting permutations by transpositions. In Proc. 6-th
Annual ACM-SIAM Symp. Discr. Algorithms, pages 614{623, 1995.
study another operation on permutations, so-called transpositions. In this context a
transposition does not exchange two single elements but two adjacent stretches from
a permutation.
Genome rearrangements are more complicated when genes are distributed over dif-
ferent chromosomes. Chromosomes can exchange genetic material. A translocation is
the process where a contiguous set of genes from an end of a chromosome is exchanged
with a contiguous set of genes from an end of another chromosome. Fusion, the com-
bination of two chromosomes, and ssion, the breakage of one chromosome into two
new ones, are other relevant processes.
J. Kececioglu and R. Ravi. Of mice and men: Evolutionary distances between genomes
under translocation. In Proc. 6-th Annual ACM-SIAM Symp. Discr. Algorithms, pages
604{613, 1995.
S. Hannenhalli. Polynomial-time algorithm for computing translocation distance
between genomes. pages 162{176.
study distances between genomes based on the above operations.
J. Kececioglu and D. Gus eld. Reconstructing a history of recombinations from a
set of sequences. In Proc. 5-th Annual ACM-SIAM Symp. Discr. Algorithms, pages
471{480, 1994.
study related operations in the context of reconstructing evolutionary history.
17
R. M. Karp. Mapping of the genome: Some combinatorial problems arising in
molecular biol. In Proc. 25-th Annual ACM Symp. Theory of Comput., pages 278{285,
1993.
P. A. Pevzner, editor. Combinatorial Methods for DNA Mapping and Sequencing,
volume 2, 2 of Special Issue of J. Comput. Biol. 1995.
This volume contains eleven presentations of the DIMACS workshop \Combinatorial
Methods in DNA Mapping and Sequencing", that opened the DIMACS computational
molecular biology year 1994/1995. The articles give a strong emphasis on applications
of combinatorial methods in molecular biology.
See also [Special Issue of Algorithmica, ed. by Myers] and [Open combinatorial
problems in computational molecular biology, by Pevzner and Waterman] (see
Section 1).
5.1 Sequence Assembly
Current technology permits experimentalists to directly determine the sequence of a
DNA strand of up to 800 nucleotides in length. To sequence a long piece of DNA many
such reads are taken and subsequently re-assembled to produce the original sequence.
Computationally, this gives rise to the problem of assembling the fragments using
the overlap information among them. Overlaps are deduced from sequence similarity.
Since 1{10% of the nucleotides in the fragment data are missing or incorrect, and
since a fragment's sequence can be reversed with respect to the others these overlaps
cannot be perfectly determined. Thus, the formal Sequence Reconstruction Problem,
rst formalized in
H. Peltola, H. Soderland, J. Tarhio, and E. Ukkonen. Algorithms for some string
matching problems arising in molecular genetics. In Proc. 9-th IFIP World Computer
Congress, pages 59{64, 1983.
can be stated as follows: Given a collection of fragment sequences F and an error
rate 0 < 1, nd a shortest sequence S such that every fragment F 2 F , or its
reverse complement, matches a substring of S with at most jF j errors.
H. Peltola, H. Soderland, and E. Ukkonen. SEQUAID: A DNA sequence assembly
program based on a mathematical model. Nucl. Acids Res., 12:307{321, 1984.
decompose this extremely hard problem into three combinatorial optimization prob-
lems: the overlap graph construction, the layout phase, and the alignment problem.
J. Kececioglu and E. Myers. Exact and approximate algorithms for the sequence
reconstruction problem. Algorithmica, 13(1-2):7{51, 1995.
Here, for each of the subproblems either exact algorithms or approximate algorithms
are suggested.
J. D. Kececioglu. Exact and approximation algorithms for DNA sequence
reconstruction. PhD thesis, U. Arizona, 1991.
This thesis gives a detailed overview of all aspects concerning the sequence recon-
struction problem such as the biological aspects, the related combinatorial optimiza-
18
tion problems, the literature and the computer software available.
19
This is one of the most recent papers in a series of papers which subsequently im-
proved the approximation factor for the shortest common superstring problem. Here,
an approximation factor of 2:75 is obtained by using the structure of strings with large
amounts of overlap. The algorithm runs in time O(jS j + k3), where k is the number
of given strings and S is the sum of the lengths of all the strings.
T. Jiang, Z. Jiang, and D. Breslauer. Rotation of periodic strings and short
superstrings. Technical Report, Max-Planck-Institut f. Informatik, Saarbrucken,
Germany, May 1996.
Recently, Jiang, Jiang and Breslauer suggested an approximation algorithm for the
shortest common superstring with approximation factor 2:667 (2:596 in the near fu-
ture). Their algorithm is simpler than the previous approximation algorithms that
lead to a factor less than three.
Most of the above authors conjecture the existence of an approximation algorithm
of factor two, and that the greedy algorithm itself is a candidate for it, because no
example is known where the greedy algorithm does worse.
20
1994.
This article gives a survey of the state of the art in sequencing by hybridization
through 1994.
Recently, modi cations of sequencing by hybridization have been proposed to re-
duce ambiguities in sequence reconstruction:
S. Hannenhalli, W. Feldman, H. F. Lewis, S. S. Skiena, and P. A. Pevzner. Positional
sequencing by hybridization. CABIOS, 12(1):19{24, 1996.
The authors consider the case where additional information about the approximate
position of each l-tuple in the unknown DNA fragment is given. No polynomial algo-
rithms for the Positional Sequence Reconstruction Problem are known. A special case
is the Bounded Positional SBH Reconstruction Problem, where the range of positions
for each l-tuple is bounded by a small xed number. For this case the authors present
two polynomial time algorithms.
R. M. Idury and M. S. Waterman. A new algorithm for DNA sequence assembly.
J. Comput. Biol., 2(2):291{306, 1995.
This article suggests the application of the SBH-related computational techniques
to fragment assembly.
21
when the enzyme sites are modelled by a random process.
W. Schmitt and M. S. Waterman. Multiple solutions of DNA restriction mapping
problem. Adv. Appl. Math., 12:412{427, 1991.
The authors suggest partitioning the entire set of maps into equivalence classes. But
still the number of equivalence classes grows very fast, as observed by the following
author.
P. A. Pevzner. DNA physical mapping and alternating eulerian cycles in colored
graphs. Algorithmica, 13(1-2):77{105, 1995.
The author studies the combinatorics of multiple solutions to the double digest
problem and shows that the solutions are closely associated with alternating Eulerian
cycles in colored graphs. Furthermore he gives a complete characterization of equiva-
lent maps as introduced by Schmitt and Waterman.
S. Ho, L. Allison, and C. N. Yee. Restriction site mapping for three or more enzymes.
Comp. Appl. Biosci., 6:195{204, 1990.
The authors generalize the double digest problem to more than two enzymes.
G. Zehetner, A. Frischauf, and H. Lehrach. Approaches to restriction map
determination, pages 147{164. IRL Press, Oxford, 1988. M. J. Bishop and C. J.
Rawlings (eds.), Nucl. Acid and Protein Sequence Analysis, Practical Approaches.
This article gives a survey on commonly used methods for attacking this problem.
In the Partial Digest Problem only a single enzyme, say A, is used to cleave sev-
eral clones of a segment. Let a1 < a2 < < ak be the set of the restriction sites.
In contrast to a double digest experiment, the digestion process is stopped earlier,
such that not all of the restriction sites are cut. Again, the task is to reconstruct
the restriction sites from the given fragment length data. The Probed Partial Digest
Problem is similar, except that a site p on the DNA segment is labeled, and the sizes
of only those restriction fragments are measured that contain this site. Both problems
can be reduced to polynomial factorization [7], which is \unlikely to be NP-complete"
([Karp], see above).
Like in the double digest problem the number of solutions can grow fast for these
problems. The following two papers deal with this:
S. S. Skiena, W. D. Smith, and P. Lemke. Reconstructing sets from interpoint
distances. In Proc. 6-th Ann. Symp. Computational Geometry, pages 332{339, 1990.
L. A. Newberg and D. Naor. A lower bound on the number of solutions to the exact
probed partial digest problem. Adv. Appl. Math., 14:172{185, 1993.
The authors show how the degree of ambiguity can grow as a function of the number
of restriction sites.
22
5.4 Mapping Using Hybridization Data
For large pieces of DNA digest mapping has mostly been superseded by mapping
techniques that are based on experimental determination of overlap between fragments
(also called clones). This overlap data can be represented as a graph in which vertices
are clones and there is an edge between two vertices if the corresponding clones overlap.
In the absence of error, this graph is an interval graph [6].
Several experimental methods to obtain the required overlap information are in use.
One approach detects certain, very speci c sequence segments in the fragments. This
is done using probes hybridizing to speci c bits of the sequence. The hope is that
two clones share a probe if and only if they overlap. If a probe is absolutely unique
it is called a Sequence Tagged Site (STS). Ideally, the matrix describing which clones
contain such a probe will have the consecutive ones property [6]. Hence, the correct
orderings can be found very easily using the PQ-tree data structure [2] to generate
the set of all arrangements of probes consistent with the data. Alternatively, there is
another experiment that allows determination of overlap between two clones directly.
However, both techniques are error prone. In practice, false positive overlaps will be
reported and overlaps may not be recognized, i.e., there are false negatives. A further
complication is that sometimes two clones are merged into one. This phenomenon is
called chimerism.
M. S. Waterman and J. R. Griggs. Interval graphs and maps of DNA. Bull. Math. Biol.,
48(2):189{195, 1986.
study a model that combines digest mapping with information on overlap between
the fragments. They introduced the representation of maps as interval graphs [6] and
characterize the interval graphs arising in the speci c experimental setup modeled in
this paper. A simple linear time algorithm for recognizing and representing the data
in interval representation is given.
Since in practice the overlap data is error prone the question arises if there exists
an interval graph that is in some sense \close" to the given overlap data.
M. C. Golumbic, H. Kaplan, and R. Shamir. On the complexity of DNA physical
mapping. Adv. Appl. Math., 15:251{261, 1994.
Given some \noisy" and some correct part of the overlap data, the authors show
that the following problem is NP-complete. Is there an interval graph induced by E1
satisfying E0 E1 E2 for given edge sets E0 and E2 (E0 E2) (Interval Sandwich
Problem).
H. Kaplan, R. Shamir, and R. E. Tarjan. Tractability of parameterized completion
problems on chordal and interval graphs: Minimum ll-in and physical mapping. In
Proc. 35-th IEEE Symp. Found. Comp. Sci., pages 780{791, 1994.
First, the authors consider the case of unidenti ed overlaps. Although the problem
of building a map with fewest errors is NP-hard (Proper Interval Graph Completion
Problem), the authors present a linear time algorithm which gives an augmenting set
with no more than k edges (k xed) if one exists. Observing that the arising interval
23
graphs have small clique size, the authors use this fact to present a polynomial time
algorithm for the proper interval graph completion problem with bounded clique size.
M. R. Fellows, M. T. Hallett, and W. T. Wareham. DNA physical mapping: Three
ways dicult. In Proc. European Symp. Algorithms, Lecture Notes Comp. Sci. 726,
pages 157{168, Berlin, 1993. Springer-Verlag.
consider the case of digest mapping when k enzymes are used and noisy overlap
data between the fragments are given. This is a generalization of the model used by
Waterman and Griggs (1986). They study the following problem: Given a graph G
and a coloring c of the vertices to k colors, is there a supergraph of G which is prop-
erly colored by c and which is an interval graph? It is shown that this problem is
NP-complete, and is not xed-parameter tractable, i.e., it cannot be solved in time
f (k)n , where is independent of k (unless an apparently resistant problem can be
solved).
The following articles give algorithms for constructing maps using STS probes.
E. D. Green and P. Greene. Sequence-tagged site (STS) content mapping of human
chromosomes: Theoretical considerations and early experiences. PCR Methods and
Appl., pages 77{90, 1991.
In this article some background information on STS mapping is given. Furthermore,
the authors discuss a strategy for developing clone-based STS maps of chromosomes.
D. S. Greenberg and S. Istrail. Physical mapping by STS hybridization: Algorithmic
strategy and the challenge of software evaluation. J. Comput. Biol., 2(2):219{273,
1995.
develop some algorithmic theory of the mapping process, and propose a performance
evaluation procedure. Furthermore, they suggest various combinatorial optimization
problems such as the hamming distance travelling salesman problem that could be
useful for solving the practical mapping problem.
F. Alizadeh, R. M. Karp, D. K. Weisser, and G. Zweig. Physical mapping of
chromosomes using unique probes. J. Comput. Biol., 2(2):159{184, 1995.
The authors present several combinatorial methods for reconstructing a DNA frag-
ment in the presence of errors. The methods include techniques for the hamming
distance travelling salesman problem, and simulated annealing as well as screening
methods for detecting errors in the given data.
Several experimenters have approached mapping using probes that do not satisfy
the uniqueness condition.
F. Alizadeh, R. M. Karp, L. A. Newberg, and D. K. Weisser. Physical mapping
of chromosomes: A combinatorial problem in molecular biology. Algorithmica, 13(1-
2):52{76, 1995.
present the rst result that e ectively addresses the ordering problem for map-
ping with hybridization ngerprints and non-unique probes. The authors introduce
24
approximations to a likelihood function quantifying overlap information. This leads
to optimization problems that are reasonably tractable in practice, although they are
NP-hard.
R. Mott, A. Grigoriev, J. H. E. Maier, and H. Lehrach. Algorithms and software
tools for ordering clone libraries: application to the mapping of the genome of
schizosaccharomyces pombe. Nucl. Acids Res., 21(8):1965{1974, 1993.
The authors describe a complete set of software tools for mapping of a genome that
has been successfully applied to the genome of ssion yeast.
6.1 Threading
The Threading Problem is a generalization of the pairwise sequence alignment problem
where for one of the two proteins a three-dimensional structure is given. An alignment
thus implies spatial proximity between certain amino acids which is in turn weighted
by a so-called pair-potential. The resulting optimization problem attempts to nd an
alignment that minimizes the sum over all implied pair-potentials.
R. H. Lathrop. The protein threading problem with sequence amino acid interaction
preferences is NP-complete. Protein Engineering, 7:1059{1068, 1994.
has shown that the Threading Problem is NP-complete (by reduction to One-in-
Three 3SAT).
A branch-and-bound algorithm for a slightly modi ed version of threading is given
in
R. H. Lathrop and T. F. Smith. Global optimum protein threading with gapped
alignment and empirical pair score functions. J. Mol. Biol., 255:641{665, 1996.
Other heuristic approaches to threading are summarized in
M. Sippl. Knowledge-based potentials for proteins. Current Opinion in Structural
Biology, 5:229{235, 1995.
25
R. Unger and J. Moult. Finding the lowest free energy conformation of a protein is an
NP-hard problem: Proof and implications. Bull. Math. Biol., 55:1183 { 1198, 1993.
A. S. Fraenkel. Complexity of protein folding. Bull. Math. Biol., 55:1199 { 1210, 1993.
give NP-completeness proofs for discretized versions of protein folding.
Algorithms for nding minimal energy structures for lattice models in proteins are
summarized and tested in
K. Yue, M. Fiebig, P. D. Thomas, H. S. Chan, E. I. Shakhnovich, and K. A. Dill. A
test of lattice protein folding algorithms. Proc. Natl. Acad. Sci. USA, 92:325 { 329,
1995.
W. E. Hart and S. Istrail. Fast protein folding in the hydrphobic-hydrophilic model
within three{eights of optimal. In Proc. 27-th Annual ACM Symp. Theory of Comput.,
pages 157{168, 1995.
give a factor 83 approximation algorithm for lattice models.
References
[1] Berman and Ramaiyer. Improved approximations for the steiner tree problem. J.
Algorithms, 17:381{408, 1994.
[2] K. S. Booth and G. S. Lueker. Testing for consecutive ones property, interval
graphs and planarity using PQ-tree algorithms. J. Computer Syst. Sci., 13:335{
379, 1976.
[3] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. John Wiley
& Sons, 1973.
[4] J. Gallant, D. Maier, and J. Storer. On nding minimal length superstrings.
J. Computer Syst. Sci., 20:50{58, 1980.
[5] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman, 1979.
[6] M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press,
1980.
[7] J. Rosenblatt and P. D. Seymour. The structure of homometric sets. SIAM
J. Algebraic Discr. Methods, 3:343{350, 1982.
[8] A. Zelikovsky. The 11/6 approximation algorithm for the steiner problem on
networks. Algorithmica, 9:463{470, 1993.
26