Algorithms in Bioinformatics: Dan Brown Burkhard Morgenstern
Algorithms in Bioinformatics: Dan Brown Burkhard Morgenstern
Algorithms in Bioinformatics: Dan Brown Burkhard Morgenstern
Algorithms
LNBI 8701
in Bioinformatics
14th International Workshop, WABI 2014
Wroclaw, Poland, September 8–10, 2014
Proceedings
123
Lecture Notes in Bioinformatics 8701
Algorithms
in Bioinformatics
14th International Workshop, WABI 2014
Wroclaw, Poland, September 8-10, 2014
Proceedings
13
Volume Editors
Dan Brown
University of Waterloo
David R. Cheriton School of Computer Science
Waterloo, ON, Canada
E-mail: dan.brown@uwaterloo.ca
Burkhard Morgenstern
University of Göttingen
Institute of Microbiology and Genetics
Department of Bioinformatics
Göttingen, Germany
E-mail: bmorgen@gwdg.de
Program Chairs
Dan Brown University of Waterloo, Canada
Burkhard Morgenstern University of Göttingen, Germany
Program Committee
Mohamed Abouelhoda Cairo University, Egypt
Tatsuya Akutsu Kyoto University, Japan
Anne Bergeron Universite du Quebec a Montreal, Canada
Sebastian Böcker Friedrich Schiller University Jena, Germany
Paola Bonizzoni Università di Milano-Bicocca
Marilia Braga Inmetro - Ditel
Broňa Brejová Comenius University in Bratislava, Slovakia
C.Titus Brown Michigan State University, USA
Philipp Bucher Swiss Institute for Experimental Cancer
Research, Switzerland
Rita Casadio UNIBO
Cedric Chauve Simon Fraser University, Canada
Matteo Comin University of Padova, Italy
Lenore Cowen Tufts University, USA
Keith Crandall George Washington University, USA
Nadia El-Mabrouk University of Montreal, Canada
David Fernández-Baca Iowa State University, USA
Anna Gambin Institute of Informatics, Warsaw University,
Poland
Olivier Gascuel LIRMM, CNRS - Université Montpellier 2,
France
Raffaele Giancarlo Università di Palermo, Italy
Nicholas Hamilton The University of Queensland, Australia
Barbara Holland University of Tasmania, Australia
Katharina Huber University of East Anglia, UK
Steven Kelk University of Maastricht, The Netherlands
Carl Kingsford Carnegie Mellon University, USA
Gregory Kucherov CNRS/LIGM, France
Zsuzsanna Liptak University of Verona, Italy
Stefano Lonardi UC Riverside, USA
Veli Mäkinen University of Helsinki, Finland
Ion Mandoiu University of Connecticut, USA
VIII Organization
Additional Reviewers
Claudia Landi Gunnar W. Klau
Pedro Real Jurado Annelyse Thévenin
Yuri Pirola João Paulo Pereira Zanetti
James Cussens Andrea Farruggia
Seyed Hamid Mirebrahim Faraz Hach
Mateusz L acki Alberto Policriti
Anton Polishko Slawomir Lasota
Giosue Lo Bosco Heng Li
Christian Komusiewicz Daniel Doerr
Piotr Dittwald Leo van Iersel
Giovanna Rosone Thu-Hien To
Flavio Mignone Rayan Chikhi
Marco Cornolti Hind Alhakami
Fábio Henrique Viduani Martinez Ermin Hodzic
Yu Lin Guillaume Holley
Salem Malikic Ibrahim Numanagic
Organization IX
1 Introduction
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 1–13, 2014.
c Springer-Verlag Berlin Heidelberg 2014
2 M. Comin, A. Leoni, and M. Schimd
Alignment-based methods have been used for quite some time to establish sim-
ilarity between sequences [3]. However there are cases where alignment methods
can not be applied or they are not suited. For example the comparison of whole
genomes is impossible to conduct with traditional alignment techniques, because
of events like rearrangements that can not be captured with an alignment [4–6].
Although fast alignment heuristics exist, another drawback is that alignment
methods are usually time consuming, thus they are not suited for large-scale se-
quence data produced by Next-Generation Sequencing technologies (NGS)[7, 8].
For these reasons a number of alignment-free techniques have been proposed
over the years [9].
The use of alignment-free methods for comparing sequences has proved useful
in different applications. Some alignment-free measures use the patterns distri-
bution to study evolutionary relationships among different organisms [4, 10, 11].
Several alignment-free methods have been devised for the detection of enhancers
in ChIP-Seq data [12–14] and also of entropic profiles [15, 16]. Another applica-
tion is the classification of protein remotely related, which can be addressed with
sophisticate word counting procedures [17, 18]. The assembly-free comparison of
genomes based on NGS reads has been investigated only recently [7, 8]. For a
comprehensive review of alignment-free measures and applications we refer the
reader to [9].
In this study we want to explore the ability of alignment-free measures to
cluster reads data. Clustering techniques are widely used in many different ap-
plications based on NGS data, from error correction [19] to the discovery of
groups of microRNAs [20]. With the increasing throughput of NGS technologies
another important aspect is the reduction of data complexity by collapsing re-
dundant reads in a single cluster to improve the run time, memory requirements,
and quality of subsequent steps like assembly.
In [21] Solovyov et. al. presented one of the first comparison of alignment-
free measures when applied to NGS reads clustering. They focused on clustering
reads coming from different genes and different species based on k-mer counts.
They showed that D-type measures (see section 2), in particular D2∗ , can effi-
ciently detect and cluster reads from the same gene or species (as opposed to
[20] where the clustering is focused on errors). In this paper we extend this study
by incorporating quality value information into these measures.
Quality scores produced by NGS platforms are fundamental for various anal-
ysis of NGS data: mapping reads to a reference genome [22]; error correction
[19]; detection of insertion and deletion [23] and many others. Moreover future-
generation sequencing technologies will produce long and less biased reads with
a large number of erroneous bases [24]. The average number of errors per read
will grow up to 15%, thus it will be fundamental to exploit quality value infor-
mation within the alignment-free framework and the de novo assembly where
longer and less biased reads could have dramatic impact.
In the following section we briefly review some alignment-free measures. In
section 3 we present a new family of statistics, called Dq -type, that take ad-
vantage of quality values. The software QCluster is discussed in section 4 and
Clustering of Reads with Quality Values 3
relevant results on simulated and real data are presented in section 5. In section
6 we summarize the findings and we discuss future directions of investigation.
This is the inner product of the word vectors Xw and Yw , each one repre-
senting the number of occurrences of words of length k, i.e. k-mers, in the two
sequences. However, it was shown by Lippert et al. [26] that the D2 statistic can
be biased by the stochastic noise in each sequence. To address this issue another
popular statistic, called D2z , was introduced in [13]. This measure was proposed
to standardize the D2 in the following manner:
D2 − E(D2 )
D2z = ,
V(D2 )
where E(D2 ) and V(D2 ) are the expectation and the standard deviation of D2 ,
respectively. Although the D2z similarity improves D2 , it is still dominated by
the specific variation of each pattern from the background [27, 28]. To account
for different distributions of the k-mers, in [27] and [28] two other new statistics
are defined and named D2∗ and D2s . Let X̃w = Xw − (n − k + 1) ∗ pw and
Ỹw = Yw − (n − k + 1) ∗ pw where pw is the probability of w under the null model.
Then D2∗ and D2s can be defined as follows:
X̃w Ỹw
D2∗ = .
(n − k + 1)pw
w∈Σ k
and,
X̃w Ỹw
D2s =
w∈Σ k X̃w2 + Ỹw2
This latter similarity measure responds to the need of normalization of D2 .
These set of alignment-free measures are usually called D-type statistics. All
these statistics have been studied by Reinert et al. [27] and Wan et al. [28]
for the detection of regulatory sequences. From the word vectors Xw and Yw
several other measures can be computed like L2 , Kullback-Leibler divergence
(KL), symmetrized KL [21] etc.
4 M. Comin, A. Leoni, and M. Schimd
Upon producing base calls for a read x, sequencing machines also assign a quality
score Qx (i) to each base in the read. These scores are usually given as phred -
scaled probability [29] of the i-th base being wrong
For example, if Qx (i) = 30 then there is 1 in 1000 chance that base i of read
x is incorrect. If we assume that quality values are produced independently to
each other (similarly to [22]), we can calculate the probability of an entire read
x being correct as:
n−1
Px {the read x is correct} = (1 − 10−Qx (j)/10 )
j=0
where n is the length of the read x. In the same way we define the probability
of a word w of length k, occuring at position i of read x being correct as:
k−1
Pw,i {the word w at position i of read x is correct} = (1 − 10−Qx (i+j)/10 ).
j=0
In all previous alignment-free statistics the k-mers are counted such that each
occurrence contributed as 1 irrespective of its quality. Here we can use the quality
of that occurrence instead to account also for erroneous k-mers. The idea is
to model sequencing as the process of reading k-mers from the reference and
assigning a probability to them. Thus this formula can be used to weight the
occurrences of all k-mers used in the previous statistics.
We extend here D-type statistics [27, 28] to account for quality values. By defin-
ing Xwq as the sum of probabilities of all the occurrences of w in x:
Xwq = Pw,i
i∈{i| w occurs in x at position i}
X q
w = Xw − (n − k + 1)pw E(Pw )
q
X˜wq Y˜wq
D2∗q =
(n − k + 1)pw E(Pw )
w∈Σ k
and,
X˜wq Y˜wq
D2sq =
2 2
w∈Σ k X˜wq + Y˜wq
We call these three alignment-free measures Dq -type. Now, E(Pw ) depends
on w and on the actual sequencing machine, therefore it can be very hard, if
not impossible, to calculate precisely. However, if the set D of all the reads is
large enough we can estimate the prior probability using the posterior relative
frequency, i.e. the frequency observed on the actual set D, similarly to [22].
We assume that, given the quality values, the error probability on a base is
independent from its position within the read and from all other quality values
(see [22]). We defined two different approximations, the first one estimates E(Pw )
as the average error probability of the k-mer w among all reads x ∈ D:
Xq
E(Pw ) ≈ x∈D w (1)
x∈D Xw
while the second defines, for each base j of w, the average quality observed
over all occurrences of w in D:
x∈D i∈{i| w occurs in x at position i} Qx (i + j)
Qw [j] =
x∈D Xw
and it uses the average quality values to compute the expected word proba-
bility.
k−1
E(Pw ) ≈ (1 − 10−Qw (j)/10 ) (2)
j=0
We called the first approximation Average Word Probability (AWP) and the
second one Average Quality Probability (AQP). Both these approximations are
implemented within the software QCluster and they will tested in section 5.
6 M. Comin, A. Leoni, and M. Schimd
All the described algorithms were implemented in the software QCluster. The
program takes in input a fastq format file and performs centroid-based clustering
(k-means) of the reads based on the counts and the quality of k-mers. The soft-
ware performs centroid-based clustering with KL divergence and other distances
like L2 (Euclidean), D2 , D2∗ , symmetrized KL divergence etc. When using the
Dq -type measures, one needs to choose the method for the computation of the
expected word probability, AW P or AQP , and the quality redistribution.
Since some of the implemented distances (symmetrized KL, D2∗ ) do not guar-
antee to converge, we implemented a stopping criteria. The execution of the
algorithm interrupts if the number of iterations without improvements exceeds
a certain threshold. In this case, the best solution found is returned. The maxi-
mum number of iterations may be set by the user and for our experiments we use
the value 5. Several other options like reverse complement and different normal-
ization are available. All implemented measures can be computed in linear time
Clustering of Reads with Quality Values 7
and space, which is desirable for large NGS datasets. The QCluster1 software
has been implemented in C++ and compiled and tested using GNU GCC.
5 Experimental Results
Several tests have been performed in order to estimate the effectiveness of the
different distances, on both simulated and real datasets. In particular, we had
to ensure that, with the use of the additional information of quality values, the
clustering improved compared to that produced by the original algorithms.
For simulations we use the dataset of human mRNA genes downloaded from
NCBI2 , also used in [21]. We randomly select 50 sets of 100 sequences each
of human mRNA, with the length of each sequence ranged between 500 and
10000 bases. From each sequence, 10000 reads of length 200 were simulated
using Mason3 [30] with different parameters, e.g. percentage of mismatches, read
length. We apply QCluster using different distances, to the whole set of reads and
then we measure the quality of the clusters produced by evaluating the extent
to which the partitioning agrees with the natural splitting of the sequences. In
other words, we measured how well reads originating from the same sequence
are grouped together. We calculate the recall rate as follows, for each mRNA
sequence S we identified the set of reads originated from S. We looked for the
cluster C that contains most of the reads of S. The percentage of the S reads
that have been grouped in C is the recall value for the sequence S. We repeat
the same operation for each sequence and calculate the average value of recall
rate over all sequences.
Several clustering were produced by using the following distance types: D2∗ ,
D2 , L2 , KL, symmetrized KL and compared with D2∗q in all its variants, us-
ing the expectation formula (1) AW P or (2) AQP , with and without quality
redistribution (q-red). In order to avoid as much as possible biases due to the
initial random generation of centroids, each algorithm was executed 5 times with
different random seeds and the clustering with the lower distortion was chosen.
Table 1 reports the recall while varying error rates, number of clusters and
the parameters k. For all distances the recall rate decreases with the number
of clusters, as expected. For traditional distances, if the reads do not contain
errors then D2∗ preforms consistently better then the others D2 , L2 , KL. When
the sequencing process becomes more noisy, the KL distances appears to be
less sensitive to sequencing errors. However if quality information are used, D2∗q
outperforms all other methods and the advantage grows with the error rate.
This confirms that the use of quality values can improve clustering accuracy.
When the number of clusters increases then the advantage of D2∗q becomes more
evident. In these experiments the use of AQP for expectation within D2∗q is more
stable and better performing compared with formula AW P . The contribution of
1
http://www.dei.unipd.it/~ ciompin/main/qcluster.html
2
ftp://ftp.ncbi.nlm.nih.gov/refseq/H-sapiens/mRNA-Prot/
3
http://seqan.de/projects/mason.html
8 M. Comin, A. Leoni, and M. Schimd
Table 1. Recall rates of clustering of mRNA simulated reads (10000 reads of length
200) for different measures, error rates, number of clusters and parameter k
Table 2. Recall rates for clustering of mRNA simulated reads (10000 reads, k = 3, 4
clusters) for different measures, error rates and read length
as input the reads dataset SRR017901 (454 technology) with 23.5Mbases cor-
responding to 10× coverage. We apply the clustering algorithms, with k = 3,
and divide the dataset of reads in two clusters. Then we produce an assembly,
as a set of contigs, for each cluster using Velvet and we merged the generated
contigs. In order to evaluate the clustering quality, we compare this merged set
with the assembly, without clustering, using of the whole set of reads. Commonly
used metrics such as number of contigs, N 50 and percentage of mapped contigs
are presented in Table 3. When merging contigs from different clusters, some
contig might be very similar or they can cover the same region of the genome,
this can artificially increase these values. Thus we compute also a less biased
measure that is the percentage of the genome that is covered by the contigs (last
column).
6 Conclusions
The comparison of reads with quality values is essentials in many genome projects.
The importance of quality values will increase in the near future with the ad-
vent of future sequencing technologies, that promise to produce long reads, but
with 15% errors. In this paper we presented a family of alignment-free measures,
called Dq -type, that incorporate quality value information and k-mers counts for
the comparison of reads data. A set of experiments on simulated and real reads
data confirms that the new measures are superior to other classical alignment-free
statistics, especially when erroneous reads are considered. If quality information
are used, D2∗q outperforms all other methods and the advantage grows with the
error rate and with the length of reads. This confirms that the use of quality values
can improve clustering accuracy.
Preliminary experiments on real reads data show that the quality of assembly
can also improve when using clustering as preprocessing. All these measures are
implemented in a software called QCluster. As a future work we plan to explore
other applications like genome diversity estimation and meta-genome assembly
in which the impact of reads clustering might be substantial.
References
1. Medini, D., Serruto, D., Parkhill, J., Relman, D., Donati, C., Moxon, R., Falkow,
S., Rappuoli, R.: Microbiology in the post-genomic era. Nature Reviews Microbi-
ology 6, 419–430 (2008)
2. Jothi, R., et al.: Genome-wide identification of in vivo protein-DNA binding sites
from ChIP-Seq data. Nucleic Acids Res. 36, 5221–5231 (2008)
3. Altschul, S., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment
search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
4. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome compari-
son with feature frequency profiles (FFP) and optimal resolutions. PNAS 106(8),
2677–2682 (2009)
5. Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In:
Proc. 23rd Int. Workshop on Database and Expert Systems Applications (DEXA-
BIOKDD 2012), pp. 190–194 (2012)
6. Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using under-
lying subwords. BMC Algorithms for Molecular Biology 7(34) (2012)
7. Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-Free Sequence
Comparison Based on Next-Generation Sequencing Reads. Journal of Computa-
tional Biology 20(2), 64–79 (2013)
8. Comin, M., Schimd, M.: Assembly-free Genome Comparison based on Next-
Generation Sequencing Reads and Variable Length Patterns. Accepted at
RECOMB-SEQ 2014: 4th Annual RECOMB Satellite Workshop at Massively Par-
allel Sequencing. Proceedings to appear in BMC Bioinformatics (2014)
12 M. Comin, A. Leoni, and M. Schimd
28. Wan, L., Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence
comparison (II): theoretical power of comparison statistics. Journal of Computa-
tional Biology 17(11), 1467–1490 (2010)
29. Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II.
Error probabilities. Genome Research 8(3), 186–194 (1998)
30. Holtgrewe, M.: Mason–a read simulator for second generation sequencing data.
Technical Report FU Berlin (2010)
31. Birney, E.: Assemblies: the good, the bad, the ugly. Nature Methods 8, 59–60
(2011)
32. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Genome Research 18, 821–829 (2008)
Improved Approximation for the Maximum
Duo-Preservation String Mapping Problem∗
1 Introduction
String comparison is a central problem in stringology with a wide range of ap-
plications, including data compression, and bio-informatics. There are various
ways to measure the similarity of two strings: one may use the Hamming dis-
tance which counts the number of positions at which the corresponding symbols
are different, the Jaro-Winkler distance, the overlap coefficient, etc. However in
computer science, the most common measure is the so called edit distance that
measures the minimum number of edit operations that must be performed to
transform the first string into the second. In biology, this number may provide
some measure of the kinship between different species based on the similarities
of their DNA. In data compression, it may help to store efficiently a set of similar
∗
Research supported by the Swiss National Science Foundation project
200020_144491/1 “Approximation Algorithms for Machine Scheduling Through
Theory and Experiments”, and by the Sciex-Project 12.311
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 14–25, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Further Results on the MPSM Problem 15
yet different data (e.g. different versions of the same object) by storing only one
"base" element of the set, and then storing the series of edit operations that
result in the other versions of the base element.
The concept of edit distance changes definition based on the set of edit op-
erations that are allowed. When the only edit operation that is allowed is to
shift a block of characters, the edit distance can be measured by solving the min
common string partition problem.
The min common string partition (MCSP) is a fundamental problem in
the field of string comparison [7,13], and can be applied more specifically to
genome rearrangement issues, as shown in [7]. Consider two strings A and B,
both of length n, such that B is a permutation of A. Also, let PA denote a
partition of A, that is, a set of substrings whose concatenation results in A. The
MCSP Problem introduced in [13] and [19] asks for a partition PA of A and
PB of B of minimum cardinality such that PA is a permutation of PB . The
k−MCSP denotes the restricted version of the problem where each letters has at
most k occurrences. This problem is NP-Hard and even APX-Hard, also when the
number of occurrences of each letter is at most 2 (note that the problem is trivial
when this number is at most 1) [13]. Since then, the problem has been intensively
studied, especially in terms of polynomial approximation [7,8,9,13,15,16], but
also parametric computation [4,17,10,14]. The best approximations known so far
are an O(log n log∗ n)-approximation for the general version of the problem [9],
and an O(k)-approximation for k−MCSP [16]. On the other hand, the problem
was proved to be Fixed Parameter Tractable (FPT), first with respect to both k
and the cardinality φ of an optimal partition [4,10,14], and more recently, with
respect to φ only [17].
In [6], the maximization version of the problem is introduced and denoted
by max duo-preservation string mapping (MPSM). Reminding that a duo
denotes a couple of consecutive letters it is clear that when a solution (PA , PB )
for min common string partition partitions A and B into φ substrings, this
solution can be translated as a mapping π from A to B that preserves exactly
n − φ duos. Hence, given two strings A and B, the MPSM problem asks for a
mapping π from A to B that preserves a maximum number of duos (a formal
definition is given in Subsection 3.1). An example is provided in Figure 1.
A: a b c b a c
B: b a b c a c
2 Hardness of Approximation
We will show that MPSM is APX–hard, which essentially rules out any polyno-
mial time approximation schemes unless P = NP. The result follows with slight
modifications from the known approximation hardness result for MCSP. Indeed,
in [13] it is shown that any instance of max independent set in a cubic graph
(3–MIS) can be reduced to an instance of 2–MCSP (proof of Theorem 2.1 in
[13]). We observe that the construction used in their reduction also works as a
reduction from 3–MIS to 2–MPSM. In particular, given a cubic graph with n
vertices and independence number α, the corresponding reduction to 2–MPSM
has an optimum value of m = 4n + α.
Given a ρ–approximation to 2–MPSM, we will hence always find an inde-
pendent set of size at least ρm − 4n. It is shown in [3] that it is NP–hard to
approximate 3–MIS within 139 140 + for any > 0. Therefore, unless P = NP, for
every > 0 there is an instance I of 3–MIS such that:
APPI 139
+
OPTI 140
where APPI is the solution produced by any polynomial time approximation
algorithm and OPTI the optimum value of I. Substituting here we get:
ρm − 4n 139
+
m − 4n 140
Further Results on the MPSM Problem 17
3.1 Preliminaries
For i = 1, ..., n, we denote by ai the ith character of string A, and by bi
the ith character in B. We also denote by DA = (D1A , ..., Dn−1 A
) and DB =
(D1 , ..., Dn−1 ) the set of duos of A and B respectively. For i = 1, ..., n − 1, DiA
B B
corresponds to the duo (ai , ai+1 ), and DiB corresponds to the duo (bi , bi+1 ).
A mapping π from A to B is said to be proper if it is bijective, and if,
∀i = 1, ..., n, ai = bπ(i) . In other words, each letter of the alphabet in A must be
mapped to the same letter in B for the mapping to be proper. A couple of duos
DiA , DjB is said to be preservable if ai = bj and ai+1 = bj+1 . Given a mapping
π, a preservable couple of duos DiA , DjB is said to be preserved by π if π(i) = j
and π(i + 1) = j + 1. Finally, two preservable couples of duos DiA , DjB and
DhA , DlB will be called conflicting if there is no proper mapping that preserves
both of them. These conflicts can be of two types, w.l.o.g., we suppose that i h
(resp. j l):
σ(i) = j means that the duo DiA is mapped to the duo DjB . Again, a duo-mapping
σ is said to be proper if it is bijective, and if DiA = Dσ(i)
B
for all duos mapped
through σ. Note that a proper duo-mapping might map some conflicting couple
of duos. Revisit the example of Figure 2(b): having σ(i) = j and σ(h) = l defines
a proper duo-mapping that maps conflicting couple of duos. Notice however that
a proper duo-mapping might generate conflicts of Type 2 only. We finally define
the concept of unconflicting duo-mapping, which is a proper duo-mapping that
does not map any pair of conflicting duos.
Remark 1. An unconflicting duo-mapping σ on some subset of duos of size f (σ)
immediatly derives a proper mapping π on the whole set of characters with
f (π) f (σ): it suffices to map characters mapped by σ in the same way that σ
does, and map arbitrarily the remaining characters.
f (π ∗ ) |M ∗ | (1)
Further Results on the MPSM Problem 19
DA DB DA DB
ab ca ab ca
bc ab bc ab
ca ba ca ba
ab ab ab ab
bc bc bc bc
(a) A proper duo-mapping with 2 conflicts (b) An unconflicting duo-mapping
|M ∗ | (1) f (π ∗ )
f (π) f (σ) = |M̂ |
4 4
A 4-approximate solution can thus be computed by creating the graph G from
strings A and B, computing an maximum matching M ∗ on it, partitioning M ∗
four ways by indices parity and return the biggest partition M̂ . Then map the
matched duos following the edges M̂ , and map all the other characters arbitrarily.
The complexity of the whole procedure is given by the complexity of computing
an optimal matching in G, which is O(n3/2 ).
20 N. Boria et al.
It is likely that the simple edge removal procedure that nullifies all conflicts
of Type 2 can be replaced by a more involved heuristic method in order to solve
efficiently real life problems.
Proof. Consider a vertex vij of degree 6 in such a graph H. This vertex corre-
sponds to a preservable couple of duos that conflicts with 6 other preservable
couples. There exists only one possible configuration in the strings A and B that
can create this situation, which is illustrated in Figure 4(a).
In return, this configuration always corresponds to the gadget illustrated in
Figure 4(b), where vertices vij , vhj , vil , and vhl have no connection with the rest
of the graph.
Now, consider any maximal independent set S that picks some vertex vij
of degree 6 in H. The existence of this degree-6 vertex induces that graph H
contains the gadget of Figure 4(a). S is maximal, so it necessarily contains vertex
vhl as well. Let S = S \ ({vij }, {vhl }) ∪ ({vil }, {vhj }). Reminding that vil and
vhj have no neighbor outside of the gadget, it is clear that S also defines an
independent set.
Hence, in a maximal (and a fortiori optimal) independent set, any pair of
degree-6 vertices (in such graphs, degree-6 vertices always appear in pair) can
Further Results on the MPSM Problem 21
Notice that the reduction from k-MPSM to 6(k − 1)-MIS also yields the fol-
lowing simple parameterized algorithm:
N LP :
M ax xpij qkl
(vpij vqkl )∈E
s.t. xpij qkl xij p for i, j, k, l = [np ], p, q = [m],
np
xijp = 1 for j = [np ], p = [m], (2)
i=1
np
xij
p = 1 for i = [np ], p = [m],
j=1
0 xpij qkl 1 for i, j, k, l = [np ], p, q = [m],
p 1
0 xij for i, j = [np ], p = [m].
Note that when the size of each grid is constant, the CLP is of polynomial size.
The first constraint ensures that the value of the edge-corresponding variable
is not greater than the value of the vertex-corresponding variable of any of its
endpoints. The second and the third constraints ensure that within each grid at
most one vertex is taken in each column, each row, respectively.
Notice that within each grid there are k! possible ways of taking a feasible
subset of vertices. We call a configuration, a feasible subset of vertices for a given
grid. Let us denote by Cp the set of all possible configurations for a grid p. Now,
Further Results on the MPSM Problem 23
consider that we have boolean variable xCp for each possible configuration. The
variable xCp takes value 1 if all the vertices contained in Cp are chosen and 0
otherwise. The induced linear program is called Configuration-LP , (CLP ). The
CLP formulation for the CM IS problem is the following:
CLP (K) :
M ax xpij qkl
(vpij vqkl
)∈E
s.t. xpij qkl
xij
p for i, j, k, l = [np ], p, q = [m],
xij
p = xCp for i, j = [np ], p, = [m], (3)
vpij ∈Cp ∈Cp
xCp = 1 for p = [m],
Cp ∈Cp
0 xpij qkl 1 for i, j, k, l = [np ], p, q = [m],
0 xCp 1 for Cp ∈ Cp , p = [m],
The first constraint is the same as in N LP . The second one ensures that
the value of the vertex-corresponding variable is equal to the summation of the
values of the configuration-corresponding variables containing considered vertex.
The third constraint ensures that within each grid exactly one configuration can
be taken. Notice that the vertex variables are redundant and serve just as an
additional description. In particular the first and the second constraints could
be merged into one constraint without vertex variables.
One can easily see that the CLP is at least as strong as the N LP formulation:
a feasible solution to CLP always translates to a feasible solution to NLP.
Proof. Consider a randomized algorithm that,√in each grid Gp , takes the vertices
xC
from configuration C with a probability √x
C
.
Cp ∈Cp p
Consider any vertex, w.l.o.g. vp1,1 . Each vertex is contained in two configura-
tions, w.l.o.g. let vp1,1 be contained in Cp1 and Cp2 . The probability that vp1,1 is
chosen is:
√x 1 + √x 2
C C
Pr vp1,1 is taken = p √ p
Cp ∈Cp xCp
Optimizing the expression √xCp1 + √xCp2 under the condition xCp1 + xCp2 = xp1,1 ,
we have that the minimum is when either xCp1 = 0 or xCp2 = 0 which implies
√x 1 + √x 2 = x1,1 . Thus:
Cp Cp p
x1,1
p
Pr vp1,1 is taken √
Cp ∈Cp xCp
24 N. Boria et al.
References
1. Berman, P., Fujito, T.: On Approximation Properties of the Independent Set Prob-
lem for Low Degree Graphs. Theory of Computing Systems 32(2), 115–132 (1999)
2. Berman, P., Fürer, M.: Approximating Maximum Independent Set in Bounded
Degree Graphs. In: Sleator, D.D. (ed.) SODA, pp. 365–371. ACM/SIAM (1994)
3. Berman, P., Karpinski, M.: On Some Tighter Inapproximability Results (Extended
Abstract). In: Wiedermann, J., Van Emde Boas, P., Nielsen, M. (eds.) ICALP 1999.
LNCS, vol. 1644, pp. 200–209. Springer, Heidelberg (1999)
Further Results on the MPSM Problem 25
4. Bulteau, L., Fertin, G., Komusiewicz, C., Rusu, I.: A Fixed-Parameter Algorithm
for Minimum Common String Partition with Few Duplications. In: Darling, A.,
Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 244–258. Springer, Heidelberg
(2013)
5. Chen, J., Kanj, I.A., Jia, W.: Vertex Cover: Further Observations and Further
Improvements. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS,
vol. 1665, pp. 313–324. Springer, Heidelberg (1999)
6. Chen, W., Chen, Z., Samatova, N.F., Peng, L., Wang, J., Tang, M.: Solving the
maximum duo-preservation string mapping problem with linear programming.
Theoretical Computer Science 530, 1–11 (2014)
7. Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., Jiang, T.: Assignment
of Orthologous Genes via Genome Rearrangement. Transactions on Computational
Biology and Bioinformatics 2(4), 302–315 (2005)
8. Chrobak, M., Kolman, P., Sgall, J.: The Greedy Algorithm for the Minimum Com-
mon String Partition Problem. In: Jansen, K., Khanna, S., Rolim, J.D.P., Ron, D.
(eds.) RANDOM 2004 and APPROX 2004. LNCS, vol. 3122, pp. 84–95. Springer,
Heidelberg (2004)
9. Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with
moves. ACM Transactions on Algorithms 3(1) (2007)
10. Damaschke, P.: Minimum Common String Partition Parameterized. In: Cran-
dall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 87–98.
Springer, Heidelberg (2008)
11. Downey, R.G., Fellows, M.R.: Parameterized Complexity, p. 530. Springer (1999)
12. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman and Co., San Francisco (1979)
13. Goldstein, A., Kolman, P., Zheng, J.: Minimum Common String Partition Problem:
Hardness and Approximations. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004.
LNCS, vol. 3341, pp. 484–495. Springer, Heidelberg (2004)
14. Jiang, H., Zhu, B., Zhu, D., Zhu, H.: Minimum common string partition revisited.
Journal of Combinatorial Optimization 23(4), 519–527 (2012)
15. Kolman, P., Walen, T.: Approximating reversal distance for strings with bounded
number of duplicates. Discrete Applied Mathematics 155(3), 327–336 (2007)
16. Kolman, P., Walen, T.: Reversal Distance for Strings with Duplicates: Linear Time
Approximation using Hitting Set. Electronic Journal of Combinatorics 14(1) (2007)
17. Bulteau, L., Komusiewicz, C.: Minimum common string partition parameterized
by partition size is fixed-parameter tractable. In: SODA, pp. 102–121 (2014)
18. Lund, C., Yannakakis, M.: The Approximation of Maximum Subgraph Problems.
In: Lingas, A., Karlsson, R.G., Carlsson, S. (eds.) ICALP 1993. LNCS, vol. 700,
pp. 40–51. Springer, Heidelberg (1993)
19. Swenson, K.M., Marron, M., Earnest-DeYoung, J.V., Moret, B.M.E.: Approximat-
ing the true evolutionary distance between two genomes. ACM Journal of Experi-
mental Algorithmics 12 (2008)
A Faster 1.375-Approximation Algorithm
for Sorting by Transpositions
1 Introduction
By comparing the orders of common genes between two organisms, one may
estimate the series of mutations that occurred in the underlying evolutionary
process. In a simplified genome rearrangement model, each mutation is a trans-
position, and the sole chromosome of each organism is modeled by a permutation,
which means that there are no duplicated or deleted genes. A transposition is a
rearrangement of the gene order within a chromosome, in which two contiguous
blocks are swapped. The transposition distance is the minimum number of trans-
positions required to transform one chromosome into another. Bulteau et al. [3]
proved that the problem of determining the transposition distance between two
permutations – or Sorting by Transpositions (SBT) – is NP-hard.
Several approaches to handle the SBT problem have been considered. Our
focus is to explore approximation algorithms for estimating the transposition
distance between permutations, providing better practical results or lowering
time complexities.
Bafna and Pevzner [2] designed a 1.5-approximation O(n2 ) algorithm, based
on the cycle structure of the breakpoint graph. Hartman and Shamir [10], by
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 26–37, 2014.
c Springer-Verlag Berlin Heidelberg 2014
A Faster 1.375-Approximation Algorithm for Sorting by Transpositions 27
2 Background
For our purposes, a gene is represented by a unique integer and a chromo-
some with n genes is a permutation π = [π0 π1 π2 . . . πn πn+1 ], where π0 =
0, πn+1 = n + 1 and each πi is a unique integer in the range 1, . . . , n. The
transposition t(i, j, k), where 1 ≤ i < j < k ≤ n + 1 over π, is the permu-
tation π · t(i, j, k) where the product interchanges the two contiguous blocks
πi πi+1 . . . πj−1 and πj πj+1 . . . πk−1 . A sequence of q transpositions sorts a per-
mutation π if π t1 t2 · · · tq = ι, where every ti is a transposition and ι is the
identity permutation [0 1 2 . . . n n + 1]. The transposition distance of π, denoted
d(π), is the length of a minimum sequence of transpositions that sorts π.
Given a permutation π, the breakpoint graph of π is G(π) = (V,R∪D); the set of
vertices is V = {0, −1, +1, −2, +2, . . ., −n, +n, −(n + 1)}, and the edges are par-
→
−
titioned into two sets, the directed reality edges R = { i = (+πi , −πi+1 ) | i =
0, . . . , n} and the undirected desire edges D = {(+i, −(i + 1)) | i = 0, . . . , n}).
Fig. 1 shows G([0 10 9 8 7 1 6 11 5 4 3 2 12]), the horizontal lines represent the
edges in R and the arcs represent the edges in D.
Every vertex in G(π) has degree 2, so G(π) can be partitioned into disjoint
cycles. We shall use the terms a cycle in π and a cycle in G(π) interchangeably
28 L.F.I. Cunha et al.
C has only 3-cycles and no open gates, then C is a full configuration. Some full
configurations, such as the one in Fig. 2(a), do not correspond to the breakpoint
graph of any permutation [6].
A configuration C that has k edges is in the cromulent form 1 if every edge
→
− −−−→
from 0 to k − 1 is in C. Given a configuration C having k edges, a cromulent
relabeling (Fig. 2b) of C is a configuration C such that C is in the cromulent
→ −
− →
form and there is a function σ satisfying that, for every pair of edges i , j in
−−→ −−→
C such that i < j, we have that σ(i), σ(j) are in C and σ(i) < σ(j).
Given an integer x, a circular shift of a configuration C, which is in the cro-
mulent form and has k edges, is a configuration denoted C + x such that every
→
− −−−−−−−−−→
edge i in C corresponds to i + x (mod k) in C + x. Two configurations C and K
are equivalent if there is an integer x such that C + x = K , where C and K are
their respective cromulent relabelings.
(a) (b)
Fig. 2. (a) Full configuration {C1 , C2 , C3 , C4 } = {0 2 5, 1 3 10, 4 7 9, 6 8 11}. (b)
The cromulent relabeling of {C1 , C2 } is {0 2 4, 1 3 5}.
Elias and Hartman’s algorithm Elias and Hartman [6] performed a systematic
enumeration of all components having nine or less cycles, in which all cycles have
length 3. Starting from single 3-cycles, components were obtained by applying a
series of sufficient extensions, as described next. An extension of a configuration
C is a connected configuration C ∪ {C}, where C ∈ C. A sufficient extension is an
extension that either: 1) closes an open gate; or 2) extends a full configuration
such that the extension has at most one open gate. A configuration obtained by
a series of sufficient extensions is named sufficient configuration, which has an
(x, y)-, or xy -, sequence if it is possible to apply such a sequence to its cycles.
Lemma 1. [6] Every unoriented sufficient configuration of nine cycles has an
11
8 -sequence.
Components with less than nine cycles are called small components. Elias and
Hartman showed that there are just five kinds of small components that do not
have an 11
8 -sequence; these components are called bad small components. Small
components that have an 118 -sequence are good small components.
Lemma 2. [6] The bad small components are: A = {0 2 4, 1 3 5}; B =
{0 2 10, 1 3 5, 4 6 8, 7 9 11}; C = {0 5 7, 1 9 11, 2 4 6, 3 8 10};
D = {0 2 4, 1 12 14, 3 5 7, 6 8 10, 9 11 13}; and E = {0 2 16, 1 3 5,
4 6 8, 7 9 11, 10 12 14, 13 15 17}.
1
cromulent: neologism coined by David X. Cohen, meaning “normal” or “acceptable.”
30 L.F.I. Cunha et al.
11
If a permutation has bad small components, it is still possible to find 8 -
sequences, as Lemma 3 states.
Lemma 3. [6] Let π be a permutation with at least eight cycles and containing
only bad small components. Then π has an (11, 8)-sequence.
Corollary 1. [6] If every cycle in G(π) is a 3-cycle, and there are at least eight
cycles, then π has an 11
8 -sequence.
Lemmas 1 and 3, and Corollary 1 form the theoretical basis for Elias and
Hartman’s 11
8 = 1.375-approximation algorithm for SBT, shown in Algorithm 1.
Feng and Zhu’s permutation tree Feng and Zhu [7] introduced the permutation
tree, a binary balanced tree that represents a permutation, and provided four
algorithms: to build a permutation tree in O(n) time, to join two permutation
trees into one in O(h) time, where h is the height difference between the trees, to
split a permutation tree into two in O(log n) time, and to query a permutation
→ −
− →
tree and find reality edges that intersect a given pair i , j in O(log n) time.
A Faster 1.375-Approximation Algorithm for Sorting by Transpositions 31
Firoz’s et al. use of the permutation tree Firoz et al. [8] suggested the use of
the permutation tree to reduce the running time of Elias and Hartman’s [6]
algorithm. In [5], we showed that this strategy fails to extend some full configu-
rations.
Firoz et al. [8] stated that extensions can be done in O(log n) time. To do that,
they categorized sufficient extensions of a configuration A into type 1 extensions
– those that add a cycle that closes open gates – and type 2 extensions – those
that extend a full configuration by adding a cycle C such that A ∪ {C} has at
most one open gate.
A type 1 extension can be performed in logarithmic time by running query
for an open gate. In a type 2 extension, since there are no open gates, Firoz et
al. claimed that it is sufficient to perform queries on all pairs of reality edges
belonging to the same cycle in a configuration that is being extended. But, as
shown in [5], there is an infinite family of configurations for which this strat-
egy fails; some instances are subsets of two cycles of [0 10 9 8 7 1 6 11 5 4 3 2 12]
(Fig. 1). Consider the configuration A = {C1 }; try to sufficiently extend A (step
9 in Algorithm 1) using the steps proposed by Firoz et al.:
1. Configuration A has three open gates. Executing the query for an open gate
results in a pair of edges belonging the cycle C2 . Therefore, we add this cycle
to the configuration A, which becomes A = {C1 , C2 }.
2. Configuration A has no more open gates. Executing the query for every pair
of edges in the same cycle of A, we observe that the query will return a pair
that is already in A. So far, Firoz et al.’s method has failed to extend A.
Algorithm 4 summarizes our approach towards finding and applying a (2, 2)-
sequence in O(n) time.
Our strategy to find a (2, 2)-sequence in linear time starts with checking
whether a breakpoint graph satisfies Lemma 5, as described in detail in Al-
gorithm 2. It differs from previous approaches [6,8] in that the leftmost oriented
cycle, dubbed K1 , is fixed when verifying conditions 2 and 3, avoiding compar-
isons between every pair of cycles.
Given a simple permutation π, it is trivial to enumerate all of its cycles in lin-
ear time. The size of each cycle, and whether it is oriented, are both determined
in constant time.
Christie [4] proved that every permutation has an even number (possibly zero)
of even cycles; he also showed that, given a simple permutation, when the number
of even cycles is not zero, there exists a (2, 2)-sequence that affects those cycles
if, and only if, there are either four 2-cycles, or there are two intersecting even
cycles. Therefore, in these cases, a (2, 2)-sequence can be applied in O(log n)
A Faster 1.375-Approximation Algorithm for Sorting by Transpositions 33
Fig. 3. Oriented cycles represented by their reality edges. All oriented cycles interleave
K1 , but Ki and Kj non-interleave each other.
At the end of Section 2, we discussed Firoz’s et al. use of the permutation tree,
and as proven in [5], their strategy does not account for configurations with less
than nine cycles that are not components, since successive invocations of the
query procedure may result in a full configuration with less than nine cycles
that is not a small component. Our proposed strategy generalizes the definitions
related to small components by defining a small configuration, a configuration
with less than nine cycles.
34 L.F.I. Cunha et al.
Proof. Consider all breakpoint graphs of F and its circular shifts combined with
B, C, D, E, and their circular shifts. A combination of a pair of small full config-
urations is obtained by starting from one small full configuration and inserting
a new one in different positions in the breakpoint graph. Altogether, there are
324 such graphs. A computerized case analysis, in [1], enumerates every possible
breakpoint graph and provides an 11 8 -sequence for each of them.
Proof. The 11 8 -sequences for the cases enumerated above were also found through
a computerized case analysis [1]. Note that Fi F j is equivalent to Fi+6 F j for
i = {0, 1, . . . , 5}, which simplifies our analysis.
each pair of F is naughty; the same can also be said of every combination of F
and three copies of A such that each triple F−A−A is naughty. Therefore, at
most 12 cycles are in S, since there are in the worst case three copies of F ; or
one copy of F and three copies of A. In all these cases we apply 11
8 -sequences as
proved in [1].
New Algorithm. The previous results allow us to devise Algorithm 5, that basi-
cally obtains configurations using the query procedure, and applies 11
8 -sequences
to configurations of size at most 9. It differs from Algorithm 1 not only in the
use of permutation trees, but also because we continuously deal with bad small
full configurations instead of only at the end.
comparisons in Steps 12, 14, 15, 17 and 20 are done in constant time using
lookup tables of size bound by a constant. Updating the set S also requires
constant time, since it has at most 12 cycles. Every sequence of transpositions
of size bounded by a constant can be applied in time O(log n) due to the use
of permutation trees. The time complexity of the loop between Steps 6 to 23 is
O(n log n), since the number of 3-cycles is linear in n, and the number cycles
decreases, in the worst case, once in three iterations. In Step 24, the search for
a (4, 3) or a (3, 2)-sequence is done in constant time, since the number of cycles
is bounded by a constant. Steps 24 and 25 also run in time O(n log n).
5 Conclusion
The goal of this paper is to lower the time complexity of Elias and Hartman’s [6]
1.375-approximation algorithm down to O(n log n). Our new approach provides,
so far, both the lowest fixed approximation ratio and time complexity of any
non-trivial algorithm for sorting by transpositions.
We have previously shown that a simple application of permutation trees [7],
as claimed in [8], does not suffice to correctly improve the running time of Elias
and Hartman’s algorithm. In order to lower the time complexity, it is necessary
to add more configurations [1] to the original analysis in [6], and also to perform
some changes in the sorting procedure, as shown in Algorithm 5.
References
1. http://compscinet.org/research/sbt1375 (2014)
2. Bafna, V., Pevzner, P.A.: Sorting by transpositions. SIAM J. Discrete Math. 11(2),
224–240 (1998)
3. Bulteau, L., Fertin, G., Rusu, I.: Sorting by transpositions is difficult. SIAM J.
Discrete Math. 26(3), 1148–1180 (2012)
4. Christie, D.A.: Genome Rearrangement Problems. Ph.D. thesis, University of Glas-
gow, UK (1999)
5. Cunha, L.F.I., Kowada, L.A.B., de A. Hausen, R., de Figueiredo, C.M.H.: On the
1.375-approximation algorithm for sorting by transpositions in O(n logn) time.
In: Setubal, J.C., Almeida, N.F. (eds.) BSB 2013. LNCS, vol. 8213, pp. 126–135.
Springer, Heidelberg (2013)
6. Elias, I., Hartman, T.: A 1.375-approximation algorithm for sorting by transposi-
tions. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(4), 369–379 (2006)
7. Feng, J., Zhu, D.: Faster algorithms for sorting by transpositions and sorting by
block interchanges. ACM Trans. Algorithms 3(3), 1549–6325 (2007)
8. Firoz, J.S., Hasan, M., Khan, A.Z., Rahman, M.S.: The 1.375 approximation
algorithm for sorting by transpositions can run in O(n log n) time. J. Comput.
Biol. 18(8), 1007–1011 (2011)
9. Hannenhalli, S., Pevzner, P.A.: Transforming cabbage into turnip: Polynomial al-
gorithm for sorting signed permutations by reversals. J. ACM 46(1), 1–27 (1999)
10. Hartman, T., Shamir, R.: A simpler and faster 1.5-approximation algorithm for
sorting by transpositions. Inf. Comput. 204(2), 275–290 (2006)
A Generalized Cost Model for DCJ-Indel Sorting
1 Introduction
Large scale chromosomal mutations were observed indirectly via the study of
linkage maps near the beginning of the 20th Century, and these genome rear-
rangements were first directly observed by Dobzhansky and Sturtevant in 1938
(see [8]). Yet only in the past quarter century has the combinatorial study of
genome rearrangements taken off, as researchers have attempted to create and
adapt discrete genomic models along with distance functions modeling the evo-
lutionary distance between two genomes. See [9] for an overview of the combi-
natorial methods used to compare genomes.
Recent research has moved toward multichromosomal genomic models as well as
distance functions that allow for mutations involving more than one chromosome.
Perhaps the most commonly used such model represents an ordered collection of
disjoint chromosomal intervals along a chromosome as either a path or cycle, de-
pending on whether the chromosome is linear or circular. For genomes with equal
gene content, the double cut and join operation (DCJ), introduced in [11], incor-
porates a wide class of operations into a simple graph operation. It has led to a
large number of subsequent results over the last decade, beginning with a linear-
time algorithm for the problem of DCJ sorting (see [3]), in which we attempt to
transform one genome into another using a minimum number of DCJs.
For genomes with unequal gene content, the incorporation of insertions and
deletions of chromosomes and chromosomal intervals (collectively called “in-
dels”) into the DCJ framework was discussed in [12] and solved in [5]. The latter
The author would like to thank Pavel Pevzner and the reviewers for very insightful
comments.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 38–51, 2014.
c Springer-Verlag Berlin Heidelberg 2014
A Generalized Cost Model for DCJ-Indel Sorting 39
2 Preliminaries
A genome Π is a graph containing an even number of labeled nodes and com-
prising the edge-disjoint union of two perfect matchings: the genes1 of Π, de-
noted g(Π); and the adjacencies of Π, denoted a(Π). Consequently, each node
of Π has degree 2, and the connected components of Π form cycles that al-
ternate between genes and adjacencies; these cycles are called chromosomes.
This genomic model, in which chromosomes are circular, offers a reasonable and
commonly used approximation of genomes having linear chromosomes.
A double cut and join operation (DCJ) on Π, introduced in [11], forms
a new genome by replacing two adjacencies of Π with two new adjacencies on
the same four nodes. Despite being simply defined, the DCJ incorporates the
reversal of a chromosomal segment, the fusion of two chromosomes into one
chromosome, and the fission of one chromosome into two chromosomes (Fig. 1).2
For genomes Π and Γ with the same genes, the DCJ distance, denoted d(Π, Γ ),
is the minimum number of DCJs needed to transform Π into Γ .
The breakpoint graph of Π and Γ , denoted B(Π, Γ ) (introduced in [2]),
is the edge-disjoint union of a(Π) and a(Γ ) (Fig. 2). The line graph of the
breakpoint graph is the adjacency graph, which was introduced in [3] and is
also commonly used in genome rearrangement studies. Note that the connected
components of B(Π, Γ ) form cycles (of length at least 2) that alternate between
adjacencies of Π and Γ , and so we will let c(Π, Γ ) denote the number of cycles in
1
In practice, gene edges typically represent synteny blocks containing a large number
of contiguous genes.
2
When the DCJ is applied to circularized linear chromosomes, it encompasses a larger
variety of operations. See [5] for details.
40 P.E.C. Compeau
2t
2h
3h
1h
ion 3t
fiss 1t
4h 4t
2h 2t ion
fus
1h 3h
1t 3t
rev
4h 4t ers
al 2t
2h
3h
1h
3t
1t
4h 4t
Fig. 1. DCJs replace two adjacencies of a genome and incorporate three operations on
circular chromosomes: reversals, fissions, and fusions. Genes are shown in black, and
adjacencies are shown in red.
The DCJ distance offers a useful metric for measuring the evolutionary dis-
tance between two genomes having the same genes, but we strive toward a ge-
nomic model that incorporates insertions and deletions as well. A deletion in Π
is defined as the removal of either an entire chromosome or chromosomal interval
of Π, i.e., if adjacencies {v, w} and {x, y} are contained in the order (v, w, x, y)
on some chromosome of Π, then a deletion replaces the path connecting v to y
with the single adjacency {v, y}. An insertion is simply the inverse operation
of a deletion. The term indels refers collectively to insertions and deletions.
To consider genomes with unequal gene content, we will henceforth assume
that any pair of genomes Π and Γ satisfy g(Π) ∪ g(Γ ) = G, where G is a
perfect matching on a collection of nodes V. A transformation of Π into Γ
is a sequence of DCJs and indels such that any deleted node must belong to
V − V (Γ ) and any inserted node must belong to V − V (Π).3
3
This assumption follows the lead of the authors in [5]. It prevents, among other
things, a trivial transformation of genome Π into genome Γ of similar gene content
in which we simply delete all the chromosomes of Π and replace them with the
chromosomes of Γ .
A Generalized Cost Model for DCJ-Indel Sorting 41
1t 1h
6h 2t
6t 2h
Π BREAKPOINTGRAPH(Π, Γ)
5h 3t
5t 3h 1t 1h
6h 2t
4h 4t
2h
6t
2t 2h
5h 3t
1t 1h
2t 5t 3h
4t 4h 6h 4h 4t
6t 2h
Γ Γ
5h 3t
1t 1h
5t 3h 5t 3h
4h 4t
5h 3t
6t 6h
Fig. 2. The construction of the breakpoint graph of genomes Π and Γ having the same
genes. First, the nodes of Γ are rearranged so that they have the same position in Π.
Then, the adjacency graph is formed as the disjoint union of adjacencies of Π (red)
and Γ (blue).
The resulting genome, which we call ΠU , has the exact same adjacencies as ΠT
except that it contains the adjacencies {y1 , xk+1 } and {vk+1 , wk+1 } instead of
{vk+1 , y1 } and {wk+1 , xk+1 }. Because two genomes on the same genes are equiv-
alent if and only if they share the same adjacencies, a single DCJ on {y1 , xk+1 }
and {vk+1 , wk+1 } would change ΠU into ΠT . Furthermore, in ΠT , {vk+1 , y1 } be-
∗
longs to C and {wk+1 , xk+1 } belongs to Ck+1 , so that this DCJ in question must
∗
be a fission producing C and Ck+1 . In U, rather than applying this fission, we
simply delete the chromosomal interval containing the genes of C. As a result,
U is identical to T except that it replaces 2k + 1 DCJs and a deletion by 2k
DCJs and a deletion. Hence, U has strictly smaller cost than T, which provides
the desired contradiction.
Following Theorem 1, we recall the observation in [1] that we can view the dele-
tion of a chromosomal interval replacing adjacencies {v, w} and {x, y} with the
single adjacency as a fission replacing {v, w} and {x, y} by the two adjacencies
{w, x} and {v, y}, thus forming a circular chromosome containing {v, y} that is
scheduled for later removal. By viewing this operation as a DCJ, we establish a
bijective correspondence between the deletions of a minimum cost transforma-
tion of Π into Γ (having no singletons) and a collection of chromosomes sharing
no genes with Π. (Insertions are handled symmetrically.)
Therefore, define a completion of genomes Π and Γ as a pair of genomes
(Π , Γ ) such that Π is a subgraph of Π , Γ is a subgraph of Γ , and g(Π ) =
g(Γ ) = G. Each of Π − Π and Γ − Γ is formed of alternating cycles called
new chromosomes; in other words, the chromosomes of Π comprise the chro-
mosomes of Π in addition to some new chromosomes that are disjoint from Π.
44 P.E.C. Compeau
dω (Π, Γ ) = min
{d(Π , Γ ) + (ω − 1) · ind(Π , Γ )} (2)
(Π ,Γ )
= N − max
{c(Π , Γ ) + (1 − ω) · ind(Π , Γ )} (3)
(Π ,Γ )
Proposition 4. If 0 < ω < 2 and sing(Π, Γ ) = 0, then for any optimal com-
pletion (Π ∗ , Γ ∗ ) of Π and Γ , every path of length 2k − 1 in B(Π, Γ ) (k ≥ 1)
embeds into a cycle of length 2k in B(Π ∗ , Γ ∗ ).
B(Π, Γ ) to each other. Given any even-length path P in B(Π, Γ ), there is ex-
actly one other even-length path P1 that would form a new chromosome in Γ ∗ if
linked with P , and exactly one other even-length path P2 in B(Π, Γ ) that would
form a new chromosome in Π ∗ if linked with P (P1 and P2 may be the same).
As long as there are more than two other even-length paths to choose from, we
can simply link P to any path other than P1 or P2 . We then iterate this process
until two even-length paths remain, which we link to complete the construction
of Π ∗ and Γ ∗ ; each of these genomes has one new chromosome containing all of
that genome’s bracelet adjacencies.
It is easy to see that the conditions in the preceding three propositions are
sufficient (but not necessary) when constructing an optimal completion for the
boundary cases ω = 0 and ω = 2. We are now ready to state our first major
result with respect to DCJ-indel sorting.
Algorithm 7. When 0 ≤ ω ≤ 2 and sing(Π, Γ ) = 0, the following algorithm
solves the problem of DCJ-indel sorting Π into Γ in O(N ) time.
1. Link the endpoints of any odd-length path in B(Π, Γ ), which may create
some new chromosomes in Π ∗ and Γ ∗ .
2. Arbitrarily select an even-length path P of B(Π, Γ ) (if one exists).
(a) If there is more than one additional even-length path in B(Π, Γ ), link P
to an even-length path that produces no new chromosomes in Π ∗ or Γ ∗ .
(b) Otherwise, link the two remaining even-length paths in B(Π, Γ ) to form
a new chromosome in each of Π ∗ and Γ ∗ .
3. Iterate Step 2 until no even-length paths of B(Π, Γ ) remain. The resulting
completion is (Π ∗ , Γ ∗ ).
4. Apply the O(N )-time algorithm for DCJ sorting from [11] to transform Π ∗
into Γ ∗ .
Let podd (Π, Γ ) and peven (Π, Γ ) equal the number of odd- and even-length paths
in B(Π, Γ ), respectively. The optimal completion (Π ∗ , Γ ∗ ) constructed by Al-
gorithm 7 has the following properties:
peven(Π, Γ )
c(Π ∗ , Γ ∗ ) = c(Π, Γ ) + podd (Π, Γ ) + (4)
2
ind(Π ∗ , Γ ∗ ) = k(Π, Γ ) + min {2, peven(Π, Γ )} (5)
These formulas, when combined with Theorem 3, yield a formula for the DCJ-
indel distance as a function of Π, Γ , and ω alone.
Corollary 8. If 0 ≤ ω ≤ 2 and sing(Π, Γ ) = 0, the DCJ-indel distance between
Π and Γ is given by the following equation:
peven(Π, Γ )
dω (Π, Γ ) = N − c(Π, Γ ) + podd (Π, Γ ) + + 1−ω ·
2
(6)
k(Π, Γ ) + min 2, peven(Π, Γ )
A Generalized Cost Model for DCJ-Indel Sorting 47
We now turn our attention to the case ω > 2. Intuitively, as ω grows, we should
witness fewer indels. Let δΓ (Π) be equal to 1 if g(Π) − g(Γ ) is nonempty and
0 otherwise; then, set δ(Π, Γ ) = δΓ (Π) + δΠ (Γ ). Note that δ(Π, Γ ) is a lower
bound on the number of indels in any transformation of Π into Γ . The following
result shows that in the absence of singletons, this bound is achieved by every
minimum-cost transformation when ω > 2.
One can verify that the condition in Theorem 9 is sufficient but not necessary to
guarantee a minimum-cost transformation when ω = 2. Furthermore, a conse-
quence of Theorem 9 is that the optimal completion is independent of the value
of ω. In other words, if a completion achieves the maximum in (7), then this
completion is automatically optimal for all values of ω ≥ 2.
Fortunately, Algorithm 7 already describes the construction of a completion
(Π , Γ ) that is optimal when ω = 2. Of course, we cannot guarantee that this
completion has the desired property that ind(Π , Γ ) = δ(Π, Γ ). However, if
ind(Π , Γ ) > δ(Π, Γ ), then we can apply ind(Π , Γ ) − δ(Π, Γ ) total fusions
to Π and Γ in order to obtain a different completion (Π ∗ , Γ ∗ ). Each of these
fusions reduces the number of new chromosomes by 1 and (by (3)) must also
decrease the number of cycles in the breakpoint graph by 1, since (Π , Γ ) is
optimal for ω = 2. As a result, c(Π ∗ , Γ ∗ )−ind(Π ∗ , Γ ∗ ) = c(Π , Γ )−ind(Π , Γ ).
Thus, (Π ∗ , Γ ∗ ) is optimal for ω = 2, and since ind(Π ∗ , Γ ∗ ) = δ(Π, Γ ), we know
that (Π ∗ , Γ ∗ ) must be optimal for any ω > 2 as already noted. This discussion
immediately implies the following algorithm.
48 P.E.C. Compeau
We have thus far avoided genome pairs with singletons because Theorem 1, which
underlies the main results in the preceding section, only applied in the absence
of singletons. Yet fortunately, genomes with singletons will be relatively easy
to incorporate into a single DCJ-indel sorting algorithm. As we might guess,
different values of ω produce different results.
transformation of Π ∅ into Γ using at most n DCJs and indels. One can verify
∅
that Πi+1 = Πi∅ precisely when Πi+1 is produced from Πi either by a DCJ that
involves an adjacency belonging to a singleton or by an indel containing genes
that all belong to singletons. At least sing(Π, Γ ) such operations must always
occur in T; hence,
In the case that ω ≤ 1, the bounds in (11) and (12) immediately yield (10).
Assume, then, that ω > 1. If δΓ ∅ (Π ∅ ) = 0, then g(Π ∅ ) ⊆ g(Γ ∅ ), meaning
that every deleted gene of Π must belong to a singleton of Π. In this case, the
total cost of removing any singletons of Π is trivially minimized by singΓ (Π) − 1
fusions consolidating the singletons of Π into a single chromosome, followed by
the deletion of this chromosome. Symmetric reasoning applies to the singletons
of Γ if δΠ ∅ (Γ ∅ ) = 0.
On the other hand, assume that ω > 1 and that δΓ ∅ (Π ∅ ) = 1, so that g(Π ∅ )−
g(Γ ∅ ) is nonempty. In this case, if Π has any singletons, then we can create a
minimum-cost transformation by applying singΓ (Π) − 1 fusions consolidating
the singletons of Π into a single chromosome, followed by another fusion that
consolidates these chromosomes into a chromosomal interval of Π that is about
to be deleted. Symmetric reasoning applies to the singletons of Γ if δΠ ∅ (Γ ∅ ) = 1.
Regardless of the particular values of δΓ ∅ (Π ∅ ) and δΠ ∅ (Γ ∅ ), we will obtain
the formula in (10).
Algorithm 13. The following algorithm solves the general problem of DCJ-
indel sorting genomes Π and Γ for any indel cost ω ≥ 0 in O(N ) time.
1. Case 1: ω ≤ 1.
(a) Delete any singletons of Π, then insert any singletons of Γ .
50 P.E.C. Compeau
6 Conclusion
With the problem of DCJ-indel sorting genomes with circular chromosomes uni-
fied under a general model, we see three obvious future applications of this work.
First, an extension of these results for genomes with linear chromosomes would
prevent us from having to first circularize linear chromosomes when comparing
eukaryotic genomes. This work promises to be extremely tedious (if it is indeed
possible) without offering dramatic new insights.
Second, we would like to implement the linear-time method for DCJ-indel
sorting described in Algorithm 13 and publish the code publicly. Evolutionary
study analysis on real data would hopefully determine appropriate choices of ω.
Third, we are currently attempting to extend these results to fully characterize
the space of all solutions to DCJ-indel sorting, which would generalize the result
in [6] to arbitrary values of ω.
References
1. Arndt, W., Tang, J.: Emulating insertion and deletion events in genome rear-
rangement analysis. In: 2011 IEEE International Conference on Bioinformatics
and Biomedicine, pp. 105–108 (2011)
2. Bafna, V., Pevzner, P.A.: Genome rearrangements and sorting by reversals. SIAM
J. Comput. 25(2), 272–289 (1996)
3. Bergeron, A., Mixtacki, J., Stoye, J.: A unifying view of genome rearrange-
ments. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175,
pp. 163–173. Springer, Heidelberg (2006)
A Generalized Cost Model for DCJ-Indel Sorting 51
4. Braga, M., Machado, R., Ribeiro, L., Stoye, J.: On the weight of indels in genomic
distances. BMC Bioinformatics 12(suppl. 9), S13 (2011)
5. Braga, M.D.V., Willing, E., Stoye, J.: Genomic distance with DCJ and indels. In:
Moulton, V., Singh, M. (eds.) WABI 2010. LNCS, vol. 6293, pp. 90–101. Springer,
Heidelberg (2010)
6. Compeau, P.: DCJ-indel sorting revisited. Algorithms for Molecular Biology 8(1),
6 (2013)
7. Compeau, P.E.C.: A simplified view of DCJ-indel distance. In: Raphael, B.,
Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 365–377. Springer, Heidelberg
(2012)
8. Dobzhansky, T., Sturtevant, A.H.: Inversions in the chromosomes of drosophila
pseudoobscura. Genetics 23(1), 28–64 (1938)
9. Fertin, G., Labarre, A., Rusu, I., Tannier, E., Vialette, S.: Combinatorics of
Genome Rearrangements. MIT Press (2009)
10. da Silva, P.H., Braga, M.D.V., Machado, R., Dantas, S.: DCJ-indel distance with
distinct operation costs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS,
vol. 7534, pp. 378–390. Springer, Heidelberg (2012)
11. Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permuta-
tions by translocation, inversion and block interchange. Bioinformatics 21(16),
3340–3346 (2005)
12. Yancopoulos, S., Friedberg, R.: DCJ path formulation for genome transformations
which include insertions, deletions, and duplications. Journal of Computational
Biology 16(10), 1311–1338 (2009)
Efficient Local Alignment Discovery
amongst Noisy Long Reads
Gene Myers
MPI for Molecular Cell Biology and Genetics, 01307 Dresden, Germany
myers@mpi-cbg.de
The PacBio RS II sequencer is the first operational “long read” DNA sequencer
[2]. While its error rate is relatively high ( = 12-15% error), it has two incredibly
powerful offsetting properties, namely, that (a) the set of reads produced is a
nearly Poisson sampling of the underlying genome, and (b) the location of errors
within reads is truly randomly distributed. Property (a), by the Poisson theory
of Lander and Waterman [3], implies that for any minimum target coverage level
k, there exists a level of sequencing coverage c that guarantees that every region
of the underlying genome is covered k times. Property (b), from the early work of
Churchill and Waterman [4], implies that the accuracy of the consensus sequence
of k such sequences is O(k ) which goes to 0 as k increases. Therefore, provided
the reads are long enough that repetitive genome elements do not confound
assembling them, then in principle a (near) perfect de novo reconstruction of a
genome at any level of accuracy is possible given enough coverage c.
These properties of the reads are in stark contrast to those of existing technolo-
gies where neither property is true. All previous technologies make reproducible
Supported by the Klaus Tschira Stiftung, Heidelberg, Germany
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 52–67, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Efficient Local Alignment Discovery amongst Noisy Long Reads 53
sequencing errors. A typical rate of occurrence for these errors is about 10−4 im-
plying at best a Q40 reconstruction is possible, whereas in principle any desired
reconstruction accuracy is possible with the long reads, e.g., a Q60 reconstruc-
tion has been demonstrated for E. coli [5]. All earlier technologies also exhibit
clear sampling biases, typically due to a biased amplification or selection step,
implying that many regions of a target genome are not sequenced. For exam-
ple, some PCR-based instruments often fail to sequence GC rich stretches. So
because their error and sampling are unbiased, the new long read technologies
are poised to enable a dramatic shift in the state of the art of de novo DNA
sequencing.
The questions then are (a) what level of coverage c is required for great assem-
bly, i.e. how cost-effectively can one get near the theoretical ideal above, and (b)
how does one build an assembler that works with such high error rates and long
reads? The second question is important because most current assemblers do
not work on such data as they assume much lower error rates and much shorter
reads, e.g. error rates less than 2% and read lengths of 100-250bp. Moreover,
the algorithms within these assemblers are specifically tuned for these operating
points and some approaches, such as the de-Bruijn graph [6] would catastrophi-
cally fail at rates over 10%.
Finding overlaps is typically the first step in an overlap-layout-consensus (OLC)
assembler design [7] and is the efficiency bottleneck for such assemblers. In this pa-
per, we develop an efficient algorithm and software for finding all significant local
alignments between reads in the presence of the high error rates of the long reads.
Finding local alignments is more general then finding overlaps, and we do so be-
cause it allows us to find repeats, chimers, undetected vector sequence and other
artifacts that must be detected in order to achieve near perfect assemblies. To this
authors knowledge, the only previous algorithm and software that can effectively
accommodate the level of error in question is BLASR [8] which was original designed
as a tool to map long reads to a reference genome, but can also be used for the as-
sembly problem. Empirically our program, DALIGN, is more sensitive while being
typically 20 to 40 times faster depending on the data set.
We make use of the same basic filtration concept as BLASR, but realize it
with a series of highly optimized threaded radix sorts (as opposed to a BWT
index [9]). While we did not make a direct comparison here, we believe the
cache coherence and thread ability of the simpler sorting approach is more time
efficient then using a more sophisticated but cache incoherent data structure
such as a Suffix Array or BWT index. But the real challenge is improving the
speed of finding local alignments at a 30-40% difference rate about a seed hit
from the filter, as this step consumes the majority of the time, e.g. 85% or more
in the case of DALIGN. To find overlaps about a seed hit, we use a novel method
of adaptively computing furthest reaching waves of the classic O(nd) algorithm
[1] augmented with information that describes the match structure of the last
p columns of the alignment leading to a given furthest reaching point. Each
wave on average contains a small number of points, e.g. 8, so that in effect an
alignment is detected in time linear in the number of columns in the alignment.
54 G. Myers
A simple exercise in induction reveals that the sequence of labels on a path from
(i, j) to (g, h) in the edit graph spells out an alignment between A[i + 1, g] and
B[j + 1, h]. Let a match edge be a diagonal edge for which ai = bj and otherwise
call the diagonal edge a substitution edge. Then if match edges have weight 0 and
all other edges have weight 1, it follows that the weight of a path is the number
of differences in the alignment it models. So our goal in edit graph terms is
to find read subset pairs P such that len(P ) ≥ τ and the lowest scoring path
between (i, j) and (g, h) in the edit graph of Aa versus B b has cost no more than
2 · len(P ).
In 1986 we presented a simple O(nd) algorithm [1] for comparing two se-
quences that centered on the idea of computing progressive “waves” of furthest
Efficient Local Alignment Discovery amongst Noisy Long Reads 55
F (d, k) = Slide(k, max{F (d−1, k−1)+(1, 0), F (d−1, k)+(1, 1), F (d−1, k+1)+(0, 1)}
(1)
where Slide(k, (i, j)) = (i, j) + max{Δ : ai+1 ai+2 . . . ai+Δ = bj+1 bj+2 . . . bj+Δ }.
In words, the f.r. d-point on k can be computed by first finding the furthest
of (a) the f.r. (d − 1)-point on k − 1 followed by an insertion, or (b) the f.r.
(d − 1)-point on k followed by a substitution, or (c) the f.r. (d − 1)-point on
k + 1 followed by a deletion, and thereafter progressing as far as possible along
match edges (a “slide”). Formally a point (i, j) is furthest if its anti-diagonal,
i + j, is greatest. Next, it follows easily that the best alignment between A and
B is the smallest d such that (m, n) ∈ W(0,0) (d) where m and n are the length
of A and B, respectively. So the O(nd) algorithm simply computes d-waves from
(0, 0) in order of d until the goal point (m, n) is reached in the dth wave. It
can further be shown that the expected complexity is actually O(n + d2 ) under
the assumption that A and B are non-repetiitve sequences. In what follows we
will be computing waves adaptively and in both the forward direction, as just
described, and in the reverse direction, which is conceptually simply a matter of
reversing the direction of the edges in the edit graph.
Given blocks A and B of long, noisy reads, we seek to find local alignments
between reads that are sufficiently long (parameter τ ) and sufficiently stringent
(parameter ). For our application is much larger than typically contemplated in
prior work, 10-15%, but the reads are very long, 10Kbp, so τ is large, 1 or 2Kbp.
Here we build a filter that eliminates read pairs that cannot possibly contain
a local alignment of length τ or more, by counting the number of conserved
k-mers between the reads. A careful and detailed analysis of the statistics of
conserved k-mers in the operating range of and τ required by long read data,
has previously been given in the paper about the BLASR program [8]. So here we
just illustrate the idea by giving a rough estimate assuming all k-mer matches are
independent events. Under this simplifying assumption, it follows that a given
k-mer is conserved with probability π = (1 − 2)k and the number of conserved
k-mers in an alignment of τ base pairs is roughly a Bernouilli distribution with
56 G. Myers
1. Build the list ListA = {(kmer(Aa , i), a, i)}a,i of all k-mers of the A block
and their positions, where kmer(R, i) is the k-mer, R[i − k + 1, i].
2. Similarly build the list ListB = {(kmer(B b , j), b, j)}b,j .
3. Sort both lists in order of their k-mers.
4. In a merge sweep of the two k-mer sorted lists build ListM = {(a, b, i, j) :
kmer(Aa , i) = kmer(B b , j)} of read and position pairs that have the same
k-mer.
5. Sort ListM lexicographically on a, b, and i where a is most significant and i
least.
To keep the following analysis simple, let us assume that the sizes of the two
blocks are both roughly the same, say N . Steps 1 and 2 are easily seen to take
O(N ) time and space. The sorts of steps 3 and 5 are in theory O(LlogL) where
L is the list size. The only remaining complexity question is how large is ListM .
First note that there is a contribution (i) from k-mers that are purely random
chance, and (ii) from conserved k-mers that are due to the reads actually being
correlated. The first term is N 2 /Σ k as we expect to see a given k-mer N/Σ k
Efficient Local Alignment Discovery amongst Noisy Long Reads 57
times in each block. For case (ii), suppose that the data set is a c-fold covering
of an underlying genome, and, in the worst case, the A and B blocks are the
same block and contain all the data. The genome is then of size N/c and each
position of the genome is covered by c reads by construction. Because c/π k-
mers are on average conserved amongst the c reads covering a given position,
there are thus N/c · (c/π)2 = (N c/π 2 ) matching k-mer pairs by non-random
correlations. In most projects c is typically 50-100 whereas π is typically 1/100
(e.g. k = 14 and = 15%) implying somewhat counter-intuitively that the non-
random contribution is dominated by the random contributions! Thus ListM
is O(N 2 /Σ k ) in size and so in expectation the time for the entire procedure
is dominated by Step 5 which takes O(N 2 logN/Σ k ). Finally, suppose the total
amount of data is M and we divide it into blocks of size Σ k all of which are
compared against each other. Then the time for each block comparison is O(kΣ k )
using O(Σ k ) space, that is linear time and space in the block size. Finally, there
are M/Σ k blocks implying the total time for comparing all blocks is O(kM ·
(M/Σ k )). So our filter, like all others, still has a quadratic component in terms
of the number of occurrences of a given k-mer in a data set. With linear times
indices such as BWT’s the time can theoretically be improved by a factor of k.
However, in practice the k arises from a radix sort that actually makes only k/4
passes and is so highly optimized, threaded, and cache coherent that we believe
it likely outperforms a BWT approach by a considerable margin. At the current
time all we can say is that DALIGN which includes alignment finding is 20-40
times faster than BLASR which uses a BWT (see Table 6).
For the sorted list ListM , note that all entries involving a given read pair
(a, b) are in a single contiguous segment of the list after the sort in Step 5. Given
parameters h and s, for each pair in such a segment, we place each entry (a, b, i, j)
in both diagonal bands d = (i−j)/2s and d+1, and then determine the number
of bases in the A-read covered by k-mers in each pair of bands diagonal band,
i.e. Count(a, b, d) = | ∪ {w(Aa , a, i) : (a, b, i, j) ∈ ListM and (i − j)/2s = d
or d + 1}|. Doing so is easy in linear time in the number of relevant entries as
they are sorted on i. If Count(a, b, d) ≥ h then we have a hit and we call our
local alignment finding algorithm to be described, with each position (i, j) in the
bucket d unless the position i is already within the range of a local alignment
found with an index pair searched before it. This completes the description of
our filtration strategy and we now turn to its efficient realization.
for i = 0 to N-1 do
{ b = src[i]_p
trg[bucket[b]] = src[i]
bucket[b] += 1
}
Asymptotically the algorithm takes O(P (N + 2B )) time but B and P are fixed
small numbers so the algorithm is effectively O(N ).
While papers on threaded sorts are abundant [13], we never the less present
our pragmatic implementation of a threaded radix sort, because it uses half the
number of passes over the array that other methods use, and accomplishing this
is non-trivial as follows. In order to exploit the parallelism of T threads, we let
each thread sort a contiguous segment of size part = N/T of the array src into
the appropriate locations of trg. This requires that each thread t ∈ [0, T − 1] has
its own bucket array bucket[t] where now bucket[t][b] = {i : src[i] < b or src[i] =
b and i/part < t}. In order to reduce the number of sweeps over the arrays by
half, we produce the bucket array for the next pass while performing the current
pass. But this is a bit complex because each thread must count the number of
B-bit numbers in the next pass that will be handled by not only itself but every
other thread separately! That is, if the number at index i will be at index j and
bucket b in the next pass then the count in the current pass must be recorded
not for the thread i/part currently sorting the number, but for the thread j/part
that will sort the number in the next pass. To do so requires that we actually
count the number of such events in next[j/part][i/part][b] where now next is a
T × T × 2B array. It remains to note that when src[i] is about to be moved in
the pth pass, then j = bucket[src[i]p ] and b = src[i]p+1 . The complete algorithm
is presented below in C-style pseudo-code where unbound variables are assumed
to vary over the range of the variable. It is easily seen to take O(N/T + T 2 ) time
assuming B and P are fixed.
int64 MASK = 2^B-1
sort_thread(int t, int bit, int N, int64 *src, int64 *trg, int *bucket, int *next)
{ for i = t*N to (t+1)*N-1 do
{ c = src[i]
b = c >> bit
x = bucket[b & MASK] += 1
trg[x] = c
Efficient Local Alignment Discovery amongst Noisy Long Reads 59
computed. We use several strategies to trim the span of a wave by removing f.r.
points that are extremely unlikely to be in the desired local alignment.
A key idea is that a desired local alignment should not over any reasonable
segment have an exceedingly low correlation. To this end imagine keeping a bit
vector B(d, k) that actually models the last, say C = 60 columns, of the best
path/alignment from ρ to a given f.r. point F (d, k) in the d-wave. That is a
0 will denote a mismatch in a column of the alignment and a 1 will denote a
match. This is actually relatively easy to do: left-shift in a 0 when taking an
indel or substitution edge and then left-shift in a 1 with each matching edge
of a snake. One can further keep track of exactly how many matches M (d, k)
there are in the alignment by observing the bit that gets shifted out when a new
bit is shifted in. The pseudo-code below computes Wρ (d + 1)[low − 1, hgh + 1]
from Wρ (d)[low, hgh] assuming that [low, hgh] ⊆ [κ − d, κ + d] is the interval
of Wρ (d) that we have decided to retain (to be described below). Note that the
code computes the information for each wave in place within the arrays W, B,
and M where W simply records the B-coordinate, j, of each f.r. point (i, j) as we
know the diagonal k of the point, and hence that i = j + k.
A very simple principle for trimming a wave is to remove f.r. points for which
the last C columns of the alignment have less than say M matches, we call this
the regional alignment quality. For example, if = .15 then one almost certainly
does not want a local alignment that contains a C column segment for which
M[k] < .55C = 33 if C = 60. A second trimming principle is to keep only f.r.
points which are within L anti-diagonals of the maximal anti-diagonal reached
by its wave. Intuitively, the f.r. point (i, j) on diagonal k on the desired path is
on a greater anti-diagonal i + j than those of the points on either side of it in the
Efficient Local Alignment Discovery amongst Noisy Long Reads 61
same wave, and as one progresses away from diagonal k , the anti-diagonal values
of the wave recede rapidly, giving the wave the appearance of an arrowhead. The
higher the correlation rate of the alignment, the sharper the arrow head becomes
and the points far enough behind the tip of the arrow are almost certainly not
points on an optimal local alignment. So for each portion of a wave computed
from the previous trimmed wave, we trim away f.r. points from [low − 1, hgh + 1]
that either have M[j] < M or (2W[k ] + k ) − (2W[j] + j) > L. In the experimental
section we show that L = 30 is a universally good value for trimming.
While not a formal proof per se, the following argument explains why in the
empirical results section we see that the average wave size hgh−low is a constant
for any fixed value of , and hence why the alignment finding algorithm is linear
expected time in the alignment length. Imagine the extension of an f.r. point
that is actually on the path of an alignment with correlation 1 − 2 or better.
For the next wave, this point jumps forward one difference and then ”slides”
on average α = (1 − )2 /(1 − (1 − )2 ) matching diagonals. Contrast this to
an f.r. point off the alignment path which jumps one difference and then only
slides β = 1/(Σ − 1) diagonals, assuming every base is equally likely. On average
then, an entry d diagonals away from the alignment path, has involved d jumps
from f.r. points off the path, and hence is d(α − β) behind the f.r. point on the
alignment path in the same wave. Thus the average width of a wave trimmed
with lag cutoff L would be less than 2L/(α − β). This last step of the argument
is incorrect as the statistics of average random path length under the difference
model is more complex then assuming all random steps are the same, but there is
a definite expected value of path length with d-differences, and therefore the basis
of the argument holds, albeit with a different value for β. Since α increases as
decreases, it further explains why the wave becomes more pointy and narrower
as goes to zero.
The computation of successive waves eventually ends because either (a) the
boundary of the edit graph of A and B is reached, or (b) all the f.r. points fail
the regional alignment quality criterion in which case one can assume that the
two reads no longer correlate with each other. In case (b), one should not report
the best point in the last wave, as the trimming criterion is overly permissive
(e.g. the last 5 columns could all be mismatches!) Because we seek alignments
that have an average correlation rate of 1 − 2, we choose to end the path at
a polished point with greatest anti-diagonal for which the last E ≤ C columns
are such that every suffix of the last E columns have a correlation of 1 − 2 or
better. We call such alignments suffix positive (at rate ) for reasons that will
become obvious momentarily. We must then keep track of the polished f.r. point
with greatest anti-diagonal as the waves are computed, which in turns means
that we must test the alignment bit-vector of the leading f.r. point(s) for the
suffix positive property in each wave.
One can in O(1) time determine if an alignment bit-vector e is suffix positive
by building a 2E -element table SP [e] as follows. Let Score(∈) = 0 and recursively
let Score(1b) = Score(b) + α and Score(0b) = Score(b) − β where α = 2 and
β = 1 − 2. Note that if bit-vector b has m matches and d differences, then
62 G. Myers
Score(x−1) + SC[ex ] if x ≥ 1
Score(x) =
0 if x = 0
(2)
Polish(x−1) and Score(x−1) + SP [ex ] ≥ 0 if x ≥ 1
Polish(x) =
true if x = 0
In summary, we compute waves of f.r. points keeping only those that are
locally part of a good alignment and not too far behind the leading f.r. point.
The waves stop either when a boundary is reached, in which case the boundary
point is taken as the end of the alignment, or all possible points are eliminated,
in which case the furthest polished f.r. point is taken as the end of the alignment
(in the given direction). The search takes place both in the forward direction and
the reverse direction from a seed tip ρ. The intervals of A and B at which the
forward and reverse searches end is reported as a local alignment if the alignment
has length τ or more.
Clearly the algorithm is heuristic: (a) it could fail to find an alignment by
virtue of eliminating incorrectly an f.r. point on the alignment, and (b) it could
over report alignments whose correlation is less than 1 − 2 as local segments
of worse quality are permitted depending on the setting of M. We will examine
the sensitivity and specificity of the algorithm in the Empirical Performance
section, but for the moment indicate that with reasonable choices of M and L
the algorithm fails less, than once in a billion base pairs, i.e. (a) almost never
happens. It is our belief that this heuristic variation of the O(nd) algorithm is
superior to any other filter verification approach for local alignments in the case
of identity matching over DNA while simultaneously being extremely sensitive.
Intuitively this is because the heuristic explores many fewer vertices of the edit
graph than dynamic programming based approaches because in expectation the
span hgh − low of trimmed waves is a small constant, that is, an alignment is
found in linear expected time with near certainty.
Efficient Local Alignment Discovery amongst Noisy Long Reads 63
6 Empirical Performance
All trials reported in this section were run on a Macbook Pro with a 2.7GHz
Intel Core i7 and the code was compiled with gcc version 4.2.1 with the -O4 level
of optimization set.
For a given setting of , we ran trials to determine the sensitivity of the local
alignment algorithm in terms of the trimming parameters M and L. Each trial
consisted of generating a 1Mbp random DNA sequence (with every base equally
likely) and then peppering in random differences at rate into two distinct
copies. The two perturbed copies were then given to the wave algorithm with
seed point (0, 0). For various settings of the trimming parameters and we ran
1000 trials and recorded (a) what fraction of the trials were successful in that
the entire 1Mbp alignment between the two copies was reported (Table 2), (b)
the average wave span (Table 3), and (c) the time taken.
Observed Effective
Perturbation () Correlation Perturbation
The first thing we observed was that the perturbed copies of a sequence actu-
ally aligned with much better correlation than 1 − 2 and the larger the larger
the relative improvement. We thus define the effective perturbation as the value
such that 1 − 2 equals the observed correlation. Table 1 gives the observed
correlation and effective perturbation for a range of values of .
The success rate and wave span both increase monotonically as L increases
and as M decreases. In Table 2, we observe that achieving a 100% success
rate depends very crucially on M being small enough, e.g. M must be 55% or
less when the perturbation is = 15%, 60% or less for = 10%, and so on.
But one should further note in Table 3 that the average wave span is virtually
independent of M and really depends only on L, at least for the values of M
that are required to have a 100% success rate. One might then think that only
the lag threshold is important and trimming on M can be dropped, but one
must remember that in the general case, when two sequences stop aligning, it is
regional alignment quality that stops the extension beyond the end of the local
alignment.
So we then investigated how quickly the wave’s die off after the end of a local
alignment with trials where two sequences completely random with respect to
each other were generated and then the wave algorithm was called with seed
64 G. Myers
point (0, 0). We recorded the number of waves traversed in each trial, the av-
erage span of the waves, and the total number of furthest reaching (f.r.) points
computed all together before the algorithm quit. The results are presented in
Table 4. Basically the total time to terminate grows quadratically in M for large
values but as M moves towards the rate at which two random DNA sequences
will align (i.e. 48%) the growth in time begins to become exponential going to
infinity at 48%. One can begin to see this at M= 55% in the table.
We timed the local alignment algorithm on 15 operating points in Tables 2 and
3 for which the success rate was 100% so that each measurement involved exactly
1billion aligned base pairs. The points covered from 1% to 15% and L from 20
Efficient Local Alignment Discovery amongst Noisy Long Reads 65
to 50. The structure of the algorithm implies that the time it takes should be
a linear function of (a) the number of waves, D, (b) the number of f.r. points
computed, DW̄ where W̄ is the average span of a wave, and (c) the number of
non-random aligned bases followed in snakes, a. But D = N and we know that
i+d+2s+2a = 2N where i, d, and s are the number of insertions, deletions, and
substitutions in the alignment found. The later implies a = N (1 − (1 + σ)/2 )
were σ is the relative portion of the alignment that is substitutions versus indels.
Thus it follows that the time for the algorithm should be the linear function:
N (α + β · + γ · W̄ ) (3)
for some choice of α, β, and γ. A linear regression on our 15 timing values
gave a correlation of .9995 with the fit:
BLASR DALIGN
Block Size Sensitivity Time (sec.) Sensitivity Time (sec.)
100 87% 2463 98.7% 109
200 86% 5678 97.5% 222
400 85% 15334 97.3% 393
in the data set. One should note carefully, that for much bigger projects, the
time for alignment is considerably less. For example, a 40X dataset over a 1Gbp
synthetic genome, would produce 100 400Mb blocks, but comparing each block
against itself would typically find only 12.3 thousand overlaps. Another way to
look at it is that there will be 100 times more overlaps found, but the filter has
to be run on roughly 5000 block pairs.
Real genomes are highly repetitive, implying that the number of overlaps found
in practical situations is much higher. For example for the 218Mbp, 31,700 read
E. coli data set produced by PacBio found 1.44 million overlaps in 1256 total sec-
onds (5.36 wall clock minutes). Moreover, to obtain this result overly frequent k-
mers had to be suppressed and low-complexity intervals of reads had to be soft
masked. So while the synthetic results above characterize performance in a well
understood situation, performance on real data is harder to predict. As our last
result, we show in Table 6 the results of timing BLASR and DALIGN on blocks of
various sizes from the PacBio human data set. DALIGN was run with (k, h, s) =
(14, 35, 6) and k-mers occurring more than 20 times were suppressed. BLASR was
run with the parameters used by the PacBio team for their human genome assembly
(private communication, J. Chin) which were “−nCandidates 24 −minMatch 14
−maxLCPLength 15 −bestn 12−minPctIdentity 70.0−maxScore 1000−nproc 4
noSplitSubreads”. Reads in the block were mapped to the human genome refer-
ence in order to obtain the sensitivity numbers. It is clear that DALIGN is much more
Efficient Local Alignment Discovery amongst Noisy Long Reads 67
References
1. Myers, E.W.: An O(ND) difference algorithm and its variations. Algorithmica 1,
251–266 (1986)
2. Eid, R., Fehr, A., . . . (51 authors) . . . Korlach, J, Turner, S.W.: Real-Time DNA
Sequencing from Single Polymerase Molecules. Science 323(5910), 133–138
3. Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones:
a mathematical analysis. Genomics 2(3), 231–239 (1988)
4. Churchill, G.A., Waterman, W.S.: The accuracy of DNA sequences: estimating
sequence quality. Genomics 14(1), 89–98 (1992)
5. Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C.,
Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., Turner, S.W., Korlach, J.:
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing
data. Nature Methods 10, 563–569 (2013)
6. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA
fragment assembly. PNAS 98(17), 9748–9753 (2001)
7. Kececioglu, J., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly.
Algorithmica 13, 7–51 (1995)
8. Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic
local alignment with successive refinement (BLASR): application and theory. BMC
Bioinformatics 13, 238–245 (2012)
9. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm.
Technical Report 124, Digital Equipment Corporation (1994)
10. https://github.com/PacificBiosciences/DevNet/wiki/Datasets
11. Manber, U., Myers, E.: Suffix Arrays: A New Method for On-Line String Searches.
SIAM Journal on Computing 22, 935–948 (1993)
12. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms
(3rd, 3rd edn., pp. 197–204. MIT Press (2009)
13. Yuan, W.: http://projects.csail.mit.edu/wiki/pub/SuperTech/
ParallelRadixSort/Fast Parallel Radix Sort Algorithm.pdf
Efficient Indexed Alignment
of Contigs to Optical Maps
1 Introduction
With the cost of next generation sequencing (NGS) continuing to fall, the last
decade has been witness to the production of draft whole genome sequences
for dozens of species. However, de novo genome assembly, the process of recon-
structing long contiguous sequences (contigs) from short sequence reads, still
produces a substantial number of errors [25,1] and is easily misled by repetitive
regions [26].
One way to improve the quality of assembly is to use secondary informa-
tion (independent of the short sequence reads themselves) about the order and
orientation of contigs. Optical mapping, which constructs ordered genome-wide
high-resolution restriction maps, can provide such information. Optical mapping
is a system that works as follows [4,10]: an ensemble of DNA molecules adhered
to a charged glass plate are elongated by fluid flow. An enzyme is then used
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 68–81, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Efficient Indexed Alignment of Contigs to Optical Maps 69
to cleave them into fragments at loci where the enzyme’s recognition sequence
occurs. Next, the remaining fragments are highlighted with fluorescent dye and
digitally photographed under a microscope. Finally, these images are analyzed
to estimate the fragment sizes, producing a molecular map. Since the fragments
stay relatively stationary during the aforementioned process, the images cap-
tures their relative order and size [23]. Multiple copies of the genome undergo
this process, and a consensus map is formed that consists of an ordered sequence
of fragment sizes, each indicating the approximate number of bases between oc-
currences of the recognition sequence in the genome [2].
The raw optical mapping data identified by the image processing is an ordered
sequence of fragment lengths. Hence, an optical map with x fragments can be
denoted as = { 1 , 2 , . . . , x }, where i is the length of the ith fragment in base
pairs. This raw data can then be converted into a sequence of locations, each
of which determines where a restriction site occurs. We denote the converted
data as follows: L(x) = {L0 < L1 < · · · < Ln }, where i = Li − Li−1 for
i = 1, . . . , n, and L0 and Ln are defined by the original molecule as a segment
of the whole genome by shearing. This latter representation is convenient for
algorithmic descriptions. The approximate mean and standard deviation of the
fragment size error rate for current data [31] are zero and 150 bp, respectively.
See Figure 1 for an illustration of the data produced by this technique. Each
restriction enzyme recognizes a specific nucleotide sequence so a unique optical
map results from each enzyme, and multiple enzymes can be used in combination
to derive denser optical maps. Optical maps have recently become commercially
available for mammalian-sized genomes1 , allowing them to be used in a variety
of applications.
Although optical mapping data has been used for structural variation detec-
tion [28], scaffolding and validating contigs for several large sequencing projects
— including those for various prokaryote species [24,32,33], Oryza sativa (rice)
[35], maize [34], mouse [9], goat [11], Melopsittacus Undulatus (budgerigar) [16],
and Amborella trichopoda [8] — there exist few non-proprietary tools for ana-
lyzing this data. Furthermore, the currently available tools are extremely slow
because most of them were specifically designed for smaller, prokaryote genomes.
Our Contribution. We present the first index-based method for aligning contigs
to an optical map. We call our tool Twin to illustrate the association between
the assembly and optical map as two representations of the genome sequence.
The first step of our procedure is to in silico digest the contigs with the set
of restriction enzymes, computationally mimicking how each restriction enzyme
would cleave the short segment of DNA defined by the contig. Thus, in silico di-
gested contigs are miniature optical maps that can be aligned to the much longer
(sometimes genome-wide) optical maps. The objective is to search and align the
in silico digested contigs to the correct location in the optical map. By using a
suitably-constructed FM-Index data structure [12] built on the optical map, we
1
OpGen (http://www.opgen.com) and BioNano (http://www.bionanogenomics.com)
are commercial producers of optical mapping data.
70 M.D. Muggli, S.J. Puglisi, and C. Boucher
Fig. 1. An illustration of
the data produced by opti-
GCTCTTGTCGTCAATCTTAAGGCTA cal mapping. Optical map-
AGTCGTTGCTTAAGTATGCTAGGTC
GCGAGCTATGCTGCTTAAGTCGAGT
GCTTAAGTCTGATGCTAGTCTGAATT
ping locates and measures
the distance between re-
striction sites. Analogous
GCTCTTGTCGTCAATGTACTAGCTA
AGTCGTCTTAAGCATATGCTAGGTC
to sequence data, optical
GCTTAAGATGCTGATCTTAAGGAGT
GCTAGCATCTGATGCTACCTAAGTT mapping data is produced
for multiple copies of the
same genome, and overlap-
ping single molecular maps
are analyzed to produce a
map for each chromosome.
show that alignments between contigs and optical maps can be computed in time
that is faster than competing methods by more than two orders of magnitude.
Twin takes as input a set of contigs and an optical map, and produces a
set of alignments. The alignments are output in Pattern Space Layout (PSL)
format, allowing them to be visualized using any PSL visualization software,
such as IGV [29]. Twin is specifically designed to work on a wide range of
genomes, anything from relatively small genomes, to large eukaryote genomes.
Thus, we demonstrate the effectiveness of Twin on Yersinia kristensenii, rice,
and budgerigar genomes. Rice and budgerigar have genomes of total sizes 430 Mb
and 1.2 Gb, respectively. Yersinia kristensenii, a bacteria with genome size of 4.6
Mb, is the smallest genome we considered. Short read sequence data was assem-
bled for these genomes, and the resulting contigs were aligned to the respective
optical map. We compared the performance of our tool with available competing
methods; specifically, the method of Valouev et al. [30] and SOMA [22]. Twin has
superior performance on all datasets, and is demonstrated to be the only current
method that is capable of completing the alignment for the budgerigar genome in
a reasonable amount of CPU time; SOMA [22] required over 77 days of machine
time to solve this problem, whereas, Twin required just 35 minutes. Lastly, we
verify our approach on simulated E. coli data by showing our alignment method
found correct placements for the in silico digested contigs on a simulated optical
map. Twin is available for download at http://www.cs.colostate.edu/twin.
Roadmap. We review related tools for the problem in the remainder of this sec-
tion. Section 2 then sets notation and formally lays the data structural tools we
make use of. Section 3 gives details of our approach. We report our experimental
results in Section 4. Finally, Section 5 offers reflections and some potentially
fruitful avenues future work may take.
Related Work. The most recent tools to make use of optical mapping data in
the context of assembly are AGORA [19] and SOMA [22]. AGORA [19] uses
the optical map information to constrain de Bruijn graph construction with the
aim of improving the resulting assembly. SOMA [22] is a scaffolding method that
Efficient Indexed Alignment of Contigs to Optical Maps 71
uses an optical map and is specifically designed for short-read assemblies. SOMA
requires an alignment method for scaffolding and implements an O(n2 m2 )-time
dynamic programming algorithm. Gentig [2], and software developed by Val-
ouev et al. [30] also use dynamic programming to address the closely related
task of finding alignments between optical maps. Gentig is not available for
download. BACop [34] also uses a dynamic programming algorithm and corre-
sponding scoring scheme that gives more weight to contigs with higher fragment
density. Antoniotti et al. [3] consider the unique problem of validating an optical
map by using assembled contigs. This method assumes the contigs are error-free.
Optical mapping data was produced for Assemblathon 2 [6].
2 Background
Strings. Throughout we consider a string X = X[1..n] = X[1]X[2] . . . X[n] of |X| =
n symbols drawn from the alphabet [0..σ − 1]. For i = 1, . . . , n we write X[i..n]
to denote the suffix of X of length n − i + 1, that is X[i..n] = X[i]X[i + 1] . . . X[n].
Similarly, we write X[1..i] to denote the prefix of X of length i. X[i..j] is the
substring X[i]X[i + 1] . . . X[j] of X that starts at position i and ends at j.
F = 2, 4, 5, 3, 5.
Suffix Arrays. The suffix array [20] SAX (we drop subscripts when they are clear
from the context) of a string X is an array SA[1..n] which contains a permutation
of the integers [1..n] such that X[SA[1]..n] < X[SA[2]..n] < · · · < X[SA[n]..n]. In
other words, SA[j] = i iff X[i..n] is the j th suffix of X in lexicographical order.
SA Intervals. For a string Y, the Y-interval in the suffix array SAX is the in-
terval SA[s..e] that contains all suffixes having Y as a prefix. The Y-interval is
a representation of the occurrences of Y in X. For a character c and a string Y,
the computation of cY-interval from Y-interval is called a left extension.
72 M.D. Muggli, S.J. Puglisi, and C. Boucher
3 Methods
We find alignments in four steps. First, we convert contigs from the sequence
domain to the optical map domain through the process of in silico digestion.
Second, an FM-index is built from the sequence of optical map fragment sizes.
Third, we execute a modified version of the FM-index backward search algorithm
described in Section 2 that allows inexact matches. As a result of allowing inexact
matches, there may be multiple fragments in an optical map that could each be
a reasonable match for an in silico digested fragment, and in order to include all
of these as candidate matches, backtracking becomes necessary in the backward
search. For every backward search path that maintains a non-empty interval for
the entire query contig, we emit the alignments denoted by the final interval.
In preparation for finding alignments, we also keep two auxiliary data struc-
tures. The first is the suffix array, SAF , corresponding to our FM-index, which
we use to report the positions in where alignments of a contig occur. While we
could decode the relevant entries of SA on demand with the FM-index in O(p)
time, where p is the so-called sample period of the FM-index, storing SA explic-
itly significantly improves runtime at the cost of a modest increase in memory
usage. The second data structure we store is M, which allows us to map from
positions in to positions in the original genome in constant time.
For each in silico digested contig that has an approximate match in the optical
map, we emit the alignment, converting positions in the fragment string to
positions in the genome using the M table. We provide a script to convert the
human readable output into PSL format.
4 Results
arguments (0.2, 2, 1, 5, 17.43, 0.579, 0.005, 0.999, 3, 1). Twin was run with
Dσ = 4, t = 1000, and [250 . . . 1000] for the range of small fragments. Gentig [2]
and BACop [34] were not available for download so we did not test the data
using these approaches.
The sequence data was assembled for Yersinia kristensenii, rice and budgeri-
gar by using various assemblers. The relevant assembly statistics are given in
Table 1. An important statistic in this table is the number of contigs that have
at least two restriction sites, since contigs with fewer than two are unable to be
aligned meaningfully by any method, including Twin. This statistic was com-
puted to reveal cases of ambiguity in placement from lack of information. Indeed,
Assemblathon 2 required there to be nine restriction sites present in a contig to
align it to the optical mapping data [6]. All experiments were performed on Intel
x86-64 workstations with sufficient RAM to avoid paging, running 64-bit Linux.
The experiments for Yersinia kristensenii, rice and budgerigar illustrate how
each of the programs’ running time scale as the size of the genome increases. How-
ever, due to the possibility of mis-assemblies in these draft genomes, comparing
the actual alignments could possibly lead to erroneous conclusions. Therefore,
we will verify the alignments using simulated E. coli data. See Subsection 4.4 for
this experiment.
76 M.D. Muggli, S.J. Puglisi, and C. Boucher
Table 1. Assembly and genome statistics for Yersinia kristensenii, rice and budgerigar.
The assembly statistics were obtained from Quast. [15].
The sequence and optical map data for the budgerigar genome were generated
for the Assemblathon 2 project of Bradnam et al. [6]. Sequence data consists of
a combination of Roche 454, Illumina, and Pacific Biosciences reads, providing
16x, 285x, and 10x coverage (respectively) of the genome. All sequence reads
are available at the NCBI Short Read Archive (accession ERP002324). For our
analysis we consider the assembly generated using Celera [21], which was com-
pleted by the CBCB team (Koren and Phillippy) as part of Assemblathon 2 [6].
The optical mapping data was created by Zhou, Goldstein, Place, Schwartz, and
Bechner using the SwaI restriction enzyme and consists of 92 separate pieces.
As with the two previous data sets, Twin found alignments for more contigs
than SOMA on the budgerigar genome. SOMA and Twin found alignments
for 9,668, and 9,826 contigs, respectively, out of 10,019 contigs that could be
aligned to the optical map. However, SOMA required over 77 days of CPU time
and Twin required 35 minutes. The software of Valouev et al. returned 9,814
alignments and required over an order of magnitude (6.5 hours) of CPU time.
Hence, Twin was the only method that efficiently aligned the in silico digested
budgerigar genome contigs to the optical map. It should be kept in mind that
the competing methods were developed for prokaryote genomes and so we are
repurposing them at a scale for which they were not designed. Lastly, the amount
of memory used by all the methods on all experiments was low enough for them
to run on a standard workstation.
We were forced to parallelize SOMA due to the enormous amount of CPU
time SOMA required for this dataset. To accomplish this task, the FASTA file
containing the contigs was split into 300 different files, and then IPython Parallel
library was used to invoke up to two instances of SOMA on each machine from a
set of 150 machines. Thus, when using a cluster with up to 300 jobs concurrently,
the alignment for the budgerigar genome took about a day of wall clock time.
In contrast, we ran the software of Valouev et al. and Twin with a single thread
running on a single core. However, it should be noted that the same paralleliza-
tion could have been accomplished for both these software methods too. Also,
even with parallelization of SOMA, Twin is still an order of magnitude faster
than it.
We compared the alignments given by Twin against the alignments of the contigs
of an E. coli assembly to the E. Coli (str. K-12 substr. MG1655) reference
genome. Our prior experiments involved species for which the reference genome
may have regions that are mis-asssembled and therefore, contig alignments to
the reference genome may be inaccurate and cannot be used for comparison
and verification of the in silico digested contig alignment. The E. coli reference
genome is likely to contain the fewest errors and thus, is the one we used for
assembly verification. The sequence data consists of approximately 27 million
paired-end 100 bp reads from E. coli (str. K-12 substr. MG1655) generated by
78 M.D. Muggli, S.J. Puglisi, and C. Boucher
Illumina, Inc. on the Genome Analayzer (GA) IIx platform, and was obtained
from the NCBI Short Read Archive (accession ERA000206), and was assembled
using SPAdes version 3.0.0 [5] using default parameters. This assembly consists
of 160 contigs; 50 of which contain two restriction sites, the minimum required
for any possible optical alignment, and complete alignments with minimal (<800
bp) total in/dels relative to the reference genome.
We simulated an optical map using the reference genome for E. coli (str. K-12
substr. MG1655) since there is no publicly available one for this genome.
The 50 contigs that contained more than two restriction sites were aligned to
the reference genome using BLAT [18]. These same contigs were then in silico
digested and aligned to the optical map using Twin. The resulting PSL files were
then compared. Twin found alignment positions within 10% of those found by
BLAT for all 50 contigs, justifying that our method is finding correct alignments.
We repeated this verification approach with both SOMA and the software from
Valouev. All of SOMA’s reported alignments had matching BLAT alignments,
while of the 49 alignments the software from Valuoev reported, only 18 could be
matched with alignments from BLAT.
are able to handle genomes at least as large as the budgerigar genome directly,
whereas SOMA cannot feasibly complete the alignment for this genome in a
reasonable amount of time without significant parallelization, and even then is
orders of magnitude slower than Twin. Indeed, given its performance on the
budgerigar genome, and its O(m2 n2 ) time complexity, larger genomes seem be-
yond SOMA. For example, the loblolly pine tree genome, which is approximately
20 Gb [36], would take SOMA approximately 84 machine years, which, even with
parallelization, is prohibitively long.
Lastly, optical mapping is a relatively new technology, and thus, with so few
algorithms available for working with this data, we feel there remains good op-
portunities for developing more efficient and flexible methods. Dynamic pro-
gramming optical map alignment approaches are still important today, as the
assembly of the consensus optical maps from the individually imaged molecules
often has to deal with missing or spurious restriction sites in the single molecule
maps when enzymes fail to digest a recognition sequence or the molecule breaks.
Though coverage is high (e.g. about 1,241 Gb of optical data was collected for
the 2.66 Gb goat genome), there may be cases where missing restriction site
errors are not resolved by the assembly process. In these rare cases (only 1% of
alignments reported by SOMA on parrot contain such errors) they will inhibit
Twin’s ability to find correct alignments. In essence, Twin is trading a small
degree of sensitivity for a huge speed increase, just as other index based aligners
have done for sequence data. Sirén et al. [27] recently extended the Burrows-
Wheeler transform (BWT) from strings to acyclic directed labeled graphs and
to support path queries. In future work, an adaptation of this method for op-
tical map alignment may allow for the efficient handling of missing or spurious
restriction sites.
References
1. Alkan, C., Sajjadian, S., Eichler, E.: Limitations of next-generation genome se-
quence assembly. Nat. Methods 8(1), 61–65 (2010)
2. Anantharaman, T., Mishra, B.: A probabilistic analysis of false positives in optical
map alignment and validation. In: Proc. of WABI, pp. 27–40 (2001)
80 M.D. Muggli, S.J. Puglisi, and C. Boucher
3. Antoniotti, M., Anantharaman, T., Paxia, S., Mishra, B.: Genomics via optical
mapping iv: sequence validation via optical map matching. Technical report, New
York University (2001)
4. Aston, C., Schwartz, D.: Optical mapping in genomic analysis. John Wiley and
Sons, Ltd. (2006)
5. Bankevich, A., et al.: others. SPAdes: a new Genome assembly algorithm and its
applications to single-cell sequencing. J. Comp. Biol. 19(5), 455–477 (2012)
6. Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome
assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)
7. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm.
Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)
8. Chamala, S., et al.: Assembly and validation of the genome of the nonmodel basal
angiosperm amborella. Science 342(6165), 1516–1517 (2013)
9. Church, D.M., et al.: Lineage-specific biology revealed by a finished genome assem-
bly of the mouse. PLoS Biology 7(5), e1000112+ (2009)
10. Dimalanta, et al.: A microfluidic system for large dna molecule arrays. Anal.
Chem. 76(18), 5293–5301 (2004)
11. Dong, Y., et al.: Sequencing and automated whole-genome optical mapping of the
genome of a domestic goat (capra hircus). Nat. Biotechnol. 31(2), 136–141 (2013)
12. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581
(2005)
13. Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and appli-
cations to information retrieval. Theor. Comput Sci. 426-427, 25–41 (2012)
14. Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software
Pract. Expr. (to appear)
15. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool
for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
16. Howard, J.T., et al.: De Novo high-coverage sequencing and annotated assemblies
of the budgerigar genome (2013)
17. Kawahara, Y., et al.: Improvement of the oryza sativa nipponbare reference genome
using next generation sequence and optical map data. Rice 6(4), 1–10 (2013)
18. Kent, J.: BLAT–The BLAST-Like Alignment Tool. Genome Res. 12(4), 656–664
(2002)
19. Lin, H., et al.: AGORA: Assembly Guided by Optical Restriction Alignment. BMC
Bioinformatics 12, 189 (2012)
20. Manber, U., Myers, G.W.: Suffix arrays: A new method for on-line string searches.
SIAM J. Sci. Comput. 22(5), 935–948 (1993)
21. Miller, J.R., et al.: Aggressive assembly of pyrosequencing reads with mates. Bioin-
formatics 24, 2818–2824 (2008)
22. Nagarajan, N., Read, T.D., Pop, M.: Scaffolding and validation of bacterial genome
assemblies using optical restriction maps. Bioinformatics 24(10), 1229–1235 (2008)
23. Neely, R.K., Deen, J., Hofkens, J.: Optical mapping of DNA: single-molecule-based
methods for mapping genome. Biopolymers 95(5), 298–311 (2011)
24. Reslewic, S., et al.: Whole-genome shotgun optical mapping of rhodospirillum
rubrum. Appl. Environ. Microbiol. 71(9), 5511–5522 (2005)
25. Ronen, R., Boucher, C., Chitsaz, H., Pevzner, P.: SEQuel: Improving the Accuracy
of Genome Assemblies. Bioinformatics 28(12), i188–i196 (2012)
26. Salzberg, S.: Beware of mis-assembled genomes. Bioinformatics 21(24), 4320–4321
(2005)
Efficient Indexed Alignment of Contigs to Optical Maps 81
27. Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with ap-
plications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. (to
appear, 2014)
28. Teague, B., et al.: High-resolution human genome structure by single-molecule
analysis. Proc. Natl. Acad. Sci. 107(24), 10848–10853 (2010)
29. Thorvaldsdòttir, H., Robinson, J.T., Mesirov, J.P.: Integrative Genomics Viewer
(IGV): High-performance Genomics Data Visualization and Exploration. Brief.
Bioinform. 14(2), 178–192 (2013)
30. Valouev, A., et al.: Alignment of optical maps. J. Comp. Biol. 13(2), 442–462 (2006)
31. VanSteenHouse, H. personal communication (2013)
32. Zhou, S., et al.: A whole-genome shotgun optical map of yersinia pestis strain KIM.
Appl. Environ. Microbiol. 68(12), 6321–6331 (2002)
33. Zhou, S., et al.: Shotgun optical mapping of the entire leishmania major Friedlin
genome. Mol. Biochem. Parasitol. 138(1), 97–106 (2004)
34. Zhou, S., et al.: A single molecule scaffold for the maize genome. PLoS Genet. 5(11),
e1000711 (2009)
35. Zhou, S., et al.: Validation of rice genome sequence by optical mapping. BMC
Genomics 8(1), 278 (2007)
36. Zimin, A., et al.: Sequencing and assembly of the 22-gb loblolly pine genome.
Genetics 196(3), 875–890 (2014)
Navigating in a Sea of Repeats in RNA-seq
without Drowning
1 Introduction
Transcriptomes can now be studied through sequencing. However, in the ab-
sence of a reference genome, de novo assembly remains a challenging task. The
main difficulty certainly comes from the fact that sequencing reads are short,
and repeated sequences within transcriptomes could be longer than the reads.
This short read / long repeat issue is of course not specific to transcriptome
sequencing. It is an old problem that has been around since the first algorithms
for genome assembly. In this latter case, the problem is somehow easier because
coverage can be used to discriminate contigs that correspond to repeats, e.g.
using Myer’s A-statistics [8] or [9]. In transcriptome assembly, this idea does
not apply, since the coverage of a gene does not only reflect its copy-number
in the genome, but also and mostly its expression level. Some genes are highly
expressed and therefore highly covered, while most genes are poorly expressed
and therefore poorly covered.
Initially, it was thought that repeats would not be a major issue in RNA-
seq, since they are mostly in introns and intergenic regions. However, the truth
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 82–96, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Navigating in a Sea of Repeats in RNA-seq without Drowning 83
is that many regions which are thought to be intergenic are transcribed [3]
and introns are not always already spliced out when mRNA is collected to be
sequenced. Repeats, especially transposable elements, are therefore very present
in real samples and cause major problems in transcriptome assembly.
Most, if not all current short-read transcriptome assemblers are based on de
Bruijn graphs. Among the best known are Oases [14], Trinity [4], and to a
lesser degree Trans-Abyss [11] and IDBA-tran [10]. Common to all of them
is the lack of a clear and explicit model for repeats in RNA-seq data. Heuristics
are thus used to try and cope efficiently with repeats. For instance, in Oases
short nodes are thought to correspond to repeats and are therefore not used for
assembling genes. They are added in a second step, which hopefully causes genes
sharing repeats not to be assembled together. In Trinity, there is no attempt
to deal with repeats explicitly. The first module of Trinity, Inchworm, will try
and assemble the most covered contig which hopefully corresponds to the most
abundant alternative transcript. Then alternative exons are glued to this major
transcript to form a splicing graph. The last step is to enumerate all alternative
transcripts. If repeats are present, their high coverage may be interpreted as
a highly expressed link between two unrelated transcripts. Overall, assembled
transcripts may be chimeric or spliced into many sub-transcripts.
In the method we developed, KisSplice, which is a local transcriptome as-
sembler [12], repeats may be less problematic, since the goal is not to assemble
full-length transcripts. KisSplice instead aims at finding variations expressed
at the transcriptome level (SNPs, indels and alternative splicings). However, as
we previously reported in [12], KisSplice is not able to deal with large por-
tions of a de Bruijn graph containing subgraphs associated to highly repeated
sequences, e.g. transposable elements, the so-called complex BCCs.
Here, we try and achieve two goals: (i) give a clear formalization of the no-
tion of repeats with high copy-number in RNA-seq data, and (ii) based on it,
give a practical way to enumerate bubbles that are lost because of such re-
peats. Recall that we are in a de novo context, so we assume that neither a
reference genome/transcriptome nor a database of known repeats, e.g. Repeat-
Masker [15], are available.
First, we formally introduce a model for representing high copy-number re-
peats and exploit its properties to infer a parameter characterizing repeat-
associated subgraphs in a de Bruijn graph. We prove its relevance but we also
show that the problem of identifying, in a de Bruijn graph, a subgraph corre-
sponding to repeats according to such characterization is NP-complete. Hence,
a polynomial time algorithm is unlikely. We then show that in the specific case
of a local assembly of alternative splicing (AS) events, by using a strategy based
on that parameter, we can implicitly avoid such subgraphs. More precisely, it
is possible to find the structures (i.e. bubbles) corresponding to AS events in a
de Bruijn graph that are not contained in a repeat-associated subgraph. Finally,
using simulated RNA-seq data, we show that the new algorithm improves by a
factor of up to 2 the sensitivity of KisSplice, while also improving its precision.
For the specific tasks of calling AS events, we further show that our algorithm
84 G. Sacomoto et al.
more sensitive, by a factor of 2, than Trinity, while also being slightly more
precise. Finally, we give an indication of the usefulness of our method on real
data.
2 Preliminaries
Let Σ be an alphabet of fixed size σ. Here we always assume Σ = {A, C, T, G}.
Given a sequence (string) s ∈ Σ ∗ , let |s| denote its length, s[i] the ith element
of s, and s[i, j] the substring s[i]s[i + 1] . . . s[j] for any 1 ≤ i < j ≤ |s|.
A k-mer is a sequence s ∈ Σ k . Given an integer k and a set S of sequences
each of length n ≥ k, we define span(S, k) as the set of all distinct k-mers that
appear as a substring in S.
Definition 1. Given a set of sequences (reads) R ⊆ Σ ∗ and an integer k, we
define the directed de Bruijn graph Gk (R) = (V, A) where V = span(R, k) and
A = span(R, k + 1).
Given a directed graph G = (V, A) and a vertex v ∈ V , we denote its out-
neighborhood (resp. in-neighborhood) by N + (v) = {u ∈ V | (v, u) ∈ A} (resp.
N − (v) = {u ∈ V | (u, v) ∈ A}), and its out-degree (resp. in-degree) by d+ (v) =
|N + (v)| (d− (v) = |N − (v)|). A (simple) path π = s t in G is a sequence of
distinct vertices s = v0 , . . . , vl = t such that, for each 0 ≤ i < l, (vi , vi+1 ) is
an arc of G. If the graph is weighted, i.e. there is a function w : A → Q≥0
associating a weight to every arc in the graph, then the length of a path π is the
sum of the weights of the traversed arcs, and is denoted by |π|.
An arc (u, v) ∈ A is called compressible if d+ (u) = 1 and d− (v) = 1. The intu-
ition behind this definition comes from the fact that every path passing through
u should also pass through v. It should therefore be possible to “compress”
or contract this arc without losing any information. Note that the compressed
de Bruijn graph [4,14] commonly used by transcriptomic assemblers is obtained
from a de Bruijn graph by replacing, for each compressible arc (u, v), the vertices
u, v by a new vertex x, where N − (x) = N − (u), N + (x) = N + (v) and the label is
the concatenation of the k-mer of u and the k-mer of v without the overlapping
part (see Fig. 1).
(a) (b)
Fig. 1. (a) The arc (CT G, T GA) is the only compressible arc in the given de Bruijn
graph (k = 3). (b) The corresponding compressed de Bruijn graph.
Navigating in a Sea of Repeats in RNA-seq without Drowning 85
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
⎛A A C T G T A T C C⎞ s0
⎜A C C T G T A G C C⎟ s1
⎜G C⎟
⎜ A C T C A A T C ⎟ s2
⎜A C⎟
⎜ A C T C T A T C ⎟ s3
⎜A A⎟
⎜ A C A G T A T C ⎟ s4
⎜A C⎟
⎜ A T T G T A G C ⎟ s5
⎜A A⎟
⎜ G C T G T A T C ⎟ s6
⎜. .. ⎟
⎝. .. .. .. .. .. .. .. .. ⎠
. . . . . . . . . .
A A G T G A A T C C s20
Proof. The probability that a sequence of length k−1 occurs in a fixed position in
a randomly chosen sequence of length n is (1/4)k−1 . Thus the expected number
of appearances of a sequence of length k − 1 in a set of m randomly chosen
sequences of length n is given by m(n − k + 2)(1/4)k−1 . If m(n − k + 2) ≤ 4k ,
then this value is upper bounded by 1, and all the sequences of length k − 1
are boundary rigid (as a sequence appears once). The claim follows by observing
that there are m(n − k + 1) different k-mers.
We consider now γ(Gk (R)) for R = S(m, n, α). We upper bound the expected
number of compressible arcs by upper bounding the number of boundary rigid
(k − 1)-mers.
Navigating in a Sea of Repeats in RNA-seq without Drowning 87
m−1
P r[ŝ is boundary rigid|dH (ŝ, ŝ0 ) = d] ≤ 1−(2α−4/3α2)(1−α)k−1−d (α/3)d
k−1
E[X] ≤ (n − k − 1)m P r[ŝ is boundary rigid|dH (ŝ, ŝ0 ) = d] (1)
d=0
2
≤ (n − k − 1)me−(m−1)(2α−4/3α
k−1
)/( α
3)
k
For a sufficiently large number of copies (e.g. m = αk ) and using the fact
that αk ≥ (1/α) , we have that E[X] is o(mn). This concludes the proof.
k αk
The previous result shows that the number of compressible arcs is a good
parameter for characterizing a repeat-associated subgraph.
Proof. Given a complete graph G = (V, E), a set of terminal vertices N and
an upper bound B, i.e. an instance of STEINER(1, 2), we transform it into an
instance of Repeat Subgraph Problem for a graph G with degree bounded by 3.
Let us first build the graph G = (V , E ). For each vertex v in V \ N , add a
corresponding subgraph r(v) = R(|V |) in G and for each vertex v in N , add a
corresponding subgraph r(v) = R(|E| + |V |2 + 1) in G . For each arc (u, v) in E
with weight w ∈ {1, 2}, add a simple directed path composed by w compressible
arcs connecting r(u) to r(v) in G ; these are the subgraphs corresponding to
u and v. The first vertex of the path should be in a sink of r(u) and the last
vertex in a source of r(v). By construction, there are at least |V | vertices with in-
degree 2 and out-degree 0 (sink) and |V | vertices with out-degree 2 and in-degree
0 (source) in both r(v) and r(u). It is clear that G has degree bounded by 3.
Moreover, the size of G is polynomial in the size of G and it can be constructed
in polynomial time.
In this way, the graph G has one subgraph for each vertex of G and a path
with one or two (depending on the weight of the corresponding arc) compressible
arcs for each arc of G. Thus, there exists a subgraph spanning N in G with
weight at most B if and only if there exists a subgraph in G with at least
m = 2|N | + 2|E||N | + 2|V |2 |N | vertices and at most t = |B| compressible arcs.
Navigating in a Sea of Repeats in RNA-seq without Drowning 89
This follows from the fact that any subgraph of G with at least m vertices
necessarily contains all the subgraphs r(v), where v ∈ N , since the number
of vertices in all r(v), with v ∈ V \ N , is at most |E| + 2|V |2 and the only
compressible arcs of G are in the paths corresponding to the arcs of G.
We can obtain the same result for the specific case of de Bruijn graphs. The
reduction is very similar but uses a different graph family.
Fig. 3. An alternative splicing event in the SCN5A gene (human) trapped inside a
complex region, likely containing repeat-associated subgraphs, in a de Bruijn graph.
The alternative isoforms correspond to a pair of paths shown in red and blue.
KisSplice [12] is a method for de novo calling of AS events through the enu-
meration of so-called bubbles, that correspond to pairs of vertex-disjoint paths in
a de Bruijn graph. The bubble enumeration algorithm proposed in [12] was later
improved in [13]. However, even the improved algorithm is not able to enumerate
all bubbles corresponding to AS events in a de Bruijn graph. There are certain
complex regions in the graph, likely containing repeat-associated subgraphs but
also real AS events [12], where both algorithms take a huge amount of time. See
90 G. Sacomoto et al.
– The bubbles of Bs (π1 , π2 , G ) that use e, for each arc e = (u1 , v) outgoing
from u1 , that is Bs (π1 · e, π2 , G − u1 ), where G − u1 is the subgraph of G
after the removal of u1 and all its incident arcs.
– The bubbles that do not use any arc from u1 , that is Bs (π1 , π2 , G ), where
G is the subgraph of G after the removal of all arcs outgoing from u1 .
In order to maintain the invariant (∗), we only perform the recursive calls when
Bs (π1 · e, π2 , G − u) or Bs (π1 , π2 , G ) are non-empty. In both cases, we have to
decide if there exist a pair of (internally) vertex-disjoint paths π̄1 = u1 t1
and π̄2 = u2 t2 , such that |π̄1 | ≤ α1 , |π̄2 | ≤ α2 , and π̄1 , π̄2 have at most
b1 , b2 branching vertices, respectively. Since both the length and the number
of branching vertices are monotonic properties, i.e. the length and the number
of branching vertices of a path prefix is smaller than this number for the full
path, we can drop the vertex-disjoint condition. Indeed, let π̄1 and π̄2 be a pair
of paths satisfying all conditions but the vertex-disjointness one. The prefixes
π̄1∗ = u1 t∗ and π̄2∗ = u2 t∗ , where t∗ is the first intersection of the
paths, satisfy all conditions and are internally vertex-disjoint. Moreover, using a
dynamic programming algorithm, we can obtain the following result.
5 Experimental Results
yet spliced). To achieve this, we first ran the FluxSimulator with the Refseq
annotations. We then modified the annotations to include the introns and re-ran
it on this modified version. In this second run, we additionally constrained the
expression values of the pre-mRNAs to be correlated to the expression values of
their corresponding mRNAs, as simulated in the first run. Finally, we mixed the
two sets of reads to obtain a total of 100M reads. We tested two values: 5% and
15% for the proportion of reads from pre-mRNAs. Those values were chosen so
as to correspond to realistic ones as observed in a cytoplasmic mRNA extraction
(5%) and a total (cytoplasmic + nuclear) mRNA extraction (15%) [16].
On these simulated datasets, we ran KisSplice [12] versions 2.1.0 (KsOld)
and 2.2.0 (KsNew, with a maximum number of branching vertices set to 5) and
obtained lists of detected bubbles that are putative alternative splicing (AS)
events. We also ran the full-length transcriptome assembler Trinity version
r2013 08 14 on both datasets, obtaining a list of predicted transcripts, from
which we then extracted a list of putative AS events.
In order to assess the precision and the sensitivity of our method, we com-
pared our set of found AS events to the set of true AS events. Following the
definition of Astalavista, an AS event is composed of two sets of transcripts,
the inclusion/exclusion isoforms respectively. An AS event is said to be true if at
least one transcript among the inclusion isoforms and one among the exclusion
isoforms is present in the simulated dataset with at least one read. We stress that
this definition is very permissive and includes AS events with very low coverage.
This means that our ground truth, i.e. the set of true AS events, contains some
events that are very hard, or even impossible, to detect. We chose to proceed in
this way as it reflects what happens in real data.
To compare the results of KisSplice with the true AS events, we propose
that a true AS event is a true positive (TP) if there is a bubble such that one
path matches the inclusion isoform and the other the exclusion isoform. If there
is no such bubble among the results of KisSplice, the event is counted as a false
negative (FN). If a bubble does not correspond to any true AS event, it is counted
as a false positive (FP). To align the paths of the bubbles to transcript sequences,
we used the Blat aligner [7] with 95% identity and a constraint of 95% of each
bubble path length to be aligned (to account for the sequencing errors simulated
by FluxSimulator). We computed the sensitivity TP/(TP+FN) and precision
TP/(TP+FP) for each simulation case and we report their values for various
classes of expression of the minor isoform. Expression values are measured in
reads per kilobase (RPK).
1.0
1.0
● ● KsNew ● KsNew
KsOld KsOld
Trinity
0.8
Trinity
0.8
● ●
●
0.6
0.6
Sensitivity
Sensitivity
● ● ●
0.4
0.4
● ● ●
● ● ●
0.2
0.2
● ●
0.0
0.0
−2 −1 0 1 2 3 −1 0 1 2 3
Fig. 4. Sensitivity of KsNew, KsOld and Trinity for several classes of expression
of the minor isoform. Each class (i.e. point in the graph) contains the same number of
AS events (250). It is therefore an average sensitivity on a potentially broad class of
expression.
the 15% pre-mRNA dataset. In this case, the sensitivity for KsNew and KsOld
are 24% and 48%, respectively. This represents an improvement of 100% over the
old version. The results reflect the fact that the most problematic repeats are
in intronic regions. A small unspliced mRNA rate leads to few repeat-associated
subgraphs, so there are not many AS events drowned in them (which are then
missed by KsOld). In this case, the advantage of using KsNew is less obvi-
ous, whereas a large proportion of pre-mRNA leads to more AS events drowned
in repeat-associated subgraphs which are identified by KsNew and missed by
KsOld.
Clearly, any improvement in the sensitivity is meaningless if there is also a
significant decrease in precision. This is not the case here. In both datasets,
KsNew improves the precision of KsOld. It increases from 95% to 98% and
from 90% to 99%, in the 5% and 15% datasets, respectively. The high precision
we obtain indicates that very few FP bubbles, including the ones generated by
repeats, are mistakenly identified as AS events. Moreover, both running times
and memory consumption are very similar for the two versions.
The plots for the sensitivity of Trinity on the two simulated datasets are also
shown in Fig. 4. In both cases, KsNew performs considerably better than Trin-
ity over all expression levels, with a larger gap for highly expressed variants.
The overall sensitivity of Trinity for the 5% and 15% pre-mRNA datasets is
18% and 28%, whereas for KsNew we have 37% and 48%, respectively. Simi-
larly to both KsNew and KsOld, the specificity of Trinity improved from the
94 G. Sacomoto et al.
Fig. 5. One of the bubbles found only by KsNew with the corresponding sequences
mapped to the reference human genome and visualized using the UCSC Genome
Browser. The first two lines correspond to the sequences of, respectively, the short-
est (exon exclusion variant) and longest paths (exon inclusion variant) of the bubble
mapped to the genome. The blue line is the Refseq annotation. The last line shows the
annotated SINE and LINE sequences (transposable elements).
Navigating in a Sea of Repeats in RNA-seq without Drowning 95
6 Conclusion
Although transcriptome assemblers are now commonly used, their way to handle
repeats is not satisfactory, arguably because the presence of repeats in transcrip-
tomes has been underestimated so far. Given that most RNA-seq datasets corre-
spond to total mRNA extractions, many introns are still present in the data and
their repeat content cannot be simply ignored. In this paper, we first proposed
a simple formal model for representing high copy-number repeats in RNA-seq
data. Exploiting the properties of this model we established that the number
of compressible arcs is a relevant quantitative characteristic of repeat-associated
subgraphs. We proved that the problem of identifying in a de Bruijn graph a
subgraph with this characteristic is NP-complete. However, this characteristic
drove the design of an algorithm for efficiently identifying AS events that are not
included in repeated regions. The new algorithm was implemented in KisSplice
(KsNew), and by using simulated RNA-seq data, we showed that it improves
by a factor of up to 2 the sensitivity of the previous version of KisSplice, while
also improving its precision. In addition, we compared our algorithm with Trin-
ity and showed that for the specific tasks of calling AS events, our algorithm is
more sensitive, by a factor of 2, while also being slightly more precise. Finally,
we gave an indication of the usefulness of our method on real data.
Clearly our model could be improved, for instance by using a tree-like struc-
ture to take into account the evolutionary nature of repeat (sub)families. Indeed,
many TE families are composed by different subfamilies that can be divergent
from each other. Consider for instance the human ALU family of TEs that con-
tains at least 7 high copy-number subfamilies with intra-family divergence less
than 1% and substantially higher inter-family divergence [6]. In this model, the
repeats are generated through a branching process on binary trees. Starting from
the root to which we associate a sequence s0 , the tree generation process follows
recursively the following rule: each node has probability γ to give birth to two
children and 1 − γ to give birth to a single child. In each case the node is as-
sociated to a sequence obtained by independently mutating each symbol of the
parent sequence with probability α. In this way, the height of the tree reflects the
passing of the time. Hence, the maximum height of the tree would correspond to
the time passed since the appearance of the first element of this repeat family.
The leaves will be associated to the set of repetitions of s0 in a genome. Beside
representing in a more realistic way the generation of copies of transposable ele-
ments, this would also allow to model subfamilies of repeats. Indeed, sequences
corresponding to leaves of the same subtree are more similar between them then
to sequences belonging to leaves outside the subtree.
However, a formal mathematical analysis on this model seems more difficult
to obtain. Observe that in the case α is sufficiently small, such model would
converge to the one presented in this paper.
Finally, an interesting open problem remains on how to efficiently enumerate
AS events for which their variable region (i.e. the skipped exon) is itself a high
copy number and low divergence repeat.
96 G. Sacomoto et al.
References
1. Bern, M., Plassmann, P.: The steiner problem with edge lengths 1 and 2. Informa-
tion Processing Letters (1989)
2. Carroll, M.L., Roy-Engel, A.M., Nguyen, S.V., Salem, A.-H., et al.: Large-scale
analysis of the alu ya5 and yb8 subfamilies and their contribution to human ge-
nomic diversity. Journal of Molecular Biology 311(1), 17–40 (2001)
3. Djebali, S., Davis, C., Merkel, A., Dobin, A., et al.: Landscape of transcription in
human cells. Nature (2012)
4. Grabherr, M., Haas, B., Yassour, M., Levin, J., et al.: Full-length transcriptome
assembly from RNA-Seq data without a reference genome. Nat. Biot. (2011)
5. Griebel, T., Zacher, B., Ribeca, P., Raineri, E., et al.: Modelling and simulating
generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. (2012)
6. Jurka, J., Bao, W., Kojima, K.: Families of transposable elements, population
structure and the origin of species. Biology Direct 6(1), 44 (2011)
7. Kent, W.J.: BLAT–the BLAST-like alignment tool. Genome Res. 12 (2002)
8. Myers, E., Sutton, G., Delcher, A., Dew, I., et al.: A whole-genome assembly of
drosophila. Science 287(5461), 2196–2204 (2000)
9. Novák, P., Neumann, P., Macas, J.: Graph-based clustering and characterization
of repetitive sequences in next-generation sequencing data. BMC Bioinf. (2010)
10. Peng, Y., Leung, H., Yiu, S.-M., Lv, M.-J., et al.: IDBA-tran: a more robust de
novo de bruijn graph assembler for transcriptomes with uneven expression levels.
Bioinf. 29(13) (2013)
11. Robertson, G., Schein, J., Chiu, R., Corbett, R., et al.: De novo assembly and
analysis of RNA-seq data. Nat. Met. 7(11), 909–912 (2010)
12. Sacomoto, G., Kielbassa, J., Chikhi, R., Uricaru, R., et al.: KISSPLICE: de-
novo calling alternative splicing events from RNA-seq data. BMC Bioinformat-
ics 13(Suppl 6), S5 (2012)
13. Sacomoto, G., Lacroix, V., Sagot, M.-F.: A polynomial delay algorithm for the enu-
meration of bubbles with length constraints in directed graphs and its application
to the detection of alternative splicing in RNA-seq data. In: Darling, A., Stoye, J.
(eds.) WABI 2013. LNCS, vol. 8126, pp. 99–111. Springer, Heidelberg (2013)
14. Schulz, M., Zerbino, D., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq
assembly across the dynamic range of expression levels. Bioinf. (2012)
15. Smit, A.F.A., Hubley, R., Green, P.: RepeatMasker Open-3.0, 1996-2004
16. Tilgner, H., Knowles, D., Johnson, R., Davis, C., et al.: Deep sequencing of subcel-
lular RNA fractions shows splicing to be predominantly co-transcriptional in the
human genome but inefficient for lncRNAs. Genome Res. (2012)
Linearization of Median Genomes under DCJ
1 Introduction
One of the key computational problems in comparative genomics is the genome
median problem (GMP), which asks to reconstruct a median genome M from
three given genomes such that the total number of genome rearrangements be-
tween M and the given genomes is minimized. The GMP represents a particular
case of the more general ancestral genome reconstruction problem (AGRP) and
is often used as a building block for AGRP solvers [1–5]. The GMP is NP-hard
under several models of genome rearrangements, such as reversals only [6] and
DCJs [7]. While Double-Cut-and-Join (DCJ) operations [8] (also known as 2-
breaks [9]) mimic most common genome rearrangements (i.e., reversals, translo-
cations, fissions, and fusions) and simplify their analysis, they do not take into
account linearity of genome chromosomes. As a result, a solution to the GMP
under DCJ may contain circular chromosomes even if the given genomes are
linear (i.e., consist only of linear chromosomes). We will therefore distinguish
between DCJ genome median problem (DCJ-GMP) and linear genome median
problem (L-GMP), where the latter is restricted to linear genomes.
There exist some advanced DCJ-GMP solvers [10–12], which allow the median
genome to have circular chromosomes. To the best of our knowledge, there exist
no solvers for L-GMP, so we pose the problem of using the solution for DCJ-
GMP to obtain a linear genome approximating the solution to L-GPM. In the
present study, we propose an algorithm that linearizes chromosomes of the given
DCJ-GMP solution in some optimal way. Our method also provides insights into
the combinatorial structure of genome transformations with DCJs with respect
to appearance of circular chromosomes.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 97–106, 2014.
c Springer-Verlag Berlin Heidelberg 2014
98 S. Jiang and M.A. Alekseyev
g i−1 g i−1
gi gi
g i+1 g i+1
Fig. 1. Graph representation of a gene sequence gi−1 , gi , gi+1 and a DCJ that inverses
the gene gi
1
Transformations to the median genome may be produced by a DCJ-GMP solver or
constructed directly from the median genome and the given genomes, since finding
a shortest transformation between two genomes is polynomially solvable [19].
100 S. Jiang and M.A. Alekseyev
G2
t2
T2
T1
M
t0 G1
t1
M’
T3
t3
G3
3
3
3
3
dDCJ (M , Gi ) − dDCJ (M, Gi ) = |ti | − |Ti |
i=1 i=1 i=1 i=1
≤ (|T1 | + |T2 | + |T3 | + c(M )) − (|T1 | + |T2 | + |T3 |) = c(M ).
Applying Theorem 1 c(P ) − c(Q) times, we easily get the following statement.
z
Corollary 1. Let P −→ Q be a transformation between genomes P and Q with
r1 r2 r3 rk
c(P ) > c(Q). Then there exists a transformation P −→ P −→ P −→ · · · −→
P (k)
→ Q of the same total length |z|, where k = c(P ) − c(Q), r1 , r2 , . . . , rk are
DCJs, and c(P (k) ) = c(Q).
Proof. If c(P ) > c(Q), ϑ must either destroy one circular chromosome in the
genome graph P or combine two circular chromosomes into a new one. In either
case, the two edges removed by ϑ must belong to different chromosomes in P
and at least one of them is circular.
If the two edges removed by ϑ in P belong to distinct chromosomes, one of
which is circular, then ϑ destroys this circular chromosome. Thus, c(Q) < c(P )
unless ϑ creates a new circular chromosome in Q. However, in the latter case
ϑ must also destroy another circular chromosome in P (i.e., ϑ is a fusion on
circular chromosomes), implying that c(Q) < c(P ).
1ϑ 2 ϑ
Theorem 2. Let P −→ Q −→ R be a transformation between genomes P, Q, R
such that ϑ1 , ϑ2 are independent DCJs and c(P ) ≥ c(Q) > c(R). Then in the
ϑ ϑ
2
transformation P −→ Q −→
1
R, we have c(P ) > c(Q ).
Proof. Let a and b be the edges removed by ϑ2 . Since ϑ1 and ϑ2 are independent,
the edges a and b are present in both P and Q. By Lemma 1, c(Q) > c(R)
implies that in Q one edge, say a, belongs to a circular chromosome, which does
not contain b.
Suppose that a belongs to a circular chromosome C in Q. Genome P can
be obtained from genome Q by a DCJ ϑ−1 1 that reverses ϑ1 (i.e., ϑ−1
1 replaces
the edges created by ϑ1 with the edges removed by it). We consider two cases,
depending on whether c(P ) > c(Q) or c(P ) = c(Q).
If c(P ) > c(Q), ϑ−1
1 must either split one circular chromosome in Q into two
or create a new circular chromosome from linear chromosome(s). In the former
case, the edge a belongs to a circular chromosome C in P even if ϑ−1 1 splits the
chromosome C. The set of vertices of C is a subset of the vertices of C and thus
does not contain b. In the latter case, C is not affected, while b remains outside it.
102 S. Jiang and M.A. Alekseyev
1 ϑ 2 ϑ
Theorem 3. Let P −→ Q −→ R be a transformation between genomes P, Q, R
such that ϑ2 depends on ϑ1 and c(P ) ≥ c(Q) > c(R). Then there exists a trans-
ϑ ϑ
3
formation: P −→ Q −→
4
R, where ϑ3 and ϑ4 are DCJs and c(P ) > c(Q ).
b) y2
x1 w1
C
x2 w2
r1 y1 r2
a) y2 c) y2 e) y2
x1 w1 x1 w1 x1 w1
r3 r4
C
x2 w2 x2 w2 x2 w2
y1 y1 y1
r5 d) y2 r6
x1 w1
x2 w2
y1
Fig. 3. Illustration of Lemma 2. a) Initial genome graph, where the dashed edges
represent some gene sequences. The dashed edge and black undirected edge between
w1 and w2 form a circular chromosome C. b-d) The intermediate genomes after first
DCJs in the three equivalent pairs of weakly dependent DCJs. e) The resulting genome
graph after the equivalent pairs of DCJs, where C is destroyed. Namely, C is destroyed
by DCJs r2 , r3 , and r5 .
4 Discussion
For given three linear genomes G1 , G2 , G3 and their DCJ median genome M
(which may contain circular chromosomes), we described an algorithm that con-
structs a linear genome M such that the approximation accuracy of M (i.e.,
the difference in the DCJ median scores of M and M ) is bounded by c(M ),
the number of circular chromosomes in M . In the Appendix we give an example,
where c(M ) also represents the lower bound for the accuracy of any linearization
of M and thus our algorithm achieves the best possible accuracy in this case.
It was earlier observed by Xu [11] on simulated data that the number of cir-
cular chromosomes produced by their DCJ-GMP solver is typically very small,
implying that the approximation accuracy of M would be very close to 0.
We remark that the proposed algorithm relies on a transformation between
M and one of the genomes G1 , G2 , G3 . For presentation purposes, we chose
it to be G1 but other choices may sometimes result in better approximation
accuracy. It therefore makes sense to apply the algorithm for each of the three
transformations from M to Gi and obtain three corresponding linear genomes
Mi , among which select the genome M with the minimum DCJ median score. At
the same time, we remark that the linear genomes Mi may be quite distant from
each other. In the Appendix, we show that the pairwise DCJ distances between
the linear genomes Mi may be as large as 2/3 · N , where N = |G1 | = |G2 | = |G3 |
is the number of genes in the given genomes.
The proposed algorithm can be viewed as c(M ) iterative applications of The-
orem 1, each of which takes at most dDCJ (G1 , M ) < N steps. Therefore, the
overall time complexity is O(c(M ) · N ) elementary (in sense of Theorems 2
and 3) operations on DCJs. The algorithm is implemented in the AGRP solver
MGRA [20, 21].
References
1. Sankoff, D., Cedergren, R.J., Lapalme, G.: Frequency of insertion-deletion,
transversion, and transition in the evolution of 5S ribosomal RNA. Journal of
Molecular Evolution 7(2), 133–149 (1976)
2. Kováč, J., Brejová, B., Vinař, T.: A practical algorithm for ancestral rearrange-
ment reconstruction. In: Przytycka, T.M., Sagot, M.-F. (eds.) WABI 2011. LNCS,
vol. 6833, pp. 163–174. Springer, Heidelberg (2011)
3. Gao, N., Yang, N., Tang, J.: Ancestral genome inference using a genetic algorithm
approach. PLoS One 8(5), e62156 (2013)
4. Moret, B.M., Wyman, S., Bader, D.A., Warnow, T., Yan, M.: A New Implemen-
tation and Detailed Study of Breakpoint Analysis. In: Pacific Symposium on Bio-
computing, vol. 6, pp. 583–594 (2001)
5. Bourque, G., Pevzner, P.A.: Genome-scale evolution: reconstructing gene orders in
the ancestral species. Genome Research 12(1), 26–36 (2002)
6. Caprara, A.: The reversal median problem. INFORMS Journal on Comput-
ing 15(1), 93–113 (2003)
7. Tannier, E., Zheng, C., Sankoff, D.: Multichromosomal median and halving prob-
lems under different genomic distances. BMC Bioinformatics 10(1), 120 (2009)
8. Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations
by translocation, inversion and block interchange. Bioinformatics 21(16), 3340–
3346 (2005)
9. Alekseyev, M.A., Pevzner, P.A.: Multi-Break Rearrangements and Chromosomal
Evolution. Theoretical Computer Science 395(2-3), 193–202 (2008)
10. Xu, A.W.: A fast and exact algorithm for the median of three problem: A graph de-
composition approach. Journal of Computational Biology 16(10), 1369–1381 (2009)
11. Xu, A.W.: DCJ median problems on linear multichromosomal genomes: Graph
representation and fast exact solutions. In: Ciccarelli, F.D., Miklós, I. (eds.)
RECOMB-CG 2009. LNCS, vol. 5817, pp. 70–83. Springer, Heidelberg (2009)
12. Zhang, M., Arndt, W., Tang, J.: An exact solver for the DCJ median problem. In:
Pacific Symposium on Biocomputing, vol. 14, pp. 138–149 (2009)
13. Maňuch, J., Patterson, M., Wittler, R., Chauve, C., Tannier, E.: Linearization
of ancestral multichromosomal genomes. BMC Bioinformatics 13(suppl. 19), S11
(2012)
14. Ma, J., Zhang, L., Suh, B.B., Raney, B.J., Burhans, R.C., Kent, W.J., Blanchette,
M., Haussler, D., Miller, W.: Reconstructing contiguous regions of an ancestral
genome. Genome Research 16(12), 1557–1565 (2006)
15. Muffato, M., Louis, A., Poisnel, C.-E., Crollius, H.R.: Genomicus: a database and
a browser to study gene synteny in modern and ancestral genomes. Bioinformat-
ics 26(8), 1119–1121 (2010)
16. Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Zhang, L., Miller, W., Haussler, D.:
Dupcar: reconstructing contiguous ancestral regions with duplications. Journal of
Computational Biology 15(8), 1007–1027 (2008)
17. Alekseyev, M.A.: Multi-break rearrangements and breakpoint re-uses: from circular
to linear genomes. Journal of Computational Biology 15(8), 1117–1131 (2008)
18. Tesler, G.: Efficient algorithms for multichromosomal genome rearrangements.
Journal of Computer and System Sciences 65(3), 587–609 (2002)
19. Bergeron, A., Mixtacki, J., Stoye, J.: A unifying view of genome rearrange-
ments. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175,
pp. 163–173. Springer, Heidelberg (2006)
106 S. Jiang and M.A. Alekseyev
20. Alekseyev, M.A., Pevzner, P.A.: Breakpoint graphs and ancestral genome recon-
structions. Genome Research 19(5), 943–957 (2009)
21. Jiang, S., Avdeyev, P., Hu, F., Alekseyev, M.A.: Reconstruction of ancestral
genomes in presence of gene gain and loss (2014) (submitted)
M’3
b
c
2 2
a
M
b
1 c 1
a a
M’2 M’1
b 2
b
c c
Fig. 4. A circular median genome M of three unichromosomal linear genomes M1 , M2 ,
M3 on genes a, b, c with specified pairwise DCJ distances
An LP-Rounding Algorithm for Degenerate
Primer Design
1 Introduction
Polymerase Chain Reaction (PCR) is an amplification technique widely used in
molecular biology to generate multiple copies of a desired region of a given DNA
sequence. In a PCR process, two small pieces of synthetic DNA sequences called
primers, typically of length 15-30 bases, are required to identify the boundary of
amplification. This pair of primers, referred to as forward and reverse primers,
are obtained from the 5’ end of the target sequences and their opposite strand,
respectively. Each primer hybridizes to the 3’ end of another strand and starts
to amplify toward the 5’ end.
In applications where a collection of similar sequences needs to be amplified
through PCR, degenerate primers can be used to improve the efficiency and
accuracy of amplification. Degenerate primers [1] can be thought of, conceptually,
as having ambiguous bases at certain positions, that is bases that represent
several different nucleotides. This enables degenerate primers to bind to several
different sequences at once, thus allowing amplification of multiple sequences in
a single PCR experiment. Degenerate primers are represented as strings formed
Research supported by NSF grants CCF-1217314 and NIH 1R01AI078885.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 107–121, 2014.
c Springer-Verlag Berlin Heidelberg 2014
108 Y.-T. Huang and M. Chrobak
from IUPAC codes, where each code represents multiple possible alternatives for
each position in a primer sequence (see Table 1).
The degeneracy deg(p) of a primer p is the number of distinct non-degenerate
primers that it represents. For example, the degeneracy of primer p = ACMCM is 4,
because it represents the following four non-degenerate primers: ACACA, ACACC,
ACCCA, and ACCCC.
⇒ reverse primer
5’--AGAAAAGTCM--3’
||||||||||
5’--GATGGACTGATTACCGATGACTGGACTTTTCTG--3’ ⇒ target sequence
5’--CAGAAAAGTCCAGTCATCGGTAATCAGTCCATC--3’
|| | |||||
forward primer ⇐ 5’--TRTAWTGATY--3’
Quite obviously, primers with higher degeneracy can cover more target se-
quences, but in practice high degeneracy can also negatively impact the quality
and quantity of amplification. This is because, in reality, degenerate primers are
just appropriate mixtures of regular primers, and including too many primers
in the mixture could lead to problems such as mis-priming, where unrelated
sequences may be amplified, or primer cross-hybridization, where primers may
hybridize to each other. Thus, when designing degenerate primers, it is essential
to find a good balance between high coverage and low degeneracy.
PCR experiments involving degenerate primers are useful in studying the
composition of microbial communities that typically include many different but
similar organisms (see, for example, [2,3]). This variant of PCR is sometimes
An LP-Rounding Algorithm for Degenerate Primer Design 109
referred to as Multiplex PCR (MP-PCR) [4], although in the literature the term
MP-PCR is also used in the context of applications where non-similar sequences
are amplified, in which case using degenerate primers may not be beneficial.
Designing (non-degenerate) primers for MP-PCR applications also leads to in-
teresting algorithmic problems – see, for example, [5] and references therein.
For the purpose of designing primers we can assume that our target sequences
are single-strain DNA sequences. Thus from now on target sequences will be
represented by strings of symbols A, C, T, and G.
We say that a (degenerate) primer p covers a target sequence s if at least
one of the non-degenerate primers represented by p occurs in s as a substring.
In practice, a primer can often hybridize to the target sequence even if it only
approximately matches the sequence. Formally, we will say that p covers s with
at most m mismatches if there exists a sub-string s of s of length |p| such some
non-degenerate primer represented by p matches s on at least |p| − m positions.
We refer to m as mismatch allowance.
Following the approach in [6], we model the task as an optimization problem
that can be formulated as follows: given a collection of target sequences, a desired
primer length, and bounds on the degeneracy and mismatch allowance, we want
to find a pair of degenerate primers that meet these criteria and maximize the
number of covered target sequences. We add, however, that, as discussed later in
Section 2, there are other alternative approaches that emphasize other aspects
of primer design, for example biological properties of primers.
As in [6], using heuristics taking advantage of properties of DNA sequences,
the above task can be reduced to the following problem, which, though concep-
tually simpler, still captures the core difficulty of degenerate primer design:
Problem: MCDPDmis .
Instance: A set of n target strings A = a1 , a2 , . . . , an over alphabet
Σ, each of length k, integers d (degeneracy threshold) and m (mismatch
allowance);
Objective: Find a degenerate primer p of length k and degeneracy at most
d that covers the maximum number of strings in A with up to m mis-
matches.
This reduction involves computing the left and right primers separately, as
well as using local alignment of target sequences to extract target strings that
have the same length as the desired primer. There may be many collections of
such target strings (see Section 4), and only those likely to produce good primer
candidates need to be considered. Once we solve the instance of MCDPDmis for
each collection, obtaining a number of forward and reverse primer candidates,
we select the final primer pair that optimizes the joint coverage, either through
exhaustive search or using heuristic approaches.
The main contribution of this paper is a new algorithm for MCDPDmis ,
called SRRdna , based on LP-rounding. We show that MCDPDmis can be formu-
lated as an integer linear program. (This linear program actually solves a slightly
modified version of MCDPDmis – see Section 3 for details.) Algorithm SRRdna
computes the optimal fractional solution of this linear program, and then uses
110 Y.-T. Huang and M. Chrobak
2 Related Work
The problem of designing high-quality primers for PCR experiments has been
extensively studied and has a vast literature. Much less is known about designing
degenerate primers. The work most relevant to ours is by Linhart and Shamir [7],
who introduced the MCDPDmis model, proved that the problem is NP-hard,
and gave some efficient approximation algorithms.
In their paper [7], the ideas behind their approximation algorithms were in-
corporated into a heuristic algorithm HYDEN for designing degenerate primers.
HYDEN uses an efficient heuristic approach to design degenerate primers [7]
with good coverage. It constructs primers of specified length and with specified
degeneracy threshold. HYDEN consists of three phases. It first uses a non-gap
local alignment algorithm to find best-conserved regions among target sequences.
These regions are called alignments. The degree to which an alignment A is con-
served is measured by its entropy score:
k DA (σ,j) DA (σ,j)
HA = − j=1 σ∈Σ n · log2 n ,
3 Randomized Rounding
We now present our randomized rounding approach to solving the MCDPDmis
problem defined in the introduction. Recall that in this problem we are given a
collection A of strings over an alphabet Σ, each of the same length k, a degener-
acy threshold d, and a mismatch allowance m, and the objective is to compute
a degenerate primer p of length k and degeneracy at most d, that covers the
maximum number of strings in A with at most m mismatches.
An optimal primer p covers at least one target string ai ∈ A with at most m
mismatches. In other words, p can be obtained from ai by (i) changing at most
m bases in ai to different bases, and (ii) changing some bases in ai to ambiguous
bases that match the original bases, without exceeding the degeneracy limit d.
Let Tmplm (A) denote the set of all strings of length k that can be obtained
from some target string ai ∈ A by operation (i), namely changing up to m
bases in ai . By trying all strings in Tmplm (A), we can reduce MCDPDmis
to its variant where p is required to cover a given template string (without
mismatches). Formally, this new optimization problem is:
Problem: MCDPDmis tmpl .
Instance: A set of n strings A = {a1 , a2 , . . . , an }, each of length k, a
template string p̂, and integers d (degeneracy threshold) and m (mismatch
allowance);
Objective: Find a degenerate primer p of length k, with deg(p) ≤ d that
covers p̂ and covers the maximum number of sequences in A with mis-
match allowance m.
We remark that our algorithm for MCDPDmis will not actually try all pos-
sible templates from Tmplm (A) – there are simply too many of these, if m is
large. Instead, we randomly sample templates from Tmplm (A) and apply the
algorithm for MCDPDmis tmpl only to those sampled templates. The number of
samples affects the running time and accuracy (see Section 5).
112 Y.-T. Huang and M. Chrobak
We present our algorithm for MCDPDmis tmpl in two steps. In Section 3.1 that
follows, we explain the fundamental idea of our approach, by presenting the
linear program and our randomized rounding algorithm for the case of binary
strings, where Σ = {0, 1}. The extension to DNA strings is somewhat compli-
cated due to the presence of several ambiguous bases. We present our linear
program formulation and the algorithm for DNA strings in Section 3.2.
solving a linear program at each step. At each iteration, the size of the linear
program can be reduced by discarding strings that are too different from the
current p, and by ignoring strings that are already matched by p. More precisely,
any ai which differs from the current p on more than m + log2 d positions cannot
be covered by any degenerate primer obtained from p, so this ai can be discarded.
On the other hand, if ai differs from p on at most m positions then it will
always be covered, in which case we can set xi = 1 and we can also remove it
from A. This pruning process in Algorithm SRRbin is implemented by function
FilterOut.
We now present our randomized rounding scheme for MCDPDmis tmpl when the
input consists of DNA sequences.
We start with the description of the integer linear program for MCDPDmis tmpl
with Σ = {A, C, G, T}. Degenerate primers for DNA sequences, in addition to
four nucleotide symbols A, C, G and T, can use eleven symbols corresponding to
ambiguous positions, described by their IUPAC codes M, R, W, S, Y, K, V, H, D, B,
and N. The interpretation of these codes was given in Table 1 in Section 1. Let Λ
denote the set of these fifteen symbols. We think of each λ ∈ Λ as representing
a subset of Σ, and we write |λ| for the cardinality of this subset. For example,
we have |C| = 1, |H| = 3 and |N| = 4.
The complete linear program is given below. As for binary sequences, xi in-
dicates whether the i-th target sequence ai is covered. Thenthe objective of the
linear program is to maximize the primer coverage, that is i xi .
i
maximize ix
mj + rj + wj + sj + yj + kj + vj + hj + dj + bj + nj ≤ 1 ∀j
μ ≤m
i
∀i
j j
j (mj + rj + wj + sj + yj +kj ) + log 3 · (vj + hj + dj + bj ) + 2 · nj ] ≤ log d
To specify the constraints, we now have eleven variables representing the pres-
ence of ambiguous bases in the degenerate primer, namely mj , rj , wj , sj , yj ,
kj , vj , hj , dj , bj , and nj , denoted using letters corresponding to the ambigu-
ous symbols. Specifically, for each position j and for each symbol λ ∈ Λ, the
corresponding variable λj indicates whether p̂j is changed to this symbol in the
computed degenerate primer p. For example, rj represents the absence or pres-
ence of R in position j. For each j, at most one of these variables can be 1, which
can be represented by the constraint that their sum is at most 1.
Variables μij indicate a mismatch between p and ai on position j. Then the
bound on the number of mismatches can be written as j μij ≤ m, for each i.
The bound on the degeneracy of the primer p can be written as
deg(p) = j 2(mj +rj +wj +sj +yj +kj ) × 3(vj +hj +dj +bj ) × 4nj ≤ d,
An LP-Rounding Algorithm for Degenerate Primer Design 115
which after taking logarithms of both sides gives us another linear constraint.
In order for ai to be covered (that is, when xi = 1), for each position j
for which aij = p̂j , we must either have a mismatch at position j or we need
aij ⊆ pj . Expressing this with linear constraints can be done by considering
cases corresponding to different values of p̂j and aij . For example, when p̂j = A
and aij = C (or vice versa), then either we have a mismatch at position j (that is,
μij = 1) or pj must be one of ambiguous symbols that match A and C (that is M,
V, H, or N). This can be expressed by the constraint xi ≤ mj + vj + hj + nj + μij .
We will have one such case for any two different choices of p̂j and aij , giving us
six groups of such constraints.
We then extend our randomized rounding approach from the previous section
to this new linear program. From the linear program, we can see that the integral
solution can be determined from the values of all variables λj , for λ ∈ Λ. In the
fractional solution, a higher value of λj indicates that pj is more likely to be the
ambiguous symbol λ. We thus determine ambiguous bases in p one at a time by
rounding the corresponding variables.
As for binary strings, Algorithm SRRdna will start with p = p̂ and gradually
change some bases in p to ambiguous bases, solving a linear program at each
step. At each iteration we first call function FilterOut that filters out target
sequences that are either too different from the template p̂, so that they cannot
be matched, or too similar, in which case they are guaranteed to be matched. The
pseudocode of Algorithm SRRdna is the same as in Pseudocode 1 except that
the procedure RandRoundingbin is replaced by the corresponding procedure
RandRoundingdna for DNA strings.
If no sequences are left in A then we output p and halt. Otherwise, we con-
struct a linear program for the remaining sequences. This linear program is a
slight modification of the one above, with p̂ replaced by p. Each base pj that was
rounded to an ambiguous symbol is essentially removed from consideration and
will not be changed in the future. Specifically, the constraints on xi associated
with this position j will be dropped from the linear program (because these
constraints apply only to positions where pj ∈ {A, C, G, T}). For each position j
that was already rounded, we appropriately modify the corresponding variables.
If pj = λ, for some λ ∈ Λ − Σ, then the corresponding variable λj is set to 1
and all other variables λj are set to 0. If aij ∈ pj , that is, aij is already matched,
then we set μij = 0, and if aij ∈/ pj then we set μij = 1, which effectively reduces
i
the mismatch allowance for a in the remaining linear program.
Next, Algorithm SRRdna solves the fractional relaxation of such constructed
integer program, obtaining a fractional solution FracSol. Finally, the algorithm
calls function RandRoundingdna that will round one fractional variable λj to
1. (This represents setting pj to λ.) To choose j and the symbol λ for pj , we
randomly choose a fractional variable λj proportionally to their values among
undetermined positions. This is done similarly as in the binary case, by summing
up fractional values corresponding to different symbols and positions, and choos-
ing uniformly a random number c between 0 and this sum. This c determines
which variable should be rounded up to 1.
116 Y.-T. Huang and M. Chrobak
Table 2. Algorithm SRRdna versus the integral solution obtained with Cplex. The
numbers represent coverage values for the fifteen alignments.
d = 10000, m = 0 d = 625, m = 2
Ai 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
Opt 26 24 24 24 26 26 24 24 24 26 24 24 24 43 42 42 42 43 43 42 42 42 43 42 42 42
SRR 26 24 23 23 26 26 23 24 23 26 24 23 23 42 40 42 42 43 43 40 41 42 43 42 42 40
This experiment was repeated for two different settings for m (the mismatch al-
lowance) and d (the degeneracy threshold), namely for (m, d) = (0, 1000), (2, 625).
The results are shown in Table 2. As can be seen from this table, Algorithm SRRdna
computes degenerate primers that are very close, and often equal, to the values ob-
tained from the integer program. Note that for m = 0 the value obtained with the
integer program represents the true optimal solution for the instance of
MCDPDmis , because we try all target strings as templates. For m = 2, to com-
pute the optimal solution we would have to try all template strings in Tmpl2 (Ah ),
which is not feasible; thus the values in the first row are only close approximations
to the optimum.
The linear programs we construct are very sparse. This is because for any
given ai and position j, the corresponding constraint on xi is generated only
when p and ai differ on position j (see Section 3.2), and our data sets are very
conserved. Thus, for sufficently small data sets one could simply use integral
solutions from Cplex instead of rounding the fractional solution. For example,
the initial linear programs in the above instances had typically around 150 con-
straints, and computing each integral solution took only about 5 times longer
than for the fractional solution (roughly, 0.7s versus 0.15s). For larger datasets,
however, computing the optimal integral solution becomes quickly infeasible.
Computing primers. In the second part (Lines 2-7), the algorithm considers all
alignments Ah computed by Algorithm FindAlignments. For each Ah , we
use the list Th of template strings (see below), and for each p̂ ∈ Th we call
SRRdna (p̂, Ah , d, m) to compute a primer p that is added to the list of primers
PLh . All lists PLh are then combined into the final list of candidate primers.
It remains to explain how to choose the set Th of templates. If the set
Tmplm (Ah ) of all candidate templates is small then one can take Th to be the
whole set Tmplm (Ah ). (For instance, when m = 0 then Tmpl0 (Ah ) = Ah .) In
general, we take Th to be a random sample of r strings from Tmplm (Ah ), where
the value of r is a parameter of the program, which can be used to optimize the
tradeoff between the accuracy and the running time. Each p̂ ∈ Th is constructed
as follows: (i) choose uniformly a random ai ∈ Ah , (ii) choose uniformly a set
of exactly m random positions in ai , and (iii) for each chosen position j in ai ,
set aij to a randomly chosen base, where this base is selected with probability
proportional to its frequency in position j in all sequences from Ah .
5 Experiments
22 38 44 44
21 37 44 RRD2P
RRD2P RRD2P
HYDEN 20 36 HYDEN HYDEN 43 43 43
20 35
19 19 42
18 18 34 42
18 34
33 41 41
17 17
Coverage
Coverage
Coverage
32 32
16 16 32 40
16 31 31 31
15 15 30 30 39 39
14 14 30
14 29
13 38
28
27
12 36
11 26 36
26
10 5
50
75
00
50
00
00
0
5
50
75
00
50
00
00
0
5
50
75
00
50
00
00
62
00
62
00
62
00
12
18
25
37
50
75
12
18
25
37
50
75
12
18
25
37
50
75
10
10
10
Coverage
Coverage
125
123 142
122
130 139 140 140 139
130 140
120 118 137 137 137
117 128 136
116 127
5
50
75
00
50
00
00
50
75
00
50
00
00
50
75
00
50
00
00
0
0
62
62
62
00
00
00
12
18
25
37
50
75
12
18
25
37
50
75
12
18
25
37
50
75
10
10
10
73 73 80 80 83
72 72 72 80 RRD2P RRD2P
71 HYDEN HYDEN
70 70 82
70 69 69 69 69 69 69 78 78 78 82
78
77 77
81 81 81 81 81 81
Coverage
Coverage
Coverage
65 76 76 76 76 76
65 76
75
80 80 80 80 80
74 80
74
60 79 79
RRD2P 72
56 71 78
HYDEN 78
55
5
5
50
75
00
50
00
00
50
75
00
50
00
00
50
75
00
50
00
00
0
0
62
62
62
00
00
00
12
18
25
37
50
75
12
18
25
37
50
75
12
18
25
37
50
75
10
10
10
6 Discussion
References
1. Kwok, S., Chang, S., Sninsky, J., Wang, A.: A guide to the design and use of
mismatched and degenerate primers. PCR Methods and Applications 47, S39–S47
(1994)
2. Hunt, D.E., Klepac-Ceraj, V., Acinas, S.G., Gautier, C., Bertilsson, S., Polz, M.F.:
Evaluation of 23s rRNA PCR primers for use in phylogenetic studies of bacterial
diversity. Applied Environmental Microbiology 72, 2221–2225 (2006)
3. Ihrmark, K., Bödeker, I.T., Cruz-Martinez, K., Friberg, H., Kubartova, A.,
Schenck, J., Strid, Y., Stenlid, J., Brandström-Durling, M., Clemmensen, K.E.,
Lindahl, B.D.: New primers to amplify the fungal its2 region–evaluation by 454-
sequencing of artificial and natural communities. FEMS Microbiology Ecology 82,
666–677 (2012)
4. Chamberlain, J.S., Gibbs, R.A., Rainer, J.E., Nguyen, P.N., Casey, C.T.: Deletion
screening of the duchenne muscular dystrophy locus via multiplex dna amplifica-
tion. Nucleic Acid Research 16, 11141–11156 (1988)
5. Konwar, K.M., Mandoiu, I.I., Russell, A.C., Shvartsman, A.A.: Improved algo-
rithms for multiplex PCR primer set selection with amplification length constraints.
In: Proc. 3rd Asia-Pacific Bioinformatics Conference, pp. 41–50 (2005)
6. Linhart, C., Shamir, R.: The degenerate primer design problem: theory and appli-
cations. Journal of Computational Biology 12(4), 431–456 (2005)
7. Linhart, C., Shamir, R.: The degenerate primer design problem. Bioinformatics 180,
S172–S180 (2002)
8. Souvenir, R., Buhler, J.P., Stormo, G., Zhang, W.: Selecting degenerate multiplex
PCR primers. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI),
vol. 2812, pp. 512–526. Springer, Heidelberg (2003)
9. Balla, S., Rajasekaran, S.: An efficient algorithm for minimum degeneracy primer
selection. IEEE Transactions on NanoBioscience 6, 12–17 (2007)
10. Sharma, D., Balla, S., Rajasekaran, S., DiGirolamo, N.: Degenerate primer selec-
tion algorithms. Computational Intelligence in Bioinformatics and Computational
Biology, 155–162 (2009)
11. Duitama, J., Kumar, D.M., Hemphill, E., Khan, M., Mandoiu, I.I., Nelson, C.E.:
Primerhunter: a primer design tool for PCR-based virus subtype identification.
Nucleic Acids Research 37, 2483–2492 (2009)
12. http://dna.engr.uconn.edu/software/PrimerHunter/primerhunter.php
13. Boyce, R., Chilana, P., Rose, T.M.: iCODEHOP: a new interactive program for
designing COnsensus-DEgenerate Hybrid Oligonucleotide Primers from multiply
aligned protein sequences. Nucleid Acids Research 37, 222–228 (2009)
14. http://blocks.fhcrc.org/codehop.html
15. http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html
16. http://www.ncbi.nlm.nih.gov/genbank/collab/
17. Huang, Y.T., Yang, J.I., Chrobak, M., Borneman, J.: Prise2: Software for designing
sequence-selective PCR primers and probes (2013) (in preparation)
GAML: Genome Assembly by Maximum
Likelihood
1 Introduction
The second and third generation sequencing technologies have dramatically de-
creased the cost of sequencing. Nowadays, we have a surprising variety of se-
quencing technologies, each with its own strengths and weaknesses. For example,
Illumina platforms are characteristic by low cost and high accuracy, but the reads
are short. On the other hand, Pacific Biosciences offer long reads at the cost of
quality and coverage. In the meantime, the cost of sequencing was brought down
to the point, where it is no longer a sole domain of large sequencing centers; even
small labs can experiment with cost-effective genome sequencing. In this setting,
it is no longer possible to recommend a single protocol that should be used to
sequence genomes of a particular size. In this paper, we propose a framework
for genome assembly that allows flexible combination of datasets from different
technologies in order to harness their individual strengths.
Modern genome assemblers are usually based either on the overlap–
layout–consensus framework (e.g. Celera by Myers et al. (2000), SGA by
Simpson and Durbin (2010)), or on de Bruijn graphs (e.g. Velvet by
Zerbino and Birney (2008), ALLPATHS-LG by Gnerre et al. (2011)). Both ap-
proaches can be seen as special cases of a string graph (Myers, 2005), in which
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 122–134, 2014.
c Springer-Verlag Berlin Heidelberg 2014
GAML: Genome Assembly by Maximum Likelihood 123
point of the assembly are short contigs derived from Velvet (Zerbino and Birney,
2008) with very conservative settings in order to avoid assembly errors. We then
use simulated annealing to combine these short contigs into high likelihood as-
semblies (Section 3). We compare our assembler to existing tools on benchmark
datasets (Section 4), demonstrating that we can assemble genomes of up to 10 MB
long with N50 sizes and error rates comparable to ALLPATHS-LG or Cerulean.
While ALLPATHS-LG and Cerulean each require a very specific combination of
datasets, GAML works on any combination.
Basics of the likelihood model. The model assumes that individual reads are
independently sampled, and thus the overall likelihood is the product of like-
lihoods of the reads: Pr(R|A) = r∈R Pr(r|A). To make the resulting value
independent of the number of reads in set R, we use as the main assembly
score the
log average probability of a read computed as follows: LAP(A|R) =
(1/|R|) r∈R log Pr(r|A). Note that maximizing Pr(R|A) is equivalent to max-
imizing LAP(A|R).
If the reads were error-free and each position in the genome was sequenced
equally likely, the probability of observing read r would simply be Pr(r|A) =
nr /(2L), where nr is the number of occurrences of the read as a substring of the
assembly A, L is the length of A, and thus 2L is the length of the two strands
combined (Medvedev and Brudno, 2009). Ghodsi et al. (2013) have shown a dy-
namic programming computation of read probability for more complex models,
accounting for sequencing errors. The algorithm marginalizes over all possible
alignments of r and A, weighting each by the probability that a certain number
of substitution and indel errors would happen during sequencing. In particular,
the probability of a single alignment with m matching positions and s errors
(substitution and indels) is defined as R(s, m)/(2L), where R(s, m) = s (1 − )m
and is the sequencing error rate.
However, full dynamic programming is too time consuming, and in practice
only several best alignments contribute significantly to the overall probability.
Thus Ghodsi et al. (2013) propose to approximate the probability of observing
read r with an estimate based on a set Sr of a few best alignments of r to
GAML: Genome Assembly by Maximum Likelihood 125
Paired reads. Many technologies provide paired reads produced from the op-
posite ends of a sequence insert of certain size. We assume that the insert size
distribution in a set of reads R can be modeled by the normal distribution with
known mean μ and standard deviation σ. The probability of observing paired
reads r1 and r2 can be estimated from sets of alignments Sr1 and Sr2 as follows:
1
Pr(r1 , r2 |A) ≈ R(sj1 , mj1 )R(sj2 , mj2 ) Pr(d(j1 , j2 )|μ, σ) (2)
2L
j1 ∈Sr1 j2 ∈Sr2
As before, mji and sji are the numbers of matches and sequencing errors in
alignment ji respectively, and d(j1 , j2 ) is the distance between the two alignments
as observed in the assembly. If alignments j1 and j2 are in two different contigs,
or on inconsistent strands, Pr(d(j1 , j2 )|μ, σ) is zero.
Reads that have no good alignment to A. Some reads or read pairs do not align
well to A, and as a result, their probability Pr(r|A) is very low; our approxi-
mation by a set of high-scoring alignments can even yield zero probability if set
Sr is empty. Such extremely low probabilities then dominate the log likelihood
score. Ghodsi et al. (2013) propose a method that assigns such a read a score
approximating the situation when the read would be added as a new contig to
the assembly. We modify their formulas for variable read length, and use score
ec+k for a single read of length or ec+k(1 +2 ) for a pair of reads of lengths
1 and 2 . Values k and c are scaling constants set similarly as in Ghodsi et al.
(2013). These alternative scores are used instead of the read probability Pr(r|A)
whenever the probability is lower than the score.
Multiple read sets. Our work is specifically targeted at a scenario, where we have
multiple read sets obtained from different libraries with different insert lengths or
even with different sequencing technologies. We use different model parameters
for each set and compute the final score as a weighted combination of log average
probabilities for individual read sets R1 , . . . , Rk :
LAP(A|R1 , . . . , Rk ) = w1 LAP(A|R1 ) + . . . + wk LAP(A|Rk ) (3)
In our experiments we use weight wi = 1 for most datasets, but we lower the
weight for Pacific Biosciences reads, because otherwise they dominate the likeli-
hood value due to their longer length. The user could also increase or decrease
weights wi of individual sets based on their reliability.
126 V. Boža, B. Brejová, and T. Vinař
Penalizing spuriously joined contigs. The model of Ghodsi et al. (2013) does
not penalize obvious misassemblies when two contigs are joined together with-
out any evidence in the reads. We have observed that to make the likelihood
function applicable as an optimization criterion for the best assembly, we need
to introduce a penalty for such spurious connections. We say that a particular
base j in the assembly is connected with respect to read set R if there is a read
which covers base j and starts at least k bases before j, where k is a constant
specific to the read set. In this setting, we treat a pair of reads as one long read.
If the assembly contains d disconnected bases with respect to R, penalty αd is
added to the LAP(A|R) score (α is a scaling constant).
Complex probabilistic models, like the one described in Section 2, were pre-
viously used to compare the quality of several assemblies (Ghodsi et al., 2013;
Rahman and Pachter, 2013; Clark et al., 2013). In our work, we instead attempt
to find the highest likelihood assembly directly. Of course, the search space is
huge, and the objective function too complex to admit exact methods. Here,
we describe an effective optimization routine based on the simulated annealing
framework (Eglese, 1990).
Our algorithm for finding the maximum likelihood assembly consists of three
main steps: preprocessing, optimization, and postprocessing. In preprocessing,
we decrease the scale of the problem by creating an assembly graph, where ver-
tices correspond to contigs and edges correspond to possible adjacencies between
GAML: Genome Assembly by Maximum Likelihood 127
contigs supported by reads. In order to make the search viable, we will restrict
our search to assemblies that can be represented as a set of walks in this graph.
Therefore, the assembly graph should be built in a conservative way, where the
goal is not to produce long contigs, but rather to avoid errors inside them. In
the optimization step, we start with an initial assembly (a set of walks in the
assembly graph), and iteratively propose changes in order to optimize the as-
sembly likelihood. Finally, postprocessing examines the resulting walks and splits
some of them into shorter contigs if there are multiple equally likely possibilities
of resolving ambiguities. This happens, for example, when the genome contains
long repeats that cannot be resolved by any of the datasets.
In the rest of this section, we discuss individual steps in more detail.
Fig. 1. Examples of proposal moves. (a) Walk extension joining two walks. (b)
Local improvement by addition of a new loop. (c) Repeat interchange.
Proposals of new assemblies are created from the current assembly using the
following moves:
– Walk extension. (Fig.1a) We start from one end of an existing walk and
randomly walk through the graph, at every step uniformly choosing one
of the edges outgoing from the current node. Each time we encounter the
end of another walk, the two walks are considered for joining. We randomly
(uniformly) decide whether we join the walks, end the current walk without
joining, or continue walking.
– Local improvement. (Fig.1b) We optimize the part of some walk connecting
two long contigs s and t. We first sample multiple random walks starting
from contig s. In each walk, we only consider nodes from which contig t is
reachable. Then we evaluate these random walks and choose the one that
increases the likelihood the most. If the gap between contigs s and t is too
big, we instead use a greedy strategy where in each step we explore multiple
random extensions of the walk (of length around 200bp) and pick the one
with the highest score.
– Repeat optimization. We optimize the copy number of short tandem repeats.
We do this by removing or adding a loop to some walk. We precompute the
list of all short loops (up to five nodes) in the graph and use it for adding
loops.
– Joining with advice. We join two walks that are spanned by long reads or
paired reads with long inserts. We fist select a starting walk, align all reads
to the starting walk and randomly choose a read which has the other end
outside the current walk. Then we find to which node this other end belongs
to and join appropriate walks. If possible, we fill the gap between the two
walks using the same procedure as in the local improvement move. Otherwise
we introduce a gap filled with Ns.
– Disconnecting. We remove a path through short contigs connecting two long
contigs in the same walk, resulting in two shorter walks.
GAML: Genome Assembly by Maximum Likelihood 129
– Repeat interchange. (Fig.1c) If a long contig has several incoming and out-
going walks, we optimize the pairing of incoming and outgoing edges. In
particular, we evaluate all moves that exchange parts of two walks through
this contig. If one of these changes improves the score, we accept it and
repeat this step, until the score cannot be improved at this contig.
At the beginning of each annealing step, the type of the move is chosen ran-
domly; each type of move has its own probability. We also choose randomly the
contig at which we attempt to apply the move.
Note that some moves (e.g. local improvement) are very general, while other
moves (e.g. joining with advice) are targeted at specific types of data. This does
not contradict a general nature of our framework; it is possible to add new moves
as new types of data emerge, leading to improvement when using specific data
sets, while not affecting the performance when such data is unavailable.
For each window, we keep the position and edit distance of all alignments. In
each annealing step, we identify which windows of the assembly were modified.
We glue together overlapping windows and align reads against these sequences
using a read mapping tool. Finally, we use alignments in all windows to calculate
the probability of each read and combine them into the score of the whole as-
sembly. This step requires careful implementation to ensure that we count each
alignment exactly once.
To speed up read mapping even more, we use a simple prefiltering scheme,
where we only align reads which contain some k-mer (usually k = 13) from the
target sequence. In the current implementation, we store an index of all k-mers
from all reads in a simple hash map. In each annealing step, we can therefore
iterate over all k-mers in the target portion of the genome and retrieve reads that
contain them. We use a slightly different filtering approach for PacBio reads. In
particular, we take all reasonably long contigs (at least 100 bases) and align
them to PacBio reads. Since BLASR can find alignments where a contig and a
read overlap by only around 100 bases, we can use these alignments as a filter.
4 Experimental Evaluation
In each scenario, we use the short insert Illumina reads (SA1 or EC1) in Velvet
with conservative settings to build the initial contigs and assembly graph. In the
LAP score, we give Illumina datasets weight 1 and PacBio dataset weight 0.01.
The results are summarized in Tab.2. Note that none of the assemblers con-
sidered here can effectively run in all three of these scenarios, except for GAML.
GAML: Genome Assembly by Maximum Likelihood 131
In the first scenario, GAML performance ranks third among zero-error as-
semblers in N50 length. The best N50 assembly is given by ALLPATHS-LG
(Gnerre et al., 2011). A closer inspection of the assemblies indicates that GAML
missed several possible joins. One such miss was caused by a 4.5 kbp repeat, while
the longest insert size in this dataset is 3.5 kbp. Even though in such cases it
is sometimes possible to reconstruct the correct assembly thanks to small dif-
ferences in the repeated regions, the difference in likelihood between alternative
repeat resolutions may be very small. Another missed join was caused by a se-
quence coverage gap penalized in our scoring function. Perhaps in both of these
cases the manually set constants may have caused GAML to be overly conser-
vative. Otherwise, the GAML assembly seems very similar to the one given by
ALLPATHS-LG.
In the second scenario, Pacific Biosystems reads were employed instead of
jump libraries. These reads pose a significant challenge due to their high er-
ror rate, but they are very useful due to their long length. Assemblers such
as Cerulean (Deshpande et al., 2013) deploy special algorithms taylored to this
technology. GAML, even though not explicitly tuned to handle Pacific Biosys-
tems reads, builds an assembly with N50 size and the number of scaffolds
very similar to that of Cerulean. In N50, both programs are outperformed by
PacbioToCA (Koren et al., 2012), however, this is again due to a few very long
repeats (approx. 5000 bp) in the reference genome which were not resolved by
GAML or Cerulean. (Deshpande et al. (2013) also aim to be conservative in
repeat resolution.) Note that in this case, simulated annealing failed to give
the highest likelihood assembly among those that we examined, so perhaps our
results can be improved by tuning the likelihood optimization.
Finally, the third scenario shows that the assembly quality can be hugely
improved by including a long jump library, even if the coverage is really small
(we used 0.5× coverage in this experiment). This requires a flexible genome
assembler; in fact, only Celera (Myers et al., 2000) can process this data, but
GAML assembly is clearly superior. We have attempted to also run ALLPATHS-
LG, but the program could not process this combination of libraries. Compared
to the previous scenario, GAML N50 size increased approximately 7 fold (or
approx. 4 fold compared to the best N50 from the second scenario assemblies).
132 V. Boža, B. Brejová, and T. Vinař
5 Conclusion
References
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic lo-
cal alignment with successive refinement (BLASR): application and theory. BMC
Bioinformatics 13(1), 238 (2012)
Clark, S.C., Egan, R., Frazier, P.I., Wang, Z.: ALE: a generic assembly likelihood eval-
uation framework for assessing the accuracy of genome and metagenome assemblies.
Bioinformatics 29(4), 435–443 (2013)
Deshpande, V., Fung, E.D.K., Pham, S., Bafna, V.: Cerulean: A hybrid assembly using
high throughput short and long reads. In: Darling, A., Stoye, J. (eds.) WABI 2013.
LNCS, vol. 8126, pp. 349–363. Springer, Heidelberg (2013)
Eglese, R.: Simulated annealing: a tool for operational research. European Journal of
Operational Research 46(3), 271–281 (1990)
English, A.C., Richards, S., et al.: Mind the gap: upgrading genomes with Pacific
Biosciences RS long-read sequencing technology. PLoS One 7(11), e47768 (2012)
Ghodsi, M., Hill, C.M., Astrovskaya, I., Lin, H., Sommer, D.D., Koren, S., Pop, M.: De
novo likelihood-based measures for comparing genome assemblies. BMC Research
Notes 6(1), 334 (2013)
Gnerre, S., MacCallum, I., et al.: High-quality draft assemblies of mammalian genomes
from massively parallel sequence data. Proceedings of the National Academy of
Sciences 108(4), 1513–1518 (2011)
Huang, W., Li, L., Myers, J.R., Marth, G.T.: ART: a next-generation sequencing read
simulator. Bioinformatics 28(4), 593–594 (2012)
Koren, S., Schatz, M.C., et al.: Hybrid error correction and de novo assembly of single-
molecule sequencing reads. Nature Biotechnology 30(7), 693–700 (2012)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature
Methods 9(4), 357–359 (2012)
Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. Journal of Compu-
tational Biology 16(8), 1101–1116 (2009)
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., Pevzner, P.: Paired de Bruijn graphs:
a novel approach for incorporating mate pair information into genome assemblers.
Journal of Computational Biology 18(11), 1625–1634 (2011)
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(suppl 2), ii79–ii85
(2005)
Myers, E.W., Sutton, G.G., et al.: A whole-genome assembly of Drosophila.
Science 287(5461), 2196–2204 (2000)
Pham, S.K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P.A., Alekseyev, M.A.:
Pathset graphs: a novel approach for comprehensive utilization of paired reads in
genome assembly. Journal of Computational Biology 20(4), 359–371 (2013)
Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni,
A., Swerdlow, H.P., Gu, Y.: A tale of three next generation sequencing platforms:
comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC
Genomics 13(1), 341 (2012)
Rahman, A., Pachter, L.: CGAL: computing genome assembly likelihoods. Genome
Biology 14(1), R8 (2013)
134 V. Boža, B. Brejová, and T. Vinař
Salzberg, S.L., Phillippy, A.M., et al.: GAGE: a critical evaluation of genome assemblies
and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the
FM-index. Bioinformatics 26(12), i367–i373 (2010)
Varma, A., Ranade, A., Aluru, S.: An improved maximum likelihood formulation for
accurate genome assembly. In: Computational Advances in Bio and Medical Sciences
(ICCABS 2011), pp. 165–170. IEEE (2011)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de
Bruijn graphs. Genome Research 18(5), 821–829 (2008)
A Common Framework for Linear and Cyclic
Multiple Sequence Alignment Problems
1 Introduction
While only recently considered a rare oddity, circular RNAs have been identified
as a quite common phenomenon in Eukaryotic as well as Archaeal transcrip-
tomes. In Mammalia, thousands of circular RNAs are reported together with
evidence for regulation of miRNAs and transcription [1]; in Archaea, ”expected”
circRNAs like excised tRNA introns and intermediates of rRNA processing as
well as many circular RNAs of unknown function have been revealed [2]. Most
methods to comparatively analyze biological sequences require the computation
of multiple alignments as a first step. While this task has received plenty of
attention for linear sequences, comparably little is known for the corresponding
circular problem. Although most bacterial and archaeal genomes are circular, this
fact can be ignored for the purpose of constructing alignments, because genome-
wide alignments are always anchored locally and then reduce to linear alignment
problems. Even for mitochondrial genomes, with a typical size of 10-100kb, an-
chors are readily identified so that alignment algorithms for linear sequences are
applicable. This situation is different, however, for short RNAs such as viroid
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 135–147, 2014.
c Springer-Verlag Berlin Heidelberg 2014
136 S. Will and P.F. Stadler
and other small satellite RNAs [3] or the abundant circular non-coding RNAs
in Archaea [2, 4].
Cyclic pairwise alignment problems were considered e.g. in [5–8], often with
applications to 2D shape recognition rather than to biological sequences. Cyclic
alignments obviously can be computed by dynamic programming by linearizing
the sequences at a given match, resulting in an quintic-time algorithm for general
gap cost functions by solving O(n2 ) general linear alignment problems in cubic
time [9]. For affine gap costs, the linear problem is solved in quadratic time by the
Needleman-Wunsch algorithm, resulting in a quartic-time solution for the cyclic
problem. For linear gap costs, a O(n2 log n) cyclic alignment algorithm exists,
capitalizing on the fact that alignment traces do not intersect in this case [7]. A
variant was applied to identify cyclically permuted repeats [10]. This approach
does not generalize to other types of cost functions, however. In [11], a cubic-
time dynamic programming solution is described for affine gap cost. Multiple
alignments of cyclic sequences have received very little attention. A progressive
alignment algorithm was implemented and applied to viroid phylogeny [11].
Since the multiple alignment problem is NP-hard for all interesting scoring
functions [12–14], we focus here on a versatile ILP formulation that can easily ac-
commodate both linear and circular input strings. To this end, we start from the
combinatorial characterization of multiple alignments of linear sequences [15, 16]
and represent multiple alignments as order-preserving partitions, a framework
that lends itself to a natural ILP formulation. This approach is not restricted
to pure sequence alignments but also covers various models of RNA sequence-
structure alignments, requiring modifications of the objective function only.
sequence strings as finite, totally ordered sets. The actual letters sak impact only
the scoring function, which we largely consider as given. For convenience, we
refer to the sequence index a ∈ {1, . . . , M } of the sequence string s(a) simply as
sequence. The tuple (a, k) denotes sequence position k in sequence a. We denote
the set of sequence positions by X = {(a, k)|1 ≤ a ≤ M, 1 ≤ k ≤ na }. Each
sequence carries a natural order < on its sequence position. The order < on the
Cyclic and Linear MSAs 137
(IR) A = B
(NC) (a, i), (b, j) ∈ A and (a, k) ∈ B then, for every (b, l) ∈ B holds i < k
implies j < l and i > k implies j > l.
(C) There is a ∈ {1, . . . , M } such that (a, i) ∈ A, (a, j) ∈ B and i < j.
By definition, ≺ is irreflexive (IR) and antisymmetric (i.e., A ≺ B implies not
B ≺ A). As the example in Fig. 2.1 shows, ≺ is not transitive. Note that A ≺ B
implies A ∩ B = ∅. We say that A and B are non-crossing if (NC) holds.
We say that A and B are comparable if A = B, A ≺ B, or B ≺ A. The
example in Fig. 2.1 shows that the transitive closure ≺ of ≺ is not irreflexive
(and consequently, not antisymmetric) in general.
A multiple sequence alignment (X, A, ≺) on X is a partition A of X such
that, for all A, B ∈ A, holds
The above construction critically depends on the existence of the linear order
< on the input sequences. On circular sequences, however, such a linear order
exists only locally. Instead there is a natural cyclic order. A ternary relation
i j k on a set V is a cyclic order [18] if for all i, j, k ∈ V holds
(cO1) i j k implies i, j, k pairwise distinct. (irreflexive)
(cO2) i j k implies k i j. (cyclic)
(cO3) i j k implies ¬ k j i (antisymmetric)
(cO4) i j k and i k l implies i j l. (transitive)
(cO5) If i, j, k are pairwise distinct then i j k or k j i. (total)
If only (cO1) to (cO4) hold, CO is a partial cyclic order. A pair of points (p, q)
is adjacent in a total cyclic order on V if there is no h ∈ V such that p h q.
In contrast to recognizing partial (linear) orders, i.e., testing for acyclicity
of (the transitive closure of) (X, ≺), the corresponding problem for cyclic or-
ders is NP-complete [19]. The conditions (CO1) through (CO4), however, are
easy enough to translate to ILP constraints. Furthermore, the multiple sequence
alignment problem is NP-complete already for linear sequences; thus, the extra
complication arising from the cyclic ordering problem is irrelevant.
Cyclic orders can be linearized by cutting them at any point resulting in
a linear order with the cut point as its minimal (or maximal) element [20]. A
trivial variation on this construction is to insert an additional cut point 0 between
adjacent points to obtain a linearized order that has one copy of the artificial
cut point as its minimal and maximal element, respectively. Formally, let be a
total cyclic order on V ; furthermore, let p and q be adjacent points in this order
on V. Then, the relation ∪ {p, 0, q} is a total cyclic order on the set V ∪ {0}.
The corresponding linearization is (V, <p0q ) with i <p0q j and j <p0q k iff i j k
for all i, j, k ∈ V . Of course, this can be extended by adding a (distinct) copy of
0 as both the minimal and maximal elements, i.e. V ∪ {0− , 0+ } is also totally
ordered, if we set 0− <p0q k and k <p0q 0+ for all k ∈ V .
Cyclic alignments thus can, as the intuition would tell us, equivalently, be char-
acterized as linear alignments with cut. The virtue of the technicalities above
is that we do not have to make the cut explicit by renumbering but instead by
specifying its positions in terms of the circular orders on the input sequences.
ILP approaches to (linear) MSAs so far were based on variables for individual
alignment edges, i.e., xai,bj = 1 if position i of sequence a is aligned with position
j in sequence b. In this picture, an alignment is viewed as a graph on X. The
correspondence between partitions A of X and the graph Γ (X, A, <) is very
simple: {(a, i), (b, j)} is an alignment edge if and only if a = b and there is
A ∈ A such that (a, i) ∈ A, (b, j) ∈ A. It is customary to view Γ (X, A, <) by
connecting consecutive positions of the same sequence by a directed arc [21, 22].
A cycle Z in Γ (X, A, <) is called mixed if it contains at least one directed arc
and all arcs are oriented along the cycle. Z is critical if all vertices in the same
sequence occur consecutively along Z.
Proposition 1 ([21, 22]). A partition (X, A, ≺) satisfying (A1) is a MSA if
and only if Γ (X, A, ≺) contains no critical mixed cycle.
This observation forms the basis for the current ILP-based MSA implementa-
tions. Circular alignments, of course, have an analogous graph representation
Γ (X, A, ). The discussion of the previous section immediately implies
Proposition 2. A partition A of X is acyclic MSA if and only if there is a cut
∅ for A such that the graph Γ (X ∅ , A {∅}) contains no critical mixed cycle
that does not intersect ∅.
The only extra complication for cMSAs is that an explicit representation of the
cut is required and only mixed cycles that do not cross the cut are inconsistent
with axioms (cA1), (cA2), and (cA3).
The main difficulty of using the mixed cycle condition in an ILP framework is
that there are exponentially many potential critical mixed cycles. While, concep-
tually, the Maximum Weight Trace (MWT) ILP formulation includes all critical
mixed cycle inequalities, it is strictly infeasible to feed all those constraints to an
ILP solver and apply branch-and-bound. This dilemma was resolved in [21] by
means of the branch and cut scheme: Starting without mixed cycle constraints,
selected inequalities are iteratively added on demand during the branch-and-
bound optimization. A polynomial separation algorithm works at the core of
Cyclic and Linear MSAs 141
this approach: given a solution of the LP corresponding to the current ILP in-
stance, it selects a critical mixed cycle inequality that removes this solution. This
inequality is added to the current ILP and the process is iterated.
Instead of constructing the cyclic MSA ILP based on the graph formalization
of MSA as MWT, we devise here novel ILP formulations that directly build
on the partition formalization of MSAs. Remarkably, this model required only
polynomially many variables and constraints, whereas the MWT formulation
required an exponential number of constraints. Our formulation is based exclu-
sively on Boolean variables; consequently, we omit the constraints of all variables
to domains {0, 1} for brevity.
M
Partition. A MSA (or cMSA) is represented by at most N = a=1 na =
|X| classes α of positions x ∈ X. As before, we denote the set of classes by
A; we use Greek letters α, β, γ, . . . to denote single classes (where we used
A,B,C,. . . before.) The classes aremodeled by membership variables Pxα = 1
if x ∈ α. The simple constraint
α Pxα = 1 for all x ∈ X ensures that this
describes a partition of X; 1≤i≤na P(a, i)α ≤ 1 for all α ∈ A and sequences
1 ≤ a ≤ M guarantees that each partition contains at most one position per
sequence.
Linear Order. Next, we model the partial ordering relation between classes;
in the linear case, this is the transitive closure of relation ≺. For this purpose,
we introduce ordering variables Oαβ for α = β ∈ A, value 1 indicating α ≺ β.
First, the ordering variables are related to the membership variables and re-
lation < on positions
∀α = β ∈ A,
P(a, i)α + P(a, i)β ≤ Oαβ + 1. (CI)
1 ≤ a ≤ M, 1 < j ≤ na
i<j i≥j
Together with (CI), the constraints (OI1)-(OI3), which model the properties
antireflexive, antisymmetric, and transitive of the order relation, guarantee that
the modeled classes are non-crossing (A2).
by variables COαβγ, which are first related to the membership variables and
relation by
⎛ ⎞
∀α = β = γ ∈ A,
⎝ 1 ≤ a ≤ M, ⎠ P(a, i)α + P(a, j)β + P(a, k)γ
1 < j < k ≤ na 1≤i<j i≤i<k j≤i≤na
≤ COαβγ + 2. (CCI)
Note that it is indeed sufficient to specify the above implication for 1 ≤ i < j <
k ≤ na (instead of all i,j,k, where i j k), due to the cyclic implications in (cOI2)
below. Secondly, we guarantee to describe a partial cyclic order by
Note that without the linearity requirement, the same relation could be ex-
pressed by Exy = α∈A PyαPyα. In the simplest case, the objective function
is therefore given by the linear expression
wxy Exy. (OF)
x,y∈X
Linear gap costs. We model linear gap cost with cost g per gap by introducing
gap variables G(a, i)b together with the equalities
(∀(a, i) ∈ X, 1 ≤ b ≤ M, b = a) G(a, i)b + E(a, i)(b, j) = 1. (GI1)
j:(b,j)∈X
Cyclic and Linear MSAs 143
Affine gap costs. Affine gap penalties of the form h + kg for gaps of length k
can be introduced by further variables GO(a, i)b together with
By (GI2), GO(a, i)b equals one, if and only if there is a gap (w.r.t. b) at (a, i) and
no gap at (a, i + 1), i.e. GO(a, i)b = 1 signals gap opening (at the right end of
each gap). For (GI2), we define G(a, na + 1)b := 0 to penalize (right) end gaps, in
the linear case, and G(a, na + 1)b := G(a, 1)b to avoid double counting of gaps,
in
the circular case. Finally, we model the additional gap opening penalties by
(a,i)∈X,1≤b≤M,b=a hGO(a, i)b.
(i, i ) = (j, j )
By the inequalities (SI2), for each base pair (i,i’) of a sequence a, Baii =1
if and only if there exists a base pair (j,j’) second sequence b, s.t. Baii bjj =1.
Given the variables Baii , we can simply forbid all conflicts by pairwise con-
straints in (SINC). Only (SINC) is specific to linear alignment; in the circular
case, the only necessary modification is to replace, in the all-quantor of (SINC),
the linear ordering condition i < j < i < j for the base pairs (i,i’) and (j,j’) by
the corresponding expression for circular order i j j ∧ j i j .
Table 1. Preliminary results. See text and the electronic appendix† for details.
(−3), and constant contributions per arc match (32). Subsequently, we applied
the standalone CPLEX solver essentially out-of-the-box1 . The features of our
test instances and results are summarized in Table 1. The actual instances and
alignments are reported in the Appendix. All tests were performed on a Lenovo
T431s notebook; time-out was set to 10 minutes. Run-times are reported for solv-
ing to proven optimality and within 5% tolerance of the LP-relaxation bound
(CPLEX configuration: tolerance mipgap 0.02). The found optimal alignments
have been structurally correct. Linear alignments were performed with and with-
out a “diagonal” restriction of alignment edges (i, j) to |i − j| < Δ. The cyclic
alignment failed for all instances but the one of three sequences of length 10.
Whether the sequences are given in the correct rotation (ID 1) or rotated to
each other (ID 1R) does not seem to make a large difference in this case.
6 Discussion
however, that in this naı̈ve form even highly efficient commercial solvers such a
CPLEX cannot accommodate instances large enough to be of practical interest.
We therefore also investigated the cyclic analog of the “critical mixed cycle”,
which form the basis for branch and cut and Lagrangian relaxation approaches
[21, 22]. Reassuringly, it can be phrased in terms of a cut linearizing the cyclic
MSA and critical mixed cycles not interfering with the cut. Although our contri-
bution does not immediately provide a production-grade software, it points out
many promising directions for future work. Furthermore, it is – to our knowledge
– the first systematic analysis of the cyclic multiple sequence alignment problem.
References
1. Jeck, W.R., Sharpless, N.E.: Detecting and characterizing circular RNAs. Nat.
Biotechnol. 32, 453–461 (2014)
2. Danan, M., Schwartz, S., Edelheit, S., Sorek, R.: Transcriptome-wide discovery of
circular RNAs in Archaea. Nucleic Acids Res. 40, 3131–3142 (2012)
3. Ding, B.: Viroids: self-replicating, mobile, and fast-evolving noncoding regulatory
RNAs. Wiley Interdiscip Rev. RNA 1, 362–375 (2010)
4. Doose, G., Alexis, M., Kirsch, R., Findeiß, S., Langenberger, D., Machné, R., Mörl,
M., Hoffmann, S., Stadler, P.F.: Mapping the RNA-seq trash bin: Unusual tran-
scripts in prokaryotic transcriptome sequencing data. RNA Biology 10, 1204–1210
(2013)
5. Bunke, H., Bühler, U.: Applications of approximate string matching to 2D shape
recognition. Patt. Recogn. 26, 1797–1812 (1993)
6. Gregor, J., Thomason, M.G.: Dynamic programming alignment of sequences rep-
resenting cyclic patterns. IEEE Trans. Patt. Anal. Mach. Intell. 15, 129–135 (1993)
7. Maes, M.: On a cyclic string-to-string correction problem. Inform. Process. Lett. 35,
73–78 (1990)
8. Mollineda, R.A., Vidal, E., Casacuberta, F.: Cyclic sequence alignments: approx-
imate versus optimal techniques. Int. J. Pattern Rec. Artif. Intel. 16, 291–299
(2002)
9. Dewey, T.G.: A sequence alignment algorithm with an arbitrary gap penalty
function. J. Comp. Biol. 8, 177–190 (2001)
10. Benson, G.: Tandem cyclic alignment. Discrete Appl. Math. 146, 124–133 (2005)
11. Mosig, A., Hofacker, I.L., Stadler, P.F.: Comparative analysis of cyclic sequences:
Viroids and other small circular RNAs. In: Giegerich, R., Stoye, J. (eds.) Proceed-
ings GCB 2006, vol. P-83. Lecture Notes in Informatics, pp. 93–102 (2006)
12. Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput.
Biol. 1, 337–348 (1994)
13. Just, W.: Computational complexity of multiple sequence alignment with SP-score.
J. Comput. Biol. 8, 615–623 (2001)
14. Elias, I.: Settling the intractability of multiple alignment. J. Comput. Biol. 13,
1323–1339 (2006)
Cyclic and Linear MSAs 147
15. Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: finding local simi-
larities by multiple sequence alignment. Bioinformatics 14(3), 290–294 (1998)
16. Morgenstern, B., Stoye, J., Dress, A.W.M.: Consistent equivalence relations: a
set-theoretical framework for multiple sequence alignments. Technical report,
University of Bielefeld, FSPM (1999)
17. Otto, W., Stadler, P.F., Prohaska, S.J.: Phylogenetic footprinting and consistent
sets of local aligments. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS,
vol. 6661, pp. 118–131. Springer, Heidelberg (2011)
18. Meggido, N.: Partial and complete cyclic orders. Bull. Am. Math. Soc. 82, 274–276
(1976)
19. Galil, Z., Megiddo, N.: Cyclic ordering in NP-complete. Theor. Comp. Sci. 5,
179–182 (1977)
20. Novák, V.: Cuts in cyclically ordered sets. Czech. Math. J. 34, 322–333 (1984)
21. Reinert, K., Lenhof, H.P., Mutzel, P., Mehlhorn, K., Kececioglu, J.D.: A branch-
and-cut algorithm for multiple sequence alignment. In: Proceedings of the First
Annual International Conference on Research in Computational Molecular Biology
(RECOMB), pp. 241–250. ACM (1997)
22. Lenhof, H.P., Morgenstern, B., Reinert, K.: An exact solution for the segment-to-
segment multiple sequence alignment problem. Bioinformatics 15, 203–210 (1999)
23. Hofacker, I.L., Bernhart, S.H., Stadler, P.F.: Alignment of RNA base pairing prob-
ability matrices. Bioinformatics 20, 2222–2227 (2004)
24. Will, S., Reiche, K., Hofacker, I.L., Stadler, P.F., Backofen, R.: Inferring non-coding
RNA families and classes by means of genome-scale structure-based clustering.
PLoS Comput. Biol. 3, e65 (2007)
25. Bauer, M., Klau, G.W., Reinert, K.: Accurate multiple sequence-structure align-
ment of RNA sequences using combinatorial optimization. BMC Bioinformatics 8
(2007)
26. Möhl, M., Will, S., Backofen, R.: Lifting prediction to alignment of RNA
pseudoknots. J. Comp. Biol. 17, 429–442 (2010)
Entropic Profiles, Maximal Motifs
and the Discovery of Significant Repetitions
in Genomic Sequences
1 Introduction
Sequence data is growing in volume with the availability of more and more pre-
cise, as well as accessible, assaying technologies. Patterns in biological sequences
is central to making sense of this exploding data space, and its study continues to
be a problem of vital interest. Natural notions of maximality and irredundancy
have been introduced and studied in literature in order to limit the number of
output patterns without losing information [3, 4, 6, 10, 11, 14–17]. Such notions
are related to both the length and the occurrences of the patterns in the input
sequence. Maximal patterns have been successfully applied to the identification
of biologically significant repetitions, and compressibility of biological sequences,
to list a few areas of use.
Different flavors of patterns, based either on combinatorics or statistics, can
usually be shown to be a variation on this basic concept of maximal patterns. In
particular, it is well known that the degree of predictability of a sequence can be
measured by its entropy and, at the same time, it is also closely related with its
repetitiveness and compressibility [9]. Entropic profile was introduced [7, 8, 18]
to study the under- and over-representation of segments, and also the scale of
each conserved DNA region.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 148–160, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Maximal Motifs and the Discovery of Significant Repetitions 149
2 Background
Maximal motifs are those subwords of the input string which cannot be ex-
tended at the left or at the right without loosing at least one of their occurrences.
150 L. Parida et al.
Definition 5. (Normalized EP) Let mL,φ be the mean and SL,φ be the standard
deviation using all positions i = 1 . . . n. The normalized EP is:
1ˆ 1
n n
mL,φ = fL,φ (xi ) and SL,φ = (fˆL,φ (xi ) − mL,φ )2
n i=1 n − 1 i=1
fL,φ (i)
F astEPL,φ =
max0≤j<n[fL,φ (j)]
Proof. Let i be a generic position of the input string and L be the length of
the subword x starting at i − L + 1 and ending at i, such that x scores the
maximum value of entropy at the position i. Then the following inequalities hold,
Maximal Motifs and the Discovery of Significant Repetitions 151
with respect to the two subwords x and x of length L + 1 and L − 1, ending
at i and starting at i − L and at i − L + 2, respectively:
⎧ L k k L +1
⎪
⎨ n+ k=1
4 φ c([i−k+1,i])
≥
n+ 4k φk c([i−k+1,i])
k=1
L +1 k
L k
k=0 φ φ
L −1 k=0
⎪
⎩ n+ k=1 4k φk c([i−k+1,i])
L ≥
n+ L k k
k=1 4 φ c([i−k+1,i])
L −1 k
k
k=0 φ k=0 φ
As shown in the Appendix, the two inequalities above can be rewritten as:
⎧ L k k +1
⎪
⎨ n+ k=1
4 φ c([i−k+1,i])
≥ 4L φL +1 c([i−L ,i])
L k φL +1
k=0 φ
L −1
⎪
⎩ n+ k=1 4k φk c([i−k+1,i])
L −1 k ≤ 4L φL c([i−L +1,i])
φ φL
k=0
0 1 2 3 4 5 6 7 8 9 10 11
TCAACGGCGG C T
We wonder if the maximal motif CGGC, ending at positions 7 and 10, corre-
sponds to a peak of entropy at one of those positions. We have that fˆ3,10 (7) =
9.85, fˆ4,10 (7) = 39.38 and fˆ4,10 (7) = 76.8, therefore CGGC has not a peak of fˆ
at that position. The same values occur for the position i = 10.
4 Methods
In this section the available algorithms to compute entropic profiles are discussed,
and faster algorithms for entropic profiles computation and normalization are
presented. We recall that we want to analyze an input string x of length n by
means of entropic profiles of resolution L and for a fixed φ.
152 L. Parida et al.
There are two algorithms available in literature that compute entropic profiles.
The algorithm described in [8] is a faster version of the original algorithm
proposed by Fernandes et al. [7]. It relies on a truncated suffix trie data structure,
which is quadratic both in time and space occupation, enhanced with a list of
side links that connect all the nodes at the same depth in the tree. This is needed
to speed up the normalization because, in the formulas used to compute mean
and standard deviation [7], the counting of subwords of the same length is a
routine operation. With this approach the maximum value of L had to be set to
15.
The other method, presented in [5], uses a suffix tree on the reverse string to
obtain linear time and space computation of the absolute values of entropy for
some paramethers L and φ. These values are then normalized with respect to the
maximum value of entropy among all the substrings of length L. To obtain the
maximum value maxL , in correspondence of a given L, all values maxl , where
1 ≤ l < L, are needed. The algorithm has a worst case complexity O(n2 ), but
being guided by a branch-and-cut technique in practice substantial savings are
possible.
A key property of both suffix tries and suffix tree [12] is that, once the data
structure is built on a text string x, the occurrences of a pattern y = y1 . . . ym
in x can be found by following the path labelled with y1 . . . ym from the root
of the tree. If such a path exists, the occurrences are given by the indexes of
the leaves of the subtree rooted at the node in which the path ends. Moreover,
being the suffix tree a compact version of a suffix trie, we have for it the further
property that all the strings corresponding to paths that end in the “middle” of
an arc between two nodes share the same set of occurrences. Figure 1 shows an
example of trie and suffix tree.
4.2 Preprocessing
For the computation of the values needed to obtain both the absolute and the
normalized values of entropy, we perform the same preprocessing procedure de-
scribed in [5]. We recall here the main steps as we will need the annotated suffix
tree for the subsequent description of the speed up to compute the mean and
the standard deviation.
Consider the suffix tree T built on the reverse of the input string x. In such
a tree, strings that are described from paths ending at the same locus share the
same set of ending positions in the original input string. Hence, they are exactly
the strings we need to consider when computing the values of entropy. Some care
needs to be taken to map the actual positions during the computation, but this
will not affect the time complexity. Therefore, in the following discussion we will
just refer to the standard association between strings and positions in a suffix
tree, keeping in mind they are actually reversed.
Maximal Motifs and the Discovery of Significant Repetitions 153
root root
A T
C
$ A T
T C A T A C
7
T $ A C A C
6 3 2 2
A C $ C A
5 TTACAC$ C AC $ ACAC$ TACAC$
C $ A C
4 1 2 5 7 3 2
A C $
3
C $ $ AC$
2
$
1 6 4
Fig. 1. A suffix trie (left) and a suffix tree (right) built on the same string x =
AT T ACAC$. The leaves correspond to positions in x. The internal nodes in the suffix
tree hold the number of occurrences of the strings that have the node as a locus.
The main observation in [5] is that in the reverse tree the absolute value of
the EP function for n − i is equal to:
L
1 + n1 k=1 4k φk · c([i, i + k − 1])
fL,φ (xi ) = L k
k=0 φ
In the suffix tree T each node v is annotated with a variable count(v) which
stores the number of occurrences of the subword w(v), given by the concatenation
of labels from the root to the node v. This can be done in linear time with a
bottom-up traversal by setting up the value of the leaves to 1, and the value of
the internal nodes to the sum of the values of their children.
Each node v is also annotated with the value of the main summation in the
entropy formula. Let i be the position at which occurs the string w(v):
L
main(v) = 4k φk · c([i, i + k − 1])
k=1
Note that once this value is available the absolute value of entropy for w(v)
can be computed in constant time:
1 + n1 main(v)(1 − φ)
1 − φL+1
Now let h(v) be the length of w(v) and parent(v) be the parent node of v. The
annotation takes linear time with a pre-order traversal of the tree that passes
the contribution of shorter prefixes to the following nodes in the path:
h(v)
main(v) = main(parent(v)) + (4φ)k count(v)
k=h(parent(v))+1
154 L. Parida et al.
(4φ)h(parent(v))+1 − (4φ)h(v)+1
main(v) = main(parent(v)) + count(v)
1 − 4φ
L
(4φ)h(parent(v))+1 − (4φ)L+1
(4φ)k count(v) = count(v)
1 − 4φ
k=h(parent(v))+1
Absolute Value of Entropy. We can collect the absolute value of entropy for
all positions in the input string with a simple traversal of the tree at depth L
(in terms of length of strings that labels the paths). The steps to follow when
computing the entropy once we reach the last node of the path are the same
we already described for computing the entropy of a given substring. Differently
from before, when we reach the last node of a path we also store the value
of entropy in an array of size n (or any other suitable data structure) at the
positions corresponding to the leaves of the subtree rooted at the node, which
are the occurrences of the string that labels the path.
Moreover, as a byside product of this traversal, we can also collect information
to compute the mean and the standard deviation in linear time.
The Mean. Consider the mean first. We need to sum up the values of entropy
over all possible substrings of length L in the input string. Indeed we can re-write
Maximal Motifs and the Discovery of Significant Repetitions 155
1 ˆ 1
n
mL,φ = f (xi ) = count(vw ) × fˆL,φ (w)
n i=1 n
w∈DL
Therefore, when traversing the tree, we also keep a variable in which we add
the value of entropies found at length L, multiplied by the value of count(·)
stored at their locus.
) *
1
n 1
n
SL,φ = (fL,φ (xi ) − mL,φ ) =
ˆ 2 (fˆL,φ
2 (x )) − nm2
i
n−1 i=1
n−1 i=1
L,φ
nAgain we aggregate the contribution coming from the same subwords, so that
ˆ2 (xi ) becomes:
f
i=1 L,φ
(count(vw ) × fˆL,φ
2
(w))
w∈DL
To compute this sum, when traversing the tree we keep a variable in which
we add the square of the entropies we compute at length L, multiplied by the
value of count(·) stored at their locus.
Once the above summation and the mean have been computed with a single
traversal at depth L of our tree, we have all the elements needed to compute the
standard deviation in constant time.
The Maximum. As a side observation, one can also note as, in terms of asyn-
totic complexity, the maximum value of entropy can also be retrieved in linear
time with a tree traversal without need to compute the value of maxl , 1 ≤ l < L.
5 Experimental Analysis
In this section we present the results of the experimental analysis we performed
on the whole genome of Haemophilus influenzae, which is one of the most ex-
tensively analyzed in this context. For all the considered methods, the time
performance evaluations we show do not include the preprocessing step, i.e., the
construction of the exploited data structures (suffix trees or suffix tries), whereas
the time needed to annotate the tree is always included. All the tests were run
on a laptop with a 3.06GHz Core 2 Duo and 8Gb of Ram.
Figure 2 shows a comparison among the original EP function computation
by Vinga et al [18] (denoted by EP in the following), FastEP by Comin and
Antonello [5] and our approach, that is, LinearEP. In particular, the running
times in milliseconds are shown for φ = 10, L = 10 and increasing values of n.
As it is clear from the figure, LinearEP outperfomes the other two methods,
thus confirming the theoretic results.
4
x 10 Total Running Time
2
EP
LinearEP
FastEP
1.5
Time (ms)
0.5
0
0 0.5 1 1.5 2
n 6
x 10
Fig. 2. Comparison among EP, FastEP and LinearEP (φ = 10, L = 10) for increas-
ing values of n. Total running time includes the computation of the normalizing factors,
and the normalized EP values for the whole sequence.
5000
Time (ms)
4000
3000
2000
1000
0
0 0.5 1 1.5 2
n x 10
6
Fig. 3. Comparison between EP and LinearEP for the computation of mean and
standard deviation
software of EP. We do not have such limitation. Indeed we tested our algorithm
till L = 100 obtaining results close to those for L = 15. Figure 4 shows the
results for the time needed to compute the mean and the standard deviation.
The first observation is that, in both cases, the performances of EP do not change
significantly for increasing values of L, whereas the running times of LinearEP
sensibly increase for increasing values of L. This is due to the fact that n is fixed
and L varies. To compute mean and standard deviation LinearEP traverses a
6000
5000
EP
LinearEP
Time (ms)
4000
3000
2000
1000
0
2 4 6 8 10 12 14 16
L
6 Concluding Remarks
The research proposed here includes two main contributions. The first contri-
bution is the study of possible relationships between two classes of motifs an-
alyzed in the literature and both effective in singling out significant biological
repetitions, that are, entropic profiles and maximal motifs. We proved that en-
tropic profiles are a subset of maximal motifs, and, in particular, that they are
left-maximal motifs of the input string. The second contribution of the present
manuscript is the proposal of a novel linear time linear space algorithm for the
extraction of entropic profiles, according to the original normalization reported
in [7]. Experimental validations confirmed that the algorithm proposed here is
faster than the others in the literature, including a recent approach where a
different normalization was introduced [5].
From these contributions interesting considerations emerge. First of all, we
observe that entropic profiles are related to a specific length which one can only
guess when doing de novo discovery. So one could think of extracting maximal
motifs first, and then investigate entropic profiles in the regions of the maximal
motifs and for values of L around the maximal reported length. The process
of discovery of the entropic profiles would be further improved then. Other im-
provements in the entropic profiles extraction could come from the exploitation
of more efficient data structures such as enhanced suffix arrays [1]. In this regard,
we note that the preprocessing step can also be speeded since, as already pointed
out in the previous section, a full suffix tree is not necessary for the computa-
tion. Finally, open challenges still remain open about further issues concerning
maximal motifs and entropic profiles. Notably among them, one may wonder if
entropic profiles do not recover the complete information, so maximal motifs are
more reliable when it comes to discovery problems or if, on the contrary, entropic
profiles cover the complete information, i.e., they are a refinement of maximal
motifs and should be preferred.
Maximal Motifs and the Discovery of Significant Repetitions 159
References
1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced
suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
2. Allali, A., Sagot, M.-F.: The at most k deep factor tree. Technical Report (2004)
3. Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of
Computational Biology 11, 1 (2004)
4. Apostolico, A., Pizzi, C., Ukkonen, E.: Efficient algorithms for the discovery of
gapped factors. Algorithms for Molecular Biology 6, 5 (2011)
5. Comin, M., Antonello, M.: Fast Computation of Entropic Profiles for the Detection
of Conservation in Genomes. In: Ngom, A., Formenti, E., Hao, J.-K., Zhao, X.-M.,
van Laarhoven, T. (eds.) PRIB 2013. LNCS, vol. 7986, pp. 277–288. Springer,
Heidelberg (2013)
6. Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biolog-
ical sequences. Theoretical Computer Science 410(43), 4391–4401 (2009)
7. Fernandes, F., Freitas, A.T., Vinga, S.: Detection of conserved regions in genomes
using entropic profiles. INESC-ID Tec. Rep. 33/2007
8. Fernandes, F., Freitas, A.T., Almeida, J.S., Vinga, S.: Entropic Profiler – detection
of conservation in genomes using information theory. BMC Research Notes 2, 72
(2009)
9. Herzel, H., Ebeling, W., Schmitt, A.O.: Entropies of biosequences: The role of
repeats. Physical Review E 50, 5061–5071 (1994)
10. Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F.:
MADMX: a strategy for maximal dense motif extraction. Journal of Computa-
tional Biology 18(4), 535–545 (2011)
11. Grossi, R., Pisanti, N., Crochemore, M., Sagot, M.-F.: Bases of motifs for generat-
ing repeated patterns with wild cards. IEEE/ACM Transactions on Computational
Biology and Bioinformatics 2(1), 40–50 (2005)
12. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology. Cambridge University Press (1997)
13. Na, J.C., Apostolico, A., Iliopulos, C.S., Park, K.: Truncated suffix trees and their
application to data compression. Theoretical Computer Science 304(1-3), 87–101
(2003)
14. Parida, L.: Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman
& Hall/CRC (2007)
15. Parida, L., Pizzi, C., Rombo, S.E.: Characterization and Extraction of Irredun-
dant Tandem Motifs. In: Calderón-Benavides, L., González-Caro, C., Chávez, E.,
Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 385–397. Springer, Heidelberg
(2012)
16. Parida, L., Pizzi, C., Rombo, S.E.: Irredundant tandem motifs. Theoretical
Computer Science 525, 89–102 (2014)
17. Rombo, S.E.: Extracting string motif bases for quorum higher than two. Theoret-
ical Computer Science 460, 94–103 (2012)
18. Vinga, S., Almeida, J.S.: Local Renyi entropic profiles of DNA sequences. BMC
Bioinformatics 8, 393 (2007)
160 L. Parida et al.
Appendix
Let us start from the following two inequalities:
⎧ L +1 k k
⎪ k k
⎨ N + k=14 Lφ c([i−k+1,i]) ≥
N+ L
k=1 4 φ c([i−k+1,i])
L +1 k
k φ
k=0 k=0 φ
L L −1
⎪
⎩ N+ k=1 4k φk c([i−k+1,i])
L ≥
N+ k=1 4k φk c([i−k+1,i])
L −1 k
k
k=0 φ k=0 φ
from which:
⎧ L L
⎪ k k
⎨ N + k=14 Lφ c([i−k+1,i]) ≥
4k φk c([i−k+1,i])+4L +1 φL +1 c([i−L ,i])
L
N+ k=1
k
φ k L +1
k=0 φ +φ
L −1 k=0 L L L −1 k k
⎪
⎩
k k
N + k=1 4 φ c([i−k+1,i])+4 φ c([i−L +1,i])
L −1 k ≥
N + k=1 4 φ c([i−k+1,i])
L −1 k
L
k=0 φ +φ k=0 φ
L L k
Let us consider A = N + k=1 4k φk c([i − k + 1, i]), B = k=0 φ , C =
L −1
4L +1 φL +1 c([i − L , i]), D = φL +1 ; A = N + k=1 4k φk c([i − k + 1, i]),
L −1
B = k=0 φk , C = 4L φL c([i − L + 1, i]) and D = φL . Then:
+
B ≥ B+D =⇒ B ≥ D
A A+C A C
A +C A A C
B +D ≥ B =⇒ B ≤ D
⎧ L
⎪
⎪ −1 k k φk L +1 L +1
⎨N + L 4 φ c([i − k + 1, i]) ≥ k=0
4 φ c([i − L , i]) − 4L φL c([i − L + 1, i])
k=1 φL +1
L −1 k
⎪
⎪ −1 k k φ
⎩N + L
k=1
4 φ c([i − k + 1, i]) ≤ k=0
4L φL c([i − L + 1, i])
φL
Then:
L L −1
φk +1 +1 φk
k=0
4L φL c([i − L , i]) − 4L φL c([i − L + 1, i]) ≤ k=0
4L φL c([i − L + 1, i])
φL +1 φL
L
L −1
L
φ 4c([i − L , i]) − φ c([i − L + 1, i]) −
k
φk c([i − L + 1, i]) ≤ 0
k=0 k=0
L −1
L
L
(φ + φ )c([i − L + 1, i]) ≥ 4
k
φk c([i − L , i])
k=0 k=0
1 Introduction
Alignment-free methods are increasingly used for DNA and protein sequence
comparison since they are much faster than traditional alignment-based ap-
proaches [1]. Most alignment-free algorithms compare the word or k-mer compo-
sition of the input sequences [2]. They use standard metrics such as the Euclidean
or the Jensen-Shannon (JS) distance [3] on the relative word frequency vectors
of the input sequences to estimate their distances.
Recently, we proposed an alternative approach to alignment-free sequence
comparison. Instead of considering contiguous subwords of the input sequences,
our approach considers spaced words, i.e. words containing wildcard or don’t
care characters at positions defined by a pre-defined pattern P , similar as the
spaced seeds that are used in database searching [4]. As in existing alignment-
free methods, the (relative) frequencies of these spaced words are compared using
standard distance measures [5]. In [6], we extended this approach by using whole
sets P = {P1 , . . . , Pm } of patterns and calculating the spaced-word frequencies
with respect to all patterns in P. In this multiple-pattern approach, the distance
between two sequences is defined as the average of the distances between by the
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 161–173, 2014.
c Springer-Verlag Berlin Heidelberg 2014
162 B. Morgenstern et al.
single and multiple pattern approach. We show that the variance of N is lower
for spaced words than for contiguous words and that the variance is further
reduced in our multiple pattern approach.
we have P̂ = {1, 2, 4}, and the spaced word w = (P, w ) occurs at position 2 in
sequence S = CACGT CA since
S[2]S[3]S[5] = ACT = w .
S1 [i + P̂r − 1] = S2 [j + P̂r − 1]
N
p̂ = k
− (L − ) · q k (3)
m · (L − + 1)
as an estimator for the match probability p for sequences without indels, and
with Jukes-Cantor [13] we obtain
- .
3 4 N 1
dˆ = − · ln k
− (L − ) · q k − (4)
4 3 m · (L − + 1) 3
as an estimator for the distance d between the sequences S1 and S2 .
Equation (2) for the expected number N of spaced-word matches between
two sequences S1 and S2 can be easily generalized to the case where S1 and S2
have different lengths and contain insertions and deletions. Let L1 and L2 be the
166 B. Morgenstern et al.
5 The Variance of N
To calculate the variance of N , we adapt results on the occurrence of words
in a sequence as outlined in [14]. First, we calculate the joint probability of
overlapping spaced-word matches for (different or equal) patterns from P at
different sequence positions. Note that an overlap between a P -match at i, j and
a P -match at (i , j ) can occur only if i − i = j − j (and for non-overlapping
P -matches, their joint probability is, of course, the product of their individual
probabilities). We therefore consider a P -match at (i, j) and a P match at
(i + s, j + s) for some s ≥ 0.
For patterns P, P and s ∈ N we define n(P, P , s) to be the number of integers
that are match positions of P or match positions of P shifted by s positions to
the right (or both). Formally, if
For example, for P = 101011, P = 111001 and s = 2, there are 6 positions that
are match positions of P or of P shifted by 2 positions to the right, namely
positions 1, 3, 4, 5, 6, 8:
P : 101011
P : 111001
so one has n(P, P , s) = 6. In particular, one has n(P, P, 0) = k for all patterns
P of weight k, and
n(P, P, s) = k + max{s, k}
for all contiguous patterns P of weight (or length) k. With this notation, we can
write
n(P,P ,s)
P p if i = j
P
E Xi,j · Xi+s,j+s = (5)
q n(P,P ,s) else
P P
for all Xi,j , Xi+s,j+s
Estimating Evolutionary Distances from Spaced-Word Matches 167
We express the variance of these sums of random variable as the sum of all of
their covariances, so for the ’homologue’ random variables we can write
) *
L−l+1
P
V ar X = Cov Xi,i , XiP ,i
X∈X Hom P,P ∈P i,i =1
and since the above covariances depend only on s but not on i, we can use (5)
and (7) and obtain
) * −1
V ar X ≈ (L − + 1) · pn(P,P ,s) − p2k
X∈X Hom P,P ∈P s=−+1
and similarly
) * −1
V ar X ≈ (L − + 1) · (L − ) · q n(P,P ,s) − q 2k
X∈X BG P,P ∈P s=−+1
168 B. Morgenstern et al.
Together, we get
−1
V ar(N ) ≈ (L − + 1) · pn(P,P ,s)
− p2k
P,P ∈P s=−+1
(8)
−1
+ (L − + 1) · (L − ) · q n(P,P ,s) − q 2k
P,P ∈P s=−+1
6 Test Results
1.2 1.2
Evolutionary Distance single pattern Evolutionary Distance single pattern
1.1 1.1
Jensen Shannon single pattern Jensen Shannon single pattern
1
estimated distance
1
estimated distance
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
substitutions per site substitutions per site
1.2 1.2 Evolutionary Distance multiple pattern
Evolutionary Distance multiple pattern
1.1 1.1
Jensen Shannon multiple pattern Jensen Shannon multiple pattern
1 1
estimated distance
estimated distance
5 5
Kr Kr
4.5 4.5
4 expected 4 expected
estimated distance
estimated distance
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
substitutions per site substitutions per site
1.2 1.2
ACS ACS
1.1 1.1
kmacs k=30 kmacs k=30
1 1
estimated distance
expected expected
0.9 0.9
estimated distance
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
substitutions per site substitutions per site
As a third sequence set, we used a set of 112 HIV-1 genomes from the HIV-
1/SIVcpz database at Los Alamos National Laboratory [23]. Again, we compared
the distance matrices produced by various alignment-free methods to a reference
matrix calculated with Dnadist from a trusted reference alignment from the HIV
database. Here, the Jenson-Shannon distance applied to multiple spaced-word
frequencies was slightly superior to our new distance function if the length
(and therefore the number of don’t care positions of the underlying patterns)
was small. Only for larger values of , our new distance was superior to Jensen-
Shannon. As in the previous example, kmacs was among the best performing
methods. On the HIV sequences, we could also apply a multiple-alignment pro-
gram. We used CLUSTAL Ω [24] and applied Dnadist to the resulting align-
ment. Not surprisingly, this slow but accurate method of sequence comparison
performed better than all alignment-free approaches that we tested.
0.95
0.94
0.93
0.92
0.91
0.9
0.89
0.88
Spaced Words
Fig. 1 shows not only striking differences in the shape of the distance functions
used by various alignment-free programs. There are also remarkable differences in
the variance of the distances calculated with the new distance measure that we de-
fined in equation (4). This distance is defined in terms of the number N of (spaced)
word matches between two sequences. As mentioned above, the established Jensen-
Shannon and Euclidean distances on (spaced) word √ frequency vectors also depend
on N , as they can be approximated by L − N and L − N , respectively. Thus, the
variances of these three distance measures directly depend on the variance of N .
As can be seen in Fig. 1, the variance of the distances calculated with our new dis-
tance function increases with the frequency of substitutions. Also, the variance is
higher for the single-pattern approach than for the multiple-pattern approach. To
explain this observation, we calculated the variance of the number N of spaced-
Estimating Evolutionary Distances from Spaced-Word Matches 171
0.99
0.98
similarity in %
0.97
0.96
0.95
0.94
Spaced Words
7000 15000
6000
5000
10000
4000
3000
5000
2000
1000
0
0
s
le
10
20
40
60
80
0
ou
10
ng
le
10
20
40
60
80
0
ou
10
gu
ng
si
gu
in
si
nt
in
co
nt
co
word matches using equation 8. Fig. 4 summarizes the results for a sequence length
of L = 16.000 and mismatch frequencies of 0.7 and 0.25, respectively.
7 Discussion
In this paper, we proposed a new estimator for the evolutionary distance between
two DNA sequences based on the number N of spaced-word matches between
them. While most alignment-free methods use ad-hoc distance measures, the
distance function that we defined is based on a probabilistic model of evolution
and seems to be a good estimator for the number of substitutions per site that
have occurred since two sequences have evolved separately. For simplicity, we
used a model of evolution without insertions and deletions. Nevertheless, our test
results show that our distance function is still a reasonable estimator if the input
sequences contain a moderate number of insertions and deletions. Obviously, our
distance function would drastically overestimate the distances between sequence
172 B. Morgenstern et al.
pairs that share only local homologies. This seems to be a major limitation of
our approach. However, as indicated in section 4, our distance measure can be
adapted to the case of local homologies if the length of these homologies and
the number of gaps in the homologous regions can be estimated. In principle,
it should therefore be possible to apply our method to locally related sequences
by first estimating the extent of their shared homologies and then adapting our
distance measure accordingly.
The distance introduced in this paper and other distance measures that we
previously used for our spaced words approach depend on the number N of
space-word matches between two sequences with respect to a set P of patterns
of ‘match’ and ‘don’t care’ positions. This is similar for more traditional, k-
mer based distance measures where P consists of one single contiguous pattern
P = 1 . . . 1. Obviously, the expected number of (spaced) word matches is essen-
tially the same for contiguous and for spaced words of the corresponding weight.
Herein, we showed how the variance of N can be calculated and demonstrated
that this variance is considerably lower for our spaced-words approach than for
the standard approach that is based on contiguous words, and that our multiple-
pattern approach further reduces the variance of N/m where m is the number of
patterns in P. This seems to be the main reason why our multiple spaced words
approach outperforms the single-pattern approach that we previously introduced
as well as the classical k-mer approach when used for phylogeny reconstruction.
As we have shown, the variance of N depends on the number of overlapping
‘match’ positions if patterns from P are shifted against each other. Consequently,
in our single-pattern approach, the variance of N is higher for periodic patterns
than for non-periodic patterns, and if a pattern like 101010 . . . is used, the
variance is equal to the variance of the contiguous pattern with the same weight.
In our benchmark studies, we could experimentally confirm that on phylogeny
benchmark data, spaced words performs worse with periodic patterns than with
non-periodic patterns. Therefore, the theoretical results of this study may be
useful to find patterns or sets of patterns that minimize the variance of N and
thereby improve our spaced-words approach.
References
1. Vinga, S.: Editorial: Alignment-free methods in computational biology. Briefings
in Bioinformatics 15, 341–342 (2014)
2. Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring se-
quence alignment. Proceedings of the National Academy of Sciences of the United
States of America 83, 5155–5159 (1986)
3. Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on
Information theory 37, 145–151 (1991)
4. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology
search. Bioinformatics 18, 440–445 (2002)
Estimating Evolutionary Distances from Spaced-Word Matches 173
5. Boden, M., Schöneich, M., Horwege, S., Lindner, S., Leimeister, C.-A., Morgen-
stern, B.: Alignment-free sequence comparison with spaced k-mers. In: German
Conference on Bioinformatics 2013. OpenAccess Series in Informatics (OASIcs),
vol. 34, pp. 24–34 (2013)
6. Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., Morgenstern, B.: Fast
alignment-free sequence comparison using spaced-word frequencies. Bioinformatics
30, 2000–2008 (2014)
7. Horwege, S., Sebastian, L., Boden, M., Hatje, K., Kollmar, M., Leimeister, C.-A.,
Morgenstern, B.: Spaced words and kmacs: fast alignment-free sequence comparison
based on inexact word matches. Nucleic Acids Research 42, W7–W11 (2014)
8. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987)
9. Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without
alignment using shortest unique substrings. BMC Bioinformatics 6, 123 (2005)
10. Lippert, R.A., Huang, H., Waterman, M.S.: Distributional regimes for the number
of k-word matches between two random sequences. Proceedings of the National
Academy of Sciences 99, 13980–13989 (2002)
11. Kantorovitz, M., Robinson, G., Sinha, S.: A statistical method for alignment-free
comparison of regulatory sequences. Bioinformatics 23, 249–255 (2007)
12. Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison
(i): Statistics and power. Journal of Computational Biology 16, 1615–1634 (2009)
13. Jukes, T.H., Cantor, C.R.: Evolution of Protein Molecules. Academy Press (1969)
14. Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of
Exceptional Words. Cambridge University Press, Cambridge (2005)
15. Haubold, B., Pfaffelhuber, P., Domazet-Loso, M., Wiehe, T.: Estimating muta-
tion distances from unaligned genomes. Journal of Computational Biology 16,
1487–1500 (2009)
16. Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common
substring approach to alignment-free sequence comparison. Bioinformatics 30,
1991–1999 (2014)
17. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring
approach to phylogenomic reconstruction. Journal of Computational Biology 13,
336–350 (2006)
18. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison
with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the
National Academy of Sciences 106, 2677–2682 (2009)
19. Qi, J., Luo, H., Hao, B.: CVTree: a phylogenetic tree reconstruction tool based on
whole genomes. Nucleic Acids Research 32(suppl 2), W45–W47 (2004)
20. Felsenstein, J.: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5,
164–166 (1989)
21. Bonnet, E., de Peer, Y.V.: zt: A sofware tool for simple and partial mantel tests.
Journal of Statistical Software 7, 1–12 (2002)
22. Didier, G., Laprevotte, I., Pupin, M., Hénaut, A.: Local decoding of sequences and
alignment-free comparison. J. Computational Biology 13, 1465–1476 (2006)
23. Kuiken, C., Leitner, T., Foley, B., Hahn, B., Marx, P., McCutchan, F., Wolinsky,
S., Korber, B.T. (eds.): HIV Sequence Compendium 2009. Theoretical Biology
and Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico
(2009)
24. Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R.,
McWilliam, H., Remmert, M., Söding, J., Thompson, J.D., Higgins, D.G.: Fast,
scalable generation of high-quality protein multiple sequence alignments using
Clustal Omega. Molecular Systems Biology 7, 539 (2011)
On the Family-Free DCJ Distance
1 Introduction
Genomes are subject to mutations or rearrangements in the course of evolution.
Typical large-scale rearrangements change the number of chromosomes and/or
the positions and orientations of genes. Examples of such rearrangements are
inversions, translocations, fusions and fissions. A classical problem in compara-
tive genomics is to compute the rearrangement distance, that is, the minimum
number of rearrangements required to transform a given genome into another
given genome [14].
In order to study this problem, one usually adopts a high-level view of genomes,
in which only “relevant” fragments of the DNA (e.g., genes) are taken into con-
sideration. Furthermore, a pre-processing of the data is required, so that we can
compare the content of the genomes.
One popular method, adopted for more than 20 years, is to group the genes
in both genomes into gene families, so that two genes in the same family are
said to be equivalent. This setting is said to be family-based. Without gene
duplications, that is, with the additional restriction that each family occurs
exactly once in each genome, many polynomial models have been proposed to
compute the genomic distance [3,4,12,17]. However, when gene duplications are
allowed, the problem is more intrincate and all approaches proposed so far are
NP-hard, see for instance [1, 7, 8, 15, 16].
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 174–186, 2014.
c Springer-Verlag Berlin Heidelberg 2014
On the Family-Free DCJ Distance 175
2 Preliminaries
Let A and B be two distinct genomes and let A be the set of genes in genome
A and B be the set of genes in genome B.
Each gene g in a genome is an oriented DNA fragment that can be repre-
sented by the symbol g itself, if it has direct orientation, or by the symbol −g,
if it has reverse orientation. Furthermore, each one of the two extremities of a
linear chromosome is called a telomere, represented by the symbol ◦. Each chro-
mosome in a genome can be represented by a string that can be circular, if the
chromosome is circular, or linear and flanked by the symbols ◦ if the chromosome
is linear. For the sake of clarity, each chromosome is also flanked by parentheses.
As an example, consider the genome A = {(◦ 3 −1 4 2 ◦), (◦ 5 −6 −7 ◦)} that is
composed of two linear chromosomes.
Since a gene g has an orientation, we can distinguish its two ends, also called
its extremities, and denote them by g t (tail ) and g h (head ). An adjacency in a
genome is either the extremity of a gene that is adjacent to one of its telomeres,
or a pair of consecutive gene extremities in one of its chromosomes. If we consider
again the genome A above, the adjacencies in its first chromosome are 3t , 3h 1h ,
1t 4t , 4h 2t and 2h .
of the two input genomes and an edge connects the same extremities of genes in
both genomes. In other words, there is a one-to-one correspondence between the
set of edges in AG(A, B) and the set of gene extremities. Vertices have degree
one or two and thus an adjacency graph is a collection of paths and cycles. An
example of an adjacency graph is given in Figure 1.
1h 1t 3t 3h 4t 4h 2t 2h
2h 2t 1t 1h 4t 4h 3t 3h
Fig. 1. The adjacency graph for the two unichromosomal and linear genomes A =
{(◦ −1 3 4 2 ◦)} and B = {(◦ −2 1 4 3 ◦)}
The family-based DCJ distance ddcj between two genomes A and B without
duplications can be computed in linear time and is closely related to the number
of components in the adjacency graph AG(A, B) [4]:
where n = |A| = |B| is the number of genes in both genomes, c is the number of
cycles and i is the number of odd paths in AG(A, B).
Observe that, in Figure 1, the number of genes is n = 4 and AG(A, B) has
one cycle and two odd paths. Consequently the DCJ distance is ddcj (A, B) =
4 − 1 − 2/2 = 2.
The formula for ddcj (A, B) can also be derived using the following approach.
Given a component C in AG(A, B), let |C| denote the length, or number of
edges, of C. From [6,11] we know that each component in AG(A, B) contributes
independently to the DCJ distance, depending uniquely on its length. Formally,
the contribution d(C) of a component C in the total distance is given by:
⎧ |C|
⎪
⎪ − 1 , if C is a cycle ,
⎨ 2
|C|−1
d(C) = , if C is an odd path ,
⎪
⎪
2
⎩ |C|
2 , if C is an even path .
The sum of the lengths of all components in the adjacency graph is equal
to 2n. Let C, I, and P represent the sets of components in AG(A, B) that
On the Family-Free DCJ Distance 177
are cycles, odd paths and even paths, respectively. Then, the DCJ distance can
be calculated as the sum of the contributions of each component:
ddcj (A, B) = d(C)
C∈AG(A,B)
|C| |C| − 1 |C|
= −1 + +
2 2 2
C∈C C∈I C∈P
1
1
= |C| − 1−
2 2
C∈AG(A,B) C∈C C∈I
= n − c − i/2 .
1 2 3 4 5
8
7
0.
0.
0.9
0.8
0.8
0.7
0.9
0.3
0.
5
0.7
0.4
6 −7 −8 −9 10 11
Fig. 2. A possible gene similarity graph for the two unichromosomal linear genomes
A = {(◦ 1 2 3 4 5 ◦)} and B = {(◦ 6 −7 −8 −9 10 11 ◦)}
Let A and B be two genomes and let GSσ (A, B) be their gene similarity graph.
let M = {e1 , e2 , . . . , en } be a matching in GSσ (A, B) and denote by w(M ) =
Now
ei ∈M σ(ei ) the weight of M , that is the sum of its edge weights. Since the
178 F.V. Martinez et al.
endpoints of each edge ei = (a, b) in M are not saturated by any other edge of M ,
we can unambiguously define the function s(a, M ) = s(b, M ) = i. The reduced
genome AM is obtained by deleting from A all genes that are not saturated by
M , and renaming each saturated gene a to s(a, M ), preserving its orientation.
Similarly, the reduced genome B M is obtained by deleting from B all genes
that are not saturated by M , and renaming each saturated gene b to s(b, M ),
preserving its orientation. Observe that the set of genes in AM and in B M is
G(M ) = {s(g, M ) : g is saturated by the matching M } = {1, 2, . . . , n}.
1 2 3 4 5
0.8
0.9
0.8
0.7
7
0.
0.9
0.7
0.3
0.
0.4
5
8
0.
6 −7 −8 −9 10 11
1t 1h 2t 2h 3t 3h 4t 4h 5t 5h 1t 1h 2t 2h 3t 3h 4t 4h 5t 5h
8
0.
4
0.7
0.
0.8
0.
0.5
5
0.
7
0.3
0.3
0.7
0.7
0.9
0.9
0.8
0.8
0.8
7
0.
0.4
0.7
8
0.
3h 3t 2h 2t 1h 1t 4t 4h 5t 5h 1t 1h 2h 2t 4h 4t 3h 3t 5t 5h
AGσ (AM1 ,B M1 ) AGσ (AM2 ,B M2 )
Let C, I, and P represent the sets of components in AGσ (AM , B M ) that are
cycles, odd paths and even paths, respectively. Summing the contributions of all
the components, the resulting distance for a certain matching M is computed as
follows:
dσ (AM , B M ) = dσ (C)
C∈AGσ (AM ,B M )
Proof. Using notation from [2] (Chapter 8), we give an AP-reduction (f, g, β)
from (1, 2)-exdcj-distance to ffdcj-distance as follows:
Algorithm f receives as input a positive rational number δ and an instance
(A, B) of (1, 2)-exdcj-distance where A and B are genomes from a set of genes
G and each gene in G occurs at most once in A and at most twice in B, and
constructs an instance (A , B ) = f (δ, (A, B)) of ffdcj-distance as follows.
On the Family-Free DCJ Distance 181
1 2 −3 4
−5 6 7 8 9 −10
Fig. 4. Gene similarity graph GSσ (A , B ) constructed from the input genomes A =
{(◦ a c −b d ◦)} and B = {(◦ −c d a c b −b ◦)} of (1, 2)-exdcj-distance, where all
edge weights are 1. Highlighted edges represent a maximal matching in GSσ (A , B ).
1 −2 3 −4 5 −6
1−ε 1−ε 1−ε
1 1 1 1 1 1
−7 8 −9 10 −11 12
There are several matchings in GSσ (A, B). We are interested in two particular
maximal matchings:
– M ∗ is composed of all edges that have weight 1 − ε. It has weight w(M ∗ ) =
∗ ∗
(1−ε)|M ∗ |. Its corresponding weighted adjacency graph AGσ (AM , B M ) has
∗ ∗
|M ∗ | − 1 cycles and two odd paths, thus ddcj (AM , B M ) = 0. Consequently,
∗ ∗
we have dσ (AM , B M ) = |M ∗ | − (1 − ε)|M ∗ | = ε|M ∗ |.
– M is composed of all edges that have weight 1. It is the only matching with
the maximum weight w(M ) = |M |. Its corresponding weighted adjacency
graph AGσ (AM , B M ) has two even paths, but no cycles or odd paths, giving
ddcj (AM , B M ) = |M |. Hence, dσ (AM , B M ) = 2|M | − |M | = |M |.
On the Family-Free DCJ Distance 183
∗ ∗
Notice that dffdcj (A, B) ≤ dσ (AM , B M ). Furthermore, since |M | = 2|M ∗ |,
dσ (AM , B M ) |M | 2
∗ ∗ = ∗
=
M
dσ (A , B ) M ε|M | ε
xe = 1 , ∀ e ∈ Ea .
We require then that, for each vertex in H, exactly one incident edge to it be
chosen:
xuv = 1 , ∀ u ∈ XA , and xuv = 1 , ∀ v ∈ XB .
uv∈Em ∪Es uv∈Em ∪Es
184 F.V. Martinez et al.
Then, we require that the final solution be consistent, meaning that if one
extremity of a gene in A is assigned to an extremity of a gene in B, then the
other extremities of these two genes have to be assigned as well:
To count the number of cycles, we use the same strategy as described in [16].
We first give an arbitrary index for each vertex in H such that V (H) = {v1 , v2 ,
. . . , vk } with k = |V (H)|. For each vertex vi , we define a variable yi that labels
vi such that
0 ≤ yi ≤ i , 1≤i≤k.
We also require that all vertices in the same cycle in the solution have the same
label:
yi ≤ yj + i · (1 − xe ) , ∀ e = vi vj ∈ E(H) ,
yj ≤ yi + j · (1 − xe ) , ∀ e = vi vj ∈ E(H) .
i · zi ≤ yi , 1≤i≤k.
Notice that the way as variables zi were defined, they count the number of cycles
in H [16].
Finally, we set the objective function as follows:
minimize 2 xe − we xe − zi ,
e∈Em e∈Em 1≤i≤k
which is exactly the family-free DCJ distance dffdcj (A, B) as defined in Section 3.
We performed some initial simulated experiments of our integer linear pro-
gram formulation. We produced some datasets using the Artificial Life Simulator
(ALF) [9]. Genome sizes varied from 1000 to 3000 genes, where the gene lengths
were generated according to a gamma distribution with shape parameter k = 3
and scale parameter θ = 133. A birth-death tree with 10 leaves was generated,
with PAM distance of 100 from the root to the deepest leaf. For the amino
acid evolution, the WAG substitution model with default parameters was used,
with Zipfian indels at a rate of 0.000005. For structural evolution, gene dupli-
cations and gene losses were applied with a 0.001 rate, with a 0.0025 rate for
reversals and translocations. To test different ration of rearrangement events, we
On the Family-Free DCJ Distance 185
also simulated datasets where the structural evolution ratios had a 2- and 5-fold
increase.
To solve the ILPs, we ran the CPLEX Optimizer1 on the 45 pairwise compar-
isons of each simulated dataset. All simulations were run in parallel on a cluster
consisting of machines with an Intel(R) Xeon(R) E7540 CPU, with 48 cores and
as many as 2 TB of memory, but for each individual CPLEX run only 4 cores
and 2 GB of memory were allocated. The results are summarized on Table 1.
Table 1. ILP results for datasets with different genome sizes and evolutionary rates.
Each dataset has 10 genomes, totalling 45 pairwise comparisons. Maximum running
time was set to 20 minutes. For each dataset, it is shown the number of runs that found
an optimal solution in time and their average running time. For the runs that did not
finish, the last row shows the gap between the upper bound and the current solution.
Rate r = 1 means the default rate for ALF evolution, and r = 2 and r = 5 mean 2-fold
and 5-fold increase for the gene duplication, gene deletion and rearrangement rates.
Finished 45/45 22/45 6/45 45/45 9/45 1/45 45/45 7/45 3/45
Avg. Time (s) 0.66 11.09 24.26 1.29 2.76 16.97 2.24 16.36 36.01
Avg. Gap (%) 0 1.08 3.9 0 1.93 12.4 0 3.9 6.03
6 Conclusion
In this paper, we have defined a new distance measure for two genomes that is
motivated by the double cut and join model, while not relying on gene annota-
tions in form of gene families. In case gene families are known and each family
has exactly one member in each of the two genomes, the distance equals the
family-based DCJ distance and thus can be computed in linear time. In the gen-
eral case, however, it is NP-hard and even hard to approximate. Nevertheless, we
could give an integer linear program for the exact computation of the distance
that is fast enough to be applied to realistic problem instances.
The family-free model has many potentials when gene family assignments are
not available or ambiguous, in fact it can even be used to improve family assign-
ments [13]. The work presented in this paper is another step in this direction.
References
1. Angibaud, S., Fertin, G., Rusu, I., Thévenin, A., Vialette, S.: On the approxima-
bility of comparing genomes with duplicates. J. Graph Algorithms Appl. 13(1),
19–53 (2009)
2. Ausiello, G., Protasi, M., Marchetti-Spaccamela, A., Gambosi, G., Crescenzi, P.,
Kann, V.: Complexity and Approximation: Combinatorial Optimization Problems
and Their Approximability Properties. Springer (1999)
3. Bafna, V., Pevzner, P.: Genome rearrangements and sorting by reversals. In: Proc.
of FOCS 1993, pp. 148–157 (1993)
4. Bergeron, A., Mixtacki, J., Stoye, J.: A unifying view of genome rearrange-
ments. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175,
pp. 163–173. Springer, Heidelberg (2006)
5. Braga, M.D.V., Chauve, C., Dörr, D., Jahn, K., Stoye, J., Thévenin, A., Wittler,
R.: The potential of family-free genome comparison. In: Chauve, C., El-Mabrouk,
N., Tannier, E. (eds.) Models and Algorithms for Genome Evolution, ch. 13, pp.
287–307. Springer (2013)
6. Braga, M.D.V., Stoye, J.: The solution space of sorting by DCJ. J. Comp.
Biol. 17(9), 1145–1165 (2010)
7. Bryant, D.: The complexity of calculating exemplar distances. In: Sankoff, D.,
Nadeau, J.H. (eds.) Comparative Genomics, pp. 207–211. Springer, Netherlands
(2000)
8. Bulteau, L., Jiang, M.: Inapproximability of (1,2)-exemplar distance. IEEE/ACM
Trans. Comput. Biol. Bioinf. 10(6), 1384–1390 (2013)
9. Dalquen, D.A., Anisimova, M., Gonnet, G.H., Dessimoz, C.: ALF–a simulation
framework for genome evolution. Mol. Biol. Evol. 29(4), 1115–1123 (2012)
10. Dörr, D., Thévenin, A., Stoye, J.: Gene family assignment-free comparative ge-
nomics. BMC Bioinformatics 13(Suppl 19), S3 (2012)
11. Feijão, P., Meidanis, J.: SCJ: A breakpoint-like distance that simplifies sev-
eral rearrangement problems. IEEE/ACM Trans. Comput. Biol. Bioinf. 8(5),
1318–1329 (2011)
12. Hannenhalli, S., Pevzner, P.: Transforming men into mice (polynomial algorithm
for genomic distance problem). In: Proc. of FOCS 1995, pp. 581–592 (1995)
13. Lechner, M., Hernandez-Rosales, M., Doerr, D., Wieseke, N., Thévenin, A., Stoye,
J., Hartmann, R.K., Prohaska, S.J., Stadler, P.F.: Orthology detection combining
clustering and synteny for very large datasets (unpublished manuscript)
14. Sankoff, D.: Edit distance for genome comparison based on non-local operations.
In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS,
vol. 644, pp. 121–135. Springer, Heidelberg (1992)
15. Sankoff, D.: Genome rearrangement with gene families. Bioinformatics 15(11),
909–917 (1999)
16. Shao, M., Lin, Y., Moret, B.: An exact algorithm to compute the DCJ distance
for genomes with duplicate genes. In: Sharan, R. (ed.) RECOMB 2014. LNCS,
vol. 8394, pp. 280–292. Springer, Heidelberg (2014)
17. Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permuta-
tions by translocation, inversion and block interchanges. Bioinformatics 21(16),
3340–3346 (2005)
New Algorithms for Computing
Phylogenetic Biodiversity
Center for Massive Data Algorithmics, a Center of the Danish National Research
Foundation.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 187–203, 2014.
c Springer-Verlag Berlin Heidelberg 2014
188 C. Tsirogiannis, B. Sandel, and A. Kalvisa
1 Introduction
Researchers in the field of ecology, but also from other disciplines in biology,
are frequently confronted with the following problem: given a set of species,
they want to measure if these species are close evolutionary relatives. The most
common way to measure this is to use a phylogenetic tree T , where each leaf
of the tree corresponds to a species, and the weights of the tree edges represent
some concept of distance e.g. time since the last speciation event. From T we
select a subset of leaves R which correspond to the species that we want to
examine. The next step is then to choose a method for computing the distance
between the leaves in R based on the structure of T . In the related literature,
such methods are refered to as phylogenetic biodiversity measures. Two measures
of this kind that are widely used are the Phylogenetic Diversity (PD) and the
Mean Nearest Taxon Distance (MNTD). For a given tree T and a subset R of its
leaves, the value of the PD is equal to the cost of the minimum-weight Steiner
tree in T that spans the nodes in R. The value of the MNTD is the average path
cost in T between any node v ∈ R and its closest neighbour in R \ {v}.
Whichever method we choose for computing the distance between the elements
in R, we need to know if the returned distance value is relatively small or large
compared to other sets of leaves in T . More specifically, we need to compare the
distance value that we got for R with the distance values of all possible subsets
of leaves in T that have exactly the same number of elements. In several case
studies in biology this is done by computing the mean and the variance of the
distance values among all those subsets of species [10,4,9,8]. We can then use
these to calculate a standardized index; from the distance value that we got
for R we subtract the mean and divide by the standard deviation. Depending
on the distance measure that we choose, we can use this method to produce
several indices. Some of the most widely used indices of this kind are the Net
Relatedness Index (NRI), the Nearest Taxon Index (NTI, based on the MNTD)
and the Phylogenetic Diversity Index (PDI, based on the PD) [15,17].
In a previous paper we introduced algorithms that compute the values for
the mean and the variance the PD [15]. For a tree T that consists of n nodes in
total, and for a non-negative integer r, we introduced an algorithm that computes
in O(n) time the mean value of the PD among all possible subsets that consist
of r leaves. We also introduced an algorithm that computes the variance of
the PD in Θ(n2 ) time. The latter algorithm is quite inefficient since it takes Θ(n2 )
time to execute, not only in the worst case but for every input tree. This makes
the use of this algorithm limited in practice, since in some applications it is
required to calculate the variance of some measure for a large number of different
trees (for example, constructed algorithmically by slightly changing the structure
of a given reference tree).
On the other hand, there are no known algorithms for computing the exact
value of the mean and the variance of the MNTD. So far, researchers try to
estimate these values using a random sampling technique; for a given subset
size r, a few subsets of exactly r leaves in T are selected at random. Then,
the mean and the variance of the MNTD is calculated using the values of this
New Algorithms for Computing Phylogenetic Biodiversity 189
measure only for the selected subsets. The number of the sampled subsets is
usually around a thousand. For sufficiently large values of r and n, this is a very
small number of samples compared to the number of all possible subsets of r
leaves in T . This implies that the sampling approach is inexact, and may yield
estimated values for the mean and the variance that are very different from the
original ones. Hence, there is need to introduce exact and efficient algorithms
for computing these statistics for the MNTD, which are required to derive the
commonly used NTI [11].
Furthermore, in some studies it is required to compute not only the mean
and the variance, but also the higher order moments of a given measure [3].
Unfortunately, for the most popular phylogenetic biodiversity measures
computing the higher order statistics appears to be a difficult task. For the PD
and the MNTD, any preliminary attempts that we made to compute the higher
order moments lead to algorithms with running time that scales exponentially
as the order of the moment increases. Yet, to this point we have not proven
that designing more efficient algorithms is impossible; this is a conjecture. On
the other hand, the skewness of another popular measure, the Mean Pairwise
Distance (MPD), can be computed in O(n) time [14]. However, the analytical
expression that yields the value of the MPD skewness is particularly involved.
Worse than that, it appears that deriving an expression for the higher order
moments of the MPD may be overwhelmingly complicated. Therefore, there
is the need for a non-trivial biodiversity measure for which we can efficiently
compute its higher order moments.
Our Results. In this paper we present several results that have to do with
the efficient computation of the statistical moments of certain phylogenetic
biodiversity measures. Given a phylogenetic tree T and a positive integer r, we
describe an algorithm that computes the variance of the PD among all subsets
of r leaves in O(SI(T ) + DSSI2 (T )) time, using O(n) space. Here, we use SI(T )
to denote the Sackin’s Index of T which is equal to the sum of the numbers
of leaves that appear at the subtree of each node in T [2]. We use DSSI(T ) to
denote a new index that we introduce, which we call the Distinct Subtree Sizes
Index. We provide a formal definition of this new index later in this paper. The
values of both the SI(T ) and the DSSI(T ) depend on the structure of the tree T .
When T is relatively balanced, the new algorithm has a very good performance,
and is much more efficient in practice than the already known Θ(n2 ) algorithm.
It is only in the worst case, when T has Ω(n) height, that the new algorithm runs
in Θ(n2 ) time. Moreover, we present for the first time algorithms for computing
the exact value of the mean and the variance of the MNTD for ultrametric trees;
a tree is called ultrametric if any simple path from its root to a leaf node has
the same cost. Given an ultrametric tree T of n nodes and a positive integer r,
we provide an algorithm that runs in O(n) time, and computes the mean of
the MNTD among all subsets of r leaves in T . We also present an algorithm that
computes the variance of the MNTD in O(SI(T ) + DSSI2 (T )) time, using O(n)
space. This algorithm is based on the on the same method as our new algorithm
that computes the variance of the PD.
190 C. Tsirogiannis, B. Sandel, and A. Kalvisa
Related literature. The definition of the PD that we provide in this paper (that
is the cost of the min-weight Steiner tree of a subset of leaves) is known in the
related literature as the unrooted version of the PD. Steel was the first to provide
a formula for the exact computation of the mean of the PD over all subsets of r
leaves of a tree T [13]. This formula describes the value of the mean for the rooted
variant of the PD; in this variant, for a given subset of leaves R ∈ T the value of
the PD is equal to the value of the unrooted PD, plus the cost of the path that
connects the root of T with the deepest common ancestor of all elements in R.
In a previous paper we introduced exact expressions for computing the mean
and the variance of the unrooted PD, and we examined issues related to their
efficient computation [15]. Nipperess and Matsen [12] yield a related result for
a more general version of the problem. They derive formulas for the mean and
the variance of the PD for subsets of nodes in T that may also include internal
nodes. They provide such formulas both for the rooted and the unrooted version
of the PD. Faller et al. [6] and O’Dwyer et al. [5] consider several probability
distributions for sampling subsets of leaves from a tree. In the version of the
problem that they examine, formulas for the mean and the variance of the PD are
derived among subsets of leaves that do not have the same number of elements.
To our knowledge, except our previous work, none of the above papers is
concerned with analysing the running time of an algorithm that evaluates the
derived formulas. Unlike these works, in the current paper we do not provide
a new formula for the variance of the PD. Instead, among other results, we
describe a novel non-trivial method for speeding up significantly the evaluation
of the existing formula.
New Algorithms for Computing Phylogenetic Biodiversity 191
SI(T ) = s(v).
v∈V
Alternatively, in the related literature the Sackin’s index is described as the
sum of the depths of all leaf nodes in T . Both definitions are equivalent since
they lead to exactly the same value. The Sackin’s index is mainly used in the
literature as a function for measuring how balanced a phylogenetic tree is [2].
192 C. Tsirogiannis, B. Sandel, and A. Kalvisa
The variance of f over all subsets of r leaves is equal to:
where:
⎧
⎪ (s(e))+(s−s(l) )−(s(e)−s(l) )
⎪
⎪ FOff (S, e, l, r) = r r r
if l ∈ Off(e).
⎪
⎪ ( s
)
⎪
⎪
r
⎪
⎨
(s(l))+(s−s(e) )−(s(l)−s(e) )
F(S, e, l, r) = FOff (S, l, e, r) = r r r
if e ∈ Off(l).
⎪
⎪ ( s
)
⎪
⎪
r
⎪
⎪
⎪
⎪ s−s(e) s−s(l) s−s(e)−s(l)
⎩FInd (S, e, l, r) = ( r )+( r s)−( r )
otherwise.
(r )
and where μPD (T , r) is the mean value of the PD over all possible subsets of
exactly r leaves of T . In our previous paper we showed how we can compute this
New Algorithms for Computing Phylogenetic Biodiversity 193
mean value for a given r in O(n) time. Hence, the bottleneck for calculating the
variance of this metric is the computation of the following quantity:
e∈E l∈E
It is easy to show that the first and the second sum in (4) consist of Θ(n)
terms, and therefore they can be computed in O(n) time. The third sum in (4)
1
In the definition of F, all the required values that involve binomial coefficients can
be precomputed in O(n) time in total in the RAM model. Each of the precomputed
values can then be accessed in constant time each time we have to evaluate this
expression.
194 C. Tsirogiannis, B. Sandel, and A. Kalvisa
consists of SI(T ) terms since for every edge e ∈ E there exist s(e) terms in this
sum. Since we can evaluate each of these terms in constant time, the expression
in (4) can be evaluated in O(SI(T )) time in total.
The two nested sums of the quantity in (5) can be analysed as follows:
Based on the same arguments as for the expression in (4), the two last sums
in (6) can be evaluated in O(SI(T )) time in total. Let α
be a positive integer such that α ∈ D(T ). Recall that D(T ) is the set of all
values s(e) that we can observe among the edges of T . Let ζ(α) denote the sum
of the weights of all the edges e ∈ E for which it holds s(e) = α, that means:
ζ(α) = w(e)
e∈E
s(e)=α
Using this notation, the first sum in (6) can be written as:
In the last expression, we abuse slightly the notation for function FInd ; for two
integers α, β ∈ D we imply that FInd (S, α, β, r) = FInd (S, e, l, r), where s(e) = α
and s(l) = β. The sum in (7) consists of Θ(DSSI2 (T )) terms. Each of these
terms can be evaluated in constant time given that we have precomputed the
values ζ(α), ∀α ∈ D(T ). The values ζ(α) can be precomputed trivially in Θ(n)
time altogether, hence the expression in (7) can be evaluated in Θ(DSSI2 (T ))
time in total. Given the description that we provided for evaluating the
expressions from (4) to (7), we conclude that the variance of the PD can be
computed in O(SI(T ) + DSSI(T )2 ) time overall. To do this, we need to store the
values of the functions FOff , and FInd , and the values ζ(α) for every α ∈ D(T ).
These require O(n) memory in total, and the theorem follows.
the variance takes O(n2 ) time. In Section 4 we present experimental results that
indicate that the new approach is much more efficient in practice. For different
tree data sets that we use there, the values of SI(T ) and DSSI(T ) are much
smaller than in the worst case scenario. In fact, we can prove a non-trivial tight
worst case bound for DSSI(T ); this bound depends on the number of nodes and
the height of T . The bound that we provide applies to trees that have a height
that is at least logarithmic to the number of tree nodes (for example, trees where
the nodes have constant maximum degree). The proof of the following lemma
appears in the full version of this paper.
Lemma 1. Let T be a phylogenetic tree that consists of n nodes and has height
h(T ). In the worst case, the value of DSSI(T ) can be as large as Θ( n · h(T )).
1
MNTD(T , R) = min cost(u, v) . (8)
r v∈R u∈R/{v}
Like with other phylogenetic measures, in order to analyse the value of the
MNTD for a set of leaves R it is important to compute the mean and the variance
of this measure for all possible subsets of |R| leaves in T . Next we provide for
the first time formal expressions that lead to the efficient computation of the
exact value of the mean and the variance of the MNTD. The expressions that
we provide hold only for ultrametric phylogenetic trees; recall that a tree T is
ultrametric if all simple paths between the root and the leaves of T have the
same cost. Ultrametric tree datasets are very common in phylogenetic research;
for instance, ultrametric trees are produced for a given set of taxa when the
weights of the tree edges represent specific notions of distance, such as time
between speciation events. In the next lemma we show how we can simplify the
expression in (8) when we specifically consider ultrametric trees.
Lemma 2. Let T be an ultrametric phylogenetic tree and let R ⊆ S be a subset
of r leaves. The value of the MNTD for this subset is equal to:
2
MNTD(T , R) = w(e) . (9)
r e∈E
sr(e)=1
Proof. Let v be a leaf in R, and let u be the closest leaf to v in R/{v}. That
means cost(u, v) = minx∈R/{v} cost(v, x). Let p(u, v) be the simple path that
196 C. Tsirogiannis, B. Sandel, and A. Kalvisa
Next we use the expression in (9) to obtain expressions for efficiently computing
the mean and the variance of the MNTD for ultrametric trees.
Theorem 2. Let T be an ultrametric phylogenetic tree that has s leaves and
consists of n nodes in total. Let r be a non-negative integer with r ≤ s. The
expected value of the MNTD for a subset of exactly r leaves in T is equal to:
μMNTD (T , r) =
2
r e∈E
w(e) · s(e) ·
s
r
s−s(e)
r−1
, (10)
of r leaves in T is equal to:
EMNTD (T , r) = ER∈Sub(S,r)
2
r
e∈E
w(e) = ER∈Sub(S,r)
2
r
e∈E
w(e) · SP (e, R)
sr(e)=1
(11)
2
= w(e) · ER∈Sub(S,r) [SP (e, R)] . (12)
r
e∈E
Considering that every subset R of exactly r leaves is picked with the same
probability, the expected value of the function SP (e, R) is equal to:
x
To compute the value of this expression, we first precompute values r−1 / rs
for every integer x ∈ [r − 1, s]. This can be done alltogether in O(n) time in
the RAM model. Given these values, the rest of the expression (10) can be
straightforwardly evaluated in O(n) time.
The proofs of the next theorem appears in the full version of this paper.
4
varMNTD (T , r) = w(e) · w(l) · G(S, e, l, r) − μ2MNTD (T , r), (14)
r2 e∈E l∈E
where:
⎧
⎪ s(l)·(s−s(e)
r−1 )
⎪
⎪ GOff (S, e, l, r) = if l ∈ Off(e).
⎪
⎪ (rs)
⎪
⎪
⎪
⎨
s(e)·(s−s(l)
r−1 )
G(S, e, l, r) = GOff (S, l, e, r) = if e ∈ Off(l).
⎪
⎪ (rs)
⎪
⎪
⎪
⎪
⎪
⎪ s(e)·s(l)·(s−s(e)−s(l) )
⎩GInd (S, e, l, r) = r−2
otherwise.
(rs)
species in R have a common ancestor which is deep in the tree, and they are
closely related. On the other hand, if CAC(T , R, 0.51) is zero then R consists
of at least two main unrelated groups of species. Early experiments that we
conducted have demonstrated that the CAC is strongly positively related to
the NRI and weakly negatively related to the PD, relationships which we intend
to explore further in a future publication. In the present paper we focus on
the computational aspects of this measure; we examine how we can compute
efficiently the CAC and the values of its statistical moments.
For a given sample of leaves R and an integer χ ∈ (0.5, 1], value CAC(T , R, χ)
can be computed in O(n) time in the following way; first, we compute bottom-up
the values sr(e) for every e ∈ E. Then, we start from the root of T and we
compute CAC(T , R, χ) by constructing incrementally the path that connects
the root with vanc (R, χ).
The major advantage of using the CAC in phylogenetic analysis is that, for
a given χ and size of R, we can efficiently compute in practice the value of any
statistical moment of this measure. To describe how can do this, we define the
following quantity:
Cχ (T , r, k) = ER∈Sub(S,r) CACk (T , R, χ) .
We can compute any of the moments of CAC by using the values Cχ (T , r, k).
In particular, The expectation of CAC for r leaves is equal to Cχ (T , r, 1), and the
variance is equal to Cχ (T , r, 2) − Cχ2 (T , r, 1). Using a standard formula from the
mathematical literature, for any integer k > 3 the k-th order moment of CAC
for r leaves can be expressed as:
k k
i=0 i (−Cχ (T , r, 1))i Cχk−i (T , r, i)
. (15)
(Cχ (T , r, 2) − Cχ (T , r, 1))k/2
Therefore, computing the k-th order moment of CAC boils down to calculating
values Cχ (T , r, i) for every i = 1, 2, . . . , k. In the next lemma we show that this
can be done efficiently in practice. The proof of this lemma is provided in the
full version of this paper.
Lemma 3. Let T be a phylogenetic tree that has s leaves and consists of n
nodes in total. Let r ≤ s be a positive integer and let χ be real number such
that χ ∈ (0.5, 1] . For any positive integer k it holds that:
Cχ (T , r, k) = ER∈Sub(S,r) CACk (T , R, χ)
=
v∈V
cost(v, root(T )) ·
s(v)
i=rχ
s(v)
i
s−s(v)
r−i
− u∈Ch(v)
s
r
s(u)
j=rχ
s(u)
j
s−s(u)
r−j
.
(16)
ninety-nine times, each time using a different value of r, ranging from two to one
hundred. Preliminary measurements showed that the value of r does not affect in
practice the performance of any of the examined algorithms. This is also the case
with the value of the χ parameter and the performance of the CAC algorithm. In
the experiments we ran this algorithm with parameter values χ = 0.6 and k = 3.
We also calculated the values of the SI and the DSSI for each dataset. These
results are presented in Table 1.
Table 1. The results of the experiments that involve trees which represent relations
between species in the real-world. The running time of each algorithm is measured over
ninety-nine consecutive executions on the same dataset (PD Old = the old approach
for computing the PD variance, PD New = the new algorithm for computing the PD
variance, CAC = the algorithm that computes the k first moments of the CAC for k = 3
and χ = 0.6). Running times are presented in seconds.
ninety-nine times for each input tree, and we measured the total time taken
for these executions. Figure 1 illustrates the running times of the old and the
new algorithm that compute the variance of the PD, and the running times
of the algorithm that computes the first k moments of the CAC for χ = 0.6
and k = 3. Also, for each T ∈ U we measured the values of the SI and the DSSI.
Furthermore, we measured the running time of the algorithm that computes the
moments of the CAC for a fixed tree of 4, 000 leaves and for different values
of k–see Figure 2.
Fig. 1. The running times of three of the implemented algorithms using as input
randomly generated trees. For each algorithm, the continuous line segments connect
the median values of the measured running times for input trees that have the same
number of leaves. Left: The running times of the old and the new algorithms that
compute the variance of the PD. For each algorithm, the running times for input trees
of the same number of leaves have very small difference in value, and hence they are
almost indistinguishable. Right: The running time of the algorithm that computes the
first k moments of the CAC for k = 3 and χ = 0.6.
Again, as can be seen in Figure 1, the new algorithm for the PD variance has
a much better performance than the old one. We see also that the algorithm that
computes the moments of the CAC runs very fast, processing almost a hundred
trees of a few thousand nodes in less than 1.5 seconds. In Figure 2 we see that
the SI is evidently larger the DSSI for the randomly generated trees. Still, the
value of the SI is not much larger the size of the input trees; given that the
total number of nodes of a binary tree is roughly at most twice the number of
its leaves, the SI in this set of experiments is not larger than ten times the size
of the input. This possibly explains the very good performance of all the new
algorithms that we introduce in this paper. Also, as expected, in Figure 2 we can
see that the running time of the algorithm that calculates the moments CAC
scales almost linearly as the value of k increases.
202 C. Tsirogiannis, B. Sandel, and A. Kalvisa
2.5
● ●
Index value
● ●
● ●
Time (s)
● ●
● ●
2.0
●
● ●
● ●
●
● ● ● ● ●
● ●
● ● ●
1.5
● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ●
Fig. 2. Left: The values of the SI and of the square of the DSSI for the trees that we
generated using a pure birth process. For each number of leaves, we illustrate only the
median of these values. The rest of the values are quite close to this median, having at
most an absolute difference of roughly two thousand units. Right: The running time of
the algorithm that computes the first k moments of the CAC for a single tree of 4, 000
leaves and for k ranging from one to twenty.
References
1. Bininda-Emonds, O.R.P., Cardillo, M., Jones, K.E., MacPhee, R.D.E., Beck,
R.M.D., Grenyer, R., Price, S.A., Vos, R.A., Gittleman, J.L., Purvis, A.: The
Delayed Rise of Present-Day Mammals. Nature 446, 507–512 (2007)
2. Blum, M.G.B., François, O.: On Statistical Tests of Phylogenetic Tree Imbalance:
The Sackin and Other Indices Revisited. Mathematical Biosciences 195, 14–153
(2005)
3. Cadotte, M., Albert, C.H., Walker, S.C.: The Ecology of Differences: Assessing
Community Assembly with Trait and Evolutionary Distances. Ecology Letters 16,
1234–1244 (2013)
4. Cooper, N., Rodriguez, J., Purvis, A.: A Common Tendency for Phylogenetic
Overdispersion in Mammalian Assemblages. Proceedings of the Royal Society
B 275, 2031–2037 (2008)
5. O’Dwyer, J.P., Kembel, S.W., Green, J.L.: Phylogenetic Diversity Theory
Sheds Light on the Structure of Microbial Communities. PLoS Computational
Biology 8(12), e1002832(2012)
6. Faller, B., Pardi, F., Steel, M.: Distribution of Phylogenetic Diversity Under
Random Extinction. Journal of Theoretical Biology 251, 286–296 (2008)
7. Goloboff, P.A., Catalano, S.A., Mirandeb, J.M., Szumika, C.A., Ariasa, J.S.,
Kallersjoc, M., Farris, J.S.: Phylogenetic Analysis of 73 060 Taxa Corroborates
Major Eukaryotic Groups. Cladistics 25, 211–230 (2009)
8. Graham, C.H., Parra, J.L., Rahbek, C., McGuire, J.A.: Phylogenetic Structure
in Tropical Hummingbird Communities. Proceedings of the National Academy of
Sciences USA 106, 19673–19678 (2009)
9. Kembel, S.W., Hubbell, S.P.: The Phylogenetic Structure of a Neotropical Forest
Tree Community. Ecology 87, S86–S99 (2006)
New Algorithms for Computing Phylogenetic Biodiversity 203
10. Kissling, W.D., Eiserhardt, W.L., Baker, W.J., Borchsenius, F., Couvreur, T.L.P.,
Balslev, H., Svenning, J.-C.: Cenozoic Imprints on the Phylogenetic Structure of
Palm Species Assemblages Worldwide. Proceedings of the National Academy of
Sciences USA 109, 7379–7384 (2012)
11. Kraft, N.J.B., Cornwell, W.K., Webb, C.O., Ackerly, D.D.: Trait Evolution,
Community Assembly, and the Phylogenetic Structure of Ecological Communities.
The American Naturalist 170, 271–283 (2007)
12. Nipperess, D.A., Matsen IV., F.A.: The Mean and Variance of Phylogenetic
Diversity Under Rarefaction. Methods in Ecology and Evolution 4, 566–572 (2013)
13. Steel, M.: Tools to Construct and Study Big Trees: A Mathematical Perspective.
In: Hodkinson, T., Parnell, J., Waldren, S. (eds.) Reconstructing the Tree of Life:
Taxonomy and Systematics of Species Rich Taxa, pp. 97–112. CRC Press (2007)
14. Tsirogiannis, C., Sandel, B.: Computing the skewness of the phylogenetic mean
pairwise distance in linear time. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS,
vol. 8126, pp. 170–184. Springer, Heidelberg (2013)
15. Tsirogiannis, C., Sandel, B., Cheliotis, D.: Efficient computation of popular
phylogenetic tree measures. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS,
vol. 7534, pp. 30–43. Springer, Heidelberg (2012)
16. Vellend, M., Cornwell, W.K., Magnuson-Ford, K., Mooers, A.Ø.: Measuring
Phylogenetic Biodiversity. In: Magurran, A., McGill, B. (eds.) Biological Diversity:
Frontiers in Measurement and Assessment, Oxford University Press (2010)
17. Webb, C.O., Ackerly, D.D., McPeek, M.A., Donoghue, M.J.: Phylogenies and
Community Ecology. Annual review of ecology and systematics 33, 475–505 (2002)
The Divisible Load Balance Problem
and Its Application to Phylogenetic Inference
1 Introduction
Maximizing the efficiency of parallel codes by distributing the data in such a way
as to optimize load balance is one of the major objectives in high performance
computing.
Here, we address a specific case of job scheduling (data distribution) which,
to the best of our knowledge, has not been addressed before. We have a list of N
divisible jobs, each of which consists of si atomic tasks, where 1 ≤ i ≤ N , and B
processors (or bins). All jobs have an equal, constant startup latency α, and each
task, regardless of the job it appears in, requires a constant amount of time β to
be processed. Although these times are constant, they depend on the available
hardware architecture, and hence are not known a priori. Moreover, the jobs are
independent of one another. We also assume that processors are equally fast.
Therefore, any task takes time β to execute, independently of the processor it is
scheduled to run on. Any job can be partitioned (or decomposed) into disjoint
sets of its original tasks, which can then be distributed to different processors.
However, each such set incurs its own startup latency α on the processor on
which it is scheduled to run. Thus, a job of k tasks takes time k · β + α to execute
on any processor. The tasks (even of the same job) are independent of each
other, that is, they can be executed in any order, and the sole purpose of the job
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 204–216, 2014.
c Springer-Verlag Berlin Heidelberg 2014
The Divisible Load Balance Problem and Its Application 205
configuration is to group together the tasks that require the same initialization
step and hence minimize the overall startup latency.
Our work is motivated by parallel likelihood computations in phylogenetics
(see [4,9] for an overview). There, we are given a multiple sequence alignment
that is typically subdivided into distinct partitions (e.g., gene partitions; jobs
in our context). Given the alignment and a partition scheme, the likelihood on
a given candidate tree can be calculated. To this end, transition probabilities
for the statistical nucleotide substitution model need to be calculated (start-up
cost α in our context) for each partition separately because they are typically
considered to evolve under different models. Note that, all alignment sites (job
size) that belong to the same partition have identical model parameters.
The partitions are the divisible jobs to be distributed among processors. Each
partition has a fixed number of sites (columns from the alignment), which de-
note the size of the partition. The sites represent the independent tasks a job
(partition) consists of. Since alignment sites are assumed to evolve independently
in the likelihood model, the calculations on a single site can be performed inde-
pendently of all other sites. Thus, a single partition can easily be split among
multiple processors. Finally, note that, parallel implementations of the phylo-
genetic likelihood function now form part of several widely-used tools and the
results presented in this paper are generally applicable to all tools.
2 Problem Definition
Assume we have N divisible items of sizes s1 , s2 , . . . , sN , and B available bins.
Our task is to find an assignment of the N items to the B bins, by allowing an
item to be partitioned into several sub-items whose total size is the size of the
original item, in order to achieve the following two goals:
1. The sum of sizes of the (possibly partitioned) items assigned to each bin is
well-balanced.
2. The maximum load over all bins is minimal with respect to the number of
items added.
In the rest of the text we will use the term solid for the items that are not
partitioned, and fractional for those that are partitioned.
We can now formally introduce two variations of the problem; one where we
only allow items of integer sizes, and one where the sizes can be represented
by real numbers. In the case of integers, the problem can be formulated as the
following integer program.
Problem 1 (LBN). Given a sequence of positive integers s1 , s2 , . . . , sN and a
positive integer B,
N
minimize max{ j=1 xi,j | i = 1, 2, . . . , B }
subject to
B
i=1 qi,j = sj , 1≤j≤N
N
j=1 qi,j ≥ σ/B, 1≤i≤B
N
j=1 qi,j ≤ σ/B, 1≤i≤B
N
σ= i=1 si
0 ≤ qi,j ≤ xi,j · sj , 1 ≤ i ≤ B, 1 ≤ j ≤ N
q ∈ NB×N
≥0
x ∈ {0, 1}B×N
1
Available at http://www.exelixis-lab.org/web/software/examl/index.html.
The Divisible Load Balance Problem and Its Application 207
Variable xi,j is a boolean value indicating whether bin i contains part of item
j and if it does, qi,j denotes the amount. By removing the imposed restriction
of integer sizes, and hence allowing for positive real values as the sizes of both
solid and fractional items, we obtain the following mixed integer program.
Problem 2 (LBR). Given a sequence of positive real values s1 , s2 , . . . , sN and
a positive integer value B,
N
minimize max{ j=1 xi,j | i = 1, 2, . . . , B }
subject to
B
i=1 qi,j = sj , 1≤j≤N
N
j=1 qi,j = σ/B, 1≤i≤B
N
σ= i=1 si
0 ≤ qi,j ≤ xi,j · sj , 1 ≤ i ≤ B, 1 ≤ j ≤ N
q ∈ RB×N
x ∈ {0, 1}B×N
If for some bin i and element j we get a solution with qi,j < sj , we say that
element j is only assigned to bin i partially, or that only a fraction of element j
is assigned to bin i. If qi,j = sj we say that element j is fully assigned to bin i.
3 NP-hardness
We now show that problems LBN and LBR are NP-hard by reducing the well-
known Partition [6] problem. We reduce it to another decision problem called
Equal Cardinality Partition (ECP) that decides whether a set can be broken into
disjoint sets of equal cardinality and equal sum of elements (see Def. 2), which
can be solved by the two flavors of our problem.
Definition 1 (Partition). Is it possible to partition a set S
of positive
integers
into two disjoint subsets Q and R, such that Q ∪· R = S and q∈Q q = r∈R r?
Definition 2 (ECP). Let p, k be two positive integers and S a set of p·k positive
integers. Can
we partition Sinto p disjoint
sets S1 , S2 , . . . , Sp of k elements each,
such that · pi=1 Si = S and s∈Si s = s∈Sj s, for all 1 ≤ i, j ≤ p?
Clearly, if we can solve our original optimization problems LBN and LBR for
any S exactly, we can also answer whether ECP returns true or false for the
same set S. Thus, if we can show that ECP is NP-Complete we know that the
original problems are NP-hard.
To show that ECP is NP-Complete, it is sufficient to show that ECP is in NP,
that is the set of polynomial time verifiable problems, and some NP-Complete
problem (here Partition) reduces to it.
208 K. Kobert et al.
≥1
/ 01 2
=a·| q/a − r/a| ≥ a
q∈(a·Q) r∈(a·R)
However, s∈S s < a. Thus, q∈Q̂ q = r∈R̂ r which contradicts the assump-
tion of Q̂, R̂ being a solution for ECP(Ŝ,2). Therefore, Partition reduces to
ECP, which means that ECP is NP-Complete.
Corollary 1. The optimization problems LBN and LBR are NP-hard.
This follows directly from Lemma 1 and the fact that an answer for ECP can be
obtained by solving the optimization problem.
4 Algorithm
As seen in Section 3, finding an optimal solution to this problem is hard. To
overcome this hurdle, we propose an approximation algorithm running in poly-
nomial time that guarantees a near-optimal solution. For an in-depth analysis
of the complexity of the algorithm, see Section 5.
The input for the algorithm is a list S of N integer weights and the number
of bins B these elements must be assigned to. The idea of the algorithm can be
explained by the following three steps:
The Divisible Load Balance Problem and Its Application 209
Fig. 1 presents the pseudocode for the first two phases, while Fig. 2 illustrates
phase 3. The output of this algorithm is an assignment, list = (list[1], . . . , list[p]),
of –possibly fractional– elements to bins. Each entry in list is a set of triplets
that specify which portion of an integer sized element is assigned to a bin. Let
(j, i, k) ∈ list[l] be one such triplet for bin number l. We interpret this triplet as
follows: bin l is assigned the fraction of element j that starts at i and ends at k
(including i and k).
For the application in phylogenetics, each triplet specifies which portion (how
many sites) of a partition is assigned to which processor. Again, let (j, i, k) ∈
list[l] be one such triplet for some processor l. We interpret this triplet as follows:
processor l is assigned sites i through k of partition j.
If i = 1 or k = sj (recall sj is the size of element j), we say that element j
is partially assigned to bin i, that is, only a fraction of element j is assigned to
LoadBalance(N, B, S)
Phase 1 — Initialization
1. Sort S in ascending order and let S = (s1 , s2 , . . . , sN )
2. σ = N i←1 si
3. c ← σ/B
4. r ← c · B − σ
5. for i ← 1 to B do
6. size[b] ← 0; items[b] ← 0; list[b] ← ∅
7. full bins ← 0; b ← 0;
Phase 2 — Initial filling
8. for i ← 1 to N do
9. if size[b] + si ≤ c then
10. size[b] ← size[b] + si
11. items[b] = items[b] + 1
12. Enqueue(list[b], (i, 1, si ))
13. if size[b] = c then
14. full bins ← full bins + 1
15. if full bins = B − r then c ← c − 1
16. else
17. add ← si
18. break
19. b ← (b + 1) mod B
Fig. 1. The algorithm accepts three arguments N, B and S, where N is the number of
items in list S, and B is the number of bins
210 K. Kobert et al.
Example 1. Consider the set {2, 2, 3, 5, 9} and three bins. During initialization
(phase 1) we have c = 7 and r = 0. Phase 2 makes the following assignments:
list[1] = {(1, 1, 2), (4, 1, 5)}, list[2] = {(2, 1, 2)}, list[3] = {(3, 1, 3)}. Adding the
next element of size 9 is not possible since size[2] + 9 = 2 + 9 = 11 > c.
Thus, phase 2 ends. Phase 3 splits the last element of size 9 among bins 2 and
The Divisible Load Balance Problem and Its Application 211
3, and the solution is list[1] = {(1, 1, 2), (4, 1, 5)}, list[2] = {(2, 1, 2), (5, 1, 5)},
list[3] = {(3, 1, 3), (5, 6, 9)}. With max{|list[1]|, |list[2]|, |list[3]|} = 2. This is also
an optimal solution.
Example 2. Consider the set {1, 1, 2, 3, 3, 6} and two bins. During the initial-
ization (phase 1) we have c = 8 and r = 0. Phase 2 generates the following
assignments: list[1] = {(1, 1, 1), (3, 1, 2), (5, 1, 3)}, list[2] = {(2, 1, 1), (4, 1, 3)}.
The last element of size 6 can not be fully assigned to bin 2, thus phase
2 terminates. Finally, phase 3 splits the last element of size 6 among the
two bins, and the solution is list[1] = {(1, 1, 1), (3, 1, 2), (5, 1, 3), (6, 1, 2)},
list[2] = {(2, 1, 1), (4, 1, 3), (6, 3, 6)}. We get max{|list[1]|, |list[2]|} = 4.
However, an optimal solution list1 = {(1, 1, 1), (2, 1, 1), (6, 1, 6)}, list2 =
{(3, 1, 2), (4, 1, 3), (5, 1, 3)} with max{|list1 |, |list2 |} = 3 exists.
5 Algorithm Analysis
We now show that the score obtained by algorithm LoadBalance, for any
given set of integers and any number of bins, is at most one above the optimal
solution. We then give the asymptotic time and space complexities.
items[j] ≤ items[i] + 1.
Lemma 2. Let OPT(S, B) be the score for the optimal solution for a set S dis-
tributed to B bins. Let list be the solution produced by algorithm LoadBalance
for the same set S and B bins. Then:
Proof. Let ĵ be the bin that terminates phase 2. That is, ĵ is the last bin con-
sidered for any assignment in phase 2. After phase 2, if there exists a bin j with
items[j] = items[ĵ]+1 we get, by Observation 1 and the pigeonhole principle, that
OPT(S, B) ≥ items[ĵ]+1. Otherwise, if no such bin exists, OPT(S, B) ≥ items[ĵ].
Let K be the number of unassigned elements at the beginning of phase 3. Let
J be the number of bins j with items[j] = items[ĵ]. We distinguish between
three cases. First assume that items[j] = items[ĵ] (after phase 2) for all bins
j and K > 0. Clearly, OPT(S, B) ≥ items[ĵ] + 1. By observation 3 we know
that items[j] ≤ items[ĵ] + 2 (after phase 3). Thus the lemma holds for this case.
Now consider K > J and items[j] = items[ĵ] for some bin j, that is, there
are more unassigned elements than there are bins with only items[ĵ] elements
assigned to them. By the pigeonhole principle, OPT(S, B) ≥ items[ĵ] + 2. By
observation 3 we get that items[j] ≤ items[ĵ] + 1 + 2 = items[ĵ] + 3 for all j.
Thus the lemma holds for this case as well. For the last case assume K ≤ J and
items[j] = items[ĵ] for some bin j. After a bin is assigned a fractional element
that does not fill it completely, it is immediately filled up with the next element.
Since preference is given to any bin j with items[j] = items[ĵ] and there are at
least as many such bins as remaining elements to be added (K ≤ J), we get that
items[j] ≤ items[ĵ] + 2. Since we have seen above that OPT(S, B) ≥ items[ĵ] + 1,
the lemma holds. As this covers all cases, the lemma is proven.
5.2 Run-Time
The runtime analysis is straight forward. Phase 1 of the algorithm consists
of initializing variables, sorting N items by size in ascending order and com-
puting their sum. Using an algorithm such as Merge-Sort, Phase 1 requires
O(N log(N )) time. Phase 2 requires O(N ) time to consider at most N items,
and assign them to B bins in a cyclic manner. Phase 3 appends at most 2 items
to a bin (see Observation 3), and hence has a time complexity of O(B). This
yields an overall asymptotic run-time complexity of O(N log(N )+B). Note that,
if we are already given a sorted list of partitions, the algorithm runs in linear
time O(N + B). Finally, LoadBalance requires O(B) space due to the arrays
items, size and list, that are each of size B.
The Divisible Load Balance Problem and Its Application 213
5000
2000
1000
500
#characters
200
100
50
20
10
#partitions
6 Practical Application
As mentioned before, the scheduling problem arises for parallel phylogenetic like-
lihood calculations on large partitioned multi-gene or whole-genome datasets.
This type of partitioned analyses represent common practice at present. The
number of multiple sequence alignment partitions, the number of alignment sites
per partition, and the number of available processors are the input to our algo-
rithm. The production-level maximum likelihood based phylogenetic inference
software ExaML for supercomputers implements two different data distribution
approaches: The cyclic data distribution scheme that does not balance the num-
ber of unique partitions per processor, but just assigns single sites to processors
in a cyclic fashion. The second approach is the whole-partition data distribution
scheme. Here, the individual partitions are not considered divisible and are as-
signed monolithically to processors using the longest processing time heuristic
for the ’classic’ multi-processor scheduling problem [10]. This ensures that the
total and maximum number of initialization steps (substitution matrix calcula-
tions) is minimized, at the cost of not being balanced with respect to the sites
per processor. Nonetheless, using this scheme instead of the cyclic distribution
already yielded substantial performance improvements. In order to evaluate the
new distribution scheme, we compare it to these two previous schemes, in terms
of total ExaML runtime. Note that, our algorithm has also been implemented
in ExaBayes2 which is a code for large-scale Bayesian phylogenetic inference.
6.1 Methods
25000 25000
LoadBalance algorithm LoadBalance algorithm
cyclic cyclic
22500 whole-partition 22500 whole-partition
20000 20000
17500 17500
Execution time [s]
12500 12500
10000 10000
7500 7500
5000 5000
2500 2500
0 0
0 48 96 144 192 240 288 336 384 0 48 96 144 192 240 288 336 384
Partitions [-] Partitions [-]
6.2 Results
ExaML requires up to 3.6× less runtime than with the whole partition distribu-
tion scheme for 24 processes and for 48 processes the runtime can be improved by
a factor of up to 3.9×. For large numbers of partitions, the runtime of the whole
partition distribution scheme converges against the runtime of LoadBalance.
This is expected, since by increasing the number of partitions we break the align-
ment into smaller chunks and the chance of any heuristic to attain a near-optimal
load/data distribution increases. However, if the same run is executed with more
processes (i.e., 48 instead of 24), this break-even point shifts towards a higher
number of partitions, as shown in Fig. 4.
The results show that, cyclic data distribution performance is acceptable for
many processes and few partitions, whereas monolithic whole-partition data dis-
tribution is on par with our new heuristic for analyses with few processes and
many partitions. Both figures show, that there exists a region where neither of the
previous strategies exhibits acceptable performance compared to LoadBalance
and that this performance gap widens, as parallelism increases.
Finally, employing LoadBalance, ExaML executes twice as fast with 48
processes than with 24 processes and thus exhibits an optimum scaling factor
of about 2.07 in all cases. For comparison, under the cyclic data distribution,
scaling factors ranged from 1.24 to 1.75 and under whole partition distribution,
scaling factors ranged from 1.00 (i.e., no parallel runtime improvement) to 2.04.
The slight superlinear speedups are due to increased cache efficiency.
7 Conclusion
References
1. Bharadwaj, V., Ghose, D., Robertazzi, T.: Divisible load theory: A new paradigm
for load scheduling in distributed systems. Cluster Computing 6(1), 7–17 (2003)
2. Blażewicz, J., Drozdowski, M.: Distributed processing of divisible jobs with com-
munication startup costs. Discrete Appl. Math. 76(1-3), 21–41 (1997)
3. Cook, S.A.: The complexity of theorem-proving procedures. In: STOC 1971
Proceedings of the Third Annual ACM Symposium on Theory of Computing,
pp. 151–158 (1971)
4. Felsenstein, J.: Inferring phylogenies. Sinauer Associates (2003)
216 K. Kobert et al.
1 Introduction
Studying metabolites and other small biomolecules with mass below 1000 Da, is
relevant, for example, in drug design and the search for new signaling molecules
and biomarkers [14]. Since such molecules cannot be predicted from the genome
sequence, high-throughput de novo identification of metabolites is highly sought.
Mass spectrometry (MS) in combination with a fragmentation technique is com-
monly used for this task. In liquid chromatography MS, a selected molecule
can be fragmented in a second step typically using collision-induced dissociation
(CID). The resulting fragment ions are recorded in tandem mass spectra (MS2
spectra). For metabolites, the understanding of CID fragmentation is still in its
infancy.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 217–231, 2014.
c Springer-Verlag Berlin Heidelberg 2014
218 K. Scheubert, F. Hufsky, and S. Böcker
Multiple-stage MS (MSn ) allows to select the product ions of the initial frag-
mentation step (manually or automatically) and subject them to another frag-
mentation reaction. This reveals additional information about the dependencies
between the fragments. The resulting fragment ions can, in turn, again be se-
lected as precursor ions for further fragmentation. Typically, with each additional
fragmentation reaction, the quality of mass spectra is reduced and measuring
time increases. Thus, analysis is usually limited to a few fragmentation reac-
tions beyond MS2 .
CID mass spectra (both MS2 and MSn ) are limited in their reproducibility
on different instruments, making spectral library search a non-trivial task [16].
Furthermore, spectral libraries are vastly incomplete. Recent approaches tend to
replace searching in spectral libraries by searching in the more comprehensive
molecular structure databases [1, 9–11, 26, 31]. However, many metabolites even
remain uncharacterized with respect to their structure and function [17].
For the de novo interpretation of tandem mass spectra of small molecules,
Böcker and Rasche [5] introduced fragmentation trees to identify the molecu-
lar formula of an unknown and its fragments. Moreover, fragmentation trees
are reasonable descriptions of the fragmentation process and hence can also be
used to derive further information about the unknown molecule [19]. Scheubert
et al. [23,24] adjusted the fragmentation tree concept to MSn data to reflect the
succession of fragmentation reactions.
Adjusting the fragmentation tree concept to MSn data, results in the NP-hard
Colorful Subtree Closure problem [24] which has to be solved in conjunc-
tion with the original NP-hard Maximum Colorful Subtree problem [5],
resulting in the Combined Colorful Subtree problem [24]. To solve this
problem, Scheubert et al. [24] presented a fixed-parameter algorithm based on
dynamic programming (DP) with worst-case running time depending exponen-
tially on the number of peaks in the spectrum.
To compare two molecules based on their fragmentation spectra, Rasche
et al. [18] introduced fragmentation tree alignments. By this, similar fragmen-
tation cascades in the two trees are identified and scored. This allows us to use
fragmentation trees in applications such as database searching, assuming that
structural similarity is inherently coded in the CID spectra fragments. Improving
the quality of the fragmentation trees using the additional information provided
by MSn , may improves this downstream analysis.
Here, we present a novel exact algorithm for solving the Combined Color-
ful Subtree problem. This Integer Linear Program (ILP) is faster than the
DP algorithm. Further, we demonstrate the impact of the additional informa-
tion from MSn data for the downstream analysis: We compute fragmentation
tree alignments [18] and find that correlation between the similarity score of
two fragmentation trees and the structural similarity score of the correspond-
ing molecules increases when using the additional information gained from the
succession of fragments in multiple MS.
MSn Fragmentation Trees Revisited 219
Fig. 1. (1) As input we use MSn spectra that contain additional information on the
succession of fragments. (2) For each peak, we compute all fragment molecular for-
mulas with mass sufficiently close to the peak mass. (3) A fragmentation graph is
constructed with vertices for all fragment molecular formulas and edges (grey) for all
possible fragmentation steps. Explanations of the same peak receive the same color.
The transitive closure of the graph is scored based on pairs of colors. To simplify the
drawing, we only show non zero edges of the transitive closure (black). (4) The colorful
subtree with maximum combined weight of the edges and the transitive closure is the
best explanation of the observed fragments.
energy but not at low collision energy (σ3 ≤ 0). For a more detailed description
of the parameters see [24].
Now, each subtree of the fragmentation graph corresponds to a possible frag-
mentation tree. Considering trees, every fragment is explained by a unique frag-
mentation pathway. To avoid the case that one peak is explained by more than
one molecular formula, we limit our search to colorful trees, where each color is
used at most once. In practice, it is very rare that a peak is indeed created by two
different fragments. Searching for a colorful subtree of maximum sum of edge
weights is known as the Maximum Colorful Subtree problem, which is NP-
hard [5, 8]. Searching for a colorful subtree of maximum weight of the transitive
closure is known as the Colorful Subtree Closure problem, which is again
NP-hard (even for unit weights) [24]. In addition, both problems are even hard
to approximate [6, 24, 27]. The problem we are interested in combines the two
above problems, that is searching for a colorful subtree of maximum combined
weight of the edges and the transitive closure, which is the best explanation of
the observed fragments [24]:
Combined Colorful Subtree Problem. Given a vertex-colored DAG G =
(V, E) with colors C, edge weights w : E → R, and transitive weights w+ :
C 2 → R. Find the induced colorful subtree T of G of maximum weight w∗ (T ) =
w(T ) + w+ (T ).
and an Integer Linear Program (ILP) [20] (see below) – both computing an exact
solution. For multiple MS data, Scheubert et al. [24] presented an exact DP algo-
rithm for the Combined Colorful Subtree problem, which is parameterized
by the number of colors k in the graph. Here, we present an ILP for solving
the Combined Colorful Subtree problem. ILPs are a classical approach for
finding exact solutions of computationally hard problems.
s.t. xuv ≤ 1 for all v ∈ V \ {r}, (3)
u with uv ∈ E
xvw ≤ xuv for all vw ∈ E with v = r, (4)
u with uv ∈ E
xuv ≤ 1 for all c ∈ C, (5)
uv ∈ E with v ∈ V (c)
Constraints (3) ensure that the feasible solution is a tree, whereas constraints
(5) make sure that there is at most one vertex of each color present in the so-
lution. Finally, constraints (4) require the solution to be connected. Note that
in general graphs, we would have to ensure for every cut of the graph to be
connected to some parent vertex. That would require an exponential number of
constraints [15]. But since our graph is directed and acyclic, a linear number
of constraints suffice. White et al. [30] pointed out that constraints (3) are re-
dundant due to constraints (5). However, in the following we will refer to the
original ILP from [20].
capture the transitivity of the closure. To this end, we will introduce additional
variables that capture the edges of the transitive closure of the tree. Unfortu-
nately, this simple approach is only working for negative weights for all edges of
the transitive closure and cannot be generalized to arbitrary transitivity scores.
Let G+ = (V, E + ) be the transitive closure of the input graph G. We assume
that w+ (c(u), c(v)) ≤ 0 holds for all edges uv of the transitive closure. Let us
define binary variables xuv for each edge uv ∈ E, and zuv for each edge uv ∈ E + .
We assume xuv = 1 if and only if uv is part of the subtree; and zuv = 1 if uv is
part of the closure of the subtree. We can formulate the following ILP:
max w(u, v) · xuv + w+ (c(u), c(v)) · zuv (7)
uv∈E uv∈E +
As w+ (c(u), c(v)) ≤ 0 for all uv ∈ E + we may assume that zuv = 0 holds unless
required otherwise by (8) or (9). Constraint (8) requires that all edges of the
subtree are also edges of the closure; constraint (9) results in the transitivity of
the closure.
Unfortunately, the above ILP cannot be generalized to arbitrary transitivity
scores, demonstrated by the example that zuv = 1 for all uv ∈ E + satisfies both
constraints (8) and (9), independently of the actual assignment of variables xuv .
We may assume U = C, but for the sake of clarity we will use U whenever we
refer to the vertices of the color graph H.
Let us define binary variables xuv for each edge uv ∈ E, and zab and yab for
each edge ab ∈ F . We assume xuv = 1 if and only if uv is part of the subtree,
and yab = 1 if there exist u ∈ V (a) and v ∈ V (b) such that uv is part of the
subtree, that is, xuv = 1. Variables yab are merely helper variables that map the
subtree to the color space. Finally, we assume zab = 1 if ab is part of the closure
of the subtree in color space. The following ILP captures the maximum colorful
MSn Fragmentation Trees Revisited 223
Constraints (13) and (14) ensure that there is an edge in the color version
of the tree if and only if there is an edge between vertices of the corresponding
colors. Constraints (15) guarantee that for each edge that is part of the solution,
also its transitive edge is part of the solution. Constraints (16) and (17) ensure
the transitivity of the transitive closure of the solution: For a given edge ybc in
the color version of the tree and an arbitrary color a, a is either an ancestor of b
(and thus also of c), or not. The first case implies that there must be transitive
edges from a to b as well as from a to c. In the second case, transitive edges
from a to b as well as from a to c are prohibited. Constraints (18) guarantee
that only the transitive closure of the solution tree is part of the solution, and
not the transitive closure of other subgraphs.
Rasche et al. [18] presented the comparison of fragmentation trees using fragmen-
tation tree alignments. One important application of this approach is searching
in a database for molecules that are similar to the measured unknown molecule.
Two structurally similar molecules have similar fragmentation trees and vice
versa [18]. Hence, the similarity of high quality fragmentation trees correlates
with the structural similarity of the corresponding molecules. We will use the
correlation coefficient to optimize the parameters of the transitivity score and
to evaluate the benefit of MSn data compared to MS2 data.
Fragmentation tree similarity is defined via edges, representing fragmentation
steps, and vertices, representing fragments. A local fragmentation tree alignment
224 K. Scheubert, F. Hufsky, and S. Böcker
contains those parts of the two trees where similar fragmentation cascades oc-
curred [18]. To compute fragmentation tree alignments we use the sparse DP
introduced by Hufsky et al. [12] which is very fast in practice.
For the comparison of molecular structures, many different similarity scores
have been developed [13]. Molecular structures can be represented as binary fin-
gerprints. Here, we use two of those fingerprint representations, that is the fin-
gerprints from PubChem database [29] accessed via the Chemistry Development
Toolkit version 1.3.37 [28]1 , and Molecular ACCess System (MACCS) finger-
prints implemented in OpenBabel2 . We use Tanimoto similarity scores (Jaccard
indices) [21] to compare those binary vectors.
To assess the correlation between fragmentation tree similarity and structural
similarity, we use the well-known Pearson correlation coefficient r which mea-
sures the linear dependence of two variables, as well as the Spearman’s rank
correlation coefficient ρ that is the Pearson correlation coefficient between the
ranked variables. The coefficient of determination, r2 , measures how well a model
explains and predicts future outcomes. Fragmentation tree alignment scores and
structural similarity scores are two measures where one would not expect a linear
dependence. This being said, we argue that any Pearson correlation coefficients
r > 0.5 (r2 > 0.25) can be regarded as strong correlation.
5 Results
To evaluate our work, we analyze spectra from a dataset introduced in [24]. It
contains 185 mass spectra of 45 molecules, mainly representing plant secondary
metabolites. All spectra were measured on a Thermo Scientific Orbitrap XL
instrument using direct infusion. For more details of the dataset see [24].
For the construction of the fragmentation graph, we use a relative mass er-
ror of 20 ppm and the standard alphabet – that is carbon, hydrogen, nitrogen,
oxygen, phosphorus, and sulfur – to compute the fragment molecular formulas.
For weighting the fragmentation graph, we use the scoring parameters from [19].
For scoring the transitive closure, we evaluate the influence of parameters σ1 , σ2
and σ3 on the quality of fragmentation trees. We assume the molecular formula
of the unfragmented molecule to be given (for details, see [18, 19, 24]).
For the computation of fragmentation trees from tandem MS data, we use the
DP algorithm from [5] (called DP-MS2 in the following) and the ILP from [20]
(ILP-MS2 ). Recall, that we can convert MSn data to “pseudo MS2 ” data by
mapping all peaks into a single spectrum and ignoring the additional informa-
tion gained from the succession of fragments in MSn . For the computation of
fragmentation trees from multiple MS data, we use the DP algorithm from [24]
(DP-MSn ) as well as our novel ILP (ILP-MSn ). Both DP algorithms are re-
stricted by memory and time consumptions. Thus, exact calculations are limited
to the k most intense peaks. The remaining peaks are added in descending inten-
sity order by a greedy heuristic (see the tree completion heuristic from [20, 24]).
1
https://sourceforge.net/projects/cdk/
2
http://openbabel.sourceforge.net/
MSn Fragmentation Trees Revisited 225
For solving the ILPs we use Gurobi 5.63 . The experiments were run on a cluster
with four nodes each containing 2x Intel XEON 6 Cores E5645 at 2.40 GHz with
48 GB RAM. Each instance is started on a single core.
For computing fragmentation tree alignments, we use the sparse DP from [12]
and the scoring from [18]. Estimation of Pearson and Spearman correlation co-
efficients was done using the programming language R.
Fig. 2. Running times for calculating fragmentation trees. Times are averaged on 10
repetitive evaluations and given in seconds. Note the logarithmic y-axis. Left: Average
running times for calculating one fragmentation tree with exact solution for the k
most intense peaks. The remaining peaks are attached by tree completion heuristic.
Right: Total running times for instances of size k = 35. Again, the remaining peaks
are attached heuristically. We calculate the total running time of the x instances for
which the tree was computed faster than for any of the remaining instances. For each
algorithm, instances were sorted separately.
with the original scoring parameters from [24]. Although they were chosen ad
hoc, they seem to work very well in practice. We further find, that σ1 has a
larger effect on the correlation than σ2 and σ3 (see Fig. 3). This was expected,
as the requirement that a fragments is placed below its parent fragment is very
strong.
Further, we evaluate the effect of using more peaks for the exact fragmentation
tree computation on the correlation. We set σ1 = 3, σ2 = −0.5 and σ3 = −0.5,
and vary the number of peaks from 10 ≤ k ≤ 35. We find that the highest
PubChem/Tanimoto correlation coefficient r = 0.5643137 (r2 = 0.31844500) is
achieved for k = 21 (see Fig. 4).
Note that the DP-MSn is not able to solve problems of size k = 21 with
acceptable running time and memory consumption. Thus, only by help of the
ILP-MSn it is possible to compute trees with best quality.
The optimum of k remains relatively stable in a leave-one-out validation
experiment: For each compound, we delete the corresponding fragmentation tree
from the dataset and repeat the former analysis to determine the best k . For 30
of the 42 sub-datasets k = 21 achieves the best correlation. For the remaining
11 sub-datasets k = 14, k = 20 or k = 25 are optimal.
Due to the small size of the dataset, it is hard to determine best parameters
without overfitting. Hence, these analyzes should not be seen as perfect param-
eter estimation, but more as a rough estimation until a bigger dataset becomes
available.
Comparison between Trees from MS2 , Pseudo-MS2 and MSn Data. To evaluate
the benefit of scoring the additional information from MSn data, we compare the
MSn Fragmentation Trees Revisited 227
correlation coefficients of using only the MS2 spectra, using Pseudo-MS2 data,
and using MSn data. As mentioned above, Pseudo-MS2 data means mapping all
peaks into a single spectrum and ignoring the additional information gained from
the succession of fragments in MSn , that is not scoring the transitive closure. For
fragmentation tree computation from MS2 and Pseudo-MS2 data we use the ILP-
MS2 , for MSn data we use the ILP-MSn . For a fair evaluation, we again vary the
number of peaks from 10 ≤ k ≤ 35 to choose the k with the highest correlation
coefficient. The highest Pearson correlation coefficient with PubChem/Tanimoto
fingerprints for MS2 data is r = 0.3860874 (r2 = 0.1490635) with k = 21 and
for Pseudo-MS2 data r = 0.5477199 (r2 = 0.2999970) with k = 25 (see Fig. 4).
Further, we compare the Pearson correlation coefficients between the three
datasets MS2 , Pseudo-MS2 and MSn (see Table 1). We find that the benefit of
MSn data is huge in comparison to using only MS2 data, which is expected since
the MS2 spectra contain too few peaks. The question that is more intriguing is
whether scoring the transitive closure improves correlation results. Comparing
Pseudo-MS2 with MSn data, we get an increase in the coefficient of determination
r2 by up to 6.7 % for PubChem fingerprints and 6.3 % for MACCS fingerprints.
The results for Spearman correlation coefficients look similar. When restricting
the evaluation to large trees (at least three edges, five edges, seven edges), we
cannot observe an increase in correlation.
When fragmentation trees are used in database search the relevant accuracy
measure is not Pearson correlation, but identification accuracy. The dataset used
in this paper is small and there is only one measurement per compound. Thus
we cannot evaluate the identification accuracy. Instead we analyze the Tanimoto
scores T (h) of the first h hits with h ranging from one to the number of com-
pounds (see Fig. 5). We exclude the identical compound from the hitlist and then
228 K. Scheubert, F. Hufsky, and S. Böcker
#
$%
!" #$$
Fig. 4. Correlation and regression line for the complete datasets. Fragmentation
tree similarity (x-axis) plotted against structural similarity measured by Pub-
Chem/Tanimoto score (y-axis). (a) Fragmentation trees for MS2 data (k = 21). Pear-
son correlation is r = 0.386. Spearman correlation is ρ = 0.364 (b) Fragmentation trees
for Pseudo-MS2 data (k = 25). Pearson correlation is r = 0.548. Spearman correlation
is ρ = 0.615 (c) Fragmentation trees for MSn data (k = 21). Pearson correlation is
r = 0.564. Spearman correlation is ρ = 0.624.
average over the hitlists of all compounds in the dataset. We compare the results
from MS2 , Pseudo-MS2 and MSn data with pseudo hitlists containing randomly
ordered compounds (minimum value, RANDOM) and compounds arranged in
descending order in accordance with the Tanimoto scores (upper limit, BEST).
There is a significant increase of average Tanimoto scores from MS2 data to MSn
data, and a slight increase from Pseudo-MS2 data to MSn data especially for
the first h = 5 compounds.
MSn Fragmentation Trees Revisited 229
!
"!
# $ %"!
"!
&'()*"
Fig. 5. Average Tanimoto scores T (h) between query structures and the first h struc-
tures from hitlists obtained by FT alignments (MS2 , Pseudo-MS2 , MSn data), pseudo
hitlists containing the structures with maximum Tanimoto score to query structure
(BEST) and randomly selected pseudo hitlists (RANDOM)
6 Conclusion
In this work, we have presented an Integer Linear Program for the Combined
Colorful Subtree problem, that outperforms the Dynamic Programming
algorithm that has been presented before [24]. Solving this problem is relevant
for calculating fragmentation trees from multistage mass spectrometry data.
Quality of fragmentation trees is measured by correlation of tree alignment
scores with structural similarity scores of the corresponding compounds. Ex-
periments on a dataset with 45 compounds revealed that trees computed with
transitivity scores σ1 = 3, σ2 = −0.5 and σ3 = −0.5 achieve the best quality. The
highest correlation of r = 0.564 was achieved when computing exact fragmenta-
tion trees for the k = 21 most intense peaks and attaching the remaining peaks
heuristically. Using the additional information provided by multiple MS data,
the coefficient of determination r2 increases by up to 6.7 % compared to trees
computed without transitivity scores. Thus, we could show for the first time that
additional information from MSn data can improve the quality of fragmentation
trees.
For the computation of those trees with highest quality (k = 21), our ILP
needs 1.3 s on average. In contrast, the original DP is not able to solve those
instances with acceptable running time and memory consumption. The ILP for
MSn is, however, slower than the ILP for MS2 that has been presented before [20].
This is due to the number of constraints which increases by an order of magnitude
from MS2 to MSn . White et al. [30] suggested rules to speed up computations
for the ILP on MS2 data. These rules may also improve the running time of our
algorithm.
us with the test data. We thank Kai Dührkop for helpful discussions on the ILP.
F. Hufsky was funded by Deutsche Forschungsgemeinschaft, project “IDUN”.
References
1. Allen, F., Wilson, M., Pon, A., Greiner, R., Wishart, D.: CFM-ID: a web server for
annotation, spectrum prediction and metabolite identification from tandem mass
spectra. Nucleic. Acids Res. (2014)
2. Böcker, S., Briesemeister, S., Klau, G.W.: On optimal comparability editing
with applications to molecular diagnostics. BMC Bioinformatics 10(suppl. 1), S61
(2009); Proc. of Asia-Pacific Bioinformatics Conference (APBC 2009)
3. Böcker, S., Lipták, Z.: Efficient mass decomposition. In: Proc. of ACM Symposium
on Applied Computing (ACM SAC 2005), pp. 151–157. ACM Press, New York
(2005)
4. Böcker, S., Lipták, Z.: A fast and simple algorithm for the Money Changing Prob-
lem. Algorithmica 48(4), 413–432 (2007)
5. Böcker, S., Rasche, F.: Towards de novo identification of metabolites by analyz-
ing tandem mass spectra. Bioinformatics 24, I49–I55 (2008); Proc. of European
Conference on Computational Biology (ECCB 2008)
6. Dondi, R., Fertin, G., Vialette, S.: Complexity issues in vertex-colored graph pat-
tern matching. J. Discrete Algorithms 9(1), 82–99 (2011)
7. Dreyfus, S.E., Wagner, R.A.: The Steiner problem in graphs. Networks 1(3), 195–
207 (1972)
8. Fellows, M.R., Gramm, J., Niedermeier, R.: On the parameterized intractability of
motif search problems. Combinatorica 26(2), 141–167 (2006)
9. Gerlich, M., Neumann, S.: MetFusion: integration of compound identification
strategies. J. Mass Spectrom 48(3), 291–298 (2013)
10. Heinonen, M., Shen, H., Zamboni, N., Rousu, J.: Metabolite identification and
molecular fingerprint prediction via machine learning. Bioinformatics 28(18), 2333–
2341 (2012); Proc. of European Conference on Computational Biology (ECCB
2012)
11. Hill, D.W., Kertesz, T.M., Fontaine, D., Friedman, R., Grant, D.F.: Mass spectral
metabonomics beyond elemental formula: Chemical database querying by match-
ing experimental with computational fragmentation spectra. Anal. Chem. 80(14),
5574–5582 (2008)
12. Hufsky, F., Dührkop, K., Rasche, F., Chimani, M., Böcker, S.: Fast alignment
of fragmentation trees. Bioinformatics 28, i265–i273 (2012); Proc. of Intelligent
Systems for Molecular Biology (ISMB 2012)
13. Leach, A.R., Gillet, V.J.: An Introduction to Chemoinformatics. Springer, Berlin
(2005)
14. Li, J.W.-H., Vederas, J.C.: Drug discovery and natural products: End of an era or
an endless frontier? Science 325(5937), 161–165 (2009)
15. Ljubić, I., Weiskircher, R., Pferschy, U., Klau, G.W., Mutzel, P., Fischetti, M.:
Solving the prize-collecting Steiner tree problem to optimality. In: Proc. of Algo-
rithm Engineering and Experiments (ALENEX 2005), pp. 68–76. SIAM (2005)
16. Oberacher, H., Pavlic, M., Libiseller, K., Schubert, B., Sulyok, M., Schuhmacher,
R., Csaszar, E., Köfeler, H.C.: On the inter-instrument and inter-laboratory trans-
ferability of a tandem mass spectral reference library: 1. Results of an Austrian
multicenter study. J. Mass Spectrom. 44(4), 485–493 (2009)
MSn Fragmentation Trees Revisited 231
17. Patti, G.J., Yanes, O., Siuzdak, G.: Metabolomics: The apogee of the omics trilogy.
Nat. Rev. Mol. Cell Biol. 13(4), 263–269 (2012)
18. Rasche, F., Scheubert, K., Hufsky, F., Zichner, T., Kai, M., Svatoš, A., Böcker,
S.: Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84(7),
3417–3426 (2012)
19. Rasche, F., Svatoš, A., Maddula, R.K., Böttcher, C., Böcker, S.: Computing
fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83(4),
1243–1251 (2011)
20. Rauf, I., Rasche, F., Nicolas, F., Böcker, S.: Finding maximum colorful subtrees
in practice. In: Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 213–223.
Springer, Heidelberg (2012)
21. Rogers, D.J., Tanimoto, T.T.: A computer program for classifying plants. Sci-
ence 132(3434), 1115–1118 (1960)
22. Rojas-Chertó, M., Kasper, P.T., Willighagen, E.L., Vreeken, R.J., Hankemeier, T.,
Reijmers, T.H.: Elemental composition determination based on MSn . Bioinformat-
ics 27, 2376–2383 (2011)
23. Scheubert, K., Hufsky, F., Rasche, F., Böcker, S.: Computing fragmentation trees
from metabolite multiple mass spectrometry data. In: Bafna, V., Sahinalp, S.C.
(eds.) RECOMB 2011. LNCS, vol. 6577, pp. 377–391. Springer, Heidelberg (2011)
24. Scheubert, K., Hufsky, F., Rasche, F., Böcker, S.: Computing fragmentation
trees from metabolite multiple mass spectrometry data. J. Comput. Biol. 18(11),
1383–1397 (2011)
25. Sheldon, M.T., Mistrik, R., Croley, T.R.: Determination of ion structures in struc-
turally related compounds using precursor ion fingerprinting. J. Am. Soc. Mass.
Spectrom 20(3), 370–376 (2009)
26. Shen, H., Dührkop, K., Böcker, S., Rousu, J.: Metabolite identification through
multiple kernel learning on fragmentation trees. Bioinformatics (2014) Accepted
Proc. of Intelligent Systems for Molecular Biology (ISMB 2014)
27. Sikora, F.: Aspects algorithmiques de la comparaison d’éléments biologiques. PhD
thesis, Université Paris-Est (2011)
28. Steinbeck, C., Hoppe, C., Kuhn, S., Floris, M., Guha, R., Willighagen, E.L.: Recent
developments of the Chemistry Development Kit (CDK) - an open-source Java
library for chemo- and bioinformatics. Curr. Pharm. Des. 12(17), 2111–2120 (2006)
29. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J., Bryant, S.H.: PubChem: A
public information system for analyzing bioactivities of small molecules. Nucleic
Acids Res. 37(Web Server issue), W623–W633 (2009)
30. White, W.T.J., Beyer, S., Dührkop, K., Chimani, M., Böcker, S.: Speedy colorful
subtrees. Submitted to European Conference on Computational Biology, ECCB
2014 (2014)
31. Wolf, S., Schmidt, S., Müller-Hannemann, M., Neumann, S.: In silico fragmentation
for computer assisted identification of metabolite mass spectra. BMC Bioinformat-
ics 11, 148 (2010)
An Online Peak Extraction Algorithm
for Ion Mobility Spectrometry Data
1 Introduction
Ion mobility (IM) spectrometry (IMS), coupled with multi-capillary columns
(MCCs), MCC/IMS for short, has been gaining importance for biotechnologi-
cal and medical applications. With MCC/IMS, one can measure the presence
and concentration of volatile organic compounds (VOCs) in the air or exhaled
breath with high sensitivity; and in contrast to other technologies, such as mass
spectrometry coupled with gas chromatography (GC/MS), MCC/IMS works at
ambient pressure and temperature. Several diseases like chronic obstructive pul-
monary disease (COPD) [1], sarcoidosis [4] or lung cancer [20] can potentially
be diagnosed early. IMS is also used for the detection of drugs [11] and ex-
plosives [9]. Constant monitoring of VOC levels is of interest in biotechnology,
e.g., for watching fermenters with yeast producing desired compounds [12] and
in medicine, e.g., monitoring propofol levels in the exhaled breath of patients
during surgery [15].
IMS technology is moving towards miniaturization and small mobile devices.
This creates new challenges for data analysis: The analysis should be possible
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 232–246, 2014.
c Springer-Verlag Berlin Heidelberg 2014
An Online Peak Extraction Algorithm for Ion Mobility Spectrometry Data 233
within the measuring device without requiring additional hardware like an ex-
ternal laptop or a compute server. Ideally, the spectra can be processed on a
small embedded chip or small device like a Raspberry Pi or similar hardware
with restricted resources. Algorithms in small mobile hardware face constraints,
such as the need to use little energy (hence little random access memory), while
maintaining prescribed time constraints.
The basis of each MCC/IMS analysis is peak extraction, by which we mean
a representation of all high-intensity regions (peaks) in the measurement by
using a few descriptive parameters per peak instead of the full measurement
data. State-of-the-art software (like IPHEx [3], Visual Now [2], PEAX [5]) only
extracts peaks when the whole measurement is available, which may take up
to 10 minutes because of the pre-separation of the analytes in the MCC. Our
own PEAX software in fact defines modular pipelines for fully automatic peak
extraction and compares favorably with a human domain expert doing the same
work manually when presented with a whole MCC/IMS measurement. However,
storing the whole measurement is not desirable or possible when the memory
and CPU power is restricted. Here we introduce a method to extract peaks and
estimate a parametric representation while the measurement is being captured.
This is called online peak extraction, and this article presents the first algorithm
for this purpose on MCC/IMS data.
Section 2 contains background on the data produced in an MCC/IMS ex-
periment, on peak modeling and on optimization methods used in this work.
A detailed description of the novel online peak extraction method is provided
in Section 3. An evaluation of our approach is presented in Section 4, while
Section 5 contains a concluding discussion.
2 Background
We primarily focus on the data generated by an MCC/IMS experiment
(Section 2.1) and related peak models. Ion mobility spectrometers and their
functions are well documented [7], and we do not go into technical details. In
Section 2.2 we describe a previously used parametric peak model, and in Sec-
tion 2.3 we review two optimization methods that are being used as subroutines
in this work.
Let R be the set of (equidistant) retention time points and let T be the set of
(equidistant) IRMs where a measurement is made. If D is the corresponding set
of drift times (each 1/250000 second for 50 ms, that is 12 500 time points), there
exists a constant fims depending on external conditions such that T = fims ·D [7].
Then the data is an |R|×|T | matrix S = (Sr,t ) of measured ion intensities, which
we call an IM spectrum-chromatogram (IMSC). The matrix can be visualized as
a heat map (Figure 1). A row of S is a spectrum, while a column of S is a
chromatogram.
Areas of high intensity in S are called peaks, and our goal is to discover these
peaks. Comparing peak coordinates with reference databases may reveal the
identity of the corresponding compound. A peak caused by a VOC occurs over
several IM spectra. We have to mention some properties of MCC/IMS data that
complicate the analysis.
– An IM spectrometer uses a carrier gas, which is also ionized. The ions are
present in every spectrum, which is referred to as the reactant ion peak
(RIP). In the whole IMSC it is present as high-intensity chromatogram at
an IRM between 0.47 and 0.53 Vs/cm2 . When the MCC/IMS is in idle mode,
no analytes are injected into the IMS, and the spectra contain only the RIP.
These spectra are referred to as RIP-only spectra.
– Every spectrum contains a tailing of the RIP, meaning that it decreases
slower than it increases; see Figure 2. To extract peaks, the effect of both
RIP and its tailing must be estimated and removed.
– At higher concentrations, compounds can form dimer ions, and one may
observe both the monomer and dimer peak from one compound. This means
that there is not necessarily a one-to-one correspondence between peaks
and compounds, and our work focuses on peak detection, not compound
identification.
– An IM spectrometer may operate in positive or negative mode, depending
on which type of ions (positive or negative) one wants to detect. In either
case, signals are reported in positive units. All experiments described here
were done in positive mode.
An Online Peak Extraction Algorithm for Ion Mobility Spectrometry Data 235
λ λ (x − o) − μ 2
g(x; μ, λ, o) := [x > o] · · exp − , (1)
2π(x − o)3 2μ2 (x − o)
where o is the shift, μ + o is the mean and λ is a shape parameter. A peak is then
given as the product of two shifted Inverse Gaussians, scaled by a volume factor v,
i.e., by seven parameters, namely P (r, t) := v · g(r, μr , λr , or ) · g(t, μt , λt , ot ) for
all r ∈ R, t ∈ T .
Since the parameters μ, λ, o of a shifted Inverse Gaussian may be very differ-
ent, although the resulting distributions have a similar shape, it is more intuitive
to describe the shifted Inverse Gaussian in terms of three different descriptors,
the mean μ , the standard deviation σ and the mode m. There is a bijection
between (μ, λ, o) and (μ , σ, m), described in [13] and shown in Appendix A.
Gaussians, here the fc are of different types (e.g., uniform and Inverse Gaus-
sian). The goal is to determine the parameters ω and θ such that the probability
of the observed sample is maximal (maximum likelihood paradigm). Since the
resulting optimization problem is non-convex in (ω, θ), the EM algorithm is an
iterative method that converges towards a local optimum that depends on the
initial parameter values. The EM algorithm consists of two repeated steps: The
E-step (expectation) estimates the expected membership of each data point in
each component and then the component weights ω, given the current model
parameters θ. The M-step (maximization) estimates maximum likelihood pa-
rameters θc for each parametric component fc individually, using the expected
memberships as hidden variables that decouple the model.
E-Step. To estimate the expected membership Wi,c of data point xi in each com-
ponent c, the
component’s relative probability at that data point is computed,
such that c Wi,c = 1 for all i. Then the new component weight estimates ωc+
are the averages of Wi,c across all n data points.
1
n
ωc fc (xi | θc )
Wi,c = , ωc+ = Wi,c , (2)
k ωk fk (xi | θk ) n i=1
for c ∈ {n, d}, where αt = 1/At − 1/μ+ d . These computations are repeated until
convergence as defined in Section 2.3.
The denoised signal vector S + is computed as St+ := max(0, (St − μn ) · (1 −
Wt,n )) for all t ≤ |T |, thus setting mean noise level to zero and erasing data
points that belong exclusively to noise.
!"#!$!
voltage / V
goal is to determine the parameters θ = (v, μ, λ, o) such that fθ (t) under-fits the
given data S = (St ), as shown in Figure 2.
Let rθ (t) := S(t) − fθ (t) be the residual function for a given choice θ of pa-
rameters. As we want to penalize r(t) < 0 but not (severely) r(t) > 0, maximum
likelihood estimation of the parameters is not appropriate. Instead, we use a
modified version of non-linear least squares (NLLS) estimation. In the standard
NLLS method, the error function to be minimized is e(θ) = t et (θ), where
et (θ) := rθ (t)2 . Instead, we use
+
rθ (t)2 /2 if rθ (t) < ρ ,
et (θ) :=
ρ · rθ (t) − ρ /2 if rθ (t) ≥ ρ .
2
That is, the error is the residual squared when it has a negative or small positive
value less than a given ρ > 0, but becomes a linear function for larger residuals.
We refer to this modification (and corresponding algorithms) as the modified
NLLS (MNLLS) method. To estimate the tailing function,
1. we determine reasonable initial values for the parameters (v, μ, λ, o); see
below,
2
2. we use MNLLS to estimate the scaling factor v with ρ = σR , leaving the
other parameters fixed,
2
3. we use MNLLS to estimate all four parameters with ρ = σR ,
2
4. we use MNLLS to re-estimate the scaling factor v with ρ = σR /100.
The initial parameter values are determined as follows. The initial σ is set to
the standard deviation of the whole RIP-only spectrum. An additional offset o
is set to the largest IRM left of the RIP mode where the signal is below σR .
Having determined the IRM of the RIP mode TR , the initial μ can only range
within the interval ]TR , TR + 0.7σ]. To obtain appropriate model parameters, μ
is being increased in small steps within the interval. The descriptor set (μ , σ, TR )
is being recomputed into the parameter set (μ, λ, o) until o ≥ o . This parameter
set contains
the initial parameter values. For the scaling factor, we initially set
v = (1/2) t≤|T | St .
An Online Peak Extraction Algorithm for Ion Mobility Spectrometry Data 239
Scanning. The algorithm scans for peaks, starting at the left end of S, by sliding
a window of width ρ across S and fitting a quadratic polynomial to the data
points within the window. Assume that the window starts at index β and ends
at β + ρ, the latter being not included. The value of ρ is determined by the
grid opening time dgrid , the maximum drift time of the spectrum Dlast and the
number of data points in the spectrum, ρ := dgrid /Dlast · |D| data point units.
Let f (x; θ) = θ2 x2 + θ1 x + θ0 be the fitted polynomial. We call a window a peak
window if the following conditions are fulfilled:
– the extreme drift time Dx = θ1 /(2θ2 ) ranges within the interval [Dβ , Dβ+ρ ].
– f (Dx ; θ) ≥ 2 σR
– There is a maximum at Dx , i.e., f (Dx ; θ) > f (Dβ ; θ)
The first condition can be more restricted to achieve more reliable results, by
shrinking the interval towards the center of the window. When no peak is found,
the moving window is shifted one index forward. If a peak is detected, the window
is shifted half the window length forward before the next scan begins, but first
the peak parameters and the reduced spectrum are computed.
4 Evaluation
We tested different properties of our online algorithm: (1) the execution time,
(2) the quality of reducing a single spectrum to peak models, (3) the correlation
between manual annotations on full IMSCs by a computer-assisted expert and
our automated online extraction method.
242 D. Kopczynski and S. Rahmann
Table 1. Average processing time of both spectrum reduction and consecutive align-
ment on two platforms with different degrees of averaging
Execution time. We tested our method on two different platforms, (1) a desk-
top PC with Intel(R) Core(TM) i5 2.80GHz CPU, 8GB memory, Ubuntu 12.04
64bit OS and (2) a Raspberry Pi1 type B with ARM1176JZF-S 700MHz CPU,
512 MB memory, Raspbian Wheezy 32bit OS. The Raspberry Pi was chosen
because it is a complete credit card sized low-cost single-board computer with
low CPU and power consumption (3.5 W). This kind of device is appropriate for
data analysis in future mobile measurement devices.
Recall that each spectrum contains 12 500 data points. It is current practice
to analyze not the full spectra, but aggregated ones, where five consecutive
values are averaged. Here we consider the full spectra, slightly aggregated ones
(av. over two values, 6 250 data points) and standard aggregated ones (av. over
five values, 2 500 data points). We measured the average execution time of the
spectrum reduction and consecutive alignment. Table 1 shows the results. At
the highest resolution (Average 1) only the desktop PC satisfies the time bound
of 100 ms between consecutive spectra. At lower resolutions, the Raspberry Pi
satisfies the time restrictions.
We found that in the steps that use the EM algorithm, on average 25–30
EM iterations were necessary for a precision of ε := 0.001 (i.e., 0.1%) (see Con-
vergence in Section 2.3). Relaxing the threshold from 0.001 to 0.01 halved the
number of iterations without noticeable difference in the results.
voltage / V
voltage / V
measurement index measurement index
Fig. 3. Time series of discovered intensities of two peaks. Left: A peak with agreement
between manual and automated online annotation. Right: A peak where the online
method fails to extract the peak in several measurements. If one treated zeros as missing
data, the overall trend would still be visible.
Cosine Similarity
Recall
Fig. 4. Comparison of peak intensity time series from manual annotation of full mea-
surements vs. our online algorithm. Each cross corresponds to a time series of one peak.
Ideally, the online algorithm finds the peak in all measurements where it was manually
annotated (high recall, x-axis), and the time series has a high cosine similarity (y-axis).
Similarity of online extracted peaks with manual offline annotation. The third
experiment compares extracted peaks on a time series of measurements between
expert manual annotation and our algorithm. Here 15 rats were monitored in 20
minute intervals for up to a day. Each rat resulted in 30–40 measurements (a time
series) for a total of 515 measurements, all captured in positive mode. To track
peaks within a single time series, we used the previously described EM clustering
method [14] as well as the state-of-the-art software, namely VisualNow. As an
example, Figure 3 shows time series of intensities of two peaks detected by
computer-assisted manual annotation and using our online algorithm. Clearly,
the sensitivity of the online algorithm is not perfect.
To obtain an overview over all time series, we computed the cosine similarity
γX,Y ∈ [−1, +1] between the intensities over time as discovered by manual an-
notation (X) and our online algorithm (Y ). We also computed the recall of the
online algorithm for each time series, that is, the relative fraction of measure-
ments where the peak was found by the algorithms among those where it was
found by manual annotation.
In summary, we outperform VisualNow in terms of sensitivity and computa-
tion time. Almost 27% of the points extracted by the online method exceed 90%
244 D. Kopczynski and S. Rahmann
recall and 95% cosine similarity whereas only 7% of the time series extracted
by VisualNow achieve that values. The peak detection of one measurement took
about 2 seconds on average (when the whole measurement is available at once)
with the online method and about 20 seconds with VisualNow on the previous
described desktop computer. VisualNow only provides the position and signal
intensity of the peak’s maximum, whereas our method additionally provides
shape parameters. Figure 4 shows generally good agreement between the online
method and the manual one, and similarly good agreement between Visual-
Now and the manual annotation. Problems of our online method stem from
low-intensity peaks only slightly above the detection threshold, and resulting
fragmentary or rejected peak chains.
References
1. Bessa, V., Darwiche, K., Teschler, H., Sommerwerck, U., Rabis, T., Baumbach,
J.I., Freitag, L.: Detection of volatile organic compounds (VOCs) in exhaled breath
of patients with chronic obstructive pulmonary disease (COPD) by ion mobility
spectrometry. International Journal for Ion Mobility Spectrometry 14, 7–13 (2011)
2. Bödeker, B., Vautz, W., Baumbach, J.I.: Peak finding and referencing in
MCC/IMS-data. International Journal for Ion Mobility Spectrometry 11(1), 83–87
(2008)
An Online Peak Extraction Algorithm for Ion Mobility Spectrometry Data 245
3. Bunkowski, A.: MCC-IMS data analysis using automated spectra processing and
explorative visualisation methods. Ph.D. thesis, University Bielefeld: Bielefeld, Ger-
many (2011)
4. Bunkowski, A., Bödeker, B., Bader, S., Westhoff, M., Litterst, P., Baumbach, J.I.:
MCC/IMS signals in human breath related to sarcoidosis – results of a feasibility
study using an automated peak finding procedure. Journal of Breath Research 3(4),
046001 (2009)
5. D’Addario, M., Kopczynski, D., Baumbach, J.I., Rahmann, S.: A modular compu-
tational framework for automated peak extraction from ion mobility spectra. BMC
Bioinformatics 15(1), 25 (2014)
6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 1–38 (1977)
7. Eiceman, G.A., Karpas, Z.: Ion Mobility Spectrometry, 2 ed. Taylor & Francis
(2005)
8. Einstein, A.: Über die von der molekularkinetischen Theorie der Wärme geforderte
Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Annalen der
Physik 322(8), 549–560 (1905)
9. Ewing, R.G., Atkinson, D.A., Eiceman, G.A., Ewing, G.J.: A critical review of
ion mobility spectrometry for the detection of explosives and explosive related
compounds. Talanta 54(3), 515–529 (2001)
10. Hauschild, A.C., Kopczynski, D., D’Addario, M., Baumbach, J.I., Rahmann, S.,
Baumbach, J.: Peak detection method evaluation for ion mobility spectrometry by
using machine learning approaches. Metabolites 3(2), 277–293 (2013)
11. Keller, T., Schneider, A., Tutsch-Bauer, E., Jaspers, J., Aderjan, R., Skopp, G.:
Ion mobility spectrometry for the detection of drugs in cases of forensic and crim-
inalistic relevance. Int. J. Ion Mobility Spectrom 2(1), 22–34 (1999)
12. Kolehmainen, M., Rönkkö, P., Raatikainen, O.: Monitoring of yeast fermenta-
tion by ion mobility spectrometry measurement and data visualisation with self-
organizing maps. Analytica Chimica Acta 484(1), 93–100 (2003)
13. Kopczynski, D., Baumbach, J., Rahmann, S.: Peak modeling for ion mobility spec-
trometry measurements. In: 2012 Proceedings of the 20th European Signal Pro-
cessing Conference (EUSIPCO), pp. 1801–1805. IEEE (August 2012)
14. Kopczynski, D., Rahmann, S.: Using the expectation maximization algorithm with
heterogeneous mixture components for the analysis of spectrometry data. Pre-print
- CoRR abs/1405.5501 (2014)
15. Kreuder, A.E., Buchinger, H., Kreuer, S., Volk, T., Maddula, S., Baumbach, J.:
Characterization of propofol in human breath of patients undergoing anesthesia.
International Journal for Ion Mobility Spectrometry 14, 167–175 (2011)
16. Munteanu, A., Wornowizki, M.: Demixing empirical distribution functions. Tech.
Rep. 2, TU Dortmund (2014)
17. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular Biol-
ogy 48(3), 443–453 (1970)
18. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York
(2006)
19. Spangler, G.E., Collins, C.I.: Peak shape analysis and plate theory for plasma
chromatography. Analytical Chemistry 47(3), 403–407 (1975)
20. Westhoff, M., Litterst, P., Freitag, L., Urfer, W., Bader, S., Baumbach, J.: Ion
mobility spectrometry for the detection of volatile organic compounds in exhaled
breath of lung cancer patients. Thorax 64, 744–748 (2009)
246 D. Kopczynski and S. Rahmann
1 Introduction
In many studies, it is important to work with an artificial population to evaluate the
efficacy of different methods or simply generate a founder population for an in-silico
breeding regimen. The populations are usually specified by a set of characteristics such
as minimum allele frequency (MAF) and linkage disequilibrium (LD) distributions. A
generative model simulates the population by evolving a population over time [1, 2].
Such an approach uses different parameters such as ancestor population characteristics
and their sizes, mutation and recombination rates, and breeding regimens, if any. The
non-generative models [3–5], on the other hand, do not evolve the population and often
start with an exemplar population and perturb it either by a regimen of recombinations
between the samples or other local perturbations.
We present a novel non-generative approach that first breaks up the specified (global)
constraints into a series of local constraints. We map the problem onto a discrete frame-
work by identifying subproblems that use best-fit techniques to satisfy these local con-
straints. The subproblems are solved iteratively to give an integrated final solution using
techniques from linear algebra, combinatorics, basic statistics and probability. Using
techniques from discrete methods, the algorithms are optimized to run in time linear
with the size of the output, thus extremely time-efficient. In fact, for one of the prob-
lems, the algorithm completes the task in sublinear time.
Corresponding author.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 247–262, 2014.
c Springer-Verlag Berlin Heidelberg 2014
248 N. Haiminen, C. Lebreton, and L. Parida
The first problem we address is that of constructing a deme (population) with pre-
specified characteristics [6] More precisely, the problem is defined as:
Problem 1. (Deme Construction). The task is to generate a population (deme) of n
diploids (or 2n haploids) with m SNPs that satisfy the following characteristics: MAF
distribution p, LD distribution r2 .
Our approach combines algebraic techniques, basic quantitative genetics and discrete
algorithms to best fit the specified distributions. The second problem we address is
that of simulating crossovers with interference in a population. Capturing the crossover
events in an individual chromosome as it is transmitted to its offspring, is a fundamen-
tal component of a population evolution simulator where the population may be under
selection or not (neutral). The expected number of crossovers is d, also called the the ge-
netic map distance (in units of Morgans). Let the recombination fraction be be denoted
by r. Then:
Problem 2. (F1 Population with Crossover Interference). The task is to generate a
F1 hybrid population with the following crossover interference models for a pair of
parents:
1. Complete interference (Morgan [7]) model defined by the relationship d = r.
2. Incomplete interference (Kosambi [8]) model defined by the relationship
1 + 2r
d = 0.25 ln or r = 0.5 tanh 2d.
1 − 2r
3. No interference (Haldane [7, 8]) model defined by the relationship
Again, we use combinatorics and basic probability to design a sub-linear time algorithm
to best fit the distribution of the three different crossover interference models.
Equivalently, the LD table of the pairwise patterns, 00, 01, 10, 11, of the two loci, is
written as:
0 1
0 (1 − p1 )(1 − p2 ) + D (1 − p1 )p2 − D 1 − p1
(2)
1 p1 (1 − p2 ) − D p1 p2 + D p1
1 − p2 p2 1
With a slight abuse of notation we call D the LD between two loci, with the obvious
interpretation.
Best-Fit in Linear Time for Non-generative Population Simulation 249
The output deme (population) is a matrix M where each row is a haplotype and each
column is a (bi-allelic) marker. Recall that the input is the MAF and LD distributions.
By convention, the MAF of marker j, pj , is the proportion of 1’s in column j of M . Our
approach to constructing the deme is to work with the markers one at a time and without
any backtracking. We identify the following subproblem, which is used iteratively to
construct the population.
Problem 3 (k-Constrained Marker Problem (k-CMP)). Given columns j0 , j1 , ., jk−1
and target values r0 , r1 , ..., rk−1 and pk , the task is to generate column jk with MAF
pk such that the pairwise LD with column jl is rl , l = 0, 1, .., k − 1.
Outline of our approach to solving k-CMP. The 1’s in column jk are assigned at random
respecting MAF pk . Let Dl (jl , jk ) denote the LD between markers jl and jk . Then let
the expected value, in the output matrix M , be Dl (·, ·). When both the columns fulfill
the MAF constraints of pl and pk respectively, let the observed value be Dlobs (·, ·). In
other words, if Q10 is the number of times pattern 10 is seen in these two markers in
M with n rows (after the random initialization),
1
Dlobs = (npl (1 − pk+1 ) − Q10 .) . (3)
n
Next, we shuffle the 1’s in column jk , such that it simultaneously satisfies k conditions.
Thus we get a best-fit of Dlobs (jl , jk ) to D(jl , jk ). To achieve this, we compare column
jk with columns jl , l = 0, 1, 2, .., k − 1, that have already been assigned. Thus, first, for
each pair of markers jl , jk , compute the target deviation, Dltarget , based on input p and
r values. Then, shuffle the 1’s in column jk of the output matrix, to get a best-fit to the
targets D0target , D1target , ..., Dk−1
target
simultaneously.
where 3
Dl = +rl pl (1 − pl )pk (1 − pk ).
Strictly speaking, the number of unknowns is 2×2k , written as yi,0 , yi,1 , where 1 ≤ i ≤
2k . Let Xi denote the binary k-pattern corresponding to binary pattern of the number
i − 1. For example when k = 3, X1 = 000 and X8 = 111. Then in the solution, the
number of rows with binary (k + 1)-pattern Xi 0 is yi,0 and the number of rows with
binary (k + 1)-pattern Xi 1 is yi,1 . Thus
yi,0 + yi,1 = #Xi where #Xi is the number of rows in the input M with k-pattern Xi in the given k columns.
Since the right hand side of the above 2k equations can be directly obtained from the
existing input data, the effective number of unknowns are 2k , re-written as yi = yi,0 ,
250 N. Haiminen, C. Lebreton, and L. Parida
For example, L31 = {1, 2, 3, 4} and L33 = {1, 3, 5, 6} (see Fig 1 for the binary patterns).
Then the k + 1 linear equations are:
yi = Plk , for l = 0, 1, .., k − 1,
i∈Lk
l
yi = Q(k−1)k .
i∈Lk
k−1
Consider the following concrete example where k = 3. We use the following conven-
tion: the given (constraint) columns are 0, 1 2 and the column under construction is
3. We solve for the eight variables y1 , .., y8 and the conditions are derived below. Let
p3 = 0.26 with the three pairwise LD tables as:
r03
2
= 0.27 r13
2
= 0.29 r23
2
= 0.37
D0 = 0.0714 D1 = 0.1178 D2 = 0.133
0 1 0 1 0 1
0 73 16 89 0 51 2 53 0 54 1 55
1 1 10 11 1 23 24 47 1 20 25 45
74 26 100 74 26 100 74 26 100 (4)
0 1 2 3 III II I
000 27 y1 + y1,1
001 21 y2 + y2,1
010 22 y3 + y3,1
011 19 y4 + y4,1
100 4 y5 + y5,1
101 1 y6 + y6,1
110 2 y7 + y7,1
111 4 y8 + y8,1
1 100
Q 03 B 03 Q 13 B 13 Q 23 B 23 1
Fig. 1. The problem set up for k = 3 showing effectively 8 unknowns in the rightmost column.
The different numbers of the concrete example are shown on the right.
Recall from linear algebra that a column (or variable yi ) is pivotal, if it is the leftmost
as well as the top-most non-zero element (which can be converted easily to 1). In the
reduced row echelon form, a column is pivotal, if all elements to its left, all elements
above it and all below are it zero. If the augmented column (the very last one with
P s and Qs) is also pivotal, then the system has no solution in R8 . The different row
transformations give the following:
row echelon form; reduced row echelon form;
input constraitns
pivotal columns are y1 ..y4 pivotal columns are y1 ..y4
y1 y2 y3 y4 y5 y6 y7 y8
y1 y2 y3 y4 y5 y6 y7 y8 y1 y2 y3 y4 y5 y6 y7 y8
1 0 1 0 1 0 1 0 P23
⇒ 1 0 1 0 1 0 1 0 P23 ⇒ 1 0 0 0 2 1 2 1 2P23 -P03 +Q23
0 1 0 1 0 1 0 1 Q23
0 1 0 1 0 1 0 1 Q23 0 1 0 0 0 1 -1 0 P13 -P23
1 1 1 1 0 0 0 0 P03
0 0 1 1 -1 -1 0 0 P03 -P13 0 0 1 0 -1 -1 -1 -1 P03 -P23 -Q23
1 1 0 0 1 1 0 0 P13
0 0 0 1 0 0 1 1 P23 -P13 +Q23 0 0 0 1 0 0 1 1 P23 − P13 + Q23
Thus for this concrete example, the solution space is captured as:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 ≤ 27 55 −2 −1 −2 −1
⎢ y2 ≤ 21 ⎥ ⎢ −3 ⎥ ⎢ 0 ⎥ ⎢ −1 ⎥ ⎢ 1 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ≤ 22 ⎥ ⎢ −1 ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ≤ 19 ⎥ ⎢ 23 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥=⎢ ⎥ + c1 ⎢ 0 ⎥ + c2 ⎢ 0 ⎥ + c3 ⎢ −1 ⎥ + c4 ⎢ −1 ⎥ .
⎢ y5 ≤ 4 ⎥ ⎢ 0 ⎥ ≤ 4 ⎢ 1 ⎥ ≤ 1 ⎢ 0 ⎥ ≤ 2 ⎢ 0 ⎥ ≤ 4 ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ≤ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥ ⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ y7 ≤ 2 ⎦ ⎣ 0 ⎦ ⎣ 0 ⎦ ⎣ 0 ⎦ ⎣ 1 ⎦ ⎣ 0 ⎦
y8 ≤ 4 0 0 0 0 1
Fig 2 shows an example of applying the algebraic technique to a data set based on real
human MAF and r2 data provided by the International HapMap Project [11], from chr
22 in the LD data collection of Japanese in Tokyo (JPT) population.
Fig. 2. Population construction using algebraic techniques, for HapMap JPT population. Here
k = 10. (A) LD fit, (B) MAF fit, and (C) LD for each pair of columns, upper left triangle is the
target and lower right triangle the constructed.
One of the shortcomings of the linear algebraic approach is that, it is not obvious how
to extract an approximate solution, when the exact solution does not exist. Recall that
in the original problem the LD (rlk values) are drawn from a distribution with non-
zero variance. A single-step hill climbing algorithm is described here and the general
hill-climbing (with compound-steps) will be presented in the full version of the paper.
Best-Fit in Linear Time for Non-generative Population Simulation 253
Note that the cost function in the hill climbing process is crucial to the algorithm.
Here we derive the cost function. To keep the exposition simple, we use k = 3 in the
discussion. Let the column under construction be j3 and in the pairwise comparison this
column is being compared with, e.g., jl = j0 . Then Q03 represents the number of rows
with 1 in column j0 and 0 in column j3 . A flip is defined as updating an entry of 0 to
1 (or 1 to 0). Further, in column j3 , we will exchange the position of 0 and 1, say at
rows if rom and ito respectively. This is equivalent to a flip at row if rom and at row ito ,
in column j3 (called Flip1 and Flip2 in the algorithm). Two flips are essential since the
number of 1’s (and 0’s) in column j3 must stay the same so as not to affect the allele
frequency in column j3 . When if rom at column j3 is 0 and ito at column j3 is 1, these
two flips lead to a change in the LD between columns j0 and j3 as follows:
Scenario I: The entry in row if rom of column j0 is 0 and the entry in row ito of
column j0 is 1. Then there is a negative change in the LD value D0 since the count
of the pattern 00 (and 11) went down by 1 and the count of pattern 01 (and 10)
went up by 1.
Scenario II: The entry in row if rom of column j0 is 1 and the entry in row ito of
column j0 is 0. Then there is a positive change in the LD value D0 since the count
of the pattern 00 (and 11) went up by 1 and the count of pattern 01 (and 10) went
down by 1.
Scenario III: The entry in row if rom of column j0 is 0 and the entry in row ito of
column j0 is 0. Then there is no change in the LD value D0 since the count of the
pattern 00 does not change and the count of pattern 01 does not change.
Scenario IV: The entry in row if rom of column j0 is 1 and the entry in row ito of
column j0 is 1. Then there is no change in the LD value D0 since the count of the
pattern 11 does not change and the count of pattern 10 does not change.
The four scenarios and the effect on the LD is summarized below:
[from] [to]
0 1 0 1
Cost function based on Q 0 P00 Flip2
C01 0 P00 ←− C01
Flip1
1 Q10 −→ B11 1 Q10 B11
Time complexity. The cost function is pre-computed for binary k-patterns and the data
is sorted in a hash table (details in the full version of the paper). Based on this an entry
of 1 in column j is processed no more than once. The number of such 1’s is npj . Thus
for the k-CMP problem, the algorithm takes O(knpj ) time. Hence to to compute the
entire matrix it takes O(knmpj ) time.
254 N. Haiminen, C. Lebreton, and L. Parida
1
Corr. 0.9997 100 1
0.5
0.9
0.4 80 0.8
Simulated MAF
0.8
0.3
1
60 0.6
Column j
0.7 0.2
40 0.4
0.1
0.6
0 20 0.2
0 0.1 0.2 0.3 0.4
B
2
0
20 40 60 80 100
0.4 Column j
2
C
0.3
0.2
A
0.1
0
10 20 30 40 50 60 70 80
Pairwise marker distance
Fig. 3. Hill-climbing algorithm for ASW HapMap population with k = 6. (A) LD fit, (B) MAF
fit, and (C) LD for each pair of columns, upper left triangle is the target and lower right triange
is the constructed. The main figure (A) shows the target ”o” and constructed ”*” mean r 2 per
distance, while the black dots show target and cyan dots constructed r 2 distribution per distance.
ALGORITHM: (k = 3)
1. Assign 1 to each column j3 of M , written as Mj3 , with probability p3 (the rest are
assigned 0).
2. At column j3 of matrix with the 3 constraints based on columns j0 , j1 , j2 :
(a) Compute targets Dltarget , using rl3 , Gtarget
l = n(Dltarget − Dlobs ), for l = 0, 1, 2.
(b) Initialize
i. Gt0 = Gt1 = Gt2 = 0.
ii. distance = l (Gtarget l − Gtl )2 .(goal of the LOOP is to minimize distance)
(c) LOOP
i. Move(Z1 → Z2 , 1) in the following five steps:
(1-from) Pick ifrom in column j, with its k-neighbor pattern as Z1 , so
that Mj3 [ifrom ] is 0.
(2-to) Pick ito in column j, with its k-neighbor pattern as Z2 , such that
Mj3 [ito ] is 1 and the resulting distance decreases.
(3-Update) For columns l = 0, 1, 2 (corresponding to G0 , G1 , G2 )
IF Z1l = 1 and Z2l = 0 THEN Gtl will go up by 1 (Scenario II)
IF Z1l = 0 and Z2l = 1 THEN Gtl will go down by 1 (Scenario I)
(IF (Z1l = Z2l ) then no change) (Scenarios III & IV)
(4-Flip1 ) Flip Mj3 [ifrom ]. (to maintain pj )
(5-Flip2 ) Flip Mj3 [ito ]. (to maintain pj )
ii. Update Gt0 , Gt1 , Gt2 and the distance
WHILE the distance decreases
Best-Fit in Linear Time for Non-generative Population Simulation 255
100
90
80
70
Individual
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Marker
Fig. 4. Haplotype blocks in the simulated ASW HapMap population data, defined by HapBlock
software. For each of the 7 identified blocks, three most frequent haplotype sequences are shown
as white to gray horizontal lines. The remaining marker values are shown in darker gray/black.
Fig 3 shows an example of applying the hill climbing technique to a data set based
on real human MAF and r2 data provided by the International HapMap Project [11],
from chr 22 in the LD data collection of the African ancestry in Southwest USA (ASW)
population. Fig 4 shows the haplotype blocks in the simulated population, defined by the
HapBlock [15] software. The results demonstrate reasonable haplotype block lengths
in the simulated population.
An individual of a diploid population draws its genetic material from its two parents
and the interest is in studying this fragmentation and distribution of the parental mate-
rial in the progeny. The difference in definition between recombinations and crossovers
is subtle: the latter is what happens in reality, while the former is what is observable.
Hence a simulator that reflects reality must simulate the crossover events. However, it
is a well known fact that the there appears to be some interference between adjacent
crossover locations. The reality of plants having large population of offsprings from the
same pair of parents poses more stringent condition on the crossover patterns seen in the
offspring population than is seen in humans or animals. Thus capturing the crossover
events accurately in an individual chromosome as it is transmitted to its offspring, is a
fundamental component of a population evolution simulator. Since the crossover event
often dominates a population simulator, it determines both the accuracy as well as ulti-
mately controls the execution speed of the simulator [16].
Various crossover models in terms of their overall statistical behavior have been pro-
posed in literature. However, it is not obvious how they can be adapted to generating
populations respecting these models since the distributions are some (indirect) non-
trivial functions of the crossover frequencies. In Section 5.5, we present an utterly sim-
ple two-step probabilistic algorithm that is not only accurate but also runs in sublinear
256 N. Haiminen, C. Lebreton, and L. Parida
time for three interference models. Not surprisingly, the algorithm is very cryptic. The
derivation of the two steps, which involves mapping the problem into a discrete frame-
work, is discussed below. We believe that this general framework can be used to convert
other such “statistical” problems into a discrete framework as well.
Two point crosses. Let rij be the recombination fraction between loci i and j on the
chromosome. Consider three loci 1, 2 and 3, occurring in that order in a closely placed
or linked segment of the chromosome. Then:
r13 = r12 + r23 − 2r12 r23 . (10)
Using combinatorics, and the usual interpretation of double frequency r12 r23 , we derive
this as follows (1 is a crossover and · is an absence of crossover with the 4 possible
distinct samples marked i − iv):
r12 r23 r13 r12 r23
i 1 · 1 odd x1 ·
ii 1 1 ⇒ · even x2 1
iii · 1 1 odd x3 ·
iv · · · even x4 ·
The above also suggests an algorithm for simulating N chr segments of expected
length L cM: M can be traversed (in any order) and at each cell a 1 is introduced with
probability p (or 0 with probability 1 − p). This results in a matrix M satisfying the
conditions in Definition 1. In fact, when L is large and p is small, as is the case here,
a Poisson distribution can be used with mean λ = pL, along a row to directly get the
marker locations (the j’s) with the crossovers.
Let M [i1 , j1 ] = x and M [i2 , j2 ] = y then M [i1 , j1 ] M [i2 , j2 ] denotes the ex-
change of the two values. In other words, after the operation, M [i1 , j1 ] = y and
M [i2 , j2 ] = x. The following is the central lemma of the section:
Lemma 1 (Exchange Lemma). Let j be fixed. Let I1 , I2 = φ be random subset of
rows, and for random i1 ∈ I1 , i2 ∈ I2 , let M be obtained after the following exchange
operation:
M [i1 , j] M [i2 , j].
Then M is also a progeny matrix.
This follows from the fact that each value in cell of M is independent of any other
cell.
Note that, given a progeny matrix M , if the values in some random cells are toggled
(from 0 to 1 or vice-versa), the resulting matrix may not be a progeny matrix, but if a
careful exchange is orchestrated respecting Lemma 1, then M continues to be a progeny
matrix.
t & a fixed
t & a varied
200
Number of fragments
150
100
50
0
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281
Fragment length
(a)
0.5
Kosambi simulated
Kosambi theoretical
0.4 Haldane simulated
Fraction recombinants r
Haldane theoretical
Morgan simulated
Morgan theoretical
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Pairwise distance d
(b)
Fig. 5. (a) The distribution of the fragment length when t and a are fixed, compared with when
they are picked at random from a small neighbourhood, in the Kosambi model. The latter gives
a smoother distribution than the former (notice the jump at fragment length 20 in the fixed case).
(b) Distance d (cM) versus recombination fraction r, for closed form solutions according to the
Kosambi, Haldane, and Morgan models, and for observed data in the simulations. The results
show average values from constructing a 5 Morgan chromosome 2, 000 times for each model.
Best-Fit in Linear Time for Non-generative Population Simulation 259
where a is a correction factor. When the crossovers are assigned randomly and inde-
pendently in M , due to the interference model C, the fraction with M [·, j2 ] = 1 that
are in violation, are
(1 − α) × β × (1 − C).
Also fraction that could potentially be exchanged with the violating i’s while respecting
the interference model are
α × (1 − β).
Thus for Kosambi model, C = 2r ≈ 2p:
Violating fraction with (M [·, j2 ] = 1) ≈ (1 − (1 − p)at ) × p × (1 − 2p) = γ,(16)
potential exchange fraction ≈ (1 − p)at × (1 − p) = η. (17)
The correction factor a is empirically estimated. Thus an iterative procedure can ex-
change the violating fraction (γ) randomly with the values in the potential exchange
fraction (η). Thus based on Lemma 1, M continues to be a progeny matrix. Since condi-
tions of Conjecture 1 hold, the transformed matrix also respects the interference model.
Thus the ”interference adjustment” probability p in Kosambi model is defined as:
at
γ 1 − (1 − p)
p ≈ = p(1 − 2p) . (18)
η (1 − p)at+1
For Morgan model C = 0:
1 − (1 − p)at
γ = (1 − α)β, η = α(1 − β), p = p .
(1 − p)at (1 − p)
For Haldane model C = 1: η = 0, p = 0. Now, we are ready for the details of
the algorithm where each row of MC can be computed independently using the two
probabilities p and p .
5.5 Algorithm
We encapsulate the above into a framework to generate crossovers based on the mathe-
matical model of Eqn 11 and the generic interference function of the form C = f (r).
In this general framework a and t of Eqn 19 are estimated empirically, to match the
expected r curves with = 0.001 (of Conjecture 1). We present these estimated values
for the C = 2r, C = 1 and C = 0 models in Eqn 19.
260 N. Haiminen, C. Lebreton, and L. Parida
Crossover probability p. For historical reasons, the lengths of the chromosome may be
specified in units of Morgan, which is the expected distance between two crossovers.
Thus in a chromosomal segment of length 1 centiMorgan (cM), i.e., one-hundredth
of M, there is only 1% chance of a crossover. Thus, in our representation each cM
is a single cell in the matrix representation of the population leading to a crossover
probability p at each position as p = 0.01 and this single cell may correspond to 1
nucleotide base [9, 10, 14].
INPUT: Length of chromosome: Z Morgans or Z × 100 centiMorgans (cM).
ASSUMPTION: 1 cM is the resolution, i.e., is a single cell in the representation vec-
tor/matrix.
OUTPUT: Locations of crossover events in a chromosome.
ALGORITHM: Let
where Xc is a random variable drawn from a uniform distribution on [b, d], for some
b < d, where c = (b + d)/2. For example, uniform discrete distribution on [1, 31] for t
and uniform continuous distribution on [1.0, 1.2] for a.
For each sampled chromosome:
Step 1. Draw the number of positions from a Poisson distribution with λ = pL. For
each randomly picked position j, introduce a crossover. If crossovers in any of
the previous t or next t positions (in cM) then the crossover at j is removed
with probability q. [Interference]
Step 2. Draw the number of positions from a Poisson distribution with λ = p L. For
each randomly picked position j , if no crossovers in the previous t and next t
positions then a crossover is introduced at j . [Interference adjustment]
Fragment lengths. Note that the careful formulation does not account for the follow-
ing summary statistic of the population: the distribution of the length of the fragments,
produced by the crossovers. It turns out that the fragment length is controlled by the
choice of the empirical values of Xt and Xa in Eqn 19. In Fig 5 (a), we show the frag-
ment length distribution where two values are fixed at t = 16 and a = 1.1 respectively
Best-Fit in Linear Time for Non-generative Population Simulation 261
in the Kosambi model. However, when the t and a are picked uniformly from a small
neighborhood around these values, we obtain the distribution which is more acceptable
by practitioners.
Running time analysis. The running time analysis is rather straightforward. Let cp be
the time associated with a Poisson draw and cu with a uniform draw. Ignoring the
initialization time and assuming t is negligible compared to L, the expected time taken
by the above algorithm for each sample is 2cp +(Z +1)cu . Since the time is proportional
to the number of resulting crossovers, the running time is optimal.
Acknowledgments. We are very grateful to the anonymous reviewers for their excel-
lent suggestions which we incorporated to substantially improve the paper.
262 N. Haiminen, C. Lebreton, and L. Parida
References
1. Balloux, F.: EASYPOP (Version 1.7): A Computer Program for Population Genetics Simu-
lations. Journal of Heredity 92, 3 (2001)
2. Peng, B., Amos, C.I.: Forward-time simulations of non-random mating populations using
simuPOP. Bioinformatics 24, 11 (2008)
3. Montana, G.: HapSim: a simulation tool for generating haplotype data with pre-specified
allele frequencies and LD coefficients. Bioinformatics 21, 23 (2005)
4. Yuan, X., Zhang, J., Wang, Y.: Simulating Linkage Disequilibrium Structures in a Human
Population for SNP Association Studies. Biochemical Genetics 49, 5–6 (2011)
5. Shang, J., Zhang, J., Lei, X., Zhao, W., Dong, Y.: EpiSIM: simulation of multiple epistasis,
linkage disequilibrium patterns and haplotype blocks for genome-wide interaction analysis.
Genes & Genomes 35, 3 (2013)
6. Peng, B., Chen, H.-S., Mechanic, L.E., Racine, B., Clarke, J., Clarke, L., Gillanders, E.,
Feuer, E.J.: Genetic Simulation Resources: a website for the registration and discovery of
genetic data simulators. Bioinformatics 29, 8 (2013)
7. Balding, D., Bishop, M., Cannings, C.: Handbook of Statistical Genetics, 3rd edn. Wiley J.
and Sons Ltd. (2007)
8. Vinod, K.: Kosambi and the genetic mapping function. Resonance 16(6), 540–550 (2011)
9. Kearsey, M.J., Pooni, H.S.: The genetical analysis of quantitative traits. Chapman & Hall
(1996)
10. Lynch, M., Walsh, B.: Genetics and Analysis of Quantitative Traits. Sinauer Associates
(1998)
11. The International HapMap Consortium: The International HapMap Project. Nature 426,
pp. 789–796 (2003)
12. Haldane, J.B.S.: The combination of linkage values, and the calculation of distance between
linked factors. Journal of Genetics 8, 299–309 (1919)
13. Kosambi, D.D.: The estimation of map distance from recombination values. Journal of Ge-
netics 12(3), 172–175 (1944)
14. Cheema, J., Dicks, J.: Computational approaches and software tools for genetic linkage map
estimation in plants. Briefings in Bioinformatics 10(6), 595–608 (2009)
15. Zhang, K., Deng, M., Chen, T., Waterman, M.S., Sun, F.: A dynamic programming algorithm
for haplotype block partitioning. Proceedings of the National Academy of Sciences 19(11),
7335–7339 (2002)
16. Podlich, D.W., Cooper, M.: QU-GENE: a simulation platform for quantitative analysis of
genetic models. Bioinformatics 14(7), 632–653 (1998)
GDNorm: An Improved Poisson Regression
Model for Reducing Biases in Hi-C Data
1 Introduction
Three dimensional (3D) conformation of chromosomes in nuclei plays an im-
portant role in many chromosomal mechanisms such as gene regulation, DNA
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 263–280, 2014.
c Springer-Verlag Berlin Heidelberg 2014
264 E.-W. Yang and T. Jiang
However, during the experimental steps of Hi-C, systematic biases from different
sources are often introduced into contact frequencies. Several systematic biases
were shown to be related to genomic features such as number of restriction en-
zyme cutting sites, GC content and sequence uniqueness in the work of Yaffe and
Tanay [13]. Without being carefully detected and eliminated, these systematic
biases may distort many down-stream analyses of chromosome spatial organiza-
tion studies. To remove such systematic biases, several bias reduction methods
have been proposed recently. These bias reduction methods can be divided into
two categories, the normalization methods and bias correction methods accord-
ing to [2]. The normalization methods, such as ICE [14] and the method in [15],
aims at reducing the joint effect of systematic biases without making any specific
assumption on the relationships between systematic biases and related genomic
features. Their applications are limited to the study of equal sized genomic loci
[2]. In contrast, the bias correction methods, such as HiCNorm [16] and the
method of Yaffe and Tanay (YT) [13], build explicit computational models to
capture the relationships between systematic biases and related genomic features
that can be used to eliminate the joint effect of the biases.
Although it is well known that observed contact frequencies are determined
by both systematics biases and spatial distance between genomic segments, the
existing bias correction methods do not take spatial distance into account explic-
itly. This incomplete characterization of causal relationships for contact frequen-
cies is known to cause problems such as poor goodness-of-fitting to the observed
contact frequency data [16]. In this paper, we build on the work in [16] and
propose an improved Poisson regression model that corrects systematic biases
while taking spatial distance (between genomic regions) into consideration. We
also present an efficient algorithm for solving the model based on gradient de-
scent. This new bias correction method, called GDNorm, provides more accurate
normalized contact frequencies and can be combined with a distance-based chro-
mosome structure determination method such as ChromSDE [12] to obtain more
accurate spatial structures of chromosomes, as demonstrated in our simulation
study. Moreover, two recently published Hi-C datasets from human lymphoblas-
toid and mouse embryonic stem cell lines are used to compare the performance
of GDNorm with the other state-of-the-art bias reduction methods including
HiCNorm, YT and ICE at 40kb and 1M resolutions. Our experiments on the
real data show that GDNorm outperforms the existing bias reduction methods in
terms of the reproducibility of normalized contact frequencies between biological
replicates. The normalized contact frequencies by GDNorm are also found to be
highly correlated to the corresponding FISH distance values in the literature.
With regard to time efficiency, GDNorm achieves the shortest running time on
the two real datasets and the running time of GDNorm increases linearly with
the resolution of data. Since more and more high resolution (e.g., 5 to 10kb)
data are being used in the studies of chromosome structures [17], the time effi-
ciency of GDNorm makes it a valuable bias reduction tool, especially for studies
involving high resolution data.
266 E.-W. Yang and T. Jiang
The rest of this paper is organized as follows. Section 2.1 presents several
genomic features that are used in our improved Poisson regression model. The
details of the model as well as the gradient descent algorithm are described
in Section 2.2. Several experimental results on simulated and real human and
mouse data are presented in Section 3. Section 4 concludes the paper.
2 Methods
2.1 Genomic Features
A chromosome g can be binned into several disjoint and consecutive genomic seg-
ments. Given an ordering to concatenate the chromosomes, let S = {s1 , s2 , ..., sn }
be a linked list representing all n genomic segments of interest such that the lin-
ear order of the segments in S is consistent with the sequential order in the
concatenation of the chromosomes. For each genomic segment si , the number of
restriction enzyme cutting sites (RECSs) within si is represented as Ri . The GC
content Gi of segment si is the average GC content within the 200 bps region
upstream of each RECS in the segment. The sequence uniqueness Ui of segment
si is the average sequence uniqueness of 500 bps region upstream or downstream
of each RECS. To calculate the sequence uniqueness for a 500 bps region, we
use a sliding window of 36bps to synthesize 55 reads of 35 bps by taking steps
of 10bps from 5 to 3 as done in [16]. After using the BWA algorithm [18] to
align the 55 reads back to the genome, the percentage of the reads that is still
uniquely mapped in the 500 bps region is considered as the sequence uniqueness
for the 500 bps region. These three major genomic features have been shown to
be either positively or negatively correlated to contact frequencies in the litera-
ture [13]. In the following, we will present a new bias correction method based on
gradient search to eliminate the joint effect of the systematic biases correlated to
the three genomic features, building on the Poisson regression model introduced
in [16].
where β0 is a global constant, βrecs , βgcc and βseq are coefficients for the sys-
tematic biases correlated to RECS, GC content and sequence uniqueness, and
An Improved Poisson Regression Model for Reducing Biases 267
where β = {βrecs , βgcc , βseq } again represents the systematic biases, βdist rep-
resents the conversion factor and D = {di,j |1 ≤ i ≤ n, i < j} are variables
representing the spatial distance values to be estimated. However, without any
constraint or assumption on spatial distance, the model represented by Eq. 2 is
non-identifiable, because for any constant k, βdist log(di,j ) = k ×βdist log(di,j 1/k ).
BACH solved this issue by introducing some spatial constraints from previously
predicted chromosome structures. (Eq. 2 was used by BACH to iteratively refine
the predicted chromosome structure.) Hence, Eq. 2 is infeasible for bias correc-
tion methods that do not rely on any spatial constraint. To get around this, we
introduce a new variable zi,j = β0 + βdist log(di,j ) and rewrite Eq. 2 as follows:
where the systematic biases β and Z = {zi,j |1 ≤ i ≤ n, i < j} are the vari-
ables to be estimated. Note that applying a Poisson distribution on read count
data sometimes leads to the overdispersion problem, i.e., underestimation of the
variance [19], which is generally solved by using a negative binomial distribution
instead. However, the results in [16] suggest that there is usually no significant
difference in the performance of bias correction methods when a negative bi-
nomial distribution or a Poisson distribution is applied to Hi-C data. For the
mathematical simplicity of our model, we use Poisson distributions.
Let θ denote the set of θi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ n. Given the observed contact
frequency matrix F and genomic features of S, the log-likelihood function of the
observed contact frequencies over the Poisson distribution rates can be written
as:
n
n
e−θi,j θfi,j
log(P r(F |β, Z)) = log(P r(F |θ)) = log( P r(fi,j |θi,j )) = log( )
i=1,i<j i=1,i<j
fi,j !
n
= −θi,j + fi,j log(θi,j ) − log(fi,j !). (4)
i=1,i<j
However, without any constraint on the variables Z, the above model is still
generally non-identifiable since for any β, we can always choose a zi,j such that
fi,j = θi,j and the likelihood function is maximized. Therefore, we require that
for any i, j, |zi,i+1 − zj,j+1 | ≤ for some threshold , since we expect that the
distance between neighboring segments is roughly the same across a chromosome.
Observe that Eq. 5 cannot be solved by using the same Poisson regression
fitting method as in HiCNorm, because Eq. 5 is no longer a standard log-linear
model like Eq. 1. A popular technique for solving multivariate optimization prob-
lems is gradient descent. Gradient descent searches the optimum of a minimiza-
tion problem with an objective function g(x) from a given initial point x1 at
the first iteration and then iteratively moves toward a local minimum by fol-
lowing the negative of the gradient function −∇g(x). In other words, at every
iteration i, we compute xi ← xi−1 − α∇g(x), where α is a constant. In our case,
the objective function to be minimized is the negative of the above log-likelihood
function g(x) = g(β, Z) = −log(P r(F |β, Z)). By taking partial derivatives of the
objective function with respect to the variables β and Z, we have the gradient
function −∇g(x) = { ∂g(x) ∂g(x)
∂β , ∂Z } as
∂g(β, Z)
= θi,j − fi,j
∂zi,j
∂g(β, Z) n
= log(Ri Rj )(θi,j − fi,j )
∂βrecs i=1,i<j
∂g(β, D) n
= log(Gi Gj )(θi,j − fi,j )
∂βgcc i=1,i<j
∂g(β, D) n
= log(Ui Uj )(θi,j − fi,j )
∂βseq i=1,i<j
3 Experimental Results
We assess the performance of GDNorm in terms of (i) the accuracy of its normal-
ized contact frequencies and (ii) the accuracy of structure determination using
the normalized contact frequencies. The latter will be done by simulating bi-
ased Hi-C read count data from some simple reference chromosome structures
and then trying to recover the reference structures from normalized contact
frequencies in combination with the most recent chromosome structure determi-
nation algorithm, ChromSDE [12]. In other words, we will consider the impact
of normalized contact frequencies on the chromosome structures predicted by
ChromSDE. To measure the quality of bias correction, we consider the repro-
ducibility of normalized contact frequencies between biological replicates of an
mESC line [9] and the correlation between normalized contact frequencies and
FISH distance values in the literature. The performance of GDNorm will be
compared with the state-of-the-art bias reduction algorithms HiCNorm [16], YT
[13] and ICE [14].
and the last segments to be 1 while ChromSDE does not perform this normal-
ization. To obtain a fair comparison, we calibrate the predicted structure sizes in
GDNormsde and HiCNormsde such that the distance between the first and last
segment is fixed at 100. Finally, the accuracy of structure prediction is assessed
using the root mean square difference (RMSD) measure after optimally aligning
a predicted structure to the reference structure by Kabsch’s algorithm [21].
GDNorm Provides the Most Accurate Chromosome Structure Pre-
diction on Noise-Free Data. The optimal alignments of the predicted and
reference chromosome structures are shown together with their RMSD values
in Figure 1. In the structure predictions for both the helix and random walk,
GDNormsde predicted the chromosome structures with the minimum RMSDs.
In the structure prediction for the helix, GDNormsde obtained a structure that
can be almost perfectly aligned with the reference structure with a very small
RMSD value of 0.3. This is because GDNorm was able to significantly reduce the
effect of systematic bias and the semi-definite programming method employed
by ChromSDE can guarantee perfect recovery of a chromosome structure when
the given distance values between segments are noise-free.
GDNorm Reduces Systematic Biases Significantly in Noise-Free Data.
To examine how much the effect of systematic biases can be reduced by the se-
lected bias reduction methods, we further analyze the predicted spatial distance
values between neighboring segments in the structure prediction for the helix.
Because the spatial distance between neighboring segments si and si+1 in the
reference structure of the helix is the same for all i, the difference in the ob-
served contact frequency between si and si+1 , for different i, is mainly a result
of the systematic biases. If the systematic biases are correctly estimated and
eliminated, the distance between any two consecutive segments in the predicted
structure is expected to be the same. The spatial distance values between 10
pairs of consecutive segments with the greatest systematic biases are compared
with the distance values between 10 pairs with the smallest systematic biases for
each of the chromosome structures predicted by GDNormsde , HiCNormsde and
BACH. The box plots in Figure 2 summarizes the comparison results. The abso-
lute differences between the means of the two sets of 10 distance values obtained
by GDNormsde , HiCNormsde and BACH are 0.045, 3.47 and 2.61, respectively.
The statistical significance of the difference between two sets of 10 distance val-
ues obtained by each method is also examined by a two-tailed t-Test [22], which
yielded a non-significant p-value of 0.42 for GDNormsde and significant p-values
of 1.3 × 10−12 and 1.56 × 10−6 for HiCNormsde and BACH, respectively.
GDNorm Provides the Most Accurate Chromosome Prediction on
Noisy Data. We have demonstrated the superior performance of GDNormsde on
Hi-C data without noise (but with systematic biases). To test its performance on
noisy data, a uniformly random noise δi,j is injected into every contact frequency
fi,j such that the noisy frequency f˜ij = fi,j (1+δi,j ). In this test, we consider two
noise levels, 30% and 50%. Table 1 summarizes the RMSD values of the optimal
alignments between the predicted structures and the reference structures. The
272 E.-W. Yang and T. Jiang
results show GDNormsde still outperforms the other two methods by achieving
the overall smallest RMSD values at both noise levels. Note that BACH failed
to predict the helix structure at both noise levels in this test, perhaps because
its MCMC algorithm could sometimes be trapped in a local optimum when the
input data contains a significant level of noise.
Fig. 2. Comparison of the predicted spatial distance values with the 10 greatest and 10
smallest systematic biases. For each structure prediction method studied, two sets of
10 distance values form the two boxes in a comparison group. The left box depicts the
distribution of the distance values for contacts with the greatest systematic biases while
the right shows the distribution of the distance values for contacts with the smallest
systematic biases. Clearly, GDNormsde produced the most consistent distance values
and HiCNormsde the least.
In addition to the simulation study, several experiments on real Hi-C data are
conducted to evaluate the bias reduction capability of GDNorm, in comparison
with other state-of-the-art bias reduction methods, HiCNorm, YT and ICE. Un-
like the assessment in the previous simulation study, the reference structures
for real Hi-C datasets are hardly obtainable because of the complexity of chro-
mosome structures. To compare the performance of the studied bias reduction
An Improved Poisson Regression Model for Reducing Biases 273
methods on real Hi-C data, a commonly used evaluation criterion is the simi-
larity (or reproducibility) between normalized contact frequency matrices from
biological replicates using different enzymes. Since these replicates are derived
from the same chromosomal structures in the cell line, the contact frequencies
normalized by a robust bias reduction algorithm using one enzyme are expected
to be similar to those using another enzyme. However, a high reproducibility is
a necessary but not sufficient condition for robust bias reduction algorithms. As
suggested in [2], we further compare the correlation between normalized contact
frequencies and the corresponding spatial distance values measured by FISH
experiments. Both the similarity between the normalized contact frequency ma-
trices and the correlation to FISH data will be measured in terms of Spearman’s
rank correlation coefficient that is independent to the conversion between nor-
malized contact frequencies and spatial distance values.
To prepare benchmark datasets for the performance assessment, we use two
recently published Hi-C data from human lymphoblastoid cells (GM06990) [3]
and mouse stem cells (mESC) [9]. For the GM06990 dataset, the Hi-C raw reads,
SRR027956 and SRR027960, of two biological replicates using restriction en-
zymes HindIII and NcoI, respectively, were downloaded from NCBI (GSE18199).
Each of the chromosomes in the GM06990 cell line is binned into 1M bps seg-
ments and the pre-computed observed frequency matrices at 1M resolution were
obtained from the publication website of [13]. For the mESC dataset, the mapped
reads, uniquely aligned by the BWA algorithm [18], were downloaded from NCBI
(GSE35156). Because of the enhanced sequencing depth in the mESC dataset,
the Hi-C data can be analyzed at a higher resolution, i.e., 40kb. In other words,
the 20 chromosomes in the mESC cell line are binned into 40kb bps segments.
To calculate observed contact frequencies from the mapped reads, the prepro-
cessing protocols used in the literature [3,13] are followed. For every paired-end
read, its total distance to the two closest RECSs is calculated. Any read with
a total distance greater than 500 bps is defined as a non-specific ligation and
thus removed to prevent reads from random ligation being used, as suggested
in [13]. Reads from RECSs with low sequence uniqueness (smaller than 0.5) are
also discarded. The remaining paired-end reads over the 20 chromosome, chr1
to chr20 (chrX), are used for calculating the observed contact frequencies.
The contact frequencies are derived from a cell population that may consist of
several subpopulations of different chromosome structures. Without fully under-
standing the structural variations in a cell population, any structural inference
from the Hi-C data can be distorted [10]. A recent single-cell sequencing study
found that interchromosome (or trans) contacts have much higher variability
among cells of the same cell line than intra-chromosome (or cis) contacts [23].
To avoid potential uncertainty that may be caused by significant variations in
a cell line, we follow suggestions in the literature [9,2] and focus on cis contacts
within a chromosome.
To obtain normalized frequencies of the bias reduction methods, we run both
GDNorm and HiCNorm on the contact frequencies and ICE on the raw Hi-C
reads. The normalized frequencies by the YT method are downloaded from the
274 E.-W. Yang and T. Jiang
publication websites of the literature [9,13]. Note that although the primary
objective of BACH is to predict chromosome structures, it also estimates sys-
tematic biases in the prediction of chromosome structures, using the log-linear
regression model given in Eq. 2.
Hence, BACH can be regarded as a bias reduction method if we divide each
observed contact frequency by its estimated systematic biases and use the quo-
tient as the normalized frequency. To study the accuracy of bias estimation
by BACH, we also include BACH in the comparison of bias correction meth-
ods. The reproducibility between the two biological replicates and correlation to
FISH data achieved by the compared methods are discussed below.
Fig. 4. Comparison of the reproducibility in the mESC dataset. Plots (a) and (b)
illustrate the overall reproducibility and RHCF of GDNorm, HiCNorm, YT, and ICE
on the 20 chromosomes, chr1 to chr20 (chrX), in the mESC cell line at 40kb resolution,
respectively. Here, the distribution of Spearman’s correlation coefficients achieved by
each bias reduction method is represented as a solid curve over the 20 chromosomes.
Plots (c) and (d) show the overall reproducibility and RHCF of GDNorm and BACH
at 1M resolution, respectively.
studies concerning gene promoter-enhancer contacts [17] and spatial gene-gene in-
teraction networks [24]. To assess the capability of reducing systematic biases in
high contact frequencies, we calculate another Spearman’s correlation coefficient,
called the reproducibility of high contact frequencies (RHCF), by using only the
top 20% of bins with the highest observed contact frequencies.
The Spearman’s correlation coefficients over the 23 chromosomes in the
GM06990 dataset are summarized in Figure 3. The average overall reproducibil-
ity of the observed (i.e., raw) contact frequencies is 0.711 and GDNorm achieves
the best overall reproducibility 0.811 on average while HiCNorm, YT, BACH,
and ICE obtain 0.799, 0.789, 0,761, and 0.721, respectively. GDNorm improves
the average overall reproducibility by up to 0.04 on an individual chromosome,
over the second best method, HiCNorm. In terms of RHCF, the improvement
by GDNorm over the second best method (HiCNorm) is more striking, 0.02 on
average and up to 0.13 on an individual chromosome.
In the experiments on the mESC dataset, all the selected methods are run
on the data at 40kb resolution except for BACH. The running time of BACH
is prohibitive for performing chromosome-wide bias correction on the mESC
dataset at the 40kb resolution, because it requires 5000 iterations to refine the
predicted structure by default and each iteration takes about 30 minutes on av-
erage on our computer. So, we excluded BACH from the experiments at 40kb
resolution, but will compare it with GDNorm at 1M resolution separately. The
comparisons over the 20 chromosomes in the mESC dataset at 40kb resolution
are summarized in Figure 4 (a) and (b). The average overall reproducibility
of the observed (raw) contact frequencies is 0.734. The average overall repro-
ducibility provided by GDNorm is 0.865, which is about 0.02 higher than the
average overall reproducibility (0.846) obtained by HiCNorm and 0.03 higher
276 E.-W. Yang and T. Jiang
than the third best (0.83) obtained by YT. Although ICE can eliminate system-
atic biases without assuming their specific sources, it achieves the lowest average
overall reproducibility, 0.783, which is significantly lower than the average re-
producibilities obtained by the other three methods. GDNorm achieves similar
improvements in terms of RHCF, which is also 0.02 higher than the second best
by HiCNorm on average and up to 0.04 on an individual chromosome. The com-
parisons between BACH and GDNorm at 1M resolution are shown in Figure 4
(c) and (d). GDNorm significantly outperforms BACH on both average overall
reproducibility (0.02) and average RHCF (0.07). In the tests on individual chro-
mosomes, the maximum improvement on RHCF by GDNorm is up to 0.15. This
result shows that, although GDNorm and BACH both include spatial distance
explicitly in their models, the gradient descent method of GDNorm can estimate
the systematic biases more accurately than the MCMC based optimization pro-
cedure of BACH. These experimental results demonstrate that GDNorm is able
to consistently improve on the reproducibility between biological replicates at
both high (40kb) and low (1M) resolutions.
The Normalized Contact Frequencies Obtained by GDNorm are well
Correlated to the FISH Data. To further validate the quality of normalized
contact frequencies, we use an mESC 2d-FISH dataset that contains distance
measurement for six pairs of genomic loci as our benchmark data. The six pairs of
genomic loci are distributed on chromosomes 2 and 11 of the mESC genome, with
three pairs on chromosome 2 and the other three on chromosome 11. The distance
between each pair of the genomic loci is measured by inter-probe distance on 100
cell images from 2d-FISH experiments and normalized by the size of cell nucleus
such that any change in the distance measurement is attributed solely to altered
nucleus size on the images as described in the literature [4]. The average of
the 100 normalized distance values for each pair of the genomic segments is
used to correlate with the normalized contact frequency corresponding to the
pair. The normalized frequencies are expected to be inversely correlated to the
corresponding spatial distance values. Table 2 compares Spearman’s correlation
coefficients obtained by all four methods. The correlation coefficient between
the 2d-FISH distance values and observed contact frequencies is low, −0.45 and
−0.25 in the HindIII and NcoI replicates, respectively. YT and GDNorm are
able to improve both correlation coefficients and achieve a strong correlation
(smaller than −0.6) in the HindIII replicate while HiCNorm and ICE fail to
deliver strongly correlated normalized frequencies in either replicate.
An Improved Poisson Regression Model for Reducing Biases 277
Fig. 5. The running time of GDNorm and HiCNorm on the mESC data at four different
resolutions. The Y-axis shows the running time in seconds and the X-axis indicates the
number of genomic segments at each resolution.
278 E.-W. Yang and T. Jiang
However, in our gradient descent method, the execution time of each iteration is
only linear in the number of segment pairs, which makes GDNorm faster than
HiCNorm. As illustrated in Figure 5, a simple experiment on the mESC data
with resolutions at 40kb, 80kb, 200kb, and 1M shows that, when the number of
genomic segments increases, the running time of HiCNorm grows much faster
than that of GDNorm.
4 Conclusion
The reduction of systematic biases in Hi-C data is a challenging computational
biology problem. In this paper, we proposed an accurate bias reduction method
that takes advantage of a more comprehensive model of causal relationships
among observed contact frequency, systematic biases and spatial distance. In
our simulation study, GDNorm was able to provide more accurate normalized
contact frequencies that resulted in improved chromosome structure prediction.
Our experiments on two real Hi-C datasets demonstrated that GDNorm achieved
a better reproducibility between biological replicates consistently at both high
and low resolutions than the other state-of-the-art bias reduction methods and
provided stronger correlation to published 2d-FISH data. The experiments also
showed GDNorm’s high time efficiency. With the rapid accumulation of high
throughput genome-wide chromatin interaction data, the method could become
a valuable tool for understanding the higher order architecture of chromosome
structures.
References
1. Dekker, J., Marti-Renom, M.A., Mirny, L.A.: Exploring the three-dimensional or-
ganization of genomes: interpreting chromatin interaction data. Nature Reviews.
Genetics 14(6), 390–403 (2013)
2. Hu, M., Deng, K., Qin, Z., Liu, J.S.: Understanding spatial organizations of chro-
mosomes via statistical analysis of Hi-C data. Quantitative Biology 1(2), 156–174
(2013)
3. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T.,
Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., Sandstrom, R.,
Bernstein, B., Bender, M.A., Groudine, M., Gnirke, A., Stamatoyannopoulos, J.,
Mirny, L.A., Lander, E.S., Dekker, J.: Comprehensive mapping of long-range inter-
actions reveals folding principles of the human genome. Science 326(5950), 289–293
(2009)
4. Eskeland, R., Leeb, M., Grimes, G.R., Kress, C., Boyle, S., Sproul, D., Gilbert,
N., Fan, Y., Skoultchi, A.I., Wutz, A., Bickmore, W.A.: Ring1B compacts chro-
matin structure and represses gene expression independent of histone ubiquitina-
tion. Molecular Cell 38(3), 452–464 (2010)
An Improved Poisson Regression Model for Reducing Biases 279
5. Dekker, J., Rippe, K., Dekker, M., Kleckner, N.: Capturing chromosome confor-
mation. Science 295(5558), 1306–1311 (2002)
6. Simonis, M., Klous, P., Splinter, E., Moshkin, Y., Willemsen, R., de Wit, E., van
Steensel, B., de Laat, W.: Nuclear organization of active and inactive chromatin
domains uncovered by chromosome conformation capture-on-chip (4C). Nature
Genetics 38(11), 1348–1354 (2006)
7. Zhao, Z., Tavoosidana, G., Sjölinder, M., Göndör, A., Mariano, P., Wang, S., Kan-
duri, C., Lezcano, M., Sandhu, K.S., Singh, U., Pant, V., Tiwari, V., Kurukuti,
S., Ohlsson, R.: Circular chromosome conformation capture (4C) uncovers exten-
sive networks of epigenetically regulated intra- and interchromosomal interactions.
Nature Genetics 38(11), 1341–1347 (2006)
8. Dostie, J., Richmond, T.A., Arnaout, R.A., Selzer, R.R., Lee, W.L., Honan, T.A.,
Rubio, E.D., Krumm, A., Lamb, J., Nusbaum, C., Green, R.D., Dekker, J.: Chro-
mosome Conformation Capture Carbon Copy (5C): a massively parallel solution
for mapping interactions between genomic elements. Genome Research 16(10),
1299–1309 (2006)
9. Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., Ren,
B.: Topological domains in mammalian genomes identified by analysis of chromatin
interactions. Nature 485(7398), 376–380 (2012)
10. Hu, M., Deng, K., Qin, Z., Dixon, J., Selvaraj, S., Fang, J., Ren, B., Liu, J.S.:
Bayesian inference of spatial organizations of chromosomes. PLoS Computational
Biology 9(1), e1002893 (2013)
11. Marti-Renom, M.A., Mirny, L.A.: Bridging the resolution gap in structural model-
ing of 3D genome organization. PLoS Computational Biology 7(7), e1002125 (2011)
12. Zhang, Z., Li, G., Toh, K.-C., Sung, W.-K.: Inference of spatial organizations of
chromosomes using semi-definite embedding approach and hi-C data. In: Deng, M.,
Jiang, R., Sun, F., Zhang, X. (eds.) RECOMB 2013. LNCS, vol. 7821, pp. 317–332.
Springer, Heidelberg (2013)
13. Yaffe, E., Tanay, A.: Probabilistic modeling of Hi-C contact maps eliminates sys-
tematic biases to characterize global chromosomal architecture. Nature Genet-
ics 43(11), 1059–1065 (2011)
14. Imakaev, M., Fudenberg, G., Mccord, R.P., Naumova, N., Goloborodko, A., Lajoie,
B.R., Dekker, J., Mirny, L.A.: Iterative correction of Hi-C data reveals hallmarks
of chromosome organization. Nature Methods (September) (2012)
15. Cournac, A., Marie-Nelly, H., Marbouty, M., Koszul, R., Mozziconacci, J.: Nor-
malization of a chromosomal contact map. BMC Genomics 13, 436 (2012)
16. Hu, M., Deng, K., Selvaraj, S., Qin, Z., Ren, B., Liu, J.S.: HiCNorm: removing
biases in Hi-C data via Poisson regression. Bioinformatics 28(23), 3131–3133 (2012)
17. Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.A., Schmitt,
A.D., Espinoza, C.A., Ren, B.: A high-resolution map of the three-dimensional
chromatin interactome in human cells. Nature 503(7475), 290–294 (2013)
18. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics (Oxford, England) 25(14), 1754–1760 (2009)
19. Lindsey, J.K., Altham, P.M.E.: Analysis of the human sex ratio by using overdis-
persion models. Journal of the Royal Statistical Society. Series C (Applied Statis-
tics) 47(1), 149–157 (1998)
20. Rousseau, M., Fraser, J., Ferraiuolo, M.A., Dostie, J., Blanchette, M.: Three-
dimensional modeling of chromatin structure from interaction frequency data using
Markov chain Monte Carlo sampling. BMC Bioinformatics 12(1), 414 (2011)
21. Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta
Crystallographica Section A 32(5), 922–923 (1976)
280 E.-W. Yang and T. Jiang
22. Goulden, C.H.: Methods of Statistical Analysis, 2nd edn. Wiley, New York (1956)
23. Nagano, T., Lubling, Y., Stevens, T.J., Schoenfelder, S., Yaffe, E., Dean, W., Laue,
E.D., Tanay, A., Fraser, P.: Single-cell Hi-C reveals cell-to-cell variability in chro-
mosome structure. Nature 502(7469), 59–64 (2013)
24. Wang, Z., Cao, R., Taylor, K., Briley, A., Caldwell, C., Cheng, J.: The properties
of genome conformation and spatial gene interaction and regulation networks of
normal and malignant human cell types. PLoS ONE 8(3), e58793 (2013)
25. Dobson, A.J.: An Introduction to Generalized Linear Models. Chapman and Hall,
London (1990)
Pacemaker Partition Identification
Sagi Snir
1 Introduction
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 281–295, 2014.
c Springer-Verlag Berlin Heidelberg 2014
282 S. Snir
each evolutionary lineage adhere to the pace of a pacemaker (PM), and change
their evolutionary rate (approximately) in unison although the pacemaker’s pace
at different lineages may differ. The UPM model is compatible with the large
amount of data on fast-evolving and slow-evolving organismal lineages, primar-
ily different groups of mammals [5]. An obvious alternative to the UPM is the
Molecular Clock (MC) model of evolution under which genes evolved at roughly
constant albeit different (gene-specific) rates [32] that implies the constancy of
gene-specific relative evolution rates.
In a line of works [26,30,25] we established the superiority of the UPM model
over the MC by explaining a larger fraction of the variance in the branch lengths
of thousands of gene trees spanning the entire tree of life. Although highly sta-
tistically significant, in absolute terms however, the advantage of UPM over MC
was small, and both models exhibited considerable evolution rate overdispersion.
A plausible explanation to the latter is that instead of a single, apparently weak
(overdispersed) PM, there are independent multiple pacemakers that each affect
a (different) subset of genes and are less dispersed than the single pacemaker.
Throughout, we use the notation UPM to refer to the model and the PM term
for the pacemaker as an object.
Primarily, we investigate the requirements for the identification of distinct
PMs and assignment of each gene to the appropriate PM. Such an assignment
forms a partition over the set of genes and hence we denote this task as the
PM partition identification (PMPI) problem. PM identification depends on the
number of analyzed genes, the number of target PMs, the intrinsic variability of
the evolutionary rate for each gene and the intrinsic variability of each PM. The
PMPI problem is theoretically and practically hard as it concerns dealing with
a lot of data obscured by a massive amount of noise. A possible direction to
pursue is to exploit the signal from the data themselves in order to reduce the
search space and focus only on relevant partitions.
In this work, a first attempt in this direction is made by devising and em-
ploying a novel technique using a series of analytic tools to solve the PMPI
problem, and assess the quality of the derived solution. We tackle theoretical
computational and statistical issues, as well as challenging engineering obstacles
that arise along the way. These include guarantying homoschedasticity [29] by
working in the log space, removing gene order dependency [1] by employing the
Deming regression [6,10], and graph completion through most reliable paths.
The result is the partial gene correlation graph where edge lengths represent
(inversely) correlation, that we subsequently embed into the Euclidean space
while preserving the distances. We apply standard clustering tools to this data
and assess the significance of the result. We next formulate the PMPI problem
as a recoloring problem [22,21] where a gene’s PM is perceived as its color and
the (set of) genes associated with a certain PM form a color class. To measure
the quality of partition reconstruction, one may look for the minimum number
of genes that need to be recolored in order that every part in the reconstructed
partition is monochromatic. This number (the recolored genes) is denoted the
partition distance [14] and can be solved by a matching algorithm. We however
Pacemaker Partition Identification 283
associated with PM Pk has actual rate at time tj : ri,j = ri eαi,j eβk,j . Hence, for
β < 0 the PM slows down its associated genes, for β > 0 genes are accelerated
by their PM, and for β = 0, the PM is neutral. Assume every gene is associated
with some PM and let P M (gi ) be the PM of gene gi . Then the latter defines a
partition over the set of genes G, where genes gi and gi are in the same part if
P M (gi ) = P M (gi ).
Comment 2. The presence of two genes in the same part (PM) does not imply
anything about their magnitude of rates, rather on their unison of rate divergence.
Observation 1. Assume gene gi has error factor αi,j = 0 for all time periods
tj , 1 ≤ j ≤ τ and let P = P M (gi ) be the pace maker of gene gi with relative
paces eβj . Then at all periods tj , ri,j = ri eβj .
Observation 1 implies that if genes gi and gi belong to the same pace maker,
and both genes have zero error factor at all periods, then at all periods, the ratio
between the edge lengths at each period is constant and equals to ri /ri . This
however is not necessarily true if one of the error factor is not zero or genes gi
and gi do not belong to the same pace maker. Recall that we do not see the
gene intrinsic rates (and hence also the ratio between them). However if we see
the same ratio between edge lengths across all time periods, we can conclude
about the error factors and possibly their belonging to the same PM.
In order to tackle the PM identification problem, we impose some statistical
structure (as observed in real data [12]) on the given setting. The goal is to
assume that the error factor of each gene is small enough at every period, so
that all genes belonging to the same PM, change their actual rate in unison.
Similarly, we assume that βk varies so that genes from different PMs (parts)
can be distinguished (otherwise, no difference except their random error factor
exists)
Assumption 1.
1. For all genes gi and periods tj , the gene error factors αi,j follow a normal
distribution αi,j ∼ N (0, σG
2
),
2. For all PMs Pk and periods tj , the PM paces βk,j follow a normal distribution
βk,j ∼ N (0, σP2 ),
Pacemaker Partition Identification 285
We denote this as the log transformation and also observe the following:
Observation 3. Under the log transformation the trend line log i ,j =
a log i,j + b has slope one (a = 1) and intercept b = log ρi,i .
We will use these properties in our calculations.
3. Gene Order Independence: The final problem with the linear regression
has to do with the basic assumptions in least squares analysis. In standard
least squares, the assumption is that the independent variable x is error-
free while only the dependent variable y deviates from its expected val-
ues. In our case, however, the choice between the variables is arbitrary and
both are subjected to deviation, according to their characteristic variance
2
σG . Handling this case with standard least squares would cause arbitrary
bias due to the selection of the variables [1]. To handle this case, we apply
Deming Regression [6,10]. This approach assumes an explicit probabilistic
model for the variables and extracts closed forms expressions (in the ob-
served variables) for the sought expected values. To adjust to our specific
case, we will use the observations drawn above. The linear model assumed is
of type η = αξ + β where the observations of both η and ξ , (x1 , . . . , xn ) and
(y1 , . . . , yn ), respectively, have normally distributed errors: (i) xi = ξi + εxi ,
and (ii)yi = ηi + εyi = α + βξi + εyi . As can be seen, this is exactly our
case. The likelihood function of this model is:
2 −1/2 (xi − ξi )2 2 −1/2 (yi − α − βξi )2
n
f = Π1 (2πσ ) exp − (2πσ ) exp −
2σ 2 2σ 2
(1)
Under the general formulation, the ML value for α is: α = x̄ + ȳβ where x̄
and ȳ are the average values for xi and yi . However, in our formulation we have
β = 1 and hence α = x̄ + ȳ. Having α at hand, we can reconstruct the trend line
and obtain the deviation of every point from it. Finally, by our formulation, ρi,i
is given by exp(α) and the correlation between the rates is the standard sample
Pearson correlation coefficient r(X, Y ) [29]:
i=1 (Xi − X̄)(Yi − Ȳ )
r = . (2)
i=1 (X i − X̄)2
i=1 (Yi − Ȳ )2
that aims at minimizing the within-cluster sum of squares (WCSS). These tech-
niques operate in the Euclidean space and hence some distance preserving
technique is required to embed the correlation graph G in the space. Multi-
dimensional Scaling [17] (also Euclidean embedding) is a family of approaches
for this task. Kruskal’s iterative algorithm [16] for non-metric multidimensional
scaling (MDS) receives as input a (possibly partial) set of distances and the de-
sired embedding should preserve the order of the original distances. It requires
however a full matrix as a starting guess.
Our approach here is to join every two nodes by the most reliable connection
and with the highest correlation. This translates to finding the path with the
minimum number of nodes (hops) and that the multiplication of the correspond-
ing weights is minimal. This distance measure, min hop min weight (MHMW),
is also useful in communication networks, where hop distance corresponds to re-
liability [13]. While the naive algorithm for the latter runs in time O(n3 ) it can
be easily seen that we can solve the problem in time O(n2 log diam(G)) where
diam(G) is the diameter of G. The completed graph Ĝ serves as input to the
Classical multidimensional scaling (CMDS) [4] whose output serves as the initial
guess to the Kruskal’s non-metric MDS. Once we have the embedding, we can
apply k-means and obtain the desired clustering.
Below is the complete formal procedure PMPI:
Procedure PMPI(G, δr ):
1. Set the correlation graph G = (V, E) with V = ∅, E = ∅
2. V = {g|g is a gene in G}
3. for all gi , gj ∈ G
– apply the Deming regression between gi and gj to determine r(gi , gj )
– if r(gi , gj ) ≥ δr , then add {(gi , gj )} to E and set w(gi , gj ) ← r(gi , gj )
4. Ĝ ← M HM W (G)
5. apply Classical Multidimensional Scaling (cmdscale) to the full graph Ĝ
6. apply Kruskal’s iterative algorithm (isoMDS) to the original distance
matrix, starting from cmdscale output
7. apply kmeans to the resulted embedding
4 Simulation Analysis
f in the first approach is simply the identity function f (c) = c for every c ∈ C.
This essentially defines a recoloring problem[22] where the goal is to recolor the
least number of elements in P (or C ) such that f (C(x)) = C (x) for every
element. Hence the cost of f is the number of elements x s.t. f (C(x)) = C (x).
Now, since the mapping is from C to C, f is a bijection or simply a matching
between the set of colors. In [14] Gusfield noted that the partition distance
problem can be casted as an assignment problem [18] and hence be solved by
a maximum flow in a bipartite graph in time O(mn + n2 log n) [2]. Matching
problems are among the most classical and well investigated in theoretical, as
well as in practical, computer science [28]. Although it has a polynomial time
exact algorithms with many flavors [2], a host of works on approximated solutions
were introduced. For its very simple implementation and empirically accurate
results that are based on theoretical properties we show below, we chose to
use a very simple greedy algorithm, named Greedy PartDist. The algorithm
works recursively and, at each recursion, chooses the heaviest edge (u, v) in the
graph, adds it to the matching M and removes from the graph all other edges
(u, v ) and (u , v) for u , v ∈ V . It is easy to see that the algorithm runs in
time O(m log n) where the complexity of the sorting operation dominates. This
algorithm provides a 1/2-approximation guarantee [23] for a general input and
Pacemaker Partition Identification 289
the same approximation guarantee can be obtain by the generic recursive analysis
of the local ratio technique[3].
Claim. Assume every color is correctly clustered. Then Algorithm Greedy Part-
Dist returns the correct result.
Proof. The proof follows by induction on the number of PMs |P|. For a single
PM, there is a single edge in the bipartite graph and this edge is chosen. For
|P| > 1, note that by the assumption, the heaviest edge emanating from each
PM (node) in P to its corresponding color in the partition P . In particular,
this is true for the heaviest edge in the bipartite graph, linking between the
nodes corresponding to some PM P . Then the algorithm chooses that edge and
remove all edges adjacent to it. Therefore, PM P was correctly chosen and by
the induction hypothesis the algorithm returns the correct result.
was derived by the Deming regression. This has defined our correlation graph
described above.
In order to apply clustering algorithms on the elements, the elements need to
be embedded in some Euclidean space. Multidimensional scaling takes a set of
dissimilarities (over a set of elements) and returns a set of points in a Euclidean
space, such that the distances between the points are approximately equal to
the dissimilarities. A set of Euclidean distances on n points can be represented
exactly in at most n− 1 dimensions. The procedure cmdscale follows the analysis
of Mardia [20], and returns the best-fitting k-dimensional representation, where
k may be less than the argument k (and by definition smaller than n). In our
implementation, in order to avoid any distortion, we set k to the maximum value
as determined by the data (and is found and returned by the method). We used
a version of cmdscale that is implemented in R. As cmdscale requires a complete
graph, we used the min-hop-min-weight (MHMW) algorithm. The output of the
MHMW is a complete graph where the weight between any two points is the
lightest (min weight) path among all min hop reliable paths (paths between trees
for which correlation was derived). At this point we can use cmdscale to map
this graph to the Euclidean space. Note however, that this mapping corresponds
not to the original graph, rather to some approximation of it derived by the out-
put of the MHMW algorithm. This mapping however serves as an initial guess
to the iterative mapping of the original, partial, distance matrix. This iterative
100
partition distance (clustering error)
90
80
70
60
2
50
4
40
6
30
20 8
10
0
0.01 0.1 1 10 100
Vgene/Vpacemaker
Fig. 1. Partition distance obtained by applying the PMPI technique on simulated data
versus the gene/pacemaker variance ratio; the plots are shown for 2, 4, 6 and 8 clusters
(PMs)
Pacemaker Partition Identification 291
Working with real data poses some other serious problems requiring solution.
The first, is that we don’t have here exactly τ periods with edge length i,j for
every gene gi rather a set of trees with loose pairwise agreement. This loose
agreement is due to vast discordance between the histories of the various genes
as a result of phenomena such as horizontal gene transfer (HGT) or incomplete
lineage sorting (ILS, see more details below). However, discordance can arise
even from the simple fact that some gene is missing in some specific species,
resulting in a contraction of internal nodes.
To cope with this problem, we employ the idea of Maximum Agreement Sub-
trees (MAST) [9], that seeks for the largest subset of species under which the
two trees are the same. Under MAST (or in general, any subset of the leaf set),
edges not connecting any species to the induced tree, are removed, and internal
nodes with degree two are contracted, while maintaining the original length of
the path. Hence for every pair of genes (trees) we need to find the MAST and
compare lengths of corresponding edges.
Additionally, here as opposed to a simulation study, we do not know the “real”
partition and cannot compare the resultant clustering to it. Therefore, another
method for assessing the results should be employed. Here we need to compare
the result to the probability of being obtained under a random model. Recall
that at the final stage of the PMPI procedure we employ the kmeans algorithm
which seeks to minimize some error measure WK . This error measure holds the
sum of all pairwise distances between members of the same cluster, across all
clusters in the partition. It is clear that the more clusters, the smaller WK is.
292 S. Snir
Fig. 2. The deltaGap function for 2755 analyzed genes, k from 1 to 10. According to
Tibshirani et al [27], the smallest k producing a non-negative value of deltaGap[k] =
Gap[k]-Gap[k+1]+sigma[k+1] indicates the optimal number of clusters.
However, the decrease in WK is the largest near the real value of the number of
clusters k = K, and vanishes slowly for k > K. Therefore, a threshold for the
improvement (decrease) in WK must be defined as a stopping condition, above
which we don’t increase the number of clusters k. The gap statistics analysis [27]
compares the improvement in WK under the real data, to that of a random
model. The gap (between the improvements) forms an “elbow” at the optimal
(real) K and this is the stopping condition.
The real data we chose to analyze is the one used by us [26] previously, of
a set of gene trees that covers 2755 orthologous families from 100 prokaryotic
genomes [24]. Prokaryotic evolution is characterized by the pervasive phenomena
of horizontal gene transfer (HGT) [7,11], resulting in different topologies for al-
most any two gene trees. To account for this we employed the MAST procedure
for every gene pair and considered this pair only if the MAST contained at least
10 leaves (species). Branch lengths of the original trees were used to compute
the branch lengths of the corresponding MAST components (by computing path
lengths). The variant of Deming regression in the log space as described in Sec-
tion 4 was performed on the logarithms of the lengths of equivalent branches in
both MAST components. The standard sample Pearson correlation coefficient
was used as the measure of correlation between the branch lengths. The graph
of correlations between the gene trees contained a giant connected component
containing 2755 genes and 1,250,972 edges, 33% of the maximum possible num-
ber (an edge in the graph exists only when the MAST for the corresponding pair
of trees consists of at least 10 species). To cluster these genes according to the
Pacemaker Partition Identification 293
correlation between their branch lengths, the data were projected using isoMDS
into a 30-dimensional space based on the sparse matrix where 1 − r (correlation
coefficient) was used as a distance. We ran k-means for k spanning the range from
2 to 30. The random model we chose to consider is the fully random uniform
model (i.e., α = 0, no advantage to source PM) and we compared the results
to this model. Grouping these 2755 genes in two clusters containing 1550 and
1205 members, respectively, yields the optimal partitioning according to the gap
function statistics (Figure 2). We see the typical “elbow” at the value of k = 2.
The absolute results were 5, 587, 960 for the total graph weight, 2, 686, 914 and
2, 285, 921 weight within each of the clusters, and 615, 125 between them. Anal-
ysis of the cluster membership reveals small albeit significant differences in the
representation of functional categories of genes but no outstanding biologically
relevant trends were detected. Therefore, we can hypothesize that if indeed the
data gives rise to multiple PMs, this signal is completely obscured by the amount
of noise produced by the genes themselves (i.e. loose adherence to the associated
PM), and noise introduced by artificial factors such as MAST, multiple sequence
alignment, and phylogenetic reconstruction.
6 Conclusions
between two PMs, and the improvement in the statistical explanation is small al-
beit highly significant. The partition of different functional gene groups between
the two PMs is also statistically significant (WRT random partitioning of each
group) however the biological interpretation of this partitioning is challenging
and remained for future research.
Acknowledgments. We thank Eugene Koonin and Yuri Wolf for helpful discus-
sions, in particular in interpretation of the biological significance of the resulted
clustering of the real data in Section 5.
References
1 Introduction
The de Bruijn graphs are the key algorithmic technique in genome assembly [1–3]
that resulted in dozens of software tools [4–10]. In addition, the de Bruijn graphs
have been used for repeat classification [11], de novo protein sequencing [12],
synteny block construction [13], multiple sequence alignment [14], and other
applications in genomics and proteomics. In fact, the de Bruijn graphs have
become so ubiquitous in bioinformatics that one rarely questions what are the
intrinsic limitation of this approach.
We argue that the original definition of the de Bruijn graph is far from being
optimal for the challenges posed by the assembly problem. We further propose
a new notion of the Manifold de Bruijn (M-Bruijn) graph (that generalizes the
concept of the de Bruijn graph) and show that it has advantages over the classical
de Bruijn graph in assembly applications.
The disadvantages of the de Bruijn graphs became apparent when bioinfor-
maticians moved from assembling cultivated bacterial genomes (with rather uni-
form read coverage) to assembling genomes from single cells (with 4 orders of
magnitude variations in coverage [15]). In such projects, selecting a fixed k-mer
size is detrimental since k should be small in low-coverage regions (otherwise the
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 296–310, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Manifold de Bruijn Graphs 297
(A)
A
C T
A C
G A
G G
A A
T
(C) T CAT
ACA CAT C
ACA ATC
AT
C
GAT
A
GAC TCA
(B)
TCA
GGA
TA GG
TA
G
GG
G
A G
AT
AG
GGA CAG
ATC TCA CAG AGA
AT TC CA AG GA
A
CAG
AC C
CAT
A GA
AGG AGA
AC
TA
AT
G
G
G
A
TAG A
GAT GAT
ATA
Fig. 1. (A) A circular string String =CATCAGATAGGA. The de Bruijn graphs (B)
DB(Reads, 3) and (C) DB(Reads, 4) on a set of Reads = {CATC, ATCA, TCAG,
CAGA, AGAT, GATA, TAGG, GGAC, ACAT } drawn from that circular genome.
Small value of k = 3 “glues” many repeats and makes DB(Reads, 3) tangled, while
larger value of k = 4 fails to detect overlaps and makes DB(Reads, 4) fragmented.
a more accurate but impractical strategy. The question thus arises whether one
can provide benefits similar to the IDBA approach in a single iteration that
considers the entire range of possible k-mer sizes rather than varies it from one
iteration to another. The Manifold de Bruijn (M-Bruijn) graph achieves this
goal by automatically varying k-mer sizes according to the input data.
C
(A) (B) ACA CA AT
A AC AT A
T
C
C T C
A
G
GA TC
A C
GGA
TCA
G A GG CA
AGG
CAG
G G
AG AG
A
TA
G
A
G
A A
T TA ATA GAT GA
AT
GAT
GAT
(C) (D)
TA TAG G GG TA GG
ATA AG
GG
TA
GG
G
A G
AT AG AT
AG
A
A
ATC TCA CAG AGA ATC TCA CAG AGA
AT TC CA AG GA AT TC CA AG GA
AC C
CA AGG GA CAT
A GA
TAG
AGG
AC AC
(A) (B) CA 1
1 AT
A 1
C T AC
TC
A C
1
3
G A CA
AGG
1
G G
1
A A AGA
T TA 1 2
AT
TA 1 TA 1
(C) 1 AGG (D) AGG
2 1
AT
2
1 1 1 1 1 1
3
3
AT TC CA AGA AT TC CA AGA
1 1
CA 1 1
AC AC
(A) (B)
1 1
CATC: CA AT TC
A
C T
1 1
A C
ATCA: AT TC CA
1 (C)
G A TCAG: TC CA G
G G 1 TA 1
CAGA: CA AGA
AGG
A A
T 1
2 2
AGAT: AGA AT
AT
1 TC
1 CA
1 AGA
1 1
GATA: G AT TA 1
GG AC
1
TAGG: TA AGG
GGAC: GG AC
1 1
ACAT: AC CA AT
Fig. 4. (A) A set of Reads = {CATC, ATCA, TCAG, CAGA, AGAT, GATA, TAGG,
GGAC, ACAT }. (B) All paths corresponding to reads from Reads and V = {CA, AC,
TC, AGA, AT, TA, AGG } (C) AB(Reads, V ).
Below we address the question of how to choose V so that the resulting as-
sembly AB(Reads, V )) improves on the classical assembly approach represented
by AB(Reads, Σ k−1 ) = DB(Reads, k).
CAT
1 2
AC TC
A
C T
1
2
A C
G A CAG
G G
1
GG
A A
T
2 AGA
1
2 GAT
TA
The corollary below reduces the search for irreducible words to the search for
r-irreducible words:
Thus a simple linear time algorithm that scans the set of pairs of indices of
all r-irreducible words reveals the set of irreducible words.
$
GCA
C A$
G
A$
1 2 3 4 5 6 CA GGC
$
String C A G G C A
A
A$
GGC
$
Fig. 6. The suffix tree (right) for a string String =CAGGCA. The four outposts (GG,
GC, CAG, and AG) end in symbols colored in red in the suffix tree. While CAG and AG
are two r-irreducible words ending at position 3, only one of them (AG) is irreducible
with respect to String.
Proof. The suffix tree for String can be constructed in linear time [21], the set of
r-irreducible words (outposts, represented by pairs of indices) can be computed
in linear time by a depth-first search of this suffix tree, and all irreducible words
can be derived from the r-irreducible words in linear time (Corollary 1).
w w w
w w w
w = w w =
(i) (ii) (iii) (iv)
Fig. 7. Illustration of a consistent word (i) and an inconsistent word (ii),(iii) and (iv)
with respect to Reads. Perfectly aligned symbols are shown by the same color (marked
by dotted lines) while misalignments are shown by different colors (shown with = sign).
for Reads and define Suf [x, i] as the suffix starting at position i of readx $x .
In T (Reads), each Suf [x, i]$x corresponds to a root-to-leaf path. An edge in
T (Reads) is called trivial if it is labeled by $x for 1 ≤ x ≤ m (see Figure 8). A
vertex in T (Reads) is called a branching vertex if it has at least two non-trival
outgoing edges in T (Reads). Given an edge from vertex u to vertex v, we say
that all suffixes (leaves) in the subtree rooted at v are after v in T (Reads).
T$ 2 A$ 1
ATC $3
GC$3
$2
T
GC$3
G
$2
Reads 1 2 3 4 5 T
read1 C A G C A GC$3
A $2
read2 A G A T T TT
read3 A T T G C G
CA$
A 1
$1
TT
C
$2
$3
$1
A CA
G
$1
Fig. 8. The generalized suffix tree T (Reads) (right) for Reads = {CAGCA, AGATT,
ATTGC } (left). Trivial edges are shown by dotted lines in T (Reads). If an outpost
appears only in one read, the outpost ends in a symbol colored in the color of that read;
if an outpost appears in multiple reads, the outpost ends in a symbol colored in red.
The three outposts (CAG, AGC, GC) in Read1 end in symbols colored in brown (or
red), the four outposts (AGA, GA, AT, TT) in Read2 end in symbols colored in blue
(or red), and the four outposts (AT, TT, TG, GC) in Read3 end in symbols colored in
green (or red) in T (Reads).
$2
C$
A 1
TTA$
3
GA$2
$3
TA
G
A
GA$2
Reads 1 2 3 4 5
T
$3
read1 A C G A C
A $2
read2 T T A G A GA
C GAC$1
read3 C G T T A $3 $1
$2
C
$1
$3
TA
T
G
AC$1
Fig. 9. The generalized suffix tree T (Reads) (left) for Reads = {ACGAC, TTAGA,
CGTTA } (top). Trivial edges are shown by dotted lines in T (Reads). The three
outposts (ACG, CGA, GA) in read1 end in symbols colored in brown (or red), the
four outposts (TT, TA, AG, GA) in read2 end in symbols colored in blue (or red), and
the four outposts (CGT, GT, TT, TA) in to read3 end in symbols colored in green (or
red) in T (Reads).
Proof. The generalized suffix trees for Reads and Reads can be constructed
in linear time [21]. We can then compute all Rightx(i) and Lef tx(i) for any i
(1 ≤ i ≤ |readx |) of readx in linear time through the depth-first search of each
trees. Then Algorithm 1 derives all the irreducible words in linear time.
read1 C A G C A j
i 1 2 3 4 5 (3, 5)
Right1 (i) 3 4 4 6 6 5
(2, 4)
4
(1, 3)
3
2
read1 C A G C A
1
j 1 2 3 4 5
Lef t1 (j) 0 0 2 2 3 0
0 1 2 3 4 5 i
Fig. 10. Algorithm 1 identifies three irreducible words in read1 : w(1, 1, 3) = CAG,
w(1, 2, 4) = AGC and w(1, 3, 5) = GCA (shown as red points)
read2 A G A T T j
i 1 2 3 4 5 (4, 5)
Right2 (i) 3 3 4 5 6 5
(3, 4)
4
(1, 3) (2, 3)
3 ×
2
read2 A G A T T
1
j 1 2 3 4 5
Lef t2 (j) 0 1 2 3 4 0
0 1 2 3 4 5 i
Fig. 11. Algorithm 1 identifies three irreducible words in read2 : w(2, 2, 3) = GA,
w(2, 3, 4) = AT and w(2, 4, 5) = TT (shown as red points)
read3 A T T G C j
i 1 2 3 4 5 (4, 5)
Right3 (i) 2 3 4 5 6 5 ×
(3, 4)
4
(2, 3)
3
(1, 2)
2
read3 A T T G C
1
j 1 2 3 4 5
Lef t3 (j) 0 1 2 3 3 0
0 1 2 3 4 5 i
Fig. 12. Algorithm 1 identifies three irreducible words in read3 : w(3, 1, 2) = AT,
w(3, 2, 3) = TT and w(3, 3, 4) = TG (shown as red points)
308 Y. Lin and P.A. Pevzner
“parallel” edges, all these edges have the same shift and the same shift tag. We thus
substitute all such edges by a single edge. It is easy to see that M B(Reads) is a set
of paths (including paths consisting of a single vertex) or cycles. Each cycle spells
a sequence that we refer to as a cyclic contig. Each path spell a sequence that we
refer to as a linear contig after concatenating it with the longest prefix tag of its
first vertex and the longest suffix tag of its last vertex.
Figure 13 shows an example of an M-Bruijn graph on a set of reads.
CAT
1 2
AC TC
1
2
CAG
1
GG
2
1 AGA
2 GAT
TA
Fig. 13. The M-Bruijn graph M B(Reads), where Reads = {CATC, ATCA, TCAG,
CAGA, AGAT, GATA, TAGG, GGAC, ACAT } drawn from a circular string
String =CATCAGATAGGA, and Irreducible(Reads) = {CAT, TC, CAG, AGA,
GAT, TA, GG, AC}. Compared to Figure 5, M B(Reads) reconstructs M B(String)
and thus the circular string String. M B(Reads) is neither tangled (like DB(Reads, 3)
in Figure 1(B)) nor fragmented (like DB(Reads, 4) in Figure 1(C)).
5 Conclusion
The Iterative de Bruijn graph Assembly (IDBA) approach starts from small k,
uses contigs from the de Bruijn graph on k-mers as pseudoreads, and mixes
pseudoreads with original reads to construct the de Bruijn graph for larger k.
The key step in IDBA is to maintain the accumulated de Bruijn graph (Hk ) to
carry the contigs forward as k increases [6].
We have proposed a notion of the Manifold de Bruijn (M-Bruijn) graph that
does not require any parameter setup, e.g., it does not require one to specify
the k-mer size. The M-Bruijn graph provides an alternative way to generate
pseudoreads (as its contigs) that incorporate information for k-mers of varying
sizes.
Our introduction of M-Bruijn graph is merely a preliminary theoretical con-
cept that may seem impractical since we have not addressed various challenges
posed by the real datasets in genome assembly. When Idury and Waterman [2]
Manifold de Bruijn Graphs 309
introduced the de Bruijn graph approach for genome assembly, the high error rates
in Sanger reads also made that approach seem impractical. Pevzner et al. [3] later
removed this obstacle by introducing an error correction procedure that made the
vast majority of reads error-free. Thus, our ability to handle errors in reads is crucial
for future applications of the M-Bruijn graph approach.
References
18. Magoc, T., Pabinger, S., Canzar, S., et al.: GAGE-B: an evaluation of genome
assemblers for bacterial organisms. Bioinformatics 29(14), 1718–1725 (2013)
19. Compeau, P.E.C., Pevzner, P.A.: Bioinformatics Algorithms: An Active-Learning
Approach. Active Learning Publishers (2014)
20. Ilie, L., Smyth, W.F.: Minimum unique substrings and maximum repeats. Funda-
menta Informaticae 110(1), 183–195 (2011)
21. Gusfield, D.: Algorithms on strings, trees and sequences: computer science and
computational biology. Cambridge University Press (1997)
Constructing String Graphs in External Memory
1 Introduction
De novo sequence assembly is a fundamental step in analyzing data from Next-
Generation Sequencing (NGS) technologies. NGS technologies produce, from a
given (genomic or transcriptomic) sequence, a huge amount of short sequences,
called reads—the most widely used current technology can produce 109 reads
with average length 150. The large majority of the available assemblers [1,10,15]
are built upon the notion of de Bruijn graphs where each k-mer is a vertex and
an arc connects two k-mers that have a k − 1 overlap in some input read. Also
in transcriptomics, assembling reads is a crucial task, especially when analyzing
RNA-seq in absence of a reference genome.
Alternative approaches to assemblers based on de Bruijn graphs have been
developed recently, mostly based on the idea of string graph, initially proposed
by Myers [9] before the advent of NGS technologies and further developed [13,14]
to incorporate some advances in text indexing, such as the FM-index [7]. This
method builds an overlap graph whose vertices are the reads and where an arc
connects two reads with a sufficiently large overlap. For the purpose of assembling
a genome some arcs might be uninformative. In fact an arc (r1 , r2 ) is called
reducible if its removal does not change the strings that we can assemble from
the graph, therefore reducible arcs can be discarded. The final graph, where all
reducible arcs are removed, is called the string graph. More precisely, an arc
(r1 , r2 ) of the overlap graph is labeled by a suffix of r2 so that traversing a path
r1 , · · · , rk and concatenating the first read r1 with the labels of the arcs of the
path gives the assembly of the reads along the path [9].
The naı̈ve way of computing all overlaps consists of pairwise comparisons of all
input reads, which is quadratic in the number of reads. A main contribution of [13]
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 311–325, 2014.
© Springer-Verlag Berlin Heidelberg 2014
312 P. Bonizzoni et al.
is the use of the notion of Q-interval to avoid such pairwise comparisons. More pre-
cisely, for each read r in the collection R, the portion of BWT (Q-interval), iden-
tifying all reads whose overlap with r is a string Q, is computed in time linear in
the length of r. In a second step, Q-intervals are extended to discover irreducible
arcs. Both steps require to keep the whole FM-index and BWT for R and for the
collection of reversed reads in main memory since the Q-intervals considered cover
different positions of the whole BWT. Notice that the algorithm of [13] requires to
recompute Q-intervals a number of times that is equal to the number of different
reads in R whose suffix is Q, therefore that approach cannot be immediately trans-
lated into an external memory algorithm. For this reason, an open problem of [13]
is to reduce the space requirements by developing an external memory algorithm
to compute the string graph.
Recently, an investigation of external memory construction of the Burrows-
Wheeler Transform (BWT) and of related text indices (such as the FM-index)
and data structures (such as LCP) has sprung [2,3,6] greatly reducing the amount
of RAM necessary. In this paper, we show that two scans of the BWT, LCP and
the generalized suffix array (GSA) for the collection of reads are sufficient to
build a compact representation of the overlap graph, mainly consisting of the
Q-intervals for each overlap Q.
Since each arc label is a prefix of some reads and a Q-interval can be used
to represent any substring of a read, we exploit the above representation of
arcs also for encoding labels. The construction of Q-intervals corresponding to
labels is done by iterating the operation of backward σ-extension of a Q-interval,
that is computing the σQ-interval on the BWT starting from a Q-interval. The
idea of backward extension is loosely inspired by the pattern matching algorithm
using the FM-index [7]. A secondary memory implementation of the operation of
backward extension is a fundamental contribution of [5]. They give an algorithm
that, with a single scan of the BWT, reads a lexicographically sorted set of
disjoint Q-intervals and computes all possible σQ-intervals, for every symbol σ
(the original algorithm extends all Q-intervals where all Qs have the same length,
but it is immediate to generalize that algorithm to an input set of disjoint Q-
intervals). Our approach requires to backward extend generic sets of Q-intervals.
For this purpose, we develop a procedure (ExtendIntervals) that will be a crucial
component of our algorithm to build the overlap and string graph.
Our main result is an efficient external memory algorithm to compute the
string graph of a collection of reads. The algorithm consists of three different
phases, where the second phase consists of some iterations. Each part will be
described as linear scans and/or writes of the files containing the BWT, the
GSA and the LCP array, as well as some other intermediate files. We strive to
minimize the number of passes over those files, as a simpler adaptation of the
algorithm of [13] would require a number of passes equal to the number of input
reads in the worst case, which would clearly be inefficient.
After building the overlap graph, where each arc consists of two reads with
a sufficiently large overlap, the second phase iteratively extends the Q-intervals
found in the first phase, and the results of the previous iterations to compute
Constructing String Graphs in External Memory 313
an additional symbol of some arc labels (all labels are empty at the end of the
first phase). At the end of the second phase, those labels allow to reconstruct
the entire assembly (i.e. the genome/transcriptome from which the reads have
been extracted). Finally, the third phase is devoted to testing whether an arc is
reducible, in order to obtain the final string graph, using a new characterization
of reducible arcs in terms of arc labels, i.e. prefixes of reads.
The algorithm has O(d 2 n) time complexity, where and n are the length and
the number of input reads and d is the indegree of the string graph. We have
developed an open source implementation of the algorithm, called LightString-
Graph (LSG), available at http://lsg.algolab.eu/. We have compared LSG
with SGA [13] on a dataset of 37M reads, showing that LSG is competitive (its
running time is 5h 28min while SGA needed 2h 19min) even if disk accesses are
much slower than those in main memory (SGA is an in-memory algorithm).
2 Preliminaries
We briefly recall the standard definitions of Generalized Suffix Array and
Burrows-Wheeler Transform on a set of strings. Let Σ be an ordered finite
alphabet and let S a string over Σ. We denote by S[i] the i-th symbol of S, by
= |S| the length of S, and by S[i : j] the substring S[i]S[i + 1] · · · S[j] of S.
The reverse of S is the string S rev = S[ ]S[ − 1] · · · S[1]. The suffix and prefix
of S of length k are the substrings S[ − k + 1 : ] and S[1 : k], respectively. The
k-suffix of S is the suffix of length k. Given two strings (Si , Sj ), we say that Si
overlaps Sj iff a nonempty suffix Z of Si is also a prefix of Sj , that is Si = XZ
and Sj = ZY . In that case we say that Sj extends Si by |Y | symbols, that Z is
the overlap of Si and Sj , denoted as ovi,j , that Y is the extension of Si with Sj ,
denoted as exi,j , and X is the prefix-extension of Si with Sj , denoted as pei,j .
In the following of the paper we will consider a collection R = {r1 , . . . , rn }
of n reads (i.e., strings) over Σ. As usual, we append a sentinel symbol $ ∈ /Σ
to the end of each string ($ lexicographically precedes all symbols in Σ). Then,
let R = {r1 $, . . . , rn $} be a collection of n strings (or reads), where each ri is a
string over Σ; we denote by Σ $ the extended alphabet Σ ∪ {$}. Moreover, we
assume that the sentinel symbol $ is not taken into account when computing
overlaps between two strings.
The Generalized Suffix Array (GSA) [12] of R is the array SA where each
element SA[i] is equal to (k, j) if and only if the k-suffix of string rj is the i-th
smallest element in the lexicographic order of the set of all the suffixes of the
strings in R. In the literature (as in [2]), the relative order of two elements (k, i)
and (k, j) of the GSA such that reads ri and rj share their k-suffix is usually
determined by the order in which the two reads appear in the collection R (i.e.,
their indices). However, starting from the usual definition of the order of the
elements of the GSA, it is possible to compute the GSA with the order of their
elements determined by the lexicographic order of the reads with two sequential
scans of the GSA itself. The first scan extracts the sequence of pairs (k, j) where
k is equal to the length of rj , hence obtaining the reads of R sorted lexicograph-
ically. The second scan uses the sorted R to reorder consecutive entries of the
314 P. Bonizzoni et al.
GSA sharing the same suffix. This ordering will be essential in the following since
a particular operation (namely, the backward $-extension, as defined below) is
possible only if this particular order is assumed. The Longest Common Prefix of
R, denoted by LCP , is an array of size equal to the total length of the strings
in R and such that LCP [i] is equal to the length of the longest prefix shared
by the suffixes pointed to by GSA[i] and GSA[i − 1] (excluding the sentinel $).
For convenience, we assume that LCP [1] = 0. Notice that no element of LCP is
larger than the maximum length of a read of R.
The Burrows-Wheeler Transform (BWT) of R is the sequence B such that
B[i] = rj [|rj |−k], if SA[i] = (k, j) and k < |rj |, or B[i] = $, otherwise. Informally,
B[i] is the symbol that precedes the k-suffix of string rj where such suffix is the
i-th smallest suffix in the ordering given by SA. Given a string Q, all suffixes of
the GSA whose prefix is Q appear consecutively in GSA, therefore they induce
an interval [b, e) which is called Q-interval [2] and denoted by q(Q). We define
the length and width of the Q-interval [b, e) as |Q| and the difference (e − b),
respectively. Notice that the width of the Q-interval is equal to the number of
occurrences of Q as a substring of some string r ∈ R. Whenever the string Q
is not specified, we will use the term string-interval to point out that it is the
interval on the GSA of all suffixes having a common prefix. Since the BWT
and the GSA are closely related, we also say that [b, e) is a string-interval (or
Q-interval for some string Q) on the BWT. Let B rev be the BWT of the set
Rrev = {rrev | r ∈ R}, let [b, e) be the Q-interval on B for some string Q, and let
[b , e ) be the Qrev -interval on B rev . Then, [b, e) and [b , e ) are called linked. The
linking relation is a 1-to-1 correspondence and two linked intervals have same
width and length, hence (e − b) = (e − b ).
Given a Q-interval and a symbol σ ∈ Σ, the backward σ-extension of the Q-
interval is the σQ-interval (that is, the interval on the GSA of the suffixes sharing
the common prefix σQ). We say that a Q-interval has a nonempty (empty, re-
spectively) backward σ-extension if the resulting interval has width greater than
0 (equal to 0, respectively). Conversely, the forward σ-extension of a Q-interval
is the Qσ-interval. Given the BWT B, the FM-index [7] is essentially composed
of two functions C and Occ: C(σ), with σ ∈ Σ, is the number of occurrences
in B of symbols that are alphabetically smaller than σ, while Occ(σ, i) is the
number of occurrences of σ in the prefix B[1 : i − 1] (hence Occ(·, 1) = 0). These
two functions can be used to efficiently compute a backward σ-extension on B
of any Q-interval [7] and the corresponding forward σ-extension of the linked
Qrev -interval on B rev [8]. The same procedure can be used also for computing
backward σ-extensions only thanks to the property that the first |R| elements
of the GSA corresponds to R in lexicographical order. Notice that the order we
assumed on the elements of the GSA allows us to compute also the backward $-
extension of a Q-interval (hence determining the set of reads sharing a common
prefix Q), whereas this operation is not possible according to the usual order
of the elements of the GSA. The backward $-extension will be used in several
parts of our algorithms in order to compute and represent such a set of reads.
Moreover, for the purpose of computing σ-extensions, notice that the BWT can
Constructing String Graphs in External Memory 315
be obtained assuming any order for equal suffixes in different reads, since there
not exists any string-interval including only some of them.
3 The Algorithm
Since short overlaps are likely to appear by chance, they are not meaningful for
assembling the original sequence. Hence, we will consider only overlaps at least
τ long, where τ is a positive constant. For simplicity, we assume that the set R
of the reads is substring-free, that is, there are no two reads r1 , r2 ∈ R such that
r1 is a substring of r2 . The overlap graph of R is the directed graph GO = (R, A)
whose vertices are the strings in R, and two reads ri , rj form the arc (ri , rj ) if
they overlap. Moreover, each arc (ri , rj ) of GO is labeled by the extension exi,j
of ri with rj . Each path (r1 , · · · , rk ) in GO represents a string that is obtained by
assembling the reads of the path. More precisely, such string is the concatenation
r1 ex1,2 ex2,3 · · · exk−1,k [9, 14]. An arc (ri , rj ) of GO is called reducible if there
exists another path from ri to rj representing the same string of the path (ri , rj )
(i.e., the string ri exi,j ). Notice that reducible arcs are not helpful in assembling
reads, therefore we are interested in removing (or in avoiding computing) them.
The resulting graph is called string graph [9].
Let us denote by Rs (Q) and Rp (Q) the set of reads whose suffix (prefix,
resp.) is a given string Q. If |Q| ≥ τ , then each pair of reads rs ∈ Rs (Q),
rp ∈ Rp (Q) forms an arc (rs , rp ) of GO . Conversely, given an arc (rs , rp ) of GO ,
then rs ∈ Rs (ovs,p ) and rp ∈ Rp (ovs,p ). Therefore, the arc set of the overlap
graph is the union of Rs (Q) × Rp (Q) for each Q at least τ characters long.
Observe that a $Q-interval represents the set Rp (Q) of the reads with prefix Q,
while a Q$-interval represents the set Rs (Q) of the reads with suffix Q. As a
consequence, we can represent the sets Rs (Q) and Rp (Q) as two string-intervals.
Our algorithm for building the string graph is composed of three steps. The
first step computes a compact representation of the overlap graph in secondary
memory, the second step computes the prefix-extensions of each arc of the overlap
graph that will be used in the third step for removing the reducible arcs from the
compact representation of the overlap graph (hence obtaining the string graph).
In the first step, since the cartesian product Rs (S) × Rp (S) represents all arcs
whose overlap is S, we compute the (unlabeled) arcs of the overlap graph by
computing all S-intervals (|S| ≥ τ ) such that the two sets Rs (S), Rp (S) are
both nonempty. We compactly represent the set of arcs whose overlap is S as
a tuple (q(S$), q($S), 0, |S$|), that we call basic arc-interval. We will use S for
denoting a string that is an overlap among some reads.
The three steps of the algorithm work on the three files—B, SA and L—
containing the BWT, the GSA, and the LCP of the set R, respectively. We first
discuss the ideas used to compute the overlap graph, while we will present the
other steps in the following parts of the section. Observe that the arcs of the over-
lap graph correspond to nonempty S$-intervals and $S-intervals for every overlap
S of length at least τ . As a consequence, the computation of the overlap graph
reduces to the task of computing the set of S-intervals that have a nonempty
316 P. Bonizzoni et al.
efficiently deal with the inclusion between string-intervals (in fact, any two string-
intervals are either nested or disjoint). Each Q-interval [b, e) in I is associated to
a record (Q, [b, e), [b , e )) such that [b, e) is the Q-interval on B and [b , e ) is the
Qrev -interval on B rev , that is, the intervals in each record are linked. Moreover,
a set x([b, e)) of symbols are associated to each string-interval [b, e) in I, and
x([b, e)) contains the symbols that must be used to extend the record. For each
string-interval and for each character σ in the associated set of symbols, the re-
sult must contain a record (σQ, [bσ , eσ ), [bσ , eσ )) where [bσ , eσ ) is the backward
σ-extension of [b, e) on B and [bσ , eσ ) is the forward σ-extension of [b , e ) on
B rev . Notice that also the intervals in the output records are linked.
The algorithm ExtendIntervals performs only a single pass over the BWT
B and the LCP L, and maintains an array Π[·] which stores for each symbol in
Σ $ the number of its occurrences in the prefix of the BWT preceding the cur-
rent position. In other words, when the first p symbols of B have been read, the
array Π gives the number of occurrences of each symbol in Σ $ in the first p − 1
characters of B. The procedure also maintains some arrays EΠj [·] so that, for
each symbol σ and each integer j, EΠj [σ] = Occ(σ, pj ) where pj is the starting
position of the Q-interval containing the current position of the BWT such that
(1) |Q| = j and (2) the width of the Q-interval is larger than 1. Notice that,
for each position p and integer j, at most one such Q-interval exists. If no such
Q-interval exists then the value of EΠj is undefined. We recall that Occ(σ, p)
is the number of occurrences of σ in B[0 : p − 1] [7]. Since ExtendIntervals
accesses sequentially the arrays B and LCP , it is immediate to view the proce-
dure as an external memory algorithm where B and LCP are two files. Notice
also that line 3, that is finding all Q-intervals whose end boundary is p, can be
executed most efficiently if the intervals are already ordered by end boundary.
Lemmas 1 and 2 show the correctness of Alg. 1.
Lemma 1. At line 3 of Algorithm 1, for each c ∈ Σ (1) Π[c] is equal to the
number of occurrences of c in B[1 : p − 1] and (2) EΠk [c] = Occ(c, pk ) for each
Q-interval [pk , ek ) of width larger than 1 which contains p and such that |Q| = k.
Proof. We prove the lemma by induction on p. When p = 1, there is no symbol
before position p, therefore Π must be made of zeroes, and the initialization
of line 1 is correct. Moreover all string-intervals containing the position 1 must
start at 1 (as no position precedes 1), therefore line 1 sets the correct values of
EΠk .
Assume now that the property holds up to step p − 1 and consider step p.
The array Π is updated only at line 16, hence its correctness is immediate. Let
[pk , ek ) be the generic Q-interval [pk , ek ) containing p and such that (1) |Q| = k,
and (2) the width of the Q-interval is larger than 1, that is ek − pk ≥ 2. Since
all suffixes in the interval [pk , ek ) of the GSA have Q as a common prefix and
|Q| = k, LCP [i] ≥ k for pk < i ≤ ek .
If pk < p, then [pk , ek ) contains also p − 1, that is Q is a prefix of the suffix
pointed to by SA[p − 1]. Hence LCP [p] ≥ k and the value of EΠk at iteration p
is the same as at iteration p − 1. By inductive hypothesis EΠk = Occ(c, pk ). The
value of EΠk is correct, since the line 15 of the algorithm is not executed.
318 P. Bonizzoni et al.
Algorithm 1. ExtendIntervals
Input : The BWT B and the LCP array L of a set R of strings. A set I of
Q-intervals, each one associated with a record and with a set x(·) of
characters driving the extension.
Output : The set of extended Q-intervals.
1 Initialize Π and each EΠj (for 1 ≤ j ≤ maxi {L[i]}) to be |Σ|-long vectors 0̄;
2 for p ← 1 to |B| do
3 foreach Q-interval [b, e) in I such that e = p do
4 (Q, [b, e), [b , e )) ← the record associated to [b, e) // p = e
5 foreach character c ∈ x([b, e)) do
6 if b = e − 1 then
7 if B[p − 1] = c then
8 t←0
9 else
10 t ← 1;
11 Output (cQ, [C[c] + Π[c] + t, C[c] + Π[c] +1),
[b + σ<c Π(σ) − EΠ|Q| (σ) , b + σ<c Π(σ) − EΠ|Q| (σ) +
(1 − t)));
12 else
13 Output
(cQ, [C[c] + EΠ|Q| [c] + 1,C[c]+ Π[c]
+ 1),
[b
+ σ<c Π(σ) − EΠ |Q| (σ) , b + σ<c Π(σ) − EΠ|Q| (σ) +
Π[c] − EΠ|Q| [c] ));
14 foreach j such that L[p] ≤ j < L[p + 1] do
15 EΠj ← Π // a Q-interval with |Q| = j begins at position p
16 Π[B[p]] ← Π[B[p]] + 1;
Lemma 2. Let (Q, [b, e), [b , e )) be a record and let c ∈ x([b, e)) be a character.
Then Algorithm 1 outputs the correct c-extension of such record.
Computing Arc Labels. In this part, we describe how the compact represen-
tation of the overlap graph computed in the first step can be further processed
in order to easily remove reducible arcs without resorting to (computationally
expensive) string comparisons. First, we give an easy-to-test characterization of
reducible arcs of overlap graphs in terms of string-intervals (Lemmas 3 and 4).
Then, we show how such string-intervals (that we call arc-labels) can be efficiently
computed in external memory starting from the collection of basic arc-intervals
computed in the first step.
Lemma 3. Let GO be the overlap graph for R and let (ri1 , ri2 , . . . , rik ) be a path
of GO . Then, such a path represents the string pei1 ,i2 pei2 ,i3 · · · peik−1 ,ik rik .
Proof. We will prove the lemma by induction on k. Let (rh , rj ) be an arc of
GO . Notice that the string represented by such arc is peh,j ovh,j exh,j . Since
rh = peh,j ovh,j and rj = ovh,j exh,j , applying the property to the arc (ri1 , ri2 )
settles the case k = 2. Assume now that the lemma holds for paths of length
smaller than k and consider the path (ri1 , . . . , rik ). By definition, the string rep-
resented by such path is ri1 exi1 ,i2 · · · exik−1 ,ik which, by inductive hypothesis on
the path (ri1 , ri2 , . . . , rik−1 ), is equal to pei1 ,i2 · · · peik−2 ,ik−1 rik−1 exik−1 ,ik . But
rik−1 exik−1 ,ik = peik−1 ,ik ovik−1 ,ik exik−1 ,ik which can be rewritten as peik−1 ,ik rik .
Hence pei1 ,i2 · · · peik−2 ,ik−1 rik−1 exik−1 ,ik = pei1 ,i2 · · · peik−2 ,ik−1 peik−1 ,ik rik .
Lemma 4. Let GO be the overlap graph for a substring-free set R of reads and
let (ri , rj ) be an arc of GO . Then, (ri , rj ) is reducible iff there exists another arc
(rh , rj ) such that peh,j is a proper suffix of pei,j (or, equivalently, that perev h,j is a
proper prefix of perev i,j ).
Algorithm 2. ExtendArcIntervals
Input : Two files B and SA containing the BWT and the GSA of the set R,
respectively. A set of files AI(·, 1 ) containing the arc-intervals of
length 1 . A set of files BAI(·, 1 ) containing the basic arc-intervals of
length 1 .
Output : A set of files AI(·, 1 + 1) containing the arc-intervals of length 1 + 1.
The arcs of the overlap graph coming out from reads of length 1 − 1.
1 Π(σ) ← 0, for each σ ∈ Σ;
2 π(σ) ← 0, for each σ ∈ Σ;
3 [bprev , eprev ) ← null;
4 foreach σ ∈ Σ do
5 foreach ([b, e), q($S), le , 1 ) ∈ SortedM erge(AI(σ, 1 ), BAI(σ, 1 )) do
// If the P S$-interval [b, e) is different from the one previously processed,
then vectors Π and π must be updated, otherwise [b, e) is extended using the
values previously computed.
6 if ([b, e) = [bprev , eprev ) then
7 Π(σ) ← Π[σ] + π[σ], for each σ ∈ Σ;
8 Update Π while reading B until the BWT position b − 1;
9 π(σ) ← 0, for each σ ∈ Σ;
10 r ← null;
11 while reading B from the BWT position b to e − 1 do
12 σ ← symbol of the BWT at the current position p;
13 if σ = $ then
14 π[σ] ← π[σ] + 1;
15 else
// The arc-interval is terminal and r is the read equal to P S
16 r ← the read pointed to by GSA at position p;
17 if r = null then
// Update the file A of the output arcs, since the arc-interval is terminal
18 Append {r} × Rp (S) to A|e | ;
19 else
20 foreach σ ∈ Σ do
21 if π[σ] > 0 then
22 b ← C[σ] + Π[σ] + 1;
23 e ← b + π[σ];
24 Append ([b , e ), q(S$), le + 1, 1 + 1) to AI(σ, 1 + 1);
25 [bprev , eprev ) ← [b, e);
2 -long prefix-extension P begins with the symbol σ. These files are tightly cou-
pled, since there is a 1-to-1 correspondence between records of AI(σ, 1 ) and
records of AL(σ, 1 , ·), where those records refer to the same pair (P, S) of prefix-
extension (of length 2 ) and overlap-string. Each of the BAI(·, ·), AI(·, ·) and
AL(·, ·, ·) files contains string-intervals of the same length and ordered by start
boundary, hence those intervals are also sorted by end boundary.
Computing Terminal Arc-Intervals and Arc-Labels. After the first step,
the algorithm computes terminal arc-intervals and arc-labels. The first funda-
mental observation is that an arc-interval of length 1 + 1 (that is an arc-interval
that will be stored in BAI(·, 1 + 1)), and corresponding to a pair (P, S) with
|P S$| = 1 + 1, can be obtained by extending a basic arc-interval of length
1 (taken from BAI(·, 1 )) or a non-basic arc-interval of length 1 (taken from
AI(·, 1 )). Since all those files are sorted, we can assume to have a SortedMerge
procedure which receives two sorted files and returns their sorted union. No-
tice that we do not actually need to write a new file, as SortedMerge basically
consists in choosing the file from which to read the next record.
The algorithm performs some extension steps, each mainly backward
σ-extending string-intervals. In fact, at each extension step i (to simplify some
formulae, the first step is set to i = τ + 1), the algorithm scans all files BAI(·, i),
AI(·, i), AL(·, i, j) and computes the files AI(·, i + 1), AL(·, i + 1). At iteration
i, for each σ1 ∈ Σ, all records in SortedMerge(BAI(σ1 , i), AI(σ1 , i)) are σ2 -
extended, for each σ2 ∈ Σ, via the procedure ExtendArcIntervals, outputting
the results in the file AI(·, i + 1). We recall that σ-extending a record means, in
this case of the procedure ExtendArcIntervals, to backward σ-extend the q(P S$)
of the arc-interval (or the q(S$) of the basic arc-interval). If the record to σ2 -
extend is read from a file BAI(·, i) (i.e., it is a basic arc-interval), when the
algorithm writes a record of AI(·, i + 1) (i.e., the σ2 -extension of those record),
it also writes the corresponding record of AL(·, i + 1), that is an arc-label where
the prefix-extension is equal to the symbol σ2 . On the other hand, if the current
record to σ2 -extend is read from a file AI(·, i), we consider also the correspond-
ing record of AL(·, i) to write a record of AI(·, i + 1) and the corresponding
record of AL(·, i + 1) which is the σ2 -extension of the record in AL(·, i). Each
time a terminal arc-interval associated to (P, S) is found, the arcs {r} × Rp (S),
where r = P S, are written in the file A|P | .
Testing Irreducible Arcs. The algorithm reads the arcs of the overlap graph,
stored in the files Ai , for increasing values of i. Each arc a is added to the set A of
the arcs of the string graph if there is no arc already in A reducing a. Notice that
A is stored in main memory in the current implementation. Lemma 4 implies
that an arc (of the overlap graph) associated to a pair (P1 , S1 ) can be reduced
only by an arc associated to a pair (P2 , S2 ), such that |P1 | > |P2 |. Hence, an arc
in Ai can be reduced by an arc in Aj only if j < i. Since we examine the files
Ai by increasing values of i, either an arc a is reduced by an arc that is already
in A, or no subsequently read arc of the overlap graph can reduce a. Notice also
that, by the reducibility test of Lemma 4, an arc associated to a pair (P1 , S1 )
is reduced by an arc associated to (P2 , S2 ) if and only if P2rev is a proper prefix
Constructing String Graphs in External Memory 323
of P1rev . Thus, the test is equivalent to determine whether the P2rev -interval on
B rev properly contains the P1rev -interval. The latter test can be easily performed
by outputting in the files Aj a representation of the prefix-interval of each arc.
On the Complexity. Notice that Algorithm 1 scans once B and L and recall
the total length of B is n where is the length of the reads and n is number of
reads. Since the input Q-intervals are nested or disjoint, there are at most O( n)
distinct Q-intervals, that is, block at lines 6–10 and line 12 are executed O( n)
times. A stack-based data structure allows to store the distinct EΠ arrays while
requiring O(1) time for each iteration, hence the time complexity of Algorithm 1
is O( n), while its space complexity is O( |Σ|) since it stores at most arrays
of |Σ| elements (plus a constant space). The second phase of our algorithm
consists of ( − τ ) iterations, each requiring a call to ExtendIntervals and
ExtendArcIntervals, therefore the overall time complexity to compute the
overlap graph is O( 2 n), which is also an upper bound on the number of arcs of
the overlap graph. The time complexity of the third phase is O(de), where e and
d are respectively the number of arcs of the overlap graph and the maximum
indegree of the resulting string graph, as each arc must be tested for reducibility
against each adjacent vertex.
4 Experimental Analysis
We performed a preliminary experimental comparison of LSG with SGA, a state-
of-the-art assembler based on string-graphs [13], on the dataset of the Human
chromosome 14 used in the recent Genome Assembly Gold-standard Evaluation
(GAGE) project [11]. We used a slightly modified version of BEETL 0.9.0 [2] to
construct the BWT, the GSA, and the LCP of the reads needed by LSG. Note
that BEETL is able to compute the GSA by setting a particular compilation flag.
Since the current version of BEETL requires all input reads to have the same
length, we harmonized the lengths of the reads (∼36M) of the GAGE dataset
to 90bp: we discarded shorter reads (∼6M), whereas we split longer reads into
overlapping substrings with a minimum overlap of 70bp. We further preprocessed
and filtered the resulting ∼50M reads according to the workflow used for SGA in
GAGE [11]: no reads were discarded by the preprocess step, while ∼13M reads
were filtered out as duplicated. As a result, the final dataset was composed of
∼37M reads of length 90bp.
We generated the index of the dataset using sga-index and beetl-bwt and we
gave them as input to SGA and LSG requiring a minimum overlap (τ ) of 65. We
performed the experimental analysis on a workstation with 12GB of RAM and
standard mechanical hard drives, as our tool is designed to cope with a limited
amount of main memory. The workstation has a quad-core Intel Xeon W3530
2.80GHz CPU running Ubuntu Linux 12.04. To perform a fair comparison, we
slightly modified SGA to disable the computation of overlaps on different strands
(i.e., when one read is reversed and complemented w.r.t. the other).
For the comparison, we focused on running times and main memory allocation.
During the evaluation of the tools we do not consider the index generation step
324 P. Bonizzoni et al.
because such part is outside the scope of this paper. Regarding the running
times, SGA built the string graph in 2 hours and 19 minutes, whereas LSG built
the string graph in a total time of 5 hours and 28 minutes (9min were required
for computing the basic arc-intervals with 98% CPU time, 5h and 13min for arc
labeling with 76% CPU time, 4min for graph reduction with 60% CPU time,
and 2min for producing the final output with 59% CPU time). Regarding the
main memory usage, SGA had a peak memory allocation of 3.2GB whereas LSG
required less than 0.09GB for basic arc-interval computation and for arc labeling,
and less than 0.25GB for graph reduction.
We chose to write the output in ASQG format (the format used by SGA)
to allow processing the results obtained by LSG by the subsequent steps of the
SGA workflow (such as the assembly and the alignment steps). In the current
straightforward implementation of this part, we store the whole set of read IDs
in main memory, which pushes the peak memory usage of this part to 2.5GB.
However, more refined implementations (that, for example, store only part of
the read IDs in main memory) could easily reduce the memory usage for this
(non-essential) part. We also want to point out that the memory required by the
reduction step can be arbitrarily reduced by reducing iteratively arcs incident
to subsets of nodes with only a small penalty in running times.
Furthermore, we point out that this experimental part was performed on
commodity hardware equipped with mechanical hard disks. As a consequence,
the execution of LSG on systems equipped with faster disks (e.g., SSDs) will
significantly decrease its running time, especially when compared with SGA.
Acknowledgments. The authors thank the reviewers for their detailed and
insightful comments. The authors acknowledge the support of the MIUR PRIN
2010-2011 grant 2010LYA9RH (Automi e Linguaggi Formali: Aspetti Matematici
e Applicativi), of the Cariplo Foundation grant 2013-0955 (Modulation of anti
cancer immune response by regulatory non-coding RNAs), of the FA 2013 grant
(Metodi algoritmici e modelli: aspetti teorici e applicazioni in bioinformatica).
Constructing String Graphs in External Memory 325
References
1. Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: A new genome assembly
algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5),
455–477 (2012)
2. Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and in-
verting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)
3. Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction
for next-generation sequencing datasets. In: Raphael, B., Tang, J. (eds.) WABI
2012. LNCS, vol. 7534, pp. 326–337. Springer, Heidelberg (2012)
4. Beretta, S., Bonizzoni, P., Della Vedova, G., Pirola, Y., Rizzi, R.: Modeling al-
ternative splicing variants from RNA-Seq data with isoform graphs. J. Comput.
Biol. 16(1), 16–40 (2014)
5. Cox, A.J., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.B.: Comparing DNA sequence
collections by direct comparison of compressed text indexes. In: Raphael, B., Tang,
J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 214–224. Springer, Heidelberg (2012)
6. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression
in external memory. Algorithmica 63(3), 707–730 (2012)
7. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581
(2005)
8. Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read
alignment via bi-directional BWT. In: BIBM 2009, pp. 31–36 (2009)
9. Myers, E.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)
10. Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – A practical iterative
de bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS,
vol. 6044, pp. 426–440. Springer, Heidelberg (2010)
11. Salzberg, S.L., et al.: GAGE: A critical evaluation of genome assemblies and as-
sembly algorithms. Genome Res. 22(3), 557–567 (2012)
12. Shi, F.: Suffix arrays for multiple strings: A method for on-line multiple string
searches. In: Jaffar, J., Yap, R.H.C. (eds.) ASIAN 1996. LNCS, vol. 1179,
pp. 11–22. Springer, Heidelberg (1996)
13. Simpson, J., Durbin, R.: Efficient construction of an assembly string graph using
the FM-index. Bioinformatics 26(12), i367–i373 (2010)
14. Simpson, J., Durbin, R.: Efficient de novo assembly of large genomes using com-
pressed data structures. Genome Res. 22, 549–556 (2012)
15. Simpson, J., Wong, K., Jackman, S., et al.: ABySS: a parallel assembler for short
read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Topology-Driven Trajectory Synthesis
with an Example on Retinal Cell Motions
1 Introduction
The work presented in this paper is motivated from the investigation of a retinal
disease called retinitis pigmentosa [18]. In this disease, a mutation kills the rod
photoreceptors in the retina. A consequence of this death is that the geometry
of the mosaic of cone photoreceptors deforms in an interesting way. Normally,
cones form a relatively homogeneous distribution. But after the death of rods,
the cones migrate to form an exquisitely regular array of holes.
Our central goal is to build a dynamic evolution model for the point distri-
butions that arise from the cone mosaic in retinitis pigmentosa. In physics, the
most classical method for modeling cell motions is to solve a system of differ-
ential equations from Newton’s laws of motion with some predefined force field
which specifies cell-to-cell interactions. However, in many cases it is difficult to
understand how different types of cells (for example cones and rods) interact
with each other. There are also mathematical models that do not presume much
prior biological knowledge, such as flocking which has been widely used to sim-
ulate coordinated animal motions [16]. But as with all model-based approaches,
the method is limited by the model chosen in the first place.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 326–339, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Topology-Driven Trajectory Synthesis 327
The retina is a light-sensitive layer of tissue that lines the inner surface of the
eye. It contains photoreceptor cells that capture light rays and convert them
into electrical impulses. These impulses travel along the optic nerve to the brain
where they are turned into images of the visual world.
There are two types of photoreceptors in the retina: cones and rods. In adult
humans, the entire retina contains about 6 million cones and 120 million rods.
Cones are contained in the macula, the portion of the retina responsible for
central vision. They are most densely packed within the fovea, the very center
portion of the macula. Cones function best in bright light and support color
perception. In contrast, rods are spread throughout the peripheral retina and
function best in dim light. They are responsible for peripheral and night vision.
Retinitis pigmentosa is one of the most common forms of inherited retinal
degeneration. This disorder is characterized by the progressive loss of photore-
ceptor cells and may lead to night blindness or tunnel vision. Typically, rods
are affected earlier in the course of the disease, and cone deterioration occurs
later. In the progressive degeneration of the retina, the peripheral vision slowly
constricts and the central vision is usually retained until late in the disease.
328 C. Gu, L. Guibas, and M. Kerber
Fig. 2. Distribution of cones in normal (7523 cells in 1mm2 ) and retinitis pigmentosa
(6509 cells in 1mm2 ) retinas at day 25
3 Synthesis Algorithm
Suppose we are given a point set X = {x1 , x2 , . . . , xn } at time t and we want
to simulate the time evolution from t to t + Δt. Since we do not presume any
biological knowledge about the system, in each step we simply move a point xi to
some random location xi within its neighborhood. We then compare both the old
configuration {x1 , . . . , xi , . . . , xn } and the new configuration {x1 , . . . , xi , . . . , xn }
after this point update to the real data at time t + Δt. If the new configuration
is closer to the data than the old configuration, we accept this movement for xi ,
otherwise we accept it with some probability which depends on their difference.
We iteratively repeat this process for each point in X until the result converges.
The details of the trajectory synthesis algorithm are shown in Algorithm 1. It
can be seen as a variant of the simulated annealing algorithm [9], in which the
acceptance probability also depends on a temperature parameter to avoid local
minima in optimization. There are two questions we have not addressed:
1 n n
gX (r) = G(||xi − xj || − r), ∀r ≥ 0 (1)
Sd−1 (r)ρn i=1 j=1
2
gX (r) = gX (r) + (G(||xi − xj || − r) − G(||xi − xj || − r))
Sd−1 (r)ρn
j =i
Data Interpolation. Now we answer the two questions proposed at the end
of Section 3. Consider we have a set of observed samples {X t0 , X t1 , . . . , X tM }.
Without loss of generality, we can assume the observation time t0 = 0 < t1 <
. . . < tM−1 < tM = 1. We start with Y 0 = X 0 as the initial point set, and run
the synthesis algorithm to simulate its time evolution. By matching the PCFs of
{X t1 , X t2 , . . . , X tM }, we can obtain a sequence of point sets {Y t1 , Y t2 , . . . , Y tM }
at all observation time. For each sample X ti , the goal is to minimize the distance
between gX ti and gY ti defined in (2). Furthermore, if there is more than one
sample observed at time ti , we can extend the objective function in standard
ways, by taking the minimum or average distance from the synthetic point set
to all samples at that time.
Note that in the above approach we can only synthesize point sets at the ob-
servation time {t0 , t1 , . . . , tM }. But how do we simulate during the time intervals
between successive observations? Suppose we want to generate a point distribu-
tion at time ti < t < ti+1 . Although there is no real data X t , it is possible to
approximate the PCF gX t by linear interpolation
ti+1 − t t − ti
gX t = g ti + g ti+1
ti+1 − ti X ti+1 − ti X
It has been shown that such a simple linear interpolation can generate valid
PCFs from which distributions can be synthesized [14]. Thus, we can use the
synthesis algorithm to generate data at any time t0 ≤ t ≤ tM .
In Section 4, we have seen that the PCF can be used to characterize the distribu-
tions of photoreceptor point sets. However, this function only considers pairwise
correlations and misses higher-order information in the data. As we will show in
Section 6, there are point sets with almost same PCF but very different shape
features. In this section, we present another way to summarize point distribution
without correspondence from a topological perspective.
Topology-Driven Trajectory Synthesis 333
Alpha Shapes. Suppose we are given a point set and we want to understand the
shape formed by these points. Of course there are many possible interpretations
for the notion of shape, the α-shape being one of them [4]. In geometry, α-shapes
are widely used for shape reconstruction, as they give linear approximations of
the original shape.
The concept of α-shapes is generally applicable to point sets in any Euclidean
space Rd , but for our application we will illustrate in the 2D case. Given a point
set S in R2 , the α-shape of S is a straight line graph whose vertices are points in
S and whose edges connect pairs of points that can be touched by the boundary
of an open disk of radius α containing no points in S. The parameter α controls
the desired level of detail in shape reconstruction. For any value of α, the α-
shape is a subgraph of the Delaunay triangulation, and thus it can be computed
in O(n log n) time.
Figure 4 shows the α-shapes for the photoreceptor point sets in Figure 2 with
different values of α. As α increases, we see that edges appear in the graph
and some of them form cycles. For the normal point set, these edges and cycles
disappear very quickly since there is no space for empty disks of large radius α.
In contrast, for the retinitis pigmentosa point set, some cycles can stay for long
time in the large empty regions. Therefore, α-shapes can successfully capture
the hole structures formed by cone migration.
In Figure 4, we see that α = 0.02mm gives a nice example to distinguish be-
tween the two photoreceptor point sets. However, in general how do we choose
the right value of α? Indeed, what we are more interested in is to summarize
information of α-shapes at different scale levels. So, we next turn to its topo-
logical definition — the α-complex. Given a point set in Rd , the α-complex is
a simplicial subcomplex of its Delaunay triangulation. For each simplex in the
Delaunay triangulation, it appears in the α-complex K(α) if its circumsphere is
empty and has a radius less than α, or it is a face of another simplex in K(α).
Although we can choose infinite numbers for α, there are only finite many α-
complexes for a point set S. They are totally ordered by inclusion giving rise to
filtration of the Delaunay triangulation K0 = φ ⊂ K1 = S ⊂ ... ⊂ Km = Del(S).
For a point set in R2 , α-complexes consist of vertices, edges, and triangles.
334 C. Gu, L. Guibas, and M. Kerber
The first non-empty complex K1 is the point set S itself. As α increases, edges
and triangles are added into K(α) until we eventually arrive at the Delaunay
triangulation. The relation between α-shape and α-complex is that the edges in
the α-shape make up the boundary of the α-complex.
Persistence. In Figure 4, we have seen that cycles appear and disappear in the
α-complexes during the filtration. The cycles that stay for a while are important
ones since they characterize major shape features of the data set. In algebraic
topology, the cycles are defined based on homology groups: there is one group of
cycles Hd per dimension d, and the rank of Hd is called the d-th Betti number βd
which can be considered as the number of d-dimensional holes in the space [5].
For example in the 2D case, β0 is the number of connected components and β1
is the number of holes in the plane. In the evolution from K0 to Km , adding
an edge will create a new hole (except for n − 1 edges in a spanning tree which
change β0 by merging connected components), while adding a triangle will fill a
hole. The persistence of a hole is the difference between its birth time and death
time which are paired by following the elder rule.
Given a point set S, the information about persistence of holes can be encoded
into a two-dimensional persistence diagram PS . As depicted in Figure 5, each
point in the diagram represents a hole (or a class of cycles) during the filtration,
where the x and y coordinates are the birth time and death time respectively. In
the normal case all cycles have short persistence, while in the retinitis pigmentosa
case some cycles have very long persistence and they capture the large hole
features in the point set. Note that there are also some cycles with large birth
time and very short persistence (the points near the diagonal). This is because
the holes in the point set may not be perfectly round (such as ellipses), and thus
some cycles can be split by adding long edges at large α. These cycles of short
persistence can be considered as noise and ignored in the analysis of the data.
For a Delaunay triangulation with m simplicies, the persistence diagram can
be computed using a matrix reduction in O(m3 ) time. In the 2D case, m = O(n)
and the running time can be reduced to O(nα(n)) using the union-find data
structure [5], where α(n) is the inverse of Ackermann function which grows very
slowly with n. We also apply periodic boundary conditions by computing the
periodic Delaunay triangulation of a point set [2,10].
Topology-Driven Trajectory Synthesis 335
There are two distances often used to measure the similarity between persis-
tence diagrams: the bottleneck and Wasserstein distances [5]. Computing both
distances reduces to the problem of finding the optimal matching in a bipartite
graph. With the optimal matching, we can also interpolate between two persis-
tence diagrams by linearly interpolating between the matched pairs of points.
However, solving a minimum cost perfect matching problem in non-Euclidean
spaces takes O(n3 ) time [11], so we should avoid recomputing this matching
distance after each point update in the synthesis algorithm.
already know the hole positions. After filling the holes we end up with a blue-
noise pattern. By reversing the sequence of snapshots in Figure 6(c), it gives
another example on retinal cell motions in retinitis pigmentosa. Furthermore,
we can start with a point set at any time t and run bidirectional simulations to
synthesize trajectories for the time evolution of this sample.
Running Time. There are four main components of the trajectory synthesis
algorithm (see Table 1). For PCF, we only need to compute it for the initial and
target point sets, which takes O(n2 ) time. After that, it takes O(n) time per
point update. For α-shapes, it takes O(n log n) time for Delaunay triangulation
and persistence matching, as well as O(nα(n)) time for persistence diagram.
Therefore, the running time for all these four parts is almost linear per point
update, and hence the algorithm runs in O(n2 log n) time per iteration.
We have also tested the real running time for each part of the synthesis algo-
rithm on the photoreceptors data set. The experiment is performed on a com-
puter with IntelR
CoreTM 2 Quad Processor Q6600 and 4GB Memory. In the
current implementation, the periodic Delaunay triangulation is the slowest part
which takes about half of the computation time. However, for each point update
there is no need to recompute the whole Delaunay triangulation, and indeed it
338 C. Gu, L. Guibas, and M. Kerber
can be maintained in O(log n) expected time per point update [3]. So, by using
dynamic Delaunay triangulation we can improve the real running time by almost
a factor of 2, but theoretically the algorithm still takes O(n log n) time per point
update — for persistence matching the input is the persistence diagram and it
is not clear how to bound its change after we move a point.
Note that in the initialization part, we may need to interpolate the target
persistence diagram at time t if we do not have the real data at that time. As
mentioned in Section 5, this would take O(n3 ) time. Therefore, if we synthesize
N frames and run L iterations per frame, the total running time is bounded by
O(N (n3 + Ln2 log n)). Although the initialization part has a larger theoretical
cost O(n3 ), in practice the main synthesis part O(Ln2 log n) may take longer
time because its unit cost O(1) is more expensive. For the simulation results
shown in Figure 6(b–c), they take about 3450 seconds for initialization and
290 seconds per iteration, with L = 20 iterations for each frame. Furthermore,
since the synthesis algorithm is probabilistic, we can use it to generate multiple
trajectories from a data set, while the initialization can be considered as a pre-
processing step and only needs to be computed once.
References
1. Carlsson, G.E.: Topology and data. Bulletin of the American Mathematical
Society 46, 255–308 (2009)
2. Caroli, M., Teillaud, M.: Computing 3D periodic triangulations. In: Proceedings
of the European Symposium on Algorithms, pp. 59–70 (2009)
3. Devillers, O., Meiser, S., Teillaud, M.: Fully dynamic Delaunay triangulation in
logarithmic expected time per operation. Computational Geometry: Theory and
Applications 2(2), 55–80 (1992)
4. Edelsbrunner, H.: Alpha shapes — a survey. Tessellations in the Sciences (2011)
5. Edelsbrunner, H., Harer, J.L.: Computational Topology. An Introduction. Ameri-
can Mathematical Society (2010)
6. Edelsbrunner, H., Morozov, D.: Persistent homology: theory and practice. In: Pro-
ceedings of the European Congress of Mathematics, pp. 31–50 (2012)
Topology-Driven Trajectory Synthesis 339
7. Illian, J., Penttinen, A., Stoyan, H., Stoyan, D.: Statistical Analysis and Modelling
of Spatial Point Patterns. Wiley Interscience (2008)
8. Ji, Y., Zhu, C.L., Grzywacz, N.M., Lee, E.-J.: Rearrangement of the cone mosaic
in the retina of the rat model of retinitis pigmentosa. The Journal of Comparative
Neurology 520(4), 874–888 (2012)
9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 220(4598), 671–680 (1983)
10. Kruithof, N.: 2D periodic triangulations. CGAL User and Reference Manual (2009)
11. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research
Logistics 2(1-2), 83–97 (1955)
12. Lagae, A., Dutre, P.: A comparison of methods for generating Poisson disk distri-
butions. Computer Graphics Forum 27(1), 114–129 (2008)
13. Lee, E.-J., Ji, Y., Zhu, C.L., Grzywacz, N.M.: Role of muller cells in cone mosaic
rearrangement in a rat model of retinitis pigmentosa. Glia 59(7), 1107–1117 (2011)
14. Oztireli, A.C., Gross, M.: Analysis and synthesis of point distributions based on
pair correlation. Transactions on Graphics 31(6), 170 (2012)
15. Schlomer, T., Deussen, O.: Accurate spectral analysis of two-dimensional point
sets. Journal of Graphics, GPU, and Game Tools 15(3), 152–160 (2011)
16. Vicsek, T., Czirok, A., Ben-Jacob, E., Cohen, I., Shochet, O.: Novel type of
phase transition in a system of self-driven particles. Physical Review Letters 75(6),
1226–1229 (1995)
17. Wilkinson, L., Anand, A., Grossman, R.: Graph-theoretic scagnostics. In: Proceed-
ings of the Symposium on Information Visualization, pp. 157–164 (2005)
18. Yanoff, M., Duker, J.S.: Ophthalmology. Elsevier (2009)
19. Yellott, J.I.: Spectral consequences of photoreceptor sampling in the rhesus retina.
Science 221(4608), 382–385 (1983)
A Graph Modification Approach for Finding
Core–Periphery Structures in Protein Interaction
Networks
1 Introduction
A fundamental task in the analysis of PPI networks is the identification of protein
complexes and functional modules. Herein, a basic assumption is that complexes
in a PPI network are strongly connected among themselves and weakly connected
to other complexes [22]. This assumption is usually too strict. To obtain a more
realistic network model of protein complexes, several approaches incorporate the
core–attachment model of protein complexes [12]: In this model, a complex is
conjectured to consist of a stable core plus some attachment proteins, which only
interact with the core temporally. In graph-theoretic terms, the core thus is a
dense subnetwork of the PPI network. The attachment (or: periphery) is less
dense, but has edges to one or more cores.
Current methods employing this type of modeling are based on seed grow-
ing [18, 19, 23]. Here, an initial set of promising small subgraphs is chosen as
cores. Then, each core is separately greedily expanded into cores and attach-
ments to satisfy some objective function. The aim of these approaches was to
predict protein complexes [18, 23] or to reveal biological features that are corre-
lated with topological properties of core–periphery structures in networks [19].
Supported by DFG project ALEPH (HU 2139/1).
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 340–351, 2014.
c Springer-Verlag Berlin Heidelberg 2014
A Graph Modification Approach for Finding Core–Periphery Structures 341
Further related work. The related optimization problem Split Editing asks to
transform a graph into a (single) split graph by at most k edge modifications.
Split Editing is, somewhat surprisingly, solvable in polynomial time [13]. An-
other approach of fitting PPI networks to specific graph classes was proposed
by Zotenko et al. [25] who find for a given PPI network a close chordal graph,
that is, a graph without induced cycles of length four or more. The modification
operation is insertion of edges.
Fig. 1. The forbidden induced subgraphs for split graphs (2K2 , C4 , and C5 ) and for
split cluster graphs (C4 , C5 , P5 , necktie, and bowtie)
Proof. Let G contain the 2K2 {x1 , x2 }, {y1 , y2 } as induced subgraph. With-
out loss of generality, let the shortest path between any xi , yj be P = (x1 =
p1 , p2 , . . . , pk = y1 ). Clearly, k > 2. If k = 3, then x1 and y1 are both adjacent
to p2 . Otherwise, if k = 4, then {x2 , x1 = p1 }, {p3 , p4 = y1 } is a 2K2 and x1
and p3 are both adjacent to p2 . Finally, if k > 4, then P contains a P5 as induced
subgraph. The four outer vertices of this P5 induce a 2K2 whose K2 ’s each con-
tain a neighbor of the middle vertex.
Proof. Let G be a split cluster graph, that is, every connected component is a
split graph. Clearly, G does not contain a C4 or C5 . If a connected component
of G contains a P5 , then omitting the middle vertex of the P5 yields a 2K2 , which
contradicts that the connected component is a split graph. The same argument
shows that the graph cannot contain a necktie or bowtie.
Conversely, let G be (C4 , C5 , P5 , necktie, bowtie)-free. Clearly, no connected
component contains a C4 or C5 . Assume for a contradiction that a connected
344 S. Bruckner, F. Hüffner, and C. Komusiewicz
component contains a 2K2 consisting of the K2 ’s {a, b} and {c, d}. Then accord-
ing to Lemma 2 there is a vertex v which is, without loss of generality, adjacent
to a and c. If no other edges between the 2K2 and v exist, then {a, b, v, c, d} is
a P5 . Adding exactly one of {b, v} or {d, v} creates a necktie, and adding both
edges results in a bowtie. No other edges are possible, since there are no edges
between {a, b} and {c, d}.
3 Solution Approaches
Integer Linear Programming. We experimented with a formulation based di-
rectly on the forbidden subgraphs for split cluster graphs (Theorem 1). How-
ever, we found a formulation based on the following observation to be faster in
practice, and moreover applicable also to Monopolar Editing: If we correctly
guess the partition into clique and independent set vertices, we can get a simpler
characterization of split cluster graphs by forbidden subgraphs.
Lemma 3. Let G = (V, E) be a graph and C ∪˙ I = V a partition of the vertices.
Then G is a split cluster graph with core vertices C and periphery vertices I if
and only if it does not contain an edge with both endpoints in I, nor an induced
P3 with both endpoints in C.
Proof. “⇒”: Clearly, if there is an edge with both endpoints in I or an induced
P3 with both endpoints in C, then I is not an independent set or C does not
form a clique in each connected component, respectively.
“⇐”: We again use contraposition. If G is not a split cluster graph with core
vertices C and periphery vertices I, then it must contain an edge with both
endpoints in I, or C ∩ H does not induce a clique for some connected compo-
nent H of G. In the first case we are done; in the second case, there are two
vertices u, v ∈ C in the same connected component with {u, v} ∈ / E. Consider
a shortest path u = p1 , . . . , pl = v from u to v. If it contains a periphery ver-
tex pi ∈ I, then pi−1 , pi , pi+1 forms a forbidden subgraph. Otherwise, p1 , p2 , p3
is one.
With a very similar proof, we can get a simpler set of forbidden subgraphs for
annotated monopolar graphs.
Lemma 4. Let G = (V, E) be a graph and C ∪˙ I = V a partition of the vertices.
Then G is a monopolar graph with core vertices C and periphery vertices I if
and only if it does not contain an edge with both endpoints in I, nor an induced
P3 whose vertices are contained in C.
Proof. “⇒”: Easy to see as in Lemma 3.
“⇐”: If G is not monopolar with core vertices C and periphery vertices I, then
it must contain an edge with both endpoints in I, or C does not induce a cluster
graph. In the first case we are done; in the second case, there is a P3 with all
vertices in C, since that is the forbidden subgraph for cluster graphs.
346 S. Bruckner, F. Hüffner, and C. Komusiewicz
Data Reduction. Data reduction (preprocessing) proved very effective for solv-
ing Cluster Editing optimally [3, 4]. Indeed, any instance can be reduced
to one of at most 2k vertices [7], where k is the number of edge modifications.
Unfortunately, the data reduction rules we devised for Split Cluster Editing
were not applicable to our real-world test instances. However, a simple observa-
tion allows us to fix the values of some variables of Eqs. (1) to (3) in the Split
Cluster Editing ILP: if a vertex u has only one vertex v as neighbor and
deg(v) > 1, then set cu = 0 and euw = 0 for all w = v. Since our instances have
many degree-one vertices, this considerably reduces the size of the ILPs.
Heuristics. The integer linear programming approach is not able to solve the hard-
est of our instances. Thus, we employ the well-known simulated annealing heuris-
tic, a local search method. For Split Cluster Editing, we start with a clustering
where each vertex is a singleton. As random modification, we move a vertex to a
cluster that contains one of its neighbors. Since this allows only a decrease in the
number of clusters, we also allow moving a vertex into an empty cluster. For a fixed
clustering, the optimal number of modifications can be computed in linear time by
counting the edges between clusters and computing for each cluster a solution for
Split Editing in linear time [13]. For Monopolar Editing, we also allow mov-
ing a vertex into the independent set. Here, the optimal number of modifications
for a fixed clustering can also be calculated in linear time: all edges in the indepen-
dent set are deleted, all edges between clusters are deleted, and all missing edges
within clusters are added.
4 Experimental Results
We test exact algorithms and heuristics for Split Cluster Editing (SCE) and
Monopolar Editing (ME) on several PPI networks, and perform a biological
A Graph Modification Approach for Finding Core–Periphery Structures 347
Table 1. Network statistics. Here, n is the number of proteins, without singletons, and m
is the number of interactions; nlcc and mlcc are the number of proteins and interactions in
the largest connected component; C is the number of CYC2008 complexes with at least
50% and at least three proteins in the network, p is the number of network proteins that
do not belong to these complexes, and AC is the average complex size. Finally, ig is the
number of genetic interactions between proteins without physical interaction.
n m nlcc mlcc C p AC ig
cell cycle 196 797 192 795 7 148 21.8 1151
transcription 215 786 198 776 11 54 28.0 1479
translation 236 2352 186 2351 5 88 29.8 174
evaluation of the modules found. We use two known methods for comparison. The
algorithm by Luo et al. [19] (“Luo” for short) produces clusters with core and pe-
riphery, like SCE, but the clusters may overlap and might not cover the whole graph.
The SCAN algorithm [24], like ME, partitions the graph vertices into “clusters”,
which we interpret as cores, and “hubs” and “outliers”, which we interpret as
periphery.
Biological evaluation. We evaluate our results using the following measures. First,
we examine the coherence of the GO terms in our modules using the semantic sim-
ilarity score calculated by G-SESAME [8]. We use this score to test the hypothesis
that the cores are more stable than the peripheries. If the hypothesis is true, then
348 S. Bruckner, F. Hüffner, and C. Komusiewicz
the pairwise similarity score within the core should be higher than in the periph-
ery. We test only terms relating to process, not function, since proteins in the same
complex play a role in the same biological process. Since Monopolar Editing
and SCAN return multiple cores and only a single periphery, we assign to each
cluster C its neighborhood N (C) as periphery. We consider only clusters with at
least two core vertices and one periphery vertex.
Next, we compare the resulting clusters with known protein complexes from
the CYC2008 database [20]. Since the networks we analyze are subnetworks of the
larger yeast network, we discard for each network the CYC2008 complexes that
have less than 50% of their vertices in the current subnetwork, restrict them to
proteins contained in the current subnetwork, and then discard those with fewer
than three proteins. We expect that the cores mostly correspond to complexes and
that the periphery may contain complex vertices plus further vertices.
Finally, we analyze the genetic interactions between and within modules. Ideally,
we would obtain significantly more genetic interactions outside of cores than within
them. This is supported by the between pathways model [16], which proposes that
different complexes can back one another up, thus disabling one would not harm
the cell, but disabling both complexes would reduce its fitness or kill it. Here, when
counting genetic interactions, we are interested only in genetic interactions that
occur between proteins that do not physically interact.
4.2 Results
Our results are summarized in Table 2. For Split Cluster Editing, the ILP ap-
proach failed to solve the cell cycle and transcription network, and for Monopolar
Editing, it failed to solve the transcription network, with CPLEX running out of
memory in each case. The fact that for the “harder” problem ME more instances
were solved could be explained by the fact that the number k of necessary modifica-
tions is much lower, which could reduce the size of the branch-and-bound tree. For
the three optimally solved instances, the heuristic also finds the optimal solution
after one minute for two of them, but for the last one (ME transcription) only after
several hours; after one minute, it is 2.9% too large. This indicates the heuristic
gives good results, and in the following, we use the heuristic solution for the three
instances not solvable by ILP.
Table 3 gives an overview of the results. We say that a cluster is interesting if it
contains at least two vertices in the core and at least one in the periphery. In the
cell cycle network (see Fig. 2), the SCE solution identifies ten interesting clusters,
along with four clusters containing only cores, and some singletons. Only for one of
the ten clusters is the GO term coherence higher in the periphery than in the core,
as expected (for two more the scoring tool does not return a result).
Following our hypothesis, we say that a complex is detected by a cluster if at least
50% of the core belongs to the complex and at least 50% of the complex belongs to
the cluster. Out of the seven complexes, three are detected without any error, and
one is detected with an error of two additional proteins in the core that are not in
the complex. The periphery contains between one and eight extra proteins that are
not in the complex (which is allowed by our hypothesis).
A Graph Modification Approach for Finding Core–Periphery Structures 349
Table 2. Experimental results. Here, K is the number of clusters with at least two vertices
in the core and at least one in the periphery, p is the size of the periphery, k is the number
of edge modifications, and ct , cc , and cp is the average coherence within the cluster, core,
and periphery, respectively.
Table 3. Experimental results for the complex test. Here, D is the number of detected
complexes (≥ 50% of core contained in complex and ≥ 50% of complex contained in
cluster), core% is among the detected complexes the median percentage of core vertices
that are in this complex and comp% is the median percentage of complex proteins that
are in the cluster.
The Monopolar Editing result contains more interesting clusters than SCE
(24). Compared to SCE, clusters are on average smaller and have a smaller core, but
about the same periphery size (recall that a periphery vertex may occur in more than
one cluster). ME detects the same complexes as SCE, plus one additional complex.
SCAN identifies 7 hubs and 41 outliers, which then comprise the periphery.
SCAN fails to detect one of the complexes ME finds. It also has slightly more er-
rors, for example having three extra protein in the core for the anaphase-promoting
complex plus one missing. Luo identifies only large clusters (this is true for all sub-
networks we tested). It detects the same complexes as ME, but also finds more extra
vertices in the cores.
In the transcription network, for GO-Term analysis, we see a similar pattern
here that Luo has worse coherence, but all methods show less coherence in the
peripheries than in the cores. The ME method comes out a clear winner here with
detecting all 11 complexes and generally fewer errors.
In the translation network, SCE and ME find about the same number of interest-
ing clusters (22 and 24) and detect the same four complexes. The SCAN algorithm
does not seem to deal well with this network, since it finds only two interesting
clusters and does not detect any complex. Luo finds only four interesting clusters,
corresponding to the four complexes also detected by SCE and ME; this might also
explain why it has the best coherence values here.
350 S. Bruckner, F. Hüffner, and C. Komusiewicz
(a) Complexes (b) SCE cores (c) Monopolar (d) SCAN (e) Luo
Fig. 2. Results of the four algorithms on the cell-cycle network. The periphery is in white,
remaining vertices are colored according to their clusters.
Experiments conclusion. The coherence values for cores and peripheries indicate
that a division of clusters into core and periphery makes sense. In detecting com-
plexes, the ME method does best (20 detected), followed by SCE and Luo (15 each),
and finally SCAN (12). This indicates that the model that peripheries are shared
is superior. Note however that SCE is at a disadvantage in this evaluation, since it
can use each protein as periphery only once, while having large peripheries makes
it easier to count a complex as detected.
5 Outlook
There are many further variants of our models that could possibly yield better bio-
logical results or have algorithmic advantages. For instance, one could restrict the
cores to have a certain minimum size. Also, instead of using split graphs as a core–
periphery model, one could resort to dense split graphs [5] in which every periphery
vertex is adjacent to all core vertices. Finally, one could allow some limited amount
of interaction between periphery vertices.
References
[1] Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal
of Computational Biology 6(3-4), 281–297 (1999)
[2] Berger, A.J.: Minimal forbidden subgraphs of reducible graph properties. Discus-
siones Mathematicae Graph Theory 21(1), 111–117 (2001)
[3] Böcker, S., Baumbach, J.: Cluster editing. In: Bonizzoni, P., Brattka, V., Löwe, B.
(eds.) CiE 2013. LNCS, vol. 7921, pp. 33–44. Springer, Heidelberg (2013)
A Graph Modification Approach for Finding Core–Periphery Structures 351
[4] Böcker, S., Briesemeister, S., Klau, G.W.: Exact algorithms for cluster editing: Eval-
uation and experiments. Algorithmica 60(2), 316–334 (2011)
[5] Borgatti, S.P., Everett, M.G.: Models of core/periphery structures. Social Net-
works 21(4), 375–395 (1999)
[6] Chatr-aryamontri, A., et al.: The BioGRID interaction database: 2013 update. Nu-
cleic Acids Research 41(D1), D816–D823 (2013)
[7] Chen, J., Meng, J.: A 2k kernel for the cluster editing problem. Journal of Computer
and System Sciences 78(1), 211–220 (2012)
[8] Du, Z., Li, L., Chen, C.-F., Yu, P.S., Wang, J.Z.: G-SESAME: web tools for GO-
term-based gene similarity analysis and knowledge discovery. Nucleic Acids Re-
search 37(suppl. 2), W345–W349 (2009)
[9] Farrugia, A.: Vertex-partitioning into fixed additive induced-hereditary properties
is NP-hard. The Electronic Journal of Combinatorics 11(1), R46 (2004)
[10] Foldes, S., Hammer, P.L.: Split graphs. Congressus Numerantium 19, 311–315 (1977)
[11] Fomin, F.V., Kratsch, S., Pilipczuk, M., Pilipczuk, M., Villanger, Y.: Subexponen-
tial fixed-parameter tractability of cluster editing. CoRR, abs/1112.4419 (2011)
[12] Gavin, A.-C., et al.: Proteome survey reveals modularity of the yeast cell machinery.
Nature 440(7084), 631–636 (2006)
[13] Hammer, P.L., Simeone, B.: The splittance of a graph. Combinatorica 1(3), 275–284
(1981)
[14] Heggernes, P., Kratsch, D.: Linear-time certifying recognition algorithms and for-
bidden induced subgraphs. Nordic Journal of Computing 14(1-2), 87–108 (2007)
[15] Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential
complexity? Journal of Computer and System Sciences 63(4), 512–530 (2001)
[16] Kelley, R., Ideker, T.: Systematic interpretation of genetic interactions using protein
networks. Nature Biotechnology 23(5), 561–566 (2005)
[17] Komusiewicz, C., Uhlmann, J.: Cluster editing with locally bounded modifications.
Discrete Applied Mathematics 160(15), 2259–2270 (2012)
[18] Leung, H.C., Xiang, Q., Yiu, S.-M., Chin, F.Y.: Predicting protein complexes from
PPI data: a core-attachment approach. Journal of Computational Biology 16(2),
133–144 (2009)
[19] Luo, F., Li, B., Wan, X.-F., Scheuermann, R.: Core and periphery structures in pro-
tein interaction networks. BMC Bioinformatics (Suppl. 4), S8 (2009)
[20] Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S.J.: Up-to-date catalogues of yeast
protein complexes. Nucleic Acids Research 37(3), 825–831 (2009)
[21] Shamir, R., Sharan, R., Tsur, D.: Cluster graph modification problems. Discrete
Applied Mathematics 144(1-2), 173–182 (2004)
[22] Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular net-
works. PNAS 100(21), 12123–12128 (2003)
[23] Wu, M., Li, X., Kwoh, C.-K., Ng, S.-K.: A core-attachment based method to detect
protein complexes in PPI networks. BMC Bioinformatics 10(1), 169 (2009)
[24] Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.J.: SCAN: a structural clustering algo-
rithm for networks. In: Proc. 13th KDD, pp. 824–833. ACM (2007)
[25] Zotenko, E., Guimarães, K.S., Jothi, R., Przytycka, T.M.: Decomposition of overlap-
ping protein complexes: a graph theoretical method for analyzing static and dynamic
protein associations. Algorithms for Molecular Biology 1(7) (2006)
Interpretable Per Case Weighted Ensemble
Method for Cancer Associations
Abstract. Over the past decades, biology has transformed into a high
throughput research field both in terms of the number of different mea-
surement techniques as well as the amount of variables measured by
each technique (e.g., from Sanger sequencing to deep sequencing) and is
more and more targeted to individual cells [3]. This has led to an un-
precedented growth of biological information. Consequently, techniques
that can help researchers find the important insights of the data are be-
coming more and more important. Molecular measurements from cancer
patients such as gene expression and DNA methylation are usually very
noisy. Furthermore, cancer types can be very heterogeneous. Therefore,
one of the main assumptions for machine learning, that the underlying
unknown distribution is the same for all samples in training and test
data, might not be completely fulfilled.
In this work, we introduce a method that is aware of this potential bias
and utilizes an estimate of the differences during the generation of the
final prediction method. For this, we introduce a set of sparse classifiers
based on L1-SVMs [1], under the constraint of disjoint features used by
classifiers. Furthermore, for each feature chosen by one of the classifiers,
we introduce a regression model based on Gaussian process regression
that uses additional features. For a given test sample we can then use
these regression models to estimate for each classifier how well its fea-
tures are predictable by the corresponding Gaussian process regression
model. This information is then used for a confidence-based weighting
of the classifiers for the test sample. Schapire and Singer showed that
incorporating confidences of classifiers can improve the performance of
an ensemble method [2]. However, in their setting confidences of classi-
fiers are estimated using the training data and are thus fixed for all test
samples, whereas in our setting we estimate confidences of individual
classifiers per given test sample.
In our evaluation, the new method achieved state-of-the-art perfor-
mance on many different cancer data sets with measured DNA methy-
lation or gene expression. Moreover, we developed a method to visualize
our learned classifiers to find interesting associations with the target la-
bel. Applied to a leukemia data set we found several ribosomal proteins
Corresponding author.
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 352–353, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Interpretable Per Case Weighted Ensemble Method for Cancer Associations 353
References
1. Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and
support vector machines. In: ICML, vol. 98, pp. 82–90 (1998)
2. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated
predictions. Machine Learning 37(3), 297–336 (1999)
3. Shapiro, E., Biezuner, T., Linnarsson, S.: Single-cell sequencing-based technologies
will revolutionize whole-organism science. Nat. Rev. Genet. 14(9), 618–630 (2013)
Reconstructing Mutational History in Multiply
Sampled Tumors Using Perfect Phylogeny
Mixtures
1 Introduction
Cancer is an evolutionary process, where somatic mutations accumulate in a
population of cells during the lifetime of an individual. The clonal theory of
cancer posits that all cells in a tumor are descended from a single founder cell,
and that selection for advantageous mutations and clonal expansions of cells
containing these mutations leads to uncontrolled growth of a tumor [16]. Tradi-
tional genome-wide profiling, using comparative genomic hybridisation (CGH),
initially shed light on cancer progression. Using CGH for the purpose of analy-
sis, mathematical models such as oncogenetic tree models [2], were developed to
describe the pathways and the order of somatic alterations in cancer genomes. In
recent years, high-throughput DNA sequencing technologies has enabled large-
scale measurement of somatic mutations in many cancer genomes [4,12,14,22].
This new data has led to much interest in modeling the mutational process
within a tumor, and reconstructing the history of somatic mutations [10,20].
At first glance, the problem of reconstructing the history of somatic mutations
is a phylogenetic problem, where the “species” are the individual cells in the
D. Brown and B. Morgenstern (Eds.): WABI 2014, LNBI 8701, pp. 354–367, 2014.
c Springer-Verlag Berlin Heidelberg 2014
Reconstructing Mutational History 355
tumor, and the “characters” are the somatic mutations. Since the number of
single-nucleotide mutations in a tumor is a very small percentage of the number
of positions in the genome, one may readily assume that somatic mutations
follow the infinite sites assumption whereby a mutation occurs at a particular
locus only once during the course of tumor evolution. Moreover, we may assume
that mutations have one of two possible states: 0 = normal, and 1 = mutated.
Under these assumptions the mutational process in a tumor follows a perfect
phylogeny (Figure 1(a)). Reconstructing perfect phylogenies has been extensively
studied, and many algorithms to construct perfect phylogenies from data have
been developed. See [6] for a survey.
Nearly all cancer sequencing studies to date do not measure somatic muta-
tions in single cells, but rather sequence DNA from a tumor sample containing
thousands-millions of tumor cells. This is because of technological and cost is-
sues in single cell sequencing [23,11,5]. Reconstructing the history of somatic
mutations from a single sample is very different from a traditional phylogenetic
problem, as the data is not from individual species, but rather from a mixture of
all species. Thus, researchers have instead focused on identifying subpopulations
of tumor cells that share somatic mutations by clustering mutations according to
their inferred frequencies within the tumor [3,15,19,10]. In contrast with the tra-
ditional perfect phylogeny problems, these works instead solve a deconvolution
problem.
A few recent studies have sequenced multiple spatially distinct samples from
the same tumor [7,18]. This data presents an interesting intermediate between
the perfect phylogeny problem – where characters are measured in individual
species – and the single sample problem – where the goal is to deconvolve a
mixture. Other related recent studies include computational methods to infer
tumor phylogenies, with the goal of improving single nucleotide variant (SNV)
calling [17].
In this paper, we formulate a hybrid problem, the Perfect Phylogeny Mixture
problem. In this problem, we are given a collection of samples, each of which
is a superposition of the characters from a subset of species. The problem is
to reconstruct the state of each character in each species so that the resulting
species satisfies a perfect phylogeny. Note that, similiar to some of the studies we
mentioned above (e.g. [3,15,19,10]), we also restrict our problem formulation to
single nucleotide variant (SNV) markers. Although consideration of additional
somatic events such as copy number variations (CNVs) might be informative, the
perfect phylogeny assumptions do not apply to CNVs. We demonstrate that one
instance of this problem, using a cost function based on parsimony, is NP-hard,
by using a reduction from the problem of finding the minimum vertex coloring
of a graph. Finally, we develop an algorithm to solve the problem.
In this section, we first define the Perfect Phylogeny Mixture problem, and then
formulate the Minimum Split Row problem as a specific optimization version of
356 I. Hajirasouliha and B.J. Raphael
m2 m3
m2 m3
m1
m1 001
Sample 3
Sample 2 Sample 4
110
10 010 Sample 1
(a) A perfect phylogeny with 3 (b) Taking 4 different samples from the tu-
mutations m1 , m2 and m3 . The mor phylogeny. The binary representation
leaves are assigned binary repre- of the mutations sets in samples 1,2,3 and 4
sentations 110, 010, 001, repre- will be 111, 110, 010 and 001, respectively.
senting mutation sets {m1 , m2 },
{m2 }, {m3 }, respectively
Fig. 1. This figure shows a tumor phylogeny with 3 mutations. Multiple samples are
taken from the tumor which overlap and may create conflicts in observed mutations.
Note that for any given binary matrix M with m rows and n columns, the
identity matrix In is always a trivial solution for the Perfect Phylogeny Mix-
ture problem. In what follows, we consider optimization versions of the Perfect
Phylogeny Mixture problem and define the Minimum-Split-Row problem.
We use Split-Row operations, to distinguish distinct leaves in a Perfect Phy-
logeny model whose corresponding subpopulations were mixed with each other in
one sample. This mixture may cause conflicts in the input mutation matrix, and
we ask to perform Split-Row operations to convert the input mutation matrix
to a conflict-free one. For a split row operation Sr , we define the cost function
γ(M, r) = k − 1, the number of additional rows to M . An alternative cost func-
tion η(M, r) is the number of additional unique rows that were not identical 1 to
any of the original rows in M , after the Split-Row operation on Sr . Note that,
we only discuss the problem under the cost function γ. We leave the study of
the problem under the cost function η to a future work.
The following matrix on the left shows an example where a conflict exists
between columns m2 and m3 . After performing an Split-Row operation on the
first row of the matrix (shown in blue), two new rows are created (shown in
red). The resulting matrix on the right is conflict-free and thus corresponds to
a perfect phylogeny tree. In this example γ(M, 1) = 1, while η(M, 1) = 0.
m1 m2 m3
m1 m2 m3 ⎛ ⎞
⎛ ⎞ First new row 1 1 0
Split-Row operation 1 1 1
⎜ 1 Second new row ⎜ 0 0 1 ⎟
⎜ 1 0 ⎟⎟⇒
⎜
⎜
⎟
⎝ 0 ⎜ 1 1 0 ⎟⎟
0 1 ⎠ ⎝ 0 0 1 ⎠
0 1 0
0 1 0
Proof. See the following example in which the Split-Row operation on the first
row creates new conflicts between the first and second columns, if (1, 1, 1, 1) is
replaced by (1, 0, 1, 0) and (0, 1, 0, 1). In this case, to avoid creating new conflicts,
the Split-Row operation must replace (1, 1, 1, 1) with (0, 0, 1, 1) and (1, 1, 0, 0).
⎛ ⎞
1111
⎜1 1 0 0 ⎟
⎜ ⎟
M =⎜ ⎟
⎜0 0 1 1 ⎟
⎝1 1 1 0 ⎠
0111
Proof. First note that, in the given mutation matrix, if the corresponding graph
for a particular row r is not empty, then we have to perform the Split-Row
operation on r. If two columns of r are in conflict, then performing the Split-
Row operation on other rows does not affect the conflicts in r. Moreover, if r is
replaced by less than χ(GM,r ) rows, then due to the pigeonhole principle, there
would still be a row with conflicting columns.
Theorem 4. Given a simple graph G with n vertices and m edges, there exists
a binary matrix M with n columns and 2m + 1 rows, such that all entries of its
first row are equal to 1 and GM,1 = G.
Proof. Let V (G) = {v1 , · · · , vn } be the vertex set of G, and E(G) = {e1 , · · · , em }
be the edge set of G. Assume for any 1 ≤ k ≤ m, ek = (xk , yk ). That is the
edge ek connects vertices xk and yk (See Figure 2(a)) for an example of such
a conflict graphs with 5 vertices and 4 edges). We construct a mutation matrix
with n columns and 2m + 1 rows such that each column corresponds a vertex
in G. All entries in the first row are equal to 1, and corresponding to each edge
ek (1 ≤ k ≤ m), we have two rows (i.e. rows 2k and 2k + 1) in M . For each
k(1 ≤ k ≤ m), the entries of rows 2k and 2k + 1 are determined as follows. For
row 2k of the matrix, we place a 0 at entry xk and we place a 1 at entry yk ,
while for row 2k + 1, we place a 1 at entry xk and we place a 0 at entry yk .
Since all the entries of the first row are equal to 1, this configuration leads to a
conflict between any pairs of columns xk and yk that correspond to an edge ek
in G. We name the above as Step 1 of our construction.
Now, we fill the rest of the matrix with 0’s and 1’s such that we do not create
new conflicts among the columns that were not originally in conflict. In order to
guide filling of the other entries of the matrix so that no new conflict is created,
we maintain a principle that if two columns are not in conflict (i.e. there is no
edge between the corresponding vertices in G) then the column on the right must
contain the column on the left. That is, for each row, the entry of the column
on the right is greater or equal to the entry of the column on the left. In other
Reconstructing Mutational History 361
c1 c2 c3 c4 c5 c1 c2 c3 c4 c5
(a) A conflict graph G. (b) An underlying containment
graph G is shown. All edges of G
are oriented from left to right.
Fig. 2. This figure shows a conflict graph G (on the left) and an underlying containment
graph G (on the right) corresponding to a row of a mutation matrix that is known to
be all 1’s. Hence, the edge set of G is the complement of the edge set of G.
c1 c2 c3 c4 c5 c1 c2 c3 c4 c5
⎛ ⎞ ⎛ ⎞
Step 1 1 1 1 1 1 Step 2 1 1 1 1 1
⎜1 0 ⎟ ⎜1 0 1 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜0 1 ⎟ ⎜0 1 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜ 1 0 ⎟ ⎜0 1 0 1 1⎟
⎜ ⎟ ⎜ ⎟
M= ⎜ 0 1 ⎟⇒ ⎜0 0 1 0 0⎟
⎜ ⎟ ⎜ ⎟
⎜ 1 0 ⎟ ⎜0 0 1 0 0⎟
⎜ ⎟ ⎜ ⎟
⎜ 0 1 ⎟ ⎜0 0 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎝ 1 0⎠ ⎝0 0 1 0 0⎠
0 1 0 0 0 0 1
Fig. 3. The partially-filled matrix on the left shows the mutation matrix after Step 1
of the construction described in the proof of Theorem 4, while the matrix on the right
shows the mutation matrix after Step 2
to the (0/1) entries shown in black. Because the entry (r, i) value is equal to 1
and vertices corresponding to i and j are not connected in the conflict graph (and
thus connected in the underlying containment graph, oriented from i to j), filling
(r, j) with a 0 would be a contradiction. Now, if the 0 was originally planted in
the entry(r, j), the value 1 at the entry (r, j) cannot also be an originally planted
1 in the row r. Because in this scenario the pair must have been already a conflict
in the graph. Thus the entry (r, j) must have been filled with 1 during Step 2 of
the algorithm, essentially from a chain of containment relationships edges which
starts from an entry on row r which has an originally planted 1. Note that the
containment relationships are transitive. This is also a contradiction, and thus
no new conflicting pairs of columns exists in the filled mutation matrix.
Using this result, we now prove Theorem 3.
Proof (of Theorem 3). We use a reduction from the minimum vertex coloring
problem (one of Karp’s 21 NP-hard problems [13]). Given a simple graph G, by
Theorem 4 there exists a mutation matrix M whose conflict graph of the first
row is G and whose size is polynomially bounded. If we solve the Minimum-
Split-Row problem for M , then by Lemma 1, the first row of M is replaced by at
least χ(G) rows, each of which defines a color for a vertex of G. Thus, we obtain
a vertex coloring for G. Furthermore, given an instance of the Minimum-Split-
Row problem, we can show that minimum vertex colorings of the conflict graphs
corresponding to each row, will lead to a solution for the Minimum-Split-Row
problem.
4 A Graph-Theoretic Algorithm
In this section, we provide an algorithm for the Minimum-Split-Row problem.
Our algorithm achieves the lower bound given in Lemma 1: i.e. for a muta-
tion matrix M , we replace each row r by exactly χ(GM,r ) rows, where χ is the
chromatic number of the graph χ(GM,r ).
Reconstructing Mutational History 363
Note that HM is a directed acyclic graph (DAG). Further note that if column
i contains column j then there is no conflict between i and j in the original
mutation matrix [9].
As we discussed earlier (Observation 2), performing a Split-Row operation on
a row may introduce new conflicts. Informally, a Split-Row operation may affect
pairs of columns i and j, where j contains i, in a way that a new conflict arises.
Thus, any series of Split-Row operations that aim to make the mutation matrix
M conflict free, must be aware of the underlying containment graph of M . In
what follows, we describe an algorithm that uses what we call Containment-
Aware series of Split-Row operations and achieves the lower bound of χ(GM,r )
additional rows.
References
1. Buneman, P.: A characterization of rigid circuit graphs. Discrete Mathematics 9,
205–212 (1974)
2. Desper, R., Jiang, F., Kallioniemi, O.-P., Moch, H., Papadimitriou, C.H., Schäffer,
A.A.: Inferring tree models for oncogenesis from comparative genome hybridization
data. Journal of Computational Biology 6(1), 37–51 (1999)
3. Ding, L., Ley, T.J., Larson, D.E., Miller, C.A., Koboldt, D.C., Welch, J.S., Ritchey,
J.K., Young, M.A., Lamprecht, T., McLellan, M.D., McMichael, J.F., Wallis, J.W.,
Lu, C., Shen, D., Harris, C.C., Dooling, D.J., Fulton, R.S., Fulton, L.L., Chen, K.,
Schmidt, H., Kalicki-Veizer, J., Magrini, V.J., Cook, L., McGrath, S.D., Vickery,
T.L., Wendl, M.C., Heath, S., Watson, M.A., Link, D.C., Tomasson, M.H., Shan-
non, W.D., Payton, J.E., Kulkarni, S., Westervelt, P., Walter, M.J., Graubert,
T.A., Mardis, E.R., Wilson, R.K., DiPersio, J.F.: Clonal evolution in relapsed
acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481(7382),
506–510 (2012)
4. Ding, L., Raphael, B.J., Chen, F., Wendl, M.C.: Advances for studying clonal
evolution in cancer. Cancer Lett. (January 2013)
5. Eberwine, J., Sul, J.-Y., Bartfai, T., Kim, J.: The promise of single-cell sequencing.
Nat. Methods 11(1), 25–27 (2014)
6. Fernandez-Baca, D.: The Perfect Phylogeny Problem (retrieved September 30,
2012)
7. Gerlinger, M., Rowan, A.J., Horswell, S., Larkin, J., Endesfelder, D., Gronroos,
E., Martinez, P., Matthews, N., Stewart, A., Tarpey, P., Varela, I., Phillimore, B.,
Begum, S., McDonald, N.Q., Butler, A., Jones, D., Raine, K., Latimer, C., San-
tos, C.R., Nohadani, M., Eklund, A.C., Spencer-Dene, B., Clark, G., Pickering, L.,
Stamp, G., Gore, M., Szallasi, Z., Downward, J., Futreal, P.A., Swanton, C.: Intra-
tumor heterogeneity and branched evolution revealed by multiregion sequencing.
N. Engl. J. Med. 366(10), 883–892 (2012)
8. Gusfield, D.: Efficient algorithms for inferring evolutionary trees. Networks 21,
19–28 (1991)
9. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology (1997)
10. Hajirasouliha, I., Mahmoody, A., Raphael, B.J.: A combinatorial approach for an-
alyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioin-
formatics 30(12), 78–86 (2014)
11. Hou, Y., Song, L., Zhu, P., Zhang, B., Tao, Y., Xu, X., Li, F., Wu, K., Liang, J.,
Shao, D., Wu, H., Ye, X., Ye, C., Wu, R., Jian, M., Chen, Y., Xie, W., Zhang, R.,
366 I. Hajirasouliha and B.J. Raphael
Chen, L., Liu, X., Yao, X., Zheng, H., Yu, C., Li, Q., Gong, Z., Mao, M., Yang, X.,
Yang, L., Li, J., Wang, W., Lu, Z., Gu, N., Laurie, G., Bolund, L., Kristiansen, K.,
Wang, J., Yang, H., Li, Y., Zhang, X., Wang, J.: Single-cell exome sequencing and
monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148(5),
873–885 (2012)
12. Kandoth, C., McLellan, M.D., Vandin, F., Ye, K., Niu, B., Lu, C., Xie, M.,
Zhang, Q., McMichael, J.F., Wyczalkowski, M.A., Leiserson, M.M., Miller, C.A.,
Welch, J.S., Walter, M.J., Wendl, M.C., Ley, T.J., Wilson, R.K., Raphael, B.J.,
Ding, L.: Mutational landscape and significance across 12 major cancer types. Na-
ture 502(7471), 333–339 (2013)
13. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Com-
puter Computations, pp. 85–103 (1972)
14. Lawrence, M.S., Stojanov, P., Polak, P., Kryukov, G.V., Cibulskis, K., Sivachenko,
A., Carter, S.L., Stewart, C., Mermel, C.H., Roberts, S.A., Kiezun, A., Hammer-
man, P.S., McKenna, A., Drier, Y., Zou, L., Ramos, A.H., Pugh, T.J., Stransky,
N., Helman, E., Kim, J., Sougnez, C., Ambrogio, L., Nickerson, E., Shefler, E.,
Cortés, M.L., Auclair, D., Saksena, G., Voet, D., Noble, M., DiCara, D., Lin, P.,
Lichtenstein, L., Heiman, D.I., Fennell, T., Imielinski, M., Hernandez, B., Hodis,
E., Baca, S., Dulak, A.M., Lohr, J., Landau, D.-A., Wu, C.J., Melendez-Zajgla, J.,
Hidalgo-Miranda, A., Koren, A., McCarroll, S.A., Mora, J., Lee, R.S., Crompton,
B., Onofrio, R., Parkin, M., Winckler, W., Ardlie, K., Gabriel, S.B., Roberts, C.M.,
Biegel, J.A., Stegmaier, K., Bass, A.J., Garraway, L.A., Meyerson, M., Golub, T.R.,
Gordenin, D.A., Sunyaev, S., Lander, E.S., Getz, G.: Mutational heterogeneity in
cancer and the search for new cancer-associated genes. Nature 499(7457), 214–218
(2013)
15. Nik-Zainal, S., Van Loo, P., Wedge, D.C., Alexandrov, L.B., Greenman, C.D., Lau,
K.W., Raine, K., Jones, D., Marshall, J., Ramakrishna, M., Shlien, A., Cooke, S.L.,
Hinton, J., Menzies, A., Stebbings, L.A., Leroy, C., Jia, M., Rance, R., Mudie, L.J.,
Gamble, S.J., Stephens, P.J., McLaren, S., Tarpey, P.S., Papaemmanuil, E., Davies,
H.R., Varela, I., McBride, D.J., Bignell, G.R., Leung, K., Butler, A.P., Teague,
J.W., Martin, S., Jonsson, G., Mariani, O., Boyault, S., Miron, P., Fatima, A.,
Langerod, A., Aparicio, S.A., Tutt, A., Sieuwerts, A.M., Borg, A., Thomas, G.,
Salomon, A.V., Richardson, A.L., Borresen-Dale, A.L., Futreal, P.A., Stratton,
M.R., Campbell, P.J.: The life history of 21 breast cancers. Cell 149(5), 994–1007
(2012)
16. Nowell, P.C.: The clonal evolution of tumor cell populations. Science 194(4260),
23–28 (1976)
17. Salari, R., Saleh, S.S., Kashef-Haghighi, D., Khavari, D., Newburger, D.E., West,
R.B., Sidow, A., Batzoglou, S.: Inference of tumor phylogenies with improved so-
matic mutation discovery. In: Deng, M., Jiang, R., Sun, F., Zhang, X. (eds.) RE-
COMB 2013. LNCS, vol. 7821, pp. 249–263. Springer, Heidelberg (2013)
18. Schuh, A., Becq, J., Humphray, S., Alexa, A., Burns, A., Clifford, R., Feller, S.M.,
Grocock, R., Henderson, S., Khrebtukova, I., Kingsbury, Z., Luo, S., McBride,
D., Murray, L., Menju, T., Timbs, A., Ross, M., Taylor, J., Bentley, D.: Monitor-
ing chronic lymphocytic leukemia progression by whole genome sequencing reveals
heterogeneous clonal evolution patterns. Blood 120(20), 4191–4196 (2012)
Reconstructing Mutational History 367
19. Shah, S.P., Roth, A., Goya, R., Oloumi, A., Ha, G., Zhao, Y., Turashvili, G., Ding,
J., Tse, K., Haffari, G., Bashashati, A., Prentice, L.M., Khattra, J., Burleigh, A.,
Yap, D., Bernard, V., McPherson, A., Shumansky, K., Crisan, A., Giuliany, R.,
Heravi-Moussavi, A., Rosner, J., Lai, D., Birol, I., Varhol, R., Tam, A., Dhalla,
N., Zeng, T., Ma, K., Chan, S.K., Griffith, M., Moradian, A., Cheng, S.W., Morin,
G.B., Watson, P., Gelmon, K., Chia, S., Chin, S.F., Curtis, C., Rueda, O.M.,
Pharoah, P.D., Damaraju, S., Mackey, J., Hoon, K., Harkins, T., Tadigotla, V.,
Sigaroudinia, M., Gascard, P., Tlsty, T., Costello, J.F., Meyer, I.M., Eaves, C.J.,
Wasserman, W.W., Jones, S., Huntsman, D., Hirst, M., Caldas, C., Marra, M.A.,
Aparicio, S.: The clonal and mutational evolution spectrum of primary triple-
negative breast cancers. Nature 486(7403), 395–399 (2012)
20. Strino, F., Parisi, F., Micsinai, M., Kluger, Y.: TrAp: a tree approach for finger-
printing subclonal tumor composition. Nucleic Acids Res. 41(17), e165 (2013)
21. Warnow, T.: Some combinatorial problems in phylogenetics. In: Invited paper,
Proceedings of the International Colloquium on Combinatorics and Graph Theory,
Balatonlelle, Hungary (1999)
22. Vogelstein, B., Papadopoulos, N., Velculescu, V.E., Zhou, S., Diaz Jr., L.A., Kin-
zler, K.W.: Cancer genome landscapes. Science 339(6127), 1546–1558 (2013)
23. Xu, X., Hou, Y., Yin, X., Bao, L., Tang, A., Song, L., Li, F., Tsang, S., Wu, K.,
Wu, H., He, W., Zeng, L., Xing, M., Wu, R., Jiang, H., Liu, X., Cao, D., Guo, G.,
Hu, X., Gui, Y., Li, Z., Xie, W., Sun, X., Shi, M., Cai, Z., Wang, B., Zhong, M.,
Li, J., Lu, Z., Gu, N., Zhang, X., Goodman, L., Bolund, L., Wang, J., Yang, H.,
Kristiansen, K., Dean, M., Li, Y., Wang, J.: Single-cell exome sequencing reveals
single-nucleotide mutation characteristics of a kidney tumor. Cell 148(5), 886–895
(2012)
Author Index