Abstract

Summary

We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length.

Availability and implementation

https://github.com/lh3/ropebwt2 CONTACT: hengli@broadinstitute.org.

Free full text

Bioinformatics. 2014 Nov 15; 30(22): 3274–3275.

Published online 2014 Aug 8. https://doi.org/10.1093/bioinformatics/btu541

PMCID: PMC4221129

PMID: 25107872

Fast construction of FM-index for long sequence reads

Heng Li

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Abstract

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length.

Availability and implementation: https://github.com/lh3/ropebwt2

Contact: gro.etutitsnidaorb@ilgneh

Go to:

1 INTRODUCTION

FM-index plays an important role in DNA sequence alignment, de novo assembly (Simpson and Durbin, 2012) and compression (Cox et al., 2012). Fast and lightweight construction of FM-index for a large dataset is the key to these applications. In this context, a few algorithms (Bauer et al., 2013; Liu et al., 2014) have been developed that substantially outperform earlier algorithms. However, they are only efficient for short reads. A fast and practical algorithm for long sequence reads is still lacking. This work aims to fill this gap.

Go to:

2 METHODS

Let Σ = {A, C, G, T, N} be the alphabet of DNA with a lexicographical order A < C < G < T < N. Each element in Σ is called a symbol and a sequence of symbols called a string over Σ. Given a string P, |P| is its length and P[i] the symbol at position i. A sentinel $ is smaller than all the other symbols. For simplicity, we let P[−1] = P[|P|] = $. We also introduce $\tilde{P}$ as the reverse of P and $\bar{P}$ as the reverse complement of P.

Given a list of strings over Σ, (P_i)_0≤i<m, let T = P₀$₀ … P_{m − 1}$_{m − 1} with $₀ < ⋯ < $_m−1 < A < C < G < T < N. The suffix array of T is an integer array S such that S(i), 0 ≤ i < |T|, is the starting position of the i-th smallest suffix in the collection T. The Burrows-Wheeler Transform, or BWT, of T can be computed as B[i] = T[S(i)−1]. For the description of the algorithm, we segment B into B = B_$B_AB_CB_GB_TB_N, where B_a[i] = B[i + C(a)] with C(a) = |{j:T[j] < a}| being the array of accumulative counts. By the definition of suffix array and BWT, B_a consists of all the symbols with their next symbol in T being a.

The above defines BWT for an ordered list of strings. We next seek to define BWT for an unordered set of strings 𝒞 by imposing an arbitrary sorting order on 𝒞. We say list (P_i)_i is in the reverse lexicographical order or RLO, if ${\tilde{P}}_{i} \leq {\tilde{P}}_{j}$ for any i < j; say it is in the reverse-complement lexicographical order or RCLO, if ${\bar{P}}_{i} \leq {\bar{P}}_{j}$ for any i < j. The RLO-BWT of 𝒞, denoted by B^RLO(𝒞), is constructed by sorting strings in 𝒞 in RLO and then applying the procedure in the previous paragraph on the sorted list. RCLO-BWT B^RCLO(𝒞) can be constructed in a similar way. In $B^{RCLO} ({P_{i}}_{i} \cup {{\bar{P}}_{j}}_{j})$ , the k-th smallest sequence is the reverse complement of the k-th sequence in the FM-index. This property removes the necessity of keeping an extra array to link the rank and the position of a sequence in the FM-index, and thus helps to reduce the memory of some FM-index–based algorithms (Simpson and Durbin, 2012). For short reads, RLO/RCLO-BWT is also more compressible (Cox et al., 2012).

As a preparation, we further define two string operations: rank(c, k; B) and insert(c, k; B), where rank(c, k; B) = |{i < k:B[i] = c}| gives the number of symbols c before the position k in B, and insert(c, k; B) inserts symbol c after k symbols in B with all the symbols after position k shifted to make room for c. We implemented the two operations by representing each B_c in a B+-tree in memory, where a leaf keeps a run-length encoded string and an internal node keeps the count of each symbol in the leaves descended from the node.

Algorithm 1 appends a string to an existing index by inserting each of its symbol from the end of P. It was first described by Chan et al. (2004). Algorithm 2 constructs RLO/RCLO-BWT in a similar manner to Algorithm 1 except that it inserts P[i] to [l, u), the suffix array interval of P’s suffix starting at i + 1, and that BWT symbols in this interval are already sorted. This process implicitly applies a radix sort from the end of P, sorting it into the existing strings in the BWT in RLO/RCLO. Note that if we change line 1 to “l ← u ← |{i:B[i] = $}|”, Algorithm 2 will be turned into Algorithm 1. Recall that the BCR algorithm (Bauer et al., 2013) is, to some extent, the multi-string version of Algorithm 1. Following similar reasoning, we can extend Algorithm 2 so as to insert multiple strings at the same time, which gives Algorithm 3. We use an array A(j) to keep the state of the j-th sequence after inserting its d-long suffix. At line 2, A(j).c is the previously inserted symbol and [A(j).l, A(j).u) is the interval to which the new symbol is inserted. In implementation, we may speed up the sorting mode by inserting multiple symbols at line 3.

When B is represented by a balanced tree structure, the time complexity of all three algorithms is O(n log⁡n), where n is the total number of symbols in the input. However, we will see later that for short strings, Algorithm 3 is substantially faster than the first two algorithms, due to the locality of memory accesses, the possibility of cached B+-tree update and the parallelization of the ‘for’ loop at line 1. These techniques are more effective for a larger batch of shorter strings.

Disregarding RLO/RCLO, Algorithm 3 is similar to BCR except that BCR keeps B in monolithic arrays. As a result, the time complexity of BCR is O(nl), where l is the maximum length of reads, not scaling well to l.

An external file that holds a picture, illustration, etc.
Object name is btu541ilf1.jpg

An external file that holds a picture, illustration, etc.
Object name is btu541ilf2.jpg

An external file that holds a picture, illustration, etc.
Object name is btu541ilf3.jpg

Go to:

3 RESULTS AND DISCUSSION

We implemented the algorithm in ropeBWT2 and evaluated its performance together with BEETL (http://bit.ly/beetlGH), the original on-disk implementation of BCR and BCRext, ropeBWT-BCR (https://github.com/lh3/ropebwt), an in-memory reimplementation of BCR by us, and NVBio (http://bit.ly/nvbioio), a GPU-based algorithm inspired by CX1 (Liu et al., 2014). Table 1 shows that for ~100 bp reads, ropeBWT2 has comparable performance to others. For the ~875 bp Venter dataset, NVBio aborted due to insufficient memory under various settings. We did not apply BCR because it is not designed for long reads of unequal lengths. Only ropeBWT2 works with this data set and the even longer moleculo reads.

Table 1.

Performance of BWT construction

Data^a	Algorithm	RCLO	Real	CPU%	RAM^b (GB)	Comments
worm	nvbio	–	316 s	138	12.9	See note^c
worm	ropebwt-bcr	–	480 s	223	2.2	-btORf
worm	Algorithm 3	Yes	506 s	250	10.5	-brRm10g
worm	Algorithm 3	No	647 s	249	11.8	-bRm10g
worm	beetl-bcr	–	965 s	259	1.8	RAM disk^d
worm	beetl-bcr	–	2092 s	122	1.8	Network^e
worm	Algorithm 1	–	5125 s	100	2.5	-bRm0
worm	beetl-bcrext	–	5900 s	48	0.1	Network^e
12 878	ropebwt-bcr	–	3.3 h	210	39.3	-btORf
12 878	nvbio	–	4.1 h	471	63.8	See note^f
12 878	Algorithm 3	Yes	5.0 h	261	34.0	-brRm10g
12 878	Algorithm 3	No	5.1 h	248	60.9	-bRm10g
12 878	beetl-bcr	–	11.2 h	131	31.6	Network^e
Venter	Algorithm 3	Yes	1.4 h	274	22.2	-brRm10g
Venter	Algorithm 3	No	1.5 h	274	22.8	-bRm10g
mol	Algorithm 3	No	6.8 h	285	20.0	-bRm10g

Open in a separate window

^aDatasets—worm: 66M × 100 bp Caenorhabditis elegans reads from SRR065390; 12878: 1206M × 101 bp human reads for sample NA12878 (Depristo et al., 2011). Venter: 32M × 875 bp (in average) human reads by Sanger sequencing (Levy et al. 2007; http://bit.ly/levy2007); mol: 23M × 4026 bp (in average) human reads by Illumina’s Moleculo sequencing (http://bit.ly/mol12878).

^bHardware—CPU: 48 cores of Xeon E5-2697v2 at 2.70 GHz; GPU: one Nvidia Tesla K40; RAM: 128 GB; Storage: Isilon IQ 72000x and X400 over network. CPU time, wall-clock time and peak memory are measured by GNU time.

^cRun with option ‘-R -cpu-mem 4096 -gpu-mem 4096’. NVBio uses more CPU and GPU RAM than the specified.

^dResults and temporary files created on in-RAM virtual disk ‘/dev/shm’.

^eResults and temporary files created on Isilon’s network file system.

^fRun with option ‘-R -cpu-mem 48000 -gpu-mem 4096’.

Funding: NHGRI U54HG003037; NIH GM100233.

Conflict of Interest: none declared.

Go to:

REFERENCES

Bauer MJ, et al. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 2013;483:134–148. [Google Scholar]
Chan H-L, et al. Compressed index for a dynamic collection of texts. In: Sahinalp SC, Muthukrishnan S, Dogrusöz U, editors. CPM, Volume 3109 of Lecture Notes in Computer Science. Berlin Heidelberg: Springer; 2004. pp. 445–456. [Google Scholar]
Cox AJ, et al. Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics. 2012;28:1415–1419. [Abstract] [Google Scholar]
Depristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. [Europe PMC free article] [Abstract] [Google Scholar]
Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. [Abstract] [Google Scholar]
Liu C-M, et al. 2014. GPU-accelerated BWT construction for large collection of short reads. arXiv:1401.7457. [Google Scholar]
Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22:549–556. [Europe PMC free article] [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/bioinformatics/btu541

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/30/22/3274/7252262/btu541.pdf

Citations & impact

Impact metrics

26

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/2583322

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/2583322

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btu541

Supporting

Mentioning

Contrasting

0

50

0

Article citations

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index.
Guguchkin E, Kasianov A, Belenikin M, Zobkova G, Kosova E, Makeev V, Karpulevich E
BMC Bioinformatics, 25(1):238, 13 Jul 2024
Cited by: 1 article | PMID: 39003441 | PMCID: PMC11246581
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
A survey of BWT variants for string collections.
Cenzato D, Lipták Z
Bioinformatics, btae333, 24 May 2024
Cited by: 1 article | PMID: 38788221 | PMCID: PMC11286279
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.
Song L, Langmead B
Genome Biol, 25(1):106, 25 Apr 2024
Cited by: 1 article | PMID: 38664753 | PMCID: PMC11046777
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Complete genome sequence of Pseudomonas entomophila strain TVIN-A01.
Shelke T, Mahar NS, Gupta V
Microbiol Resour Announc, 13(2):e0097723, 18 Jan 2024
Cited by: 0 articles | PMID: 38236041 | PMCID: PMC10868157
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Recursive Prefix-Free Parsing for Building Big BWTs.
Oliva M, Gagie T, Boucher C
Proc Data Compress Conf, 2023:62-70, 01 Mar 2023
Cited by: 2 articles | PMID: 39157001 | PMCID: PMC11328891
Free full text in Europe PMC

Go to all (26) article citations

Funding

Funders who supported this work.

NHGRI NIH HHS (1)

Grant ID: U54HG003037
5 publications

NIGMS NIH HHS (1)

Grant ID: GM100233
34 publications

Search life-sciences literature (45,090,497 articles, preprints and more)

Fast construction of FM-index for long sequence reads.

Affiliations

ORCIDs linked to this article

Abstract

Summary

Availability and implementation

Free full text

Fast construction of FM-index for long sequence reads

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS AND DISCUSSION

Table 1.

REFERENCES

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btu541

Article citations

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index.

A survey of BWT variants for string collections.

Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.

Complete genome sequence of Pseudomonas entomophila strain TVIN-A01.

Recursive Prefix-Free Parsing for Building Big BWTs.

Similar Articles

BFC: correcting Illumina sequencing errors.

FermiKit: assembly-based variant calling for Illumina resequencing data.

Minimap2: pairwise alignment for nucleotide sequences.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Funding

NHGRI NIH HHS (1)

NIGMS NIH HHS (1)

Partnerships & funding

Search life-sciences literature (45,090,497 articles, preprints and more)

Fast construction of FM-index for long sequence reads.

Author information

Affiliations

ORCIDs linked to this article

Abstract

Summary

Availability and implementation

Free full text

Fast construction of FM-index for long sequence reads

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS AND DISCUSSION

Table 1.

REFERENCES

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Similar Articles

Funding

NHGRI NIH HHS (1)﻿

NIGMS NIH HHS (1)﻿

Partnerships & funding

NHGRI NIH HHS (1)

NIGMS NIH HHS (1)