article

RCSI: scalable similarity search in thousand(s) of genomes

Authors:

Sebastian Wandelt,

Johannes Starlinger,

Ulf LeserAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 13

Pages 1534 - 1545

https://doi.org/10.14778/2536258.2536265

Published: 01 August 2013 Publication History

Abstract

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some k- approximate occurrences.

We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

References

[1]

An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56-65, Oct. 2012.

[2]

1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, Oct. 2010.

[3]

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, Oct. 1990.

[4]

R. A. Baeza-Yates and C. H. Perleberg. Fast and practical approximate string matching. Information Processing Letters, 59(1):21-27, 1996.

[5]

S. Bao, R. Jiang, W. Kwan, B. Wang, X. Ma, and Y. Song. Evaluation of next-generation sequencing software in mapping and assembly. Journal of human genetics, 56(6):406-414, 2011.

[6]

S. Christley, Y. Lu, C. Li, and X. Xie. Human genomes as email attachments. Bioinformatics (Oxford, England), 25(2):274-275, Jan. 2009.

[7]

R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don-t cares. In Proceedings of STOC, pages 91-100, New York, NY, USA, 2004. ACM.

[8]

I. H. G. S. Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, February 2001.

[9]

Consortium ICG. International network of cancer genome projects. Nature, 464(7291):993-998, Apr. 2010.

[10]

P. Danecek, A. Auton, G. Abecasis, and 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics (Oxford, England), 27(15):2156-2158, Aug. 2011.

[11]

A. P. J. de Koning, W. Gu, T. A. Castoe, M. A. Batzer, and D. D. Pollock. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genetics, 7(12):e1002384, 12 2011.

[12]

S. Deorowicz and S. Grabowski. Robust Relative Compression of Genomes with Random Access. Bioinformatics (Oxford, England), Sept. 2011.

[13]

H. H. Do, J. Jansson, K. Sadakane, and W.-K. Sung. Fast relative lempel-ziv self-index for similar sequences. In Proceedings of FAW-AAIM 2012, Beijing, China, 2012., volume 7285 of LNCS, pages 291-302. Springer, 2012.

[14]

P. Ferragina. String algorithms and data structures. CoRR, abs/0801.2378, 2008.

[15]

J. Fischer, V. Mäkinen, and G. Navarro. An(other) entropy-bounded compressed suffix tree. In Proceedings of 19th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 5029, pages 152-165, 2008.

[16]

T. Gagie, P. Gawrychowski, J. Kärkkäinen, and Y. Nekrich. A faster grammar-based self-index. In LATA-12, pages 240-251, Berlin, Heidelberg, 2012. Springer.

[17]

X. Ge and P. Smyth. Deformable markov model templates for time-series pattern matching. In Proceedings of SIGKDD, pages 81-90, New York, NY, USA, 2000. ACM.

[18]

L. A. Goldberg. Efficient algorithms for listing combinatorial structures, volume 5, page 7. Cambridge University Press, 2009.

[19]

D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.

[20]

O. Harismendy, P. Ng, et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology, 10(3):R32+, 2009.

[21]

W. J. Kent. BLAT-The BLAST-Like Alignment Tool. Genome Research, 12(4):656-664, Apr. 2002.

[22]

W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Research, 12(6):996-1006, 2002.

[23]

Y. Kim, K.-G. Woo, H. Park, and K. Shim. Efficient processing of substring match queries with inverted q-gram indexes. In Proceedings of ICDE 2010, Long Beach, California, USA, pages 721-732.

[24]

S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.

[25]

B. Langmead and S. L. Salzberg. Fast gapped-read alignment with bowtie 2. Nat Meth, 9(4):357-359, Apr. 2012.

[26]

B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25-10, Mar. 2009.

[27]

H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England), 25(14):1754-1760, 2009.

[28]

Y. Li, A. Terrell, and J. M. Patel. Wham: a high-throughput sequence alignment method. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Athens, Greece, June 12-16, pages 445-456. ACM, 2011.

[29]

Z. Li and T. Ge. Online windowed subsequence matching over probabilistic sequences. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 277-288, New York, NY, USA, 2012. ACM.

[30]

P.-R. Loh, M. Baym, and B. Berger. Compressive genomics. Nature Biotechnology, 30(7):627-630, July 2012.

[31]

U. Manber and E. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993.

[32]

G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, 2001.

[33]

G. Navarro. Indexing highly repetitive collections. In Proceedings 23rd International Workshop on Combinatorial Algorithms (IWOCA), LNCS 7643, pages 274-279, 2012.

[34]

G. Navarro and M. Raffinot. Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York, NY, USA, 2002.

[35]

E. Ohlebusch, J. Fischer, and S. Gog. Cst++. In SPIRE-10, pages 322-333, 2010.

[36]

P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos. Reference-based alignment in large sequence databases. Proceedings of the VLDB Endowment, 2(1):205-216, Aug. 2009.

[37]

E. Pennisi. Will Computers Crash Genomics? Science, 331(6018):666-668, Feb. 2011.

[38]

D. E. Reich, S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M. Higgins, D. J. Richter, E. S. Lander, and D. Altshuler. Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics, 32(1):135-142, Aug. 2002.

[39]

A. Rheinländer, M. Knobloch, N. Hochmuth, and U. Leser. Prefix tree indexing for similarity search and similarity joins on genomic data. In Proceedings of the 22nd SSDBM, pages 519-536, Berlin, Heidelberg, 2010. Springer.

[40]

E. E. Schadt, S. Turner, and A. Kasarskis. A window into third-generation sequencing. Human molecular genetics, 19(R2):R227-R240, Oct. 2010.

[41]

K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing, O. Kohlbacher, and D. Weigel. Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9):R98+, Sept. 2009.

[42]

J. Sirén, N. Välimäki, and V. Mäkinen. Indexing finite language representation of population genotypes. In Proceedings of the 11th international conference on Algorithms in bioinformatics, WABI-11, pages 270-281, Berlin, Heidelberg, 2011. Springer.

[43]

E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proceedings of the Third Annual European Symposium on Algorithms, ESA -95, pages 327-340, London, UK, 1995. Springer.

[44]

N. Vaelimaeki, V. Maekinen, W. Gerlach, and K. Dixit. Engineering a compressed suffix tree implementation. ACM Journal of Experimental Algorithmics, 14, 2009.

[45]

S. Wandelt and U. Leser. Adaptive efficient compression of genomes. Algorithms for Molecular Biology, 7:30, 2012.

[46]

S. Wandelt and U. Leser. String searching in referentially compressed genomes. In Proceedings of the 4th Int. Conf. on Knowledge Discovery and Information Retrieval, Barcelona, Spain, 2012.

[47]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85-96, 2012.

[48]

W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the ACM SIGMOD International Conference on Management of data, pages 759-770, New York, NY, USA, 2009. ACM.

[49]

X. Yang, B. Wang, C. Li, J. Wang, and X. Xie. Efficient direct search on compressed genomic data. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), Australia (to appear; preprint at http://www.ics.uci.edu/xhx/publications/genomecompress.pdf).

[50]

H. Zhu, G. Kollios, and V. Athitsos. A generic framework for efficient and effective subsequence retrieval. Proceedings VLDB Endow., 5(11):1579-1590, July 2012.

Cited By

Zhang WLin MRoss K(2020)Efficient Search over Genomic Short Read DataProceedings of the 32nd International Conference on Scientific and Statistical Database Management10.1145/3400903.3400907(1-12)Online publication date: 7-Jul-2020
https://dl.acm.org/doi/10.1145/3400903.3400907
Hu HWang HLi JGao H(2016)An efficient pruning strategy for approximate string matching over suffix treeKnowledge and Information Systems10.1007/s10115-015-0896-649:1(121-141)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-015-0896-6
Wandelt SLeser U(2015)MRCSIProceedings of the VLDB Endowment10.14778/2735479.27354808:5(461-472)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.14778/2735479.2735480
Show More Cited By

RCSI: scalable similarity search in thousand(s) of genomes
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information retrieval

Recommendations

Identification of human-specific transcript variants induced by DNA insertions in the human genome

Motivation: Many genes in the human genome produce a wide variety of transcript variants resulting from alternative exon splicing, differential promoter usage, or altered polyadenylation site utilization that may function differently in human cells. ...
A probabilistic method for the detection and genotyping of small indels from population-scale sequence data

Motivation: High-throughput sequencing technologies have made population-scale studies of human genetic variation possible. Accurate and comprehensive detection of DNA sequence variants is crucial for the success of these studies. Small insertions ...
sRNA associated genomic islands in Salmonella spp.
ISB '10: Proceedings of the International Symposium on Biocomputing

Genomic Islands are parts of a genome that has evidence of horizontal origins. The present work is a continuation of our earlier work that identified 25 regions downstream of the small RNAs as hotspots of genomic island integration by analyzing three ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 13

August 2013

180 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 13

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
172
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WLin MRoss K(2020)Efficient Search over Genomic Short Read DataProceedings of the 32nd International Conference on Scientific and Statistical Database Management10.1145/3400903.3400907(1-12)Online publication date: 7-Jul-2020
https://dl.acm.org/doi/10.1145/3400903.3400907
Hu HWang HLi JGao H(2016)An efficient pruning strategy for approximate string matching over suffix treeKnowledge and Information Systems10.1007/s10115-015-0896-649:1(121-141)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-015-0896-6
Wandelt SLeser U(2015)MRCSIProceedings of the VLDB Endowment10.14778/2735479.27354808:5(461-472)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.14778/2735479.2735480
Wandelt SXiaoqian Sun (2015)Efficient Compression of 4D-Trajectory Data in Air Traffic ManagementIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2014.234505516:2(844-853)Online publication date: 27-Mar-2015
https://dl.acm.org/doi/10.1109/TITS.2014.2345055

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents