Nothing Special   »   [go: up one dir, main page]

skip to main content
article

RCSI: scalable similarity search in thousand(s) of genomes

Published: 01 August 2013 Publication History

Abstract

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some k- approximate occurrences.
We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

References

[1]
An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56-65, Oct. 2012.
[2]
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, Oct. 2010.
[3]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, Oct. 1990.
[4]
R. A. Baeza-Yates and C. H. Perleberg. Fast and practical approximate string matching. Information Processing Letters, 59(1):21-27, 1996.
[5]
S. Bao, R. Jiang, W. Kwan, B. Wang, X. Ma, and Y. Song. Evaluation of next-generation sequencing software in mapping and assembly. Journal of human genetics, 56(6):406-414, 2011.
[6]
S. Christley, Y. Lu, C. Li, and X. Xie. Human genomes as email attachments. Bioinformatics (Oxford, England), 25(2):274-275, Jan. 2009.
[7]
R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don-t cares. In Proceedings of STOC, pages 91-100, New York, NY, USA, 2004. ACM.
[8]
I. H. G. S. Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, February 2001.
[9]
Consortium ICG. International network of cancer genome projects. Nature, 464(7291):993-998, Apr. 2010.
[10]
P. Danecek, A. Auton, G. Abecasis, and 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics (Oxford, England), 27(15):2156-2158, Aug. 2011.
[11]
A. P. J. de Koning, W. Gu, T. A. Castoe, M. A. Batzer, and D. D. Pollock. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genetics, 7(12):e1002384, 12 2011.
[12]
S. Deorowicz and S. Grabowski. Robust Relative Compression of Genomes with Random Access. Bioinformatics (Oxford, England), Sept. 2011.
[13]
H. H. Do, J. Jansson, K. Sadakane, and W.-K. Sung. Fast relative lempel-ziv self-index for similar sequences. In Proceedings of FAW-AAIM 2012, Beijing, China, 2012., volume 7285 of LNCS, pages 291-302. Springer, 2012.
[14]
P. Ferragina. String algorithms and data structures. CoRR, abs/0801.2378, 2008.
[15]
J. Fischer, V. Mäkinen, and G. Navarro. An(other) entropy-bounded compressed suffix tree. In Proceedings of 19th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 5029, pages 152-165, 2008.
[16]
T. Gagie, P. Gawrychowski, J. Kärkkäinen, and Y. Nekrich. A faster grammar-based self-index. In LATA-12, pages 240-251, Berlin, Heidelberg, 2012. Springer.
[17]
X. Ge and P. Smyth. Deformable markov model templates for time-series pattern matching. In Proceedings of SIGKDD, pages 81-90, New York, NY, USA, 2000. ACM.
[18]
L. A. Goldberg. Efficient algorithms for listing combinatorial structures, volume 5, page 7. Cambridge University Press, 2009.
[19]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
[20]
O. Harismendy, P. Ng, et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology, 10(3):R32+, 2009.
[21]
W. J. Kent. BLAT-The BLAST-Like Alignment Tool. Genome Research, 12(4):656-664, Apr. 2002.
[22]
W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Research, 12(6):996-1006, 2002.
[23]
Y. Kim, K.-G. Woo, H. Park, and K. Shim. Efficient processing of substring match queries with inverted q-gram indexes. In Proceedings of ICDE 2010, Long Beach, California, USA, pages 721-732.
[24]
S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.
[25]
B. Langmead and S. L. Salzberg. Fast gapped-read alignment with bowtie 2. Nat Meth, 9(4):357-359, Apr. 2012.
[26]
B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25-10, Mar. 2009.
[27]
H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England), 25(14):1754-1760, 2009.
[28]
Y. Li, A. Terrell, and J. M. Patel. Wham: a high-throughput sequence alignment method. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Athens, Greece, June 12-16, pages 445-456. ACM, 2011.
[29]
Z. Li and T. Ge. Online windowed subsequence matching over probabilistic sequences. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 277-288, New York, NY, USA, 2012. ACM.
[30]
P.-R. Loh, M. Baym, and B. Berger. Compressive genomics. Nature Biotechnology, 30(7):627-630, July 2012.
[31]
U. Manber and E. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993.
[32]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, 2001.
[33]
G. Navarro. Indexing highly repetitive collections. In Proceedings 23rd International Workshop on Combinatorial Algorithms (IWOCA), LNCS 7643, pages 274-279, 2012.
[34]
G. Navarro and M. Raffinot. Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York, NY, USA, 2002.
[35]
E. Ohlebusch, J. Fischer, and S. Gog. Cst++. In SPIRE-10, pages 322-333, 2010.
[36]
P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos. Reference-based alignment in large sequence databases. Proceedings of the VLDB Endowment, 2(1):205-216, Aug. 2009.
[37]
E. Pennisi. Will Computers Crash Genomics? Science, 331(6018):666-668, Feb. 2011.
[38]
D. E. Reich, S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M. Higgins, D. J. Richter, E. S. Lander, and D. Altshuler. Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics, 32(1):135-142, Aug. 2002.
[39]
A. Rheinländer, M. Knobloch, N. Hochmuth, and U. Leser. Prefix tree indexing for similarity search and similarity joins on genomic data. In Proceedings of the 22nd SSDBM, pages 519-536, Berlin, Heidelberg, 2010. Springer.
[40]
E. E. Schadt, S. Turner, and A. Kasarskis. A window into third-generation sequencing. Human molecular genetics, 19(R2):R227-R240, Oct. 2010.
[41]
K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing, O. Kohlbacher, and D. Weigel. Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9):R98+, Sept. 2009.
[42]
J. Sirén, N. Välimäki, and V. Mäkinen. Indexing finite language representation of population genotypes. In Proceedings of the 11th international conference on Algorithms in bioinformatics, WABI-11, pages 270-281, Berlin, Heidelberg, 2011. Springer.
[43]
E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proceedings of the Third Annual European Symposium on Algorithms, ESA -95, pages 327-340, London, UK, 1995. Springer.
[44]
N. Vaelimaeki, V. Maekinen, W. Gerlach, and K. Dixit. Engineering a compressed suffix tree implementation. ACM Journal of Experimental Algorithmics, 14, 2009.
[45]
S. Wandelt and U. Leser. Adaptive efficient compression of genomes. Algorithms for Molecular Biology, 7:30, 2012.
[46]
S. Wandelt and U. Leser. String searching in referentially compressed genomes. In Proceedings of the 4th Int. Conf. on Knowledge Discovery and Information Retrieval, Barcelona, Spain, 2012.
[47]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85-96, 2012.
[48]
W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the ACM SIGMOD International Conference on Management of data, pages 759-770, New York, NY, USA, 2009. ACM.
[49]
X. Yang, B. Wang, C. Li, J. Wang, and X. Xie. Efficient direct search on compressed genomic data. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), Australia (to appear; preprint at http://www.ics.uci.edu/xhx/publications/genomecompress.pdf).
[50]
H. Zhu, G. Kollios, and V. Athitsos. A generic framework for efficient and effective subsequence retrieval. Proceedings VLDB Endow., 5(11):1579-1590, July 2012.

Cited By

View all
  1. RCSI: scalable similarity search in thousand(s) of genomes

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 13
      August 2013
      180 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 August 2013
      Published in PVLDB Volume 6, Issue 13

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Efficient Search over Genomic Short Read DataProceedings of the 32nd International Conference on Scientific and Statistical Database Management10.1145/3400903.3400907(1-12)Online publication date: 7-Jul-2020
      • (2016)An efficient pruning strategy for approximate string matching over suffix treeKnowledge and Information Systems10.1007/s10115-015-0896-649:1(121-141)Online publication date: 1-Oct-2016
      • (2015)MRCSIProceedings of the VLDB Endowment10.14778/2735479.27354808:5(461-472)Online publication date: 1-Jan-2015
      • (2015)Efficient Compression of 4D-Trajectory Data in Air Traffic ManagementIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2014.234505516:2(844-853)Online publication date: 27-Mar-2015

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media