Fast Spaced Seed Hashing

File

Author Details

Samuele Girotto

Matteo Comin

Cinzia Pizzi

Cite As Get BibTex

Samuele Girotto, Matteo Comin, and Cinzia Pizzi. Fast Spaced Seed Hashing. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.7

Abstract

Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed.

Subject Classification

Keywords

k-mers
spaced seeds
efficient hashing

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990. URL: http://dx.doi.org/10.1016/S0022-2836(05)80360-2.
Daniel G. Brown, Ming Li, and Bin Ma. A tutorial of recent developments in the seeding of local alignment. Journal of Bioinformatics and Computational Biology, 02(04):819-842, 2004. URL: http://dx.doi.org/10.1142/S0219720004000983.
Jeremy Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17(5):419, 2001. URL: http://dx.doi.org/10.1093/bioinformatics/17.5.419.
Karel Břinda, Maciej Sykulski, and Gregory Kucherov. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics, 31(22):3584, 2015. URL: http://dx.doi.org/10.1093/bioinformatics/btv419.
Matteo Comin and Morris Antonello. Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 11(3):500-509, May 2014. URL: http://dx.doi.org/10.1109/TCBB.2013.2297924.
Matteo Comin, Andrea Leoni, and Michele Schimd. Clustering of reads with alignment-free measures and quality values. Algorithms for Molecular Biology, 10(1):4, 2015. URL: http://dx.doi.org/10.1186/s13015-014-0029-x.
Aaron E. Darling, Todd J. Treangen, Louxin Zhang, Carla Kuiken, Xavier Messeguer, and Nicole T. Perna. Procrastination leads to efficient filtration for local multiple alignment. In Philipp Bücher and Bernard M. E. Moret, editors, Algorithms in Bioinformatics: 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006. Proceedings, pages 126-137. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. URL: http://dx.doi.org/10.1007/11851561_12.
Samuele Girotto, Matteo Comin, and Cinzia Pizzi. Binning metagenomic reads with probabilistic sequence signatures based on spaced seeds. To Appear, 2017.
Samuele Girotto, Cinzia Pizzi, and Matteo Comin. MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics, 32(17):i567-i575, September 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btw466.
Lars Hahn, Chris-André Leimeister, Rachid Ounit, Stefano Lonardi, and Burkhard Morgenstern. Rasbhari: Optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLOS Computational Biology, 12(10):1-18, 10 2016. URL: http://dx.doi.org/10.1371/journal.pcbi.1005107.
Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. SpEED: fast computation of sensitive spaced seeds. Bioinformatics, 27(17):2433, 2011. URL: http://dx.doi.org/10.1093/bioinformatics/btr368.
Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3):253-263, 2004. URL: http://dx.doi.org/10.1016/S0166-218X(03)00382-2.
Chris-Andre Leimeister, Marcus Boden, Sebastian Horwege, Sebastian Lindner, and Burkhard Morgenstern. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics, 30(14):1991, 2014. URL: http://dx.doi.org/10.1093/bioinformatics/btu177.
Stinus Lindgreen, Karen L. Adair, and Paul Gardner. An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports, 6, 2016. Article No. 19233. URL: http://dx.doi.org/10.1038/srep19233.
Bin Ma and Ming Li. On the complexity of the spaced seeds. Journal of Computer and System Sciences, 73(7):1024-1034, 2007. Bioinformatics III. URL: http://dx.doi.org/10.1016/j.jcss.2007.03.008.
Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440, 2002. URL: http://dx.doi.org/10.1093/bioinformatics/18.3.440.
Hamid Mohamadi, Justin Chu, Benjamin P. Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics, page btw397, July 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btw397.
Taku Onodera and Tetsuo Shibuya. The gapped spectrum kernel for support vector machines. In Proceedings of the 9th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM'13, pages 1-15, Berlin, Heidelberg, 2013. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-642-39712-7_1.
Rachid Ounit and Stefano Lonardi. Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics, 32(24):3823, 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btw542.
Rachid Ounit, Steve Wanamaker, Timothy J. Close, and Stefano Lonardi. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1):1-13, 2015. URL: http://dx.doi.org/10.1186/s12864-015-1419-2.
Laxmi Parida, Cinzia Pizzi, and Simona E. Rombo. Irredundant tandem motifs. Theoretical Computer Science, 525:89-102, 2014. Advances in Stringology. URL: http://dx.doi.org/10.1016/j.tcs.2013.08.012.
C. Pizzi, P. Rastas, and E. Ukkonen. Finding significant matches of position weight matrices in linear time. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(1):69-79, Jan 2011. URL: http://dx.doi.org/10.1109/TCBB.2009.35.
Stephen M. Rumble, Phil Lacroute, Adrian V. Dalca, Marc Fiume, Arend Sidow, and Michael Brudno. Shrimp: Accurate mapping of short color-space reads. PLOS Computational Biology, 5(5):1-11, 05 2009. URL: http://dx.doi.org/10.1371/journal.pcbi.1000386.
Ariya Shajii, Deniz Yorukoglu, Yun William Yu, and Bonnie Berger. Fast genotyping of known snps through approximate k-mer matching. Bioinformatics, 32(17):i538, 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btw460.
Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15:R46, 2014. URL: http://dx.doi.org/10.1186/gb-2014-15-3-r46.

Fast Spaced Seed Hashing

Authors Samuele Girotto, Matteo Comin, Cinzia Pizzi