Abstract
We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: A basic local alignment search tool. Journal of Molecular Biology (1990)
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Vingron, M.: q-gram based database searching using a suffix array (quasar). In: Int. Conf. RECOMB, Lyon (April 1999)
Cao, X., Li, S.C., Ooi, B.C., Tung, A.: Piers: An efficient model for similarity search in dna sequence databases. ACM Sigmod Record 33 (2004)
Giladi, E., Walker, M., Wang, J., Volkmuth, W.: Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In: Int. Conf. RECOMB, Japan (2000)
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. International Journal on VLDB, 139–148 (September 2001)
Jokinen, P., Ukkonen, E.: Two algorithm for approximate string matching in static texts. In: Proc. of the 16th Symposium on Mathematical Foundataions of Computer Science, pp. 240–248 (1991)
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: Proc. 2001 Int. Conf. Very Large Data Bases (VLDB 2001), Roma, Italy (2001)
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string search. SIAM Journal on Computing 22, 935–948 (1993)
Meek, C., Patel, J.M., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In: Proc. 2003 Int. Conf. Very Large Data Bases (VLDB 2003), Berlin, Germany, September 2003, pp. 910–921 (2003)
Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operation. In: STOC, Portland, Or (2000)
Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence datasbases. In: Third IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), Bethesda, Maryland (2003)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA 85, 2444–2448 (1988)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Molecular Biology 147, 195–197 (1981)
Tan, Z., Cao, X., Ooi, B.C., Tung, A.: The ed-tree: an index for large dna sequence databases. In: Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pp. 151–160 (2003)
Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Symp. On Switching and Automata Theory, pp. 1–11 (1973)
Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14, 63–78 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cao, X., Li, S.C., Tung, A.K.H. (2005). Indexing DNA Sequences Using q-Grams. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_4
Download citation
DOI: https://doi.org/10.1007/11408079_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)