Abstract
We present a new robust solution to short peptide searching, tested on a relational platform, with a set of biological queries. Our algorithm is appropriate for large scale scientific data analysis, and has been tested with 1.4 GB of amino-acids. Protein sequences are indexed as short overlapping string windows, and stored in a relation. To find approximate matches, we use a neighbourhood generation algorithm. The words in the neighbourhood are then fetched and stored in a relation. We measure execution time and compare the matches found to those delivered by BLAST. We report some performance gains in exact matching and searching within edit distance 1, and very significant quality improvements over heuristics, as we guarantee to deliver all relevant matches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Baeza-Yates, R., Navarro, G.: A Hybrid Indexing Method for Approximate String Matching. JDA 1, 205–239 (2001)
Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)
Eckman, B.A., Kaufmann, A.: Querying BLAST within a Data Federation. IEEE Data Eng. Bull. 27(3), 12–19 (2004)
Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics. Wiley, Chichester (2003)
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., FitzHugh, W., Fields, C., Gocayne, J.D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H.O., Venter, J.C.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)
Gautier, L., et al.: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioninformatics, p. 111 (2004)
Guccione, S.A., Keller, E.: Gene Matching using JBits. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 1168–1171. Springer, Heidelberg (2002)
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Hyyrö, H., Navarro, G.: A Practical Index for genome Searching. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 341–349. Springer, Heidelberg (2003)
Hunt, E.: Indexed Searching on Proteins Using a Suffix Sequoia. IEEE Data Eng. Bulletin 27(3), 24–31 (2004)
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. In: VLDB, pp. 139–148. Morgan Kaufmann, San Francisco (2001)
Hunt, E., Atkinson, M.P., Irving, R.W.: Database Indexing for Large DNA and Protein Sequence Collections. The. VLDB Journal 11, 256–271 (2002)
Kahveci, T., Singh, A.K.: An Efficient Index Structure for String Databases. In: VLDB, pp. 351–360. Morgan and Kaufmann, Washington (2001)
Kahveci, T., Singh, A.K.: Progressive searching of biological sequences. IEEE Data Eng. Bull. 27(3), 32–39 (2004)
Karakoç, E., Özsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. IEEE Data Eng. Bull. 27(3), 40–47 (2004)
Kent, W.J.: BLAT: The BLAST-like Alignment Tool. Genome Res. 12(4), 656–664 (2002)
Kim, Y.J., Boyd, A., Athey, B.D., Patel, J.M.: miBLAST: Scalable Evaluation of a Batch of Nucleotide Sequence Queries with BLAST. Nucleic Acids Research 33, 4335–4344 (2005)
Levenstein, V.I.: Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)
Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences. In: VLDB 2003, pp. 910–921 (2003)
Mewes, H.W., Hani, J., Pfeiffer, F., Frishman, D.: MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research 26, 33–37 (1998)
Miller, C., Gurd, J., Brass, A.: A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bioinformatics 15, 111–121 (1999)
Miranker, D.P., Briggs, W.J., Mao, R., Ni, S., Xu, W.: Biosequence Use Cases in MoBIoS SQL. IEEE Data Eng. Bull. 27(3), 3–11 (2004)
Myers, E.W.: A sublinear algorithm for approximate key word searching. Algorithmica 12(4/5), 345–374 (1994)
Navarro. G.: NR-grep: A Fast and Flexible Pattern Matching Tool. Technical report (2000). TR/DCC-2000-3. University of Chile, Departmento de Ciencias de la Computacion, www.dcc.uchile.cl/~gnavarro
Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. J. Mol. Biol. 48, 443–453 (1970)
Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 11(10), 1725–1729 (2001)
Sethupathy, P., et al.: A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3(11), 881–886 (2006)
Sidhu, S.S.: Phage Display In Biotechnology and Drug Discovery. Taylor and Francis, Abington (2005)
Smith, T.A., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Stephens, S., Chen, J.Y., Thomas, S.: ODM BLAST: Sequence Homology Search in the RDBMS. IEEE Data Eng. Bull. 27(3), 20–23 (2004)
Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)
Wall, L., Schwartz, R.L., Christiansen, T., Potter, S.: Programming Perl. Nutshell Handbook, 2nd edn. O’Reilly & Associates (1996)
Work, L.M., Bining, H., Hunt, E., et al.: Vascular Bed-Targeted in vivo Gene Delivery Using Tropism-Modified Adeno-associated Viruses. Molecular Therapy 13(4), 683–693 (2006)
Yamaguchi, Y., Miyajima, Y., Maruyama, T., Konagaya, A.: High Speed Homology Search Using Run-Time Reconfiguration. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 281–291. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hunt, E. (2007). Exhaustive Peptide Searching Using Relations. In: Cooper, R., Kennedy, J. (eds) Data Management. Data, Data Everywhere. BNCOD 2007. Lecture Notes in Computer Science, vol 4587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73390-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-73390-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73389-8
Online ISBN: 978-3-540-73390-4
eBook Packages: Computer ScienceComputer Science (R0)