Abstract
Database preprocessing in order to create an index often permits considerable speedup in search compared to the iterated query of an unprocessed database. In this paper we apply index-based database lookup to a range search problem that arises in mass spectrometry-based proteomics: given a large collection of sparse integer sets and a sparse query set, find all the sets from the collection that have at least k integers in common with the query set. This problem arises when searching for a mass spectrum in a database of theoretical mass spectra using the shared peaks count as similarity measure. The algorithms can easily be modified to use the more advanced shared peaks intensity measure instead of the shared peaks count. We introduce three different algorithms solving these problems. We conclude by presenting some experiments using the algorithms on realistic data showing the advantages and disadvantages of the algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198–207 (2003)
Dutta, D., Chen, T.: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5), 612–618 (2007)
Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P., Gygi, S.P.: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22(2), 214–219 (2004)
Frank, A., Tanner, S., Pevzner, P.A.: Peptide sequence tags for fast database search in mass-spectrometry. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 326–341. Springer, Heidelberg (2005)
Havilio, M., Haddad, Y., Smilansky, Z.: Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 75(3), 435–444 (2003)
Izumi, T., Yokomaru, T., Takahashi, A., Kajitani, Y.: Computational complexity analysis of set-bin-packing problem. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E81-A(5), 842–849 (1998)
Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R., Shoemaker, D.D.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)
Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., Apweiler, R.: The international protein index: An integrated database for proteomics experiments. proteomics 4(7), 1985–1988 (2004)
Mann, M., Jensen, O.N.: Proteomic analysis of post-translational modifications. Nature Biotechnol. 21(3), 255–261 (2003)
Mann, M., Wilm, M.: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66(24), 4390–4399 (1994)
Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L., Jarvie, T.P., Jirage, K.B., Kim, J.B., Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., Rothberg, J.M.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380 (2005)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Palagi, P.M., Hernandez, P., Walther, D., Appel, R.D.: Proteome informatics I: Bioinformatics tools for processing experimental data. Proteomics 6(20), 5435–5444 (2006)
Ramakrishnan, S.R., Mao, R., Nakorchevskiy, A.A., Prince, J.T., Willard, W.S., Xu, W., Marcotte, E.M., Miranker, D.P.: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22(12), 1524–1531 (2006)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11. IEEE Press, Los Alamitos (1973)
Whitfield, E.J., Pruess, M., Apweiler, R.: Bioinformatics database infrastructure for biotechnology research. J. Biotechnol. 124(4), 629–639 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Besenbacher, S., Schwikowski, B., Stoye, J. (2010). Indexing and Searching a Mass Spectrometry Database. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-12476-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12475-4
Online ISBN: 978-3-642-12476-1
eBook Packages: Computer ScienceComputer Science (R0)