Abstract
Identification of proteins is a key step of metaproteomics research. This protein identification task should be migrated to a fast data streaming architecture to increase horizontal scalability and performance. A protein database search involves two steps: the pairwise matching of experimental spectra against protein sequences creating peptide-spectrum-matches (PSM) and the statistical validation of PSMs. The peptide-spectrum-matching is inherently parallelizable since each match is independent. However, false positive matches are inherent to this method due to measurement errors and artifacts, thus requiring statistical validation. State of the art validation is achieved using the target-decoy method, which estimates the false discovery rate (FDR) by searching against a shuffled version of the original protein database. In contrast to the protein database search, validation by target-decoy is not parallelizable, because the FDR approximation requires all experimental data at once. In short, when using a fast data architecture for the workflow, the target-decoy approach is no longer feasible. Hence a novel approach is required to avoid false discovery of PSM on streaming single-pass experimental data. To this end, the recently proposed nokoi classifier seems promising to solve the aforementioned problems. In this paper, we present a general nokoi pipeline to create such a decoy-free classifier, that reach over 95% accuracy for general metaproteomics data.
Supported by de.NBI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Data sizes: BIOGAS1 (5984 PSMs) BIOGAS2 (8367 PSMs) BIOGAS3 (8921 PSMs).
- 2.
Data sizes: GUT1 (4819 PSMs) GUT2 (2317 PSMs) GUT3 (2685 PSMs).
References
Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198 (2003)
Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)
Eisenacher, M., Kohl, M., Turewicz, M., Koch, M., Uszkoreit, J., Stephan, C.: Search and decoy: the automatic identification of mass spectra. Methods Mol. Biol. (2012). https://doi.org/10.1007/978-1-61779-885-6_28
Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)
Gonnelli, G.: A decoy-free approach to the identification of peptides. J. Proteome Res. 14(4), 1792–1798 (2015)
Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276
Heyer, R., Schallert, K., Zoun, R., Becher, B., Saake, G., Benndorf, D.: Challenges and perspectives of metaproteomic data analysis. J. Biotechnol. 261(Supplement C), 24–36 (2017). https://doi.org/10.1016/j.jbiotec.2017.06.1201. Bioinformatics Solutions for Big Data Analysis in Life Sciences presented by the German Network for Bioinformatics Infrastructure
Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems (2017)
Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)
Matrix Science: Data File Format (2016). http://www.matrixscience.com/help/data_file_help.html
Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073
National Center for Biotechnology Information: Fasta Format, November 2002. https://blast.ncbi.nlm.nih.gov
Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004
Robertson, C., Ronald, C.B.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003)
Wampler, D.: Fast data: big data evolved. White Paper (2015)
Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. OReilly Media, Sebastopol (2016)
Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)
Acknowledgment
The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the European Regional Development Fund (grant no.: 11.000sz00.00.0 17 114347 0), the DFG (grant no.: SA 465/50-1), by the German Federal Ministry of Food and Agriculture (grants no.: 22404015) and dedicated to the memory of Mikhail Zoun.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zoun, R. et al. (2018). Streaming FDR Calculation for Protein Identification. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-00063-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)