Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3632366.3632384acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedingsconference-collections
research-article
Open access

EPSAPG: A Pipeline Combining MMseqs2 and PSI-BLAST to Quickly Generate Extensive Protein Sequence Alignment Profiles

Published: 03 April 2024 Publication History

Abstract

Numerous machine learning (ML) models employed in protein function and structure prediction depend on evolutionary information, which is captured through multiple-sequence alignments (MSA) or position-specific scoring matrices (PSSM) as generated by PSI-BLAST. Consequently, these predictive methods are burdened by substantial computational demands and prolonged computing time requirements. The principal challenge stems from the necessity imposed on the PSI-BLAST software to load large sequence databases sequentially in batches and then search for sequence alignments akin to a given query sequence. In the case of batch queries, the runtime scales even linearly. The predicament at hand is becoming more challenging as the size of bio-sequence data repositories experiences exponential growth over time and as a consequence, this upward trend exerts a proportional strain on the runtime of PSI-BLAST. To address this issue, an eminent resolution lies in leveraging the MMseqs2 method, capable of expediting the search process by a magnitude of 100. However, MMseqs2 cannot be directly employed to generate the final output in the desired format of PSI-BLAST alignments and PSSM profiles. In this research work, I developed a comprehensive pipeline that synergistically integrates both MMseqs2 and PSI-BLAST, resulting in the creation of a robust, optimized, and highly efficient hybrid alignment pipeline. Notably, the hybrid tool exhibits a significant speed improvement, surpassing the runtime performance of PSI-BLAST in generating sequence alignment profiles by a factor of two orders of magnitude. It is implemented in C++ and is freely available under the MIT license at https://github.com/issararab/EPSAPG.

References

[1]
Chatzou, M., Magis, C., Chang, J. M., Kemena, C., Bussotti, G., Erb, I., & Notredame, C. (2016). Multiple sequence alignment modeling: methods and applications. Briefings in bioinformatics, 17(6), 1009--1023.
[2]
Kemena, C., Notredame, C. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25(19), 2455--2465.
[3]
Van Noorden R., Maher B., Nuzzo R. The top 100 papers. https://www.nature.com/news/polopoly_fs/1.16224!/menu/main/topColumns/topLeftColumn/pdf/514550a.pdf, last accessed 2023/06/15
[4]
Bhagwat, M., & Aravind, L. (2007). PSI-BLAST Tutorial. In Bergman NH, editor, Comparative Genomics: Volumes 1 and 2, chapter 10. Humana Press, Totowa (NJ). URL https://www.ncbi.nlm.nih.gov/books/NBK2590/.
[5]
Arab, I. (2020). Variational Inference to Learn Representations for Protein Evolutionary Information. mediatum.ub.tum.de, https://mediatum.ub.tum.de/1579236.
[6]
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403--10
[7]
Barton, G. J. (1998). Protein sequence alignment techniques. Acta Crystallographica Section D: Biological Crystallography, 54(6), 1139--1146.
[8]
Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of molecular biology, 232(2), 584--599.
[9]
Rost, B., & Sander, C. (1993). Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proceedings of the National Academy of Sciences, 90(16), 7558--7562.
[10]
Frishman, D., & Argos, P. (1997). Seventy-five percent accuracy in protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 27(3), 329--335.
[11]
Chandonia J-M, Karplus M. Neural networks for secondary structure and structural class predictions. Protein Sci. 1995;4:275--85.
[12]
Bigelow, Henry R., et al. "Predicting transmembrane beta-barrels in proteomes." Nucleic acids research 32.8 (2004): 2566--2577.
[13]
Rost, B., Fariselli, P., & Casadio, R. (1996). Topology prediction for helical transmembrane proteins at 86% accuracy-Topology prediction at 86% accuracy. Protein Science, 5(8), 1704--1718.
[14]
Rost, B., Sander, C., Casadio, R., & Fariselli, P. (1995). Transmembrane helices predicted at 95% accuracy. Protein Science, 4(3), 521--533.
[15]
Punta, M., & Rost, B. (2005). PROFcon: novel prediction of long-range contacts. Bioinformatics, 21(13), 2960--2968.
[16]
Nair, R., & Rost, B. (2003). Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins: Structure, Function, and Bioinformatics, 53(4), 917--930.
[17]
Nair, R., & Rost, B. (2005). Mimicking cellular sorting improves prediction of subcellular localization. Journal of molecular biology, 348(1), 85--100.
[18]
Marino Buslje, C., Teppa, E., Di Doménico, T., Delfino, J. M., & Nielsen, M. (2010). Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS computational biology, 6(11), e1000978.
[19]
Ofran, Y., & Rost, B. (2007). ISIS: interaction sites identified from sequence. Bioinformatics, 23(2), e13-e16.
[20]
Ofran, Y., & Rost, B. (2007). Protein-protein interaction hotspots carved into sequences. PLoS computational biology, 3(7), e119.
[21]
Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H., & UniProt Consortium. (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6), 926--932.
[22]
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026--1028.
[23]
Steinegger, M., & Söding, J. (2018). Clustering huge protein sequence sets in linear time. Nature communications, 9(1), 2542.
[24]
Bernhofer, M., Kloppmann, E., Reeb, J., & Rost, B. (2016). TMSEG: Novel prediction of transmembrane helices. Proteins: Structure,
[25]
Goldberg, T., Hamp, T., & Rost, B. (2012). LocTree2 predicts localization for all domains of life. Bioinformatics, 28(18), i458-i465.
[26]
Rost, B., Yachdav, G., & Liu, J. (2004). The predictprotein server. Nucleic acids research, 32(suppl_2), W321-W326.
[27]
Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2), 156--161.
[28]
Steinegger M. (2019, July 08), MMseqs2 User Guide, Retrieved from https://github.com/soedinglab/MMseqs2/wiki. last accessed 2023/06/02
[29]
Guntheroth, K. (2016). Optimized C++: proven techniques for heightened performance. " O'Reilly Media, Inc.".
[30]
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.
[31]
Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H., & Winther, O. (2017). DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21), 3387--3395.
[32]
Hoyte D. (2023, May 17), vmtouch - the Virtual Memory Toucher, Retrieved from https://hoytech.com/vmtouch/
[33]
Arab, I. (2023, August). PEvoLM: Protein Sequence Evolutionary Information Language Model. In 2023 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) (pp. 1--8). IEEE

Cited By

View all
  • (2024)PhosHSGN: Deep Neural Networks Combining Sequence and Protein Spatial Information to Improve Protein Phosphorylation Site PredictionIEEE Access10.1109/ACCESS.2024.342779212(100611-100627)Online publication date: 2024

Index Terms

  1. EPSAPG: A Pipeline Combining MMseqs2 and PSI-BLAST to Quickly Generate Extensive Protein Sequence Alignment Profiles

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BDCAT '23: Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies
    December 2023
    187 pages
    ISBN:9798400704734
    DOI:10.1145/3632366
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 April 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. protein sequences
    2. sequence alignment
    3. MSA methods
    4. homologs
    5. evolutionary information
    6. PSI-BLAST
    7. MMseqs2
    8. PSSM

    Qualifiers

    • Research-article

    Conference

    BDCAT '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 27 of 93 submissions, 29%

    Upcoming Conference

    BDCAT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)181
    • Downloads (Last 6 weeks)47
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PhosHSGN: Deep Neural Networks Combining Sequence and Protein Spatial Information to Improve Protein Phosphorylation Site PredictionIEEE Access10.1109/ACCESS.2024.342779212(100611-100627)Online publication date: 2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media