Abstract
Archaea are single-celled organisms found in practically every habitat and serve essential functions in the ecosystem, such as carbon fixation and nitrogen cycling. The classification of these organisms is challenging because most have not been isolated in a laboratory and are only found in ambient samples by their gene sequences. This paper presents an automated classification approach for any taxonomic level based on an ensemble method using non-comparative features. This methodology overcomes the problems of reference-based classification since it classifies sequences without resorting directly to the reference genomes, using the features of the biological sequences instead. Overall we obtained high results for classification at different taxonomic levels. For example, the Phylum classification task achieved 96% accuracy, whereas 91% accuracy was achieved in the genus identification task of archaea in a pool of 55 different genera. These results show that the proposed methodology is a fast, highly-accurate solution for archaea identification and classification, being particularly interesting in the applied case due to the challenging classification of these organisms. The method and complete study are freely available, under the GPLv3 license, at https://github.com/jorgeMFS/Archaea2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Biesecker, L.G., Burke, W., Kohane, I., Plon, S.E., Zimmern, R.: Next-generation sequencing in the clinic: are we ready? Nat. Rev. Genet. 13(11), 818–824 (2012)
Chiu, C.Y., Miller, S.A.: Clinical metagenomics. Nat. Rev. Genet. 20(6), 341–355 (2019)
Hampton-Marcell, J.T., Lopez, J.V., Gilbert, J.A.: The human microbiome: an emerging tool in forensics. Microbial Biotechnol. 10(2), 228–230 (2017)
Amorim, A., Pereira, F., Alves, C., García, O.: Species assignment in forensics and the challenge of hybrids. Forensic Sci. Int. Genet. 48, 102333 (2020)
Eloe-Fadrosh, E.A., et al.: Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat. Commun. 7(1), 1–10 (2016)
Del Fabbro, C., Scalabrin, S., Morgante, M., Giorgi, F.M.: An extensive evaluation of read trimming effects on illumina NGS data analysis. PLoS ONE 8(12) (2013)
Toppinen, M., Sajantila, A., Pratas, D., Hedman, K., Perdomo, M.F.: The human bone marrow is host to the DNAs of several viruses. Front. Cell. Infect. Microbiol. 11, 329 (2021)
Hosseini, M., Pratas, D., Morgenstern, B., Pinho, A.J.: Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience 9(5), giaa048 (2020)
Mardis, E.R.: DNA sequencing technologies: 2006–2016. Nat. Protoc. 12(2), 213–218 (2017)
Thomas, T., Gilbert, J., Meyer, F.: Metagenomics - a guide from sampling to data analysis. Microb. Inf. Exp. 2(1), 1–12 (2012)
Abnizova, I., et al.: Analysis of context-dependent errors for illumina sequencing. J. Bioinform. Comput. Biol. 10(2) (2012)
Boekhorst, R.T., et al.: Computational problems of analysis of short next generation sequencing reads. Vavilov J. Genet. Breed. 20(6), 746–755 (2016)
Breitwieser, F.P., Lu, J., J., Salzberg, J., A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20(4), 1–15 (2017)
Chen, S., He, C., Li, Y., Li, Z., Charles III, E.M.: A computational toolset for rapid identification of SARS-CoV-2, other viruses, and microorganisms from sequencing data. Brief. Bioinform. 22(2), 924–935 (2021)
Pickett, B.E., et al.: ViPR: an open bioinformatics database and analysis resource for virology research. Nucl. Acids Res. 40(D1), D593–D598 (2012)
Khan, A., et al.: Detection of human papillomavirus in cases of head and neck squamous cell carcinoma by RNA-Seq and VirTect. Mol. Oncol. (13), 829–839 (2018)
Chen, X., et al.: A virome-wide clonal integration analysis platform for discovering cancer viral etiology. Genome Res. (2019)
Vilsker, M., et al.: Genome detective: an automated system for virus identification from high-throughput sequencing data. Bioinformatics 35(5), 871–873 (2019)
Piro, V.C., Dadi, T.H., Seiler, E., Reinert, K., Renard, B.Y.: Ganon: precise metagenomics classification against large and up-to-date sets of reference sequences. Bioinformatics 36, i12–i20 (2020)
Meyer, F., et al.: The metagenomics RAST server-a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 9(1), 1–8 (2008)
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)
Brown, S.M., et al.: MGS-fast: metagenomic shotgun data fast annotation using microbial gene catalogs. GigaScience 8(4), giz020 (2019)
Truong, D.T., et al.: MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12(10), 902–903 (2015)
Karlicki, M., Antonowicz, S., Karnkowska, A.: Tiara: deep learning-based classification system for eukaryotic sequences. Bioinformatics 38(2), 344–350 (2022)
Lourenço, A.: Reconstruction and classification of unknown DNA sequences. Master dissertation (2021)
Almeida, J.R., Pinho, A.J., Oliveira, J.L., Fajarda, O., Pratas, D.: GTO: a toolkit to unify pipelines in genomic and proteomic research. SoftwareX 12, 100535 (2020)
Kans, J.: Entrez direct: e-utilities on the UNIX command line. National Center for Biotechnology Information (US) (2020)
Pratas, D., Pinho, A.J.: On the approximation of the Kolmogorov complexity for DNA sequences. In: Alexandre, L.A., Salvador Sánchez, J., Rodrigues, J.M.F. (eds.) IbPRIA 2017. LNCS, vol. 10255, pp. 259–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58838-4_29
Silva, M., Pratas, D., Pinho, A.J.: Efficient DNA sequence compression with neural networks. GigaScience 9(11), 11. giaa119 (2020)
Hosseini, M., Pratas, D., Pinho, A.J.: AC: a compression tool for amino acid sequences. Interdisc. Sci. Comput. Life Sci. 11(1), 68–76 (2019)
Romiguier, J., Ranwez, V., Douzery, E.J.P., Galtier, N.: Contrasting GC-content dynamics across 33 mammalian genomes: relationship with life-history traits and chromosome sizes. Genome Res. 20(8), 1001–1009 (2010)
Chen, H., Skylaris, C.-K.: Analysis of DNA interactions and GC content with energy decomposition in large-scale quantum mechanical calculations. Phys. Chem. Chem. Phys. 23(14), 8891–8899, 102333 (2021)
Duret, L., Galtier, N.: Biased gene conversion and the evolution of mammalian genomic landscapes. Annu. Rev. Genomics Hum. Genet. 10, 285–311 (2009)
Cristianini, N., Shawe-Taylor, J., et al.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition, vol. 544. Wiley, New York (2004)
Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: KNN model-based approach in classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) OTM 2003. LNCS, vol. 2888, pp. 986–996. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39964-3_62
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016)
Lu, J., Salzberg, S.L.: Removing contaminants from databases of draft genomes. PLoS Comput. Biol. 14(6), e1006277 (2018)
Cornet, L., Baurain, D.: Contamination detection in genomic data: more is not enough. Genome Biol. (2022)
Tavares, A.H.M.P., et al.: DNA word analysis based on the distribution of the distances between symmetric words. Sci. Rep. 7(1), 1–11 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, J.M., Pratas, D., Caetano, T., Matos, S. (2022). Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-04881-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04880-7
Online ISBN: 978-3-031-04881-4
eBook Packages: Computer ScienceComputer Science (R0)