Abstract
The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.
In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\(\le 1\times 10^{-5}\)). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.
F. Utro and D.E. Platt—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barrette, R.W., et al.: Discovery of swine as a host for the reston ebolavirus. Science 325(5937), 204–206 (2009)
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. 83, 5155–5159 (1986)
Boyce, K., Sievers, F., Higgins, D.G.: Instability in progressive multiple sequence alignment algorithms. Algorithms Mol. Biol. 10(1), 1–10 (2015)
Chan, C.X., Bernard, G., Poirion, O., Hogan, J.M., Ragan, M.A.: Inferring phylogenies of evolving sequences without muultiple sequence alignment. Sci. Rep. 4(6504), 1–9 (2014)
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-merspectra: models and modalities. Genome Biol. 10, R108 (2009)
Dembo, A., Karlin, S., Zeitouni, O.: Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Probab. 22(4), 2022–2039 (1994)
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinform. 8, 252 (2007)
Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009)
Giancarlo, R., Rombo, S.E., Utro, F.: Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning in vivo. Bioinformatics 31, 2939–2946 (2015)
Gire, S.K., et al.: Genomic surveillance elucideates ebola virus origin and transmission during the 2014 outbreak. Science 345, 1369–1372 (2014)
Haubold, B.: Alignment-free phylogenetics and population genetics. Briefings Bioinform. 15, 407–418 (2013)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring functions. PNAS 87(6), 2264–2268 (1990)
Katoh, K., Standley, D.M.: Mafft multiple sequence alignment software versions 7: improvements in performance and usability. Mol. Biol. Evol. 30(4), 772–780 (2013)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951)
Lo Bosco, G.: Alignment free dissimilarities for nucleosome classification. In: Angelini, C., Rancoita, P.M.V., Rovetta, S. (eds.) CIBB 2015. LNCS, vol. 9874, pp. 114–128. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44332-4_9
Song, K., Ren, J., Reinert, G., Deng, M., Waterman, M.S., Fengzhu, S.: New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Briefings Bioinform. 15(3), 343–353 (2014)
Stamatakis, A.: Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014)
Utro, F., Di Benedetto, V., Corona, D.F., Giancarlo, R.: The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. Bioinformatics 32, 835–842 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Utro, F., Platt, D.E., Parida, L. (2019). A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction. In: Bartoletti, M., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2017. Lecture Notes in Computer Science(), vol 10834. Springer, Cham. https://doi.org/10.1007/978-3-030-14160-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-14160-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14159-2
Online ISBN: 978-3-030-14160-8
eBook Packages: Computer ScienceComputer Science (R0)