Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3584371.3612996acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Open access

Phylogenomics using Compression Distances: Incorporating Rate Heterogeneity and Amino Acid Properties

Published: 04 October 2023 Publication History

Abstract

Efforts to reconstruct the tree of life have a long history, but the field has changed fundamentally in the genomic era. Phylogenomics examines evolutionary relationships using very large datasets, so a major problem in the field is the development of unbiased computational methods for tree inference. Sources of bias include sequence alignment errors, discordance among gene trees, and long branch attraction. Distances based on data compression can address sequence alignment errors and analyses of distances may be robust to a major source of discordance among gene trees (incomplete lineage sorting). However, compression distances appear to be susceptible to long branch attraction. This study tested the hypothesis that compression distances can be modified to be more resistant to long branch attraction and found that correcting compression distances for multiple substitutions improved their behavior. Calculating distances after grouping amino acids based on their physicochemical properties incorporated more biological information. The modified compression distances used in this study also made it possible to estimate tree support using a method that closely resembles the bootstrap, the most popular support metric in phylogenomics.

References

[1]
Elizabeth S. Allman, Colby Long, and John A. Rhodes. 2019. Species Tree Inference from Genomic Sequences Using the Log-Det Distance. SIAM Journal on Applied Algebra and Geometry 3, 1 (Jan. 2019), 107--127.
[2]
Jacob S. Berv, Sonal Singhal, Daniel J. Field, Nathanael Walker-Hale, Sean W. McHugh, J. Ryan Shipley, Eliot T. Miller, Rebecca T. Kimball, Edward L. Braun, Alex Dornburg, C. Tomomi Parins-Fukuchi, Richard O. Prum, Benjamin M. Winger, Matt Friedman, and Stephen A. Smith. 2022. Molecular early burst associated with the diversification of birds at the K-Pg boundary. bioRxiv (Oct. 2022), 2022.10.21.513146.
[3]
Alix Boc, Alpha Boubacar Diallo, and Vladimir Makarenkov. 2012. T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks. Nucleic Acids Research 40, W1 (June 2012), W573--W579.
[4]
Edward L Braun. 2018. An evolutionary model motivated by physicochemical properties of amino acids reveals variation among proteins. Bioinformatics 34, 13 (June 2018), i350--i356.
[5]
Edward L. Braun, Joel Cracraft, and Peter Houde. 2019. Resolving the Avian Tree of Life from Top to Bottom: The Promise and Potential Boundaries of the Phylogenomic Era. In Avian Genomics in Ecology and Evolution, Robert H. S. Kraus (Ed.). Springer International Publishing, 151--210.
[6]
Edward L. Braun and Rebecca T. Kimball. 2021. Data Types and the Phylogeny of Neoaves. Birds 2, 1 (Jan. 2021), 1--22.
[7]
R. Cilibrasi and P.M.B. Vitanyi. 2005. Clustering by Compression. IEEE Transactions on Information Theory 51, 4 (April 2005), 1523--1545.
[8]
Gautam Dasarathy, Robert Nowak, and Sebastien Roch. 2015. Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method. IEEE/ACM Transactions on Computational Biology and Bioinformatics 12, 2 (March 2015), 422--432.
[9]
M. O. Dayhoff, R. V. Eck, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol. 5, Margaret O. Dayhoff (Ed.). National Biomedical Research Foundation, Silver Springs, MD, 345--352.
[10]
Julian Echave, Stephanie J. Spielman, and Claus O. Wilke. 2016. Causes of evolutionary rate variation among protein sites. Nature Reviews Genetics 17, 2 (Jan. 2016), 109--121.
[11]
Scott V. Edwards. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63, 1 (Jan. 2009), 1--19.
[12]
Joseph Felsenstein. 1978. Cases in which Parsimony or Compatibility Methods will be Positively Misleading. Systematic Zoology 27, 4 (Dec. 1978), 401--410.
[13]
Nicole M. Foley, Victor C. Mason, Andrew J. Harris, Kevin R. Bredemeyer, Joana Damas, Harris A. Lewin, Eduardo Eizirik, John Gatesy, Elinor K. Karlsson, Kerstin Lindblad-Toh, Zoonomia Consortium, Mark S. Springer, and William J. Murphy. 2023. A genomic timescale for placental mammal evolution. Science 380, 6643 (April 2023).
[14]
Kousuke Hanada, Shin-Han Shiu, and Wen-Hsiung Li. 2007. The Nonsynonymous/Synonymous Substitution Rate Ratio versus the Radical/Conservative Replacement Rate Ratio in the Evolution of Mammalian Genes. Molecular Biology and Evolution 24, 10 (July 2007), 2235--2241.
[15]
Diep Thi Hoang, Olga Chernomor, Arndt von Haeseler, Bui Quang Minh, and Le Sy Vinh. 2017. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Molecular Biology and Evolution 35, 2 (Oct. 2017), 518--522.
[16]
Susan Holmes. 2003. Bootstrapping Phylogenetic Trees: Theory and Methods. Statist. Sci. 18, 2 (May 2003).
[17]
Peter Houde, Edward L. Braun, and Lawrence Zhou. 2020. Deep-Time Demographic Inference Suggests Ecological Release as Driver of Neoavian Adaptive Radiation. Diversity 12, 4 (April 2020), 164.
[18]
Erich D. Jarvis, Siavash Mirarab, Andre J. Aberer, Bo Li, Peter Houde, Cai Li, Simon Y. W. Ho, Brant C. Faircloth, Benoit Nabholz, Jason T. Howard, and 95 additional coauthors. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346, 6215 (Dec. 2014), 1320--1331.
[19]
Erich D. Jarvis, Siavash Mirarab, Andre J. Aberer, Bo Li, Peter Houde, Cai Li, Simon Y. W. Ho, Brant C. Faircloth, Benoit Nabholz, Jason T. Howard, Alexander Suh, Claudia C. Weber, Rute R. da Fonseca, Alonzo Alfaro-Núñez, Nitish Narula, Liang Liu, Dave Burt, Hans Ellegren, Scott V. Edwards, Alexandros Stamatakis, David P. Mindell, Joel Cracraft, Edward L. Braun, Tandy Warnow, Wang Jun, M. Thomas Pius Gilbert, and Guojie Zhang. 2015. Phylogenomic analyses data of the avian phylogenomics project. GigaScience 4, 1 (Feb. 2015).
[20]
Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, and Hervé Philippe. 2006. Phylogenomics: the beginning of incongruence? Trends in Genetics 22, 4 (April 2006), 225--231.
[21]
Subha Kalyaanamoorthy, Bui Quang Minh, Thomas K. F. Wong, Arndt von Haeseler, and Lars S. Jermiin. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods 14, 6 (May 2017), 587--589.
[22]
Laura Salter Kubatko and James H. Degnan. 2007. Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence. Systematic Biology 56, 1 (Feb. 2007), 17--24.
[23]
Ming Li, Jonathan H. Badger, Xin Chen, Sam Kwong, Paul Kearney, and Haoyong Zhang. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 2 (Feb. 2001), 149--154.
[24]
Ming Li, Xin Chen, Xin Li, Bin Ma, and P.M.B. Vitanyi. 2004. The Similarity Metric. IEEE Transactions on Information Theory 50, 12 (Dec. 2004), 3250--3264.
[25]
Ming Li and Paul Vitányi. 2019. An Introduction to Kolmogorov Complexity and Its Applications. Springer International Publishing.
[26]
Kelly A. Meiklejohn, Brant C. Faircloth, Travis C. Glenn, Rebecca T. Kimball, and Edward L. Braun. 2016. Analysis of a Rapid Evolutionary Radiation Using Ultraconserved Elements: Evidence for a Bias in Some Multispecies Coalescent Methods. Systematic Biology 65, 4 (Feb. 2016), 612--627.
[27]
Bui Quang Minh, Cuong Cao Dang, Le Sy Vinh, and Robert Lanfear. 2021. QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution. Systematic Biology 70, 5 (Feb. 2021), 1046--1060.
[28]
Bui Quang Minh, Heiko A. Schmidt, Olga Chernomor, Dominik Schrempf, Michael D. Woodhams, Arndt von Haeseler, and Robert Lanfear. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution 37, 5 (Feb. 2020), 1530--1534.
[29]
Siavash Mirarab, Luay Nakhleh, and Tandy Warnow. 2021. Multispecies Coalescent: Theory and Applications in Phylogenetics. Annual Review of Ecology, Evolution, and Systematics 52, 1 (Nov. 2021), 247--268.
[30]
William J. Murphy, Nicole M. Foley, Kevin R. Bredemeyer, John Gatesy, and Mark S. Springer. 2021. Phylogenomics and the Genetic Architecture of the Placental Mammal Radiation. Annual Review of Animal Biosciences 9, 1 (Feb. 2021), 29--53.
[31]
Masatoshi Nei and Jianzhi Zhang. 2006. Evolutionary Distance: Estimation. Encyclopedia of Life Sciences (July 2006), 1--4.
[32]
Akanksha Pandey and Edward L. Braun. 2020. Protein evolution is structure dependent and non-homogeneous across the tree of life. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. ACM.
[33]
Richard O. Prum, Jacob S. Berv, Alex Dornburg, Daniel J. Field, Jeffrey P. Townsend, Emily Moriarty Lemmon, and Alan R. Lemmon. 2015. A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature 526, 7574 (Oct. 2015), 569--573.
[34]
Sushma Reddy, Rebecca T. Kimball, Akanksha Pandey, Peter A. Hosner, Michael J. Braun, Shannon J. Hackett, Kin-Lan Han, John Harshman, Christopher J. Huddleston, Sarah Kingston, Ben D. Marks, Kathleen J. Miglia, William S. Moore, Frederick H. Sheldon, Christopher C. Witt, Tamaki Yuri, and Edward L. Braun. 2017. Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling. Systematic Biology 66, 5 (March 2017), 857--879.
[35]
Sebastien Roch and Mike Steel. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoretical Population Biology 100 (March 2015), 56--62.
[36]
George Sangster, Edward L. Braun, Ulf S. Johansson, Rebecca T. Kimball, Gerald Mayr, and Alexander Suh. 2022. Phylogenetic definitions for 25 higher-level clade names of birds. Avian Research 13 (2022), 100027.
[37]
Gabrielle E. Scolaro and Edward L. Braun. 2023. The Structure of Evolutionary Model Space for Proteins across the Tree of Life. Biology 12, 2 (Feb. 2023), 282.
[38]
Celine Scornavacca, Khalid Belkhir, Jimmy Lopez, Rémy Dernat, Frédéric Delsuc, Emmanuel J P Douzery, and Vincent Ranwez. 2019. OrthoMaM v10: Scaling-Up Orthologous Coding Sequence and Exon Alignments with More than One Hundred Mammalian Genomes. Molecular Biology and Evolution 36, 4 (Jan. 2019), 861--862.
[39]
Gregory A. C. Singer and Donal A. Hickey. 2000. Nucleotide Bias Causes a Genomewide Bias in the Amino Acid Composition of Proteins. Molecular Biology and Evolution 17, 11 (Nov. 2000), 1581--1588.
[40]
Mark S. Springer and John Gatesy. 2016. The gene tree delusion. Molecular Phylogenetics and Evolution 94 (Jan. 2016), 1--33.
[41]
Mark S. Springer and John Gatesy. 2017. On the importance of homology in the age of phylogenomics. Systematics and Biodiversity 16, 3 (Dec. 2017), 210--228.
[42]
Mike Steel. 2009. A basic limitation on inferring phylogenies by pairwise sequence comparisons. Journal of Theoretical Biology 256, 3 (Feb. 2009), 467--472.
[43]
Jeet Sukumaran and Mark T. Holder. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics 26, 12 (April 2010), 1569--1571.
[44]
David L. Swofford. 2003. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), version 4. https://paup.phylosolutions.com/
[45]
Deangelo Wilson and John D. Rogers. 2023. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. Journal of Computational Biology 30, 3 (March 2023), 250--260.
[46]
Chao Zhang, Yiming Zhao, Edward L. Braun, and Siavash Mirarab. 2021. TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. Methods in Ecology and Evolution 12, 11 (Aug. 2021), 2145--2158.

Index Terms

  1. Phylogenomics using Compression Distances: Incorporating Rate Heterogeneity and Amino Acid Properties

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
        September 2023
        626 pages
        ISBN:9798400701269
        DOI:10.1145/3584371
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 October 2023

        Check for updates

        Author Tags

        1. phylogenetics
        2. protein evolution
        3. data compression
        4. kolmogorov complexity
        5. multispecies coalescent
        6. incomplete lineage sorting
        7. long branch attraction

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        BCB '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 254 of 885 submissions, 29%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 116
          Total Downloads
        • Downloads (Last 12 months)94
        • Downloads (Last 6 weeks)22
        Reflects downloads up to 14 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media