Nothing Special   »   [go: up one dir, main page]


Co-Linear Chaining on Pangenome Graphs

Authors Jyotshna Rajput , Ghanshyam Chandra , Chirag Jain



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.12.pdf
  • Filesize: 1.76 MB
  • 18 pages

Document Identifiers

Author Details

Jyotshna Rajput
  • Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
Ghanshyam Chandra
  • Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
Chirag Jain
  • Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India

Acknowledgements

The authors thank Manuel Cáceres, Shravan Mehra and Sunil Chandran for providing useful feedback.

Cite AsGet BibTex

Jyotshna Rajput, Ghanshyam Chandra, and Chirag Jain. Co-Linear Chaining on Pangenome Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 12:1-12:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.12

Abstract

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width [Makinen et al., TALG'19] and how incorporating gap cost in the scoring function improves alignment accuracy [Chandra and Jain, RECOMB'23]. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational genomics
Keywords
  • Sequence alignment
  • variation graph
  • genome sequencing
  • path cover

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison. Journal of Discrete Algorithms, 3(2-4):321-341, 2005. Google Scholar
  2. Jasmijn A Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing, pages 1-28, 2022. Google Scholar
  3. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 51-58, 2015. Google Scholar
  4. Manuel Cáceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, and Alexandru I Tomescu. Sparsifying, shrinking and splicing for minimum path cover in parameterized linear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 359-376. SIAM, 2022. Google Scholar
  5. Ghanshyam Chandra and Chirag Jain. Sequence to graph alignment using gap-sensitive co-linear chaining. In Research in Computational Molecular Biology: 27th Annual International Conference, RECOMB 2023, Istanbul, Turkey, April 16-19, 2023, Proceedings, pages 58-73. Springer, 2023. Google Scholar
  6. Haoyu Cheng, Mobin Asri, Julian Lucas, Sergey Koren, and Heng Li. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. arXiv preprint, 2023. URL: https://arxiv.org/abs/2306.03399.
  7. Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2018. Google Scholar
  8. Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius, Andrew Gross, Giuseppe Narzisi, Brett Bowman, et al. Expansionhunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics, 35(22):4754-4756, 2019. Google Scholar
  9. Tatiana Dvorkina, Dmitry Antipov, Anton Korobeynikov, and Sergey Nurk. Spaligner: alignment of long diverged molecular sequences to assembly graphs. BMC bioinformatics, 21(12):1-14, 2020. Google Scholar
  10. Peter Eades, Xuemin Lin, and W.F. Smyth. A fast and effective heuristic for the feedback arc set problem. Information Processing Letters, 47(6):319-323, October 1993. Google Scholar
  11. Hannes P Eggertsson, Hakon Jonsson, Snaedis Kristmundsdottir, et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nature genetics, 49(11):1654-1660, 2017. Google Scholar
  12. David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F Italiano. Sparse dynamic programming i: linear cost functions. Journal of the ACM, 39(3):519-545, 1992. Google Scholar
  13. David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F Italiano. Sparse dynamic programming ii: convex and concave cost functions. Journal of the ACM, 39(3):546-567, 1992. Google Scholar
  14. Yang Gao, Xiaofei Yang, Hao Chen, Xinjiang Tan, Zhaoqing Yang, Lian Deng, Baonan Wang, Shuang Kong, Songyang Li, Yuhang Cui, et al. A pangenome reference of 36 chinese populations. Nature, pages 1-10, 2023. Google Scholar
  15. Shilpa Garg, Mikko Rautiainen, Adam M Novak, et al. A graph-based approach to diploid genome assembly. Bioinformatics, 34(13):i105-i114, 2018. Google Scholar
  16. Erik Garrison, Andrea Guarracino, Simon Heumos, Flavia Villani, Zhigui Bao, Lorenzo Tattini, Jörg Hagmann, Sebastian Vorbrugg, Santiago Marco-Sola, Christian Kubica, et al. Building pangenome graphs. bioRxiv, pages 2023-04, 2023. Google Scholar
  17. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875-879, October 2018. URL: https://doi.org/10.1038/nbt.4227.
  18. Daniel Gibney, Sharma V Thankachan, and Srinivas Aluru. The complexity of approximate pattern matching on de Bruijn graphs. In Research in Computational Molecular Biology: 26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22-25, 2022, Proceedings, pages 263-278. Springer, 2022. Google Scholar
  19. Glenn Hickey, Jean Monlong, Jana Ebler, Adam M Novak, Jordan M Eizenga, Yan Gao, Tobias Marschall, Heng Li, and Benedict Paten. Pangenome graph construction from genome alignments with minigraph-cactus. Nature Biotechnology, pages 1-11, 2023. Google Scholar
  20. Chirag Jain, Daniel Gibney, and Sharma V Thankachan. Algorithms for colinear chaining with overlaps and gap costs. Journal of Computational Biology, 29(11):1237-1251, 2022. Google Scholar
  21. Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using winnowmap2. Nature Methods, pages 1-6, 2022. Google Scholar
  22. Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence-to-graph alignment. Journal of Computational Biology, 27(4):640-654, April 2020. URL: https://doi.org/10.1089/cmb.2019.0066.
  23. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
  24. Heng Li, Xiaowen Feng, and Chong Chu. The design and construction of reference pangenome graphs with minigraph. Genome Biology, 21(1), October 2020. Google Scholar
  25. Heng Li, Jue Ruan, and Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research, 18(11):1851-1858, 2008. Google Scholar
  26. Wen-Wei Liao, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, Julian K Lucas, Jean Monlong, Haley J Abel, et al. A draft human pangenome reference. Nature, 617(7960):312-324, 2023. Google Scholar
  27. Tsung-Yu Lu, Human Genome Structural Variation Consortium, et al. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nature Communications, 12(1):4250, 2021. Google Scholar
  28. Xiao Luo, Xiongbin Kang, and Alexander Schönhuth. Vechat: correcting errors in long reads using variation graphs. Nature Communications, 13(1):6657, 2022. Google Scholar
  29. Jun Ma. Co-linear chaining on graphs with cycles. Master’s thesis, University of Helsinki, Faculty of Science, 2021. URL: http://hdl.handle.net/10138/330781.
  30. Jun Ma, Manuel Cáceres, Leena Salmela, Veli Mäkinen, and Alexandru I Tomescu. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.01.07.475257.
  31. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015. Google Scholar
  32. Veli Mäkinen and Kristoffer Sahlin. Chaining with overlaps revisited. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  33. Veli Mäkinen, Alexandru I Tomescu, Anna Kuosmanen, Topi Paavilainen, Travis Gagie, and Rayan Chikhi. Sparse dynamic programming on DAGs with small width. ACM Transactions on Algorithms (TALG), 15(2):1-21, 2019. Google Scholar
  34. Tom Mokveld, Jasper Linthorst, Zaid Al-Ars, Henne Holstege, and Marcel Reinders. Chop: haplotype-aware path indexing in population graphs. Genome biology, 21:1-16, 2020. Google Scholar
  35. Gene Myers and Webb Miller. Chaining multiple-alignment fragments in sub-quadratic time. In SODA, volume 95, pages 38-47, 1995. Google Scholar
  36. Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455-463, 2000. Google Scholar
  37. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. Google Scholar
  38. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, et al. The complete sequence of a human genome. Science, 376(6588):44-53, April 2022. URL: https://doi.org/10.1126/science.abj6987.
  39. Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. Pbsim2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics, 37(5):589-595, 2021. Google Scholar
  40. Christian Otto, Steve Hoffmann, Jan Gorodkin, and Peter F Stadler. Fast local fragment chaining using sum-of-pair gap costs. Algorithms for Molecular Biology, 6(1):4, 2011. Google Scholar
  41. Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome research, 27(5):665-676, 2017. Google Scholar
  42. Mikko Rautiainen and Tobias Marschall. Graphaligner: rapid and versatile sequence-to-graph alignment. Genome biology, 21(1):253, 2020. Google Scholar
  43. Mikko Rautiainen, Sergey Nurk, Brian P Walenz, Glennis A Logsdon, David Porubsky, Arang Rhie, Evan E Eichler, Adam M Phillippy, and Sergey Koren. Telomere-to-telomere assembly of diploid chromosomes with verkko. Nature Biotechnology, pages 1-9, 2023. Google Scholar
  44. Nicola Rizzo, Manuel Cáceres, and Veli Mäkinen. Chaining of maximal exact matches in graphs. arXiv preprint, 2023. URL: https://arxiv.org/abs/2302.01748.
  45. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  46. Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era. bioRxiv, 2022. Google Scholar
  47. Leena Salmela and Eric Rivals. Lordec: accurate and efficient long read error correction. Bioinformatics, 30(24):3506-3514, 2014. Google Scholar
  48. Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400-407, 2020. Google Scholar
  49. Jouni Sirén, Jean Monlong, Xian Chang, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 374(6574):abg8871, 2021. Google Scholar
  50. Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146-160, 1972. Google Scholar
  51. Ting Wang, Lucinda Antonacci-Fulton, Kerstin Howe, et al. The human pangenome project: a global resource to map genomic diversity. Nature, 604(7906):437-446, April 2022. Google Scholar
  52. Haowen Zhang, Shiqi Wu, Srinivas Aluru, and Heng Li. Fast sequence to graph alignment using the graph wavefront algorithm. arXiv preprint, 2022. URL: https://arxiv.org/abs/2206.13574.
  53. Yao Zhou, Zhiyang Zhang, Zhigui Bao, Hongbo Li, Yaqing Lyu, Yanjun Zan, Yaoyao Wu, Lin Cheng, Yuhan Fang, Kun Wu, et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature, 606(7914):527-534, 2022. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail