Qucosa - Leipzig: New Algorithms for Fast and Economic Assembly

AutorIn

Thomas Gatter

Titel

New Algorithms for Fast and Economic Assembly

Untertitel

Advances in Transcriptome and Genome Assembly

Zitierfähige Url:

https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-780992

Datum der Einreichung

01.10.2021

Datum der Verteidigung

09.02.2022

Abstract (EN)

Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.

Freie Schlagwörter (EN)

Assembly, Transcriptome, Genome, Sequencing, Graph Theory

Klassifikation (DDC)

000

Den akademischen Grad verleihende / prüfende Institution

Universität Leipzig, Leipzig

Version / Begutachtungsstatus

aktualisierte Version

URN Qucosa

urn:nbn:de:bsz:15-qucosa2-780992

Veröffentlichungsdatum Qucosa

18.02.2022

Dokumenttyp

Dissertation

Sprache des Dokumentes

Englisch

Lizenz / Rechtehinweis

CC BY-NC-ND 4.0

Inhaltsverzeichnis

1 Preface
1.1 Assembly: A vast and fast evolving field 
1.2 Structure of this Work 
1.3 Available 
2 Introduction
2.1 Mathematical Background 
2.2 High-Throughput Sequencing
2.3 Assembly 
2.4 Transcriptome Expression 

3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background 
3.2 Strategy 
3.3 Data preprocessing 
3.4 Processing of the overlap graph 
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking 
3.7 MuCHSALSA – Moving towards the future

4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly
4.1 Background 
4.2 Strategy 
4.3 The Ryūtō core algorithm 
4.4 Improved Multi-sample transcript assembly with Ryūtō 

5 Conclusion & Future Work
5.1 Discussion and Outlook 
5.2 Summary and Conclusion

Volltext (PDF)

Nutzungshinweise für die digitalen Objekte in Qucosa