research-article

Bermuda: de novo assembly of transcripts with new insights for handling uneven coverage

Authors:

Jinbo XuAuthors Info & Claims

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 166 - 175

https://doi.org/10.1145/2808719.2808736

Published: 09 September 2015 Publication History

Abstract

Motivation: RNA-seq has made feasible the analysis of a whole set of expressed mRNAs. Mapping-based assembly of RNA-seq reads sometimes is infeasible due to lack of high-quality references. However, de novo assembly is very challenging due to uneven expression levels among transcripts and also the read coverage variation within a single transcript. Existing methods either apply de Bruijn graphs of single-sized k-mers to assemble the full set of transcripts, or conduct multiple runs of assembly, but still apply graphs of single-sized k-mers at each run. However, a single k-mer size is not suitable for all the regions of the transcripts with varied coverage.

Contribution: This paper presents a de novo assembler Bermuda with new insights for handling uneven coverage. Opposed to existing methods that use a single k-mer size for all the transcripts in each run of assembly, Bermuda self-adaptively uses a few k-mer sizes to assemble different regions of a single transcript according to their local coverage. As such, Bermuda can deal with uneven expression levels and coverage not only among transcripts, but also within a single transcript. Extensive tests show that Bermuda outperforms popular de novo assemblers in reconstructing unevenly-expressed transcripts with longer length, better contiguity and lower redundancy. Further, Bermuda is computationally efficient with moderate memory consumption.

Availability: Supplementary materials are available through http://ttic.uchicago.edu/~qmtang/

References

[1]

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389--3402, 1997.

[2]

P. N. Ariyaratne and W.-K. Sung. Pe-assembler: de novo assembler using short paired-end reads. Bioinformatics, 27(2):167--174, 2011.

Digital Library

[3]

A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455--477, 2012.

[4]

I. Birol, S. D. Jackman, C. B. Nielsen, J. Q. Qian, R. Varhol, G. Stazyk, R. D. Morin, Y. Zhao, M. Hirst, J. E. Schein, et al. De novo transcriptome assembly with abyss. Bioinformatics, 25(21):2872--2877, 2009.

Digital Library

[5]

K. Bryc, C. Velez, T. Karafet, A. Moreno-Estrada, A. Reynolds, A. Auton, M. Hammer, C. D. Bustamante, and H. Ostrer. Genome-wide patterns of population structure and admixture among hispanic/latino populations. Proceedings of the National Academy of Sciences, 107(Supplement 2):8954--8961, 2010.

[6]

J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome research, 18(5):810--820, 2008.

[7]

P. J. Campbell, P. J. Stephens, E. D. Pleasance, S. O'Meara, H. Li, T. Santarius, L. A. Stebbings, C. Leroy, S. Edkins, C. Hardy, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature genetics, 40(6):722--729, 2008.

[8]

M. J. Chaisson, D. Brinza, and P. A. Pevzner. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome research, 19(2):336--346, 2009.

[9]

H.-T. Chu, W. W. Hsiao, J.-C. Chen, T.-J. Yeh, M.-H. Tsai, H. Lin, Y.-W. Liu, S.-A. Lee, C.-C. Chen, T. T. Tsao, et al. Ebardenovo: highly accurate de novo assembly of rna-seq with efficient chimera-detection. Bioinformatics, 29(8):1004--1010, 2013.

Digital Library

[10]

J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic acids research, 36(16):e105--e105, 2008.

[11]

M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell. Computational methods for transcriptome annotation and quantification using rna-seq. Nature methods, 8(6):469--477, 2011.

[12]

M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, et al. Full-length transcriptome assembly from rna-seq data without a reference genome. Nature biotechnology, 29(7):644--652, 2011.

[13]

B. R. Graveley. The haplo-spliceo-transcriptome: common variations in alternative splicing in the human population. Trends in Genetics, 24(1):5--7, 2008.

[14]

M. Guttman, M. Garber, J. Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, M. J. Koziol, A. Gnirke, C. Nusbaum, et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nature biotechnology, 28(5):503--510, 2010.

[15]

S. Heber, M. Alekseyev, S.-H. Sze, H. Tang, and P. A. Pevzner. Splicing graphs and est assembly problem. Bioinformatics, 18(suppl 1):S181--S188, 2002.

[16]

L. Ilie, F. Fazayeli, and S. Ilie. Hitec: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295--302, 2011.

Digital Library

[17]

H. Jiang and W. H. Wong. Statistical inferences for isoform expression in rna-seq. Bioinformatics, 25(8):1026--1032, 2009.

Digital Library

[18]

D. R. Kelley, M. C. Schatz, S. L. Salzberg, et al. Quake: quality-aware detection and correction of sequencing errors. Genome Biol, 11(11):R116, 2010.

[19]

J. O. Korbel, A. E. Urban, J. P. Affourtit, B. Godwin, F. Grubert, J. F. Simons, P. M. Kim, D. Palejev, N. J. Carriero, L. Du, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science, 318(5849):420--426, 2007.

[20]

S. Koren, M. C. Schatz, B. P. Walenz, J. Martin, J. T. Howard, G. Ganapathy, Z. Wang, D. A. Rasko, W. R. McCombie, E. D. Jarvis, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology, 30(7):693--700, 2012.

[21]

H.-S. Le, M. H. Schulz, B. M. McCauley, V. F. Hinman, and Z. Bar-Joseph. Probabilistic error correction for rna sequencing. Nucleic acids research, page gkt215, 2013.

[22]

H. Li and R. Durbin. Fast and accurate short read alignment with burrows--wheeler transform. Bioinformatics, 25(14):1754--1760, 2009.

Digital Library

[23]

R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, et al. Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1):18, 2012.

[24]

J. A. Martin and Z. Wang. Next-generation transcriptome assembly. Nature Reviews Genetics, 12(10):671--682, 2011.

[25]

L. M. McIntyre, K. K. Lopiano, A. M. Morse, V. Amin, A. L. Oberg, L. J. Young, and S. V. Nuzhdin. Rna-seq: technical variability and sampling. BMC genomics, 12(1):293, 2011.

[26]

P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 27(13):i137--i141, 2011.

Digital Library

[27]

N. Nagarajan and M. Pop. Sequence assembly demystified. Nature Reviews Genetics, 14(3):157--167, 2013.

[28]

Y. Peng, H. C. Leung, S.-M. Yiu, M.-J. Lv, X.-G. Zhu, and F. Y. Chin. Idba-tran: a more robust de novo de bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics, 29(13):i326--i334, 2013.

[29]

P. A. Pevzner, H. Tang, and M. S. Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748--9753, 2001.

[30]

F. Rapaport, R. Khanin, Y. Liang, M. Pirun, A. Krek, P. Zumbo, C. E. Mason, N. D. Socci, and D. Betel. Comprehensive evaluation of differential gene expression analysis methods for rna-seq data. Genome Biol, 14(9):R95, 2013.

[31]

A. Roberts, H. Pimentel, C. Trapnell, and L. Pachter. Identification of novel transcripts in annotated genomes using rna-seq. Bioinformatics, 27(17):2325--2329, 2011.

Digital Library

[32]

G. Robertson, J. Schein, R. Chiu, R. Corbett, M. Field, S. D. Jackman, K. Mungall, S. Lee, H. M. Okada, J. Q. Qian, et al. De novo assembly and analysis of rna-seq data. Nature methods, 7(11):909--912, 2010.

[33]

J. Schröder, H. Schröder, S. J. Puglisi, R. Sinha, and B. Schmidt. Shrec: a short-read error correction method. Bioinformatics, 25(17):2157--2163, 2009.

Digital Library

[34]

M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo rna-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8):1086--1092, 2012.

Digital Library

[35]

Y. Surget-Groba and J. I. Montoya-Burgos. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome research, 20(10):1432--1440, 2010.

[36]

C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, and L. Pachter. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature protocols, 7(3):562--578, 2012.

[37]

C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5):511--515, 2010.

[38]

B. S. Weir and C. C. Cockerham. Estimating f-statistics for the analysis of population structure. evolution, pages 1358--1370, 1984.

[39]

Z. Xia, J. Wen, C.-C. Chang, and X. Zhou. Nsmap: a method for spliced isoforms identification and quantification from rna-seq. BMC bioinformatics, 12(1):162, 2011.

[40]

Y. Xie, G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, W. Huang, G. He, S. Gu, S. Li, et al. Soapdenovo-trans: de novo transcriptome assembly with short rna-seq reads. Bioinformatics, 30(12):1660--1666, 2014.

[41]

Y. Xing, A. Resch, and C. Lee. The multiassembly problem: reconstructing multiple transcript isoforms from est fragment mixtures. Genome research, 14(3):426--441, 2004.

[42]

M. Yassour, T. Kaplan, H. B. Fraser, J. Z. Levin, J. Pfiffner, X. Adiconis, G. Schroth, S. Luo, I. Khrebtukova, A. Gnirke, et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mrna sequencing. Proceedings of the National Academy of Sciences, 106(9):3264--3269, 2009.

[43]

D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821--829, 2008.

Index Terms

Bermuda: de novo assembly of transcripts with new insights for handling uneven coverage
1. Applied computing
  1. Life and medical sciences
2. Mathematics of computing
  1. Mathematical software

Recommendations

Enabling large-scale next-generation sequence assembly with Blacklight

A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic ...
Enabling large-scale next-generation sequence assembly with Blacklight
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic ...
Circular RNA Detection from High-throughput Sequencing
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Alternative splicing refers to the production of multiple mRNA isoforms from a single gene due to alternative selection of exons or splice sites during pre-mRNA splicing. While canonical alternative splicing produces a linear form of RNA by joining an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

September 2015

683 pages

ISBN:9781450338530

DOI:10.1145/2808719

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation CAREER award
Alfred P. Sloan Fellowship

Conference

BCB '15

Sponsor:

SIGBio

BCB '15: ACM International Conference on Bioinformatics, Computational Biology and Biomedicine

September 9 - 12, 2015

Georgia, Atlanta

Acceptance Rates

BCB '15 Paper Acceptance Rate 48 of 141 submissions, 34%;

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
104
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents