Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2975167.2985691acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

High-performance data structures for de novo assembly of genomes: cache oblivious generic programming

Published: 02 October 2016 Publication History

Abstract

Reconstructing genomes of organisms from high-throughput sequencing experiments without a reference genome available (de novo assembly) is a challenging problem which has been approached in several ways in the past decade. Although numerous methods are available and many offer fair performance in reconstruction, there is a lack of generalized template libraries and interchangeable data structures/methods for serial, multithreaded and distributed processing. In this work we propose a novel set of cache oblivious generic data structures for serial, multithreaded and distributed processing of high-throughput sequencing data for the creation of de Bruijn or k-mer graphs towards their usage in de novo assembly and related HTS data analytics problems.

References

[1]
Altman, R.B., Prabhu, S., Sidow, A., Zook, J.M., Goldfeder, R., Litwack, D., Ashley, E., Asimenos, G., Bustamante, C.D., Donigan, K., et al. A research roadmap for next-generation sequencing informatics. Sci Transl Med, 8 (335). 335ps310.
[2]
Spjuth, O., Bongcam-Rudloff, E., Dahlberg, J., Dahlo, M., Kallio, A., Pireddu, L., Vezzi, F. and Korpelainen, E. Recommendations on e-infrastructures for next-generation sequencing. Gigascience, 5. 26.
[3]
Milicchio, F., Rose, R., Bian, J., Min, J. and Prosperi, M. Visual programming for next-generation sequencing data analytics. BioData Min, 9. 16.
[4]
Langmead, B. and Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods, 9 (4). 357--359.
[5]
Hauswedell, H., Singer, J. and Reinert, K. Lambda: the local aligner for massive biological data. Bioinformatics, 30 (17). i349--355.
[6]
Miller, J.R., Koren, S. and Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics, 95 (6). 315--327.
[7]
Li, Y., Hu, Y., Bolund, L. and Wang, J. State of the art de novo assembly of human genomes from massively parallel sequencing data. Hum Genomics, 4 (4). 271--277.
[8]
Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res, 20 (2). 265--272.
[9]
Earl, D., Bradnam, K., St John, J., Darling, A., Lin, D., Fass, J., Yu, H.O., Buffalo, V., Zerbino, D.R., Diekhans, M., et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res, 21 (12). 2224--2241.
[10]
Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29 (8). 1072--1075.
[11]
Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J.A., Chapuis, G., Chikhi, R., et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2 (1). 10.
[12]
Chaisson, M.J. and Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Res, 18 (2). 324--330.
[13]
Li, D., Liu, C.M., Luo, R., Sadakane, K. and Lam, T.W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31 (10). 1674--1676.
[14]
Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H. and Lam, T.W. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102. 3--11.
[15]
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1 (1). 18.
[16]
Conway, T.C. and Bromage, A.J. Succinct data structures for assembling large genomes. Bioinformatics, 27 (4). 479--486.
[17]
Li, H. Fast construction of FM-index for long sequence reads. Bioinformatics, 30 (22). 3274--3275.
[18]
Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P. and Rokhsar, D.S. Meraculous: de novo genome assembly with short paired-end reads. PLoS One, 6 (8). e23501.
[19]
Stroustrup, B. The C++ Programming Language. Addison-Wesley Professional, 2013.
[20]
Pataki, N., Porkolab, Z. Extension of iterator traits in the C++ Standard Template Library. in Computer Science and Information Systems (FedCSIS), 2011 Federated Conference on, (2011), 911--914.
[21]
Vandevoorde, D. and Josuttis, N.M. C++ templates: the complete guide. Addison-Wesley, Boston, MA, 2003.
[22]
Langtangen, H.P., Bruaset, A.M. and Quak, E. Advances in software tools for scientific computing. Springer, Berlin; New York, 2000.
[23]
Veldhuizen, T.L. Arrays in Blitz plus. Computing in Object-Oriented Parallel Environments, 1505. 223--230.
[24]
Pflaum, C. Expression templates for partial differential equations. Computing and Visualization in Science, 4 (1). 1--8.
[25]
Klaeren, H., Pulvermueller, E., Rashid, A. and Speck, A. Aspect Composition Applying the Design by Contract Principle Proceedings of the Second International Symposium on Generative and Component-Based Software Engineering-Revised Papers, Springer-Verlag, 2001, 57--69.
[26]
Kamp, P.H. You're Doing It Wrong. Communications of the Acm, 53 (7). 55--59.
[27]
Frigo, M., Leiserson, C.E., Prokop, H. and Ramachandran, S. Cache-Oblivious Algorithms. Acm Transactions on Algorithms, 8 (1).
[28]
Comer, D. Ubiquitous B-Tree. Computing Surveys, 11 (2). 121--137.
[29]
Stevens, A. C-Programming - the B-Tree Again. Dr Dobbs Journal, 15 (12). 121.
[30]
Vanemdeboas, P. Preserving Order in a Forest in Less Than Logarithmic Time and Linear-Space. Information Processing Letters, 6 (3). 80--82.
[31]
Vanemdeboas, P., Kaas, R. and Zijlstra, E. Design and Implementation of an Efficient Priority Queue. Mathematical Systems Theory, 10 (2). 99--127.
[32]
Tsakalidis, A.K. Maintaining Order in a Generalized Linked List. Acta Informatica, 21 (1). 101--112.
[33]
Dietz, P.F. Maintaining order in a linked list Proceedings of the fourteenth annual ACM symposium on Theory of computing, ACM, San Francisco, California, USA, 1982, 122--127.
[34]
Bender, M.A., Demaine, E.D. and Farach-Colton, M. Cache-oblivious B-trees. Siam Journal on Computing, 35 (2). 341--358.
[35]
Brodal, G.S. and Fagerberg, R. Cache oblivious distribution sweeping. Automata, Languages and Programming, 2380. 426--438.
[36]
Bender, M.A., Fineman, J.T., Gilbert, S. and Kuszmaul, B.C. Concurrent cache-oblivious b-trees Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, ACM, Las Vegas, Nevada, USA, 2005, 228--237.
[37]
Doring, A., Weese, D., Rausch, T. and Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics, 9. 11.

Cited By

View all
  • (2021)Experimental Survey on Power Dissipation of k-mer-Handling Data Structures for Mobile Bioinformatics2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM52615.2021.9669768(3201-3206)Online publication date: 9-Dec-2021
  • (2018)Third-generation sequencing data analytics on mobile devices: cache oblivious and out-of-core approaches as a proof-of-conceptProcedia Computer Science10.1016/j.procs.2018.07.164134(219-226)Online publication date: 2018

Recommendations

Reviews

David E. Goldfarb

This short paper presents a C++ library, in a generic programming style, for more efficient de novo genome assembly. The basic idea is to use a cache oblivious data structure. The authors do not present any concrete performance improvement numbers. Their focus, rather, is on creating a basis or framework for future experimentation. The paper does not carry the work forward to a level that would be completely and immediately useful, nor do the authors make any such claims. Rather, the authors are offering the current source code at beta level. They stress that the code does not yet handle parallel access or concurrent lock-free operations. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
October 2016
675 pages
ISBN:9781450342254
DOI:10.1145/2975167
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High-throughput sequencing
  2. bioinformatics
  3. cache oblivious
  4. de Bruijn graph
  5. de novo assembly
  6. generic programming
  7. genome assembly
  8. parallel programming

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

BCB '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Experimental Survey on Power Dissipation of k-mer-Handling Data Structures for Mobile Bioinformatics2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM52615.2021.9669768(3201-3206)Online publication date: 9-Dec-2021
  • (2018)Third-generation sequencing data analytics on mobile devices: cache oblivious and out-of-core approaches as a proof-of-conceptProcedia Computer Science10.1016/j.procs.2018.07.164134(219-226)Online publication date: 2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media