Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3107411.3107444acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Public Access

Fleximer: Accurate Quantification of RNA-Seq via Variable-Length k-mers

Published: 20 August 2017 Publication History

Abstract

The advent of RNA-Seq has made it possible to quantify transcript expression on a large scale simultaneously. This technology generates small fragments of each transcript sequence, known as sequencing reads. As the first step of data analysis towards expression quantification, most of the existing methods align these reads to a reference genome or transcriptome to establish their origins. However, read alignment is computationally costly. Recently, a series of methods have been proposed to perform a lightweight quantification analysis in an alignment-free manner. These methods utilize the notion of k-mers, which are short consecutive sequences representing the signatures of each transcript, to estimate the relative abundance from RNA-Seq reads. Current k-mer based approaches make use of a set of fixed size k-mers; however, the true signatures of each transcript may not exist in a fixed size. In this paper, we demonstrate the importance of k-mers selection in transcript abundance estimation. We propose a novel method, Fleximer, to efficiently discover and select an optimal set of k-mers with flexible lengths. Using both simulated and real datasets, we show that, with fewer k-mers, Fleximer is able to cover the similar amount of reads as Sailfish and Kallisto. The selected k-mers own more distinguishing features, and thus substantially reduce the errors in transcript abundance estimation.

References

[1]
Alfred V. Aho and Margaret J. Corasick. 1975. Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 6 (jun 1975), 333--340.
[2]
Kin Fai Au, Hui Jiang, Lan Lin, Yi Xing, and Wing Hung Wong. 2010. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic acids research 38, 14 (aug 2010), 4570--8.
[3]
Paul Bieganski, John Riedl, John V. Carlis, and Ernest F. Retzel. 1994. Generalized suix trees for biological sequence data: applications and implementation. In Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences HICSS-94, Vol. 5. IEEE Comput. Soc. Press, 35--44.
[4]
Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. Near-optimal probabilistic RNA-seq quantification. Nature biotechnology 34, 5 (may 2016), 525--527. arXiv:1505.02710
[5]
Fiona Cunningham, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald, et al. 2015. Ensemble 2015. Nucleic Acids Research 43, D1 (jan 2015), D662--D669.
[6]
Martin Farach. 1997. Optimal Suix Tree Construction with Large Alphabets. 38th IEEE Symp. on Foundations of Computer Science (1997), 137--143.
[7]
Alyssa C Frazee, Andrew E Jafe, Ben Langmead, and Jefrey T Leek. 2015. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31, 17 (jun 2015), 2778--84.
[8]
Manuel Garber, Manfred G Grabherr, Mitchell Guttman, and Cole Trapnell. 2011. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods 8, 6 (jun 2011), 469--77.
[9]
Stefen Heber, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, and Pavel A Pevzner. 2002. Splicing graphs and EST assembly problem. Bioinformatics 18 Suppl 1, 1 (2002), S181--S188.
[10]
Michael Höhl, Stefan Kurtz, and Enno Ohlebusch. 2002. Efficient multiple genome alignment. Bioinformatics 18, Suppl 1 (jul 2002), S312--S320.
[11]
Lars Kaderali and Alexander Schliep. 2002. Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18, 10 (oct 2002), 1340--1349.
[12]
Richard M. Karp and Michael O. Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31, 2 (mar 1987), 249--260.
[13]
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg. 2013. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 14, 4 (apr 2013), R36.
[14]
Hinanit Koltai and Carmiya Weingarten-Baror. 2008. Speciicity of DNA microarray hybridization: characterization, effectors and approaches for data correction. Nucleic acids research 36, 7 (apr 2008), 2395--405.
[15]
Stefan Kurtz and Chris Schleiermacher. 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15, 5 (may 1999), 426--427.
[16]
Edward M. McCreight. 1976. A Space-Economical Suix Tree Construction Algorithm. J. ACM 23, 12 (1976), 262--272. http://libeccio.di.unisa.it/TdP/suix.pdf
[17]
J. Ian Munro and Venkatesh Raman. 2001. Succinct Representation of Balanced Parentheses and Static Trees. SIAM J. Comput. 31, 3 (jan 2001), 762--776.
[18]
Herve Pagès, Daniel Bindreither, Marc Carlson, and Martin Morgan. 2013. SplicingGraphs: Create, manipulate, visualize splicing graphs, and assign RNA-seq reads to them. R package version 1.14.0 (2013).
[19]
Rob Patro, Stephen M Mount, and Carl Kingsford. 2014. Sailish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature biotechnology 32, 5 (may 2014), 462--4.
[20]
Robert Petryszak, Tony Burdett, Benedetto Fiorelli, Nuno A Fonseca, Mar Gonzalez-Porta, Emma Hastings, Wolfgang Huber, Simon Jupp, Maria Keays, Nataliya Kryvych, et al. 2016. Expression Atlas update - An integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Research 44, D1 (jan 2016), D746--D752.
[21]
Kunihiko Sadakane. 2007. Compressed Suix Trees with Full Functionality. Theory of Computing Systems 41, 4 (2007), 589--607. http://web.stanford.edu/
[22]
Michael Sammeth. 2009. Complete Alternative Splicing Events Are Bubbles in Splicing Graphs. Journal of computational biology 16, 8 (2009), 1117--1140.
[23]
Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Rob Patro. 2016. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32, 12 (oct 2016), i192--i200.
[24]
Esko Ukkonen. 1995. On-line construction of suix trees. Algorithmica 14, 3 (sep 1995), 249--260.
[25]
Niko Välimäki, Wolfgang Gerlach, Kashyap Dixit, and Veli Mäkinen. 2007. Compressed suix tree Ða basis for genome-scale sequence analysis. Bioinformatics 23, 5 (mar 2007), 629--30.
[26]
Kai Wang, Darshan Singh, Zheng Zeng, Stephen J Coleman, Yan Huang, Gleb L Savich, Xiaping He, Piotr Mieczkowski, Sara a Grimm, Charles M Perou, James N MacLeod, Derek Y Chiang, Jan F Prins, and Jinze Liu. 2010. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research 38, 18 (oct 2010), e178.
[27]
Brian T Wilhelm and Josette-Renée Landry. 2009. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48, 3 (jul 2009), 249--57.
[28]
Zhaojun Zhang and Wei Wang. 2014. RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30, 12 (jun 2014), i283--i292.

Cited By

View all
  • (2022)TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hashMedical Review10.1515/mr-2021-00161:2(114-125)Online publication date: 14-Feb-2022
  • (2019)Alevin efficiently estimates accurate gene abundances from dscRNA-seq dataGenome Biology10.1186/s13059-019-1670-y20:1Online publication date: 27-Mar-2019
  • (2019)RNA Sequencing Data: Hitchhiker's Guide to Expression AnalysisAnnual Review of Biomedical Data Science10.1146/annurev-biodatasci-072018-0212552:1(139-173)Online publication date: 20-Jul-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
August 2017
800 pages
ISBN:9781450347228
DOI:10.1145/3107411
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Aho-Corasick algorithm
  2. EM algorithm
  3. RNA-seq
  4. suffix tree
  5. transcript quantification

Qualifiers

  • Research-article

Funding Sources

Conference

BCB '17
Sponsor:

Acceptance Rates

ACM-BCB '17 Paper Acceptance Rate 42 of 132 submissions, 32%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)15
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hashMedical Review10.1515/mr-2021-00161:2(114-125)Online publication date: 14-Feb-2022
  • (2019)Alevin efficiently estimates accurate gene abundances from dscRNA-seq dataGenome Biology10.1186/s13059-019-1670-y20:1Online publication date: 27-Mar-2019
  • (2019)RNA Sequencing Data: Hitchhiker's Guide to Expression AnalysisAnnual Review of Biomedical Data Science10.1146/annurev-biodatasci-072018-0212552:1(139-173)Online publication date: 20-Jul-2019
  • (2018)Towards Selective-AlignmentProceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3233547.3233589(27-36)Online publication date: 15-Aug-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media