Abstract
Motivation
Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.Results
Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.Availability and implementation
Polyester is freely available from Bioconductor (http://bioconductor.org/).Contact
jtleek@gmail.comSupplementary information
Supplementary data are available at Bioinformatics online.Free full text
Polyester: simulating RNA-seq datasets with differential transcript expression
Abstract
Motivation: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.
Results: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.
Availability and implementation: Polyester is freely available from Bioconductor (http://bioconductor.org/).
Contact: moc.liamg@keeltj
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
RNA sequencing (RNA-seq) experiments have become increasingly popular as a means to study gene expression. There are a range of statistical methods for differential expression analysis of RNA-seq data (Oshlack et al., 2010). The developers of statistical methodology for RNA-seq need to test whether their tools are performing correctly. Often, accuracy tests cannot be performed on real datasets because true gene expression levels and expression differences between populations are usually unknown, and spike-in experiments are costly in terms of both time and money.
Instead, researchers often use computational simulations to create datasets with a known signal and noise structure. Typically, simulated expression measurements used to evaluate differential expression tools are generated as gene counts from a statistical model like those used in common differential expression tools (Anders and Huber, 2010; Robinson et al., 2010). But these simulated scenarios do not account for variability in expression measurements that arises during upstream steps in RNA-seq data analysis, such as read alignment or read counting. Polyester is a new R package for simulating RNA-seq reads. Polyester ’s main advantage is that users can simulate sequencing reads with specified differential expression signal for either genes or isoforms. This allows users to investigate sources of variability at multiple points in RNA-seq pipelines.
Existing RNA-seq simulators that generate sequencing reads are not designed for simulating experiments with biological replicates and specified differential expression signal. For example, the rsem-simulate-reads utility shipped with RSEM (Li and Dewey, 2011) requires a time-consuming first step of aligning real sequencing reads to develop a sequencing model before reads can be simulated, and differential expression simulation is not built-in. Neither FluxSimulator (Griebel et al., 2012) nor Benchmarker for Evaluating the Effectiveness of RNA-Seq (BEERS) (Grant et al., 2011) have a built-in mechanism for inducing differential expression. These simulators also do not provide methods for defining a model for biological variability across replicates or specifying the exact expression level of specific transcripts. TuxSim has been used to simulate RNA-seq datasets with differential expression (Trapnell et al., 2013), but it is not publicly available.
Polyester was created to fulfill the need for a tool to simulate RNA-seq reads for an experiment with replicates and well-defined differential expression. Users can easily simulate small experiments from a few genes or a single chromosome. This can reduce computational time in simulation studies when computationally intensive steps such as read alignment must be performed as part of the simulation. Polyester is open-source, cross-platform and freely available for download from Bioconductor at http://www.bioconductor.org/packages/release/bioc/html/polyester.html.
2 Methods
2.1 Input
Polyester takes annotated transcript nucleotide sequences as input. These can be provided as cDNA sequences in FASTA format, labeled by transcript. Alternatively, users can simulate from a GTF file denoting exon, transcript and gene structure paired with full-chromosome DNA sequences. The flexibility of this input makes it possible to design small, manageable simulations by simply passing Polyester a FASTA or GTF file consisting of feature sets of different sizes. Efficient functions for reading, subsetting and writing FASTA files are available in the Biostrings package (Pages et al., 2014), which is a dependency of Polyester.
2.2 RNA-seq data as basis for model parameters
Several components of Polyester, described later in this section, require parameters estimated from RNA-seq data. To get these parameter estimates, we analyzed RNA-seq reads from seven biological replicates in the public Genetic European Variation in health and Disease (GEUVADIS) RNA-seq dataset (AC’t Hoen et al., 2013; Lappalainen et al., 2013). The seven replicates were chosen by randomly selecting one replicate from each of the seven laboratories that sequenced samples in the study. These replicates represented seven people from three different HapMap populations: CEU (Utah residents with Northern and Western European ancestry), TSI (Tuscani living in Italy) and YRI (Yoruba living in Ibadan, Nigeria). Data from the GEUVADIS study is available from the ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession numbers E-GEUV-1 through E-GEUV-6. We specifically used TopHat read alignments for these seven replicates, under accession number E-GEUV-6. The reads were 75bp, paired-end reads.
Also available for the GEUVADIS dataset is a fully processed transcriptome assembly, created based on the RNA-seq reads from all 667 replicates in the GEUVADIS study without using a reference transcriptome. This assembly was built using Cufflinks and processed with the Ballgown R package (Frazee et al., 2015), and it is available for direct download as an R object (http://dx.doi.org/10.6084/m9.figshare.1130849).
2.3 Expression models
A key feature of Polyester is that the analyst has full control over the number of reads that are generated from each transcript in the input file, for each replicate in the experiment. Polyester ships with a built-in model for these read numbers, or the model can be explicitly specified by the end user.
2.3.1 Built-in negative binomial read count model
The built-in transcript read count model assumes that the number of reads to simulate from each transcript is drawn from the negative binomial distribution, across biological replicates. The negative binomial model for read counts has been shown to satisfactorily capture biological and technical variability (Anders and Huber, 2010; Robinson et al., 2010). In Polyester, differential expression between experimental groups is defined by a multiplicative change in the mean of the negative binomial distribution generating the read counts.
Specifically, define Yijk as the number of reads simulated from replicate i, experimental condition j and transcript k (i = 1, …, nj; j = 1, …, J; and k = 1, …, N; where nj is the number of replicates in condition j, J is the total number of conditions, and N is the total number of transcripts provided). The built-in model in Polyester assumes:
In this negative binomial parameterization, E(Yijk) = μjk and Var
When J=2, differential expression is set by providing a fold change λ between the two conditions for each transcript. Initially, a baseline mean μk is provided for each transcript, and μ1k and μ2k are set to μk. Then, if fold change λ is provided, μ1k and μ2k are adjusted: if λ > 1, μ1k = λμk, and if
2.3.2 Options for adjusting read counts
Users can optionally provide multiplicative library size factors for each replicate in their experiment, since the total number of reads (sequencing depth) is usually unequal across replicates in RNA-seq experiments (Mortazavi et al., 2008). All counts for a replicate will be multiplied by the library size factor.
GC (guanine–cytosine) content is known to affect expression measurements for genomic features, and the effect varies from sample to sample (Benjamini and Speed, 2012; Hansen et al., 2012; Risso et al., 2011). Polyester includes an option to model this GC bias in the simulated reads: for each biological replicate in the simulated dataset, the user can choose one of seven built-in GC content bias models, where one model was estimated from each of the seven GEUVADIS replicates described in Section 2.2. We calculated these models using all transcripts from the available GEUVADIS transcriptome assembly (also described in Section 2.2).
For each replicate, we first calculated transcript-level read counts based on transcript length, sequencing depth, and the observed Fragments Per Kilobase of exon model per Million mapped reads (FPKM) for the transcript. By definition of FPKM, read counts can be directly calculated using these inputs. We then centered the transcript counts around the overall mean transcript count, and modeled the centered counts as a smooth function of the transcript GC content using a loess smoother with span 0.3, analgous to smoothers previously used for modeling GC content (Benjamini and Speed, 2012).
Transcript GC content was calculated as the percentage of the annotated hg19 nucleotides falling in the boundaries of the assembled transcript that were G or C. The fitted loess curve defines a function that returns the average deviation from the overall mean transcript count for a transcript with a given GC content percentage. If there is no GC bias, the deviation would be 0. GC bias is added to replicates in Polyester after transcript-level counts have been specified by increasing or decreasing the count by the predicted deviation for that transcript’s GC content. The seven loess curves included in Polyester are shown in Supplementary Figure S1. Users can also provide loess models from their own data as GC bias models if desired.
2.3.3 User-defined count models
As an alternative to the built-in negative binomial model, Polyester allows users to individually specify the number of reads to generate from each transcript, for each sample. This gives researchers the flexibility to design their own models for biological and technical variability, simulate complex experimental designs, such as timecourse experiments, and explore the effects of a wide variety of experimental parameters on differential expression results. This transcript-by-sample read count matrix can be created within R and input directly into Polyester’s read simulation function. This level of flexibility is not available with Flux Simulator or BEERS, which only allow specification of the total number of reads per replicate. While it is possible to write custom command-line scripts that induce differential expression using these simulators, differential expression models are built in to Polyester. This approach offers both a built-in model for convenience and an integrated way to define a custom model for flexibility.
2.4 The RNA sequencing process
2.4.1 Fragmentation
After the transcripts have been specified and each transcript’s abundance in the simulated experiment has been determined by an assigned read count for each replicate, Polyester simulates the RNA sequencing process, described in detail in (Oshlack et al., 2010), beginning at the fragmentation step. All transcripts present in the experiment are broken into short fragments. There are two options for how fragment lengths are chosen: lengths can be drawn from a normal distribution with mean μfl and standard deviation σfl. By default, μfl = 250 nucleotides and σfl = 25, but these parameters can be changed. Alternatively, fragment lengths can be drawn from an empirical length distribution included with the Polyester R package. This empirical distribution (Fig. 1) was estimated from the insert sizes of the paired-end read alignments of the seven GEUVADIS replicates described in Section 2.2, using Picard’s CollectInsertSizeMetrics tool (http://broadinstitute.github.io/picard/). The empirical density was estimated using the logspline function in R (Kooperberg and Stone, 1992; Kooperberg, 2013). Users can also supply their own fragment length distribution in logspline format. This distribution may be estimated from a user’s dataset or varied to measure the effect of fragment length distribution on downstream results.
Ideally, the fragments generated from a transcript present in the sequencing sample would be uniformly distributed across the transcript. However, coverage across a transcript has been shown to be nonuniform (Lahens et al., 2014; Li and Jiang, 2012; Mortazavi et al., 2008). In Polyester, users can choose to generate fragments uniformly from transcripts, or they can select one of two possible positional bias models. These models were derived by Li and Jiang (2012), and they were based on two different fragmentation protocols.
The first model is based on a cDNA fragmentation protocol, and reads are more likely to come from the 3′ end of the transcript being sequenced. The second model incorporates bias caused by a protocol relying on RNA fragmentation, where the middle of each transcript is more likely to be sequenced. Both these models were estimated from Illumina data. Since the exact data from Li and Jiang (2012) was not made available with the manuscript, we extracted the data from Supplementary Figure S3 of Li and Jiang (2012) ourselves, using WebPlotDigitizer (Rohatgi, 2014), which can estimate the coordinates of data points on a scatterplot given only an image of that scatterplot. For reference, the figure is reproduced here (Supplementary Fig. S2), created using the probabilities included as datasets (cdnaf.rda and rnaf.rda) in the Polyester R package.
2.4.2 Sequencing
Polyester simulates unstranded RNA-seq reads in a manner compatible with the Illumina paired-end protocol (Sengupta et al., 2011). In this protocol, read sequences are read off of double-stranded cDNA created from mRNA fragments, separated from other types of RNA using poly-A selection. To mimic this process in Polyester, each fragment selected from an original input transcript is reverse-complemented with probability 0.5: this means the read (for single-end experiments) or mate 1 of the read (for paired-end experiments) is equally likely to have originated from the transcript sequence itself and from the cDNA strand matched to the transcript fragment during sequencing.
Reads are then generated based on these fragments. A single-end read consists of the first R nucleotides of the fragment. For paired-end reads, these first R nucleotides become mate 1, and the last R nucleotides are read off and reverse-complemented to become mate 2. The reverse complementing happens because if mate 1 came from the actual transcript, mate 2 will be read from the complementary cDNA, and if mate 1 came from the complementary cDNA, mate 2 will come from the transcript itself (http://res.illumina.com/documents/products/datasheets/datasheet_truseq_sample_prep_kits.pdf). By default, R=100 and can be adjusted by the user.
Users can choose from a variety of sequencing error models. The simplest one is a uniform error model, where each nucleotide in a read has the same probability pe of being sequenced incorrectly, and every possible sequencing error is equally likely (e.g. if there is an error at a nucleotide which was supposed to be a T, the incorrect base is equally likely to be a G, C, A or N). In the uniform error model, pe = 0.005 by default and can be adjusted.
Several empirical error models are also available in Polyester. These models are based on two dataset-specific models that ship with the GemSim software (McElroy et al., 2012). Separate models are available for a single-end read, mate 1 of a pair and mate 2 of a pair, from two different sequencing protocols: Illumina Sequencing Kit v4 and TruSeq SBS Kit v5-GA (both from data sequenced on an Illumina Genome Analyzer IIx). These empirical error models include estimated probabilities of making each of the four possible sequencing errors at each position in the read. In general, empirical error probabilities increase toward the end of the read, and mate 2 has higher error probabilities than mate 1 of a pair, and the TruSeq SBS Kit v5-GA error probabilities were lower than the Illumina Sequencing Kit v4 error probabilities (Fig. 2; Supplementary Figs S3–S7).
Polyester can also handle custom error models: users can estimate an error model from their own sequencing data with the GemErr utility in GemSim. Detailed instructions on how to do this in a way compatible with Polyester are available in the package vignette.
After generating sequencing reads and simulating sequencing error, reads are written to disk in FASTA format. The read identifier in the FASTA files specifies the transcript of origin for each read, facilitating assessment of downstream alignment accuracy. Other pertinent simulation information is also automatically written to disk for use in downstream analysis: for each transcript, the transcript name, differential expression status and fold change is recorded. For each replicate, the file name, group identifier j and library size factor is recorded.
3 Results
3.1 Comparison with real data
To show that reads generated with Polyester exhibit realistic properties, we performed a small simulation experiment based on data from the seven GEUVADIS RNA-seq replicates described in Section 2.2. For the experiment, we randomly selected 10 annotated genes with at least one highly expressed isoform. We relied on the data-driven Cufflinks assembly to determine isoform expression: an annotated gene was considered to have highly expressed isoforms if at least one of its annotated isoforms overlapped an assembled transcript with an average per-base coverage of at least 20 reads.
The 10 genes that were randomly selected had 15 transcripts between them: two had 3 isoforms, one had 2 isoforms, and the rest had 1 isoform. For the 10 genes, we counted the number of reads overlapping them using the summarizeOverlaps function in the Bioconductor package GenomicAlignments (Lawrence et al., 2013). Counts were calculated from the TopHat-aligned reads from the GEUVADIS study for the seven replicates described in Section 2.2. We then separated gene counts into isoform-level counts: we calculated per-isoform FPKM values for each of the 15 annotated transcripts using Cufflinks (Trapnell et al., 2010) in its abundance-estimation only mode, and used the FPKM ratio between isoforms of the same gene to generate isoform-level counts to simulate based on the gene counts we had already obtained.
We then used these isoform-level counts as input to Polyester, simulating a seven-replicate experiment with the specified number of reads being generated from each of the 15 selected annotated transcripts. Two experiments were simulated: one with all default options (no GC or positional bias, normal fragment length distribution with mean 250 and standard deviation 25, and uniform error model with 0.5% error probability) and one with all default options except for the positional bias model, for which we specified the rnaf bias model (Supplementary Fig. S2, red line).
The simulated reads were aligned to the hg19 genome with TopHat, and the coverage track for each experiment, for each simulated replicate was compared to the coverage track from the GEUVADIS replicate that generated the simulated replicate’s read count. For most of the transcripts, coverage tracks for both experiments looked reasonably similar to the observed coverage track in the GEUVADIS dataset (see Fig. 3 for a representative example).
The simulated coverage tracks were smoother than the coverage track from the GEUVADIS dataset, but major trends in the coverage patterns within exons were captured by the simulated reads. There are annotated transcripts for which reads generated by Polyester do not adequately capture the observed coverage in the GEUVADIS dataset (Supplementary Fig. S8), especially when positional bias is added. This seems to mainly occur in cases where only a very small part of a large exon appears to be expressed in the dataset (as is the case in Supplementary Fig. S8). The coverage for most of the other transcripts was similar to the real data for most genes and replicates (http://dx.doi.org/10.6084/m9.figshare.1225636). Reads simulated with rnaf bias sometimes had poor coverage for genes consisting of transcripts with many small exons.
For these 15 simulated transcripts, FPKM estimates were positively correlated between each simulated dataset and the GEUVADIS dataset for each replicate. To get data for this comparison, we used Cufflinks’s abundance-estimation only mode to get expression estimates for the 15 isoforms based on the simulated reads’ alignments, in the same way we calculated expression for the GEUVADIS replicates. We calculated correlation between FPKM estimates of the 15 transcripts for the GEUVADIS dataset and for each of the simulated datasets, using correlation instead of absolute FPKM because normalization for number of mapped reads put the sets of FPKMs on different scales.
For the simulation without positional bias, the correlation was extremely high: the minimum correlation across the seven replicates studied was 0.98. However, the FPKM estimates were less correlated when RNA-fragmentation-related positional bias was induced: all correlations were positive, but weak (Supplementary Fig. S9). These results generally indicate that realistic coverage profiles can be obtained with Polyester but that adding positional bias may cause problems when transcripts have unusual structure. The correlation in FPKM estimates between the simulated datasets and the GEUVADIS samples suggests that Polyester captures transcript level variation in gene expression data.
3.2 Use case: assessing the accuracy of a differential expression method
To demonstrate a use case for polyester, we simulated two small differential expression experiments and attempted to discover the simulated differential expression using limma (Smyth, 2005).
The first experiment used the default size parameter in Polyester, which means the variance of the distribution from which each transcript’s count is drawn is equal to four times the mean of that distribution. In other words, the mean and variance of the transcript counts are linearly related. We refer to this experiment as ‘low variance.’ The second experiment set the size parameter to 1 for all transcripts, regardless of the mean count, which means each transcript’s mean and variance are quadratically related. This experiment was the ‘high variance’ experiment.
In both scenarios, the main wrapper function in Polyester was used to simulate classic two-group experiments. Reads were simulated from transcripts on human chromosome 22 (hg19 build, N=926). μk was set to length(transcriptk)/5, which corresponds to approximately 20x coverage for reads of length 100. We randomly chose 75 transcripts to have λ=3 and 75 to have λ = 1/3; the rest had λ=1. For nj=7 replicates in each group j, we simulated paired-end reads from 250-base fragments (σfl = 25), with a uniform error probability and the default error rate of 0.005. Simulated reads were aligned to hg19 with TopHat 2.0.13 (Trapnell et al., 2009), and Cufflinks 2.2.1 (Trapnell et al., 2010) was used to obtain expression estimates for the 926 transcripts from which transcripts were simulated. Expression was measured using FPKM (fragments per kilobase per million mapped reads). We then ran transcript-level differential expression tests using limma (Smyth, 2005). Specifically, for each transcript k, the following linear model was fit:
where FPKMk is the expression measurement for transcript k, Xj is 0 or 1 depending on which group sample j was assigned to, and Wj is a library-size adjustment, defined as the 75th percentile over all k of the log2(FPKMk+1) values for replicate j (Paulson et al., 2013). We fit these linear models for each transcript, and for each βk, we calculated moderated t-statistics and associated P values using the shrinkage methodology in limma’s eBayes function. We calculated ROC curves based on these P values and our knowledge of the true differential expression status of each transcript. Sensitivity and specificity of the limma differential expression analysis were high for the small-variance scenario, but were diminished in the large-variance scenario, as expected (Fig. 4).
Since expression fold changes can be explicitly specified in Polyester, we can also investigate whether those fold changes are preserved throughout this RNA-seq data analysis pipeline (Fig. 5). In general, the coefficient distributions for transcripts not specified to be differentially expressed were centered around zero, as expected, since models were fit on the log scale. The coefficient distributions should have been centered around log2(3)=1.58 for the overexpressed transcripts (expression level three times higher in the first group), and around log2(1/3)=–1.58 for the underexpressed transcripts (expression level three times higher in the second group). The overexpressed distributions had means 1.39 and 1.44 in the high- and low-variance scenarios, respectively, and the underexpressed distributions had means −1.57 and −1.60 in the high- and low-variance scenarios, respectively. Coefficient estimates were much more variable in the scenario with higher expression variance (Fig. 5). These numbers are similar to the the specified value of 1.58, indicating that the RNA-seq pipeline used to analyze these datasets satisfactorily captured the existence and magnitude of the differential expression set in the experiment simulated with Polyester.
3.3 Time and memory analysis
The computational time required by Polyester varies according to a number of parameters. Assuming that n is the number of reads per replicate and m is the number of replicates in the experiment simulated, Polyester is O(mn) in time (Supplementary Figs S10–11). In other words, the time required to run Polyester scales linearly with both the number of reads per replicate and the total number of replicates in the experiment. Adding positional bias to the experiment, as we did for the analysis in Section 3.1, makes the simulation very slightly slower (Supplementary Fig. S12). We note that using an empirical error model requires substantially more computational time (Supplementary Table S1), since the empirical error models assume different error probabilities along the read and therefore require iterating in R over the reads.
Let nmax denote the number of reads simulated from the replicate with the most reads in an experiment. Then Polyester is O(nmax) in memory use (Supplementary Fig. S13); Polyester is constant in memory use with respect to m, the total number of replicates in the experiment (Supplementary Fig. S14). In other words, the memory required by Polyester scales linearly with the maximum number of reads per replicate in the experiment, but is approximately constant as the number of replicates in the experiment increases. The code is written such that all reads for a single replicate are stored in memory, but replicates are simulated sequentially, which means previous replicates’ reads are released from memory after being written to disk.
If the desired number of reads to simulate per replicate exceeds the available computer’s memory requirements, reads for the same replicate can be simulated sequentially using multiple calls to the main simulation function, and output files can be concatenated after the simulation has finished running.
4 Discussion
In this article, we propose a lightweight, flexible RNA-seq read simulator allowing users to set differential expression levels at the isoform level. A full experiment with biological replicates can be simulated with one command, and time-consuming alignment is not required beforehand.
The sequencing process is complex, and some subtleties and potential biases present in that process are not yet implemented in Polyester but could be in the future. For example, adding random hexamer priming bias (Hansen et al., 2010), implementing PCR amplification bias (Fang and Cui, 2011) or other biases that depend on the specific nucleotides being sequenced, simulating quality scores for base calls, and adding the ability to simulate indels are all possibilities for future improvements. Also, an annotation-based simulator like Polyester cannot model biological events that may occur in real datasets but are not annotated, such as unobserved transcripts or splicing errors. However, our comparisons with real data suggest that the Polyester model sufficiently mimicks real sequencing data to be practically useful.
5 Software
The release version of Polyester is available from Bioconductor: http://bioconductor.org/packages/release/bioc/html/polyester.html. Community contributions and bug reports can be made at https://github.com/alyssafrazee/polyester. Community contributions and bug reports are welcomed in the development version. Code for the analysis shown in this article is available at https://github.com/alyssafrazee/polyester_code.
Funding
J.L. and B.L. were supported by NIH R01 GM105705. A.F. was supported by a Hopkins Sommer Scholarship. A.J. was supported by the Lieber Institute for Brain Development.
Conflict of Interest: none declared.
References
- AC’t Hoen P., et al. (2013). Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol , 31, 1015–1022. [Abstract] [Google Scholar]
- Anders S., Huber W. (2010). Differential expression analysis for sequence count data. Genome Biol. , 11, R106. [Europe PMC free article] [Abstract] [Google Scholar]
- Benjamini Y., Speed T.P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. ,40, e72. [Europe PMC free article] [Abstract] [Google Scholar]
- Bullard J.H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics , 11, 94. [Europe PMC free article] [Abstract] [Google Scholar]
- Cai G., et al. (2012). Accuracy of RNA-seq and its dependence on sequencing depth. BMC Bioinformatics , 13(Suppl. 13), S5. [Europe PMC free article] [Abstract] [Google Scholar]
- Fang Z., Cui X. (2011). Design and validation issues in RNA-seq experiments. Briefings in Bioinformatics , 12, 280–287. [Abstract] [Google Scholar]
- Frazee A.C., et al. (2015). Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat. Biotechnol. , 33, 243246. [Europe PMC free article] [Abstract] [Google Scholar]
- Grant G.R., et al. (2011). Comparative analysis of RNA-seq alignment algorithms and the RNA-seq unified mapper (RUM). Bioinformatics , 27, 2518–2528. [Europe PMC free article] [Abstract] [Google Scholar]
- Griebel T., et al. (2012). Modelling and simulating generic RNA-seq experiments with the flux simulator. Nucleic Acids Res. , 40, 10073–10083. [Europe PMC free article] [Abstract] [Google Scholar]
- Hansen K.D., et al. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. , 38, e131. [Europe PMC free article] [Abstract] [Google Scholar]
- Hansen K.D., et al. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics , 13, 204–216. [Europe PMC free article] [Abstract] [Google Scholar]
- Ismail N., Jemain A.A. (2007). Handling overdispersion with negative binomial and generalized Poisson regression models. In: Casualty Actuarial Society Forum. Citeseer; pp.103–158. [Google Scholar]
- Kooperberg C. (2013). logspline: Logspline density estimation routines . R package version 2.1.5. [Google Scholar]
- Kooperberg C., Stone C.J. (1992). Logspline density estimation for censored data. J. Comput. Graph. Stat. , 1, 301–328. [Google Scholar]
- Lahens N.F., et al. . (2014). IVT-seq reveals extreme bias in RNA-sequencing. Genome Biol. , 15, R86. [Europe PMC free article] [Abstract] [Google Scholar]
- Lappalainen T., et al. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature , 501, 506–511. [Europe PMC free article] [Abstract] [Google Scholar]
- Lawless J.F. (1987). Negative binomial and mixed poisson regression. Can. J. Stat. , 15, 209–225. [Google Scholar]
- Lawrence M., et al. (2013). Software for computing and annotating genomic ranges. PLoS Comput. Biol. , 9, e1003118. [Europe PMC free article] [Abstract] [Google Scholar]
- Li B., Dewey C.N. (2011). RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics , 12, 323. [Europe PMC free article] [Abstract] [Google Scholar]
- Li W., Jiang T. (2012). Transcriptome assembly and isoform expression level estimation from biased RNA-seq reads. Bioinformatics , 28, 2914–2921. [Europe PMC free article] [Abstract] [Google Scholar]
- McElroy K.E., et al. (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics , 13, 74. [Europe PMC free article] [Abstract] [Google Scholar]
- Mortazavi A., et al. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods , 5, 621–628. [Abstract] [Google Scholar]
- Oshlack A., et al. . (2010). From RNA-seq reads to differential expression results. Genome Biol. , 11, 220. [Europe PMC free article] [Abstract] [Google Scholar]
- Pages H., et al. (2014). Biostrings: String objects representing biological sequences, and matching algorithms. R package version at least 2.32.0. [Google Scholar]
- Paulson J.N., et al. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods , 10, 1200–1202. [Europe PMC free article] [Abstract] [Google Scholar]
- Risso D., et al. (2011). GC-content normalization for RNA-seq data. BMC Bioinformatics , 12, 480. [Europe PMC free article] [Abstract] [Google Scholar]
- Robinson M.D., et al. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics , 26, 139–140. [Europe PMC free article] [Abstract] [Google Scholar]
- Rohatgi A. (2014). WebPlotDigitizer: Version 3.4 of WebPlotDigitizer. [Google Scholar]
- Sengupta S., et al. (2011). Single read and paired end mRNA-seq Illumina libraries from 10 nanograms total RNA. J. Visual. Exp , 56, e3340. [Europe PMC free article] [Abstract] [Google Scholar]
- Smyth G.K. (2005). Limma: linear models for microarray data. In: Gentleman R., et al., editors. (eds) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York, pp. 397–420. [Google Scholar]
- Trapnell C., et al. (2009). TopHat: discovering splice junctions with RNA-seq. Bioinformatics , 25, 1105–1111. [Europe PMC free article] [Abstract] [Google Scholar]
- Trapnell C., et al. (2010). Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. , 28, 511–515. [Europe PMC free article] [Abstract] [Google Scholar]
- Trapnell C., et al. (2013). Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnol. , 31, 46–53. [Europe PMC free article] [Abstract] [Google Scholar]
Articles from Bioinformatics are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/bioinformatics/btv272
Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/31/17/2778/17085043/btv272.pdf
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btv272
Article citations
DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms.
BMC Bioinformatics, 25(suppl 2):335, 24 Oct 2024
Cited by: 0 articles | PMID: 39448913 | PMCID: PMC11515663
MATES: a deep learning-based model for locus-specific quantification of transposable elements in single cell.
Nat Commun, 15(1):8798, 11 Oct 2024
Cited by: 1 article | PMID: 39394211 | PMCID: PMC11470080
GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data.
NAR Genom Bioinform, 6(3):lqae121, 18 Sep 2024
Cited by: 0 articles | PMID: 39296931 | PMCID: PMC11409065
Stem cell transcriptional profiles from mouse subspecies reveal cis-regulatory evolution at translation genes.
Heredity (Edinb), 133(5):308-316, 20 Aug 2024
Cited by: 1 article | PMID: 39164520 | PMCID: PMC11527988
junctionCounts: comprehensive alternative splicing analysis and prediction of isoform-level impacts to the coding sequence.
NAR Genom Bioinform, 6(3):lqae093, 09 Aug 2024
Cited by: 0 articles | PMID: 39131822 | PMCID: PMC11310779
Go to all (153) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Data Citations
- (1 citation) DOI - 10.6084/m9.figshare.1130849
Functional Genomics Experiments (2)
- (2 citations) ArrayExpress - E-GEUV-6
- (1 citation) ArrayExpress - E-GEUV-1
IGSR: The International Genome Sample Resource
- (1 citation) IGSR/1000 Genomes - NA06985
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
SimSeq: a nonparametric approach to simulation of RNA-sequence datasets.
Bioinformatics, 31(13):2131-2140, 26 Feb 2015
Cited by: 38 articles | PMID: 25725090 | PMCID: PMC4481850
rSeqNP: a non-parametric approach for detecting differential expression and splicing from RNA-Seq data.
Bioinformatics, 31(13):2222-2224, 24 Feb 2015
Cited by: 7 articles | PMID: 25717189 | PMCID: PMC4481847
SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.
BMC Bioinformatics, 17:66, 04 Feb 2016
Cited by: 23 articles | PMID: 26847232 | PMCID: PMC4743240
Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy, and Tools.
IEEE/ACM Trans Comput Biol Bioinform, 17(2):566-586, 01 Oct 2018
Cited by: 13 articles | PMID: 30281477
Review
Funding
Funders who supported this work.
NCATS NIH HHS (1)
Grant ID: UL1 TR001079
NIGMS NIH HHS (1)
Grant ID: R01 GM105705