WO2009091798A1

WO2009091798A1 - Quantitative genetic analysis

Info

Publication number: WO2009091798A1
Application number: PCT/US2009/030952
Authority: WO
Inventors: Doron Lipson; Tal Raz
Original assignee: Helicos Biosciences Corporation
Priority date: 2008-01-16
Filing date: 2009-01-14
Publication date: 2009-07-23

Abstract

The invention herein generally relates to quantitative genetic analysis and methods of obtaining a profile of a transcriptome of an organism.

Description

QUANTITATIVE GENETIC ANALYSIS

Related Applications

This application claims priority to U.S. provisional patent applications No. 61/021,465, filed January 16, 2008, No. 61/079,333, filed July 9, 2008, all of which are incorporated herein by reference.

Technical field

Background

Study of gene expression has become an important field for understanding cellular mechanisms and behavior. Large scale sequencing of cDNA clones, and differential comparisons of transcript abundance between samples have yielded extensive knowledge of genetic expression across tissue types and organisms.

Current technologies, such as microarray technology, have provided accurate differential expression profiling at relatively low cost (Lockhart et al., Nature 405:827-836, 2000; and Churchill, Nat Genet 32(Suppl):490-495, 2002). However, these technologies have limited sensitivity, i.e., are effective only in monitoring expression levels of medium and highly expressed known transcripts. Currently, technologies such as microarray technology lack the ability to accurately measure low-expressing genes. In addition, microarray technology does not have the capability of novel transcript discovery, and therefore it is now believed that a significant portion of a spectrum of cellular transcription is yet unknown (Birney et al., Nature 447:799-816, 2007; and Wilhelm et al., Nature, 2008).

There is a need for methods of quantitative genetic analysis that accurately measure low-expressing genes, and that can detect novel transcripts.

Summary

The invention provides methods for accurate quantification of cellular mRNA transcripts based on a single-molecule sequencing platform. Single-molecule sequencing Digital Gene Expression (smsDGE) utilizes a simple sample preparation procedure that generates accurate transcript counts representing the profile of a complete or partial transcriptome. An aspect of the invention provides methods for obtaining a profile of a transcriptome of an organism. For example, the method involves sequencing one or more cDNA molecules to obtain sequencing reads. Sequencing of the cDNA is accomplished by sequencing by synthesis, and preferably single molecule techniques. Methods further involve aligning each of the sequencing reads to a reference sequence, and separating sequencing reads into a pool of aligned sequencing reads and a pool of unaligned sequencing reads. Separating sequencing reads can include assigning a probability of accurate alignment for each read, such that reads above a defined alignment score threshold are placed in a pool of aligned sequencing reads, and reads below the defined alignment score threshold are placed in a pool of unaligned sequencing reads. These methods also involve deriving a transcript distribution in the pool of aligned sequencing reads, thereby obtaining the profile of the transcriptome of the organism. Deriving the distribution of reads can include calculating a distribution of transcripts based upon a distribution of reads assigned to the reference sequence.

Methods can further include, employing a sequence clustering strategy to identify novel transcripts within a pool of unaligned sequencing reads. In addition, methods can further include, prior to sequencing, obtaining one or more mRNA molecules from an organism, generating one or more single-strand cDNA molecules, and hybridizing the cDNA molecules to oligonucleotides attached to a surface. In certain embodiments, methods further include, prior to aligning, filtering sequence reads by length and sequence context.

Another aspect of the invention provides methods for quantitative analysis of genome transcripts, including obtaining nucleic acid from a biological sample; sequencing one or more portion(s) of nucleic acid obtained from the biological sample; aligning sequences with a reference sequence; assigning a probability to each alignment; and determining a quantitative distribution of the sequences.

Another aspect of the invention provides methods for transcriptional profiling, the method including: obtaining sequence reads from one or more gene transcripts; assigning the reads to a reference sequence for alignment based upon a probability of an accurate match between the reads and the reference sequence; and calculating a distribution of transcripts based upon the distribution of reads assigned to the reference sequence.

Another aspect of the invention provides methods for reducing errors in quantitative transcriptional analysis, the method including: sequencing one or more regions of a transcriptome; aligning sequences from the one or more regions to a reference sequence; assigning a probability of accurate alignment for each of the sequences; and discarding aligned sequences that do not meet a predetermined probability threshold.

Brief description of the drawings

Fig. 1 Panel A is a set of drawings showing sample preparation and sequencing. Panel (i) shows a strand of cDNA prepared based on mRNA from an organism, for example S. cereviciae mRNA, and an oligo-dU primer. Panel (ii) shows a 3' tail of dATP that is added to the cDNA followed by Cy3-dideoxy-dTT labeling of the cDNA. The oligo dU sequence used for the cDNA synthesis is removed by the USER reagent, as described herein. Panel (iii) shows that the tailed sample is hybridized to a poly-T oligonucleotide covalently attached to a channel surface of a flow cell. The surface oligonucleotide then acts as a primer for the sequencing reaction. An initial image of the hybridized surface is first taken to register the position of the (Cy3 -labeled) hybridized template strands. Panel (iv) shows that the sequencing reaction is done by adding a Cy5 labeled nucleotide, washing the chemistry away, and imaging the flow cell channel. Panel (v) shows that the Cy5 dye label is cleaved off and washed away. Panel (vi) shows that the next nucleotide is added and imaged.

Fig. 1 Panel B is a flow diagram showing data analysis and workflow of certain embodiments of the methods herein. Numeral (i) shows that raw reads generated by the sequencer are filtered by length and sequence context. Numeral (ii) shows that filtered reads are aligned to the transcriptome reference library of the organism, for example, SGD yeast. Numeral (iii) shows that aligned reads are used for counting transcripts and for generating transcript sequence information. Numeral (iv) shows that unaligned reads are clustered for de-no vo identification of unannotated transcripts.

Fig. 2 Panel A is a graph showing read length distribution. Read length distributions of raw reads, filtered reads, and reference aligned reads are shown. Fig. 2 Panel B is a graph showing a transcript abundance profile comparison of smsDGE to transcript distribution previously published by Holstege et al. (Cell 95:717-728, 1998) by oligonucleotide array measurement. For each data set, transcripts are ranked by abundance on the x-axis. smsDGE counts (left axis) are depicted for both the entire reference set (6,711 transcripts), as well as for the subset of genes that are common to both studies (5,460 transcripts). Estimated # of copies per cell given for array data is shown on the right axis.

Fig. 3 Panel A is a graph showing a transcript count comparison of two flow-cell channels with highest capacity (chl : 1.32 million reads, ch3 : 1.15 million reads) from the same run containing the same sample. Linear correlation > 0.99. Fig. 3 Panel B is a graph showing inter channel coefficient of variance (CV). Observed transcript abundance CV across three flow-cell channels is shown on the x-axis and expected CV based on counting stochasticity alone (expected values determined by random sampling from a Poisson distribution corresponding to 1 million counts per channel) is shown on the y-axis. Plots were smoothed by calculating the median CV for a running window of size 50.

Fig. 3 Panel C is a graph showing comparison of smsDGE counts (x-axis) to measurement by microarray (y-axis). Microarray signal values from a single array are compared to smsDGE counts of the same mRNA sample. Linear correlation = 0.67, Rank correlation = 0.81. Fig. 3 Panel D is a graph showing comparison of smsDGE counts (y-axis) to absolute qPCR (x-axis). qPCR was used to quantify 33 transcripts as described herein, and these transcripts were compared to smsDGE counts. qPCR arbitrary units were normalized using geometric mean. Linear correlation = 0.98.

Fig. 4 Panel A shows an alternative transcription start site (TSS), detected in TRX2 (YGR209C). Fraction of reads started at each position of the reference sequence of the transcript. Position 0 denotes ORF start codon. In total, 6,082 reads were assigned to this transcript. Fig. 4 Panel B shows an example of sequence variation between DBY746 and the reference database strain S288C sequences, in yeast gene DOT6 (YER088C). Figure 4B discloses SEQ ID NOs 2-19, respectively, in order of appearance. Fig. 4 Panel C shows an example of an EST and GeneBank confirmed splice variant seen in RPS26A (YGLl 89C) but not in SGD. Figure 4C discloses SEQ ID NOs 20-33, respectively, in order of appearance. Fig. 4 Panel D shows a cluster consensus sequence mapping to a highly conserved genomic region in seven Saccharomyces species, and confirmed by ESTs. Figure 4D discloses SEQ ID NOs 34-40, respectively, in order of appearance.

Fig. 5 is a graph showing transcript length (x-axis) compared to the fraction of reads reaching the 5' end of a transcript (y-axis). Data show that the fraction of reads reaching the 5 ' end of a transcript is inversely proportional to transcript length.

Fig. 6 is a graph showing accumulation of reads at the TSS of several different transcripts.

Fig. 7 is a set of graphs showing simulation results for transcript abundance profiling of 10 synthetic sequences. Panel A shows simple counting, Panel B shows statistically- corrected counting (SC counting), and Panel C shows convergence sequence of the the SC counting method for 1=15, where i indicates the iteration. 1=15/20 refers to the minimum read length that was used for analysis. Exp refers to planted abundance levels. Fig. 8 is a set of graphs showing simulation results for transcript abundance profiling of 5,750 ORF sequences from S. cerevisiae. Panel A shows simple counting and Panel B shows SC counting. Each dot represents the count (in tpm) of a single transcript.

Fig. 9 is a set of graphs showing abundance of a S. cerevisiae transcriptome by DGE. Panel A shows simple counting and Panel B shows SC counting. Comparison of profiles was generated by using reads of minimum length 20 (1=20) and 25 (1=25). Each dot represents the count (in tpm) of a single transcript.

Fig. 10 is a graph showing transcript profiles generated by simple counting and SC counting methods for 6,719 verified, uncharacterized dubious ORF's. Transcripts are ranked along the x-axis by decreasing abundance. Transcripts with zero counts are not shown.

Fig. 11 is a graph showing cumulative transcript coverage of the transcriptome of S. cerevisiae.

Detailed Description

The invention generally relates to digital gene expression (DGE) technology based on single molecule sequencing (Harris et al, Science, 320:106-109, 2008) coupled with a simple sample preparation procedure, free of restriction digest or ligation steps, that generates accurate transcript counts. Single molecule sequencing DGE techniques of the invention are conducted on single copies of mRNA derived cDNA that is representative of all or a portion of the transcriptome. Accordingly, sequencing is true single molecule sequencing and may be done as a single pass or as multiple passages as disclosed in co-owned U.S. patent application serial number 11/404,695, now U.S. patent number 7,282,337.

In general, DGE includes high-throughput sequencing of short cDNA fragments, in which observed reads are matched against a reference transcriptome. A transcript profile is determined digitally based upon relative tag counts, in contrast to the "analog" (or bulk) nature of microarray signal quantification. DGE is a technology that is capable of providing quantitative analysis of the entire transcriptome of an organism or a selected portion of the transcriptome. The DGE application is an open-ended tool that analyzes the level of expression of virtually all genes in a sample by counting the number of individual mRNA molecules produced from each gene. An advantage is that there is no requirement that genes be identified and characterized prior to conducting an experiment.

DGE can interrogate the entire transcriptome, i.e., all mRNA present, and is not limited to only the known transcripts from the public domain. Due to the highly sensitive and quantitative nature of this technology, each signature generated in the dataset can be enumerated, which allows not only for the detection of highly expressed transcripts but also for the detection of very rare transcripts represented by only a few molecules of mRNA per cell. With the depth of coverage provided by the assay, transcripts that have counts of zero can be confidently determined to be absent from a sample. The digital nature of the data output allows for: comparison of expression levels of different transcripts within the same sample; comparison of transcript levels between samples; and merging with data sets from other expression profiling technologies

In certain embodiments, methods take advantage of DGE in combination with single molecule sequencing by synthesis. Methods herein involve hybridization of single-stranded cDNA molecules to oligonucleotides attached to a surface of a flow cell. The cDNA is then sequenced by imaging the polymerase-mediated addition of fluorescently-labeled nucleotides incorporated into the growing strand surface oligonucleotide, at single molecule resolution. Sample preparation does not require PCR amplification allowing strands to be densely packed onto the flow cell surface resulting in extremely high throughput.

Because methods herein generate sequence reads from the 5 ' ends of cDNA molecules, methods herein do not require the cDNA to be of full length. Consequently, methods herein work equally well with short cDNAs generated as a result of partial mRNA degradation or incomplete reverse transcription. In fact, the variability in read start sites resulting from the presence of short cDNAs enriches the read pool for informative reads reducing bias and improving counting accuracy.

The high efficiency of certain methods herein, using a single read per transcript, allows accurate quantification of low abundance transcripts using a few million reads. In contrast, recent studies demonstrate that other DGE approaches generating a single read per transcript require an order of 3-5 million reads per sample to allow for sensitivity to changes in low-abundance transcripts such as transcription factors (Kim et al., Science, 316:1481- 1484, 2007; and Lin et al., Cancer Res, 65:3081-3091, 2005). Further, accurate expression profiling by deep sequencing technologies (Nagalakshmi, et al., Science, 2008; Mortazavi, Nat Methods, 2008; and Cloonan et al., Nat Methods, 2008) necessitates several fold more reads to reach accurate quantitation of low abundance transcripts due to the large numbers of reads generated from long and abundant transcripts.

Further, methods herein overcome problems with prior art DGE techniques, such as SAGE and MPSS that rely on restriction enzyme digest and ligation steps (Chen et al., BMC Genomics, 1:11, 2006; Siddiqui et al, Nucleic Acids Res 34:e83, 2006; Gilchrist et al, BMC Bioinformatics, 8:403, 2007; Hene et al., BMC Genomics, 8:333, 2007; and So et al., Biotechnol Bioeng, 94:54-65, 2006), because methods herein are based on a simple sample preparation method that is free of PCR reactions, restriction digest or ligation steps. Instead, methods herein utilize poly-A tailing of a cDNA sample by terminal transferase alone. Comparison of methods herein with qPCR quantification shows no detectable bias (see Fig. 3 Panel D) demonstrating the advantages of this sample preparation method.

Another problem with current technology that relies on restriction enzyme digests and ligation steps is that degraded mRNA and incomplete cDNA synthesis by reverse transcriptase can influence abundance measurement both in microarray assays, which rely on adequate cDNA representation at the locations of pre-designed probes, and in SAGE and MPSS technologies, that rely on the representation of the restriction digest site in the sample material. Methods herein overcome these problems by allowing for reads to be generated from all cDNA molecules in the sample, providing the system some resistance to the harmful effects of mRNA degradation and incomplete RT reaction.

In addition, methods herein provides a wealth of sequence information covering a significant part of the expressed transcriptome, in particular, capability to map transcription start- sites. Methods herein can further identify unannotated transcripts by clustering unaligned reads, allowing identification of novel transcripts, contaminants and off-target material.

The capacity of methods herein to provide accurate inter-sample transcript quantification allows for comparison between independently prepared and measured samples. This ability, combined with the efficiency of transcript counting and the high throughput of the SMS platform has the potential of providing cost-efficient expression profiling for large multi-sample studies.

The invention having now been described, it is further illustrated by the following examples and claims, which are illustrative and are not meant to be further limiting. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation, numerous equivalents to the specific procedures described herein. Such equivalents are within the scope of the present invention and claims.

The contents of all references and citations, including issued patents, published patent applications, and journal articles cited throughout this application, are hereby incorporated by reference in their entireties for all purposes. Examples

Examples herein show profiling of the Saccharomyces cerevisiae transcriptome using the methods of the invention. Examples further show comparison of methods herein with classical quantification methods and demonstrate a quantitative analysis of the entire yeast transcriptome, with accuracy comparable to absolute qPCR. Employing the methods herein, over 3.4 million usable (>20bp long, transcriptome aligned) reads generated in a single run of a prototype sequencing platform were generated and used to accurately quantify the complete range of transcripts expressed in the DBY746 strain of S. cerevisiae.

Example 1 : cDNA preparation

First strand cDNA was prepared (Superscript III kit, invitrogen, CA) from S. cerevisiae mRNA (DBY746; Clontech, CA) according to manufacturers instructions using a 50 nucleotide deoxyuracil primer (5μM) followed by RNase H digestion (37°C, 20 minutes),and removal of the deoxyuracil primer sequence by USER™ reagent digest (NEB, MA; 37°C, 20 minutes). A final incubation with RNase I was then performed to remove any remaining RNA (37°C, 15 minutes). Next the sample was purified using the Ampure kit (Agencourt Bioscience, MA) at a 1 :1.8 sample to bead ratio according to manufacturers instructions.

Example 2: Poly dA extension

A poly dA extension of 50 nucleotides on average, was added to the 3' end of the cDNA by terminal transferase. A 20ng cDNA sample was combined with KAc (5OmM), tris base (2OmM), MgAc (1OmM) (for a final concentration of 10%), CoCl (250μM), dATP (5OX the estimated sample molarity), and a control oligo used to assess the tailing efficiency (0.5 pmole). 20 units terminal transferase were added after denaturation and snap cooling on ice, followed by 1 hour incubation at 42°C, and 10 min heat inactivation at 70⁰C. Tailed samples were then labeled and 3' blocked by Cy3 labeled-dideoxyT (200X molar ratio of Cy3- dideoxy T was added to the tailing reaction). The sample was then denatured and snap cooled on ice, 20 units of terminal transferase were added followed by a 4-hour incubation at 37°C and a final heat inactivation step. This final labeling facilitates the imaging of initial template on the flow cell surface after template capture. The control oligo and excess nucleotides were removed from the sample by Ampure purification at a 1 :1.3 sample to bead ratio (Agencourt Bioscience, MA).

Example 3 : Template capture and sequencing

Design of the flow cell, surface chemistry, and sequencing technology were previously described (Harris et al, Science, 320:106-109, 2008). A Poly dT oligonucleotide is covalently attached to the surface of the flow cell. This surface oligonucleotide has the dual role of facilitating template capture and priming the sequencing reaction. For capture, the poly dA 3' extension of the cDNA template is hybridized to the poly dT surface oligonucleotide. The sequencing reaction was initiated at the 3' end of the surface oligo (see Fig. 1). To avoid sequencing of the poly-dA extension of the template, sequencing was preceded by a dTTP 'filling' step in which the surface oligo was extended against the 3' poly- dA extension of the template (lμM dTTP, DNA polymerase and IX NEB buffer 2; NEB, MA). After the dTTP 'fill', excess nucleotides were washed away and dATP, dCTP, and dGTP were introduced sequentially (0.5μM nucleotide, DNA polymerase I, and IX NEB buffer 2; NEB, MA) with washing steps in between to lock the template into place.

The Cy3 label of the template was used to map its position on the flow cell surface. Sequencing by synthesis was then performed by introducing one of four disulfide linker Cy5- labeled nucleotide analogs (Perkin Elmer, MA) in the presence of a polymerase reaction mix. Incorporated nucleotides were imaged and their position correlated to the Cy3 template position. The Cy5 dye was then chemically cleaved off the incorporated nucleotide and rinsed away. This process was repeated for each of the next 3 nucleotides to complete a sequencing quad cycle. A total of 30 quad cycles were performed. Both the surface chemistry, and the process of sequence base calling were previously described and used here with the exception that no intensity based homopolymer length calling was performed in this study (Harris et al., Science, 320:106-109, 2008).

Example 4: qPCR

Transcripts of 33 S. cerevisiae spanning a large range of expression levels were selected for comparison of smsDGE counts against qPCR quantification (18 Taqman and 15 SYBR green assays). Of these 33 transcripts, 13 were selected from transcripts with smsDGE counts < 10 tpm (transcripts per million) to test accuracy at low abundance levels. qPCR reactions were denatured at 95°C for 10 min followed by 40 cycles of 95°C for 15s and 57°C for 30s. Taqman assays had forward and reverse primers at 0.3μM concentration each, a Taqman probe at 0.25 μM, and IX Taqman reaction mix (Taqman universal PCR mix, Applied Biosystems, CA). SYBR green assays had forward and reverse primer at 0.15 μM each, and IX SYBR green mix (Invitrogen, CA). qPCR normalization was done in two steps: each transcript was first quantified using a yeast genomic DNA standard; and quantification was then standardized against a reference transcript - YDL047W (also first quantified by a genomic DNA standard) included in every qPCR reaction plate.

Example 5 : Data filtering

As a first filtering step, generated reads were filtered according to the following criteria: read length >20, AT content <90%, fraction of CT/TA/AG/GA dinucleotides <80%. This filter removes sequences that are very similar to BAO sequence and are therefore very likely to be artifacts.

The poly-A tail that is used to attach the DNA strands to the channel surface for the sequencing reaction may result in a prefix of Ts in a small fraction of the sequences. To remove excessive leading T's the longest prefix consisting of >75% T and ending with a T was removed from each read. Reads of length >20 were retained.

Finally, reads were aligned to the following sequence,

CAGGGCAGAGGATGGATGCAAGGATAAGTGGA (SEQ ID NO: 1), corresponding to the reverse complement of the control oligo that was used in the polyA extension reaction. Any read with an alignment score >3.9 (see Example 6) was removed.

Example 6: Alignment

All known and "dubious" yeast ORF sequences were downloaded from the SGD database December 2007 version (Fisk et al. Yeast, 23:857-865, 2006), including a lOOOnt upstream and downstream of the ORF. Sequences were trimmed to include only the coding ORF region (known introns were removed), 250nt upstream of the ORF start codon and 50nt downstream of the stop codon. The final reference library contained 6,711 sequences.

Each read was aligned to the complete reference library using a pairwise sequence aligner. The alignment process includes the following steps: template -based identification of candidate matches, which guarantees finding all alignment seeds within a given error-budget of 2 errors in 20 bases (errors may include insertions, deletions and/or substitutions); bit- parallel computation of the edit distance between the complete read and the candidate reference subsequence (Gene, J. ACM, 46:395-415, 1999), and discarding of alignments with <70% match; and exhaustive dynamic-programming calculation of the optimal alignment score between the read and the candidate reference subsequence. The final alignment score is calculated as:

5m - Ae score =

/ where m is the number of matches, e is the number of errors of any type, and / is the read length. This results in an alignment score of 5 for a perfect alignment and lower scores for imperfect alignments. In this study we used all reads that aligned with score > 4 to the reference library for counting transcripts, whereas reads that did not meet this threshold (unaligned reads) were used for de-novo discovery by clustering (see Example ).

Example 7: Algorithms for quantitative genetic analysis

The task of transcript profiling by DGE can be described as follows: given a data set R oϊn independent reads (typically of short length: 20-50bp), and a reference set T oϊm transcript sequences (typically of moderate length: l-20kbp) it would be desirable to be able to map each one of the reads to the transcript it was read from, and then determine the relative abundance of each transcript in the total sample based on the number of reads that were assigned to it. The process by which this task is achieved generally consists of two steps. First, each read is aligned with the complete reference set using a pairwise sequence aligner (e.g. BLAST), and the highest scoring alignments are determined. Then, significant alignments are used to assign reads to transcripts and produce an absolute count for each transcript. As a post-process, transcript counts are typically converted to tpm (transcripts per million) units.

The simplest form of transcript counting, referred to as simple counting, is implemented by assigning each read to the transcript to which it aligned with the highest confidence. Formally, this can be described as:

i p(O - ∑p{n r)p(r) ^ ~∑p{t \ r) (1)

ft if r « indmax,^ A(r,r') [G otherwise

where A(r,t) is the alignment score between read r and transcript t (assuming higher scores are more significant). To avoid situations in which one read gets more weight than others, reads that align with the same best score to more than one transcript are typically discarded (although other strategies, such as splitting the vote, are possible). The outcome of this calculation is the fraction of each transcript in the total mixture -p(t).

A significant disadvantage of the simple counting method is that in the presence of sequencing errors short reads may be misassigned, i.e., the best alignment score is not necessarily indicative of the correct transcript assignment. Although sequencing errors are typically random, the effect of random misassignment is not negligible, mostly due to the large variance in the abundance levels of different transcripts. In a typical scenario, the lower-abundance transcripts are most significantly affected by misassignment and will characteristically be over-counted. The overall effect of misassignment is therefore a drastic reduction of sensitivity.

Even in the absence of sequencing errors, simple counting will fail under certain circumstances. If different transcripts contain highly homologous regions, some of the reads will be inherently ambiguous and will lead to counting errors. If ambiguous reads are discarded, then all transcripts containing such sequence homologies will be under-counted. If read votes are split between equally-scored transcripts then lower-abundance transcripts will, again, be over-counted.

To resolve the counting inaccuracies that result from read misassignment a statistically-corrected counting (SC counting) method were employed with the methods herein. The essence of SC counting is to replace the hard assignments of reads to transcripts that occur in simple counting by soft probability-based assignments. Given a read r and the probability of r having been read off some transcript t, p(t \ r) can be calculated according to the Bayes formula:

The probability depends on the probability of a random read sequenced from t being r, p(t r), and the relative abundance of transcript t in the sample, p(t). The tQxmp(r) is a normalization factor that ensures that each read gets exactly one vote. The intuition is that if a read significantly aligns to more than one transcript, the true assignment of the read heavily depends on the relative abundances of the transcripts in addition to the significance of the alignment. For example, a read that aligns equally well or even with some variation to both a high-abundance and low-abundance transcript is much more likely to have been read off the former than the latter, even if the best alignment score suggests differently.

Practically, using statistically-corrected counting to infer the complete transcript abundance profile of a sample requires overcoming two difficulties. First, the calculation of the transcript abundance distribution is dependent on knowing this profile in advance. To overcome this problem a recursive algorithm that begins with an arbitrary assignment of a uniform transcript abundance distribution (p(t) = \lm) and iteratively calculates a refined distribution, until convergence was utilized.

Second, calculating the exact value oϊp(t r) for every pair of r,t is highly complex, and may also introduce over- fitting of the data. To overcome this difficulty a probability function of the alignment scoring method was precalculated. This is a function that assigns a value to each alignment score a, indicating the probability that a random transcript t and a read sequenced from it r will achieve an alignment score of a

> a). Following is the algorithm for statistically-corrected counting (SC-Count): calculate alignment score A(r,t) for all pairs ofr €R and t € T; for each pair of r,t calculate p(r \ t) =f(A(r,t)) using pre-computed probability function; initialize transcript profile: V € T p_o(t) = 1 /m; and repeat until convergence: for each t € T calculate p,(t) based on Py_-1 (T) using formulas (1) and (3) above.

The term p(r \ t) is assessed based on the alignment score A(V, t) , using simulation. A cumulative probability function f(a) that assigns a value to each alignment score a was calculated, indicating the probability that a random transcript t and a random read sequenced from it r will achieve an alignment score of a or lower: f(a) = Pr(A(V, t) < a) .

To increase the stability of the result, the following modification was apply to the method. If the sum ^ ,/?(r | t')p(f) for all transcripts f with highest scoring alignments is not above a given threshold, the read was discarded. If the read was not discarded, its count was distributed only between the best-scoring alignments, thus removing misassigned reads, rather than reassigning them.

The counting results described herein were obtained using SC-counting using all sub- optimal alignments of score > 3.5 that were detected in the alignment process. Only reads with at least one alignment of score > 4 were used. A threshold of 0.3 probability of correct match was used for discarding misassigned reads. Example 8: Validation transcript counting

To demonstrate application of the statistically-corrected counting method, a simulation using 10 synthetic transcript sequences (each of length 500bp) with concentrations in the range of 0-100,000 units was performed. From these sequences, 363,663 reads were derived, with the number of reads from each sequence in linear proportion to its concentration, with uniformly distributed random read start sites. The reads were of length 15-40bp and at an average error rate of 5%. Both simple counting and SC counting methods were then applied to determine the abundance distribution of 10 reference genes. The analysis was repeated twice, once using all reads, and the second time using only reads of length >20, which are expected to be more specific.

The results of the simulation are depicted in Fig. 7. As can be seen in Fig. 7 Panel A, the sensitivity achieved by simple counting was limited due to cross-assignment of reads, with higher sensitivity being attained when the analysis was restricted to reads with higher specificity. However, some cross-assignment remained even when using only longer reads. Fig. 7 Panel B depicts the performance of the statistically-corrected counting method on the same data, demonstrating that optimal sensitivity was attained and that the method performed equally well when reads with lower specificity were retained. In the first iteration, SC counting calculates an abundance distribution that is similar to the one outputted by the simple counting method, while at later iterations this distribution is used to correct read misassignments (Fig. 7 Panel C). In this simple scenario the distribution converged to its final state within 5 iterations.

To demonstrate the applicability of the algorithm to biological data, the simulation was repeated using a reference set of the full S. cereviciae transcriptome, consisting of 5,750 verified ORF sequences with lengths in the range of 50-15,000. Transcript abundance distribution was set according to absolute yeast transcriptome profile that was previously measured with microarrays. From this reference, 360,000 reads were derived, with parameters as above, and aligned to the complete reference. As before, both simple counting and SC counting methods were applied to the complete dataset. For efficiency of calculation, reads that aligned to greater than 100 references with significant alignment scores being discarded.

Fig. 8 summarizes the results of this simulation. As before, when simple counting was employed, sensitivity was limited by misassignment and low-abundance transcripts were significantly over-counted (Fig. 8 Panel A). Applying the SC counting method adequately resolved these misassignments, and sensitivity was restored down to the level of single counts (Fig. 8 Panel B). As could be expected, the SC counting method also significantly reduced the number of false positives, transcripts that were assigned a positive number of counts regardless of the fact that their abundance was set to zero. Simple counting generated 712 false positives with counts in the range of 1-100, while SC counting reduced this number to 10.

Both methods generate a comparable number of false negatives (transcripts that were assigned no counts although their abundance was positive), 54 by simple counting and 78 by SC counting. An additional source of error that was corrected by SC counting was significant under-counting of specific genes that contained highly homologous regions by the simple counting method. This phenomenon occurred due to the fact that this method discarded reads that were completely ambiguous (aligned with the same highest score to more than one transcript) thereby artificially deflating the total counts of transcripts that contained a high fraction of non-unique sequence (see large number of dots significantly below the diagonal in Fig. 8 Panel A).

Next SC counting was investigated against an experimentally-derived full transcriptome DGE data set. SC counting was applied to a subset of 400,000 reads sequenced from a S. cereviciae mRNA sample on a prototype single-molecule DNA sequencer. Read lengths in this sample were in the range of 20-70. The reference sequence set that was used for this analysis was the complete set of 6,719 verified, uncharacterized and dubious ORFs from the SGD repository.

The results of this analysis are depicted in Fig. 9. Since true counts are unknown in this sample, we compared the transcript profiles and derived the simple counting and SC counting methods on reads of different minimum lengths (1=20/25), assuming that longer reads are generally more specific. Comparison of the profiles generated by simple counting for 1=20/25 (Fig. 9 Panel A) showed the expected pattern, in which using shorter, less specific, reads limited the sensitivity of the method to an order of 100 tpm. The profiles generated by SC counting for 1=20/25 did not differ significantly, demonstrating that this method successfully mitigated the effect of read misassignment on the sensitivity of the assay.

SC counting also significantly reduced the amount of false positives in the generated abundance profile. Fig. 10 depicts the cumulative number of transcript that were assigned positive counts. The fact that simple counting assigned positive counts to 6,275 transcripts was indicative of a significant number of false positives, since the number of active yeast transcripts is estimated at well below 6,000. In contrast, SC counting assigned positive counts to only 5,721 transcripts. Data show that 595 of 998 transcripts that were assigned 0 counts by SC counting were labeled "Dubious ORFs" in the database. Given that there was a total of 836 transcripts with this label, the p-value of this agreement is p<10^"100, indicating that SC counting efficiently recognized true negatives.

Example 9: SNP Discovery

The yeast strain that was sequenced (DBY746) was different from the sequence that appears in the reference database (S288C). To discover sequence variations between these two strains the following SNP discovery method was applied to any part of the reference transcriptome that was covered at a depth of at least 5x (~1.9Mbp, see Fig S2).

The probability of reading the correct base for any random position in the reference based on all aligned reads was calculated. This value was determined separately for each of the 4 bases. For example p_Λ corresponds to the fraction of opportunities to read an A in the reference that were read correctly. Then, a most probable consensus base was calculated for each position in the reference, based on all reads covering that position as follows: for a given reference position let n be the number of reads covering that position; and let n_A ,n_c ,n_G , n_τ be the respective numbers of A, C, G, T's called at that position. The likelihood of the true base at position being x can then be calculated from the binomial tail:

The consensus base is set as base attaining maximal likelihood. If the consensus base is different from the reference base then a confidence score of this call is calculated as the log-likelihood ratio:

, _{T ΎT} , LH (consensus) log LH = In -

LH (reference)

X/X variations were confirmed by Sanger sequencing.

Example 10: Read clustering

To demonstrate ability of methods herein to provide de-novo discovery of unannotated transcripts, a reference-free sequence clustering strategy was applied to a subset of the reads that were not aligned to the reference library within the given threshold (score>4, see Alignment).

The unaligned read set was expected to contain some high-error reads from reference sequences that fell outside of the alignment threshold. To enrich the set for informative non- reference matching reads that aligned to any of the 20 highest-expressing genes we first removed with a more relaxed criterion (score>3.5). 40,000 of the remaining reads, of length 30 or more, were arbitrarily selected for cluster analysis.

All pairwise alignments between these reads were calculated by dynamic programming, assigning an alignment score in the range 0-5 to each pair. A variant of the CAST clustering algorithm (Ben-Dor et al. J Comput Biol, 6:281-297, 1999) was used to identify clusters of reads with an average pairwise alignment score >4, identifying 70 clusters of 4-62 reads each. The consensus sequence for each of these clusters was calculated using ReAligner (Anson et al., J Comput Biol, 4:369-383, 1997), a multiple sequence aligner. Each of these sequences, of length 32-46, was then mapped to the complete genome of S. cerevisiae using BLAT (Kent, Genome Res, 12:656-664, 2002).

The use of a clustering approach provided highly accurate and long consensus sequences that can be used to identify sequences that are non-contiguous in the genome (e.g. novel splice variants) or alien sequences (e.g. contaminations or viral content). Example 11 : Sample preparation and sequencing

Single-molecule sequencing was performed on a prototype instrument described in detail in Shiraki et al. (Proc Natl Acad Sci USA, 100:15776-15781, 2003). The sample preparation procedure and sequencing reaction is overviewed in Fig. 1 Panel A. Briefly, first strand cDNA was prepared using a poly-U primer. Next, a poly-A tail was added to the 3 ' end of the first-strand cDNA terminating with a Cy3 ddTTp (see Examples 1 and 2 above). The sample was then loaded onto 3 of 5 available flow-cell channels and hybridized to poly- T oligonucleotides covalently attached to the surface of the flow-cell. This hybridization allows the surface-oligonucleotides to be used as primers for the subsequent sequencing reaction.

Sequencing was achieved by incubating the flow-cell with a single fluorescently labeled Virtual Terminator™ nucleotide (VT) in the presence of a DNA polymerase reaction mix. These nucleotides were incorporated one at a time with no homopolymer run-through. Next, the free nucleotides were washed away leaving only the incorporated fluorescently labeled nucleotides behind. Sequencing information was attained by imaging the surface, and then recording nucleotide incorporations at each specific growing DNA strand location. After imaging, the fluorescent label of the nucleotide was removed, and incubation with the next nucleotide follows. The serial incubation of all four nucleotides is termed a "quad- cycle", and 30 quad-cycles were used for the S. cerevisiae transcriptome profiling described here.

Example 12: Data and alignment

Data analysis workflow is outlined in Fig. 1 Panel B. Data were collected from three channels of a single flow-cell generating 11.2 million raw reads. Filtering by length and sequence complexity yielded a final count of 5.1 million reads of lengths 20-70nt. Each of these reads was then aligned to a S. cerevisiae transcriptome reference library consisting of 5 ' UTR and ORF sequences of 6,719 verified, uncharacterized and dubious ORFs from the Saccharomyces Genome Database (SGD; Fisk et al, Yeast, 23:857-865, 2006) using a Smith- Waterman based alignment algorithm.

In total, 3.4 million (68%) of the filtered reads were aligned to at least one yeast transcript above a defined alignment score threshold, the median length of an aligned read being 28nt and the maximal length, 62nt (see Fig. 2 Panel A). In addition to alignment against the known transcriptome library, a sequence clustering strategy was also employed to identify novel transcripts within the pool of reads that did not align. Sequencing error rates were calculated based on a sample of aligned reads, resulting in 6.5% average error (4.7% missing bases, 1.4% insertions and 0.4% substitutions).

Example 13: Transcript counting

Deriving a transcript distribution based on short tags (as done in SAGE & MPSS) is typically based on a unique assignment of each read to a single transcript. However, due to the occurrence of natural transcript sequence homologies and read errors, unique assignments based on best alignment scores may lead to miscounting due to ambiguous or incorrect assignment. Read misassignment to abundant transcripts is not expected to generate significant skewing of transcript counts. However, low abundance (or non-existing) transcripts are inclined to be over-counted.

To overcome this difficulty and to achieve maximal assay specificity a probability- based method for assignment of reads to transcripts was employed. This method is described in detail in Examples 7 and 8. Briefly, suboptimal alignments between each read and the entire reference library were considered, and reads that had a significant probability of having been misassigned according to the most optimal alignment were discarded. The vote of ambiguously aligned reads was distributed between all respective transcripts based on their relative abundances in an iterative fashion.

The profile of the S. cerevisiae transcriptome as revealed by methods herein is depicted in Fig. 2 Panel B. Of the 6,711 ORFs in the reference set, 5,343 (80%) transcripts were measured at an abundance of 10 tpm to 12,000 tpm, 3,847 (72%) of which expressed at a level of 10 tmp to 100 tpm. This profile demonstrated high agreement with an absolute transcript level profile previously measured for 5,460 genes using oligonucleotide arrays (Holstege et al., Cell, 95:717-728, 1998). In addition, this comparison demonstrated that smsDGE transcript counts spanned at least three orders of magnitude of expression levels (0.1 to 100 transcripts per cell) with higher resolution of low abundance transcripts than was demonstrated using microarray technology.

The remaining 1,368 (20%) of the ORFs in the reference set were detected at a level of less than 10 tpm, signifying extremely low or no expression. Amongst these were 735 (91%) of 809 ORFs annotated in SGD as "dubious" ORFs, and only 265 (6%) of 4,647 ORFs which were annotated as "verified" ORFs. Infrequent detection of transcripts described as dubious versus verified transcripts served as a validation of the high specificity attainable by methods herein.

Example 14: Reproducibility and accuracy

Results were highly reproducible between different flow-cell channels (corr = 0.994 to 0.998 between the three channel pairs, Fig. 3 Panel A). The median CV for all transcripts with expression level >10 tpm is 11%, with median CV at 10 tpm being 30%, at 100 tpm - 11% and at 1,000 tpm - 4%. These figures are in good agreement with the variation expected due to counting stochasticity alone (Fig. 3 Panel B).

The ability of methods herein to accurately quantify transcript levels was validated by comparison with existing expression profiling technologies. The same mRNA sample was quantified by microarray (Affymetrix Yeast 2.0 Array, performed by Expression Analysis, NC) and assessed the correlation between smsDGE counts and unprocessed microarray signal levels (Fig. 3 Panel C). Absolute transcription profiling by microarrays is known to be inaccurate due to probe heterogeneity, and the measured absolute signal levels are expected to vary within an order of magnitude. The agreement between the absolute array signal and smsDGE measurements indeed followed this pattern, demonstrating an overall correlation of 0.67 (rank correlation of 0.81; linear correlation is negatively affected by the non- linear saturation of the array signal).

To test the ability of smsDGE to generate absolute transcript counts smsDGE counts were compared to qPCR measurements of the same mRNA sample on a panel of 33 transcripts at a wide range of transcription levels (Fig. 3 Panel D). This comparison demonstrated a particularly high correlation (r=0.975, p<10^"20) of smsDGE counts with the gold standard of absolute expression analysis, covering over three orders of magnitude. Almost all smsDGE counts were observed to fall within a two-fold range of the respective qPCR measurements, with only two low-abundance transcripts deviating more significantly, as is expected from the stochastic nature of counting at this level.

Example 15: Transcription Start Site Mapping

Due to the sample preparation procedure employed herein, a significant number of reads were sequenced from the 5' end of the complete transcripts, the transcription start sites (TSS). To allow mapping of reads to 5' UTR regions, the reference library included the transcript sequence up to 250bp upstream of the ORF start codon. In total, 57% of the reads uniquely mapped to 5' UTR regions, 76% of which begin at 50bp upstream of the ORF start codon, the assumed TSS position of most yeast transcripts (Zhang et al, Nucleic Acids Res, 33:2838-2851, 2005; and Miura et al, Proc Nαtl Acαd Sci USA, 103:17846-17851, 2006).

As expected, due to limited reverse transcriptase processivity and/or mRNA degradation the fraction of reads reaching the 5' end of a transcript was inversely proportional to their length (Fig. 5). Previous efforts to accurately map the TSS of the yeast transcriptome by 5' SAGE (Zhang et al., Nucleic Acids Res, 33:2838-2851, 2005), and EST sequencing (Miura et al., Proc Nαtl AcαdSci USA, 103:17846-17851, 2006) provided significant information regarding yeast TSS. The data generated in this study enhanced these results by providing tens to thousands of reads per TSS for many different transcripts, allowing accurate mapping of the physical TSS (at least 1,600 transcripts have >100 reads mapping to their 5' UTR). Fig. 6 shows the accumulation of reads at the TSS of several different transcripts. Fig. 4 Panel A depicts an example of a transcript with two alternative TSS, that was accurately identified in our data, and demonstrated high agreement with the distribution of TSS previously described for this transcript (Miura et al., Proc Nαtl Acαd Sci USA, 103:17846-17851, 2006). Example 16: Additional transcript characterization

The wide variability in read start sites for each transcript provided a wealth of transcriptome sequence information. This variability was the result of the presence of cDNAs that did not reach full length (e.g. due to incomplete reverse transcription or RNA degradation). For that reason the coverage of different transcripts varied significantly due to their relative sizes and abundance levels, and was typically non-uniform even across any single transcript. However, as depicted in Fig. 11, the aligned reads covered 6.4Mbp of the ~8Mbp-sized transcriptome of S. cerevisiae, with 1.9Mbp covered at depth > 5X. Applying a simple SNP discovery tool (See Example 9) to this data identified over 1,000 statistically- significant sequence variations between the yeast strain being sequenced (DBY746) and the strain whose sequence appeared in the reference database (S288C), most of which were single-base substitutions. Fig. 4 Panel B demonstrates two detected sequence variation in DOT6.

A significant advantage of sequencing-based expression profiling methods over conventional array platforms is their ability to detect novel sequences that appear in the sample without any prior information that would be required for probe design. To demonstrate ability of smsDGE to provide de-novo identification of unknown sequence features a read-clustering strategy (See Example 10) was employed to a subset of reads poorly aligned to the reference library. 40,000 reads of length >30nt were arbitrarily selected and all pairwise alignments between reads were calculated. A variant of the CAST clustering algorithm (Ben-Dor et al., J Comput Biol, 6:281-297, 1999) was used to identify clusters of reads that had a high degree of mutual similarity. The consensus sequence for each of these clusters was then calculated and mapped to the complete genome of S. cerevisiae using BLAT (Kent, Genome Res, 12: 656-664, 2002; and Altschul et al., JMo/ Biol 215:403-410, 1990). Several high-abundance transcript sequences not annotated in SGD were discovered. These include 5' UTR splice junctions (Juneau et al., Proc Natl Acad Sci USA, 104:1522- 1527, 2007) confirmed by EST mapping and GeneBank entries (Miura et al., Proc Natl Acad Sci USA, 103:17846-17851, 2006; Fig. 4 Panel C), segments of 3' UTR, rRNA and ncRNA. Fig. 4 Panel D shows one of the identified clusters, a 40nt long consensus sequence mapped to the yeast genome with high confidence, in agreement with published ESTs (Miura et al., Proc Natl Acad Sci USA, 103:17846-17851, 2006) and in a genomic region highly conserved among seven related yeast species Kent, Genome Res, 12 :996-1006, 2002).

Claims

What is claimed is:

1. A method of obtaining a profile of a transcriptome of an organism, the method comprising: sequencing one or more cDNA molecules to obtain sequencing reads; aligning each of the sequencing reads to a reference sequence and separating sequencing reads into a pool of aligned sequencing reads and a pool of unaligned sequencing reads; and deriving a transcript distribution of the pool of aligned sequencing reads, thereby obtaining the profile of the transcriptome of the organism.

2. The method according to claim 1, further comprising employing a sequence clustering strategy to identify novel transcripts within the pool of unaligned sequencing reads.

3. The method according to claim 1, further comprising, prior to sequencing, obtaining one or more mRNA molecules from the organism, generating single-strand cDNA molecules based on the one or more mRNA molecules, and hybridizing the single-stranded cDNA molecules to oligonucleotides.

4. The method according to claim 3, wherein the one or more mRNA comprise a complete transcriptome of the organism.

5. The method according to claim 1, further comprising, prior to aligning, filtering the sequencing reads by length and sequence context.

6. The method according to claim 1, wherein sequencing is single molecule sequencing- by-synthesis.

7. The method according to claim 1, wherein separating comprises assigning a probability of accurate alignment for each sequencing read, wherein sequencing reads above a defined alignment score threshold are placed in the pool of aligned sequencing reads, and sequencing reads below the defined alignment score threshold are placed in the pool of unaligned sequencing reads.

8. The method according to claim 1, wherein deriving comprises calculating a distribution of transcripts based upon a distribution of reads assigned to the reference sequence.

9. A method for quantitative analysis of genome transcripts, the method comprising: obtaining nucleic acid from a biological sample; sequencing one or more portions of the nucleic acid; aligning the one or more portions with a reference sequence; assigning a probability to each alignment; and determining a quantitative distribution of the one or more portions.

10. A method for transcriptional profiling, the method comprising: obtaining sequence reads from one or more gene transcripts; assigning the reads to a reference sequence for alignment based upon a probability of an accurate match between the reads and the reference sequence; and calculating a distribution of transcripts based upon the distribution of reads assigned to the reference sequence.

11. A method for reducing errors in quantitative transcriptional analysis, the method comprising: sequencing one or more regions of a transcriptome; aligning sequences from the one or more regions to a reference sequence; assigning a probability of accurate alignment for each of the sequences; and discarding aligned sequences that do not meet a predetermined probability threshold.

12. The method according to any one of claims 9-11, wherein sequencing is single molecule sequencing-by-synthesis.