Analysis
Published: 19 January 2015

The landscape of long noncoding RNAs in the human transcriptome

Matthew K Iyer^1,2^na1,
Yashar S Niknafs^1,3^na1,
Rohit Malik^1,4,
Udit Singhal^1,5,
Anirban Sahu^1,4,
Yasuyuki Hosono¹,
Terrence R Barrette¹,
John R Prensner¹,
Joseph R Evans^1,6,
Shuang Zhao^1,6,
Anton Poliakov¹,
Xuhong Cao^1,5,
Saravana M Dhanasekaran^1,4,
Yi-Mi Wu¹,
Dan R Robinson¹,
David G Beer^6,7,
Felix Y Feng^1,6,8,
Hariharan K Iyer⁹ &
…
Arul M Chinnaiyan^1,2,4,5,8,10

Nature Genetics volume 47, pages 199–208 (2015)Cite this article

54k Accesses
130 Altmetric
Metrics details

Subjects

Computational biology and bioinformatics

Abstract

Long noncoding RNAs (lncRNAs) are emerging as important regulators of tissue physiology and disease processes including cancer. To delineate genome-wide lncRNA expression, we curated 7,256 RNA sequencing (RNA-seq) libraries from tumors, normal tissues and cell lines comprising over 43 Tb of sequence from 25 independent studies. We applied ab initio assembly methodology to this data set, yielding a consensus human transcriptome of 91,013 expressed genes. Over 68% (58,648) of genes were classified as lncRNAs, of which 79% were previously unannotated. About 1% (597) of the lncRNAs harbored ultraconserved elements, and 7% (3,900) overlapped disease-associated SNPs. To prioritize lineage-specific, disease-associated lncRNA expression, we employed non-parametric differential expression testing and nominated 7,942 lineage- or cancer-associated lncRNA genes. The lncRNA landscape characterized here may shed light on normal biology and cancer pathogenesis and may be valuable for future biomarker development.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: *Ab initio* transcriptome assembly shows an expansive landscape of human transcription.**

**Figure 2: Characterization of the MiTranscriptome assembly.**

**Figure 3: Analysis of conservation in lncRNAs.**

**Figure 4: Methodology for discovering cancer-associated lncRNAs.**

**Figure 5: Discovery of lineage-associated and cancer-associated lncRNAs in the MiTranscriptome compendium.**

lncRNAKB, a knowledgebase of tissue-specific functional annotation and trait association of long noncoding RNA

Article Open access 05 October 2020

Genome-scale pan-cancer interrogation of lncRNA dependencies using CasRx

Article Open access 26 February 2024

Joint changes in RNA, RNA polymerase II, and promoter activity through the cell cycle identify non-coding RNAs involved in proliferation

Article Open access 23 September 2021

References

Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).
CAS PubMed Google Scholar
Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133 (2013).
Article CAS PubMed PubMed Central Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ulitsky, I. & Bartel, D.P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).
Article CAS PubMed PubMed Central Google Scholar
Prensner, J.R. & Chinnaiyan, A.M. The emergence of lncRNAs in cancer biology. Cancer Discov. 1, 391–407 (2011).
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).
Article CAS PubMed Google Scholar
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Article CAS PubMed PubMed Central Google Scholar
Prensner, J.R. et al. Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat. Biotechnol. 29, 742–749 (2011).
Article CAS PubMed PubMed Central Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS PubMed PubMed Central Google Scholar
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
Article CAS PubMed Google Scholar
Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Article CAS PubMed Google Scholar
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).
Article CAS PubMed PubMed Central Google Scholar
Finn, R.D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).
Article CAS PubMed Google Scholar
Kim, M.S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Article CAS PubMed PubMed Central Google Scholar
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
Article CAS PubMed PubMed Central Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article CAS PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Rosenbloom, K.R. et al. ENCODE data in the UCSC genome browser: year 5 update. Nucleic Acids Res. 41, D56–D63 (2013).
Article CAS PubMed Google Scholar
Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–640 (2014).
Article CAS PubMed Google Scholar
Dimitrieva, S. & Bucher, P. UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res. 41, D101–D109 (2013).
Article CAS PubMed Google Scholar
Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).
Article CAS PubMed Google Scholar
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Article CAS PubMed Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Grasso, C.S. et al. The mutational landscape of lethal castration-resistant prostate cancer. Nature 487, 239–243 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yu, Y.P. et al. Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J. Clin. Oncol. 22, 2790–2799 (2004).
Article CAS PubMed Google Scholar
Taylor, B.S. et al. Integrative genomic profiling of human prostate cancer. Cancer Cell 18, 11–22 (2010).
Article CAS PubMed PubMed Central Google Scholar
Glück, S. et al. TP53 genomics predict higher clinical and pathologic tumor response in operable early-stage breast cancer treated with docetaxel-capecitabine ± trastuzumab. Breast Cancer Res. Treat. 132, 781–791 (2012).
Article CAS PubMed Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Rhodes, D.R. et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9, 166–180 (2007).
Article CAS PubMed PubMed Central Google Scholar
Gray, K.A., Yates, B., Seal, R.L., Wright, M.W. & Bruford, E.A. Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. doi:10.1093/nar/gku1071 (31 October 2014).
Article CAS PubMed PubMed Central Google Scholar
Chen, D. et al. LIFR is a breast cancer metastasis suppressor upstream of the Hippo-YAP pathway and a prognostic marker. Nat. Med. 18, 1511–1517 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gupta, R.A. et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464, 1071–1076 (2010).
Article CAS PubMed PubMed Central Google Scholar
Prensner, J.R. et al. The long noncoding RNA SChLAP1 promotes aggressive prostate cancer and antagonizes the SWI/SNF complex. Nat. Genet. 45, 1392–1398 (2013).
Article CAS PubMed PubMed Central Google Scholar
Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).
Article CAS PubMed PubMed Central Google Scholar
Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579–584 (2009).
Article CAS PubMed PubMed Central Google Scholar
Stacey, S.N. et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor–positive breast cancer. Nat. Genet. 39, 865–869 (2007).
Article CAS PubMed Google Scholar
Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. 45, 353–361 (2013).
Article CAS PubMed PubMed Central Google Scholar
Turnbull, C. et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504–507 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, J. et al. A combined analysis of genome-wide association studies in breast cancer. Breast Cancer Res. Treat. 126, 717–727 (2011).
Article CAS PubMed Google Scholar
Amaral, P.P., Clark, M.B., Gascoigne, D.K., Dinger, M.E. & Mattick, J.S. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res. 39, D146–D151 (2011).
Article CAS PubMed Google Scholar
Volders, P.J. et al. LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res. 41, D246–D251 (2013).
Article CAS PubMed Google Scholar
Park, C., Yu, N., Choi, I., Kim, W. & Lee, S. lncRNAtor: a comprehensive resource for functional investigation of long noncoding RNAs. Bioinformatics 30, 2480–2485 (2014).
Article CAS PubMed Google Scholar
Hangauer, M.J., Vaughn, I.W. & McManus, M.T. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. Activation of p53 by MEG3 non-coding RNA. J. Biol. Chem. 282, 24731–24742 (2007).
Article CAS PubMed Google Scholar
Tomlins, S.A. et al. Urine TMPRSS2:ERG fusion transcript stratifies prostate cancer risk in men with elevated serum PSA. Sci. Transl. Med. 3, 94ra72 (2011).
Article CAS PubMed PubMed Central Google Scholar
Prensner, J.R. et al. PCAT-1, a long noncoding RNA, regulates BRCA2 and controls homologous recombination in cancer. Cancer Res. 74, 1651–1660 (2014).
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Article CAS PubMed PubMed Central Google Scholar
Fickett, J.W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10, 5303–5318 (1982).
Article CAS PubMed PubMed Central Google Scholar
Kim, M.S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chambers, M.C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ye, J. et al. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 13, 134 (2012).
Article CAS PubMed PubMed Central Google Scholar
Eisenberg, E. & Levanon, E.Y. Human housekeeping genes, revisited. Trends Genet. 29, 569–574 (2013).
Article CAS PubMed Google Scholar
Bernstein, B.E. et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169–181 (2005).
Article CAS PubMed Google Scholar
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
CAS PubMed PubMed Central Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
CAS PubMed PubMed Central Google Scholar
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Article CAS PubMed PubMed Central Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank B. Palen and J. Hallum for technical assistance with the high-performance computing cluster, S. Roychowdhury for reviewing the manuscript, the University of Michigan DNA Sequencing Core for Sanger sequencing and K. Giles for critically reading the manuscript and for the submission of documents. This work was supported in part by US National Institutes of Health Prostate Specialized Program of Research Excellence grant P50 CA69568, Early Detection Research Network grant UO1 CA111275, US National Institutes of Health grants R01 CA132874 and RO1 CA154365 (D.G.B. and A.M.C.), and US Department of Defense grant PC100171 (A.M.C.). A.M.C. is supported by the Prostate Cancer Foundation and the Howard Hughes Medical Institute. A.M.C. is an American Cancer Society Research Professor and a Taubman Scholar of the University of Michigan. R.M. was supported by a Prostate Cancer Foundation Young Investigator Award and by US Department of Defense Post-Doctoral Fellowship W81XWH-13-1-0284. Y.S.N. is supported by a University of Michigan Cellular and Molecular Biology National Research Service Award Institutional Predoctoral Training Grant.

Author information

Matthew K Iyer and Yashar S Niknafs: These authors contributed equally to this work.

Authors and Affiliations

Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, Michigan, USA
Matthew K Iyer, Yashar S Niknafs, Rohit Malik, Udit Singhal, Anirban Sahu, Yasuyuki Hosono, Terrence R Barrette, John R Prensner, Joseph R Evans, Shuang Zhao, Anton Poliakov, Xuhong Cao, Saravana M Dhanasekaran, Yi-Mi Wu, Dan R Robinson, Felix Y Feng & Arul M Chinnaiyan
Department of Computational Medicine and Bioinformatics, Ann Arbor, Michigan, USA
Matthew K Iyer & Arul M Chinnaiyan
Department of Cellular and Molecular Biology, University of Michigan, Ann Arbor, Michigan, USA
Yashar S Niknafs
Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
Rohit Malik, Anirban Sahu, Saravana M Dhanasekaran & Arul M Chinnaiyan
Howard Hughes Medical Institute, University of Michigan, Ann Arbor, Michigan, USA
Udit Singhal, Xuhong Cao & Arul M Chinnaiyan
Department of Radiation Oncology, University of Michigan, Ann Arbor, Michigan, USA
Joseph R Evans, Shuang Zhao, David G Beer & Felix Y Feng
Department of Surgery, Section of Thoracic Surgery, University of Michigan, Ann Arbor, Michigan, USA
David G Beer
Comprehensive Cancer Center, University of Michigan, Ann Arbor, Michigan, USA
Felix Y Feng & Arul M Chinnaiyan
Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
Hariharan K Iyer
Department of Urology, University of Michigan, Ann Arbor, Michigan, USA
Arul M Chinnaiyan

Authors

Matthew K Iyer
View author publications
You can also search for this author in PubMed Google Scholar
Yashar S Niknafs
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Malik
View author publications
You can also search for this author in PubMed Google Scholar
Udit Singhal
View author publications
You can also search for this author in PubMed Google Scholar
Anirban Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Yasuyuki Hosono
View author publications
You can also search for this author in PubMed Google Scholar
Terrence R Barrette
View author publications
You can also search for this author in PubMed Google Scholar
John R Prensner
View author publications
You can also search for this author in PubMed Google Scholar
Joseph R Evans
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Anton Poliakov
View author publications
You can also search for this author in PubMed Google Scholar
Xuhong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Saravana M Dhanasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Mi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Dan R Robinson
View author publications
You can also search for this author in PubMed Google Scholar
David G Beer
View author publications
You can also search for this author in PubMed Google Scholar
Felix Y Feng
View author publications
You can also search for this author in PubMed Google Scholar
Hariharan K Iyer
View author publications
You can also search for this author in PubMed Google Scholar
Arul M Chinnaiyan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K.I., Y.S.N. and A.M.C. conceived the study and analyses. M.K.I. processed RNA-seq data and performed ab initio assembly. M.K.I. and Y.S.N. performed data processing and data analysis with assistance from T.R.B., R.M., A.S., Y.H., J.R.E., S.Z., J.R.P. and F.Y.F. R.M., U.S., A.S. and Y.H. performed quantitative PCR validations. M.K.I. and Y.S.N. developed SSEA with the help of H.K.I. D.G.B. contributed primary samples. D.R.R., Y.-M.W. and S.M.D. generated RNA-seq libraries, and X.C. performed the sequencing. M.K.I., Y.S.N. and A.S. developed the web resource. T.R.B. provided systems administration, data storage, high-performance computing and networking support. A.P. performed the proteomics analysis. M.K.I., Y.S.N. and A.M.C. wrote the manuscript. All authors discussed results and commented on the manuscript.

Corresponding author

Correspondence to Arul M Chinnaiyan.

Ethics declarations

Competing interests

Oncomine is supported by ThermoFisher, Inc. (previously Life Technologies and Compendia Biosciences). A.M.C. was a co-founder of Compendia Biosciences and served on the scientific advisory board of Life Technologies before it was acquired. The University of Michigan has filed a patent application for the use of a subset of the lncRNAs described in this study as biomarkers of cancer.

Integrated supplementary information

Supplementary Figure 1 Curation and processing of samples in the MiTranscriptome compendia.

(a) Pie chart showing the number of studies curated from TCGA, ENCODE, MCTP and other publicly available datasets. (b) Workflow for bioinformatics processing of individual RNA-seq libraries. Data sets downloaded as BAM files were first converted to FASTQ format. Quality assessment of FASTQ files was performed using FASTQC. Reads mapping to mitochondria, ribosomal RNA, poly-A sequence, poly-C sequence or phiX virus (a spiked-in control) were filtered out. Fragment length distribution and orientation were determined by mapping a subset of the input reads to a set of large human exons (>500 bp). Reads were aligned using TopHat (v2.0.6) with Bowtie2 (v2.1.0). Gene fusion calling was performed using TopHat-Fusion (v2.0.6) with Bowtie1 (v0.12.9). Read alignment metrics were computed using Picard Tools, and genome track information was generated using BEDTools and UCSC binary utilities. Finally, ab initio transcriptome assembly was performed using Cufflinks version 2.0.2. (c) Scatter plot showing the total fragments (x axis) and the fraction of aligned fragments (y axis) for each RNA-seq library. Coarse quality control filters were used to remove libraries with fewer than 20 million total fragments or 20 million alignments (red point). (d) Dot plot showing for each library the fraction of aligned bases corresponding to RefSeq mRNAs (black points), intronic regions (green points) or intergenic regions (blue points) on the y axis. Libraries with fewer than 50% of aligned bases corresponding to RefSeq mRNA were filtered out (dotted line). (e) Pie chart showing the numbers of primary tumors (red), metastatic tumors (yellow), benign adjacent tissues or tissues from healthy individuals (blue), or cell lines (green) for 6,503 RNA-seq libraries that passed coarse quality control filters.

Supplementary Figure 2 Transfrag filtering.

(a) The dot plot shows the numbers of short transfrags (red), short clipped exons (blue) and long transfrags (black) for each library. (b) The dot plot shows the numbers of unannotated intergenic or antisense transfrags (blue), sense intronic transfrags (green) and annotated transfrags (black) for each library. (c) Example transcript models illustrating categories of ab initio transcripts and sources of background noise. Annotated transfrags (black) overlap reference transcripts on the same strand. Unannotated antisense intronic or intergenic transfrags (blue) may be confounded by genomic DNA contamination. Unannotated sense intronic transfrags (green) may be confounded by contamination from both genomic DNA and incompletely processed RNA. (d) Decision tree depicting the transfrag filtering steps for a single library. First, transfrags were labeled ‘annotated’ or ‘unannotated’ on the basis of overlap with a reference transcriptome catalog. Annotated transfrags and unannotated multiexonic transfrags were considered expressed. Unannotated monoexonic transfrags within introns in the sense orientation of an overlapping transcript were discarded as incompletely processed RNA artifacts. Unannotated antisense or intergenic monoexonic transfrags were subjected to a bivariate kernel density classification method to discriminate recurrent, reliable transcription from genomic DNA contamination artifacts. Transfrags predicted as ‘expressed’ were incorporated into meta-assemblies. (e) Scatter plot comparing the sensitivity of the monoexonic transfrag classifier for correctly detecting annotated transcripts (y axis) and the fraction of unannotated transfrags predicted to be expressed (x axis). (f) Histogram demonstrating the sensitivity for correctly detecting annotated test transcripts held out of the classifier training process.

Supplementary Figure 3 Meta-assembly.

(a) Schematic of the transcriptome meta-assembly algorithm using a simplified example with three transfrags transcribed from left to right. The input to the meta-assembly is a list of weighted transfrags (in this case, the weights correspond to FPKM expression values). First, a splice graph is constructed using the transfrag exon boundaries. The splice graph is a directed acyclic graph (DAG) with nodes (rounded rectangular boxes) representing contiguously transcribed genomic bases and edges (arrows) corresponding to possible alternative splicing and promoter usage. The splice graph is then trimmed to remove poorly expressed starting/ending nodes, and adjacent nodes with a degree of one are collapsed. (b) The pruned splice graph from a is subjected to meta-assembly. To encapsulate the splicing pattern information present in the original transfrags, the pruned splice graph is converted into a splicing pattern graph. A splicing pattern graph is a de Bruijn graph where each node represents a group of k consecutive connected nodes from the splice graph (in this example, k = 3), and edges connect adjacent node groups. In real cases, k is automatically chosen to optimize the number of nodes in the splicing pattern graph. Finally, the splicing pattern graph is repeatedly traversed using a greedy dynamic programming algorithm to determine the set of most highly abundant isoforms from the graph. In this example, isoforms ACDE and ABCE recapitulate input transfrags with nearly identical FPKM values, and the invalid isoform combinations ACE and ABCDE are discarded. (c) Genome view showing an example of the meta-assembly procedure for breast cohort transfrags in a chromosome 12q13.3 locus containing the lncRNA HOTAIR and the protein-coding gene HOXC11 on opposite strands (chr. 12: 54,349,995–54,377,376, hg19). In total, 883 transfrags were considered background noise and not used for meta-assembly. A dense cluster of 7,471 expressed transfrags from 1,076 breast RNA-seq libraries was used as input. The aggregated transfrag signal on the positive (+) and negative (–) strands is shown below. Meta-assembly produced 17 transcripts from the transfrags, including transcripts that matched GENCODE HOTAIR and HOXC11 splicing patterns as well as HOTAIR transcripts with unannotated splice sites.

Supplementary Figure 4 Characterization of unannotated transcripts.

(a) Dot plots depicting the comparison of the MiTranscriptome with reference transcripts from RefSeq, UCSC or GENCODE. Precision (blue), precision for the subset of transcripts overlapping annotated transcripts (light blue) and sensitivity (orange) are plotted for each comparison. (b) Dot plots comparing the base-wise, splice-site and splicing pattern precision and sensitivity of MiTranscriptome and GENCODE using lncRNAs from RefSeq (left) or Cabili et al. (right). (c) Bar plots comparing the numbers of unannotated transcripts versus different classes of annotated transcripts for each of the 18 cohorts. Top, stacked bar plot showing annotated ncRNAs (red), pseudogenes (cyan), read-throughs (purple) and protein-coding genes (blue). Bottom, bar plot showing unannotated transcripts (pink).

Supplementary Figure 5 MiTranscriptome characterization.

(a) Density histogram depicting the confidence scores for annotated and unannotated lncRNAs. (b) Comparison of the relationship of the maximum number of exons per gene to the number of isoforms per gene. LncRNAs tend to have fewer exons than protein-coding genes, but they have complex splicing patterns that yield multiple transcript isoforms. (c) Cumulative distribution plot for the base-wise conservation fraction of proteins (blue), read-throughs (purple), pseudogenes (cyan), TUCPs (green) and lncRNAs (red). Random intergenic (black) and intronic (gray) regions are plotted as controls. The inset plot highlights the top 5th percentile of the distribution. (d) Bar plot showing K_S test statistics for classes of transcripts versus random intergenic controls. (e) ROC curve for predicting the conservation of protein-coding genes versus random intergenic controls. The cutoff (pink point) chosen for calling highly conserved transcripts is plotted. (f) Cumulative distribution plot for promoter conservation (legend shared with c). The inset plot highlights the top 5th percentile of the distribution. (g) Bar plot showing K_S tests for promoter conservation versus random intergenic regions. (h) ROC curve for predicting ultraconserved noncoding elements versus random intergenic regions. The cutoff (pink point) chosen for nominating ultraconserved lncRNAs is plotted.

Supplementary Figure 6 Validation of lncRNA transcripts.

One hundred lncRNA transcripts were validated by qRT-PCR across the A549, LNCaP and MCF-7 cell lines using an approach with or without revers transcriptase. C_t values were first normalized to housekeeping genes (CHMP2A, EMC7, GPI, PSMB2, PSMB4, RAB7A, REEP5, SNRPD3) and then to the median value of all samples using the DDC_t method. Here data are plotted as a logirithmic of fold change over the median with s.e.m. Validation was performed on (a) 38 monoexonic transcripts and (b) 62 multiexonic transcripts. The boxed transcripts are two representative examples of lncRNAs with lineage/cancer specificity in breast or prostate according to SSEA analysis (Supplementary Table 10) whose cell line expression profile (by qRT-PCR) reflects what is expected from tissue analysis.

Supplementary Figure 7 Further validation of lncRNA transcripts.

(a) Heat-map representation of the correlation between qPCR (fold change over the median) with RNA-seq (FPKM) of 100 selected transcripts in the A549, LNCaP and MCF-7 cell lines. (b,c) Representative example of 2 of 20 previously unannotated lncRNA transcripts that were analyzed by Sanger sequencing to ensure primer specificity with their associated chromatograms. As seen in the UCSC Genome Browser View, a (b) multiexonic lncRNA (Gene ID: G021137) and (c) monoexonic lncRNA (Gene ID: G030545).

Supplementary Figure 8 Classification of transcripts of unknown coding potential.

(a) Decision tree showing the categorization of ab initio transcripts. Unannotated transcripts and annotated noncoding RNAs were classified as either lncRNA or TUCP. Transcript categories for protein-coding genes, pseudogenes and read-throughs were imputed from overlapping reference annotations. (b) ROC curve comparing the false positive rate (x axis) with the true positive rate (y axis) for CPAT coding potential predictions of noncoding RNAs versus protein-coding genes. (c) Curve comparing the probability cutoff (x axis) with balanced accuracy (y axis). The dotted line shows the cutoff used to call TUCP transcripts. (d) Scatter plot comparing the frequency of Pfam domain occurrences in non-transcribed intergenic space versus transcribed regions. Points in red were considered valid Pfam domain hits, and points in black were considered artifacts. (e) Three-dimensional scatter plot comparing Fickett score (x axis), ORF size (y axis) and Hexamer score (z axis) for all transcripts. Transcripts represented by red points contain valid Pfam domains, while blue do not. (f–h) Box plots comparing ORF size (f), Hexamer score (g) and Fickett score (h) for lncRNAs (red), TUCPs predicted by Pfam only (yellow), TUCPs predicted by CPAT (green) and TUCPs predicted by both Pfam and CPAT (blue).

Supplementary Figure 9 Enrichment of the MiTranscriptome assembly for disease-associated regions.

(a) Venn diagram comparing the coverage of disease- or trait-associated genomic regions (i.e., GWAS SNPs) for the MiTranscriptome assembly (yellow) in comparison to reference catalogs (blue). (b) Pie charts comparing the distributions of intronic and exonic GWAS SNP coverage of the MiTranscriptome assembly (left) and reference catalogs (right). (c) Dot plot displaying the enrichment of GWAS SNPs versus random SNPs for different transcript categories. Enrichment odds ratios (transcript-SNP overlaps versus shuffled transcript-SNP overlaps) are plotted on the y axis. Points indicate the mean of 100 permutations for tests of enrichment with GWAS SNPs (circle) or random SNPs (diamond), and error bars depict ±2 s.d. of the distribution of odds ratios. Both exonic and whole-transcript enrichment is reported. (d) Dot plot showing the enrichment of GWAS SNPs (circle) versus random SNPs (diamond) for novel intergenic lncRNAs and TUCPs. Enrichment odds ratios (transcript-SNP overlaps versus shuffled transcript-SNP overlaps) are plotted on the y axis. Points indicate the mean of 100 shuffles for comparisons with GWAS SNPs (circle) or random SNPs (diamond), and error bars depict ±2 s.d. of the distribution of odds ratios. Both exonic and whole-transcript enrichment is reported.

Supplementary Figure 10 Discovery of lineage-associated and cancer-associated transcripts.

(a) Heat map of lineage-specific transcripts nominated by SSEA. Each column represents a sample set from 1 of 25 cancer (dark gray) and 13 normal (light gray) lineages, and each row represents an individual transcript. Colored labels above columns reflect the organ system cohorts used in assembly. Row side colors correspond to lncRNAs (red), TUCPs (green), pseudogenes (cyan), read-throughs (purple) and protein-coding transcripts (blue). All transcripts were statistically significant (FDR < 1 × 10⁻⁷) and ranked in the top 1% of the most positively or negatively enriched transcripts within at least one sample set. The heat-map color spectrum corresponds to percentile ranks, with underexpressed transcripts colored blue and overexpressed transcripts colored red. The column dendrogram shows unsupervised hierarchical clustering of the sample sets. (b) Heat map of cancer-specific transcripts (CATs) nominated by SSEA. Columns represent 12 cancer types, and colored column labels reflect the organ system cohorts used in assembly. All transcripts were statistically significant (FDR < 1 × 10⁻³) and ranked in the top 1% of the most positively or negatively enriched transcripts within at least one sample set. The column dendrogram shows unsupervised clustering results. The row side color and heat-map color schemes are identical to those in a.

Supplementary Figure 11 Lineage-specific and cancer-specific transcripts.

(a) Scatter plot grid showing lineage-specific and cancer-specific transcripts nominated by SSEA. A row of scatter plots for each transcript category is plotted across 12 cancer types. Each plot shows the cancer versus normal enrichment score (x axis) and the cancer lineage enrichment score (y axis). Red points indicate cancer and lineage associated transcripts within the respective cancer types, and gray points indicate all other cancer and lineage associated transcripts. (b,c) Box plots comparing the performance of (b) positively enriched cancer and lineage associated transcripts and (c) negatively enriched transcripts for each category across 12 cancer types. The average of the lineage and cancer versus normal ES is plotted on the y axis.

Supplementary Figure 12 Examples of cancer- and/or lineage-associated transcripts.

(a) Genomic view of the chromosome 6q26-q27 locus. The protein-coding genes QKI and PDE10A flank an intergenic region with two annotated lncRNAs, AK093114 and AK090788. MiTranscriptome transcripts are shown in a dense view populating this intergenic space. The most zoomed view (bottom) depicts MEAT6, a melanoma-associated lncRNA. AK090788 overlaps a portion of MEAT6, but the full MEAT6 transcript uses an alternate start site (black arrow). (b) Expression data for MEAT6 (demarcated by an asterisk in a). This isoform variant does not use the alternate start site used by MEAT6 and closely resembles AK090788. (c,d) Expression profiles for cancer- and lineage-associated transcripts across all MiTranscriptome tissue cohorts are shown for (c) lung adenocarcinoma and (d) thyroid cancer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iyer, M., Niknafs, Y., Malik, R. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat Genet 47, 199–208 (2015). https://doi.org/10.1038/ng.3192

Download citation

Received: 20 June 2014
Accepted: 18 December 2014
Published: 19 January 2015
Issue Date: March 2015
DOI: https://doi.org/10.1038/ng.3192

The landscape of long noncoding RNAs in the human transcriptome

Subjects

Abstract

Access options

Similar content being viewed by others

lncRNAKB, a knowledgebase of tissue-specific functional annotation and trait association of long noncoding RNA

Genome-scale pan-cancer interrogation of lncRNA dependencies using CasRx

Joint changes in RNA, RNA polymerase II, and promoter activity through the cell cycle identify non-coding RNAs involved in proliferation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Curation and processing of samples in the MiTranscriptome compendia.

Supplementary Figure 2 Transfrag filtering.

Supplementary Figure 3 Meta-assembly.

Supplementary Figure 4 Characterization of unannotated transcripts.

Supplementary Figure 5 MiTranscriptome characterization.

Supplementary Figure 6 Validation of lncRNA transcripts.

Supplementary Figure 7 Further validation of lncRNA transcripts.

Supplementary Figure 8 Classification of transcripts of unknown coding potential.

Supplementary Figure 9 Enrichment of the MiTranscriptome assembly for disease-associated regions.

Supplementary Figure 10 Discovery of lineage-associated and cancer-associated transcripts.

Supplementary Figure 11 Lineage-specific and cancer-specific transcripts.

Supplementary Figure 12 Examples of cancer- and/or lineage-associated transcripts.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1-9, 11, 12, 14 and 15

Supplementary Table 10

Supplementary Table 13

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links