Rfam: annotating non-coding RNAs in complete genomes.

Griffiths-Jones S ¹,

Moxon S ,

Marshall M ,

Khanna A ,

Eddy SR ,

Bateman A

Affiliations

1. The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Authors
Griffiths-Jones S¹
(1 author)

ORCIDs linked to this article

Nucleic Acids Research, 01 Jan 2005, 33(Database issue):D121-4
https://doi.org/10.1093/nar/gki081 PMID: 15608160 PMCID: PMC540035

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

Abstract

Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.

Free full text

Nucleic Acids Res. 2005 Jan 1; 33(Database Issue): D121–D124.

Published online 2004 Dec 17. https://doi.org/10.1093/nar/gki081

PMCID: PMC540035

PMID: 15608160

Rfam: annotating non-coding RNAs in complete genomes

Sam Griffiths-Jones,^* Simon Moxon, Mhairi Marshall, Ajay Khanna,¹ Sean R. Eddy,¹ and Alex Bateman

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Abstract

INTRODUCTION

Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs [reviewed in (1)]. Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5′ maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets [reviewed in (2)]. Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs [reviewed in (3)].

Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments.

Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/. All the data are also available for download, local installation and sequence searching using the INFERNAL software package (http://infernal.wustl.edu/) (4). The Rfam/INFERNAL model is much like the Pfam/HMMER system (5), extended to deal with RNA secondary structure consensus, and has been discussed previously (6). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.

RECENT DEVELOPMENTS

The database has grown dramatically over the past two years: from 25 families annotating around 55000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions [e.g. bacterial riboswitches bind a range of metabolites as reviewed previously (7,8), and the 5′-UTR of the PrfA acts as a temperature-dependent switch (9)] to regulate message stability or translational efficiency.

This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.

One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes [e.g. Caenorhabditis briggsae (10), chicken (11) and Erwinia caratova (12)]. In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.

NON-CODING RNAS IN COMPLETE GENOMES

Rfam makes available annotation of over 13400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis-regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli, in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes (B.anthracis is shown in Figure Figure1)1) include a number of recently described riboswitches (7,8).

An external file that holds a picture, illustration, etc.
Object name is gki081f1.jpg

Figure 1

Rfam genome page for Bacillus anthracis. The table contains a summary of the number of members of each Rfam family in the genome, with the distribution of hits shown on the map.

These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure Figure22 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans.

An external file that holds a picture, illustration, etc.
Object name is gki081f2.jpg

Figure 2

Taxonomic distribution of Rfam family members in the three kingdoms of life.

Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli [reviewed in (13)], but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.

FUTURE CHALLENGES

Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic (14) as described previously (6), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in ~24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of ~100 for most families, and provably do not reduce the sensitivity of the full SCFG search (15). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level.

Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA (16). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences (17), and there are over 350000 B2 repeat sequences in mouse (18). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists.

It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.

ACKNOWLEDGEMENTS

We thank all those who have contributed data and annotation and developed tools and algorithms for ncRNA detection, alignment and structure prediction. Work at the Sanger Institute is funded by the Wellcome Trust. A.K. and S.R.E. are supported by the Howard Hughes Medical Institute, the NIH National Human Genome Research Institute and Alvin Goldfarb.

REFERENCES

1. Bachellerie J.P., Cavaille,J. and Huttenhofer,A. (2002) The expanding snoRNA world. Biochimie, 84, 775–790. [Abstract] [Google Scholar]

2. Bartel D.P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116, 281–297. [Abstract] [Google Scholar]

3. Storz G., Opdyke,J.A. and Zhang,A. (2004) Controlling mRNA stability and translation with small, noncoding RNAs. Curr. Opin. Microbiol., 7, 140–144. [Abstract] [Google Scholar]

4. Eddy S.R. (2002) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18. [Europe PMC free article] [Abstract] [Google Scholar]

5. Bateman A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L. et al. (2003) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. [Europe PMC free article] [Abstract] [Google Scholar]

6. Griffiths-Jones S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. (2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441. [Europe PMC free article] [Abstract] [Google Scholar]

7. Mandal M. and Breaker,R.R. (2004) Gene regulation by riboswitches. Nature Rev. Mol. Cell. Biol., 5, 451–463. [Abstract] [Google Scholar]

8. Vitreschak A.G., Rodionov,D.A., Mironov,A.A. and Gelfand,M.S. (2004) Riboswitches: the oldest mechanism for the regulation of gene expression? Trends Genet., 20, 44–50. [Abstract] [Google Scholar]

9. Johansson J., Mandin,P., Renzoni,A., Chiaruttini,C., Springer,M. and Cossart,P. (2002) An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, 110, 551–561. [Abstract] [Google Scholar]

10. Stein L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A. et al. (2003) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., 1, E45. [Europe PMC free article] [Abstract] [Google Scholar]

11. International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, in press. [Abstract] [Google Scholar]

12. Bell K.S., Sebaihia,M., Pritchard,L., Holden,M.T., Hyman,L.J., Holeva,M.C., Thomson,N.R., Bentley,S.D., Churcher,L.J., Mungall,K. et al. (2004) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl Acad. Sci. USA, 101, 11105–11110. [Europe PMC free article] [Abstract] [Google Scholar]

13. Hershberg R., Altuvia,S. and Margalit,H. (2003) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res., 31, 1813–1820. [Europe PMC free article] [Abstract] [Google Scholar]

14. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [Europe PMC free article] [Abstract] [Google Scholar]

15. Weinberg Z. and Ruzzo,W.L. (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, 20, I334–I341. [Abstract] [Google Scholar]

16. Weiner A.M., Deininger,P.L. and Efstratiadis,A. (1986) Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem., 55, 631–661. [Abstract] [Google Scholar]

17. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [Abstract] [Google Scholar]

18. Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562. [Abstract] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/nar/gki081

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/nar/article-pdf/33/suppl_1/D121/7622063/gki081.pdf

Citations & impact

Impact metrics

958

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/3390422

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/3390422

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/nar/gki081

Supporting

Mentioning

Contrasting

1112

Article citations

Chromosome-level genome assembly of banaba (Lagerstroemia speciosa L.).
Wan Z, Zheng T, Cai M, Wang J, Pan H, Cheng T, Zhang Q
Sci Data, 11(1):1228, 14 Nov 2024
Cited by: 1 article | PMID: 39543103 | PMCID: PMC11564690
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Acquisition and evolution of the neurotoxin domoic acid biosynthesis gene cluster in Pseudo-nitzschia species.
He Z, Xu Q, Chen Y, Liu S, Song H, Wang H, Leaw CP, Chen N
Commun Biol, 7(1):1378, 23 Oct 2024
Cited by: 0 articles | PMID: 39443678 | PMCID: PMC11499653
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Chromosome-level genome assembly of two cultivated Jujubes.
Wei T, Li H, Huang X, Yang P
Sci Data, 11(1):1144, 17 Oct 2024
Cited by: 0 articles | PMID: 39420037 | PMCID: PMC11486999
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
T2T genome assemblies of Fallopia multiflora (Heshouwu) and F. multiflora var. angulata.
Zeng S, Mo C, Xu B, Wang Z, Zhang F, Biao A, Li S, Kong Q, Wang J
Sci Data, 11(1):1103, 09 Oct 2024
Cited by: 1 article | PMID: 39384802 | PMCID: PMC11464673
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Chromosome-scale and haplotype-resolved genome assembly of the autotetraploid Misgurnus anguillicaudatus.
Sun B, Li Q, Mei Y, Zhang Y, Zheng Y, Huang Y, Xiao X, Zhang J, Jian G, Cao X
Sci Data, 11(1):1059, 28 Sep 2024
Cited by: 1 article | PMID: 39341798 | PMCID: PMC11438953
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (958) article citations

Other citations

Wikipedia

https://en.wikipedia.org/wiki/Rfam

Search life-sciences literature (45,094,167 articles, preprints and more)

Rfam: annotating non-coding RNAs in complete genomes.

Affiliations

ORCIDs linked to this article

Abstract

Free full text

Rfam: annotating non-coding RNAs in complete genomes

Ajay Khanna

Sean R. Eddy

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/nar/gki081

Article citations

Chromosome-level genome assembly of banaba (Lagerstroemia speciosa L.).

Acquisition and evolution of the neurotoxin domoic acid biosynthesis gene cluster in Pseudo-nitzschia species.

Chromosome-level genome assembly of two cultivated Jujubes.

T2T genome assemblies of Fallopia multiflora (Heshouwu) and F. multiflora var. angulata.

Chromosome-scale and haplotype-resolved genome assembly of the autotetraploid Misgurnus anguillicaudatus.

Other citations

Wikipedia

Similar Articles

Rfam: an RNA family database.

Rfam 11.0: 10 years of RNA families.

Rfam: Wikipedia, clans and the "decimal" release.

Non-coding transcription characterization and annotation: a guide and web resource for non-coding RNA databases.

Partnerships & funding