Nothing Special   »   [go: up one dir, main page]

The Gene Ontology Resource: 20 Years and Still Going Strong

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

D330–D338 Nucleic Acids Research, 2019, Vol.

47, Database issue Published online 5 November 2018


doi: 10.1093/nar/gky1055

The Gene Ontology Resource: 20 years and still


GOing strong
The Gene Ontology Consortium†

Received September 22, 2018; Revised October 16, 2018; Editorial Decision October 17, 2018; Accepted October 17, 2018

ABSTRACT ferred using logical reasoning (Figure 1B). The GO struc-


ture has been meticulously constructed over the course of

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


The Gene Ontology resource (GO; http: 20 years by a small team of ontology developers; it is con-
//geneontology.org) provides structured, com- stantly evolving in response to new scientific discoveries and
putable knowledge regarding the functions of genes continuously refined to represent the most current state of
and gene products. Founded in 1998, GO has biological knowledge. The members of the ontology devel-
become widely adopted in the life sciences, and its opment team are expert biologists and knowledge represen-
contents are under continual improvement, both in tation specialists who read the scientific literature and en-
quantity and in quality. Here, we report the major gage biocurators and biological domain experts to collab-
developments of the GO resource during the past oratively develop this representation of biological informa-
two years. Each monthly release of the GO resource tion.
is now packaged and given a unique identifier (DOI), We present here the most important updates since our
last contribution to this series (1). There are currently over
enabling GO-based analyses on a specific release to
45 000 terms in the ontology, linked by almost 134 000 re-
be reproduced in the future. The molecular function lations. The ontology covers three distinct aspects of gene
ontology has been refactored to better represent the function: molecular function (the activity of a gene product
overall activities of gene products, with a focus on at the molecular level), cellular component (the location of
transcription regulator activities. Quality assurance a gene product’s activity relative to biological structures),
efforts have been ramped up to address potentially and biological process (a larger biological program in which
out-of-date or inaccurate annotations. New evidence a gene’s molecular function is utilized).
codes for high-throughput experiments now enable The GO knowledgebase also includes GO annotations,
users to filter out annotations obtained from these created by linking specific gene products (from organ-
sources. GO-CAM, a new framework for representing isms across the tree of life) to the terms in the ontology.
gene function that is more expressive than standard Each annotation includes the evidence it is based upon,
such as a peer-reviewed publication, using evidence codes
GO annotations, has been released, and users
from the Evidence and Conclusion Ontology (ECO) (2).
can now explore the growing repository of these For example, in its simplest form (what we refer to as a
models. We also provide the ‘GO ribbon’ widget for standard annotation), an annotation might state that ‘hu-
visualizing GO annotations to a gene; the widget man MSH2 (a gene, HGNC:7325, also represented by
can be easily embedded in any web page. UniProtKB:P43246) is involved in ‘GO:0006298 DNA mis-
match repair’ (a GO term), based on a ‘ECO:0000314 di-
INTRODUCTION rect assay evidence used in manual assertion’ reported in
(4)’. Formally, this annotation would be represented in
The Gene Ontology resource (GO; http://geneontology.org) the knowledgebase as a ‘triple’ linking the gene to the
is the most comprehensive and widely used knowledge- GO term using a specific relation: UniProtKB:P43246 in-
base concerning the functions of genes. In GO, all func- volved in GO:0006298. The GO knowledgebase contains
tional knowledge is structured and represented in a form over 7 million annotations to genes/gene products from
amenable to computational analysis, which is essential to over 3,200 species (http://amigo.geneontology.org/amigo/
support modern biological research. The GO knowledge- search/annotation), ∼10% of which (750 000) are supported
base is structured using a formal ontology, by defining by experimental data from research papers. Nearly half of
classes of gene functions (GO terms) that have specified re- these 750 000 experimental annotations refer to genes in a
lations to each other (Figure 1A). GO terms are often given relatively small number of ‘model’ organisms, listed in Table
logical definitions, or equivalence axioms, that define the 1. These annotations are made by a consortium of expert
term relative to other terms in the GO or other ontolo- biocurators located worldwide, who read scientific papers,
gies, so that their relationships can be computationally in-

To whom correspondence should be addressed. Email: pdthomas@usc.edu



Full list provided in Appendix.


C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2019, Vol. 47, Database issue D331

repair process?’. Because each annotation is associated with


evidence (ECO and reference), computer programs can an-
swer even more specific queries, such as ‘What genes have
direct experimental evidence of involvement in the DNA
mismatch repair process?’, or ‘Which scientific papers pro-
vide experimental evidence about the function of the hu-
man ABCA1 gene?’. The ability of the GO knowledgebase
to support computational queries is a major reason for its
standing as an essential tool in biomedical research. The
most obvious example is its use in GO enrichment analysis,
also often called pathway analysis (5). For example, a re-
searcher might have identified a set of 1000 genes expressed

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


at a higher level in a cancer sample than in a matched
healthy tissue sample, and would like to know if there are
any functions (terms from the GO molecular function, cel-
lular component, or biological process aspects) that are un-
usually common among these 1000 overexpressed genes to
understand what may be driving the cancer. To reach this
understanding, the functions represented in the set of 1000
genes need to be compared to the functions represented in
all 20 000 human protein-coding genes. A computer can use
the GO knowledgebase’s structure to rapidly retrieve the all
the functions that are performed by each of the 20 000 hu-
man genes, and create all possible groupings by functional
class. Each grouping is tested for statistical enrichment, and
the small number of enriched functional classes enables the
researcher to identify candidate biological processes within
the complex experimental measurement of 20,000 genes.

GO resource content
The GO knowledgebase consists of the ontology and the
Figure 1. GO structure. (A) Graphical representation of relationships be- annotations made using the ontology. As of the 5 Septem-
tween terms: black lines represent is a and blue lines represent part of
(representation obtained from https://www.ebi.ac.uk/QuickGO/term/GO: ber 2018 release (doi:10.5281/zenodo.1410625), there were
0060887). (B) Equivalence axiom for the term ‘GO:0060887: limb epider- ∼45 000 terms in GO: 29 698 biological processes, 11 147
mis development’, as displayed in Protégé (12). molecular functions and 4201 cellular components, linked
by almost 134 000 relationships. The number of annotations
(as well as the percentage change since 2016 (1)) are shown
ensure the correct gene is identified, and select the most ac- in Table 1. It is important to understand that the changes re-
curate and meaningful GO terms to describe the biology flect two distinct processes: addition of annotations based
supported by the experimental findings. The accuracy of the on new evidence, and obsoleting of annotations that have
GO resource is continually refined by internal checks as well been superseded by newer studies. We expect the number of
as feedback from the broader GO user community to iden- obsoleted annotations to increase, due to our increasing an-
tify and fix potentially incorrect or out-of-date annotations. notation quality assurance efforts, described in more detail
The wealth of experimental knowledge manually curated by below.
biocurators is further enriched by inferences from various
predictive methods, both manual and automatic, described New framework and repository, for gene function ‘models’.
using classes from ECO as described in (3). In most cases, We have developed a more expressive computational frame-
these annotations are inferred from one or more experi- work for representing gene functions, which subsumes our
mental annotations to a homologous gene product. These current GO annotation framework, while maintaining com-
may be individually reviewed by a biocurator [denoted by patibility. We refer to the framework as GO-Causal Activity
‘ECO:0000250 sequence similarity evidence used in manual Modeling (GO-CAM, formerly referred to as ‘LEGO’ (1))
assertion’ (ISS) or ‘ECO:0000318 biological aspect of an- and to GO-CAMs as models to distinguish them from stan-
cestor evidence used in manual assertion’ (IBA) evidence dard annotations. A paper detailing GO-CAM is in prepara-
classes] or not reviewed [denoted by ‘ECO:0000501 evi- tion, but we summarize some properties here. In GO-CAM,
dence used in automatic assertion’ (IEA)]. each model is represented as a set of triples (subject-relation-
This structure of the GO knowledgebase, the ontology object, with brackets {} as a set container), e.g. {ABCA1
plus annotations, supports queries of the sort that are typ- enables cholesterol transporter activity; cholesterol trans-
ically asked in the course of biological research, such as: porter activity occurs in plasma membrane, and cholesterol
‘What are all the functions for the human ABCA1 gene?’ transporter activity part of cholesterol homeostasis}. Each
or ‘What are all the genes involved in the DNA mismatch triple is supported by one or more pieces of evidence, con-
D332 Nucleic Acids Research, 2019, Vol. 47, Database issue

Table 1. Number of experimental annotations in the GO knowledgebase, 5 September 2018 release (doi:10.5281/zenodo.1410625)
Molecular function EXP,
Protein binding, EXP excluding protein binding Cellular component EXP Biological process EXP
Human 83589 (+158%) 29999 (+26%) 41341 (+13%) 47230 (+22%)
Mouse 12990 (+49%) 14380 (+11%) 28262 (+25%) 67094 (+13%)
Rat 4329 (+2%) 10879 (−9%) 15693 (+4%) 27241 (−1%)
Zebrafish 509 (+30%) 1756 (+15%) 1087 (+16%) 21635 (+20%)
Drosophila 1516 (+33%) 5694 (+15%) 10803 (+3%) 30762 (+1%)
C. elegans 2993 (+13%) 2482 (+13%) 5245 (+8%) 14511 (+24%)
D. discoideum 690 (+32%) 1081 (+15%) 2738 (+30%) 4149 (+14%)
S. cerevisiae 168 (+58%) 8886 (+8%) 17456 (+4%) 20194 (+14%)
S. pombe 2076 (+52%) 4636 (+42%) 12184 (+8%) 5651 (+11%)
A. thaliana 13074 (+113%) 8344 (+14%) 25486 (+7%) 25223 (+12%)

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


E. coli 3602 (+57%) 6006 (+20%) 4171 (+7%) 5756 (+5%)

Note that for the molecular function annotations, we present annotations to ‘GO:0005515 protein binding’ separately from other GO:0003674 molecular
functions, as these GO:0005515 annotations are used differently than other annotations (the class itself is not very informative, but each annotation includes
additional information about the specific binding partner). The number of annotations for the main species annotated by the GOC are shown, and the
percentage change relative to the 2016 update is indicated in parentheses.

sisting of a class from ECO and a citable source, usually Interactions with the GO user community
a scientific publication. GO-CAM specifies the semantics
GO is an open project, and we encourage community con-
of GO annotations, and how standard GO annotations can
tributions to the knowledgebase and software.
be combined into a larger model. Each GO-CAM model
All users: GO can be contacted using the GO Helpdesk
is represented using the Web Ontology Language (OWL),
(http://help.geneontology.org) for any questions or feed-
which is converted computationally to standard GO an-
back about the annotations, the ontology, software, or other
notations (GAF format), ensuring backward compatibility.
GO resources. If users notice an annotation that may not be
Users can browse, view, and download the available mod-
correct, they should first review the original publication or
els in different formats at: http://geneontology.org/go-cam.
data source. If the annotation still seems inaccurate, users
The number of models is currently small, and most models
are encouraged to report this to the GO helpdesk, and GOC
contain only one gene product (standard annotations ex-
members will review the annotation and remove or modify
tended with additional contextual information, such as cell
it if justified. Authors: Authors can now see if their paper
type). The GO is currently rapidly increasing the curation
has been used for GO annotation directly in PubMed. Un-
of GO-CAM models, and the model repository is growing.
der the ‘LinkOut - more resources’ section of the PubMed
In particular, models are now available that each represent
abstract page, papers with annotations will have a link la-
an entire regulatory or metabolic pathway.
beled ‘Gene Ontology annotations from this paper - Gene
Ontology’ (see, e.g. https://www.ncbi.nlm.nih.gov/pubmed/
Changes to data access: new production pipeline. Start- 3357510 which links to a web page on the GO site that shows
ing in March 2018, the releases of the GO resource have the annotations based on evidence in that paper). If no GO
been generated by a new data production pipeline using LinkOut is present, that may indicate that the publication
a refreshed software stack. This system allows for eas- has not been used for GO annotation. Authors can contact
ily extensible error checking and improved reporting of the helpdesk at the GO website to suggest new annotations
quality assurance checks that ensure the quality and in- or changes to existing annotations. Resources or consortia:
tegrity of the released ontology and annotations. For users, The GOC collaborates with established data resources and
one of the most important aspects of this pipeline is that other groups and consortia representing a particular area
the GO resource now produces monthly releases (named of biology. Recent examples include cilium biology (7,8),
by release date) that are available at the GO site and autophagy (9) and cardiac phenotypes (10). We encourage
can be referenced and obtained as stable Document Ob- members of other interest groups to contact us to improve
ject Identifiers (DOIs) via Zenodo. This feature is criti- the ontology and annotations in their areas of expertise.
cal for ensuring that GO-based analyses can be replicated Tracking all contributions: Most aspects of the GO project
in a consistent and referenceable manner through the in- management are now based in GitHub (https://github.com/
clusion of these DOIs and/or version of both the ontol- geneontology). In addition to tracking ontology change re-
ogy and annotation files used (6). Our data production quests (https://github.com/geneontology/go-ontology), we
pipeline is currently hosted at the Lawrence Berkeley Na- now use GitHub to collect feedback on annotations (https://
tional Laboratory. We provide GO annotations in multi- github.com/geneontology/go-annotation). For users famil-
ple formats: as standard GAF (Gene Association Format) iar with GitHub, we encourage them to submit any requests
and GPAD (Gene Product Association Data) annotation directly to GitHub, where they can follow all further dis-
files, in Turtle (OWL serialization format) [http://current. cussion and actions. Otherwise, issues and queries can be
geneontology.org/products/ttl], and a Blazegraph [http:// submitted to our helpdesk (http://help.geneontology.org).
current.geneontology.org/products/blazegraph] journal, re-
placing the legacy MySQL output.
Nucleic Acids Research, 2019, Vol. 47, Database issue D333

Increased focus on annotation quality control a GitHub repository (https://github.com/geneontology/go-


ontology). One major advantage of the new ontology man-
The GO resource is now 20 years old. The longevity of
agement process is that the work can be parallelized among
the resource adds a challenge to maintain and update the
multiple editors, thus increasing efficiency. In addition, real-
many existing annotations, as many of the findings pub-
time quality checks prevent errors that would otherwise re-
lished during that time have become much more precise, or
quire revisiting the same editing task more than once to cor-
were reinterpreted or superseded. We have made it a high
rect them.
priority to identify and correct inaccurate and out-of-date
GO continues to integrate and align with external
legacy annotations to make sure that GO continues to con-
ontology resources in two main ways: import of sub-
sistently reflect current knowledge. We have taken a num-
ontologies used to define GO terms, and inclusion of
ber of different approaches to tackle this challenge. First,
external cross-references. GO utilizes the structure of
to ensure consistency and quality, GO biocurators meet reg-
external ontologies to aid in reasoning and in the au-
ularly for training, establishment of annotation guidelines,

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


tomatic inference of relations between GO terms (13).
and coordinated review of specific areas of biology. More re-
GO imports subsets of these external ontologies that
cently, we have made significant efforts to integrate annota-
include information about anatomical structures, cell
tion review with ontology improvements, taking advantage
types, chemicals and taxonomic groupings: Uberon (14),
of suggested changes to the ontology to clarify term defini-
Protein Ontology (15), Plant Ontology (16), ChEBI
tions, intended usage, and coordinate annotation practices
(17), Relations Ontology (18), NCBI Taxonomy (19),
with curators. In addition, quality assurance is performed
Sequence Ontology (20), Ontology of Biological At-
centrally, both computationally, to ensure annotations are
tributes (http://www.obofoundry.org/ontology/oba.html),
valid, and manually, to ensure they accurately represent the
Fungal Anatomy Ontology (http://www.obofoundry.
experimental findings.
org/ontology/fao.html), Phenotypic Quality On-
We have discovered that one of the most powerful ap-
tology (http://obofoundry.org/ontology/pato.html),
proaches to quality control and consistency is the phylo-
and Common Anatomy Reference Ontology (http:
genetic approach. Originally developed as a means of prop-
//www.obofoundry.org/ontology/caro.html). GO also main-
agating annotations from experimentally studied genes to
tains cross-references between terms and multiple widely-
evolutionarily related genes in other species, the phyloge-
used external resources, including Reactome (21), The
netic perspective provides a unified view of all experimen-
Annotated Reactions Database (Rhea) (22), Enzyme Com-
tal annotations within a evolutionarily-related protein fam-
mission (EC; http://www.sbcs.qmul.ac.uk/iubmb/enzyme/),
ily, allowing curators to more easily find outlier annotations
IntAct, Complex Portal (23) and MetaCyc (24).
(see e.g. (11)). In parallel, the development of GO-CAM
models has been useful in identifying inconsistent annota-
Refactoring the molecular function branch of GO. Previ-
tion practices, and has provided opportunities to develop
ously, there was a trend in GO molecular function ontology
consortium-wide annotation guidelines. Another observa-
development to focus on adding terms that describe molec-
tion that emerged is that older annotations from isolated
ular binding activities of specific gene products. The advan-
phenotypic observations, taken outside of other contextual
tage of such terms is that annotations can often be made un-
data, often do not provide evidence of direct involvement of
equivocally based on results from a single experiment. How-
a gene in a biological process. If inconsistencies are noticed,
ever, this approach has led to a complex ontology structure
they are reported to the contributing group for verification
and a proliferation of annotations that individually repre-
and correction as appropriate.
sent only a partial functional description of a gene product.
In a pilot quality assurance effort, we have requested the
Annotations to binding terms can obscure annotations to
review (by GO Consortium biocurators) of ∼2500 manual
more informative function terms, making annotations more
annotations (<0.01% of the total corpus) that were judged
difficult to interpret. For example, one can annotate CDK1
questionable by one of the strategies above. Approximately
(UniProtKB: P06493) separately with ‘GO:0030332 cyclin
70–80% of the annotations flagged for review were modi-
binding’, ‘GO:0005524 ATP binding’, ‘GO:0005515 pro-
fied to a more appropriate term or removed. We will con-
tein binding’, and ‘GO:0004674 protein serine/threonine
tinue to work on improving the quality of the annotations
kinase activity’. However, these are all aspects of a more
and reviewing legacy data when appropriate. As a result,
precise molecular function that is more informative than
we expect that the increase in annotations and new ontol-
the sum of these parts: ‘GO:0004693 cyclin-dependent pro-
ogy terms may not be as rapid as in the past, at least for
tein serine/threonine kinase activity’. In the GO molecu-
the main species annotated by the consortium members, as
lar function refactoring, we recognized that, while specific
a greater proportion of our efforts will be dedicated to re-
binding events are an essential mechanism by which gene
viewing and revising older annotations.
products function, an individual binding activity is almost
never sufficient in itself to describe molecular function in
Ontology revision and integration
a larger biological context (25). One of the primary goals
Since our last update article, we have developed an entirely of our refactoring was to ensure that the ontology con-
new process for ontology editing and maintenance that has tains the terms necessary to describe these higher-level func-
dramatically increased efficiency and enabled extensive real- tions, and that they have a path to the root of the ontol-
time quality checks. Ontology editing is now performed in ogy that is not simply under the generic ‘GO:0005488 bind-
an OWL-based environment using the ontology editing tool ing’ term. Accordingly, we reinstated some previously ob-
Protégé (12). The ontology is versioned and tracked using soleted terms and added new terms, as well as additional
D334 Nucleic Acids Research, 2019, Vol. 47, Database issue

relations. We also addressed the structure of the ontology development methodology to refine two areas of the ontol-
so that the upper-level terms would be more biologically ogy, the MAP kinase signaling pathway and the representa-
meaningful and have more uniform specificity. We removed tion of the extracellular matrix. Refinements to the MAPK
8 terms from the top level and added four new terms (Figure cascade included defining the molecular functions that are
2). Most of the terms that were formerly direct children of parts of the process: ‘GO:0004707 MAP kinase activity’,
‘GO:0003674 molecular function’ were moved under more ‘GO:0004708 MAP kinase kinase activity, ‘GO:0004709
biologically meaningful terms: for example, ‘GO:0042056 MAP kinase kinase kinase activity’ and ‘GO:0008349 MAP
chemoattractant activity’ and ‘GO:0045499 chemorepellent kinase kinase kinase kinase activity’. All other upstream
activity’ were moved under ‘GO:0048018 receptor ligand and downstream molecular functions/biological processes
activity’, while ‘GO:0036370 D-alanyl carrier activity’ and will be modeled in GO-CAM with causal relationships be-
‘GO:0016530 metallochaperone activity’ were moved under tween them and the MAPK cascade. We also enumerated
the new term ‘GO:0140104 molecular carrier activity’ (rep- the types of cascades based on current literature and on

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


resenting an activity of ‘directly binding to a specific ion or discussions among expert model-organism curators, trying
molecule and delivering it either to an acceptor molecule to keep the distinctions useful across multiple taxa. There
or to a specific location’). We have also created a new term are four direct subtypes of ‘GO:0000165 MAPK cascade’:
‘GO:0104005 hijacked molecular function’ as a parent of ‘GO:0070371 ERK1 and ERK2 cascade’, ‘GO:0070375
terms such as ‘GO:0001618 virus receptor activity’, which ERK5 cascade’, ‘GO:0071507 pheromone response MAPK
is, from the standpoint of the protein being annotated, not a cascade’ and ‘GO:0051403 stress-activated MAPK cas-
normal function, but nevertheless relevant for some of our cade’. Some other MAPK processes such as ‘GO:1903616
users. MAPK cascade involved in axon regeneration’ will eventu-
Finally, we have made significant changes to the structure ally be obsoleted, as these combine two or more other GO
representing the molecular functions of transcription fac- terms and can be composed in GO-CAM. For the refine-
tors (Figure 3). This refactoring was carried out in collab- ment of the extracellular matrix area of the ontology, we
oration with experts in transcription factors and gene reg- worked with external experts to add terms that were useful
ulation from the Gene Regulation Consortium (GRECO; grouping terms such as ‘GO:0062023 collagen-containing
http://thegreco.org). In keeping with our design principle extracellular matrix’. We also obsoleted or merged terms
of having terms that describe higher-level functions, we that were poorly annotated and thought to represent an out-
have created a new parent term to group all functions dated view.
that directly regulate transcription, ‘GO:0140110 tran-
scription regulator activity’. The formerly top-level term GO subsets (slims). A GO subset (or slim) is a set of GO
‘GO:0000988 transcription factor activity, protein binding’ terms selected to provide an overview of the functions, lo-
has been obsoleted because this activity was partly cov- cations or roles of a set of genes. The subset can be de-
ered by other terms in the ontology and its usage was in- veloped for high coverage of specific species, or to repre-
consistent. Accordingly, its children have either been obso- sent only certain areas of the ontology, and in most cases,
leted or subsumed under different terms (merged or moved). contain only high-level GO terms to provide a broad bi-
The new top level term ‘GO:0140110 transcription reg- ological overview. Another use of subsets is to blacklist
ulator activity’ has three main children - ‘GO:0003700 certain terms for annotation: GO has two such subsets,
DNA-binding transcription factor activity’ (formerly la- one to flag terms that should not be used for manual an-
beled ‘transcription factor activity, sequence-specific DNA notation, and one for terms that should not be used at
binding), ‘GO:0140223 general transcription initiation fac- all. GO maintains two additional subsets, the Generic GO
tor activity’ and ‘GO:0003712 transcription coregulator ac- slim and the Alliance of Genome Resources (https://www.
tivity’. alliancegenome.org/) slim. GO also hosts subsets useful to
The transcription factor areas of GO had previously been groups using GO; we currently have 11 such subsets (Ta-
refactored between 2010 and 2012 (26,27) with the aim of ble 2; http://www.geneontology.org/page/go-subset-guide).
more finely capturing all combinations of different types Each subset provides a global overview of gene functions.
of protein and DNA binding activities (e.g. binding to dif- Each subset now has a designated contact person to resolve
ferent types of regulatory regions such as promoters and any issue resulting from ontology changes (see Ontology re-
enhancers) and transcription regulation processes (positive vision and integration).
and negative regulation). However, this structure, while very
precise, has proved very difficult to use by biocurators, re-
sulting in inconsistent annotations. Additionally, end-users
Other developments
have had difficulty with common queries, such as compre-
hensively identifying the set of all transcription factors in The GO ribbon: a configurable tool for visualizing GO anno-
a given species. We expect that even more improvements to tations. Many genes have large numbers of annotations,
the ontology structure, as well as more consistent annota- making it difficult to get a quick overview of a gene func-
tions to transcription regulator terms, will be available in tion, or the functions of gene sets. We have developed
2019. the GO ribbon specifically to help users visualize and ex-
plore the functions of a gene. The GO ribbon visualization
Defining the boundaries of biological processes: MAP ki- metaphor borrows from a viewer originally developed by
nase signaling and extracellular matrix as examples. We the Mouse Genome Database team (28), but in contrast, the
have used our integrated annotation review and ontology GO ribbon was developed as a lightweight, reusable widget
Nucleic Acids Research, 2019, Vol. 47, Database issue D335

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


Figure 2. The molecular function branch, before and after refactoring. The term marked with an ‘x’ in the left-hand panel has been obsoleted. Terms moved
(assigned to a new parent) are indicated by arrows. New terms (right panel) are marked with ‘NEW’. 1 The class label ‘electron carrier activity’ was changed
to ‘electron transfer activity’.

simple graphical representation of a gene’s functions (Fig-


ure 4). The ribbon is interactive, allowing users to drill down
to more specific functions by selecting a high-level category
such as ‘GO:0030154 cell differentiation’, ‘GO:0050877 ner-
vous system process’, or ‘GO:0003700 DNA-binding tran-
scription factor activity’, and to filter the functions based
on the evidence codes provided in the GO annotations.
This overview of gene functions is particularly useful when
comparing the functions of different genes in the same
species, or the functions of orthologous genes across differ-
Figure 3. Current structure of the ‘GO:0140110 transcription regulator ac- ent species.
tivity’ branch of the ontology. The GO ribbon is a React component available on
GitHub (https://github.com/geneontology/ribbon) and
Table 2. Subsets maintained in GO as a NPM package https://www.npmjs.com/package/
GO subsets @geneontology/ribbon). The GO ribbon widget is cur-
Generic GO subset rently used by the Alliance of Genome Resources.
GO slim AGR
GO do not annotate list
GO do not manually annotate list
Subsets from external groups GO annotations from high-throughput experiments. Data
Subset Group from high-throughput experiments are generally collected
Aspergillus subset Aspergillus Genome Data
Candida albicans Candida Genome Database in a hypothesis-free manner, and consequently do not gen-
Chembl Drug Target subset ChEMBL erally provide as strong evidence of gene function as small-
FlyBase Ribbon slim FlyBase
Metagenomics subset EBI Metagenomics group scale molecular biology experiments that currently sup-
Mouse GO slim MGI port most of the experimental GO annotations. In addi-
Plant subset The Arabidopsis Information Resource
Protein Information Resource subset PIR
tion, high-throughput experiments can be subject to rela-
Schizosaccharomyces pombe subset PomBase tively high false positive rates. Users may therefore want to
Synapse GO slim SynGO filter out these experimental annotations in some applica-
Yeast subset Saccharomyces Genome Database
tions of the GO. To make this possible, starting in 2018, in
collaboration with the Evidence and Conclusions Ontology
that can be embedded in any website, and retrieves data di- (29) (2), the GO has added several new evidence codes to
rectly from the GO resource via API. describe high-throughput experiments: ‘ECO:0006056 high
To generate a GO ribbon, all the functions (GO terms) as- throughput evidence used in manual assertion’ (HTP), and
sociated with a gene of interest are mapped onto a specified the subclasses: ‘ECO:0007005 high throughput direct assay
GO subset using the ontology structure. The end result is a
D336 Nucleic Acids Research, 2019, Vol. 47, Database issue

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


Figure 4. GO ribbon representation. Darker boxes indicate terms with the most annotations; white boxes represent terms that are not annotated for this
protein (Mus musculus Sox7, MGI:98369). Screenshot obtained from https://www.alliancegenome.org/gene/MGI:98369.

evidence used in manual assertion’ (HDA), ‘ECO:0007001 ACKNOWLEDGMENTS


high throughput mutant phenotype evidence used in man-
We would like to thank the domain experts Peter
ual assertion’ (HMP), ‘ECO:0007003 high throughput ge-
Yurchenco, Sylvie Ricard-Blum, Rachel Lennon, Geoff
netic interaction evidence used in manual assertion’ (HGI)
Meyer, David Sherwood and Jeff Miner for discussions
and ‘ECO:0007007 high throughput expression pattern ev-
leading to the refinement of the extracellular matrix area.
idence used in manual assertion’ (HEP). To accompany the
We also want to thank all the contributors to the GO re-
new evidence codes, we have provided annotation guidelines
source over the last 20 years (http://geneontology.org/page/
to help identify and curate high-throughput datasets that
acknowledgments-contributors), and all the authors of pa-
meet the GO Consortium annotation criteria. Consortium
pers represented in the GO knowledgebase (https://www.
members have reviewed papers with more than 40 annota-
ncbi.nlm.nih.gov/pubmed/?term=loprovGeneOntol[SB]).
tions using a single evidence code, and updated the evidence
codes, or removed the annotations if appropriate. There
are currently over 31 000 annotations that have HTP evi- FUNDING
dence codes from 140 research articles, representing <5% of
experimental GO annotations. The identification of anno- The GO resource is supported by grant from the Na-
tations derived from high-throughput experiments allows tional Human Genome Research Institute [U41 HG02273
users to choose to exclude these from their analyses, if they to P.D.T., P.W.S., S.E.L., J.M.C., J.A.B. and supplements
are concerned that these annotations may lead to an in- to grant U41 HG001315 to J.M.C., U24 HG002223
creased bias in data analysis. This is likely to be particularly to P.W.S.]. In addition, GO Consortium members are
important, as is often the case, when GO is used to interpret also supported by diverse funding sources: dictyBase is
types of data similar to those on which the annotations are supported by the National Institute of General Med-
based. ical Sciences [GM064426, GM087371 to R.L.C.]; The
EcoliWiki group is supported by the National Insti-
tutes of Health [GM089636]; National Science Founda-
tion [1565146]; EMBL-EBI is funded by EMBL core
CONCLUSIONS
funds; FlyBase is supported by the UK Medical Research
The GO resource has been under continuous development Council [MR/N030117/1]; National Human Genome Re-
for 20 years, with no signs of slowing down. Both the ontol- search Institute [U41HG000739]; InterPro is funded by
ogy and annotations continue to be updated steadily, in re- the Wellcome Trust [108433/Z/15/Z]; Biotechnology and
sponse to new experimental findings concerning gene func- Biological Sciences Research Council [BB/N00521X/1,
tion, and accumulating knowledge of how genes function BB/N019172/1, BB/L024136/1 to RDF]; The Institute for
together in larger systems. The GO Consortium is increas- Genome Sciences GO-related work on ECO is supported
ing efforts to review annotations, especially those that are by the National Science Foundation [1458400]; The Gene
older and may have been superseded by newer findings. GO Regulation Consortium (GRECO) is supported by Gene
has always been an open, community project, and we hope Regulation Ensemble Effort for the Knowledge Commons
that users of GO will contact us with suggestions for how (GREEKC) COST Action [grant CA15205]; A.L. and
we can improve the resource. GO releases are now monthly, M.L.A. are also supported by the Research Council of Nor-
with persistent DOI’s, and we recommend that all published way [project 247727]; The Institute of Cardiovascular Sci-
GO-based analyses cite this DOI, to enable reproducibility. ence, University College London (R. Lovering’s group) is
GO-CAM, our new framework for defining and represent- supported by British Heart Foundation [RG/13/5/30112];
ing gene functions with more accuracy, consistency and pre- Parkinson’s UK [G-1307]; National Institute for Health
cision, is being used to create a growing set of curated bio- Research University College London Hospitals Biomedi-
logical models, and we encourage the analysis tool devel- cal Research Centre; IntAct and the Complex Portal are
oper community to explore the new format and potential supported by the European Molecular Biology Laboratory
new applications of these models. core funds; PomBase is supported by the Wellcome Trust
Nucleic Acids Research, 2019, Vol. 47, Database issue D337

[104967/Z/14/Z to S.G.O.]; MGI is supported by the Na- 10. Lovering,R.C., Roncaglia,P., Howe,D.G., Laulederkind,S.J.F.,
tional Human Genome Research Institute [HG 000330, HG Khodiyar,V.K., Berardini,T.Z., Tweedie,S., Foulger,R.E.,
Osumi-Sutherland,D., Campbell,N.H. et al. (2018) Improving
002273]; RGD is supported by and by the National Heart, interpretation of cardiac phenotypes and enhancing discovery with
Lung, and Blood Institute [HL 64541]; The UniProt Con- expanded knowledge in the Gene Ontology. Circ. Genomic Precis.
sortium is supported by the National Eye Institute, Na- Med., 11, e001813.
tional Human Genome Research Institute, National Heart, 11. Feuermann,M., Gaudet,P., Mi,H., Lewis,S.E. and Thomas,P.D.
Lung and Blood Institute, National Institute of Allergy and (2016) Large-scale inference of gene function through phylogenetic
annotation of Gene Ontology terms: case study of the apoptosis and
Infectious Diseases, National Institute of Diabetes and Di- autophagy cellular processes. Database J. Biol. Databases Curation,
gestive and Kidney Diseases, National Institute of Gen- 2016, baw155.
eral Medical Sciences, and National Institute of Mental 12. Musen,M.A. and Protégé Team. (2015) The Protégé Project: a look
Health of the National Institutes of Health under Award back and a look forward. AI Matters, 1, 4–12.
13. Hill,D.P., Blake,J.A., Richardson,J.E. and Ringwald,M. (2002)
Number [U24HG007822], National Human Genome Re- Extension and integration of the gene ontology (GO): combining GO

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


search Institute under Award Numbers [U41HG007822 vocabularies with external vocabularies. Genome Res., 12, 1982–1991.
and U41HG002273]; National Institute of General Med- 14. Mungall,C.J., Torniai,C., Gkoutos,G.V., Lewis,S.E. and
ical Sciences under Award Numbers [R01GM080646, Haendel,M.A. (2012) Uberon, an integrative multi-species anatomy
P20GM103446 and U01GM120953]; Biotechnology and ontology. Genome Biol., 13, R5.
15. Natale,D.A., Arighi,C.N., Blake,J.A., Bona,J., Chen,C., Chen,S.C.,
Biological Sciences Research Council [BB/M011674/1]; the Christie,K.R., Cowart,J., D’Eustachio,P., Diehl,A.D. et al. (2017)
British Heart Foundation [RG/13/5/30112]; Swiss Fed- Protein Ontology (PRO): enhancing and scaling up the representation
eral Government through the State Secretariat for Educa- of protein entities. Nucleic Acids Res., 45, D339–D346.
tion, Research and Innovation (SERI); European Molec- 16. Cooper,L., Walls,R.L., Elser,J., Gandolfo,M.A., Stevenson,D.W.,
Smith,B., Preece,J., Athreya,B., Mungall,C., Rensing,S. et al. (2013)
ular Biology Laboratory core funds; The TAIR project is The plant ontology as a tool for comparative plant anatomy and
funded by academic institutional, corporate, and individ- genomic analyses. Plant Cell Physiol., 54, e1.
ual subscriptions. TAIR is administered by the 501(c)(3) 17. Hastings,J., Owen,G., Dekker,A., Ennis,M., Kale,N.,
non-profit Phoenix Bioinformatics; WormBase is sup- Muthukrishnan,V., Turner,S., Swainston,N., Mendes,P. and
ported by the US National Human Genome Research In- Steinbeck,C. (2016) ChEBI in 2016: Improved services and an
expanding collection of metabolites. Nucleic Acids Res., 44,
stitute [U24-HG002223]; UK Medical Research Council D1214–D1219.
[MR/L001220]; UK Biotechnology and Biological Sciences 18. Smith,B., Ceusters,W., Klagges,B., Köhler,J., Kumar,A., Lomax,J.,
Research Council [BB/K020080]; ZFIN also supported Mungall,C., Neuhaus,F., Rector,A.L. and Rosse,C. (2005) Relations
by the National Human Genome Research Institute [U41 in biomedical ontologies. Genome Biol., 6, R46.
19. Federhen,S. (2012) The NCBI Taxonomy database. Nucleic Acids
HG002659 to M.W.]. The content is solely the responsibil- Res., 40, D136–D143.
ity of the authors and does not necessarily represent the of- 20. Mungall,C.J., Batchelor,C. and Eilbeck,K. (2011) Evolution of the
ficial views of the funding agencies. Funding for open access sequence ontology terms and relationships. J. Biomed. Inform., 44,
charges: National Human Genome Research Institute [U41 87–93.
HG02273]. 21. Fabregat,A., Jupe,S., Matthews,L., Sidiropoulos,K., Gillespie,M.,
Garapati,P., Haw,R., Jassal,B., Korninger,F., May,B. et al. (2018) The
Conflict of interest statement. None declared. reactome pathway knowledgebase. Nucleic Acids Res., 46,
D649–D655.
22. Morgat,A., Lombardot,T., Axelsen,K.B., Aimo,L., Niknejad,A.,
REFERENCES Hyka-Nouspikel,N., Coudert,E., Pozzato,M., Pagni,M., Moretti,S.
1. The Gene Ontology Consortium (2017) Expansion of the Gene et al. (2017) Updates in Rhea - an expert curated resource of
Ontology knowledgebase and resources. Nucleic Acids Res., 45, biochemical reactions. Nucleic Acids Res., 45, 4279.
D331–D338. 23. Meldal,B.H.M., Bye-A-Jee,H., Gajdoš,L., Hammerová,Z.,
2. Chibucos,M.C., Mungall,C.J., Balakrishnan,R., Christie,K.R., Horáčková,A., Melicher,F., Perfetto,L., Pokorný,D., Rodriguez
Huntley,R.P., White,O., Blake,J.A., Lewis,S.E. and Giglio,M. (2014) Lopez,M., Türková,A. et al. (2019) Complex Portal 2018: extended
Standardized description of scientific evidence using the Evidence content and enhanced visualization tools for macromolecular
Ontology (ECO). Database J. Biol. Databases Curation, 2014, bau075. complexes. Nucleic Acids Res., doi:10.1093/nar/gky1001.
3. Gaudet,P., Škunca,N., Hu,J.C. and Dessimoz,C. (2017) Primer on the 24. Caspi,R., Billington,R., Fulcher,C.A., Keseler,I.M., Kothari,A.,
gene ontology. Methods Mol. Biol., 1446, 25–37. Krummenacker,M., Latendresse,M., Midford,P.E., Ong,Q.,
4. Fishel,R., Ewel,A. and Lescoe,M.K. (1994) Purified human MSH2 Ong,W.K. et al. (2018) The MetaCyc database of metabolic pathways
protein binds to DNA containing mismatched nucleotides. Cancer and enzymes. Nucleic Acids Res., 46, D633–D639.
Res., 54, 5539–5542. 25. Thomas,P.D. (2017) The gene ontology and the meaning of biological
5. Khatri,P., Sirota,M. and Butte,A.J. (2012) Ten years of pathway function. Methods Mol. Biol., 1446, 15–24.
analysis: current approaches and outstanding challenges. PLoS 26. Gene Ontology Consortium. (2012) The Gene Ontology:
Comput. Biol., 8, e1002375. enhancements for 2011. Nucleic Acids Res., 40, D559–D564.
6. Griffin,P.C., Khadake,J., LeMay,K.S., Lewis,S.E., Orchard,S., 27. Tripathi,S., Christie,K.R., Balakrishnan,R., Huntley,R., Hill,D.P.,
Pask,A., Pope,B., Roessner,U., Russell,K., Seemann,T. et al. (2017) Thommesen,L., Blake,J.A., Kuiper,M. and Lægreid,A. (2013) Gene
Best practice data life cycle approaches for the life sciences [version 2; Ontology annotation of sequence-specific DNA binding transcription
referees: 2 approved]. F1000Research, 6, 1618. factors: setting the stage for a large-scale curation effort. Database J.
7. Christie,K.R. and Blake,J.A. (2018) Sensing the cilium, digital Biol. Databases Curation, 2013, bat062.
capture of ciliary data for comparative genomics investigations. Cilia, 28. Smith,C.L., Blake,J.A., Kadin,J.A., Richardson,J.E., Bult,C.J. and
7, 3. Mouse Genome Database Group (2018) Mouse Genome Database
8. Roncaglia,P., van Dam,T.J.P., Christie,K.R., Nacheva,L., Toedt,G., (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids
Huynen,M.A., Huntley,R.P., Gibson,T.J. and Lomax,J. (2017) The Res., 46, D836–D842.
Gene Ontology of eukaryotic cilia and flagella. Cilia, 6, 10. 29. Chibucos,M.C., Siegele,D.A., Hu,J.C. and Giglio,M. (2017) The
9. Denny,P., Feuermann,M., Hill,D.P., Lovering,R.C., Plun-Favreau,H. Evidence and Conclusion Ontology (ECO): Supporting GO
and Roncaglia,P. (2018) Exploring autophagy with Gene Ontology. annotations. Methods Mol. Biol., 1446, 245–259.
Autophagy, 14, 419–436.
D338 Nucleic Acids Research, 2019, Vol. 47, Database issue

APPENDIX OF AUTHORS ford, V. Wood; PomBase, The Francis Crick Institute (Lon-
don, UK):J. Hayles; PomBase, University College London
Berkeley Bioinformatics Open-Source Projects (BBOP),
(London UK): J. Bahler, A. Lock; RGD, Medical College
Environmental Genomics and Systems Biology Division,
of Wisconsin (Milwaukee, WI, USA): E.R. Bolton, J. De
Lawrence Berkeley National Laboratory (Berkeley, CA,
Pons, M. Dwinell, G.T. Hayman, S.J.F. Laulederkind, M.
USA): S. Carbon*, E. Douglass, N. Dunn, B. Good, N.L.
Shimoyama, M. Tutaj, S.-J. Wang; Reactome, Department
Harris, S.E. Lewis, C.J. Mungall; dictyBase, Northwestern
of Biochemistry & Molecular Pharmacology, NYU School
University (Chicago, IL, USA): S. Basu, R.L. Chisholm,
of Medicine (New York, NY, USA): P. D’Eustachio, L.
R.J. Dodson, E. Hartline, P. Fey; Division of Bioinformat-
Matthews; Renaissance Computing Institute, University of
ics, Department of Preventive Medicine, University of South-
North Carolina (Chapel Hill, NC, USA): J.P. Balhoff; SGD,
ern California (Los Angeles, CA, USA): P.D. Thomas*,
Department of Genetics, Stanford University (Stanford,
L.P Albou*, D. Ebert, M.J. Kesling, H. Mi, A. Muruganu-
CA, USA): S.A. Aleksander, G. Binkley, B.L. Dunn, J.M.
jan, X. Huang, S. Poudel, T. Mushayahama; EcoliWiki,

Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D330/5160994 by guest on 30 July 2020


Cherry, S.R. Engel, F. Gondwe, K. Karra, K.A. MacPher-
Departments of Biology and Biochemistry and Biophysics,
son, S.R. Miyasato, R.S. Nash, P.C. Ng, T.K. Sheppard,
Texas A&M University (College Station, TX, USA): J.C.
A. Shrivatsav VP, M. Simison, M.S. Skrzypek, S. Weng,
Hu, S.A. LaBonte, D.A. Siegele; FlyBase, Department of E.D. Wong; SIB Swiss Institute of Bioinformatics (Geneva,
Physiology, Development and Neuroscience, University of Switzerland): M. Feuermann, P. Gaudet*; TAIR, Phoenix
Cambridge (Cambridge, UK): G. Antonazzo, H. Attrill, Bioinformatics (Fremont, CA, USA): E. Bakker, T.Z. Be-
N.H. Brown, S. Fexova, P. Garapati, T.E.M. Jones, S.J. rardini, L. Reiser, S. Subramaniam, E. Huala; UniProt:
Marygold, G.H. Millburn, A.J. Rey, V. Trovisco; FlyBase, EMBL-EBI (Hinxton, UK), SIB Swiss Institute of Bioinfor-
The Biological Laboratories, Harvard University (Cam- matics (SIB) (Geneva, Switzerland), and Protein Informa-
bridge, USA): G. dos Santos, D.B. Emmert, K. Falls, P. tion Resource (PIR) (Washington, DC, USA and Newark,
Zhou; FlyBase, Department of Biology, Indiana Univer- DE, USA): C. Arighi, A. Auchincloss, K. Axelsen, G.,
sity, (Bloomington, USA): J.L. Goodman, V.B. Strelets, Argoud-Puy, A. Bateman, B. Bely, M.-C. Blatter, E. Boutet,
J. Thurmond; GO-EMBL-EBI (Hinxton, UK): M. Cour- L. Breuza, A. Bridge, R. Britto, H. Bye-A-Jee, C. Casals-
tot, D. Osumi-Sutherland, H. Parkinson, P. Roncaglia; Casas, E. Coudert, A. Estreicher, L. Famiglietti, P. Garmiri,
Gene Regulation Consortium (GRECO), Norwegian Uni- G. Georghiou, A. Gos, N. Gruaz-Gumowski, E. Hatton-
versity of Science and Technology (Trondheim, Norway): Ellis, U. Hinz, C. Hulo, A. Ignatchenko F. Jungo, G. Keller,
M.L. Acencio, M. Kuiper, A. Lægreid; Gene Regulation K. Laiho, P. Lemercier, D. Lieberherr, Y. Lussi, A. Mac-
Consortium (GRECO), Radboud University (Nijmegen, The Dougall, M. Magrane, M. J. Martin, P. Masson, D.A. Na-
Netherlands): C. Logie; Institute of Cardiovascular Science, tale, N. Hyka-Nouspikel, I. Pedruzzi, K. Pichler, S. Poux,
University College London (London, UK): R.C. Lover- C. Rivoire, M. Rodrı́guez-López, T. Sawford, E. Speretta,
ing, R.P. Huntley, P. Denny, N.H. Campbell, B. Kramarz, A. Shypitsyna, A. Stutz, S. Sundaram, M. Tognolli, N.
V. Acquaah, S.H. Ahmad, H. Chen, J.H. Rawson; Insti- Tyagi, K. Warner, R. Zaru, C. Wu; University at Buffalo,
tute for Genome Sciences, University of Maryland School Department of Biomedical Informatics (Buffalo, NY, USA):
of Medicine (Baltimore, MD, USA): M. C. Chibucos, M. AD. Diehl; WormBase California Institute of Technology
Giglio, S. Nadendla, R. Tauber; IntAct/Complex Portal, (Pasadena, CA, USA), Wellcome Trust Sanger Institute
EMBL-EBI (Hinxton, UK): M.J. Duesbury, N Del-Toro, (Hinxton, UK), EBI (Hinxton, UK), and Ontario Institute
B.H.M. Meldal, L. Perfetto, P. Porras, S. Orchard, A. for Cancer Research (Toronto, Canada): J. Chan, J. Cho, S.
Shrivastava, Z. Xie; InterPro, EMBL-EBI (Hinxton, UK): Gao, C. Grove, M.C. Harrison, K. Howe, R. Lee, J. Mendel,
H.Y. Chang, R.D. Finn, A.L. Mitchell, N.D. Rawlings, L. H.-M. Muller, D. Raciti, K. Van Auken*, M. Berriman, L.
Richardson, A. Sangrador-Vegas; Mouse Genome Infor- Stein, P. W. Sternberg; ZFIN, University of Oregon (Eugene,
matics, The Jackson Laboratory (Bar Harbor, ME, USA): OR, USA): D. Howe, S. Toro, M. Westerfield.
J.A. Blake, K.R. Christie, M.E. Dolan, H.J. Drabkin, D.P. *
Authors who contributed significantly to writing this
Hill*, L. Ni, D. Sitnikov; PomBase, University of Cambridge manuscript.
(Cambridge, UK): M.A. Harris, S.G. Oliver, K. Ruther-

You might also like