The Gene Ontology Resource: 20 Years and Still Going Strong
The Gene Ontology Resource: 20 Years and Still Going Strong
The Gene Ontology Resource: 20 Years and Still Going Strong
Received September 22, 2018; Revised October 16, 2018; Editorial Decision October 17, 2018; Accepted October 17, 2018
C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2019, Vol. 47, Database issue D331
GO resource content
The GO knowledgebase consists of the ontology and the
Figure 1. GO structure. (A) Graphical representation of relationships be- annotations made using the ontology. As of the 5 Septem-
tween terms: black lines represent is a and blue lines represent part of
(representation obtained from https://www.ebi.ac.uk/QuickGO/term/GO: ber 2018 release (doi:10.5281/zenodo.1410625), there were
0060887). (B) Equivalence axiom for the term ‘GO:0060887: limb epider- ∼45 000 terms in GO: 29 698 biological processes, 11 147
mis development’, as displayed in Protégé (12). molecular functions and 4201 cellular components, linked
by almost 134 000 relationships. The number of annotations
(as well as the percentage change since 2016 (1)) are shown
ensure the correct gene is identified, and select the most ac- in Table 1. It is important to understand that the changes re-
curate and meaningful GO terms to describe the biology flect two distinct processes: addition of annotations based
supported by the experimental findings. The accuracy of the on new evidence, and obsoleting of annotations that have
GO resource is continually refined by internal checks as well been superseded by newer studies. We expect the number of
as feedback from the broader GO user community to iden- obsoleted annotations to increase, due to our increasing an-
tify and fix potentially incorrect or out-of-date annotations. notation quality assurance efforts, described in more detail
The wealth of experimental knowledge manually curated by below.
biocurators is further enriched by inferences from various
predictive methods, both manual and automatic, described New framework and repository, for gene function ‘models’.
using classes from ECO as described in (3). In most cases, We have developed a more expressive computational frame-
these annotations are inferred from one or more experi- work for representing gene functions, which subsumes our
mental annotations to a homologous gene product. These current GO annotation framework, while maintaining com-
may be individually reviewed by a biocurator [denoted by patibility. We refer to the framework as GO-Causal Activity
‘ECO:0000250 sequence similarity evidence used in manual Modeling (GO-CAM, formerly referred to as ‘LEGO’ (1))
assertion’ (ISS) or ‘ECO:0000318 biological aspect of an- and to GO-CAMs as models to distinguish them from stan-
cestor evidence used in manual assertion’ (IBA) evidence dard annotations. A paper detailing GO-CAM is in prepara-
classes] or not reviewed [denoted by ‘ECO:0000501 evi- tion, but we summarize some properties here. In GO-CAM,
dence used in automatic assertion’ (IEA)]. each model is represented as a set of triples (subject-relation-
This structure of the GO knowledgebase, the ontology object, with brackets {} as a set container), e.g. {ABCA1
plus annotations, supports queries of the sort that are typ- enables cholesterol transporter activity; cholesterol trans-
ically asked in the course of biological research, such as: porter activity occurs in plasma membrane, and cholesterol
‘What are all the functions for the human ABCA1 gene?’ transporter activity part of cholesterol homeostasis}. Each
or ‘What are all the genes involved in the DNA mismatch triple is supported by one or more pieces of evidence, con-
D332 Nucleic Acids Research, 2019, Vol. 47, Database issue
Table 1. Number of experimental annotations in the GO knowledgebase, 5 September 2018 release (doi:10.5281/zenodo.1410625)
Molecular function EXP,
Protein binding, EXP excluding protein binding Cellular component EXP Biological process EXP
Human 83589 (+158%) 29999 (+26%) 41341 (+13%) 47230 (+22%)
Mouse 12990 (+49%) 14380 (+11%) 28262 (+25%) 67094 (+13%)
Rat 4329 (+2%) 10879 (−9%) 15693 (+4%) 27241 (−1%)
Zebrafish 509 (+30%) 1756 (+15%) 1087 (+16%) 21635 (+20%)
Drosophila 1516 (+33%) 5694 (+15%) 10803 (+3%) 30762 (+1%)
C. elegans 2993 (+13%) 2482 (+13%) 5245 (+8%) 14511 (+24%)
D. discoideum 690 (+32%) 1081 (+15%) 2738 (+30%) 4149 (+14%)
S. cerevisiae 168 (+58%) 8886 (+8%) 17456 (+4%) 20194 (+14%)
S. pombe 2076 (+52%) 4636 (+42%) 12184 (+8%) 5651 (+11%)
A. thaliana 13074 (+113%) 8344 (+14%) 25486 (+7%) 25223 (+12%)
Note that for the molecular function annotations, we present annotations to ‘GO:0005515 protein binding’ separately from other GO:0003674 molecular
functions, as these GO:0005515 annotations are used differently than other annotations (the class itself is not very informative, but each annotation includes
additional information about the specific binding partner). The number of annotations for the main species annotated by the GOC are shown, and the
percentage change relative to the 2016 update is indicated in parentheses.
sisting of a class from ECO and a citable source, usually Interactions with the GO user community
a scientific publication. GO-CAM specifies the semantics
GO is an open project, and we encourage community con-
of GO annotations, and how standard GO annotations can
tributions to the knowledgebase and software.
be combined into a larger model. Each GO-CAM model
All users: GO can be contacted using the GO Helpdesk
is represented using the Web Ontology Language (OWL),
(http://help.geneontology.org) for any questions or feed-
which is converted computationally to standard GO an-
back about the annotations, the ontology, software, or other
notations (GAF format), ensuring backward compatibility.
GO resources. If users notice an annotation that may not be
Users can browse, view, and download the available mod-
correct, they should first review the original publication or
els in different formats at: http://geneontology.org/go-cam.
data source. If the annotation still seems inaccurate, users
The number of models is currently small, and most models
are encouraged to report this to the GO helpdesk, and GOC
contain only one gene product (standard annotations ex-
members will review the annotation and remove or modify
tended with additional contextual information, such as cell
it if justified. Authors: Authors can now see if their paper
type). The GO is currently rapidly increasing the curation
has been used for GO annotation directly in PubMed. Un-
of GO-CAM models, and the model repository is growing.
der the ‘LinkOut - more resources’ section of the PubMed
In particular, models are now available that each represent
abstract page, papers with annotations will have a link la-
an entire regulatory or metabolic pathway.
beled ‘Gene Ontology annotations from this paper - Gene
Ontology’ (see, e.g. https://www.ncbi.nlm.nih.gov/pubmed/
Changes to data access: new production pipeline. Start- 3357510 which links to a web page on the GO site that shows
ing in March 2018, the releases of the GO resource have the annotations based on evidence in that paper). If no GO
been generated by a new data production pipeline using LinkOut is present, that may indicate that the publication
a refreshed software stack. This system allows for eas- has not been used for GO annotation. Authors can contact
ily extensible error checking and improved reporting of the helpdesk at the GO website to suggest new annotations
quality assurance checks that ensure the quality and in- or changes to existing annotations. Resources or consortia:
tegrity of the released ontology and annotations. For users, The GOC collaborates with established data resources and
one of the most important aspects of this pipeline is that other groups and consortia representing a particular area
the GO resource now produces monthly releases (named of biology. Recent examples include cilium biology (7,8),
by release date) that are available at the GO site and autophagy (9) and cardiac phenotypes (10). We encourage
can be referenced and obtained as stable Document Ob- members of other interest groups to contact us to improve
ject Identifiers (DOIs) via Zenodo. This feature is criti- the ontology and annotations in their areas of expertise.
cal for ensuring that GO-based analyses can be replicated Tracking all contributions: Most aspects of the GO project
in a consistent and referenceable manner through the in- management are now based in GitHub (https://github.com/
clusion of these DOIs and/or version of both the ontol- geneontology). In addition to tracking ontology change re-
ogy and annotation files used (6). Our data production quests (https://github.com/geneontology/go-ontology), we
pipeline is currently hosted at the Lawrence Berkeley Na- now use GitHub to collect feedback on annotations (https://
tional Laboratory. We provide GO annotations in multi- github.com/geneontology/go-annotation). For users famil-
ple formats: as standard GAF (Gene Association Format) iar with GitHub, we encourage them to submit any requests
and GPAD (Gene Product Association Data) annotation directly to GitHub, where they can follow all further dis-
files, in Turtle (OWL serialization format) [http://current. cussion and actions. Otherwise, issues and queries can be
geneontology.org/products/ttl], and a Blazegraph [http:// submitted to our helpdesk (http://help.geneontology.org).
current.geneontology.org/products/blazegraph] journal, re-
placing the legacy MySQL output.
Nucleic Acids Research, 2019, Vol. 47, Database issue D333
relations. We also addressed the structure of the ontology development methodology to refine two areas of the ontol-
so that the upper-level terms would be more biologically ogy, the MAP kinase signaling pathway and the representa-
meaningful and have more uniform specificity. We removed tion of the extracellular matrix. Refinements to the MAPK
8 terms from the top level and added four new terms (Figure cascade included defining the molecular functions that are
2). Most of the terms that were formerly direct children of parts of the process: ‘GO:0004707 MAP kinase activity’,
‘GO:0003674 molecular function’ were moved under more ‘GO:0004708 MAP kinase kinase activity, ‘GO:0004709
biologically meaningful terms: for example, ‘GO:0042056 MAP kinase kinase kinase activity’ and ‘GO:0008349 MAP
chemoattractant activity’ and ‘GO:0045499 chemorepellent kinase kinase kinase kinase activity’. All other upstream
activity’ were moved under ‘GO:0048018 receptor ligand and downstream molecular functions/biological processes
activity’, while ‘GO:0036370 D-alanyl carrier activity’ and will be modeled in GO-CAM with causal relationships be-
‘GO:0016530 metallochaperone activity’ were moved under tween them and the MAPK cascade. We also enumerated
the new term ‘GO:0140104 molecular carrier activity’ (rep- the types of cascades based on current literature and on
[104967/Z/14/Z to S.G.O.]; MGI is supported by the Na- 10. Lovering,R.C., Roncaglia,P., Howe,D.G., Laulederkind,S.J.F.,
tional Human Genome Research Institute [HG 000330, HG Khodiyar,V.K., Berardini,T.Z., Tweedie,S., Foulger,R.E.,
Osumi-Sutherland,D., Campbell,N.H. et al. (2018) Improving
002273]; RGD is supported by and by the National Heart, interpretation of cardiac phenotypes and enhancing discovery with
Lung, and Blood Institute [HL 64541]; The UniProt Con- expanded knowledge in the Gene Ontology. Circ. Genomic Precis.
sortium is supported by the National Eye Institute, Na- Med., 11, e001813.
tional Human Genome Research Institute, National Heart, 11. Feuermann,M., Gaudet,P., Mi,H., Lewis,S.E. and Thomas,P.D.
Lung and Blood Institute, National Institute of Allergy and (2016) Large-scale inference of gene function through phylogenetic
annotation of Gene Ontology terms: case study of the apoptosis and
Infectious Diseases, National Institute of Diabetes and Di- autophagy cellular processes. Database J. Biol. Databases Curation,
gestive and Kidney Diseases, National Institute of Gen- 2016, baw155.
eral Medical Sciences, and National Institute of Mental 12. Musen,M.A. and Protégé Team. (2015) The Protégé Project: a look
Health of the National Institutes of Health under Award back and a look forward. AI Matters, 1, 4–12.
13. Hill,D.P., Blake,J.A., Richardson,J.E. and Ringwald,M. (2002)
Number [U24HG007822], National Human Genome Re- Extension and integration of the gene ontology (GO): combining GO
APPENDIX OF AUTHORS ford, V. Wood; PomBase, The Francis Crick Institute (Lon-
don, UK):J. Hayles; PomBase, University College London
Berkeley Bioinformatics Open-Source Projects (BBOP),
(London UK): J. Bahler, A. Lock; RGD, Medical College
Environmental Genomics and Systems Biology Division,
of Wisconsin (Milwaukee, WI, USA): E.R. Bolton, J. De
Lawrence Berkeley National Laboratory (Berkeley, CA,
Pons, M. Dwinell, G.T. Hayman, S.J.F. Laulederkind, M.
USA): S. Carbon*, E. Douglass, N. Dunn, B. Good, N.L.
Shimoyama, M. Tutaj, S.-J. Wang; Reactome, Department
Harris, S.E. Lewis, C.J. Mungall; dictyBase, Northwestern
of Biochemistry & Molecular Pharmacology, NYU School
University (Chicago, IL, USA): S. Basu, R.L. Chisholm,
of Medicine (New York, NY, USA): P. D’Eustachio, L.
R.J. Dodson, E. Hartline, P. Fey; Division of Bioinformat-
Matthews; Renaissance Computing Institute, University of
ics, Department of Preventive Medicine, University of South-
North Carolina (Chapel Hill, NC, USA): J.P. Balhoff; SGD,
ern California (Los Angeles, CA, USA): P.D. Thomas*,
Department of Genetics, Stanford University (Stanford,
L.P Albou*, D. Ebert, M.J. Kesling, H. Mi, A. Muruganu-
CA, USA): S.A. Aleksander, G. Binkley, B.L. Dunn, J.M.
jan, X. Huang, S. Poudel, T. Mushayahama; EcoliWiki,