Article

Validating associations in biological databases

Authors:

Francisco M. Couto,

Mário J. Silva,

Pedro M. CoutinhoAuthors Info & Claims

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Pages 142 - 151

https://doi.org/10.1145/1183614.1183639

Published: 06 November 2006 Publication History

Abstract

Erroneous data can often be found in databases, and detecting it is normally a non-trivial task. For example, To cope with the large amount of biological sequences being produced, a significant number of genes and proteins have been annotated by automated tools. A protein annotation is an association between a protein and a term describing its role. These tools have produced a significant number of misannotations that are now present in biological databases. This paper proposes a new method for automatically scoring associations by comparing them to preexisting curated associations. An association is a pair that links two entities. The score can be used to filter incorrect or uncommon associations.We evaluated the method using the automated protein annotations submitted to BioCreAtIvE, an international evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold.

References

[1]

M. Andrade and P. Bork. Automated extraction of information in Molecular Biology. FEBS Letters, 476:12--17, 2000.]]

[2]

R. Apweiler, A. Bairoch, C. Wu, W. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. Martin, D. Natale, C. O'Donovan, N. Redaschi, and L. Yeh. UniProt: the universal protein knowledgebase. Nucleic Acids Research, 32(Database issue):D115--D119, 2004.]]

[3]

T. Attwood and D. Parry-Smith. Introduction to Bioinformatics. Longman Higher Education, 1999.]]

[4]

A. Bateman, L. Coin, R. Durbin, R. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. Sonnhammer, D. Studholme, C. Yeats, and S. Eddy. The Pfam protein families database. Nucleic Acids Research, 32(Database issue):D138--D141, 2004.]]

[5]

C. Blaschke, E. Leon, M. Krallinger, and A. Valencia. Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics, 6(Suppl 1):S16, 2005.]]

[6]

A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Proc. of the Workshop on WordNet and Other Lexical Resources co-located with the 2nd North American Chapter of the Association for Computational Linguistics, June 2001.]]

[7]

E. Camon, D. Barrell, E. Dimmer, V. Lee, M. Magrane, J. Maslen, D. Binns, and R. Apweiler. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics, 6(Suppl 1):S17, 2005.]]

[8]

E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, and R. Apweiler. The Gene Ontology Annotations (GOA) database: sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Research, 32:262--266, 2004.]]

[9]

J. Chiang and H. Yu. Extracting functional annotations of proteins based on hybrid text mining approaches. In Proc. of the BioCreAtIvE Challenge Evaluation Workshop, 2004.]]

[10]

F. Couto, B. Martins, and M. Silva. Classifying biological articles using web resources. In Proc. of the 2004 ACM Symposium on Applied Computing, 2004.]]

Digital Library

[11]

F. Couto and M. Silva. Advanced Data Mining Techonologies in Bioinformatics, chapter Mining the BioLiterature: towards automatic annotation of genes and proteins. Idea Group Inc., 2006.]]

[12]

F. Couto, M. Silva, and P. Coutinho. Implementation of a functional semantic similarity measure between gene-products. DI/FCUL TR 03--29, Department of Informatics, University of Lisbon, November 2003.]]

[13]

F. Couto, M. Silva, and P. Coutinho. Finding genomic ontology terms in text using evidence content. BMC Bioinformatics, 6(S1):S21, 2005.]]

[14]

F. Couto, M. Silva, and P. Coutinho. Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In Proc. of the ACM Conference in Information and Knowledge Management as a short paper, 2005.]]

Digital Library

[15]

F. Couto, M. Silva, and P. Coutinho. Measuring semantic similarity between gene ontology terms. DKE - Data and Knowledge Engineering, Elsevier Science (in press), 2006.]]

Digital Library

[16]

D. Devos and A. Valencia. Intrinsic errors in genome annotation. Trends Genetics, 17(8):429--431, 2001.]]

[17]

F. Ehrler, A. Jimeno, and P. Ruch. Data-poor categorization and passage retrieval for Gene Ontology annotation in Swiss-Prot. BMC Bioinformatics, 6(Suppl 1):S23, 2005.]]

[18]

GO-Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32(Database issue):D258--D261, 2004.]]

[19]

M. Grand. Java Language Reference. O'Reilly, 1997.]]

Digital Library

[20]

M. Hearst. Untangling text data mining. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, 1999.]]

Digital Library

[21]

W. Hersh, R. Bhuptiraju, L. Ross, P. Johnson, A. Cohen, and D. Kraemer. TREC 2004 genomics track overview. In Proc. of the 13th Text REtrieval Conference, 2004.]]

[22]

L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1):S1, 2005.]]

[23]

J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of the 10th International Conference on Research on Computational Linguistics, 1997.]]

[24]

P. Lord, R. Stevens, A. Brass, and C. Goble. Semantic similarity measures as tools for exploring the Gene Ontology. In Proc. of the 8th Pacific Symposium on Biocomputing, 2003.]]

[25]

S. Ray and M. Craven. Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics, 6(Suppl 1):S18, 2005.]]

[26]

D. Rebholz-Schuhmann, H. Kirsch, and F. Couto. Facts from text - is text mining ready to deliver? PLoS Biology, 3(2):e65, 2005.]]

[27]

P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of the 14th International Joint Conference on Artificial Intelligence, 1995.]]

Digital Library

[28]

S. Rice, G. Nenadic, and B. Stapley. Mining protein functions from text using term-based support vector machines. BMC Bioinformatics, 6(Suppl 1):S22, 2005.]]

[29]

H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821--855, 2003.]]

[30]

R. Stevens, C. Wroe, P. Lord, and C. Goble. Handbook on Ontologies, chapter Ontologies in Bioinformatics. Springer, 2003.]]

[31]

K. Verspoor, J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L. Rocha, and T. Simas. Protein annotation as term categorization in the Gene Ontology using word proximity networks. BMC Bioinformatics, 6(Suppl 1):S20, 2005.]]

Index Terms

Validating associations in biological databases
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Alternative Interest Measures for Mining Associations in Databases

Data mining is defined as the process of discovering significant and potentially useful patterns in large volumes of data. Discovering associations between items in a large database is one such data mining activity. In finding associations, support is ...
Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia
EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology

Recent progress in information extraction has shown how to automatically build large ontologies from high-quality sources like Wikipedia. But knowledge evolves over time; facts have associated validity intervals. Therefore, ontologies should include ...
Browsing for information by highlighting automatically generated annotations: a user study and evaluation
K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture

The realization of the Semantic Web is constrained by a knowledge acquisition bottleneck, i.e. the problem of how to add RDF mark-up to the millions of ordinary web pages that already exist. Information Extraction (IE) has been proposed as a solution to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

November 2006

916 pages

ISBN:1595934332

DOI:10.1145/1183614

General Chair:
Philip S. Yu
IBM T.J. Watson Research Center (USA)
,
Program Chairs:
Vassilis Tsotras
University of California-Riverside (USA)
,
Edward Fox
Virginia Tech (USA)
,
Bing Liu
University of Illinois at Chicago (USA)

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM06

Sponsor:

CIKM06: Conference on Information and Knowledge Management

November 6 - 11, 2006

Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
302
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents