Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1183614.1183639acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Validating associations in biological databases

Published: 06 November 2006 Publication History

Abstract

Erroneous data can often be found in databases, and detecting it is normally a non-trivial task. For example, To cope with the large amount of biological sequences being produced, a significant number of genes and proteins have been annotated by automated tools. A protein annotation is an association between a protein and a term describing its role. These tools have produced a significant number of misannotations that are now present in biological databases. This paper proposes a new method for automatically scoring associations by comparing them to preexisting curated associations. An association is a pair that links two entities. The score can be used to filter incorrect or uncommon associations.We evaluated the method using the automated protein annotations submitted to BioCreAtIvE, an international evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold.

References

[1]
M. Andrade and P. Bork. Automated extraction of information in Molecular Biology. FEBS Letters, 476:12--17, 2000.]]
[2]
R. Apweiler, A. Bairoch, C. Wu, W. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. Martin, D. Natale, C. O'Donovan, N. Redaschi, and L. Yeh. UniProt: the universal protein knowledgebase. Nucleic Acids Research, 32(Database issue):D115--D119, 2004.]]
[3]
T. Attwood and D. Parry-Smith. Introduction to Bioinformatics. Longman Higher Education, 1999.]]
[4]
A. Bateman, L. Coin, R. Durbin, R. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. Sonnhammer, D. Studholme, C. Yeats, and S. Eddy. The Pfam protein families database. Nucleic Acids Research, 32(Database issue):D138--D141, 2004.]]
[5]
C. Blaschke, E. Leon, M. Krallinger, and A. Valencia. Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics, 6(Suppl 1):S16, 2005.]]
[6]
A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Proc. of the Workshop on WordNet and Other Lexical Resources co-located with the 2nd North American Chapter of the Association for Computational Linguistics, June 2001.]]
[7]
E. Camon, D. Barrell, E. Dimmer, V. Lee, M. Magrane, J. Maslen, D. Binns, and R. Apweiler. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics, 6(Suppl 1):S17, 2005.]]
[8]
E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, and R. Apweiler. The Gene Ontology Annotations (GOA) database: sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Research, 32:262--266, 2004.]]
[9]
J. Chiang and H. Yu. Extracting functional annotations of proteins based on hybrid text mining approaches. In Proc. of the BioCreAtIvE Challenge Evaluation Workshop, 2004.]]
[10]
F. Couto, B. Martins, and M. Silva. Classifying biological articles using web resources. In Proc. of the 2004 ACM Symposium on Applied Computing, 2004.]]
[11]
F. Couto and M. Silva. Advanced Data Mining Techonologies in Bioinformatics, chapter Mining the BioLiterature: towards automatic annotation of genes and proteins. Idea Group Inc., 2006.]]
[12]
F. Couto, M. Silva, and P. Coutinho. Implementation of a functional semantic similarity measure between gene-products. DI/FCUL TR 03--29, Department of Informatics, University of Lisbon, November 2003.]]
[13]
F. Couto, M. Silva, and P. Coutinho. Finding genomic ontology terms in text using evidence content. BMC Bioinformatics, 6(S1):S21, 2005.]]
[14]
F. Couto, M. Silva, and P. Coutinho. Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In Proc. of the ACM Conference in Information and Knowledge Management as a short paper, 2005.]]
[15]
F. Couto, M. Silva, and P. Coutinho. Measuring semantic similarity between gene ontology terms. DKE - Data and Knowledge Engineering, Elsevier Science (in press), 2006.]]
[16]
D. Devos and A. Valencia. Intrinsic errors in genome annotation. Trends Genetics, 17(8):429--431, 2001.]]
[17]
F. Ehrler, A. Jimeno, and P. Ruch. Data-poor categorization and passage retrieval for Gene Ontology annotation in Swiss-Prot. BMC Bioinformatics, 6(Suppl 1):S23, 2005.]]
[18]
GO-Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32(Database issue):D258--D261, 2004.]]
[19]
M. Grand. Java Language Reference. O'Reilly, 1997.]]
[20]
M. Hearst. Untangling text data mining. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, 1999.]]
[21]
W. Hersh, R. Bhuptiraju, L. Ross, P. Johnson, A. Cohen, and D. Kraemer. TREC 2004 genomics track overview. In Proc. of the 13th Text REtrieval Conference, 2004.]]
[22]
L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1):S1, 2005.]]
[23]
J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of the 10th International Conference on Research on Computational Linguistics, 1997.]]
[24]
P. Lord, R. Stevens, A. Brass, and C. Goble. Semantic similarity measures as tools for exploring the Gene Ontology. In Proc. of the 8th Pacific Symposium on Biocomputing, 2003.]]
[25]
S. Ray and M. Craven. Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics, 6(Suppl 1):S18, 2005.]]
[26]
D. Rebholz-Schuhmann, H. Kirsch, and F. Couto. Facts from text - is text mining ready to deliver? PLoS Biology, 3(2):e65, 2005.]]
[27]
P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of the 14th International Joint Conference on Artificial Intelligence, 1995.]]
[28]
S. Rice, G. Nenadic, and B. Stapley. Mining protein functions from text using term-based support vector machines. BMC Bioinformatics, 6(Suppl 1):S22, 2005.]]
[29]
H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821--855, 2003.]]
[30]
R. Stevens, C. Wroe, P. Lord, and C. Goble. Handbook on Ontologies, chapter Ontologies in Bioinformatics. Springer, 2003.]]
[31]
K. Verspoor, J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L. Rocha, and T. Simas. Protein annotation as term categorization in the Gene Ontology using word proximity networks. BMC Bioinformatics, 6(Suppl 1):S20, 2005.]]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
November 2006
916 pages
ISBN:1595934332
DOI:10.1145/1183614
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. biological databases
  2. filtering associations
  3. knowledge management

Qualifiers

  • Article

Conference

CIKM06
CIKM06: Conference on Information and Knowledge Management
November 6 - 11, 2006
Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 302
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media