Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/1118149.1118153dlproceedingsArticle/Chapter ViewAbstractPublication PagesbiomedConference Proceedingsconference-collections
Article
Free access

Accenting unknown words in a specialized language

Published: 11 July 2002 Publication History

Abstract

We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from POS tagging, the other is based on finite state transducers.We show experimental results for letter e on the French version of the Medical Subject Headings thesaurus. With the best training set, the tagging method obtains a precision-recall breakeven point of 84.2±4.4% and the transducer method 83.8±4.5% (with a baseline at 64%) for the unknown words that contain this letter. A consensus combination of both increases precision to 92.0±3.7% with a recall of 75%. We perform an error analysis and discuss further steps that might help improve over the current performance.

References

[1]
{Brill 1995} Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543--565.
[2]
{Darmoni et al.2000} Stéfan J. Darmoni, J.-P. Leroy, Benoît Thirion, F. Baudic, Magali Douyere, and J. Piot. 2000. CISMeF: a structured health resource guide. Methods Inf Med, 39(1):30--35.
[3]
{Garnier and Delamare 1992} M. Garnier and V. Delamare. 1992. Dictionnaire des Termes de Médecine. Maloine, Paris.
[4]
{Grabar and Zweigenbaum2000} Natalia Grabar and Pierre Zweigenbaum. 2000. Automatic acquisition of domain-specific morphological resources from thesauri. In Proceedings of RIAO 2000: Content-Based Multimedia Information Access, pages 765--784, Paris, France, April. C.I.D.
[5]
{Habert et al.2001} Benoît Habert, Natalia Grabar, Pierre Jacquemart, and Pierre Zweigenbaum. 2001. Building a text corpus for representing the variety of medical language. In Corpus Linguistics 2001, Lancaster.
[6]
{INS2000} Institut National de la Santé et de la Recherche Médicale, Paris, 2000. Thésaurus Biomédical Français/Anglais.
[7]
{Levenshtein1966} V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklandy, pages 707--710.
[8]
{Ruch et al.2001} Patrick Ruch, Robert H. Baud, Antoine Geissbuhler, Christian Lovis, Anne-Marie Rassinoux, and A. Rivière. 2001. Looking back or looking all around: comparing two spell checking strategies for documents edition in an electronic patient record. J Am Med Inform Assoc, 8(suppl):568--572.
[9]
{Seka et al.1997} LP Seka, C Courtin, and P Le Beux. 1997. ADM-INDEX: an automated system for indexing and retrieval of medical texts. In Stud Health Technol Inform, volume 43 Pt A, pages 406--410. Reidel.
[10]
{Simard1998} Michel Simard. 1998. Automatic insertion of accents in French text. In Proceedings of the Third Conference on Empirical Methods in Natural Language Processing, Grenade.
[11]
{Spriet and El-Bèze1997} Thierry Spriet and Marc El-Bèze. 1997. Réaccentuation automatique de textes. In FRACTAL 97, Besançon.
[12]
{Theron and Cloete1997} Pieter Theron and Ian Cloete. 1997. Automatic acquisition of two-level morphological rules. In Ralph Grishman, editor, Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 103--110, Washington, DC, March-April. ACL.
[13]
{Yarowsky1999} David Yarowsky. 1999. Corpus-based techniques for restoring accents in Spanish and French text. In Natural Language Processing Using Very Large Corpora, pages 99--120. Kluwer Academic Publishers.
[14]
{Zweigenbaum and Grabar2002} Pierre Zweigenbaum and Natalia Grabar. 2002. Accenting unknown words: application to the French version of the MeSH. In Workshop NLP in Biomedical Applications, pages 69--74, Cyprus, March. EFMI.
[15]
{Zweigenbaum2001} Pierre Zweigenbaum. 2001. Resources for the medical domain: medical terminologies, lexicons and corpora. ELRA Newsletter, 6(4):8--11.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
BioMed '02: Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
July 2002
94 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 11 July 2002

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 166
    Total Downloads
  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)11
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media