Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/974358.974399dlproceedingsArticle/Chapter ViewAbstractPublication PagesanlcConference Proceedingsconference-collections
Article
Free access

Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

Published: 13 October 1994 Publication History

Abstract

The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang & Chen 1993; Chiang et al. 1992; Lin et al. 1993; Wu & Tseng 1993; Sproat et al. 1994).We present empirical evidence for four points concerning tokenization of Chinese text: (1) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.

References

[1]
BDC. 1992. The BDC Chinese-English electronic dictionary (version 2.0). Behavior Design Corporation.
[2]
CHANG, CHAO-HUANG & CHENG-DER CHEN. 1993. HMM-based part-of-speechtagging for Chinese corpora. In Proceedings of the Workshop on Very Large Corpora, 40--47, Columbus, Ohio.
[3]
CHIANG, TUNG-HUI, JING-SHIN CHANG, MING-YU LIN, & KEH-YIH SU. 1992. Statistical models for word segmentation and unknown resolution. In Proceedings of ROCLING-92, 121--146.
[4]
FUNG, PASCALE & DEKAI WU. 1994. Statistical augmentation of a Chinese machinereadable dictionary. In Proceedings of the Second Annual Workshop on Very Large Corpora, 69--85, Kyoto.
[5]
LIN, MING-YU, TUNG-HUI CHIANG, & KEH-YIH SU. 1993. A preliminary study on unknown word problem in Chinese word segmentation. In Proceedings of ROCLING-93, 119--141.
[6]
SPROAT, RICHARD, CHILIN SHIH, WILLIAM GALE, & NANCY CHANG. 1994. A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, 66--72, Las Cruces, New Mexico.
[7]
WU, DEKAI. 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, 80--87, Las Cruces, New Mexico.
[8]
WU, ZIMIN & GWYNETH TSENG. 1993. Chinese text segmentation for text retrieval: Achievements and problems. Journal of The American Society for Information Science, 44(9):532--542.

Cited By

View all
  • (2010)Large-scale language modeling with random forests for mandarin Chinese speech-to-textProceedings of the 7th international conference on Advances in natural language processing10.5555/1884371.1884404(269-280)Online publication date: 16-Aug-2010
  • (2005)A new re-ranking method for generic chinese text summarization and its evaluationProceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences10.1007/11599517_20(171-175)Online publication date: 12-Dec-2005
  • (2003)A maximum entropy Chinese character-based parserProceedings of the 2003 conference on Empirical methods in natural language processing10.3115/1119355.1119380(192-199)Online publication date: 11-Jul-2003
  • Show More Cited By
  1. Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image DL Hosted proceedings
      ANLC '94: Proceedings of the fourth conference on Applied natural language processing
      October 1994
      226 pages

      Sponsors

      • ACL: Association for Computational Linguistics
      • Gesellschaft ffir Informatik

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      Published: 13 October 1994

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 22 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2010)Large-scale language modeling with random forests for mandarin Chinese speech-to-textProceedings of the 7th international conference on Advances in natural language processing10.5555/1884371.1884404(269-280)Online publication date: 16-Aug-2010
      • (2005)A new re-ranking method for generic chinese text summarization and its evaluationProceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences10.1007/11599517_20(171-175)Online publication date: 12-Dec-2005
      • (2003)A maximum entropy Chinese character-based parserProceedings of the 2003 conference on Empirical methods in natural language processing10.3115/1119355.1119380(192-199)Online publication date: 11-Jul-2003
      • (2001)Multidimensional transformation-based learningProceedings of the 2001 workshop on Computational Natural Language Learning - Volume 710.3115/1117822.1117823(1-8)Online publication date: 6-Jul-2001
      • (2000)A compression-based algorithm for Chinese word segmentationComputational Linguistics10.1162/08912010056174626:3(375-393)Online publication date: 1-Sep-2000
      • (1998)Machine translation with a stochastic grammatical channelProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 210.3115/980691.980799(1408-1415)Online publication date: 10-Aug-1998
      • (1998)A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language GroupsMachine Translation10.1023/A:100797460529012:1/2(53-87)Online publication date: 1-Jan-1998
      • (1997)A trainable rule-based algorithm for word segmentationProceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics10.3115/976909.979658(321-328)Online publication date: 7-Jul-1997
      • (1996)A stochastic finite-state word-segmentation algorithm for ChineseComputational Linguistics10.5555/239895.23990022:3(377-404)Online publication date: 1-Sep-1996
      • (1996)A polynomial-time algorithm for statistical machine translationProceedings of the 34th annual meeting on Association for Computational Linguistics10.3115/981863.981884(152-158)Online publication date: 24-Jun-1996
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media