Article

Free access

A stochastic finite-state word-segmentation algorithm for Chinese

Authors:

Richard Sproat,

Nancy ChangAuthors Info & Claims

ACL '94: Proceedings of the 32nd annual meeting on Association for Computational Linguistics

Pages 66 - 73

https://doi.org/10.3115/981732.981742

Published: 27 June 1994 Publication History

Abstract

We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation.

References

[1]

Evan Antworth. 1990. PC-KIMMO: A Two-Level Processor for Morphological Analysis. Occasional Publications in Academic Computing, 16. Summer Institute of Linguistics, Dallas, TX.

[2]

Harald Baayen. 1989. A Corpus-Based Approach to Morphological Productivity: Statistical Analysis and Psycholinguistic Interpretation. Ph.D. thesis, Free University, Amsterdam.

[3]

Jyun-Shen Chang, Shun-De Chen, Ying Zheng, Xian-Zhong Liu, and Shu-Jin Ke. 1992. Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, 6(3):7--15.

[4]

Keh-Jiann Chen and Shing-Huan Liu. 1992. Word identification for Mandarin Chinese sentences. In Proceedings of COLING-92, pages 101--107. COLING.

Digital Library

[5]

Kenneth Ward Church and William Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5(1):19--54.

[6]

John DeFrancis. 1984. The Chinese Language. University of Hawaii Press, Honolulu.

[7]

C.-K. Fan and W.-H. Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique. Computer Processing of Chinese and Oriental Languages, 4:33--56.

[8]

Lauri Karttunen, Ronald Kaplan, and Annie Zaenen. 1992. Two-level morphology with composition. In COLING-92, pages 141--148. COLING.

Digital Library

[9]

Kimmo Koskenniemi. 1983. Two-Level Morphology: a General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.

[10]

Ming-Yu Lin, Tung-Hui Chiang, and Keh-Yi Su. 1993. A preliminary study on unknown word problem in Chinese word segmentation. In ROCLING 6, pages 119--141. ROCLING.

[11]

Fernando Pereira, Michael Riley, and Richard Sproat. 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, pages 249--254. Advanced Research Projects Agency, March 8--11.

Digital Library

[12]

Chilin Shih. 1986. The Prosodic Domain of Tone Sandhi in Chinese. Ph.D. thesis, UCSD, La Jolla, CA.

[13]

Richard Sproat and Chilin Shih. 1990. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4:336--351.

[14]

Richard Sproat. 1992. Morphology and Computation. MIT Press, Cambridge, MA.

[15]

Evelyne Tzoukermann and Mark Liberman. 1990. A finite-state morphological processor for Spanish. In COLING-90, Volume 3, pages 3: 277--286. COLING.

Digital Library

[16]

Yongheng Wang, Haiju Su, and Yan Mo. 1990. Automatic processing of chinese words. Journal of Chinese Information Processing, 4(4):1--11.

[17]

Liang-Jyh Wang, Wei-Chuan Li, and Chao-Huang Chang. 1992. Recognizing unregistered names for mandarin word identification. In Proceedings of COLING-92, pages 1239--1243. COLING.

Digital Library

[18]

Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmentation for text retrieval: Achievements and problems. Journal of the American Society for Information Science, 44(9):532--542.

Digital Library

Cited By

Wang XNie JLuo DWu X(2008)A joint segmenting and labeling approach for Chinese lexical analysisProceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.5555/3121525.3121561(538-549)Online publication date: 15-Sep-2008
https://dl.acm.org/doi/10.5555/3121525.3121561
Ku YChiu CLiou BLiou JWu J(2008)Applying Text Mining to Assist People Who Inquire HIV/AIDS Information from InternetProceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics10.1007/978-3-540-69304-8_46(440-448)Online publication date: 17-Jun-2008
https://dl.acm.org/doi/10.1007/978-3-540-69304-8_46
Chen HKu L(2002)An NLP & IR approach to topic detectionTopic detection and tracking10.5555/772260.772273(243-264)Online publication date: 1-Jan-2002
https://dl.acm.org/doi/10.5555/772260.772273
Show More Cited By

A stochastic finite-state word-segmentation algorithm for Chinese
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

A stochastic finite-state word-segmentation algorithm for Chinese

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various ...
An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead

This paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
Chinese word segmentation as morpheme-based lexical chunking

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

ACL '94: Proceedings of the 32nd annual meeting on Association for Computational Linguistics

June 1994

353 pages

Program Chair:
James Pustejovsky
Brandeis University

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 27 June 1994

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
399
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XNie JLuo DWu X(2008)A joint segmenting and labeling approach for Chinese lexical analysisProceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.5555/3121525.3121561(538-549)Online publication date: 15-Sep-2008
https://dl.acm.org/doi/10.5555/3121525.3121561
Ku YChiu CLiou BLiou JWu J(2008)Applying Text Mining to Assist People Who Inquire HIV/AIDS Information from InternetProceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics10.1007/978-3-540-69304-8_46(440-448)Online publication date: 17-Jun-2008
https://dl.acm.org/doi/10.1007/978-3-540-69304-8_46
Chen HKu L(2002)An NLP & IR approach to topic detectionTopic detection and tracking10.5555/772260.772273(243-264)Online publication date: 1-Jan-2002
https://dl.acm.org/doi/10.5555/772260.772273
Lin WChen H(2002)Backward machine transliteration by learning phonetic similarityproceedings of the 6th conference on Natural language learning - Volume 2010.3115/1118853.1118870(1-7)Online publication date: 31-Aug-2002
https://dl.acm.org/doi/10.3115/1118853.1118870
Lin YHung P(2002)Probabilistic named entity verificationCOLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 1410.3115/1118771.1118777(1-7)Online publication date: 31-Aug-2002
https://dl.acm.org/doi/10.3115/1118771.1118777
Yoon J(2001)Compound noun segmentation based on lexical data extracted from corpusNatural Language Engineering10.1017/S13513249010026377:2(167-185)Online publication date: 1-Jun-2001
https://dl.acm.org/doi/10.1017/S1351324901002637
Yoon JKim YSong MKay M(2000)Identifying temporal expression and its syntactic role using FST and lexical data from corpusProceedings of the 18th conference on Computational linguistics - Volume 210.3115/992730.992784(954-960)Online publication date: 31-Jul-2000
https://dl.acm.org/doi/10.3115/992730.992784
Yoon JNirenburg S(2000)Compound noun segmentation based on lexical data extracted from corpusProceedings of the sixth conference on Applied natural language processing10.3115/974147.974174(196-203)Online publication date: 29-Apr-2000
https://dl.acm.org/doi/10.3115/974147.974174
Maosong SDayang STsou BBoitet CWhitelock P(1998)Chinese word segmentation without using lexicon and hand-crafted training dataProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 210.3115/980691.980775(1265-1271)Online publication date: 10-Aug-1998
https://dl.acm.org/doi/10.3115/980691.980775
Maosong SDayang SChangning HGrishman R(1997)CSeg& Tag1.0Proceedings of the fifth conference on Applied natural language processing10.3115/974557.974575(119-126)Online publication date: 31-Mar-1997
https://dl.acm.org/doi/10.3115/974557.974575
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents