Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/981732.981742dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free access

A stochastic finite-state word-segmentation algorithm for Chinese

Published: 27 June 1994 Publication History

Abstract

We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation.

References

[1]
Evan Antworth. 1990. PC-KIMMO: A Two-Level Processor for Morphological Analysis. Occasional Publications in Academic Computing, 16. Summer Institute of Linguistics, Dallas, TX.
[2]
Harald Baayen. 1989. A Corpus-Based Approach to Morphological Productivity: Statistical Analysis and Psycholinguistic Interpretation. Ph.D. thesis, Free University, Amsterdam.
[3]
Jyun-Shen Chang, Shun-De Chen, Ying Zheng, Xian-Zhong Liu, and Shu-Jin Ke. 1992. Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, 6(3):7--15.
[4]
Keh-Jiann Chen and Shing-Huan Liu. 1992. Word identification for Mandarin Chinese sentences. In Proceedings of COLING-92, pages 101--107. COLING.
[5]
Kenneth Ward Church and William Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5(1):19--54.
[6]
John DeFrancis. 1984. The Chinese Language. University of Hawaii Press, Honolulu.
[7]
C.-K. Fan and W.-H. Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique. Computer Processing of Chinese and Oriental Languages, 4:33--56.
[8]
Lauri Karttunen, Ronald Kaplan, and Annie Zaenen. 1992. Two-level morphology with composition. In COLING-92, pages 141--148. COLING.
[9]
Kimmo Koskenniemi. 1983. Two-Level Morphology: a General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.
[10]
Ming-Yu Lin, Tung-Hui Chiang, and Keh-Yi Su. 1993. A preliminary study on unknown word problem in Chinese word segmentation. In ROCLING 6, pages 119--141. ROCLING.
[11]
Fernando Pereira, Michael Riley, and Richard Sproat. 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, pages 249--254. Advanced Research Projects Agency, March 8--11.
[12]
Chilin Shih. 1986. The Prosodic Domain of Tone Sandhi in Chinese. Ph.D. thesis, UCSD, La Jolla, CA.
[13]
Richard Sproat and Chilin Shih. 1990. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4:336--351.
[14]
Richard Sproat. 1992. Morphology and Computation. MIT Press, Cambridge, MA.
[15]
Evelyne Tzoukermann and Mark Liberman. 1990. A finite-state morphological processor for Spanish. In COLING-90, Volume 3, pages 3: 277--286. COLING.
[16]
Yongheng Wang, Haiju Su, and Yan Mo. 1990. Automatic processing of chinese words. Journal of Chinese Information Processing, 4(4):1--11.
[17]
Liang-Jyh Wang, Wei-Chuan Li, and Chao-Huang Chang. 1992. Recognizing unregistered names for mandarin word identification. In Proceedings of COLING-92, pages 1239--1243. COLING.
[18]
Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmentation for text retrieval: Achievements and problems. Journal of the American Society for Information Science, 44(9):532--542.

Cited By

View all
  • (2008)A joint segmenting and labeling approach for Chinese lexical analysisProceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.5555/3121525.3121561(538-549)Online publication date: 15-Sep-2008
  • (2008)Applying Text Mining to Assist People Who Inquire HIV/AIDS Information from InternetProceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics10.1007/978-3-540-69304-8_46(440-448)Online publication date: 17-Jun-2008
  • (2002)An NLP & IR approach to topic detectionTopic detection and tracking10.5555/772260.772273(243-264)Online publication date: 1-Jan-2002
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '94: Proceedings of the 32nd annual meeting on Association for Computational Linguistics
June 1994
353 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 27 June 1994

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)7
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2008)A joint segmenting and labeling approach for Chinese lexical analysisProceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.5555/3121525.3121561(538-549)Online publication date: 15-Sep-2008
  • (2008)Applying Text Mining to Assist People Who Inquire HIV/AIDS Information from InternetProceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics10.1007/978-3-540-69304-8_46(440-448)Online publication date: 17-Jun-2008
  • (2002)An NLP & IR approach to topic detectionTopic detection and tracking10.5555/772260.772273(243-264)Online publication date: 1-Jan-2002
  • (2002)Backward machine transliteration by learning phonetic similarityproceedings of the 6th conference on Natural language learning - Volume 2010.3115/1118853.1118870(1-7)Online publication date: 31-Aug-2002
  • (2002)Probabilistic named entity verificationCOLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 1410.3115/1118771.1118777(1-7)Online publication date: 31-Aug-2002
  • (2001)Compound noun segmentation based on lexical data extracted from corpusNatural Language Engineering10.1017/S13513249010026377:2(167-185)Online publication date: 1-Jun-2001
  • (2000)Identifying temporal expression and its syntactic role using FST and lexical data from corpusProceedings of the 18th conference on Computational linguistics - Volume 210.3115/992730.992784(954-960)Online publication date: 31-Jul-2000
  • (2000)Compound noun segmentation based on lexical data extracted from corpusProceedings of the sixth conference on Applied natural language processing10.3115/974147.974174(196-203)Online publication date: 29-Apr-2000
  • (1998)Chinese word segmentation without using lexicon and hand-crafted training dataProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 210.3115/980691.980775(1265-1271)Online publication date: 10-Aug-1998
  • (1997)CSeg& Tag1.0Proceedings of the fifth conference on Applied natural language processing10.3115/974557.974575(119-126)Online publication date: 31-Mar-1997
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media