Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Toward a unified approach to statistical language modeling for Chinese

Published: 01 March 2002 Publication History

Abstract

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

References

[1]
BERTON, A., FETTER P., AND REGEL-BRIETZMANN, P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.
[2]
BROWN, P. F., DELLA PIETRA V. J., DE SOUZA, P. V., LAI, J. C., AND MERCER, R. L. 1990. Class-based n-gram models of natural language. Comput. Linguist. 18, 467-479.
[3]
CHEN, S. F., AND GOODMAN, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13 (Oct.), 359-394.
[4]
CHEN, S. F., BEEFERMAN, D., AND ROSENFELD, R. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.
[5]
CHIEN, L. F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the ACM SIGIR'97 Conference (Philadelphia, PA), 50-58.
[6]
CLARKSON, P. AND ROBINSON, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the ICASSP-97 Conference.
[7]
FUNG, P. 1998. Extracting key terms from Chinese and Japanese texts. Int. J. Comput. Process. Oriental Lang. Special Issue on Information Retrieval on Oriental Languages, 99-121.
[8]
GAO, J., LI, M., AND LEE, K. F. 2000a. N-gram distribution based language model adaptation. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct. 16-20).
[9]
GAO, J., WANG, H. F., LI, M., AND LEE, K. F. 2000b. A unified approach to statistical language modeling for Chinese. In Proceedings of the ICASSP-2000 Conference (Istanbul, June).
[10]
GAO, J., GOODMAN, J., AND MIAO, J. 2001. The use of clustering techniques for language model application to Asian language. Int. J. Comput. Linguist. Chinese Lang. Process., 6, 1.
[11]
GIACHIN, E. P. 1995. Phrase bigrams for continuous speech recognition. In Proceedings of the ICASSP-95 Conference.
[12]
GOODMAN, J. AND GAO, J. 2000. Language model compression by predictive clustering. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct.).
[13]
HEARST, M. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 33-64.
[14]
HUANG, X. D., ACERO, A., AND HON, H. 2000. Spoken Language Processing. Prentice Hall, Englewood Cliffs, NJ.
[15]
IYER, R., OSTENDORF, M., AND GISH, H. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Process. Lett. 4, 8 (Aug.).
[16]
JELINEK, F. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. A. Waibel and K. F. Lee, Eds., Morgan-Kaufmann, San Mateo, CA, 450-506.
[17]
KATZ, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics. Speech Signal Process. ASSP-35, 3 (March), 400-401.
[18]
LIN, S. C., TSAI, C. L., CHIEN, L. F., CHEN, K. J., AND LEE, L. S. 1997. Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the 5th European Conference on Speech Communication and Technology (Rhodes, Greece).
[19]
MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
[20]
MILLER, D., LEEK, T., AND SCHWARTZ, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA), 214-221.
[21]
ROCCHIO, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313-323.
[22]
SEYMORE, K., AND ROSENFELD, R. 1996. Scalable backoff language models. In Proceedings of the International Conference on Speech and Language Processing, Vol. 1 (Philadelphia, PA), 232-235.
[23]
SEYMORE, K., AND ROSENFELD, R. 1997. Using story topics for language model adaptation. In Proceedings of the ICASSP-97 Conference.
[24]
STOLCKE, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA News Transcription and Understanding Workshop (Lansdowne, VA.), 270-274.
[25]
TUNG, C. H., AND LEE, H. J. 1994. Identification of unknown words from a corpus. Comput. Process. Chinese Oriental Lang. 131-145.
[26]
WONG, P. K., AND CHAN, C. K. 1996. Chinese word segmentation based on maximum matching and word binding force. In Proceedings of the 16th International Conference on Computational Linguistics (Copenhagen), 200-203
[27]
WU, M. W. AND SU, K. Y. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In Proceedings of the R.O.C. Computational Linguistics Conference VI (Nantou, Taiwan), 207-216.
[28]
YAMAMOTO, H. AND SAGISAKA, Y. 1999. Multi-class composite n-gram based on connection direction. In Proceedings of the ICASSP Conference (Phoenix, AZ, May).
[29]
YANG, K. C., HO, T. H., CHIEN, L. F., AND LEE, L. S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE 1998 International Conference on Acoustic, Speech, Signal Processing (Seattle, WA), 169-172.
[30]
ZHANG, J., GAO, J., AND ZHOU, M. 2000. Extraction of Chinese compound words: An experimental study on a very large corpus. In Proceedings of the Second Chinese Language Processing Workshop (Hong Kong, Oct. 8).
[31]
ZHAO, J., GAO, J., CHANG, E., AND LI, M. 2000. Lexicon optimization for Chinese language modeling. In Proceedings of the ISCSLP-2000. International Symposium on Spoken Language Processing (Beijing, Oct. 14-15).
[32]
ZUE, V. W. 1995. Navigating the information superhighway using spoken language interfaces. IEEE Expert 10, 5 (Oct.), 39-43.

Cited By

View all
  • (2024)TextGram: Towards a Better Domain-Adaptive PretrainingSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-58495-4_12(161-173)Online publication date: 24-Apr-2024
  • (2022)Exploring the Chinese Public’s Perception of Omicron Variants on Social Media: LDA-Based Topic Modeling and Sentiment AnalysisInternational Journal of Environmental Research and Public Health10.3390/ijerph1914837719:14(8377)Online publication date: 8-Jul-2022
  • (2021)Machine Translation Customization via Automatic Training Data Selection from the WebAdvances in Information Retrieval10.1007/978-3-030-72113-8_44(666-679)Online publication date: 27-Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing
ACM Transactions on Asian Language Information Processing  Volume 1, Issue 1
March 2002
102 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/595576
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2002
Published in TALIP Volume 1, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chinese language
  2. Chinese pinyin-to-character conversion
  3. backoff
  4. character error rate
  5. domain adaptation
  6. lexicon
  7. n-gram model
  8. perplexity
  9. pruning
  10. smoothing
  11. statistical language modeling
  12. word segmentation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TextGram: Towards a Better Domain-Adaptive PretrainingSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-58495-4_12(161-173)Online publication date: 24-Apr-2024
  • (2022)Exploring the Chinese Public’s Perception of Omicron Variants on Social Media: LDA-Based Topic Modeling and Sentiment AnalysisInternational Journal of Environmental Research and Public Health10.3390/ijerph1914837719:14(8377)Online publication date: 8-Jul-2022
  • (2021)Machine Translation Customization via Automatic Training Data Selection from the WebAdvances in Information Retrieval10.1007/978-3-030-72113-8_44(666-679)Online publication date: 27-Mar-2021
  • (2019)Mining, analyzing, and modeling text written on mobile devicesNatural Language Engineering10.1017/S135132491900054827:1(1-33)Online publication date: 10-Oct-2019
  • (2019)Vector sentences representation for data selection in statistical machine translationComputer Speech & Language10.1016/j.csl.2018.12.00556(1-16)Online publication date: Jul-2019
  • (2018)Textual Data Selection for Language Modelling in the Scope of Automatic Speech RecognitionProcedia Computer Science10.1016/j.procs.2018.03.008128(55-64)Online publication date: 2018
  • (2018)Speech understanding for spoken dialogue systemsComputer Speech and Language10.1016/j.csl.2017.08.00247:C(272-297)Online publication date: 1-Jan-2018
  • (2018)Mixing Textual Data Selection Methods for Improved In-Domain Data AdaptationTrends and Advances in Information Systems and Technologies10.1007/978-3-319-77712-2_35(367-377)Online publication date: 17-May-2018
  • (2017)Classification-based spoken text selection for LVCSR language modelingEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0121-52017:1(1-12)Online publication date: 1-Dec-2017
  • (2017)Identification of Noun + Verb Compound Nouns in Malay Standard document based on rule based2017 IEEE 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS)10.1109/ICETSS.2017.8324176(1-6)Online publication date: Aug-2017
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media