article

Toward a unified approach to statistical language modeling for Chinese

Authors:

Joshua Goodman,

Kai-Fu LeeAuthors Info & Claims

ACM Transactions on Asian Language Information Processing (TALIP), Volume 1, Issue 1

Pages 3 - 33

https://doi.org/10.1145/595576.595578

Published: 01 March 2002 Publication History

Abstract

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

References

[1]

BERTON, A., FETTER P., AND REGEL-BRIETZMANN, P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.

[2]

BROWN, P. F., DELLA PIETRA V. J., DE SOUZA, P. V., LAI, J. C., AND MERCER, R. L. 1990. Class-based n-gram models of natural language. Comput. Linguist. 18, 467-479.

[3]

CHEN, S. F., AND GOODMAN, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13 (Oct.), 359-394.

[4]

CHEN, S. F., BEEFERMAN, D., AND ROSENFELD, R. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.

[5]

CHIEN, L. F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the ACM SIGIR'97 Conference (Philadelphia, PA), 50-58.

[6]

CLARKSON, P. AND ROBINSON, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the ICASSP-97 Conference.

[7]

FUNG, P. 1998. Extracting key terms from Chinese and Japanese texts. Int. J. Comput. Process. Oriental Lang. Special Issue on Information Retrieval on Oriental Languages, 99-121.

[8]

GAO, J., LI, M., AND LEE, K. F. 2000a. N-gram distribution based language model adaptation. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct. 16-20).

[9]

GAO, J., WANG, H. F., LI, M., AND LEE, K. F. 2000b. A unified approach to statistical language modeling for Chinese. In Proceedings of the ICASSP-2000 Conference (Istanbul, June).

[10]

GAO, J., GOODMAN, J., AND MIAO, J. 2001. The use of clustering techniques for language model application to Asian language. Int. J. Comput. Linguist. Chinese Lang. Process., 6, 1.

[11]

GIACHIN, E. P. 1995. Phrase bigrams for continuous speech recognition. In Proceedings of the ICASSP-95 Conference.

[12]

GOODMAN, J. AND GAO, J. 2000. Language model compression by predictive clustering. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct.).

[13]

HEARST, M. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 33-64.

[14]

HUANG, X. D., ACERO, A., AND HON, H. 2000. Spoken Language Processing. Prentice Hall, Englewood Cliffs, NJ.

[15]

IYER, R., OSTENDORF, M., AND GISH, H. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Process. Lett. 4, 8 (Aug.).

[16]

JELINEK, F. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. A. Waibel and K. F. Lee, Eds., Morgan-Kaufmann, San Mateo, CA, 450-506.

[17]

KATZ, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics. Speech Signal Process. ASSP-35, 3 (March), 400-401.

[18]

LIN, S. C., TSAI, C. L., CHIEN, L. F., CHEN, K. J., AND LEE, L. S. 1997. Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the 5th European Conference on Speech Communication and Technology (Rhodes, Greece).

[19]

MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.

[20]

MILLER, D., LEEK, T., AND SCHWARTZ, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA), 214-221.

[21]

ROCCHIO, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313-323.

[22]

SEYMORE, K., AND ROSENFELD, R. 1996. Scalable backoff language models. In Proceedings of the International Conference on Speech and Language Processing, Vol. 1 (Philadelphia, PA), 232-235.

[23]

SEYMORE, K., AND ROSENFELD, R. 1997. Using story topics for language model adaptation. In Proceedings of the ICASSP-97 Conference.

[24]

STOLCKE, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA News Transcription and Understanding Workshop (Lansdowne, VA.), 270-274.

[25]

TUNG, C. H., AND LEE, H. J. 1994. Identification of unknown words from a corpus. Comput. Process. Chinese Oriental Lang. 131-145.

[26]

WONG, P. K., AND CHAN, C. K. 1996. Chinese word segmentation based on maximum matching and word binding force. In Proceedings of the 16th International Conference on Computational Linguistics (Copenhagen), 200-203

[27]

WU, M. W. AND SU, K. Y. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In Proceedings of the R.O.C. Computational Linguistics Conference VI (Nantou, Taiwan), 207-216.

[28]

YAMAMOTO, H. AND SAGISAKA, Y. 1999. Multi-class composite n-gram based on connection direction. In Proceedings of the ICASSP Conference (Phoenix, AZ, May).

[29]

YANG, K. C., HO, T. H., CHIEN, L. F., AND LEE, L. S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE 1998 International Conference on Acoustic, Speech, Signal Processing (Seattle, WA), 169-172.

[30]

ZHANG, J., GAO, J., AND ZHOU, M. 2000. Extraction of Chinese compound words: An experimental study on a very large corpus. In Proceedings of the Second Chinese Language Processing Workshop (Hong Kong, Oct. 8).

[31]

ZHAO, J., GAO, J., CHANG, E., AND LI, M. 2000. Lexicon optimization for Chinese language modeling. In Proceedings of the ISCSLP-2000. International Symposium on Spoken Language Processing (Beijing, Oct. 14-15).

[32]

ZUE, V. W. 1995. Navigating the information superhighway using spoken language interfaces. IEEE Expert 10, 5 (Oct.), 39-43.

Cited By

Hiwarkhedkar SMittal SMagdum VDhekane OJoshi RKale GLadkat A(2024)TextGram: Towards a Better Domain-Adaptive PretrainingSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-58495-4_12(161-173)Online publication date: 24-Apr-2024
https://doi.org/10.1007/978-3-031-58495-4_12
Wang HSun KWang Y(2022)Exploring the Chinese Public’s Perception of Omicron Variants on Social Media: LDA-Based Topic Modeling and Sentiment AnalysisInternational Journal of Environmental Research and Public Health10.3390/ijerph1914837719:14(8377)Online publication date: 8-Jul-2022
https://doi.org/10.3390/ijerph19148377
Vu TMoschitti A(2021)Machine Translation Customization via Automatic Training Data Selection from the WebAdvances in Information Retrieval10.1007/978-3-030-72113-8_44(666-679)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_44
Show More Cited By

Index Terms

Toward a unified approach to statistical language modeling for Chinese
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
    2. Philosophical/theoretical foundations of artificial intelligence
      1. Cognitive science
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish
Fractional calculus applications in signals and systems

We have designed a Turkish dictation system for newspaper content transcription application. Turkish is an agglutinative language with free word order. These characteristics of the language result in vocabulary explosion, large number of out-of-...
A comparative study of dictionaries and corpora as methods for language resource addition

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary ...
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing

ACM Transactions on Asian Language Information Processing Volume 1, Issue 1

March 2002

102 pages

ISSN:1530-0226

EISSN:1558-3430

DOI:10.1145/595576

Issue’s Table of Contents

Copyright © 2002 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2002

Published in TALIP Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
947
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hiwarkhedkar SMittal SMagdum VDhekane OJoshi RKale GLadkat A(2024)TextGram: Towards a Better Domain-Adaptive PretrainingSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-58495-4_12(161-173)Online publication date: 24-Apr-2024
https://doi.org/10.1007/978-3-031-58495-4_12
Wang HSun KWang Y(2022)Exploring the Chinese Public’s Perception of Omicron Variants on Social Media: LDA-Based Topic Modeling and Sentiment AnalysisInternational Journal of Environmental Research and Public Health10.3390/ijerph1914837719:14(8377)Online publication date: 8-Jul-2022
https://doi.org/10.3390/ijerph19148377
Vu TMoschitti A(2021)Machine Translation Customization via Automatic Training Data Selection from the WebAdvances in Information Retrieval10.1007/978-3-030-72113-8_44(666-679)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_44
Vertanen KKristensson P(2019)Mining, analyzing, and modeling text written on mobile devicesNatural Language Engineering10.1017/S135132491900054827:1(1-33)Online publication date: 10-Oct-2019
https://doi.org/10.1017/S1351324919000548
Chinea-Rios MSanchis-Trilles GCasacuberta F(2019)Vector sentences representation for data selection in statistical machine translationComputer Speech & Language10.1016/j.csl.2018.12.00556(1-16)Online publication date: Jul-2019
https://doi.org/10.1016/j.csl.2018.12.005
Mezzoudj FLanglois DJouvet DBenyettou A(2018)Textual Data Selection for Language Modelling in the Scope of Automatic Speech RecognitionProcedia Computer Science10.1016/j.procs.2018.03.008128(55-64)Online publication date: 2018
https://doi.org/10.1016/j.procs.2018.03.008
Iosif EKlasinas IAthanasopoulou GPalogiannidi EGeorgiladakis SLouka KPotamianos A(2018)Speech understanding for spoken dialogue systemsComputer Speech and Language10.1016/j.csl.2017.08.00247:C(272-297)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1016/j.csl.2017.08.002
Wołk K(2018)Mixing Textual Data Selection Methods for Improved In-Domain Data AdaptationTrends and Advances in Information Systems and Technologies10.1007/978-3-319-77712-2_35(367-377)Online publication date: 17-May-2018
https://doi.org/10.1007/978-3-319-77712-2_35
Chunwijitra VWutiwiwatchai C(2017)Classification-based spoken text selection for LVCSR language modelingEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0121-52017:1(1-12)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1186/s13636-017-0121-5
Bakar ZIsmail NRawi M(2017)Identification of Noun + Verb Compound Nouns in Malay Standard document based on rule based2017 IEEE 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS)10.1109/ICETSS.2017.8324176(1-6)Online publication date: Aug-2017
https://doi.org/10.1109/ICETSS.2017.8324176
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents