Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

Published: 01 March 2013 Publication History

Abstract

Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.

References

[1]
Chen, Z. and Lee, K.-F. 2000. A new statistical approach to Chinese Pinyin input. In Proceedings of the Association for Computational Linguistics (ACL’00). 241--247.
[2]
Cohn, T., Smith, A., and Osborne, M. 2005. Scaling conditional random fields using error-correcting codes. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). 10--17.
[3]
Duan, H.-M., Bai X.-J., Chang, B.-B., and Yu, S.-W. 2003. Chinese word segmentation at Peking University. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17.
[4]
Even, G., Naor, J., Schieber, B., and Sudan, M. 1998. Approximating minimum feedback sets and multi-cuts in directed graphs. Algorithm 20, 2, 151--174.
[5]
Feng, H., Chen, K., Kit, C., and Deng, X. 2005. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the Conference on Natural Language Processing (IJCNLP’04). 694--703.
[6]
Gao, J.-F. and Zhang, M. 2002. Improving language model size reduction using better pruning criteria. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). 176--182.
[7]
Gao, J.-F., Goodman, J., Li, M., and Lee, K.-F. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33.
[8]
Gao, J.-F., Suzuki, H., and Yuan, W. 2006. An empirical study on language model adaptation. ACM Trans. Asian Lang. Inform. Process. 5, 3, 209--227.
[9]
Graff, D. 2007. Chinese Gigaword 3rd Ed. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2007T38.
[10]
Guo, J., Hüffner, F., and Moser, H. 2007. Feedback arc set in bipartite tournaments is NP-complete. Inf. Proc. Lett. 102, 2--3, 62--65.
[11]
Gupta, S. 2008. Feedback arc set problem in bipartite tournaments. Inf. Proc. Lett. 105, 4, 150--154.
[12]
Huang, C.-R. 2009. Tagged Chinese Gigaword Version 2.0. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2009T14.
[13]
Huang, C.-R. Lee, L.-H., Qu, W.-G., and Yu, S.-W. 2008. Quality assurance of automatic annotation of very large corpora: A study based on heterogeneous tagging systems. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).
[14]
Jiang, M. T.-J., Lee, C.-W., Liu, C., Chang, Y.-C., and Hsu W.-L. 2011. Robustness analysis of adaptive chinese input methods. In Proceedings of the Workshop on Advances in Text Input Methods (WTIM’11). 53--61.
[15]
Lafferty, J. D., Mccallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289.
[16]
Levow, G. A. 2006. The 3rd International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (CLP’06). 108--117.
[17]
Li, L., Wang, X., Wang, X.-L., and Yu, Y.-B. 2009. A conditional random fields approach to Chinese Pinyin-to-character conversion. J. Comm. Comput. 6, 4, 25--31.
[18]
Li, M., Gao, J.-F., Huang, C.-N., and Li, J.-F. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17, 1--7.
[19]
Li, R., Liu, S.-H., Ye, S.-W., and Shi, Z.-Z. 2001. A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN (in Chinese). J. Chin. Inf. Proc. 15, 6, 13--18.
[20]
Liang, N.-Y. 1987. A written Chinese automatic segmentation system (in Chinese). J. Chin. Inf. Proc. 2, 44--52.
[21]
Liu, B.-Q. and Wang, X.-L. 2002. An approach to machine learning of Chinese Pinyin-to-character conversion for small-memory application. In Proceedings of the 1st International Conference on Machine Learning and Cybernetics (CMLC’02). 1287--1291.
[22]
Liu, Y. and Wang, Q.-Q. 2007. Chinese Pinyin phrasal input on mobile phone: Usability and developing trends. In Proceedings of the 4th International Conference on Mobile Technology, Applications, and Systems and the 1st International Symposium on Computer Human Interaction in Mobile Technology (Mobility’07). 540--546.
[23]
Low, J. K., Ng, H. T., and Guo, W. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 161--164.
[24]
Mackenzie, S. I. and Soukoreff, W. R. 2002. Text entry for mobile computing: Models and methods, theory and practice. Hum. Comp. Inter. 17, 2, 147--198.
[25]
Peng, F., Feng, F., and Mccallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’04). 562--568.
[26]
Qiao, W., Sun, M.-S., and Menzel, W. 2008. Statistical properties of overlapping ambiguities in Chinese word segmentation and a strategy for their disambiguation. In Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD’08). 177--186.
[27]
Sproat, R. and Emerson, T. 2003. The 1st International Chinese Word Segmentation Bakeoff. In Proceedings of the SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 133--143.
[28]
Stonedahl, F., Rand, W., and Wilensky, U. 2008. CrossNet: A framework for crossover with network-based chromosomal representations. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO’08). 1057--1064.
[29]
Sun, M.-S. and Zuo, Z.-P. 1998. Overlapping ambiguity in Chinese text (in Chinese). Quantitative and Computational Studies on the Chinese Language. HK. 323--338.
[30]
Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL’11). 1385--1394.
[31]
Tsai, R. T.-H., Hung, H.-C., Sung, C.-L., Dai, H.-J., and Hsu, W.-L. 2006. On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 108--117.
[32]
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN Bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 168--171.
[33]
Wang, X., Li, L., Yao, L., and Anwar, W. 2006. A maximum entropy approach to Chinese Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC’06).
[34]
Ward, D. J., Blackwell, A. F., and Mackay, D. J. C. 2000. Dasher --- A data entry interface using continuous gestures and language models. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology (UIST’00). 129--137.
[35]
Wen, J., Wang, X.-J., Xu, W.-Z., and Jiang, H.-X. 2008. Ambiguity solution of Pinyin segmentation in continuous Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’08).
[36]
Wu, G.-Q. and Zheng, F. 2003. A method to build a super small but practically accurate language model for handheld devices. J. Comput. Sci. Tech. 18, 6, 747--755.
[37]
Xiao, J.-H., Liu, B.-Q., and Wang, X.-L. 2007. Exploiting Pinyin constraints in Pinyin-to-character conversion task: A class-based maximum entropy Markov model approach. Comput. Linguist. Chin. Lang. Proc. 12, 3, 325--348.
[38]
Xue, N. 2003. Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Proc. 18, 1, 29--48.
[39]
Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 176--179.
[40]
Yang, K.-C., Ho, T.-H., Chien, L.-F., and Lee, L.-S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing (ICASSP’98). 169--172.
[41]
Zhang, K. and Sun, M. 2011. A comparison study of candidate generation for Chinese word segmentation. In Proceedings of the 7th IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’11). 60--67.
[42]
Zhang, M., Zhou, G.-D., Yang, L.-P., and Ji, D.-H. 2006. Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 154--157.
[43]
Zhao, H. and Kit, C.-Y. 2011. Integrating unsupervised and supervised word segmentation: The role of goodness measures. Inf. Sci. 181, 1, 163--183.
[44]
Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inform. Process. 9, 2.
[45]
Zheng, F. 1999. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. In Proceedings of the Conference on Acoustics, Speech, and Signal Processing (ICASSP’99). 601--604.

Index Terms

  1. The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian Language Information Processing
    ACM Transactions on Asian Language Information Processing  Volume 12, Issue 1
    March 2013
    102 pages
    ISSN:1530-0226
    EISSN:1558-3430
    DOI:10.1145/2425327
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2013
    Accepted: 01 February 2012
    Revised: 01 January 2012
    Received: 01 August 2011
    Published in TALIP Volume 12, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese phonetic input methods
    2. syllable-to-word conversion

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 263
      Total Downloads
    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media