article

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Authors:

Chunyu KitAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 181, Issue 1

Pages 163 - 183

https://doi.org/10.1016/j.ins.2010.09.008

Published: 01 January 2011 Publication History

Abstract

This study explores the feasibility of integrating unsupervised and supervised segmentation of Chinese texts for enhancing performance beyond the present state-of-the art, focusing on the critical role of the former in enhancing the latter. Following only a pre-defined goodness measure, unsupervised segmentation has the advantage of discovering many new words in raw texts, but it has the disadvantage of inevitably corrupting many known. By contrast, supervised segmentation conventionally trained only on a pre-segmented corpus is particularly good at identifying known words but possesses little intrinsic mechanism to deal with unseen ones until it is formulated as character tagging. To combine their strengths, we empirically evaluate a set of goodness measures, among which description length gain excels in word discovery, but simple strategies like word candidate pruning and assemble segmentation can further improve it. Interestingly, however, accessor variety and boundary entropy, two other goodness measures, are found more effective in enhancing the supervised learning of character tagging with the conditional random fields model. All goodness scores are discretized into feature values to enrich this model. The success of this approach has been verified by our experiments on the benchmark data sets of the last two Bakeoffs: on average, it achieves an error reduction of 6.39% over the best performance of closed test in Bakeoff-3 and ranks first in all five closed test tracks in Bakeoff-4, outperforming other participants significantly and consistently by an error reduction of 8.96%.

References

[1]

R.K. Ando, L. Lee, Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji, in: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2000), Seattle, Washington, 2000, pp. 241-248.

[2]

B. Carpenter, Character language models for Chinese word segmentation and named entity recognition, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 169-172.

[3]

Chang, J.-S. and Su, K.-Y., An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics and Chinese Language Processing (CLCLP). v2 i2. 97-148.

[4]

L.-F. Chien, PAT-tree-based keyword extraction for Chinese information retrieval, in: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), Philadelphia, 1997, pp. 50-58.

[5]

T. Emerson, The second international Chinese word segmentation bakeoff, in: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (SIGHAN-4), Jeju Island, Korea, 2005, pp. 123-133.

[6]

Feng, H., Chen, K., Deng, X. and Zheng, W., Accessor variety criteria for Chinese word extraction. Computational Linguistics. v30 i1. 75-93.

[7]

Feng, H., Chen, K., Kit, C. and Deng, X., Unsupervised segmentation of Chinese corpus using accessor variety. In: Su, K.-Y., Tsujii, J., Lee, J.H., Kwong, O.Y. (Eds.), LNAI, vol. 3248. Springer. pp. 694-703.

[8]

G.-H. Fu, X.-L. Wang, Unsupervised Chinese word segmentation and unknown word identification, in: The Fifth Natural Language Processing Pacific Rim Symposium 1999 (NLPRS'99), Closing the Millennium, Beijing, China, 1999, pp. 32-37.

[9]

Fu, G.-H., Kit, C. and Webster, J.J., Chinese word segmentation as morpheme-based lexical chunking. Information Sciences. v178 i9. 2282-2296.

[10]

Ge, X., Pratt, W. and Smyth, P., Discovering Chinese words from unsegmented text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), ACM, Berkeley, CA. pp. 271-272.

[11]

Grefenstette, G., Tokenisation. In: van Halteren, H. (Ed.), Syntactic Wordclass Tagging, Kluwer, Dordrecht. pp. 117-133.

[12]

Grinstead, C. and Snell, J.L., Introduction to Probability. 1997. American Mathematical Society, Providence, RI.

[13]

Harris, Z.S., From phoneme to morpheme. Language. v31 i2. 90-222.

[14]

Harris, Z.S., Morpheme boundaries within words. In: Papers in Structural and Transformational Linguistics, Reidel, Dordrecht, Holland. pp. 68-77.

[15]

Huang, C.-N. and Zhao, H., Chinese word segmentation: A decade review. Journal of Chinese Information Processing. v21 i3. 8-20.

[16]

Huang, J.H. and Powers, D., Chinese word segmentation based on contextual entropy. In: Ji, D.H., Lua, K.-T. (Eds.), Proceedings of the 17th Pacific Asian Conference on Language, Information and Computation (PACLIC 17), COLIPS Publication, Sentosa, Singapore. pp. 152-158.

[17]

A.J. Jacobs, Y.W. Wong, Maximum entropy word segmentation of Chinese text, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.

[18]

F. Jiao, S. Wang, C.-H. Lee, R. Greiner, D. Schuurmans, Semi-supervised conditional random fields for improved sequence segmentation and labeling, in: COLING/ACL-2006, Sydney, Australia, 2006, pp. 209-216.

[19]

Z. Jin, K. Tanaka-Ishii, Unsupervised segmentation of Chinese text by use of branching entropy, in: COLING/ACL 2006, Sidney, Australia, 2006, pp. 428-435.

[20]

C. Kit, Unsupervised lexical learning as inductive inference, Ph.D. Thesis, University of Sheffield, 2000.

[21]

C. Kit, Y. Wilks, Unsupervised learning of word boundary with description length gain, in: Osborne, M., Sang, E.T.K. (Eds.), Computational Natural Language Learning (CoNLL-99), Bergen, Norway, 1999, pp. 1-6.

[22]

C. Kit, H. Zhao, Improving Chinese word segmentation with description length gain, in: The 2007 International Conference on Artificial Intelligence (ICAI-2007), Las Vegas, Nevada, USA, 2007, pp. 846-851.

[23]

Lafferty, J.D. and McCallum, A., Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML'01), Morgan Kaufmann, San Francisco, CA, USA. pp. 282-289.

[24]

G.-A. Levow, The third international Chinese language processing bakeoff: Word segmentation and named entity recognition, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.

[25]

W. Liu, H. Li, Y. Dong, N. He, H. Luo, H. Wang, France Telecom R& D Beijing word segmenter for SIGHAN bakeoff 2006, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.

[26]

J.K. Low, H.T. Ng, W. Guo, A maximum entropy approach to Chinese word segmentation, in: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Jeju Island, Korea, 2005, pp. 161-164.

[27]

Lü, X., Zhang, L. and Hu, J., Statistical substring reduction in linear time. In: Su, K.-Y., Tsujii, J., Lee, J.H., Kwong, O.Y. (Eds.), LNAI, vol. 3248. Springer. pp. 320-327.

[28]

Lua, K.-T. and Gan, K.-W., An application of information theory in Chinese word segmentation. Computer Processing of Chinese and Oriental Languages. v8 i1. 115-123.

[29]

Mikheev, A., Text segmentation. In: Mitkov, R. (Ed.), The Oxford Handbook of Computational Linguistics, Oxford University Press. pp. 201-218.

[30]

Palmer, D.D., Tokenisation and sentence segmentation. In: Dale, R., Moisl, H., Somers, H. (Eds.), Handbook of Natural Language Processing, Marcel Dekker, New York. pp. 11-36.

[31]

F. Peng, F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 562-568.

[32]

F. Peng, X. Huang, D. Schuurmans, N. Cercone, S. Robertson, Using self-supervised word segmentation in Chinese information retrieval, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), Tampere, Finland, 2001, pp. 349-350.

[33]

F. Peng, D. Schuurmans, Self-supervised Chinese word segmentation, in: The Fourth International Symposium on Intelligent Data Analysis (IDA-2001), Lisbon, Portugal, 2001, pp. 238-247.

[34]

J.M. Ponte, W.B. Croft, USeg: A retargetable word segmentation procedure for information retrieval, Presented at the Symposium on Document Analysis and Information Retrieval'96 (SDAIR),Technical Report TR96-2, University of Massachusetts, Amherst, MA, 1996.

[35]

B. Rosenfeld, R. Feldman, M. Fresko, A systematic cross-comparison of sequence classifiers, in: SDM 2006, Bethesda, Maryland, pp. 563-567.

[36]

Shannon, C.E., A mathematical theory of communication. The Bell System Technical Journal. v27. 379-423.

[37]

R. Sproat, T. Emerson, The first international Chinese word segmentation bakeoff, in: The Second SIGHAN Workshop on Chinese Language Processing (SIGHAN-2), Sapporo, Japan, 2003, pp. 133-143.

Digital Library

[38]

Sproat, R. and Shih, C., A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages. v4 i4. 336-351.

[39]

M. Sun, D. Shen, B.K. Tsou, Chinese word segmentation without using lexicon and hand-crafted training data, in: COLING-ACL'98, vol. 2, Montreal, Quebec, Canada, 1998, pp. 1265-1271.

[40]

Sun, M., Xiao, M. and Tsou, B.K., Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers. v27 i6. 736-742.

[41]

J. Suzuki, A. Fujino, H. Isozaki, Semi-supervised structured output learning based on a hybrid generative and discriminative approach, in: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech, 2007, pp. 791-800.

[42]

Teahan, W.J., Wen, Y., McNab, R. and Witten, I.H., A compression-based algorithm for Chinese word segmentation. Computational Linguistics. v26 i3. 375-393.

[43]

R.T.-H. Tsai, H.-C. Hung, C.-L. Sung, H.-J.Dai, W.-L. Hsu, On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.

[44]

Tung, C.-H. and Lee, H.-J., Identification of unknown words from corpus. International Journal of Computer Processing of Chinese and Oriental Languages. v8 iSuppl. 131-146.

[45]

X. Wang, X. Lin, D. Yu, H. Tian, X. Wu, Chinese word segmentation with maximum entropy and N-gram language model, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 138-141.

[46]

Z. Wang, C. Huang, J. Zhu, The character-based CRF segmenter of MSRA & NEU for the 4th Bakeoff, in: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008, pp.98-101.

[47]

J.J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), vol. IV, Nantes, France, 1992, pp. 1106-1110.

[48]

Xiong, Y., Zhu, J., Huang, H. and Xu, H., Minimum tag error for discriminative training of conditional random fields. Information Sciences. v179 i1-2. 169-179.

[49]

Xue, N., Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing. v8 i1. 29-48.

[50]

J. Zhang, J. Gao, M. Zhou, Extraction of Chinese compound words - An experimental study on a very large corpus, in: Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, China, 2000, pp. 132-139.

[51]

M., Zhang, G.-D. Zhou, L.-P. Yang, D.-H. Ji, Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 154-157.

[52]

H. Zhao, Huang, C.-N., M. Li, An improved Chinese word segmentation system with conditional random field, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 162-165.

[53]

H. Zhao, C.-N. Huang, Li, M., Lu, B.-L., Effective tag set selection in Chinese word segmentation via conditional random field modeling, in: Proceedings of the 20th Pacific Asian Conference on Language, Information and Computation (PACLIC 20), Wuhan, China, 2006, pp. 87-94.

[54]

H. Zhao, C. Kit, Incorporating global information into supervised learning for Chinese word segmentation, in: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 2007, pp. 66-74.

[55]

H. Zhao, C. Kit, Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition, in: The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008, pp. 106-111.

[56]

H. Zhao, C. Kit, An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework, in: The Third International Joint Conference on Natural Language Processing (IJCNLP-2008), vol. 1, Hyderabad, India, 2008, pp. 9-16.

[57]

Zhao, H. and Kit, C., Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation. Research in Computing Science. v33. 93-104.

[58]

Zhao, H. and Kit, C., Scaling conditional random fields by one-against-the-other decomposition. Journal of Computer Science and Technology. v23 i4. 612-619.

[59]

H. Zhao, C. Kit, A simple and efficient model pruning method for conditional random fields, in: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL-2009), Hong Kong, China, 2009, pp. 149-159.

[60]

M.-H. Zhu, Y.-L. Wang, Z.-X. Wang, H.-Z. Wang, J.-B. Zhu, Designing special post-processing rules for SVM-based Chinese word segmentation, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 217-220.

Cited By

Wan BSohail M(2022)Text Mining Based on the Lexicon-Constrained Network in the Context of Big DataWireless Communications & Mobile Computing10.1155/2022/87031002022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/8703100
Li ZCai JZhao H(2019)Effective Representation for Easy-First Dependency ParsingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_28(351-363)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29908-8_28
Wang RZhao HPloux SLu BUtiyama M(2016)A bilingual graph-based semantic model for statistical machine translationProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061034(2950-2956)Online publication date: 9-Jul-2016
https://dl.acm.org/doi/10.5555/3060832.3061034
Show More Cited By

Index Terms

Integrating unsupervised and supervised word segmentation: The role of goodness measures
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Hardware
  1. Power and energy
    1. Power estimation and optimization
      1. Platform power issues

Recommendations

An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead

This paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

Among statistical approaches to Chinese word segmentation, the word-based n-gram (generative) model and the character-based tagging (discriminative) model are two dominant approaches in the literature. The former gives excellent performance for the in-...
Scaling conditional random fields by one-against-the-other decomposition

As a powerful sequence labeling model, conditional random fields (CRFs) have had successful applications in many natural language processing (NLP) tasks. However, the high complexity of CRFs training only allows a very small tag (or label) set, because ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 181, Issue 1

January, 2011

257 pages

ISSN:0020-0255

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2010.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 January 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wan BSohail M(2022)Text Mining Based on the Lexicon-Constrained Network in the Context of Big DataWireless Communications & Mobile Computing10.1155/2022/87031002022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/8703100
Li ZCai JZhao H(2019)Effective Representation for Easy-First Dependency ParsingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_28(351-363)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29908-8_28
Wang RZhao HPloux SLu BUtiyama M(2016)A bilingual graph-based semantic model for statistical machine translationProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061034(2950-2956)Online publication date: 9-Jul-2016
https://dl.acm.org/doi/10.5555/3060832.3061034
Sun XLi WWang HLu Q(2014)Feature-frequencyComputational Linguistics10.1162/COLI_a_0019340:3(563-586)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1162/COLI_a_00193
Prasad SPeddoju SGhosh DGupta SKrishnapuram RVerma A(2013)Unsupervised resolution independent based natural plant leaf disease segmentation approach for mobile devicesProceedings of the 5th IBM Collaborative Academia Research Exchange Workshop10.1145/2528228.2528240(1-4)Online publication date: 17-Oct-2013
https://dl.acm.org/doi/10.1145/2528228.2528240
Jiang MLee THsu W(2013)The Left and Right Context of a WordACM Transactions on Asian Language Information Processing10.1145/2425327.242532912:1(1-23)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1145/2425327.2425329
Sun XZhang YMatsuzaki TTsuruoka YTsujii J(2013)Probabilistic Chinese word segmentation with non-local information and stochastic trainingInformation Processing and Management: an International Journal10.1016/j.ipm.2012.12.00349:3(626-636)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1016/j.ipm.2012.12.003
Walder JKrátký MBača RPlatoš JSnášel V(2012)Fast decoding algorithms for variable-lengths codesInformation Sciences: an International Journal10.1016/j.ins.2011.06.019183:1(66-91)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1016/j.ins.2011.06.019

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents