Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Published: 01 January 2011 Publication History

Abstract

This study explores the feasibility of integrating unsupervised and supervised segmentation of Chinese texts for enhancing performance beyond the present state-of-the art, focusing on the critical role of the former in enhancing the latter. Following only a pre-defined goodness measure, unsupervised segmentation has the advantage of discovering many new words in raw texts, but it has the disadvantage of inevitably corrupting many known. By contrast, supervised segmentation conventionally trained only on a pre-segmented corpus is particularly good at identifying known words but possesses little intrinsic mechanism to deal with unseen ones until it is formulated as character tagging. To combine their strengths, we empirically evaluate a set of goodness measures, among which description length gain excels in word discovery, but simple strategies like word candidate pruning and assemble segmentation can further improve it. Interestingly, however, accessor variety and boundary entropy, two other goodness measures, are found more effective in enhancing the supervised learning of character tagging with the conditional random fields model. All goodness scores are discretized into feature values to enrich this model. The success of this approach has been verified by our experiments on the benchmark data sets of the last two Bakeoffs: on average, it achieves an error reduction of 6.39% over the best performance of closed test in Bakeoff-3 and ranks first in all five closed test tracks in Bakeoff-4, outperforming other participants significantly and consistently by an error reduction of 8.96%.

References

[1]
R.K. Ando, L. Lee, Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji, in: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2000), Seattle, Washington, 2000, pp. 241-248.
[2]
B. Carpenter, Character language models for Chinese word segmentation and named entity recognition, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 169-172.
[3]
Chang, J.-S. and Su, K.-Y., An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics and Chinese Language Processing (CLCLP). v2 i2. 97-148.
[4]
L.-F. Chien, PAT-tree-based keyword extraction for Chinese information retrieval, in: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), Philadelphia, 1997, pp. 50-58.
[5]
T. Emerson, The second international Chinese word segmentation bakeoff, in: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (SIGHAN-4), Jeju Island, Korea, 2005, pp. 123-133.
[6]
Feng, H., Chen, K., Deng, X. and Zheng, W., Accessor variety criteria for Chinese word extraction. Computational Linguistics. v30 i1. 75-93.
[7]
Feng, H., Chen, K., Kit, C. and Deng, X., Unsupervised segmentation of Chinese corpus using accessor variety. In: Su, K.-Y., Tsujii, J., Lee, J.H., Kwong, O.Y. (Eds.), LNAI, vol. 3248. Springer. pp. 694-703.
[8]
G.-H. Fu, X.-L. Wang, Unsupervised Chinese word segmentation and unknown word identification, in: The Fifth Natural Language Processing Pacific Rim Symposium 1999 (NLPRS'99), Closing the Millennium, Beijing, China, 1999, pp. 32-37.
[9]
Fu, G.-H., Kit, C. and Webster, J.J., Chinese word segmentation as morpheme-based lexical chunking. Information Sciences. v178 i9. 2282-2296.
[10]
Ge, X., Pratt, W. and Smyth, P., Discovering Chinese words from unsegmented text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), ACM, Berkeley, CA. pp. 271-272.
[11]
Grefenstette, G., Tokenisation. In: van Halteren, H. (Ed.), Syntactic Wordclass Tagging, Kluwer, Dordrecht. pp. 117-133.
[12]
Grinstead, C. and Snell, J.L., Introduction to Probability. 1997. American Mathematical Society, Providence, RI.
[13]
Harris, Z.S., From phoneme to morpheme. Language. v31 i2. 90-222.
[14]
Harris, Z.S., Morpheme boundaries within words. In: Papers in Structural and Transformational Linguistics, Reidel, Dordrecht, Holland. pp. 68-77.
[15]
Huang, C.-N. and Zhao, H., Chinese word segmentation: A decade review. Journal of Chinese Information Processing. v21 i3. 8-20.
[16]
Huang, J.H. and Powers, D., Chinese word segmentation based on contextual entropy. In: Ji, D.H., Lua, K.-T. (Eds.), Proceedings of the 17th Pacific Asian Conference on Language, Information and Computation (PACLIC 17), COLIPS Publication, Sentosa, Singapore. pp. 152-158.
[17]
A.J. Jacobs, Y.W. Wong, Maximum entropy word segmentation of Chinese text, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.
[18]
F. Jiao, S. Wang, C.-H. Lee, R. Greiner, D. Schuurmans, Semi-supervised conditional random fields for improved sequence segmentation and labeling, in: COLING/ACL-2006, Sydney, Australia, 2006, pp. 209-216.
[19]
Z. Jin, K. Tanaka-Ishii, Unsupervised segmentation of Chinese text by use of branching entropy, in: COLING/ACL 2006, Sidney, Australia, 2006, pp. 428-435.
[20]
C. Kit, Unsupervised lexical learning as inductive inference, Ph.D. Thesis, University of Sheffield, 2000.
[21]
C. Kit, Y. Wilks, Unsupervised learning of word boundary with description length gain, in: Osborne, M., Sang, E.T.K. (Eds.), Computational Natural Language Learning (CoNLL-99), Bergen, Norway, 1999, pp. 1-6.
[22]
C. Kit, H. Zhao, Improving Chinese word segmentation with description length gain, in: The 2007 International Conference on Artificial Intelligence (ICAI-2007), Las Vegas, Nevada, USA, 2007, pp. 846-851.
[23]
Lafferty, J.D. and McCallum, A., Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML'01), Morgan Kaufmann, San Francisco, CA, USA. pp. 282-289.
[24]
G.-A. Levow, The third international Chinese language processing bakeoff: Word segmentation and named entity recognition, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.
[25]
W. Liu, H. Li, Y. Dong, N. He, H. Luo, H. Wang, France Telecom R& D Beijing word segmenter for SIGHAN bakeoff 2006, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.
[26]
J.K. Low, H.T. Ng, W. Guo, A maximum entropy approach to Chinese word segmentation, in: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Jeju Island, Korea, 2005, pp. 161-164.
[27]
Lü, X., Zhang, L. and Hu, J., Statistical substring reduction in linear time. In: Su, K.-Y., Tsujii, J., Lee, J.H., Kwong, O.Y. (Eds.), LNAI, vol. 3248. Springer. pp. 320-327.
[28]
Lua, K.-T. and Gan, K.-W., An application of information theory in Chinese word segmentation. Computer Processing of Chinese and Oriental Languages. v8 i1. 115-123.
[29]
Mikheev, A., Text segmentation. In: Mitkov, R. (Ed.), The Oxford Handbook of Computational Linguistics, Oxford University Press. pp. 201-218.
[30]
Palmer, D.D., Tokenisation and sentence segmentation. In: Dale, R., Moisl, H., Somers, H. (Eds.), Handbook of Natural Language Processing, Marcel Dekker, New York. pp. 11-36.
[31]
F. Peng, F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 562-568.
[32]
F. Peng, X. Huang, D. Schuurmans, N. Cercone, S. Robertson, Using self-supervised word segmentation in Chinese information retrieval, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), Tampere, Finland, 2001, pp. 349-350.
[33]
F. Peng, D. Schuurmans, Self-supervised Chinese word segmentation, in: The Fourth International Symposium on Intelligent Data Analysis (IDA-2001), Lisbon, Portugal, 2001, pp. 238-247.
[34]
J.M. Ponte, W.B. Croft, USeg: A retargetable word segmentation procedure for information retrieval, Presented at the Symposium on Document Analysis and Information Retrieval'96 (SDAIR),Technical Report TR96-2, University of Massachusetts, Amherst, MA, 1996.
[35]
B. Rosenfeld, R. Feldman, M. Fresko, A systematic cross-comparison of sequence classifiers, in: SDM 2006, Bethesda, Maryland, pp. 563-567.
[36]
Shannon, C.E., A mathematical theory of communication. The Bell System Technical Journal. v27. 379-423.
[37]
R. Sproat, T. Emerson, The first international Chinese word segmentation bakeoff, in: The Second SIGHAN Workshop on Chinese Language Processing (SIGHAN-2), Sapporo, Japan, 2003, pp. 133-143.
[38]
Sproat, R. and Shih, C., A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages. v4 i4. 336-351.
[39]
M. Sun, D. Shen, B.K. Tsou, Chinese word segmentation without using lexicon and hand-crafted training data, in: COLING-ACL'98, vol. 2, Montreal, Quebec, Canada, 1998, pp. 1265-1271.
[40]
Sun, M., Xiao, M. and Tsou, B.K., Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers. v27 i6. 736-742.
[41]
J. Suzuki, A. Fujino, H. Isozaki, Semi-supervised structured output learning based on a hybrid generative and discriminative approach, in: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech, 2007, pp. 791-800.
[42]
Teahan, W.J., Wen, Y., McNab, R. and Witten, I.H., A compression-based algorithm for Chinese word segmentation. Computational Linguistics. v26 i3. 375-393.
[43]
R.T.-H. Tsai, H.-C. Hung, C.-L. Sung, H.-J.Dai, W.-L. Hsu, On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 108-117.
[44]
Tung, C.-H. and Lee, H.-J., Identification of unknown words from corpus. International Journal of Computer Processing of Chinese and Oriental Languages. v8 iSuppl. 131-146.
[45]
X. Wang, X. Lin, D. Yu, H. Tian, X. Wu, Chinese word segmentation with maximum entropy and N-gram language model, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 138-141.
[46]
Z. Wang, C. Huang, J. Zhu, The character-based CRF segmenter of MSRA & NEU for the 4th Bakeoff, in: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008, pp.98-101.
[47]
J.J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), vol. IV, Nantes, France, 1992, pp. 1106-1110.
[48]
Xiong, Y., Zhu, J., Huang, H. and Xu, H., Minimum tag error for discriminative training of conditional random fields. Information Sciences. v179 i1-2. 169-179.
[49]
Xue, N., Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing. v8 i1. 29-48.
[50]
J. Zhang, J. Gao, M. Zhou, Extraction of Chinese compound words - An experimental study on a very large corpus, in: Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, China, 2000, pp. 132-139.
[51]
M., Zhang, G.-D. Zhou, L.-P. Yang, D.-H. Ji, Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 154-157.
[52]
H. Zhao, Huang, C.-N., M. Li, An improved Chinese word segmentation system with conditional random field, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 162-165.
[53]
H. Zhao, C.-N. Huang, Li, M., Lu, B.-L., Effective tag set selection in Chinese word segmentation via conditional random field modeling, in: Proceedings of the 20th Pacific Asian Conference on Language, Information and Computation (PACLIC 20), Wuhan, China, 2006, pp. 87-94.
[54]
H. Zhao, C. Kit, Incorporating global information into supervised learning for Chinese word segmentation, in: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 2007, pp. 66-74.
[55]
H. Zhao, C. Kit, Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition, in: The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008, pp. 106-111.
[56]
H. Zhao, C. Kit, An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework, in: The Third International Joint Conference on Natural Language Processing (IJCNLP-2008), vol. 1, Hyderabad, India, 2008, pp. 9-16.
[57]
Zhao, H. and Kit, C., Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation. Research in Computing Science. v33. 93-104.
[58]
Zhao, H. and Kit, C., Scaling conditional random fields by one-against-the-other decomposition. Journal of Computer Science and Technology. v23 i4. 612-619.
[59]
H. Zhao, C. Kit, A simple and efficient model pruning method for conditional random fields, in: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL-2009), Hong Kong, China, 2009, pp. 149-159.
[60]
M.-H. Zhu, Y.-L. Wang, Z.-X. Wang, H.-Z. Wang, J.-B. Zhu, Designing special post-processing rules for SVM-based Chinese word segmentation, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN-5), Sydney, Australia, 2006, pp. 217-220.

Cited By

View all
  • (2022)Text Mining Based on the Lexicon-Constrained Network in the Context of Big DataWireless Communications & Mobile Computing10.1155/2022/87031002022Online publication date: 1-Jan-2022
  • (2019)Effective Representation for Easy-First Dependency ParsingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_28(351-363)Online publication date: 26-Aug-2019
  • (2016)A bilingual graph-based semantic model for statistical machine translationProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061034(2950-2956)Online publication date: 9-Jul-2016
  • Show More Cited By

Index Terms

  1. Integrating unsupervised and supervised word segmentation: The role of goodness measures

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Information Sciences: an International Journal
      Information Sciences: an International Journal  Volume 181, Issue 1
      January, 2011
      257 pages

      Publisher

      Elsevier Science Inc.

      United States

      Publication History

      Published: 01 January 2011

      Author Tags

      1. Accessor variety
      2. Boundary entropy
      3. Character tagging
      4. Chinese word segmentation
      5. Conditional random fields
      6. Description length gain
      7. Unknown word detection
      8. Unsupervised segmentation

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Text Mining Based on the Lexicon-Constrained Network in the Context of Big DataWireless Communications & Mobile Computing10.1155/2022/87031002022Online publication date: 1-Jan-2022
      • (2019)Effective Representation for Easy-First Dependency ParsingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_28(351-363)Online publication date: 26-Aug-2019
      • (2016)A bilingual graph-based semantic model for statistical machine translationProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061034(2950-2956)Online publication date: 9-Jul-2016
      • (2014)Feature-frequencyComputational Linguistics10.1162/COLI_a_0019340:3(563-586)Online publication date: 1-Sep-2014
      • (2013)Unsupervised resolution independent based natural plant leaf disease segmentation approach for mobile devicesProceedings of the 5th IBM Collaborative Academia Research Exchange Workshop10.1145/2528228.2528240(1-4)Online publication date: 17-Oct-2013
      • (2013)The Left and Right Context of a WordACM Transactions on Asian Language Information Processing10.1145/2425327.242532912:1(1-23)Online publication date: 1-Mar-2013
      • (2013)Probabilistic Chinese word segmentation with non-local information and stochastic trainingInformation Processing and Management: an International Journal10.1016/j.ipm.2012.12.00349:3(626-636)Online publication date: 1-May-2013
      • (2012)Fast decoding algorithms for variable-lengths codesInformation Sciences: an International Journal10.1016/j.ins.2011.06.019183:1(66-91)Online publication date: 1-Jan-2012

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media