Abstract
We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bernstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Wang, H., Zhu, J., Tang, S., Fan, X.: A new unsupervised approach to word segmentation. CL 37, 421–454 (2011)
Sun, M., Shen, D., Tsou, B.K.: Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the Joint Conference of ACL and COLING, Montreal, Quebec, Canada, pp. 1265–1271. ACL (1998)
Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual dependencies in unsupervised word segmentation. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 673–680 (2006)
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In: Proceedings of the Joint Conference of ACL and IJCNLP, ACL 2009, Stroudsburg, PA, USA, pp. 100–108 (2009)
Johnson, M., Goldwater, S.: Improving nonparameteric bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. In: Proceedings of Human Language Technologies: The 2009 NAACL, NAACL 2009, Stroudsburg, PA, USA, pp. 317–325 (2009)
Liu, Z., Wang, H., Wu, H., Li, S.: Collocation extraction using monolingual word alignment method. In: Proceedings of EMNLP, Singapore, pp. 487–495 (2009)
Brody, S.: It depends on the translation: unsupervised dependency parsing via word alignment. In: Proceedings of EMNLP, EMNLP 2010, Stroudsburg, PA, USA, pp. 1214–1222 (2010)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 263–311 (1993)
Vogel, S., Ney, H., Tillmann, C.: Hmm-based word alignment in statistical translation. In: Proceedings of COLING, COLING 1996, Stroudsburg, PA, USA, pp. 836–841 (1996)
Teh, Y.W.: A hierarchical bayesian language model based on pitman-yor processes. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 985–992 (2006)
Bernstein-Ratner, N.: The phonology of parent-child speech. In: Nelson, K., van Kleeck, A. (eds.), vol. 6. Erlbaum, Hillsdale (1987)
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In: Proceedings of COLING, COLING 2008, Stroudsburg, PA, USA, pp. 1017–1024 (2008)
Nguyen, T., Vogel, S., Smith, N.A.: Nonparametric word segmentation for machine translation. In: Proceedings of COLING, COLING 2010, Stroudsburg, PA, USA, pp. 815–823 (2010)
Chung, T., Gildea, D.: Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, EMNLP 2009, Stroudsburg, PA, USA, pp. 718–726 (2009)
Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator (1995)
Goldwater, S., Griffiths, T.L., Johnson, M.: A bayesian framework for word segmentation: Exploring the effects of Context. Cognition 112, 21–54 (2009)
Och, F.J., Ney, H., Josef, F., Ney, O.H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)
Tom, E.: Second international Chinese word segmentation bakeoff (2005)
MacWhinney, B., Snow, C., et al.: The child language data exchange system. Journal of Child Language 12, 271–296 (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Teng, Z., Xiong, H., Liu, Q. (2014). Unsupervised Joint Monolingual Character Alignment and Word Segmentation. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-12277-9_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12276-2
Online ISBN: 978-3-319-12277-9
eBook Packages: Computer ScienceComputer Science (R0)