Abstract
This paper introduces a domain-adapted word segmentation approach to text where a word delimiter is not used regularly. It depends on an unknown word extraction technique. This approach is essential for language modeling to adapt to new domains since a vocabulary set is activated in a word segmentation step. We have achieved ERR 21.22% in Korean word segmentation. In addition, we show that an incremental domain adaptation of the word segmentation decreases the perplexity of input text gradually. It means that our approach supports an out-of-domain language modeling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chen, K. J., Bai, M. H., ”Unknown Word Detection for Chinese by a Corpusbased Learning Mothod”, International Journal of Computational Linguistics and Chinese Language Processing, Vol.3, pp.27-44, 1998
Chen, K. J., Ma, W. Y., ”Unknown word extraction for Chinese documents”, in Proceeding COLING ’02 Proceedings of the 19th international conference on Computational linguistics - Volume 1, 2002
Lafferty, J., McCallum, A., and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceeding of the 18th International Conference on Machine Learning. 282–289. 2001.
Ma, W. Y., Chen, K. J., ”A bottom-up merging algorithm for Chinese unknown word extraction”, in Proceeding SIGHAN ’03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17, 2003
Seymore, K., Rosenfeld, R., ”Using Story Topics for Language Model Adaptation”, in Proceeding of the Eurospeech, 1997
Stolcke, A., ”SRILM - An Extensible Language Modeling Toolkit”, in Proceeding of the International Conference Spoken Language Processing, Denver, Colorado, September 2002.
Varile, G. B., Zampolli, A., ”Survey of the state of the art in human language technology”, Cambridge University Press, pp32-33, 1997
Yang, S. I., Seo, Y. A., Kim, Y. K. and Ra, D., ”Noun Sense Identification of Korean Nominal Compounds Based on Sentential Form Recovery,” ETRI Journal, vol.32, no.5, Oct. 2010, pp.740-749.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this paper
Cite this paper
Chung, E., Jeon, HB., Park, JG., Lee, YK. (2011). Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling. In: Delgado, RC., Kobayashi, T. (eds) Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1335-6_9
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1335-6_9
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1334-9
Online ISBN: 978-1-4614-1335-6
eBook Packages: EngineeringEngineering (R0)