Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling

Euisok Chung³,
Hyung-Bae Jeon³,
Jeon-Gue Park³ &
…
Yun-Keun Lee³

470 Accesses
1 Citations

Abstract

This paper introduces a domain-adapted word segmentation approach to text where a word delimiter is not used regularly. It depends on an unknown word extraction technique. This approach is essential for language modeling to adapt to new domains since a vocabulary set is activated in a word segmentation step. We have achieved ERR 21.22% in Korean word segmentation. In addition, we show that an incremental domain adaptation of the word segmentation decreases the perplexity of input text gradually. It means that our approach supports an out-of-domain language modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation

Domain Adaptation of Transformers for English Word Segmentation

Effect on Probabilistic Language Model for Cross-Domain Corpus

References

Chen, K. J., Bai, M. H., ”Unknown Word Detection for Chinese by a Corpusbased Learning Mothod”, International Journal of Computational Linguistics and Chinese Language Processing, Vol.3, pp.27-44, 1998
Google Scholar
Chen, K. J., Ma, W. Y., ”Unknown word extraction for Chinese documents”, in Proceeding COLING ’02 Proceedings of the 19th international conference on Computational linguistics - Volume 1, 2002
Google Scholar
Lafferty, J., McCallum, A., and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceeding of the 18th International Conference on Machine Learning. 282–289. 2001.
Google Scholar
Ma, W. Y., Chen, K. J., ”A bottom-up merging algorithm for Chinese unknown word extraction”, in Proceeding SIGHAN ’03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17, 2003
Google Scholar
Seymore, K., Rosenfeld, R., ”Using Story Topics for Language Model Adaptation”, in Proceeding of the Eurospeech, 1997
Google Scholar
Stolcke, A., ”SRILM - An Extensible Language Modeling Toolkit”, in Proceeding of the International Conference Spoken Language Processing, Denver, Colorado, September 2002.
Google Scholar
Varile, G. B., Zampolli, A., ”Survey of the state of the art in human language technology”, Cambridge University Press, pp32-33, 1997
Google Scholar
Yang, S. I., Seo, Y. A., Kim, Y. K. and Ra, D., ”Noun Sense Identification of Korean Nominal Compounds Based on Sentential Form Recovery,” ETRI Journal, vol.32, no.5, Oct. 2010, pp.740-749.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Speech Processing Team, ETRI, Daejeon, Korea
Euisok Chung, Hyung-Bae Jeon, Jeon-Gue Park & Yun-Keun Lee

Authors

Euisok Chung
View author publications
You can also search for this author in PubMed Google Scholar
Hyung-Bae Jeon
View author publications
You can also search for this author in PubMed Google Scholar
Jeon-Gue Park
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Keun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Euisok Chung .

Editor information

Editors and Affiliations

, Dept. of Languages and Computer Systems, University of Granada, Granada, 18071, Spain
Ramón López-Cózar Delgado
, Dept. of Computer Science & Engineering, Waseda University, Okubo 3-4-1, Tokyo, 169-8555, Japan
Tetsunori Kobayashi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chung, E., Jeon, HB., Park, JG., Lee, YK. (2011). Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling. In: Delgado, RC., Kobayashi, T. (eds) Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1335-6_9

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1335-6_9
Published: 12 August 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1334-9
Online ISBN: 978-1-4614-1335-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation

Domain Adaptation of Transformers for English Word Segmentation

Effect on Probabilistic Language Model for Cross-Domain Corpus

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation

Domain Adaptation of Transformers for English Word Segmentation

Effect on Probabilistic Language Model for Cross-Domain Corpus

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation