short-paper

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

Author:

Seung-Hoon NaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 14, Issue 3

Article No.: 10, Pages 1 - 16

https://doi.org/10.1145/2700051

Published: 12 June 2015 Publication History

Abstract

There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a conditional random field (CRF). We present a two-stage discriminative approach based on CRFs for Korean morphological analysis. Similar to methods used for Chinese, we perform two disambiguation procedures based on CRFs: (1) morpheme segmentation and (2) POS tagging. In morpheme segmentation, an input sentence is segmented into sequences of morphemes, where a morpheme unit is either atomic or compound. In the POS tagging procedure, each morpheme (atomic or compound) is assigned a POS tag. Once POS tagging is complete, we carry out a post-processing of the compound morphemes, where each compound morpheme is further decomposed into atomic morphemes, which is based on pre-analyzed patterns and generalized HMMs obtained from the given tagged corpus. Experimental results show the promise of our proposed method.

References

[1]

Jae-Hyeok Choi and Sang-Jo Lee. 1993. A method for reducing dictionary access with bidirectional longest match strategy in Korean morphological analyzer. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 20, 10, 1497--1507.

[2]

Jeen-Pyo Hong. 2008. Korean part-of-speech tagger using Eojeol patterns. Master’s thesis, Changwon National University.

[3]

Seung-Shik Kang and Yung Taek Kim. 1994. Syllable-based model for the Korean morphology. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 221--226.

Digital Library

[4]

Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 535--539.

Digital Library

[5]

Jae-Hoon Kim, Byung-Gyu Jang, Gil Chang Kim, and Jungyun Seo. 1995. Morphological ambiguity reduction using subsumption relation in Korean. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95).

[6]

Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis, KAIST.

[7]

Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (IJCAI’83). Vol. 2, 683--685.

Digital Library

[8]

Taku Kudo. 2006. MeCab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net.

[9]

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 230--237.

[10]

Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences (MATEC’99). 76--88.

[11]

Changki Lee. 2013. Joint models for Korean word spacing and POS tagging using structural SVM. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 40, 12, 826--832.

[12]

Changki Lee and Myung-Gil Jang. 2009. Large-margin training of dependency parsers using Pegasos algorithm. ETRI J. 31, 2, 121--128.

[13]

Changki Lee and Hyunki Kim. 2013. Automatic Korean word spacing using Pegasos algorithm. Inf. Process. Manage. 49, 1, 370--379.

Digital Library

[14]

Do-Gil Lee and Hae-Chang Rim. 2005. Probabilistic models for Korean morphological analysis. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP’05).

[15]

Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Audio Speech Lang. Proc. 17, 5, 945--955.

Digital Library

[16]

Gary Geunbae Lee, Jong-Hyeok Lee, and Jeongwon Cha. 2002. Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean. Comput. Linguistics 28, 1, 53--70.

Digital Library

[17]

Jae-Sung Lee. 2007. A probabilistic context sensitive rewriting method for effective transliteration variants generation. J. Korea Contents Assoc. (in Korean) 7, 2, 73--83.

[18]

Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 38, 5, 257--268.

[19]

Heui-Suk Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference of Computational Processing Oriental Language (ICCPOL’95).

[20]

Seung-Hoon Na, Seong-Il Yang, Chang-Hyun Kim, Oh-Woog Kwon, and Young-Kil Kim. 2012. CRFs for Korean morpheme segmentation and POS tagging. In Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology (HCLT’12) (in Korean).

[21]

Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (ACL-HLT’11). 529--533.

Digital Library

[22]

Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based?. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 277--284.

[23]

Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04).

Digital Library

[24]

Dan Roth and Wen-tau Yih. 2005. Integer linear programming inference for conditional random fields. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). 736--743.

Digital Library

[25]

Sunita Sarawagi and William W. Cohen. 2004. Semi-Markov conditional random fields for information extraction. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS’04).

[26]

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). 807--814.

Digital Library

[27]

Kwangseob Shim. 2011. Syllable-based POS tagging without Korean morphological analysis. J. Korean Soc. Cogn. Sci. (in Korean) 22, 3, 327--345.

[28]

Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). Vol. 1, 1--7.

Digital Library

[29]

Joon-Choul Shin and Cheol-Young Ock. 2012. A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 39, 5, 415--424.

[30]

Nianwen Xue. 2003. Chinese word segmentation as character tagging. Int. J. Comput. Linguistics Chinese Lang. Process. 8, 1.

[31]

Seung Hyun Yang and Young-Sum Kim. 2000. A high-speed Korean morphological analysis method based on pre-analyzed partial words. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 27, 3, 290--301.

[32]

Shun-Zheng Yu. 2010. Hidden semi-Markov models. Artif. Intell. 174, 2, 215--243.

Digital Library

Cited By

Ryu JLim SKwon ONa S(2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
https://doi.org/10.4218/etrij.2023-0364
Park JKim M(2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
https://doi.org/10.1075/kl.00008.par
Kim DAhn SLee ESeo Y(2024)Morpheme-based Korean text cohesion analyzerSoftwareX10.1016/j.softx.2024.10165926(101659)Online publication date: May-2024
https://doi.org/10.1016/j.softx.2024.101659
Show More Cited By

Index Terms

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Korean Part-of-speech Tagging Based on Morpheme Generation

Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger ...
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & Security

Part-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Chinese word segmentation as morpheme-based lexical chunking

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 14, Issue 3

June 2015

90 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/2791399

Editor:
Richard Sproat
Google, Inc., USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2015

Accepted: 01 September 2014

Revised: 01 September 2014

Received: 01 June 2014

Published in TALLIP Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Funding Sources

Busan University of Foreign Studies
IT R&D program of MSIP/KEIT

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
605
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)1

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ryu JLim SKwon ONa S(2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
https://doi.org/10.4218/etrij.2023-0364
Park JKim M(2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
https://doi.org/10.1075/kl.00008.par
Kim DAhn SLee ESeo Y(2024)Morpheme-based Korean text cohesion analyzerSoftwareX10.1016/j.softx.2024.10165926(101659)Online publication date: May-2024
https://doi.org/10.1016/j.softx.2024.101659
Rajani Shree MShambhavi B(2022)POS Tagger Model for South Indian Language Using a Deep Learning ApproachICCCE 202110.1007/978-981-16-7985-8_16(155-167)Online publication date: 16-May-2022
https://doi.org/10.1007/978-981-16-7985-8_16
Ding CAye HPa WNwet KSoe KUtiyama MSumita E(2019)Towards Burmese (Myanmar) Morphological AnalysisACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588519:1(1-34)Online publication date: 31-May-2019
https://dl.acm.org/doi/10.1145/3325885
Nguyen QVo AShin JTran POck C(2019)Building a Korean-Vietnamese Neural Machine Translation System with Korean Morphological Analysis and Word Sense DisambiguationIEEE Access10.1109/ACCESS.2019.2902270(1-1)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2902270
Yu HAn JYoon JKim HKo Y(2019)Simple Methods to Overcome the Limitations of General Word Representations in Natural Language Processing TasksComputer Speech & Language10.1016/j.csl.2019.04.009Online publication date: Jun-2019
https://doi.org/10.1016/j.csl.2019.04.009
Na SKim HMin JKim K(2019)Improving LSTM CRFs using character-based compositions for Korean named entity recognitionComputer Speech & Language10.1016/j.csl.2018.09.00554(106-121)Online publication date: Mar-2019
https://doi.org/10.1016/j.csl.2018.09.005
NA SKIM Y(2018)Phrase-Based Statistical Model for Korean Morpheme Segmentation and POS TaggingIEICE Transactions on Information and Systems10.1587/transinf.2017EDP7085E101.D:2(512-522)Online publication date: 2018
https://doi.org/10.1587/transinf.2017EDP7085
Na SLi JShin JKim K(2018)Transition-Based Korean Dependency Parsing Using Hybrid Word Representations of Syllables and Morphemes with LSTMsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/324174518:2(1-20)Online publication date: 14-Dec-2018
https://dl.acm.org/doi/10.1145/3241745
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents