Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

Published: 12 June 2015 Publication History

Abstract

There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a conditional random field (CRF). We present a two-stage discriminative approach based on CRFs for Korean morphological analysis. Similar to methods used for Chinese, we perform two disambiguation procedures based on CRFs: (1) morpheme segmentation and (2) POS tagging. In morpheme segmentation, an input sentence is segmented into sequences of morphemes, where a morpheme unit is either atomic or compound. In the POS tagging procedure, each morpheme (atomic or compound) is assigned a POS tag. Once POS tagging is complete, we carry out a post-processing of the compound morphemes, where each compound morpheme is further decomposed into atomic morphemes, which is based on pre-analyzed patterns and generalized HMMs obtained from the given tagged corpus. Experimental results show the promise of our proposed method.

References

[1]
Jae-Hyeok Choi and Sang-Jo Lee. 1993. A method for reducing dictionary access with bidirectional longest match strategy in Korean morphological analyzer. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 20, 10, 1497--1507.
[2]
Jeen-Pyo Hong. 2008. Korean part-of-speech tagger using Eojeol patterns. Master’s thesis, Changwon National University.
[3]
Seung-Shik Kang and Yung Taek Kim. 1994. Syllable-based model for the Korean morphology. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 221--226.
[4]
Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 535--539.
[5]
Jae-Hoon Kim, Byung-Gyu Jang, Gil Chang Kim, and Jungyun Seo. 1995. Morphological ambiguity reduction using subsumption relation in Korean. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95).
[6]
Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis, KAIST.
[7]
Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (IJCAI’83). Vol. 2, 683--685.
[8]
Taku Kudo. 2006. MeCab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net.
[9]
Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 230--237.
[10]
Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences (MATEC’99). 76--88.
[11]
Changki Lee. 2013. Joint models for Korean word spacing and POS tagging using structural SVM. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 40, 12, 826--832.
[12]
Changki Lee and Myung-Gil Jang. 2009. Large-margin training of dependency parsers using Pegasos algorithm. ETRI J. 31, 2, 121--128.
[13]
Changki Lee and Hyunki Kim. 2013. Automatic Korean word spacing using Pegasos algorithm. Inf. Process. Manage. 49, 1, 370--379.
[14]
Do-Gil Lee and Hae-Chang Rim. 2005. Probabilistic models for Korean morphological analysis. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP’05).
[15]
Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Audio Speech Lang. Proc. 17, 5, 945--955.
[16]
Gary Geunbae Lee, Jong-Hyeok Lee, and Jeongwon Cha. 2002. Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean. Comput. Linguistics 28, 1, 53--70.
[17]
Jae-Sung Lee. 2007. A probabilistic context sensitive rewriting method for effective transliteration variants generation. J. Korea Contents Assoc. (in Korean) 7, 2, 73--83.
[18]
Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 38, 5, 257--268.
[19]
Heui-Suk Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference of Computational Processing Oriental Language (ICCPOL’95).
[20]
Seung-Hoon Na, Seong-Il Yang, Chang-Hyun Kim, Oh-Woog Kwon, and Young-Kil Kim. 2012. CRFs for Korean morpheme segmentation and POS tagging. In Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology (HCLT’12) (in Korean).
[21]
Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (ACL-HLT’11). 529--533.
[22]
Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based?. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 277--284.
[23]
Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04).
[24]
Dan Roth and Wen-tau Yih. 2005. Integer linear programming inference for conditional random fields. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). 736--743.
[25]
Sunita Sarawagi and William W. Cohen. 2004. Semi-Markov conditional random fields for information extraction. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS’04).
[26]
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). 807--814.
[27]
Kwangseob Shim. 2011. Syllable-based POS tagging without Korean morphological analysis. J. Korean Soc. Cogn. Sci. (in Korean) 22, 3, 327--345.
[28]
Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). Vol. 1, 1--7.
[29]
Joon-Choul Shin and Cheol-Young Ock. 2012. A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 39, 5, 415--424.
[30]
Nianwen Xue. 2003. Chinese word segmentation as character tagging. Int. J. Comput. Linguistics Chinese Lang. Process. 8, 1.
[31]
Seung Hyun Yang and Young-Sum Kim. 2000. A high-speed Korean morphological analysis method based on pre-analyzed partial words. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 27, 3, 290--301.
[32]
Shun-Zheng Yu. 2010. Hidden semi-Markov models. Artif. Intell. 174, 2, 215--243.

Cited By

View all

Index Terms

  1. Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 3
    June 2015
    90 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2791399
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2015
    Accepted: 01 September 2014
    Revised: 01 September 2014
    Received: 01 June 2014
    Published in TALLIP Volume 14, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Conditional random fields
    2. Korean morphological analysis
    3. POS tagging
    4. morpheme segmentation

    Qualifiers

    • Short-paper
    • Research
    • Refereed

    Funding Sources

    • Busan University of Foreign Studies
    • IT R&D program of MSIP/KEIT

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 29 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
    • (2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
    • (2024)Morpheme-based Korean text cohesion analyzerSoftwareX10.1016/j.softx.2024.10165926(101659)Online publication date: May-2024
    • (2022)POS Tagger Model for South Indian Language Using a Deep Learning ApproachICCCE 202110.1007/978-981-16-7985-8_16(155-167)Online publication date: 16-May-2022
    • (2019)Towards Burmese (Myanmar) Morphological AnalysisACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588519:1(1-34)Online publication date: 31-May-2019
    • (2019)Building a Korean-Vietnamese Neural Machine Translation System with Korean Morphological Analysis and Word Sense DisambiguationIEEE Access10.1109/ACCESS.2019.2902270(1-1)Online publication date: 2019
    • (2019)Simple Methods to Overcome the Limitations of General Word Representations in Natural Language Processing TasksComputer Speech & Language10.1016/j.csl.2019.04.009Online publication date: Jun-2019
    • (2019)Improving LSTM CRFs using character-based compositions for Korean named entity recognitionComputer Speech & Language10.1016/j.csl.2018.09.00554(106-121)Online publication date: Mar-2019
    • (2018)Phrase-Based Statistical Model for Korean Morpheme Segmentation and POS TaggingIEICE Transactions on Information and Systems10.1587/transinf.2017EDP7085E101.D:2(512-522)Online publication date: 2018
    • (2018)Transition-Based Korean Dependency Parsing Using Hybrid Word Representations of Syllables and Morphemes with LSTMsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/324174518:2(1-20)Online publication date: 14-Dec-2018
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media