short-paper

Korean Part-of-speech Tagging Based on Morpheme Generation

Authors:

Seong-Bae ParkAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 3

Article No.: 41, Pages 1 - 10

https://doi.org/10.1145/3373608

Published: 09 January 2020 Publication History

Abstract

Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger that solves the problems. This tagger first generates a sequence of lemmatized and recovered morphemes that can be mapped one-to-one to a POS tag using an encoder-decoder architecture derived from a POS-tagged corpus. Then, the POS tag of each morpheme in the generated sequence is finally determined by a standard sequence labeling method. Since the knowledge for segmenting and recovering morphemes is extracted automatically from a POS-tagged corpus by an encoder-decoder architecture, the POS tagger is constructed without a dictionary nor handcrafted linguistic rules. The experimental results on a standard dataset show that the proposed method outperforms existing POS taggers with its state-of-the-art performance.

References

[1]

Dae-Ho Baek, Ho Lee, and Hae-Chang Rim. 1995. A structure of Korean electronic dictionary using the finite state transducer. In Proceedings of the 1995 Conference on Hangul and Korean Information Processing. 181--187.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.

[3]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1724--1734.

[4]

Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1693--1703.

[5]

Cícero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning. 1818--1826.

[6]

Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5 (2005), 602--610.

Digital Library

[7]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631--1640.

[8]

Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.

[9]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.

Digital Library

[10]

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).

[11]

Sangkeun Jung, Changki Lee, and Hyunsun Hwang. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 3 (2018), 19:1--19:8.

[12]

Seung-Shik Kang. 1995. Morphological analysis of Korean irregular verbs using syllable characteristics. J. Kor. Inf. Sci. Soc. 22, 10 (1995), 1480--1487. [in Korean]

[13]

Cheol-Su Kim, Woo-jeong Bae, Yong-seok Lee, and Jun-ichi Aoe. 1996. Construction of Korean electronic dictionary using double-array trie structure. J. Kor. Inf. Sci. Soc. 23, 1 (1996), 85--94. [in Korean]

[14]

Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. 535--539.

Digital Library

[15]

Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis. KAIST.

[16]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. 683--685.

[18]

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230--237.

[19]

Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences. 76--88.

[20]

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.

Digital Library

[21]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260--270.

[22]

Chung-Hee Lee, Joon-Ho Lim, Soojong Lim, and Hyun-Ki Kim. 2016. Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning. J. Kor. Inst. Inf. Sci. Eng. 43, 3 (2016), 362--369. [in Korean]

[23]

Dongjoo Lee, Jongheum Yeon, and Sang-goo Lee. 2011. A unified probablistic model for correcting spacing errors and improving accuracy of morphological analysis of Korean sentences. In Proceedings of Korea Computer Congress 2011. 237--240. [in Korean]

[24]

Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Aud. Speech Lang. Process. 17, 5 (2009), 945--955.

Digital Library

[25]

Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 (2016).

[26]

Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Kor. Inst. Inf. Sci. Eng. Softw. Appl. 38, 5 (2011), 257--268. [in Korean]

[27]

Heui-Seok Lim Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference on Computer Processing of Oriental Language. 255--258.

[28]

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 11--19.

[29]

Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing. 171--189.

[30]

Andrew Matteson, Chanhee Lee, Youngbum Kim, and Heuiseok Lim. 2018. Rich character-level information for Korean morphological analysis and part-of-speech tagging. In Proceedings of the 27th International Conference on Computational Linguistics. 2482--2492.

[31]

Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 14, 3 (2015), 10:1--10:16.

[32]

Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257--286.

[33]

Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics. 939--945.

Digital Library

[34]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.

Digital Library

[35]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems. 3104--3112.

Digital Library

[36]

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology. 173--180.

Digital Library

Cited By

Ryu JLim SKwon ONa S(2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
https://doi.org/10.4218/etrij.2023-0364
Dalai TMishra TSa P(2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
https://dl.acm.org/doi/10.1145/3637877
Park JKim M(2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
https://doi.org/10.1075/kl.00008.par
Show More Cited By

Index Terms

Korean Part-of-speech Tagging Based on Morpheme Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging

There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a ...
Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean

Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknown-morpheme-estimation method with POSTAG (POStech TAGger), which is a ...
A Cross-lingual Part-of-Speech Tagging for Malay Language
ICAART 2015: Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2

Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance

of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay

is experimented as the less-resourced ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 3

May 2020

228 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3378675

Editor:
Imed Zitouni
Microsoft, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2020

Accepted: 01 November 2019

Revised: 01 July 2019

Received: 01 September 2017

Published in TALLIP Volume 19, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Funding Sources

Ministry of Education
Basic Science Research Program through the National Research Foundation of Korea (NRF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
604
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)5

Reflects downloads up to 28 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ryu JLim SKwon ONa S(2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
https://doi.org/10.4218/etrij.2023-0364
Dalai TMishra TSa P(2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
https://dl.acm.org/doi/10.1145/3637877
Park JKim M(2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
https://doi.org/10.1075/kl.00008.par
Ugwu COyewole APopoola OAdetunmbi AElebute A(2024)A part of speech tagger for Yoruba language text using deep neural networkFranklin Open10.1016/j.fraope.2024.1001859(100185)Online publication date: Dec-2024
https://doi.org/10.1016/j.fraope.2024.100185
Yusof MSaee SJuan S(2022)Identifying Relation Between Miriek and Kenyah Badeng Language by Using Morphological Analyzer2022 International Conference on Asian Language Processing (IALP)10.1109/IALP57159.2022.9961253(116-121)Online publication date: 27-Oct-2022
https://doi.org/10.1109/IALP57159.2022.9961253
Warto Muljono Purwanto Noersasongko E(2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
https://doi.org/10.1109/CyberneticsCom55287.2022.9865660
Rajani Shree MShambhavi B(2022)POS Tagger Model for South Indian Language Using a Deep Learning ApproachICCCE 202110.1007/978-981-16-7985-8_16(155-167)Online publication date: 16-May-2022
https://doi.org/10.1007/978-981-16-7985-8_16
Jin GYu Z(2021)A Hierarchical Sequence-to-Sequence Model for Korean POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/342176220:2(1-13)Online publication date: 23-Apr-2021
https://dl.acm.org/doi/10.1145/3421762

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents