Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Korean Part-of-speech Tagging Based on Morpheme Generation

Published: 09 January 2020 Publication History

Abstract

Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger that solves the problems. This tagger first generates a sequence of lemmatized and recovered morphemes that can be mapped one-to-one to a POS tag using an encoder-decoder architecture derived from a POS-tagged corpus. Then, the POS tag of each morpheme in the generated sequence is finally determined by a standard sequence labeling method. Since the knowledge for segmenting and recovering morphemes is extracted automatically from a POS-tagged corpus by an encoder-decoder architecture, the POS tagger is constructed without a dictionary nor handcrafted linguistic rules. The experimental results on a standard dataset show that the proposed method outperforms existing POS taggers with its state-of-the-art performance.

References

[1]
Dae-Ho Baek, Ho Lee, and Hae-Chang Rim. 1995. A structure of Korean electronic dictionary using the finite state transducer. In Proceedings of the 1995 Conference on Hangul and Korean Information Processing. 181--187.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.
[3]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1724--1734.
[4]
Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1693--1703.
[5]
Cícero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning. 1818--1826.
[6]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5 (2005), 602--610.
[7]
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631--1640.
[8]
Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.
[10]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
[11]
Sangkeun Jung, Changki Lee, and Hyunsun Hwang. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 3 (2018), 19:1--19:8.
[12]
Seung-Shik Kang. 1995. Morphological analysis of Korean irregular verbs using syllable characteristics. J. Kor. Inf. Sci. Soc. 22, 10 (1995), 1480--1487. [in Korean]
[13]
Cheol-Su Kim, Woo-jeong Bae, Yong-seok Lee, and Jun-ichi Aoe. 1996. Construction of Korean electronic dictionary using double-array trie structure. J. Kor. Inf. Sci. Soc. 23, 1 (1996), 85--94. [in Korean]
[14]
Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. 535--539.
[15]
Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis. KAIST.
[16]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. 683--685.
[18]
Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230--237.
[19]
Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences. 76--88.
[20]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.
[21]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260--270.
[22]
Chung-Hee Lee, Joon-Ho Lim, Soojong Lim, and Hyun-Ki Kim. 2016. Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning. J. Kor. Inst. Inf. Sci. Eng. 43, 3 (2016), 362--369. [in Korean]
[23]
Dongjoo Lee, Jongheum Yeon, and Sang-goo Lee. 2011. A unified probablistic model for correcting spacing errors and improving accuracy of morphological analysis of Korean sentences. In Proceedings of Korea Computer Congress 2011. 237--240. [in Korean]
[24]
Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Aud. Speech Lang. Process. 17, 5 (2009), 945--955.
[25]
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 (2016).
[26]
Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Kor. Inst. Inf. Sci. Eng. Softw. Appl. 38, 5 (2011), 257--268. [in Korean]
[27]
Heui-Seok Lim Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference on Computer Processing of Oriental Language. 255--258.
[28]
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 11--19.
[29]
Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing. 171--189.
[30]
Andrew Matteson, Chanhee Lee, Youngbum Kim, and Heuiseok Lim. 2018. Rich character-level information for Korean morphological analysis and part-of-speech tagging. In Proceedings of the 27th International Conference on Computational Linguistics. 2482--2492.
[31]
Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 14, 3 (2015), 10:1--10:16.
[32]
Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257--286.
[33]
Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics. 939--945.
[34]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.
[35]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems. 3104--3112.
[36]
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology. 173--180.

Cited By

View all
  • (2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
  • (2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
  • (2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 3
May 2020
228 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3378675
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2020
Accepted: 01 November 2019
Revised: 01 July 2019
Received: 01 September 2017
Published in TALLIP Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Part-of-speech tagging
  2. morpheme generation
  3. morphologically complex languages

Qualifiers

  • Short-paper
  • Research
  • Refereed

Funding Sources

  • Ministry of Education
  • Basic Science Research Program through the National Research Foundation of Korea (NRF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)5
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Transformer‐based reranking for improving Korean morphological analysis systemsETRI Journal10.4218/etrij.2023-036446:1(137-153)Online publication date: 28-Feb-2024
  • (2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
  • (2024)Word segmentation granularity in KoreanKorean Linguistics10.1075/kl.00008.par20:1(82-112)Online publication date: 30-May-2024
  • (2024)A part of speech tagger for Yoruba language text using deep neural networkFranklin Open10.1016/j.fraope.2024.1001859(100185)Online publication date: Dec-2024
  • (2022)Identifying Relation Between Miriek and Kenyah Badeng Language by Using Morphological Analyzer2022 International Conference on Asian Language Processing (IALP)10.1109/IALP57159.2022.9961253(116-121)Online publication date: 27-Oct-2022
  • (2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
  • (2022)POS Tagger Model for South Indian Language Using a Deep Learning ApproachICCCE 202110.1007/978-981-16-7985-8_16(155-167)Online publication date: 16-May-2022
  • (2021)A Hierarchical Sequence-to-Sequence Model for Korean POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/342176220:2(1-13)Online publication date: 23-Apr-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media