research-article

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

Authors:

Mamoru KomachiAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 20, Issue 2

Article No.: 31, Pages 1 - 15

https://doi.org/10.1145/3431727

Published: 15 April 2021 Publication History

Abstract

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.

References

[1]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the ICLR.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR.

[3]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly.

Digital Library

[4]

Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2017. Investigating stroke-level information for learning Chinese word embeddings. In Proceedings of the ISWC.

[5]

Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of the AAAI’18, IAAI’18, EAAI’18. 5053–5061.

[6]

Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. 2018. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the EMNLP. 4295–4305.

[7]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP, a meeting of SIGDAT, a Special Interest Group of the ACL. 1724–1734.

[8]

Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2012. Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In Proceedings of the LREC. European Language Resources Association (ELRA), 2149–2152.

[9]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the ACL. 2978–2988.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. In Proceedings of the NAACL 2019. 4171--4186.

[11]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017).

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[13]

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the ICLR. OpenReview.net.

[14]

Guillaume Klein, Yoon Kim, Yuntian Deng, Vincent Nguyen, Jean Senellart, and Alexander M. Rush. 2018. OpenNMT: Neural machine translation toolkit. In Proceedings of the AMTA. 177–184.

[15]

Shaohui Kuang and Lifeng Han. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. CoRR abs/1805.01565 (2018).

[16]

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the EMNLP. 5039–5049.

[17]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the EMNLP 2015. 1412--1421.

[18]

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2204--2208.

[19]

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. compare-mt: A tool for holistic comparison of language generation systems. In Proceedings of the NAACL-HLT. 35–41.

[20]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the NAACL-HLT.

[21]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).

[22]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the ACL.

[23]

Mo Shen, Wingmui Li, HyunJeong Choe, Chenhui Chu, Daisuke Kawahara, and Sadao Kurohashi. 2016. Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the COLING. ACL, 298–308.

[24]

Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, and Chao Liu. 2015. Radical embedding: Delving deeper to Chinese radicals. In Proceedings of the ACL IJCNLP. 594–598.

[25]

Dusan Stosic, Darko Stosic, Cleber Zanchettin, Teresa Bernarda Ludermir, and Borko D. Stosic. 2017. QRNN: q -generalized random neural network. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 383–390.

[26]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104–3112.

Digital Library

[27]

Hanqing Tao, Shiwei Tong, Tong Xu, Qi Liu, and Enhong Chen. 2019. Chinese embedding via stroke and glyph information: A dual-channel view. CoRR abs/1906.04287 (2019).

[28]

Yota Toyama, Makoto Miwa, and Yutaka Sasaki. 2017. Utilizing visual forms of Japanese characters for neural review classification. In Proceedings of the IJCNLP. 378–382.

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS. 6000–6010.

Digital Library

[30]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).

[31]

Jinyi Zhang and Tadahiro Matsumoto. 2018. Improving character-level Japanese-Chinese neural machine translation with radicals as an additional input feature. CoRR abs/1805.02937 (2018).

[32]

Longtu Zhang and Mamoru Komachi. 2018. Neural machine translation of logographic language using sub-character level information. In Proceedings of the WMT. 17–25.

[33]

Hang Zhuang, Chao Wang, Changlong Li, Yijing Li, Qingfeng Wang, and Xuehai Zhou. 2018. Chinese language processing based on stroke representation and multidimensional representation. IEEE Access 6 (2018), 41928–41941.

Cited By

Zhang JSu KLi HMao JTian YWen FGuo CMatsumoto T(2024)Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A SurveyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366524423:6(1-60)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3665244
Zhang MZhou GYu WHuang NLiu W(2023)GA-SCS: Graph-Augmented Source Code SummarizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355482022:2(1-19)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.1145/3554820

Index Terms

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
      2. Natural language generation

Recommendations

Neural Machine Translation for Low-resource Languages: A Survey
Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since the early 2000s and has already entered a mature phase. While considered the most widely used solution for Machine Translation, its performance on low-resource ...
Using Translation Memory to Improve Neural Machine Translations
ICDLT '22: Proceedings of the 2022 6th International Conference on Deep Learning Technologies

In this paper, we describe a way of using translation memory (TM) to improve the translation quality and stability of neural machine translation (NMT) systems, especially when the sentences to be translated have high similarity with sentences stored in ...
Post-editing neural machine translation versus phrase-based machine translation for English---Chinese

This paper aims to shed light on the post-editing process of the recently-introduced neural machine translation (NMT) paradigm. Using simple and more complex texts, we first evaluate the output quality from English to Chinese phrase-based statistical (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20, Issue 2

March 2021

313 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3454116

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2021

Accepted: 01 October 2020

Revised: 01 September 2020

Received: 01 November 2019

Published in TALLIP Volume 20, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
184
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang JSu KLi HMao JTian YWen FGuo CMatsumoto T(2024)Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A SurveyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366524423:6(1-60)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3665244
Zhang MZhou GYu WHuang NLiu W(2023)GA-SCS: Graph-Augmented Source Code SummarizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355482022:2(1-19)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.1145/3554820

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents