Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

Published: 15 April 2021 Publication History

Abstract

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.

References

[1]
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the ICLR.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR.
[3]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly.
[4]
Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2017. Investigating stroke-level information for learning Chinese word embeddings. In Proceedings of the ISWC.
[5]
Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of the AAAI’18, IAAI’18, EAAI’18. 5053–5061.
[6]
Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. 2018. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the EMNLP. 4295–4305.
[7]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP, a meeting of SIGDAT, a Special Interest Group of the ACL. 1724–1734.
[8]
Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2012. Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In Proceedings of the LREC. European Language Resources Association (ELRA), 2149–2152.
[9]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the ACL. 2978–2988.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. In Proceedings of the NAACL 2019. 4171--4186.
[11]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017).
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[13]
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the ICLR. OpenReview.net.
[14]
Guillaume Klein, Yoon Kim, Yuntian Deng, Vincent Nguyen, Jean Senellart, and Alexander M. Rush. 2018. OpenNMT: Neural machine translation toolkit. In Proceedings of the AMTA. 177–184.
[15]
Shaohui Kuang and Lifeng Han. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. CoRR abs/1805.01565 (2018).
[16]
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the EMNLP. 5039–5049.
[17]
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the EMNLP 2015. 1412--1421.
[18]
Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2204--2208.
[19]
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. compare-mt: A tool for holistic comparison of language generation systems. In Proceedings of the NAACL-HLT. 35–41.
[20]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the NAACL-HLT.
[21]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).
[22]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the ACL.
[23]
Mo Shen, Wingmui Li, HyunJeong Choe, Chenhui Chu, Daisuke Kawahara, and Sadao Kurohashi. 2016. Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the COLING. ACL, 298–308.
[24]
Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, and Chao Liu. 2015. Radical embedding: Delving deeper to Chinese radicals. In Proceedings of the ACL IJCNLP. 594–598.
[25]
Dusan Stosic, Darko Stosic, Cleber Zanchettin, Teresa Bernarda Ludermir, and Borko D. Stosic. 2017. QRNN: q -generalized random neural network. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 383–390.
[26]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104–3112.
[27]
Hanqing Tao, Shiwei Tong, Tong Xu, Qi Liu, and Enhong Chen. 2019. Chinese embedding via stroke and glyph information: A dual-channel view. CoRR abs/1906.04287 (2019).
[28]
Yota Toyama, Makoto Miwa, and Yutaka Sasaki. 2017. Utilizing visual forms of Japanese characters for neural review classification. In Proceedings of the IJCNLP. 378–382.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS. 6000–6010.
[30]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).
[31]
Jinyi Zhang and Tadahiro Matsumoto. 2018. Improving character-level Japanese-Chinese neural machine translation with radicals as an additional input feature. CoRR abs/1805.02937 (2018).
[32]
Longtu Zhang and Mamoru Komachi. 2018. Neural machine translation of logographic language using sub-character level information. In Proceedings of the WMT. 17–25.
[33]
Hang Zhuang, Chao Wang, Changlong Li, Yijing Li, Qingfeng Wang, and Xuehai Zhou. 2018. Chinese language processing based on stroke representation and multidimensional representation. IEEE Access 6 (2018), 41928–41941.

Cited By

View all
  • (2024)Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A SurveyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366524423:6(1-60)Online publication date: 16-May-2024
  • (2023)GA-SCS: Graph-Augmented Source Code SummarizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355482022:2(1-19)Online publication date: 21-Feb-2023

Index Terms

  1. Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 2
      March 2021
      313 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3454116
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 April 2021
      Accepted: 01 October 2020
      Revised: 01 September 2020
      Received: 01 November 2019
      Published in TALLIP Volume 20, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Neural machine translation
      2. logographic languages
      3. shared information
      4. unsupervised NMT

      Qualifiers

      • Research-article
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)37
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 19 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A SurveyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366524423:6(1-60)Online publication date: 16-May-2024
      • (2023)GA-SCS: Graph-Augmented Source Code SummarizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355482022:2(1-19)Online publication date: 21-Feb-2023

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media