Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Yining Wang¹¹,
Long Zhou¹¹,
Jiajun Zhang¹¹ &
…
Chengqing Zong^11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 787))

Included in the following conference series:

China Workshop on Machine Translation

707 Accesses
10 Citations

Abstract

Neural machine translation (NMT) becomes a new approach to machine translation and is proved to outperform conventional statistical machine translation (SMT) across a variety of language pairs. Most existing NMT systems operate with a fixed vocabulary, but translation is an open-vocabulary problem. Hence, previous works mainly handle rare and unknown words by using different translation granularities, such as character, subword, and hybrid word-character. While translation involving Chinese has been proved to be one of the most difficult tasks, there is no study to demonstrate which translation granularity is the most suitable for Chinese in NMT. In this paper, we conduct an extensive comparison using Chinese-English NMT as a case study. Furthermore, we discuss the advantages and disadvantages of various translation granularities in detail. Our experiments show that subword model performs best for Chinese-to-English translation while hybrid word-character model is most suitable for English-to-Chinese translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

English to Hindi Machine Translation Using Sub-classed Model

Text-Text Neural Machine Translation: A Survey

Article 23 June 2023

Notes

1.
The corpora include LDC2000T50, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17 and LDC2004T07.
2.
https://github.com/isi-nlp/Zoph_RNN.
3.
https://github.com/rsennrich/subword-nmt.
4.
https://github.com/google/sentencepiece.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR 2015 (2015)
Google Scholar
Cheng, Y., Liu, Y., Yang, Q., Sun, M., Xu, W.: Joint training for pivot-based neural machine translation. arXiv preprint arXiv:1611.04928v2 (2017)
Chiang, D.: A hierarchical phrase-based model for statistical machine translation. In: Proceedings of ACL 2005 (2005)
Google Scholar
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation (2016)
Google Scholar
Gage, P.: A New Algorithm for Data Compression. R & D Publications, Inc., Lawrence (1994)
Google Scholar
He, W., He, Z., Wu, H., Wang, H.: Improved neural machine translation with SMT features. In: Proceedings of AAAI 2016 (2016)
Google Scholar
Jean, S., Cho, K., Memisevic, R., Bengio, Y.: On using very large target vocabulary for neural machine translation. Computer Science (2014)
Google Scholar
Junczys-Dowmunt, M., Dwojak, T., Hoang, H.: Is neural machine translation ready for deployment? A case study on 30 translation directions. In: Proceedings of IWSLT 2016 (2016)
Google Scholar
Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of EMNLP 2013 (2013)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. Computer Science (2014)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of ACL-NAACL 2013 (2003)
Google Scholar
Li, X., Zhang, J., Zong, C.: Towards zero unknown word in neural machine translation. In: Proceedings of IJCAI 2016 (2016)
Google Scholar
Luong, M.T., Manning, C.D.: Achieving open vocabulary neural machine translation with hybrid word-character models (2016)
Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of EMNLP 2015 (2015)
Google Scholar
Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: Proceedings of ACL 2015 (2015)
Google Scholar
Meng, F., Lu, Z., Li, H., Liu, Q.: Interactive attention for neural machine translation. In: Proceedings of COLING 2016 (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002 (2002)
Google Scholar
Schuster, M., Nakajima, K.: Japanese and Korean voice search, vol. 22, no. 10, pp. 5149–5152 (2012)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of ACL 2016 (2016)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS 2014 (2014)
Google Scholar
Wang, X., Lu, Z., Tu, Z., Li, H., Xiong, D., Zhang, M.: Neural machine translation advised by statistical machine translation. In: Proceedings of AAAI 2017 (2017)
Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Mohammad Norouzi, et al.: Googles neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zhai, F., Zhang, J., Zhou, Y., Zong, C., et al.: Tree-based translation without using parse trees. In: Proceedings of COLING 2012 (2012)
Google Scholar
Zhang, J., Zong, C.: Bridging neural machine translation and bilingual dictionaries. arXiv preprint arXiv:1610.07272 (2016)
Zhou, L., Hu, W., Zhang, J., Zong, C.: Neural system combination for machine translation. arXiv preprint arXiv:1704.06393 (2017)

Download references

Acknowledgments

The research work has been funded by the Natural Science Foundation of China under Grant Nos. 61333018 and 61402478, and it is also supported by the Strategic Priority Research Program of the CAS under Grant No. XDB02070007.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, CASIA, University of Chinese Academy of Sciences, Beijing, China
Yining Wang, Long Zhou, Jiajun Zhang & Chengqing Zong
CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing, China
Chengqing Zong

Authors

Yining Wang
View author publications
You can also search for this author in PubMed Google Scholar
Long Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengqing Zong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiajun Zhang .

Editor information

Editors and Affiliations

University of Macau, Macau SAR, China
Derek F. Wong
Soochow University, Suzhou, China
Deyi Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Zhou, L., Zhang, J., Zong, C. (2017). Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT. In: Wong, D., Xiong, D. (eds) Machine Translation. CWMT 2017. Communications in Computer and Information Science, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-10-7134-8_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-7134-8_4
Published: 14 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7133-1
Online ISBN: 978-981-10-7134-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

English to Hindi Machine Translation Using Sub-classed Model

Text-Text Neural Machine Translation: A Survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

English to Hindi Machine Translation Using Sub-classed Model

Text-Text Neural Machine Translation: A Survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation