Abstract
The phenomenon of zero pronoun (ZP) has attracted increasing interest in the machine translation community due to its importance and difficulty. However, previous studies generally evaluate the quality of translating ZPs with BLEU score on MT testsets, which is not expressive or sensitive enough for accurate assessment. To bridge the data and evaluation gaps, we propose a benchmark testset and evaluation metric for target evaluation on Chinese ZP translation. The human-annotated testset covers five challenging genres, which reveal different characteristics of ZPs for comprehensive evaluation. We systematically revisit advanced models on ZP translation and identify current challenges for future exploration. We release data, code, and trained models, which we hope can significantly promote research in this field.
Similar content being viewed by others
Notes
Our released resources: https://github.com/longyuewangdcu/mZPRT.
Anaphoric pronoun, whose reference must be specified by a noun phrase occurring previously in the text. Non-anaphoric pronoun refers to an entity that is salient from larger units of discourse (such as full sentences or passages) or from the extralinguistic environment (outside of the text altogether).
A pronominal determiner phrase without phonological content.
We combined data in movie subtitle and Q &A forum as the training data for building ZPR models in other domains.
References
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 2015 international conference on learning representations, 2015. http://arxiv.org/abs/1409.0473
Baran, E., Yang, Y., & Xue, N. (2012). Annotating dropped pronouns in Chinese newswire text. In Proceedings of the 2012 eighth international conference on language resources and evaluation, 2012 (pp. 2795–2799).
Barbosa, P. P. (2019). Pro as a minimal NP: Toward a unified approach to pro-drop. Linguistic Inquiry, 50(3), 487–526.
Cai, S., Chiang, D., & Goldberg, Y. (2011). Language-independent parsing with empty elements. In Proceedings of the 2011 annual meeting of the Association for Computational Linguistics, Portland, Oregon, 2011 (pp. 212–216).
Chen, C., & Ng, V. (2013). Chinese zero pronoun resolution: Some recent advances. In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013 (pp. 1360–1365). https://aclanthology.org/D13-1135
Chen, C., & Ng, V. (2015). Chinese zero pronoun resolution: A joint unsupervised discourse-aware model rivaling state-of-the-art resolvers. In Proceedings of the 2015 annual meeting of the Association for Computational Linguistics and the 2015 international joint conference on natural language processing, 2015 (pp. 320–326). https://doi.org/10.3115/v1/P15-2053.
Chen, C., & Ng, V. (2016). Chinese zero pronoun resolution with deep neural networks. In Proceedings of the 2016 annual meeting of the Association for Computational Linguistics, 2016 (pp. 778–788). https://doi.org/10.18653/v1/P16-1074.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing, 2014 (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179.
Chung, T., & Gildea, D. (2010). Effects of empty categories on machine translation. In Proceedings of the 2010 conference on empirical methods in natural language processing, 2010 (pp. 636–645). https://aclanthology.org/D10-1062
Collins, M., Koehn, P., & Kučerová, I. (2005). Clause restructuring for statistical machine translation. In Proceedings of the 2005 annual meeting of the Association for Computational Linguistics, 2005 (pp. 531–540). https://doi.org/10.3115/1219840.1219906.
Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 2014 workshop on statistical machine translation, 2014 (pp. 376–380). https://doi.org/10.3115/v1/W14-3348.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), 2019 (Vol. 1, pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In: Proceedings of the 2017 international conference on machine learning, 2017 (Vol. 70, pp. 1243–1252). https://proceedings.mlr.press/v70/gehring17a.html
Huang, G., Liu, L., Wang, X., Wang, L., Li, H., Tu, Z., Huang, C., & Shi, S. (2021). TranSmart: A practical interactive machine translation system. arXiv preprint. arXiv:2105.13072
Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013 (pp. 1700–1709). https://aclanthology.org/D13-1176
Koehn, P. (2009). Statistical machine translation. Cambridge University Press.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 2007 annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions, 2007 (pp. 177–180).
Kong, F., & Zhou, G. (2010). A tree kernel-based unified framework for Chinese zero anaphora resolution. In Proceedings of the 2010 conference on empirical methods in natural language processing, 2010 (pp. 882–891). https://aclanthology.org/D10-1086
Läubli, S., Sennrich, R., & Volk, M. (2018). Has machine translation achieved human parity? A case for document-level evaluation. In Proceedings of the 2018 conference on empirical methods in natural language processing, 2018 (pp. 4791–4796). https://doi.org/10.18653/v1/D18-1512.
Le Nagard, R., & Koehn, P. (2010). Aiding pronoun translation with co-reference resolution. In Proceedings of the joint 2010 workshop on statistical machine translation and metrics, MATR, 2010 (pp. 252–261).
Li, B., Liu, H., Wang, Z., Jiang, Y., Xiao, T., Zhu, J., Liu, T., & Li, C. (2020). Does multi-encoder help? A case study on context-aware neural machine translation. In Proceedings of the 2020 annual meeting of the Association for Computational Linguistics, 2020 (pp. 3512–3518). https://doi.org/10.18653/v1/2020.acl-main.322.
Li, C. N., & Thompson, S. A. (1979). Third-person pronouns and zero-anaphora in Chinese discourse. In Discourse and syntax (pp. 311–335). Brill.
Ma, Q., Wei, J., Bojar, O., & Graham, Y. (2019). Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the 2019 conference on machine translation, 2019 (pp. 62–90). https://doi.org/10.18653/v1/W19-5302.
Ma, S., Zhang, D., & Zhou, M. (2020). A simple and effective unified encoder for document-level machine translation. In Proceedings of the 2020 annual meeting of the Association for Computational Linguistics, 2020 (pp. 3505–3511). https://doi.org/10.18653/v1/2020.acl-main.321.
Miculicich Werlen, L., & Popescu-Belis, A. (2017). Validation of an automatic metric for the accuracy of pronoun translation (APT). In Proceedings of the 2017 workshop on discourse in machine translation, 2017 (pp. 17–25). https://doi.org/10.18653/v1/W17-4802.
Mitkov, R. (2014). Anaphora resolution. Routledge.
Müller, M., Rios, A., Voita, E., & Sennrich, R. (2018). A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Proceedings of the 2018 conference on machine translation: Research papers, 2018 (pp. 61–72). https://doi.org/10.18653/v1/W18-6307.
Nirenburg, S., Raskin, V., & Tucker, A. (1986). On knowledge-based machine translation. In Proceedings of the 1986 coreference on computational linguistics, 1986 (Vol. 4, pp. 5–24). https://doi.org/10.1007/BF00367750.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. In Proceedings of the 2003 computational linguistics, 2003 (Vol. 29(1), pp. 19–51). https://doi.org/10.1162/089120103321337421.
Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling neural machine translation. In Proceedings of the 2018 conference on machine translation: Research papers, 2018 (pp. 1–9). https://doi.org/10.18653/v1/W18-6301.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 2002 annual meeting of the Association for Computational Linguistics, 2002 (pp. 311–318). https://doi.org/10.3115/1073083.1073135.
Peral, J., & Ferrández, A. (2003). Translation of pronominal anaphora between English and Spanish: Discrepancies and evaluation. 18, 117–147. https://doi.org/10.1613/jair.1115.
Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., & Zhang, Y. (2012). CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning—Shared task, 2012 (pp. 1–40). https://aclanthology.org/W12-4501
Rao, S., Ettinger, A., Daumé III, H., & Resnik, P. (2015). Dialogue focus tracking for zero pronoun resolution. In Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015 (pp. 494–503). https://doi.org/10.3115/v1/N15-1052.
Ri, R., Nakazawa, T., & Tsuruoka, Y. (2021). Zero-pronoun data augmentation for Japanese-to-English translation. In Proceedings of the 2021 workshop on Asian translation, 2021 (pp. 117–123). https://doi.org/10.18653/v1/2021.wat-1.11.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 2016 annual meeting of the Association for Computational Linguistics, 2016 (pp. 1715–1725). https://doi.org/10.18653/v1/P16-1162.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 2006 conference of the Association for Machine Translation in the Americas: Technical papers, 2006 (pp. 223–231).
Song, L., Xu, K., Zhang, Y., Chen, J., & Yu, D. (2020). ZPR2: Joint zero pronoun recovery and resolution using multi-task learning and BERT. In Proceedings of the 2020 annual meeting of the Association for Computational Linguistics, 2020 (pp. 5429–5434). https://doi.org/10.18653/v1/2020.acl-main.482.
Su, H., Shen, X., Zhang, R., Sun, F., Hu, P., Niu, C., & Zhou, J. (2019). Improving multi-turn dialogue modelling with utterance ReWriter. In Proceedings of the 2019 annual meeting of the Association for Computational Linguistics, 2019 (pp. 22–31). https://doi.org/10.18653/v1/P19-1003.
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Proceedings of the 2014 advances in neural information processing systems, 2014 (Vol. 27).
Taira, H., Sudoh, K., & Nagata, M. (2012). Zero pronoun resolution can improve the quality of J–E translation. In Proceedings of the 2012 workshop on syntax, semantics and structure in statistical translation, 2012 (pp. 111–118).
Tan, X., Zhang, L., & Zhou, G. (2021). Coupling context modeling with zero pronoun recovering for document-level natural language generation. In Proceedings of the 2021 conference on empirical methods in natural language processing, 2021 (pp. 2530–2540). https://doi.org/10.18653/v1/2021.emnlp-main.197.
Tu, Z., Liu, Y., Shang, L., Liu, X., & Li, H. (2017). Neural machine translation with reconstruction. In Proceedings of the 2017 thirty-first AAAI conference on artificial intelligence, 2017 (pp. 3097–3103). https://doi.org/10.1609/aaai.v31i1.10950.
Tu, Z., Liu, Y., Shi, S., & Zhang, T. (2018). Learning to remember translation history with a continuous cache. In Proceedings of the 2018 transactions of the Association for Computational Linguistics, 2018 (Vol. 6, pp. 407–420). https://doi.org/10.1162/tacl_a_00029.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. U., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 2017 advances in neural information processing systems, 2017 (Vol. 30).
Voita, E., Sennrich, R., & Titov, I. (2019). When a good translation is wrong in context: Context-aware machine translation improves on Deixis, ellipsis, and lexical cohesion. In Proceedings of the 2019 annual meeting of the Association for Computational Linguistics, 2019 (pp. 1198–1212). https://doi.org/10.18653/v1/P19-1116.
Voita, E., Serdyukov, P., Sennrich, R., & Titov, I. (2018). Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 2018 annual meeting of the Association for Computational Linguistics, 2018 (pp. 1264–1274). https://doi.org/10.18653/v1/P18-1117.
Wang, L., Du, Z., Liu, D., Cai, D., Yu, D., Jiang, H., Wang, Y., Shi, S., & Tu, Z. (2023). GuoFeng: A discourse-aware evaluation benchmark for language understanding, translation and generation. In ICLR conference, 2023.
Wang, L., Li, M., Liu, F., Shi, S., Tu, Z., Wang, X., Wu, S., Zeng, J., & Zhang, W. (2021). Tencent translation system for the WMT21 news translation task. In Proceedings of the sixth conference on machine translation, 2021 (pp. 216–224).
Wang, L., Tu, Z., Zhang, X., Liu, S., Li, H., Way, A., & Liu, Q. (2017). A novel and robust approach for pro-drop language translation. Machine Translation, 31(1–2), 65–87.
Wang, L., Tu, Z., Shi, S., Zhang, T., Graham, Y., & Liu, Q. (2018). Translating pro-drop languages with reconstruction models. In Proceedings of the 2018 AAAI conference on artificial intelligence, 2018 (Vol. 32, pp. 4937–4945). https://doi.org/10.1609/aaai.v32i1.11913.
Wang, L., Tu, Z., Wang, X., Ding, L., Ding, L., & Shi, S. (2020). Tencent AI Lab machine translation systems for WMT20 chat translation task. In Proceedings of the fifth conference on machine translation, 2020 (pp. 483–491).
Wang, L., Tu, Z., Wang, X., & Shi, S. (2019). One model to learn both: Zero pronoun prediction and translation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, 2019 (pp. 921–930). https://doi.org/10.18653/v1/D19-1085.
Wang, L., Tu, Z., Way, A., & Liu, Q. (2017). Exploiting cross-sentence context for neural machine translation. In EMNLP, 2017.
Wang, L., Tu, Z., Way, A., & Liu, Q. (2018). Learning to jointly translate and predict dropped pronouns with a shared reconstruction mechanism. In EMNLP, 2018.
Wang, L., Tu, Z., Zhang, X., Li, H., Way, A., & Liu, Q. (2016). A novel approach to dropped pronoun translation. In Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016 (pp. 983–993). https://doi.org/10.18653/v1/N16-1113.
Wang, L., Zhang, X., Tu, Z., Li, H., & Liu, Q. (2016). Dropped pronoun generation for dialogue machine translation. In ICASSP, 2016.
Wong, B. T. M., & Kit, C. (2012). Extending machine translation evaluation metrics with lexical cohesion to document level. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012 (pp. 1060–1068). https://aclanthology.org/D12-1097
Wu, S., Wang, X., Wang, L., Liu, F., Xie, J., Tu, Z., Shi, S., & Li, M. (2020). Tencent neural machine translation systems for the WMT20 news translation task. In Proceedings of the fifth conference on machine translation, 2020 (pp. 313–319).
Xiang, B., Luo, X., & Zhou, B. (2013). Enlisting the ghost: Modeling empty categories for machine translation. In Proceedings of the 2013 annual meeting of the Association for Computational Linguistics, 2013 (pp. 822–831).
Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese Treebank: Phrase structure annotation of a large corpus. In Proceedings of the 2005 natural language engineering, 2005 (Vol. 11(02), pp. 207–238).
Xue, N., & Yang, Y. (2013). Dependency-based empty category detection via phrase structure trees. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics, Atlanta, Georgia, USA, 2013 (pp. 1051–1060).
Yang, J., Tong, J., Li, S., Gao, S., Guo, J., & Xue, N. (2019). Recovering dropped pronouns in Chinese conversations via modeling their referents. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019 (pp. 892–901). https://doi.org/10.18653/v1/N19-1095.
Yang, X., Su, J., & Tan, C. L. (2006). Kernel-based pronoun resolution with structured syntactic knowledge. In Proceedings of the 2006 international conference on computational linguistics and 2006 annual meeting of the Association for Computational Linguistics, 2006 (pp. 41–48). https://doi.org/10.3115/1220175.1220181.
Yang, Y., Liu, Y., & Xue, N. (2015). Recovering dropped pronouns from Chinese text messages. In Proceedings of the 2015 annual meeting of the Association for Computational Linguistics and the 2015 international joint conference on natural language processing, 2015 (pp. 309–313). https://doi.org/10.3115/v1/P15-2051.
Yang, Y., & Xue, N. (2010). Chasing the ghost: Recovering empty categories in the Chinese treebank. In Proceedings of the 2010 international conference on computational linguistics, 2010 (pp. 1382–1390). https://aclanthology.org/C10-2158
Yin, Q., Zhang, Y., Zhang, W., Liu, T., & Wang, W. Y. (2018). Zero pronoun resolution with attention-based neural network. In Proceedings of the 2018 international conference on computational linguistics, 2018 (pp. 13–23).
Yu, L., Sartran, L., Stokowiec, W., Ling, W., Kong, L., Blunsom, P., & Dyer, C. (2020). Better document-level machine translation with Bayes’ rule. In Proceedings of the 2020 transactions of the Association for Computational Linguistics, 2020 (Vol. 8, pp. 346–360). https://doi.org/10.1162/tacl_a_00319.
Zhang, J., Luan, H., Sun, M., Zhai, F., Xu, J., Zhang, M., & Liu, Y. (2018). Improving the transformer translation model with document-level context. In Proceedings of the 2018 conference on empirical methods in natural language processing, 2018 (pp. 533–542). https://doi.org/10.18653/v1/D18-1049.
Zhang, W., Liu, T., Yin, Q., & Zhang, Y. (2019). Neural recovery machine for Chinese dropped pronoun. In Proceedings of the 2019 frontiers of computer science, 2019 (pp. 1023–1033). https://doi.org/10.1007/s11704-018-7136-7.
Zhao, S., & Ng, H. T. (2007). Identification and resolution of Chinese zero pronouns: A machine learning approach. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, 2007 (pp. 541–550). https://aclanthology.org/D07-1057
Acknowledgements
This work was supported in part by the Science and Technology Development Fund, Macau SAR (Grant Nos. FDCT/060/2022/AFJ, FDCT/0070/2022/AMJ) and the Multi-year Research Grant from the University of Macau (Grant No. MYRG2020-00054-FST).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Benchmark in other evaluation metrics
Appendix: Benchmark in other evaluation metrics
See Table 11.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, M., Wang, L., Liu, S. et al. A benchmark dataset and evaluation methodology for Chinese zero pronoun translation. Lang Resources & Evaluation 57, 1263–1293 (2023). https://doi.org/10.1007/s10579-023-09660-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09660-5