Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICSE43902.2021.00047acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Testing Machine Translation via Referential Transparency

Published: 05 November 2021 Publication History

Abstract

Machine translation software has seen rapid progress in recent years due to the advancement of deep neural networks. People routinely use machine translation software in their daily lives for tasks such as ordering food in a foreign restaurant, receiving medical diagnosis and treatment from foreign doctors, and reading international political news online. However, due to the complexity and intractability of the underlying neural networks, modern machine translation software is still far from robust and can produce poor or incorrect translations; this can lead to misunderstanding, financial loss, threats to personal safety and health, and political conflicts. To address this problem, we introduce referentially transparent inputs (RTIs), a simple, widely applicable methodology for validating machine translation software. A referentially transparent input is a piece of text that should have similar translations when used in different contexts. Our practical implementation, Purity, detects when this property is broken by a translation. To evaluate RTI, we use Purity to test Google Translate and Bing Microsoft Translator with 200 unlabeled sentences, which detected 123 and 142 erroneous translations with high precision (79.3% and 78.3%). The translation errors are diverse, including examples of under-translation, over-translation, word/phrase mistranslation, incorrect modification, and unclear logic.

References

[1]
B. Zhang, D. Xiong, and J. Su, "Accelerating neural transformer via an average attention network," in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
[2]
J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin, "A convolutional encoder model for neural machine translation," in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017.
[3]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and I. Kaiser, Lukasz abd Polosukhin, "Attention is all you need," in Proc. of the 33rd Conference on Neural Information Processing Systems (NeurIPS), 2017.
[4]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.
[5]
H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li et al., "Achieving human parity on automatic chinese to english news translation," arXiv preprint arXiv:1803.05567, 2018.
[6]
B. Turovsky. (2016) Ten years of google translate. [Online]. Available: https://blog.google/products/translate/ten-years-of-google-translate/
[7]
Facebook. (2019) How do i translate a post or comment written in another language? [Online]. Available: https://www.facebook.com/help/509936952489634/
[8]
Twitter. (2019) About tweet translation. [Online]. Available: https://help.twitter.com/en/using-twitter/translate-tweets
[9]
A. Okrent. (2016) 9 little translation mistakes that caused big problems. [Online]. Available: http://mentalfloss.com/article/48795/9-little-translation-mistakes-caused-big-problems
[10]
F. Macdonald. (2015) The greatest mistranslations ever. [Online]. Available: http://www.bbc.com/culture/story/20150202-the-greatest-mistranslations-ever
[11]
T. Ong. (2017) Facebook apologizes after wrong translation sees palestinian man arrested for posting 'good morning'. [Online]. Available: https://www.theverge.com/us-world/2017/10/24/16533496/facebook-apology-wrong-translation-palestinian-arrested-post-good-morning
[12]
G. Davies. (2017) Palestinian man is arrested by police after posting 'good morning' in arabic on facebook which was wrongly translated as 'attack them'. [Online]. Available: https://www.dailymail.co.uk/news/article-5005489/Good-morning-Facebook-post-leads-arrest-Palestinian.html
[13]
T. W. Olympics. (2018) 15,000 eggs delivered to norwegian olympic team after google translate error. [Online]. Available: https://www.nbcwashington.com/news/national-international/Google-Translate-Fail-Norway-Olympic-Team-Gets-15K-Eggs-Delivered-473016573.html
[14]
B. Royston. (2018) Israel eurovision winner netta called 'a real cow' by prime minister in auto-translate fail. [Online]. Available: https://metro.co.uk/2018/05/13/israel-eurovision-winner-netta-called-a-real-cow-by-prime-minister-in-auto-translate-fail-7541925/
[15]
K. Pei, Y. Cao, J. Yang, and S. Jana, "Deepxplore: Automated whitebox testing of deep learning systems," in Proc. of the 26th Symposium on Operating Systems Principles (SOSP), 2017.
[16]
Y. Tian, K. Pei, S. Jana, and B. Ray, "Deeptest: Automated testing of deep-neural-network-driven autonomous cars," in ICSE, 2018.
[17]
M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W. Chang, "Generating natural language adversarial examples," in Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
[18]
M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, "Adversarial example generation with syntactically controlled paraphrase networks," in Proc. of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.
[19]
J. Li, S. Ji, T. Du, B. Li, and T. Wang, "Textbugger: Generating adversarial text against real-world applications," in Proc. of the 26th Annual Network and Distributed System Security Symposium (NDSS), 2019.
[20]
N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou, "Hidden voice commands," in Proc. of the 25th USENIX Security Symposium (USENIX Security), 2016.
[21]
Y. Qin, N. Carlini, I. Goodfellow, G. Cottrell, and C. Raffel, "Imperceptible, robust, and targeted adversarial examples for automatic speech recognition," in Proc. of the 36th International Conference on Machine Learning (ICML), 2019.
[22]
J. Ebrahimi, D. Lowd, and D. Dou, "On adversarial examples for character-level neural machine translation," in Proc. of the 27th International Conference on Computational Linguistics (COLING), 2018.
[23]
Y. Belinkov and Y. Bisk, "Synthetic and natural noise both break neural machine translation," in Proc. of the 6th International Conference on Learning Representations (ICLR), 2018.
[24]
W. Zheng, W. Wang, D. Liu, C. Zhang, Q. Zeng, Y. Deng, W. Yang, P. He, and T. Xie, "Testing untestable neural machine translation: An industrial case," arXiv preprint arXiv:1807.02340, 2018.
[25]
I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and harnessing adversarial examples," 2015.
[26]
P. K. Mudrakarta, A. Taly, M. Sundararajan, and K. Dhamdhere, "Did the model understand the question?" in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
[27]
R. Jia and P. Liang, "Adversarial examples for evaluating reading comprehension systems," in Proc. of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
[28]
P. He, C. Meister, and Z. Su, "Structure-invariant testing for machine translation," in ICSE, 2020.
[29]
Z. Sun, J. M. Zhang, M. Harman, M. Papadakis, and L. Zhang, "Automatic testing and improvement of machine translation," in ICSE, 2020.
[30]
H. Søndergaard and P. Sestoft, "Referential transparency, definiteness and unfoldability," Acta Informatica, 1990.
[31]
P.-Y. Saumont. (2017) What is referential transparency? [Online]. Available: https://www.theverge.com/2016/2/29/11134344/google-self-driving-car-crash-report
[32]
M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu, "Fast and accurate shift-reduce constituent parsing," in Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Aug. 2013, pp. 434---443.
[33]
Google translate. [Online]. Available: https://translate.google.com
[34]
Bing microsoft translator. [Online]. Available: https://www.bing.com/translator
[35]
Machine translation testing. [Online]. Available: https://github.com/RobustNLP/TestTranslation
[36]
S. Kevin. (2018) Why functional programming? the benefits of referential transparency. [Online]. Available: https://sookocheff.com/post/fp/why-functional-programming/
[37]
T. Y. Chen, S. C. Cheung, and S. M. Yiu, "Metamorphic testing: a new approach for generating next test cases," Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, Tech. Rep., 1998.
[38]
S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cortés, "A survey on metamorphic testing," IEEE Transactions on Software Engineering (TSE), vol. 42, 2016.
[39]
T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, "Metamorphic testing: A review of challenges and opportunities," ACM Computing Surveys (CSUR), vol. 51, 2018.
[40]
Y. Liu and M. Sun, "Contrastive unsupervised word alignment with non-local features," in Proc. of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015.
[41]
A. Fraser and D. Marcu, "Measuring word alignment quality for statistical machine translation," Computational Linguistics, 2007.
[42]
S. N. Group. Stanford corenlp - natural language software. [Online]. Available: https://stanfordnlp.github.io/CoreNLP/
[43]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[44]
Glove. [Online]. Available: https://nlp.stanford.edu/projects/glove/
[45]
spacy. [Online]. Available: https://spacy.io/
[46]
D. Chen and C. Manning, "A fast and accurate dependency parser using neural networks," in Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[47]
R. Sennrich, B. Haddow, and A. Birch, "Improving neural machine translation models with monolingual data," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
[48]
C. Chu, R. Dabre, and S. Kurohashi, "An empirical comparison of simple domain adaptation methods for neural machine translation," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017.
[49]
O. r. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz, "Findings of the 2018 conference on machine translation (wmt18)," in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers. Belgium, Brussels: Association for Computational Linguistics, October 2018, pp. 272--307. [Online]. Available: http://www.aclweb.org/anthology/W18-6401
[50]
fairseq: A fast, extensible toolkit for sequence modeling. [Online]. Available: https://github.com//pytorch/fairseq
[51]
M. Post, "A call for clarity in reporting bleu scores," 2018.
[52]
C. Ziegler. (2016) A google self-driving car caused a crash for the first time. [Online]. Available: https://www.theverge.com/2016/2/29/11134344/google-self-driving-car-crash-report
[53]
F. Lambert. (2016) Understanding the fatal tesla accident on autopilot and the nhtsa probe. [Online]. Available: https://electrek.co/2016/07/01/understanding-fatal-tesla-accident-autopilot-nhtsa-probe/
[54]
S. Levin. (2018) Tesla fatal crash: 'autopilot' mode sped up car before driver killed, report finds. [Online]. Available: https://www.theguardian.com/technology/2018/jun/07/tesla-fatal-crash-silicon-valley-autopilot-mode-report
[55]
A. Athalye, N. Carlini, and D. Wagner, "Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples," in Proc. of the 35th International Conference on Machine Learning (ICML), 2018.
[56]
T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, "Sirenattack: Generating adversarial audio for end-to-end acoustic systems," arXiv preprint arXiv:1901.07846, 2019.
[57]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, "Towards deep learning models resistant to adversarial attacks," in Proc. of the 6th International Conference on Learning Representations (ICLR), 2018.
[58]
J. Lin, C. Gan, and S. Han, "Defensive quantization: When efficiency meets robustness," in Proc. of the 7th International Conference on Learning Representations (ICLR), 2019.
[59]
C. Mao, Z. Zhong, J. Yang, C. Vondrick, and B. Ray, "Metric learning for adversarial robustness," in Proc. of the 35th Conference on Neural Information Processing Systems (NeurIPS), 2019.
[60]
G. Tao, S. Ma, Y. Liu, and X. Zhang, "Attacks meet interpretability: Attribute-steered detection of adversarial samples," in Proc. of the 34th Conference on Neural Information Processing Systems (NeurIPS), 2018.
[61]
J. Wang, G. Dong, J. Sun, X. Wang, and P. Zhang, "Adversarial sample detection for deep neural network through model mutation testing," in ICSE, 2019.
[62]
L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu et al., "Deepgauge: Multi-granularity testing criteria for deep learning systems," in ASE, 2018.
[63]
S. Ma, Y. Liu, W.-C. Lee, X. Zhang, and A. Grama, "Mode: Automated neural network model debugging via state differential analysis and input selection," in Proc. of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2018.
[64]
J. Kim, R. Feldt, and S. Yoo, "Guiding deep learning system testing using surprise adequacy," in ICSE, 2019.
[65]
J. M. Zhang, M. Harman, L. Ma, and Y. Liu, "Machine learning testing: Survey, landscapes and horizons," arXiv preprint arXiv:1906.10742, 2019.
[66]
X. Du, X. Xie, Y. Li, L. Ma, Y. Liu, and J. Zhao, "Deepstellar: model-based quantitative analysis of stateful deep learning systems," in Proc. of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2019.
[67]
X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See, "Deephunter: a coverage-guided fuzz testing framework for deep neural networks," in Proc. of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2019.
[68]
D. Pruthi, B. Dhingra, and Z. C. Lipton, "Combating adversarial misspellings with robust word recognition," in Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
[69]
M. T. Ribeiro, S. Singh, and C. Guestrin, "Semantically equivalent adversarial rules for debugging nlp models," in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
[70]
Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu, "Towards robust neural machine translation," in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
[71]
Y. Cheng, L. Jiang, and W. Macherey, "Robust neural machine translation with doubly adversarial inputs," in Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
[72]
S. Gupta, P. He, C. Meister, and Z. Su, "Machine translation testing via pathological invariance," in The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020.
[73]
V. Le, M. Afshari, and Z. Su, "Compiler validation via equivalence modulo inputs," in ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014.
[74]
C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson, "Many-core compiler fuzzing," in ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015.
[75]
J. Zhang, J. Chen, D. Hao, Y. Xiong, B. Xie, L. Zhang, and H. Mei, "Search-based inference of polynomial metamorphic relations," in ASE, 2014.
[76]
U. Kanewala, J. M. Bieman, and A. Ben-Hur, "Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels," Software Testing, Verification and Reliability (STVR), vol. 26, no. 3, 2016.
[77]
W. K. Chan, S. C. Cheung, and K. R. Leung, "Towards a metamorphic testing methodology for service-oriented software applications," in Proc. of the 5th International Conference on Quality Software (QSIC), 2005.
[78]
W. K. Chan, S. C. Cheung, and K. R. Leung, "A metamorphic testing approach for online testing of service-oriented software applications," International Journal of Web Services Research (IJWSR), vol. 4, no. 2, 2007.
[79]
M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, "Deeproad: Gan-based metamorphic autonomous driving system testing," in ASE, 2018.
[80]
X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, "Application of metamorphic testing to supervised classifiers," in Proc. of the 9th International Conference on Quality Software (QSIC), 2009.
[81]
X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, "Testing and validating machine learning classifiers by metamorphic testing," Journal of Systems and Software (JSS), vol. 84, 2011.
[82]
Z. Q. Zhou, S. Xiang, and T. Y. Chen, "Metamorphic testing for software quality assessment: A study of search engines," IEEE Transactions on Software Engineering (TSE), vol. 42, 2016.

Cited By

View all
  • (2025)Towards effectively testing machine translation systems from white-box perspectivesEmpirical Software Engineering10.1007/s10664-024-10549-230:1Online publication date: 1-Feb-2025
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '21: Proceedings of the 43rd International Conference on Software Engineering
May 2021
1768 pages
ISBN:9781450390859

Sponsors

Publisher

IEEE Press

Publication History

Published: 05 November 2021

Check for updates

Author Tags

  1. Machine translation
  2. Metamorphic testing
  3. Referential transparency
  4. Testing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICSE '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Towards effectively testing machine translation systems from white-box perspectivesEmpirical Software Engineering10.1007/s10664-024-10549-230:1Online publication date: 1-Feb-2025
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
  • (2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Fairness Testing of Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/366460833:6(1-27)Online publication date: 27-Jun-2024
  • (2024)MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization SystemsProceedings of the ACM on Software Engineering10.1145/36608201:FSE(2561-2583)Online publication date: 12-Jul-2024
  • (2024)COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437671:FSE(906-928)Online publication date: 12-Jul-2024
  • (2024)Machine Translation Testing via Syntactic Tree PruningACM Transactions on Software Engineering and Methodology10.1145/364032933:5(1-39)Online publication date: 4-Jun-2024
  • (2024)Knowledge Graph Driven Inference Testing for Question Answering SoftwareProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639109(1-13)Online publication date: 20-May-2024
  • (2024)Adopting machine translation in the healthcare sectorComputer Speech and Language10.1016/j.csl.2023.10158284:COnline publication date: 4-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media