Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access
Just Accepted

Word Closure-Based Metamorphic Testing for Machine Translation

Online AM: 01 July 2024 Publication History

Abstract

With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate the test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report many false alarms and miss many true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. We first propose word closure as a new comparison unit, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and the translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.9% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.

References

[1]
2023. Bing Microsoft Translator. https://www.bing.com/translator.
[2]
2023. Dataset, replication package, and supplementary material for this paper. https://github.com/imjinshuo/Word-Closure-Based-MT.
[3]
2023. Google Translate. https://translate.google.com/.
[4]
2023. Youdao Translate. https://translate.google.com/.
[5]
Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Trans. Software Eng. 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
[6]
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785
[7]
Terena Bell. 2021. Google Translate Causes Vaccine Mishap. https://multilingual.com/google-translate-causes-vaccine-mishap/.
[8]
Jialun Cao, Meiziniu Li, Yeting Li, Ming Wen, Shing-Chi Cheung, and Haiming Chen. 2022. SemMT: A Semantic-Based Testing Approach for Machine Translation Systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 34e:1–34e:36. https://doi.org/10.1145/3490488
[9]
Dhivya Chandrasekaran and Vijay Mago. 2022. Evolution of Semantic Similarity - A Survey. ACM Comput. Surv. 54, 2 (2022), 41:1–41:37. https://doi.org/10.1145/3440755
[10]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
[11]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on machine reading comprehension software without annotated labels: a property-based method. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021. ACM, 590–602. https://doi.org/10.1145/3468264.3468569
[12]
Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 1998. Metamorphic Testing: A New Approach for Generating Next Test Cases. Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong.
[13]
Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (2018), 4:1–4:27. https://doi.org/10.1145/3143561
[14]
Steve Clayton. 2013. Translation tech powers automatic subtitles for everyday life. https://blogs.microsoft.com/ai/translation-tech-powers-automatic-subtitles-for-everyday-life/.
[15]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
[16]
Pierre Jean A. Colombo, Chloé Clavel, and Pablo Piantanida. 2022. InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10554–10562. https://doi.org/10.1609/AAAI.V36I10.21299
[17]
Gareth Davies. 2017. Palestinian man is arrested by police after posting ‘Good morning’ in Arabic on Facebook which was wrongly translated as ‘attack them’. https://www.dailymail.co.uk/news/article-5005489/Good-morning-Facebook-post-leads-arrest-Palestinian.html.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/V1/N19-1423
[19]
Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021. Association for Computational Linguistics, 2112–2128. https://doi.org/10.18653/v1/2021.eacl-main.181
[20]
Kawin Ethayarajh. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, 55–65. https://doi.org/10.18653/V1/D19-1006
[21]
Zhangjie Fu, Yan Wang, Xingming Sun, and Xiaosong Zhang. 2022. Semantic and secure search over encrypted outsourcing cloud based on BERT. Frontiers Comput. Sci. 16, 2 (2022), 162802. https://doi.org/10.1007/S11704-021-0277-0
[22]
Silvia P. Gennari, Maryellen C. MacDonald, Bradley R. Postle, and Mark S. Seidenberg. 2007. Context-dependent interpretation of words: Evidence for interactive neural processes. NeuroImage 35, 3 (2007), 1278–1286. https://doi.org/10.1016/J.NEUROIMAGE.2007.01.015
[23]
Jin Guo. 1997. Critical Tokenization and its Properties. Comput. Linguistics 23, 4 (1997), 569–596.
[24]
Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernández Ábrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Effective Parallel Corpus Mining using Bilingual Sentence Embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018. Association for Computational Linguistics, 165–176. https://doi.org/10.18653/V1/W18-6317
[25]
Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine translation testing via pathological invariance. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. ACM, 863–875. https://doi.org/10.1145/3368089.3409756
[26]
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 961–973. https://doi.org/10.1145/3377811.3380339
[27]
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 410–422. https://doi.org/10.1109/ICSE43902.2021.00047
[28]
Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Baowen Xu. 2021. Automated Testing for Machine Translation via Constituency Invariance. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 468–479. https://doi.org/10.1109/ASE51524.2021.9678715
[29]
Mingyue Jiang, Houzhen Bao, Kaiyi Tu, Xiao-Yi Zhang, and Zuohua Ding. 2021. Evaluating Natural Language Inference Models: A Metamorphic Testing Approach. In 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, October 25-28, 2021. IEEE, 220–230. https://doi.org/10.1109/ISSRE52982.2021.00033
[30]
Mingyue Jiang, Tsong Yueh Chen, and Shuai Wang. 2022. On the effectiveness of testing sentiment analysis systems with metamorphic testing. Inf. Softw. Technol. 150 (2022), 106966. https://doi.org/10.1016/J.INFSOF.2022.106966
[31]
Shuo Jin, Songqiang Chen, and Xiaoyuan Xie. 2021. Property-based Test for Part-of-Speech Tagging Tool. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 1306–1311. https://doi.org/10.1109/ASE51524.2021.9678807
[32]
Philipp Koehn and Christof Monz. 2006. Manual and Automatic Evaluation of Machine Translation between European Languages. In Proceedings on the Workshop on Statistical Machine Translation, WMT@HLT-NAACL 2006, New York City, NY, USA, June 8-9, 2006. Association for Computational Linguistics, 102–121.
[33]
Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, and Xiaoyong Du. 2018. Analogical Reasoning on Chinese Morphological and Semantic Relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 138–143. https://doi.org/10.18653/V1/P18-2023
[34]
Peerat Limkonchotiwat, Wannaphong Phatthiyaphaibun, Raheem Sarwar, Ekapol Chuangsuwanich, and Sarana Nutanong. 2021. Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021). Association for Computational Linguistics, 1003–1016. https://doi.org/10.18653/V1/2021.FINDINGS-ACL.86
[35]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
[36]
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguistics 8 (2020), 726–742. https://doi.org/10.1162/TACL_A_00343
[37]
Zixi Liu, Yang Feng, and Zhenyu Chen. 2021. DialTest: automated testing for recurrent-neural-network-driven dialogue systems. In ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, Cristian Cadar and Xiangyu Zhang (Eds.). ACM, 115–126. https://doi.org/10.1145/3460319.3464829
[38]
Zixi Liu, Yang Feng, Yining Yin, Jingyu Sun, Zhenyu Chen, and Baowen Xu. 2022. QATest: A Uniform Fuzzing Framework for Question Answering Systems. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 81:1–81:12. https://doi.org/10.1145/3551349.3556929
[39]
Rachel Tsz-Wai Lo, Ben He, and Iadh Ounis. 2005. Automatically Building a Stopword List for an Information Retrieval System. J. Digit. Inf. Manag. 3, 1 (2005), 3–8.
[40]
Pingchuan Ma, Shuai Wang, and Jin Liu. 2020. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 458–465. https://doi.org/10.24963/IJCAI.2020/64
[41]
Fiona Macdonald. 2015. The greatest mistranslations ever. https://www.bbc.com/culture/article/20150202-the-greatest-mistranslations-ever.
[42]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
[43]
Tomás Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA. The Association for Computational Linguistics, 746–751.
[44]
George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41. https://doi.org/10.1145/219717.219748
[45]
Chris Murphy, Gail E. Kaiser, and Marta Arias. 2007. An Approach to Software Testing of Machine Learning Applications. In Proceedings of the Nineteenth International Conference on Software Engineering & Knowledge Engineering (SEKE’2007), Boston, Massachusetts, USA, July 9-11, 2007. Knowledge Systems Institute Graduate School, 167.
[46]
Arika Okrent. 2016. 9 Little Translation Mistakes That Caused Big Problems. https://www.mentalfloss.com/article/48795/9-little-translation-mistakes-caused-big-problems.
[47]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, 311–318. https://doi.org/10.3115/1073083.1073135
[48]
Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 3rd IEEE/ACM International Workshop on Metamorphic Testing, MET 2018, Gothenburg, Sweden, May 27, 2018. ACM, 38–45. https://doi.org/10.1145/3193977.3193980
[49]
The Copenhagen Post. 2012. Police admit using Google translation in terror investigation was mistake. https://cphpost.dk/2012-12-12/general/police-admit-using-google-translation-in-terror-investigation-was-mistake/.
[50]
Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/V1/2020.ACL-MAIN.442
[51]
Stephen Shankland. 2013. Google Translate now serves 200 million people daily. https://www.cnet.com/tech/services-and-software/google-translate-now-serves-200-million-people-daily.
[52]
Qingchao Shen, Junjie Chen, Jie M. Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 71:1–71:12. https://doi.org/10.1145/3551349.3556953
[53]
Tomohiro Shigenobu. 2007. Evaluation and Usability of Back Translation for Intercultural Communication. In Usability and Internationalization. Global and Local User Interfaces, Second International Conference on Usability and Internationalization, UI-HCII 2007, Held as Part of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 4560). Springer, 259–265. https://doi.org/10.1007/978-3-540-73289-1_31
[54]
Harold L. Somers. 2005. Round-trip Translation: What Is It Good For?. In Proceedings of the Australasian Language Technology Workshop, ALTA 2005, Sydney, Australia, December 10-11, 2005. Australasian Language Technology Association, 127–133.
[55]
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic testing and improvement of machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 974–985. https://doi.org/10.1145/3377811.3380420
[56]
Zeyu Sun, Jie M. Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving Machine Translation Systems via Isotopic Replacement. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1181–1192. https://doi.org/10.1145/3510003.3510206
[57]
Peter Svenonius. 2002. Subject positions and the placement of adverbials. Subjects, expletives, and the EPP (2002), 201–242.
[58]
Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 2214–2218.
[59]
Tya Vidhayasai, Sonthida Keyuravong, and Thanis Bunsom. 2015. Investigating the Use of Google Translate in” Terms and Conditions” in an Airline's Official Website: Errors and Implications. PASAA: Journal of Language Teaching and Learning in Thailand (2015), 137–169.
[60]
Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2387–2399. https://doi.org/10.1109/ICSE48619.2023.00200
[61]
Wenyu Wang, Wujie Zheng, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN (Industry Track) 2019, Portland, OR, USA, June 24-27, 2019. IEEE, 1–4. https://doi.org/10.1109/DSN-INDUSTRY.2019.00007
[62]
Wikipedia. 2023. Lexical analysis. https://en.wikipedia.org/wiki/Lexical_analysis.
[63]
Wikipedia. 2023. Phrase. https://en.wikipedia.org/wiki/Phrase.
[64]
Wikipedia. 2023. Verb phrase. https://en.wikipedia.org/wiki/Verb_phrase.
[65]
WMT. 2018. News-Commentary. http://data.statmt.org/wmt18/translation-task/.
[66]
Xiaoyuan Xie, Shuo Jin, and Songqiang Chen. 2023. qaAskeR+: a novel testing method for question answering software via asking recursive questions. Autom. Softw. Eng. 30, 1 (2023), 14. https://doi.org/10.1007/S10515-023-00380-2
[67]
Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, and Pinjia He. 2023. Automated Testing and Improvement of Named Entity Recognition Systems. CoRR abs/2308.07937 (2023). https://doi.org/10.48550/ARXIV.2308.07937 arXiv:2308.07937
[68]
Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic Testing for Machine Translations: MT4MT. In 25th Australasian Software Engineering Conference, ASWEC 2018, Adelaide, Australia, November 26-30, 2018. IEEE Computer Society, 96–100. https://doi.org/10.1109/ASWEC.2018.00021
[69]
Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and Accurate Shift-Reduce Constituent Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. The Association for Computer Linguistics, 434–443.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology Just Accepted
EISSN:1557-7392
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 01 July 2024
Accepted: 11 June 2024
Revised: 04 May 2024
Received: 21 January 2024

Check for updates

Author Tags

  1. Machine translation
  2. Metamorphic testing
  3. Word closure
  4. Deep learning testing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 126
    Total Downloads
  • Downloads (Last 12 months)126
  • Downloads (Last 6 weeks)64
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media