research-article

Natural Test Generation for Precise Testing of Question Answering Software

Authors:

Menghan TianAuthors Info & Claims

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Article No.: 71, Pages 1 - 12

https://doi.org/10.1145/3551349.3556953

Published: 05 January 2023 Publication History

Abstract

Question answering (QA) software uses information retrieval and natural language processing techniques to automatically answer questions posed by humans in a natural language. Like other AI-based software, QA software may contain bugs. To automatically test QA software without human labeling, previous work extracts facts from question answer pairs and generates new questions to detect QA software bugs. Nevertheless, the generated questions could be ambiguous, confusing, or with chaotic syntax, which are unanswerable for QA software. As a result, a relatively large proportion of the reported bugs are false positives. In this work, we proposed QAQA, a sentence-level mutation based metamorphic testing technique for QA software. To eliminate false positives and achieve precise automatic testing, QAQA leverages five Metamorphic Relations (MRs) as well as semantics-guided search and enhanced test oracles. Our evaluation on three QA datasets demonstrates that QAQA outperforms the state-of-the-art in both quantity (8,133 vs. 6,601 bugs) and quality (97.67% vs. 49% true positive rate) of the reported bugs. Moreover, the test inputs generated by QAQA successfully reduce MR violation rate from 44.29% to 20.51% when being adopted in fine-tuning the QA software under test.

References

[1]

Accessed: 2022. AlphaChat. https://www.alphachat.ai/.

[2]

Accessed: 2022. NeuralCoref. https://github.com/huggingface/neuralcoref/.

[3]

Accessed: 2022. Siri. https://www.apple.com/siri/.

[4]

Junjie Chen, Yihua Liang, Qingchao Shen, and Jiajun Jiang. 2022. Toward Understanding Deep Learning Framework Bugs. arXiv preprint arXiv:2203.04026(2022).

[5]

Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4(2020), 1–35.

Digital Library

[6]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 104–116.

Digital Library

[7]

Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. 2020. Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543(2020).

[8]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723(2017).

[9]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044(2019).

[10]

Pradeep Dasigi, Nelson F Liu, Ana Marasović, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. arXiv preprint arXiv:1908.05803(2019).

[11]

Steffen Eger and Yannik Benz. 2020. From Hero to Z\’eroe: A Benchmark of Low-Level Adversarial Attacks. arXiv preprint arXiv:2010.05648(2020).

[12]

Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. 2019. Amazonqa: A review-based question answering task. arXiv preprint arXiv:1908.04364(2019).

[13]

Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 863–875.

[14]

Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In 2020 IEEE/ACM 42nd International Conference on Software Engineering. IEEE, 961–973.

Digital Library

[15]

Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing machine translation via referential transparency. In 2021 IEEE/ACM 43rd International Conference on Software Engineering. IEEE, 410–422.

Digital Library

[16]

Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering 7, 4 (2001), 275–300.

[17]

Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328(2017).

[18]

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.

[19]

Hidetaka Kamigaito and Manabu Okumura. 2020. Syntactically Look-Ahead Attention Network for Sentence Compression. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05(2020), 8050–8057. https://doi.org/10.1609/aaai.v34i05.6315

[20]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 252–262.

[21]

Daniel Khashabi, Tushar Khot, and Ashish Sabharwal. 2020. More bang for your buck: Natural perturbation for robust question answering. arXiv preprint arXiv:2004.04849(2020).

[22]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700(2020).

[23]

Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328.

[24]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.

[25]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052(2021).

[26]

Adam Lopez. 2008. Statistical machine translation. Comput. Surveys 40, 3 (2008), 1–49.

Digital Library

[27]

Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916(2019).

[28]

Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing. 38–45.

Digital Library

[29]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822(2018).

[30]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250(2016).

[31]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084(2019).

[32]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118(2020). https://doi.org/10.18653/v1/2020.acl-main.442

[33]

Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin Rothkopf, and Kristian Kersting. 2021. Language models have a moral dimension. arXiv preprint arXiv:2103.11790(2021).

[34]

Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 968–980.

Digital Library

[35]

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 974–985. https://doi.org/10.1145/3377811.3380420

Digital Library

[36]

Zeyu Sun, Jie M Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving machine translation systems via isotopic replacement. In Proceedings of the 2022 International Conference on Software Engineering.

Digital Library

[37]

Alon Talmor and Jonathan Berant. 2019. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. arXiv preprint arXiv:1905.13453(2019).

[38]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191–200. https://doi.org/10.18653/v1/w17-2623

[39]

Shufan Wang, Laure Thompson, and Mohit Iyyer. 2021. Phrase-bert: Improved phrase embeddings from bert with an application to corpus exploration. arXiv preprint arXiv:2109.06304(2021).

[40]

Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.

Digital Library

[41]

Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 397–409.

Digital Library

[42]

Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604(2016).

[43]

Ming Yan, Junjie Chen, Xiangyu Zhang, Lin Tan, Gan Wang, and Zan Wang. 2021. Exposing numerical bugs in deep learning via gradient back-propagation. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 627–638.

Digital Library

[44]

Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249(2020).

[45]

Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774(2021).

Cited By

Xie XJin SChen SCheung S(2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3675396
Jiang WZhai JMa SZhang XShen C(2024)COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437671:FSE(906-928)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643767
Zhang QZhai JFang CLiu JSun WHu HWang Q(2024)Machine Translation Testing via Syntactic Tree PruningACM Transactions on Software Engineering and Methodology10.1145/364032933:5(1-39)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3640329
Show More Cited By

Index Terms

Natural Test Generation for Precise Testing of Question Answering Software
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Fault detection effectiveness of source test case generation strategies for metamorphic testing
MET '18: Proceedings of the 3rd International Workshop on Metamorphic Testing

Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is ...
Testing your question answering software via asking recursively
ASE '21: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering

Question Answering (QA) is an attractive and challenging area in NLP community. There are diverse algorithms being proposed and various benchmark datasets with different topics and task formats being constructed. QA software has also been widely used in ...
qaAskeR $^{+}$ : a novel testing method for question answering software via asking recursive questions
Abstract
Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate ... $^{}$ $^{}$ $^{}$ $^{}$

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

October 2022

2006 pages

ISBN:9781450394758

DOI:10.1145/3551349

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

ASE '22

ASE '22: 37th IEEE/ACM International Conference on Automated Software Engineering

October 10 - 14, 2022

MI, Rochester, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
341
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)8

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xie XJin SChen SCheung S(2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3675396
Jiang WZhai JMa SZhang XShen C(2024)COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437671:FSE(906-928)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643767
Zhang QZhai JFang CLiu JSun WHu HWang Q(2024)Machine Translation Testing via Syntactic Tree PruningACM Transactions on Software Engineering and Methodology10.1145/364032933:5(1-39)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3640329
Jiang JYang JZhang YWang ZYou HChen J(2024)A Post-training Framework for Improving the Performance of Deep Learning Models via Model TransformationACM Transactions on Software Engineering and Methodology10.1145/363001133:3(1-41)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3630011
Yu XLiu LHu XKeung JLiu JXia X(2024)Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?IEEE Transactions on Software Engineering10.1109/TSE.2024.3492204(1-19)Online publication date: 2024
https://doi.org/10.1109/TSE.2024.3492204
Gao MWang HXu C(2024)Testing Constraint Checking Implementations via Principled Metamorphic Transformations2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00096(884-895)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00096
Chen JWu YChen HChen JAkpaku E(2024)QAQA-SS: An Improved Fuzzing Approach with Seed Scheduling Based on the UCB Algorithm for QA Systems2024 IEEE 24th International Conference on Software Quality, Reliability, and Security Companion (QRS-C)10.1109/QRS-C63300.2024.00089(670-679)Online publication date: 1-Jul-2024
https://doi.org/10.1109/QRS-C63300.2024.00089
Meng LLi YChen LMa MZhou YXu B(2024)Hybrid mutation driven testing for natural language inferenceJournal of Software: Evolution and Process10.1002/smr.2694Online publication date: 17-Jun-2024
https://doi.org/10.1002/smr.2694
Chen JLiang YShen QJiang JLi S(2023)Toward Understanding Deep Learning Framework BugsACM Transactions on Software Engineering and Methodology10.1145/358715532:6(1-31)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3587155
Wang KWang YWang JWang Q(2023)Fuzzing with Sequence Diversity Inference for Sequential Decision-making Model Testing2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00041(706-717)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00041
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents