Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3551349.3556953acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Natural Test Generation for Precise Testing of Question Answering Software

Published: 05 January 2023 Publication History

Abstract

Question answering (QA) software uses information retrieval and natural language processing techniques to automatically answer questions posed by humans in a natural language. Like other AI-based software, QA software may contain bugs. To automatically test QA software without human labeling, previous work extracts facts from question answer pairs and generates new questions to detect QA software bugs. Nevertheless, the generated questions could be ambiguous, confusing, or with chaotic syntax, which are unanswerable for QA software. As a result, a relatively large proportion of the reported bugs are false positives. In this work, we proposed QAQA, a sentence-level mutation based metamorphic testing technique for QA software. To eliminate false positives and achieve precise automatic testing, QAQA leverages five Metamorphic Relations (MRs) as well as semantics-guided search and enhanced test oracles. Our evaluation on three QA datasets demonstrates that QAQA outperforms the state-of-the-art in both quantity (8,133 vs. 6,601 bugs) and quality (97.67% vs. 49% true positive rate) of the reported bugs. Moreover, the test inputs generated by QAQA successfully reduce MR violation rate from 44.29% to 20.51% when being adopted in fine-tuning the QA software under test.

References

[1]
Accessed: 2022. AlphaChat. https://www.alphachat.ai/.
[2]
Accessed: 2022. NeuralCoref. https://github.com/huggingface/neuralcoref/.
[3]
Accessed: 2022. Siri. https://www.apple.com/siri/.
[4]
Junjie Chen, Yihua Liang, Qingchao Shen, and Jiajun Jiang. 2022. Toward Understanding Deep Learning Framework Bugs. arXiv preprint arXiv:2203.04026(2022).
[5]
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4(2020), 1–35.
[6]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 104–116.
[7]
Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. 2020. Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543(2020).
[8]
Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723(2017).
[9]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044(2019).
[10]
Pradeep Dasigi, Nelson F Liu, Ana Marasović, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. arXiv preprint arXiv:1908.05803(2019).
[11]
Steffen Eger and Yannik Benz. 2020. From Hero to Z\’eroe: A Benchmark of Low-Level Adversarial Attacks. arXiv preprint arXiv:2010.05648(2020).
[12]
Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. 2019. Amazonqa: A review-based question answering task. arXiv preprint arXiv:1908.04364(2019).
[13]
Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 863–875.
[14]
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In 2020 IEEE/ACM 42nd International Conference on Software Engineering. IEEE, 961–973.
[15]
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing machine translation via referential transparency. In 2021 IEEE/ACM 43rd International Conference on Software Engineering. IEEE, 410–422.
[16]
Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering 7, 4 (2001), 275–300.
[17]
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328(2017).
[18]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
[19]
Hidetaka Kamigaito and Manabu Okumura. 2020. Syntactically Look-Ahead Attention Network for Sentence Compression. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05(2020), 8050–8057. https://doi.org/10.1609/aaai.v34i05.6315
[20]
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 252–262.
[21]
Daniel Khashabi, Tushar Khot, and Ashish Sabharwal. 2020. More bang for your buck: Natural perturbation for robust question answering. arXiv preprint arXiv:2004.04849(2020).
[22]
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700(2020).
[23]
Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328.
[24]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
[25]
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052(2021).
[26]
Adam Lopez. 2008. Statistical machine translation. Comput. Surveys 40, 3 (2008), 1–49.
[27]
Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916(2019).
[28]
Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing. 38–45.
[29]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822(2018).
[30]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250(2016).
[31]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084(2019).
[32]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118(2020). https://doi.org/10.18653/v1/2020.acl-main.442
[33]
Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin Rothkopf, and Kristian Kersting. 2021. Language models have a moral dimension. arXiv preprint arXiv:2103.11790(2021).
[34]
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 968–980.
[35]
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 974–985. https://doi.org/10.1145/3377811.3380420
[36]
Zeyu Sun, Jie M Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving machine translation systems via isotopic replacement. In Proceedings of the 2022 International Conference on Software Engineering.
[37]
Alon Talmor and Jonathan Berant. 2019. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. arXiv preprint arXiv:1905.13453(2019).
[38]
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191–200. https://doi.org/10.18653/v1/w17-2623
[39]
Shufan Wang, Laure Thompson, and Mohit Iyyer. 2021. Phrase-bert: Improved phrase embeddings from bert with an application to corpus exploration. arXiv preprint arXiv:2109.06304(2021).
[40]
Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.
[41]
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 397–409.
[42]
Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604(2016).
[43]
Ming Yan, Junjie Chen, Xiangyu Zhang, Lin Tan, Gan Wang, and Zan Wang. 2021. Exposing numerical bugs in deep learning via gradient back-propagation. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 627–638.
[44]
Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249(2020).
[45]
Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774(2021).

Cited By

View all
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437671:FSE(906-928)Online publication date: 12-Jul-2024
  • (2024)Machine Translation Testing via Syntactic Tree PruningACM Transactions on Software Engineering and Methodology10.1145/364032933:5(1-39)Online publication date: 4-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
October 2022
2006 pages
ISBN:9781450394758
DOI:10.1145/3551349
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Metamorphic Testing
  2. Mutation
  3. Natural Language Processing
  4. Question Answering Software
  5. Test Generation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ASE '22

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437671:FSE(906-928)Online publication date: 12-Jul-2024
  • (2024)Machine Translation Testing via Syntactic Tree PruningACM Transactions on Software Engineering and Methodology10.1145/364032933:5(1-39)Online publication date: 4-Jun-2024
  • (2024)A Post-training Framework for Improving the Performance of Deep Learning Models via Model TransformationACM Transactions on Software Engineering and Methodology10.1145/363001133:3(1-41)Online publication date: 15-Mar-2024
  • (2024)Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?IEEE Transactions on Software Engineering10.1109/TSE.2024.3492204(1-19)Online publication date: 2024
  • (2024)Testing Constraint Checking Implementations via Principled Metamorphic Transformations2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00096(884-895)Online publication date: 12-Mar-2024
  • (2024)QAQA-SS: An Improved Fuzzing Approach with Seed Scheduling Based on the UCB Algorithm for QA Systems2024 IEEE 24th International Conference on Software Quality, Reliability, and Security Companion (QRS-C)10.1109/QRS-C63300.2024.00089(670-679)Online publication date: 1-Jul-2024
  • (2024)Hybrid mutation driven testing for natural language inferenceJournal of Software: Evolution and Process10.1002/smr.2694Online publication date: 17-Jun-2024
  • (2023)Toward Understanding Deep Learning Framework BugsACM Transactions on Software Engineering and Methodology10.1145/358715532:6(1-31)Online publication date: 29-Sep-2023
  • (2023)Fuzzing with Sequence Diversity Inference for Sequential Decision-making Model Testing2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00041(706-717)Online publication date: 9-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media