research-article

Evaluating Surprise Adequacy for Question Answering

Authors:

Shin YooAuthors Info & Claims

ICSEW'20: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops

Pages 197 - 202

https://doi.org/10.1145/3387940.3391465

Published: 25 September 2020 Publication History

Abstract

With the wide and rapid adoption of Deep Neural Networks (DNNs) in various domains, an urgent need to validate their behaviour has risen, resulting in various test adequacy metrics for DNNs. One of the metrics, Surprise Adequacy (SA), aims to measure how surprising a new input is based on the similarity to the data used for training. While SA has been evaluated to be effective for image classifiers based on Convolutional Neural Networks (CNNs), it has not been studied for the Natural Language Processing (NLP) domain. This paper applies SA to NLP, in particular to the question answering task: the aim is to investigate whether SA correlates well with the correctness of answers. An empirical evaluation using the widely used Stanford Question Answering Dataset (SQuAD) shows that SA can work well as a test adequacy metric for the question answering task.

References

[1]

Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing (1 ed.). Cambridge University Press, New York, NY, USA.

Digital Library

[2]

Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision. 2722--2730.

Digital Library

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805

[4]

Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1915--1929.

Digital Library

[5]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing using Surprise Adequacy. In Proceedings of the 41th International Conference on Software Engineering (ICSE 2019). IEEE Press, 1039--1049.

Digital Library

[6]

Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.

[7]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.

[8]

Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist (2010).

[9]

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 7167--7177.

[10]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis 42 (2017), 60--88.

[11]

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, 120--131.

Digital Library

[12]

Prasanta Chandra Mahalanobis. 2018. Reprint of: Mahalanobis, P.C. (1936) "On the Generalised Distance in Statistics". Sankhya A 80, 1 (2018), 1--7.

[13]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). ACM, 1--18.

Digital Library

[14]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. CoRR abs/1802.05365 (2018). arXiv:1802.05365

[15]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250 (2016). arXiv:1606.05250 http://arxiv.org/abs/1606.05250

[16]

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. CoRR abs/1611.01603 (2016). arXiv:1611.01603

[17]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.

[18]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. ACM, 303--314.

Digital Library

Cited By

Guo HTao CHuang Z(2024)Neuron importance-aware coverage analysis for deep neural network testingEmpirical Software Engineering10.1007/s10664-024-10524-x29:5Online publication date: 25-Jul-2024
https://dl.acm.org/doi/10.1007/s10664-024-10524-x
Weiss MTonella P(2023)Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural NetworksACM Transactions on Software Engineering and Methodology10.1145/361759333:1(1-29)Online publication date: 23-Nov-2023
https://dl.acm.org/doi/10.1145/3617593
Dola SDwyer MSoffa M(2023)Input Distribution Coverage: Measuring Feature Interaction Adequacy in Neural Network TestingACM Transactions on Software Engineering and Methodology10.1145/357604032:3(1-48)Online publication date: 26-Apr-2023
https://dl.acm.org/doi/10.1145/3576040
Show More Cited By

Index Terms

Evaluating Surprise Adequacy for Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluating Surprise Adequacy for Deep Learning System Testing
The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional ...
Intelligent Question Answering Model Based on CN-BiLSTM
CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

In this paper, we present a new type of intelligent question answering system model based on deep neural network. In the data stream processing stage, the model proposes a multi-channel information combination framework, which performs operations such as ...
Input Distribution Coverage: Measuring Feature Interaction Adequacy in Neural Network Testing
Testing deep neural networks (DNNs) has garnered great interest in the recent years due to their use in many applications. Black-box test adequacy measures are useful for guiding the testing process in covering the input domain. However, the absence of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSEW'20: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops

June 2020

831 pages

ISBN:9781450379632

DOI:10.1145/3387940

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering
KIISE: Korean Institute of Information Scientists and Engineers
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICSE '20

Sponsor:

SIGSOFT
KIISE

ICSE '20: 42nd International Conference on Software Engineering

June 27 - July 19, 2020

Seoul, Republic of Korea

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
124
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo HTao CHuang Z(2024)Neuron importance-aware coverage analysis for deep neural network testingEmpirical Software Engineering10.1007/s10664-024-10524-x29:5Online publication date: 25-Jul-2024
https://dl.acm.org/doi/10.1007/s10664-024-10524-x
Weiss MTonella P(2023)Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural NetworksACM Transactions on Software Engineering and Methodology10.1145/361759333:1(1-29)Online publication date: 23-Nov-2023
https://dl.acm.org/doi/10.1145/3617593
Dola SDwyer MSoffa M(2023)Input Distribution Coverage: Measuring Feature Interaction Adequacy in Neural Network TestingACM Transactions on Software Engineering and Methodology10.1145/357604032:3(1-48)Online publication date: 26-Apr-2023
https://dl.acm.org/doi/10.1145/3576040
Weiss MTonella P(2023)Uncertainty quantification for deep neural networks: An empirical comparison and usage guidelinesSoftware Testing, Verification and Reliability10.1002/stvr.184033:6Online publication date: 20-Jan-2023
https://doi.org/10.1002/stvr.1840
Weiss MRoychoudhury ACadar CKim M(2022)CheapET-3: cost-efficient use of remote DNN modelsProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3559082(1811-1813)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3559082
Ouyang TSeo YOiwa Y(2022)Quality assurance study with mismatched data in sentiment analysis2022 29th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC57359.2022.00059(442-446)Online publication date: Dec-2022
https://doi.org/10.1109/APSEC57359.2022.00059
Ouyang TMarco VIsobe YAsoh HOiwa YSeo Y(2021)Improved Surprise Adequacy Tools for Corner Case Data Description and DetectionApplied Sciences10.3390/app1115682611:15(6826)Online publication date: 25-Jul-2021
https://doi.org/10.3390/app11156826
Ouyang TMarco VIsobe YAsoh HOiwa YSeo Y(2021)Corner Case Data Description and Detection2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN)10.1109/WAIN52551.2021.00009(19-26)Online publication date: May-2021
https://doi.org/10.1109/WAIN52551.2021.00009
Weiss MChakraborty RTonella P(2021)A Review and Refinement of Surprise Adequacy2021 IEEE/ACM Third International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest)10.1109/DeepTest52559.2021.00009(17-24)Online publication date: Jun-2021
https://doi.org/10.1109/DeepTest52559.2021.00009
Kim SYoo S(2021)Multimodal Surprise Adequacy Analysis of Inputs for Natural Language Processing DNN Models2021 IEEE/ACM International Conference on Automation of Software Test (AST)10.1109/AST52587.2021.00017(80-89)Online publication date: May-2021
https://doi.org/10.1109/AST52587.2021.00017

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents