Abstract
Deep Neural Network (DNN) models are widely used for image classification. While they offer high performance in terms of accuracy, researchers are concerned about if these models inappropriately make inferences using features irrelevant to the target object in a given image. To address this concern, we propose a metamorphic testing approach that assesses if a given inference is made based on irrelevant features. Specifically, we propose two metamorphic relations (MRs) to detect such unreliable inferences. These relations expect (a) the classification results with different labels or the same labels but less certainty from models after corrupting the relevant features of images, and (b) the classification results with the same labels after corrupting irrelevant features. The inferences that violate the metamorphic relations are regarded as unreliable inferences. Our evaluation demonstrated that our approach can effectively identify unreliable inferences for single-label classification models with an average precision of 64.1% and 96.4% for the two MRs, respectively. As for multi-label classification models, the corresponding precision for MR-1 and MR-2 is 78.2% and 86.5%, respectively. Further, we conducted an empirical study to understand the problem of unreliable inferences in practice. Specifically, we applied our approach to 18 pre-trained single-label image classification models and 3 multi-label classification models, and then examined their inferences on the ImageNet and COCO datasets. We found that unreliable inferences are pervasive. Specifically, for each model, more than thousands of correct classifications are actually made using irrelevant features. Next, we investigated the effect of such pervasive unreliable inferences, and found that they can cause significant degradation of a model’s overall accuracy. After including these unreliable inferences from the test set, the model’s accuracy can be significantly changed. Therefore, we recommend that developers should pay more attention to these unreliable inferences during the model evaluations. We also explored the correlation between model accuracy and the size of unreliable inferences. We found the inferences of the input with smaller objects are easier to be unreliable. Lastly, we found that the current model training methodologies can guide the models to learn object-relevant features to certain extent, but may not necessarily prevent the model from making unreliable inferences. We encourage the community to propose more effective training methodologies to address this issue.
Similar content being viewed by others
Notes
The latter study refers this concept as “prediction confidence”
in the Chi-square test, it is usually referred to as Cramér’s V (Cramer 1946)
TResNet-L: https://github.com/Alibaba-MIIL/ASL, ResNet-50: https://github.com/ARiSE-Lab/DeepInspect
The MR-3/4/5/6 are just our initial proposals. The detailed definition should be polished and their effectiveness should be thoroughly evaluated.
References
Aggarwal A, Lohia P, Nagar S, Dey K, Saha D (2019) Black box fairness testing of machine learning models. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, association for computing machinery, ESEC/FSE 2019, New York, NY, USA, pp 625–635. https://doi.org/10.1145/3338906.3338937
Barr ET, Harman M, McMinn P, Shahbaz M, Yoo S (2015) The oracle problem in software testing: A survey. IEEE Trans Softw Eng 41 (5):507–525
Ben-Baruch E, Ridnik T, Zamir N, Noy A, Friedman I, Protter M, Zelnik-Manor L (2020) Asymmetric loss for multi-label classification. arXiv:2009.14119
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, pp 1–4
Carlini N, Wagner DA (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE symposium on security and privacy, SP 2017, May 22-26, 2017. IEEE Computer Society, San Jose, CA, USA, pp 39–57. https://doi.org/10.1109/SP.2017.49
Chen TY, Cheung SC, Yiu SM (1998) Metamorphic testing: a new approach for generating next test cases. Tech. Rep. HKUST-CS98-01 Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong
Chen TY, Kuo FC, Liu H, Poon PL, Towey D, Tse TH, Zhou ZQ (2018) Metamorphic testing: A review of challenges and opportunities. ACM Comput Surv 51(1):4:1–4:27. https://doi.org/10.1145/3143561
Chollet F, et al. (2015a) Keras. https://keras.io
Chollet F, et al. (2015b) Keras applications. https://keras.io/api/applications/
Cochran W (1963) Sampling techniques, 2nd edn. [Wiley Publications in Statistics.], John Wiley & Sons, New York
Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: CVPR09
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Ding J, Kang X, Hu X (2017) Validating a deep learning framework by metamorphic testing. In: 2017 IEEE/ACM 2nd international workshop on metamorphic testing (MET), pp 28–34. https://doi.org/10.1109/MET.2017.2
Dwarakanath A, Ahuja M, Sikand S, Rao RM, Bose RPJC, Dubash N, Podder S (2018) Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, ISSTA 2018. ACM, New York, NY, USA, pp 118–128. https://doi.org/10.1145/3213846.3213858
Fahmy H, Pastore F, Bagherzadeh M, Briand L (2020) Supporting dnn safety analysis and retraining through heatmap-based unsupervised learning. arXiv:2002.00863
Fellbaum C (2006) Wordnet(s). In: Brown K (ed) Encyclopedia of language & linguistics. 2nd edn. Elsevier, Oxford, pp 665–670. https://doi.org/10.1016/B0-08-044854-2/00946-9http://www.sciencedirect.com/science/article/pii/B0080448542009469
Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Theory, computational learning. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 23–37
FRS KP (1900) X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175. https://doi.org/10.1080/14786440009463897
Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann FA, Brendel W (2019) Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In: 7th International conference on learning representations, ICLR 2019, May 6-9, 2019, OpenReview.net, New Orleans, LA, USA. https://openreview.net/forum?id=Bygh9j09KX
Gu T, Liu K, Dolan-Gavitt B, Garg S (2019) Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7:47230–47244. https://doi.org/10.1109/ACCESS.2019.2909068
Guo J, Jiang Y, Zhao Y, Chen Q, Sun J (2018) Dlfuzz: Differential fuzzing testing of deep learning systems. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, association for computing machinery, ESEC/FSE 2018, New York, NY, USA, pp 739–743. https://doi.org/10.1145/3236024.3264835
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21-26, 2017. IEEE Computer Society, Honolulu, HI, USA, pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Krasin I, Duerig T, Alldrin N, Ferrari V, Abu-El-Haija S, Kuznetsova A, Rom H, Uijlings J, Popov S, Kamali S, Malloci M, Pont-Tuset J, Veit A, Belongie S, Gomes V, Gupta A, Sun C, Chechik G, Cai D, Feng Z, Narayanan D, Murphy K (2017) Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storagegoogleapiscom/openimages/web/indexhtml
Krizhevsky A, Nair V, Hinton G (2009) The cifar-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. Curran Associates, Inc., pp 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Lin T, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. arXiv:1405.0312
Lin Y, Lv F, Zhu S, Yang M, Cour T, Yu K, Cao L, Huang T (2011) Large-scale image classification: Fast feature extraction and svm training. In: CVPR 2011, pp 1689–1696
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Springer International Publishing, Cham, pp 21–37
Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y, Zhao J, Wang Y (2018a) Deepgauge: Multi-granularity testing criteria for deep learning systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018. ACM, New York, NY, USA, pp 120–131. https://doi.org/10.1145/3238147.3238202
Ma L, Zhang F, Sun J, Xue M, Li B, Juefei-Xu F, Xie C, Li L, Liu Y, Zhao J, Wang Y (2018b) Deepmutation: Mutation testing of deep learning systems. In: Ghosh S, Natella R, Cukic B, Poston R, Laranjeiro N (eds) 29th IEEE international symposium on software reliability engineering, ISSRE 2018, October 15-18, 2018. IEEE Computer Society, Memphis, TN, USA, pp 100–111. https://doi.org/10.1109/ISSRE.2018.00021
Ma S, Liu Y, Lee WC, Zhang X, Grama A (2018c) Mode: Automated neural network model debugging via state differential analysis and input selection. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, association for computing machinery, ESEC/FSE 2018, New York, NY, USA, pp 175–186. https://doi.org/10.1145/3236024.3236082
Montavon G, Binder A, Lapuschkin S, Samek W, Müller KR (2019) Layer-wise relevance propagation: an overview. In: Explainable AI: interpreting, explaining and visualizing deep learning. Springer, pp 193–209
Moosavi-Dezfooli S, Fawzi A, Frossard P (2016) Deepfool: A simple and accurate method to fool deep neural networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2574–2582. https://doi.org/10.1109/CVPR.2016.282
Nejadgholi M, Yang J (2019) A study of oracle approximations in testing deep learning libraries. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE), pp 785–796. https://doi.org/10.1109/ASE.2019.00078
Odena A, Olsson C, Andersen D, Goodfellow IJ (2019) Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of machine learning research, vol 97, pp 4901–4911. http://proceedings.mlr.press/v97/odena19a.html
Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: Automated whitebox testing of deep learning systems. In: Proceedings of the 26th symposium on operating systems principles, SOSP ’17. ACM, New York, NY, USA, pp 1–18. https://doi.org/10.1145/3132747.3132785
Pham HV, Lutellier T, Qi W, Tan L (2019) CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In: Proceedings of the 41st international conference on software engineering, ICSE ’19. IEEE Press, pp 1027–1038. https://doi.org/10.1109/ICSE.2019.00107
Qin G, Vrusias B, Gillam L (2010) Background filtering for improving of object detection in images. In: 2010 20th international conference on pattern recognition, pp 922–925. https://doi.org/10.1109/ICPR.2010.231
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Ribeiro MT, Singh S, Guestrin C (2016) “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, August 13-17, 2016, San Francisco, CA, USA, pp 1135–1144
Roobaert D, Zillich M, Eklundh J (2001) A pure learning approach to background-invariant object recognition using pedagogical support vector learning. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 2, pp II–II. https://doi.org/10.1109/CVPR.2001.990982
Rosenfeld A, Zemel RS, Tsotsos JK (2018) The elephant in the room. arXiv:1808.03305
Sanchez J, Perronnin F (2011) High-dimensional signature compression for large-scale image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR ’11. IEEE Computer Society, USA, pp 1665–1672. https://doi.org/10.1109/CVPR.2011.5995504
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: IEEE international conference on computer vision, ICCV 2017, October 22-29, 2017. IEEE Computer Society, Venice, Italy, pp 618–626. https://doi.org/10.1109/ICCV.2017.74
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, May 7-9, 2015, conference track proceedings, San Diego, CA, USA
Stock P, Cissé M (2018) Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018 - 15th european conference, September 8-14, 2018, Proceedings, Part VI, Lecture Notes in Computer Science, vol 11210. Springer, Munich, Germany, pp 504–519. https://doi.org/10.1007/978-3-030-01231-1_31
Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th international conference on software engineering, ICSE ’18. ACM, New York, NY, USA, pp 303–314. https://doi.org/10.1145/3180155.3180220
Tian Y, Zeng Z, Wen M, Liu Y, Kuo Ty, Cheung SC (2020a) Evaldnn: A toolbox for evaluating deep neural network models. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering: companion proceedings, association for computing machinery, ICSE ’20, New York, NY, USA, pp 45–48. https://doi.org/10.1145/3377812.3382133
Tian Y, Zhong Z, Ordonez V, Kaiser G, Ray B (2020b) Testing dnn image classifiers for confusion & bias errors. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, association for computing machinery, ICSE ’20, New York, NY, USA, pp 1122–1134. https://doi.org/10.1145/3377811.3380400
Tramèr F, Atlidakis V, Geambasu R, Hsu D, Hubaux J, Humbert M, Juels A, Lin H (2017) Fairtest: Discovering unwarranted associations in data-driven applications. In: 2017 IEEE european symposium on security and privacy (EuroS P), pp 401–416. https://doi.org/10.1109/EuroSP.2017.29
Wang S, Su Z (2020) Metamorphic object insertion for testing object detection systems. In: Proceedings of the 35th ACM/IEEE international conference on automated software engineering, ASE 2020. ACM, New York, NY, USA, pp 1053–1065. https://doi.org/10.1145/3324884.3416584
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Wu G, Zhu J (2020) Multi-label classification: do hamming loss and subset accuracy really conflict with each other? In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/20479c788fb27378c2c99eadcf207e7f-Abstract.html
Xie X, Ho JW, Murphy C, Kaiser G, Xu B, Chen TY (2011) Testing and validating machine learning classifiers by metamorphic testing. J Syst Softw 84(4):544–558, the Ninth International Conference on Quality Software. https://doi.org/10.1016/j.jss.2010.11.920http://www.sciencedirect.com/science/article/pii/S0164121210003213
Xie X, Ma L, Juefei-Xu F, Xue M, Chen H, Liu Y, Zhao J, Li B, Yin J, See S (2019a) Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In: Møller A, Zhang D (eds) Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, ISSTA 2019, July 15-19, 2019. ACM, Beijing, China, pp 146–157. https://doi.org/10.1145/3293882.3330579
Xie X, Ma L, Wang H, Li Y, Liu Y, Li X (2019b) Diffchaser: Detecting disagreements for deep neural networks. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, International joint conferences on artificial intelligence organization, pp 5772–5778. https://doi.org/10.24963/ijcai.2019/800
Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative image inpainting with contextual attention. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, June 18-22, 2018. IEEE Computer Society, Salt Lake City, UT, USA, pp 5505–5514. https://doi.org/10.1109/CVPR.2018.00577
Zhang JM, Harman M, Ma L, Liu Y (2020) Machine learning testing: Survey, landscapes and horizons. IEEE Trans Softw Eng, pp 1–1. https://doi.org/10.1109/TSE.2019.2962027
Zhang M, Zhang Y, Zhang L, Liu C, Khurshid S (2018) Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018. ACM, New York, NY, USA, pp 132–142. https://doi.org/10.1145/3238147.3238187
Zhang P, Wang J, Sun J, Dong G, Wang X, Wang X, Dong JS, Ting D (2020a) White-box fairness testing through adversarial sampling. In: Proceedings of the 42nd international conference on software engineering, association for computing machinery, ICSE ’20, New York, NY, USA
Zhang X, Xie X, Ma L, Du X, Hu Q, Liu Y, Zhao J, Sun M (2020b) Towards characterizing adversarial defects of deep learning software from the lens of uncertainty. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, association for computing machinery, ICSE ’20, New York, NY, USA, pp 739–751. https://doi.org/10.1145/3377811.3380368
Zhao J, Wang T, Yatskar M, Ordonez V, Chang KW (2017) Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 2941–2951. https://www.aclweb.org/anthology/D17-1319
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2921–2929. https://doi.org/10.1109/CVPR.2016.319
Zhou ZQ, Sun L (2019) Metamorphic testing of driverless cars. Commun ACM 62(3):61–67. https://doi.org/10.1145/3241979
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710. https://doi.org/10.1109/CVPR.2018.00907
Acknowledgement
We want to thank all reviewers for their constructive comments and suggestions for the manuscript. We would also like to thank the editors’ coordination. We would like to express our deep gratitude to Miss Yao Feng for her significant contribution to the manual check. Besides, we appreciate the proofreading by our labmates, Mr. Wuqi Zhang, Mr. Meiziniu Li, Mr. Hao Guan and Miss Lei Liu.
Funding
This work was supported by the National Key Research and Development Program of China (Grant No. 2019YFE0198100), National Natural Science Foundation of China (Grant No. 61932021, 62002125 and 61802164), Guangdong Provincial Key Laboratory (Grant No. 2020B121201001), Hong Kong RGC/RIF (Grant No. R5034-18), Hong Kong ITF (Grant No: MHP/055/19), Hong Kong PhD Fellowship Scheme, MSRA Collaborative Research Grant, Microsoft Cloud Research Software Fellow Award 2019, NSF 1901242, NSF 1910300, and IARPA TrojAI W911NF19S0012. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Communicated by: Shin Yoo
Availability of data and material
Data is available at here: https://github.com/yqtianust/PaperUnreliableInference.
Code availability
Code is available at here: https://github.com/yqtianust/PaperUnreliableInference
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tian, Y., Ma, S., Wen, M. et al. To what extent do DNN-based image classification models make unreliable inferences?. Empir Software Eng 26, 84 (2021). https://doi.org/10.1007/s10664-021-09985-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-021-09985-1