research-article

Machine Learning Testing: Survey, Landscapes and Horizons

Authors:

Yang LiuAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 48, Issue 1

Pages 1 - 36

https://doi.org/10.1109/TSE.2019.2962027

Published: 01 January 2022 Publication History

Abstract

This paper provides a comprehensive survey of techniques for testing machine learning systems; Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.

References

[1]

K. Pei, Y. Cao, J. Yang, and S. Jana, “DeepXplore: Automated whitebox testing of deep learning systems,” in Proc. 26th Symp. Operating Syst. Princ., 2017, pp. 1–18.

[2]

C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning affordance for direct perception in autonomous driving,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2722–2730.

[3]

G. Litjens, et al., “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017.

[4]

P. Ammann and J. Offutt, Introduction to Software Testing. Cambridge, U.K.: Cambridge Univ. Press, 2016.

[5]

S. Galhotra, Y. Brun, and A. Meliou, “Fairness testing: Testing software for discrimination,” in Proc. 11th Joint Meeting Found. Softw. Eng., 2017, pp. 498–510.

[6]

M. Pacula, “Unit-testing statistical software,” 2011. [Online]. Available: http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/

[7]

A. Ramanathan, L. L. Pullum, F. Hussain, D. Chakrabarty, and S. K. Jha, “Integrating symbolic and statistical methods for testing intelligent systems: Applications to machine learning and computer vision,” in Proc. Des. Autom. Test Europe Conf. Exhib., 2016, pp. 786–791.

[8]

S. Amershi, et al., “Software engineering for machine learning: A case study,” in Proc. 41st Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2019, pp. 291–300.

[9]

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE Trans. Softw. Eng., vol. 41, no. 5, pp. 507–525, May 2015.

Digital Library

[10]

C. Murphy, G. E. Kaiser, and M. Arias, “An approach to software testing of machine learning applications,” in Proc. 19th Int. Conf. Softw. Eng. Knowl. Eng., 2007, Art. no.

[11]

M. D. Davis and E. J. Weyuker, “Pseudo-oracles for non-testable programs,” in Proc. ACM 81 Conf., 1981, pp. 254–257.

[12]

K. Androutsopoulos, D. Clark, H. Dan, M. Harman, and R. Hierons, “An analysis of the relationship between conditional entropy and failed error propagation in software testing,” in Proc. 36th Int. Conf. Softw. Eng., 2014, pp. 573–583.

[13]

J. M. Voas and K. W. Miller, “Software testability: The new verification,” IEEE Softw., vol. 12, no. 3, pp. 17–28, May 1995.

Digital Library

[14]

D. Clark and R. M. Hierons, “Squeeziness: An information theoretic measure for avoiding fault masking,” Inf. Process. Lett., vol. 112, no. 8/9, pp. 335–340, 2012.

Digital Library

[15]

Y. Jia and M. Harman, “Constructing subtle faults using higher order mutation testing (best paper award winner),” in Proc. 8th Int. Work. Conf. Source Code Anal. Manipulation, 2008, pp. 249–258.

[16]

C. D. Turner and D. J. Robson, “The state-based testing of object-oriented programs,” in Proc. Conf. Softw. Maintenance, 1993, pp. 302–310.

[17]

M. Harman and B. F. Jones, “Search-based software engineering,” Inf. Softw. Technol., vol. 43, no. 14, pp. 833–839, 2001.

[18]

M. Harman and P. McMinn, “A theoretical and empirical study of search-based testing: Local, global, and hybrid search,” IEEE Trans. Softw. Eng., vol. 36, no. 2, pp. 226–247, Mar./Apr. 2010.

Digital Library

[19]

A. M. Memon, “GUI testing: Pitfalls and process,” Computer, vol. 35, no. 8, pp. 87–88, 2002.

Digital Library

[20]

K. Sen, D. Marinov, and G. Agha, “CUTE: A concolic unit testing engine for C,” ACM SIGSOFT Softw. Eng. Notes, vol. 30, pp. 263–272, 2005.

Digital Library

[21]

G. Hains, A. Jakobsson, and Y. Khmelevsky, “Towards formal methods and software engineering for deep learning: Security, safety and productivity for DL systems development,” in Proc. Annu. IEEE Int. Syst. Conf., 2018, pp. 1–5.

[22]

L. Ma, et al., “Secure deep learning engineering: A software quality assurance perspective,” 2018,.

[23]

X. Huang, et al., “Safety and trustworthiness of deep neural networks: A survey,” 2018,.

[24]

S. Masuda, K. Ono, T. Yasue, and N. Hosokawa, “A survey of software quality for machine learning applications,” in Proc. IEEE Int. Conf. Softw. Testing Verification Validation Workshops, 2018, pp. 279–284.

[25]

F. Ishikawa, “Concepts in quality assessment for machine learning-from test data to arguments,” in Proc. Int. Conf. Conceptual Model., 2018, pp. 536–544.

[26]

H. Braiek and F. Khomh, “On testing machine learning programs,” 2018,.

[27]

J. Shawe-Taylor, et al., Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004.

[28]

M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. Cambridge, MA, USA: MIT Press, 2012.

Digital Library

[29]

A. Paszke, et al., “Automatic differentiation in pytorch,” 2017.

[30]

M. Abadi, et al., “TensorFlow: Large-scale machine learning on heterogeneous systems, 2015,” Software available from tensorflow.org.

[31]

F. Pedregosa, et al., “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

Digital Library

[32]

F. Chollet, et al., “Keras,” 2015. [Online]. Available: https://keras.io

[33]

Y. Jia, et al., “Caffe: Convolutional architecture for fast feature embedding,” 2014,.

[34]

M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015.

[35]

S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, pp. 660–674, May/Jun. 1991.

[36]

J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied Linear Statistical Models, vol. 4. Irwin Chicago, 1996.

[37]

A. McCallum, et al., “A comparison of event models for naive bayes text classification,” in Proc. Workshop Learn. Text Categorization, 1998, vol. 752, pp. 41–48.

[38]

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, 2015, Art. no.

[39]

Y. Kim, “Convolutional neural networks for sentence classification,” 2014,.

[40]

A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 6645–6649.

[41]

IEEE Standard Classification for Software Anomalies, IEEE Std 1044–2009 (Revision of IEEE Std 1044-1993), Jan.2010.

[42]

R. Werpachowski, A. György, and C. Szepesvári, “Detecting overfitting via adversarial examples,” 2019,.

[43]

J. Cruz-Benito, A. Vázquez-Ingelmo, J. C. Sánchez-Prieto, R. Therón, F. J. García-Peñalvo, and M. Martín-González, “Enabling adaptability in web forms based on user characteristics detection through A/B testing and machine learning,” IEEE Access, vol. 6, pp. 2251–2265, 2018.

[44]

E. Kaufmann, O. Cappé, and A. Garivier, “On the complexity of best-arm identification in multi-armed bandit models,” The J. Mach. Learn. Res., vol. 17, no. 1, pp. 1–42, 2016.

Digital Library

[45]

E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich, “Data validation for machine learning,” in Proc. 2nd SysML Conf., 2019.

[46]

C.-H. Cheng, G. Nührenberg, C.-H. Huang, and H. Yasuoka, “Towards dependability metrics for neural networks,” 2018,.

[47]

S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks against autoregressive models,” in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 1452–1458.

[48]

W. Qi, L. Tan, H. V. Pham, and T. Lutellier, “CRADLE: Cross-backend validation to detect and localize bugs in deep learning libraries,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 1027–1038.

[49]

L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos, Non-Functional Requirements in Software Engineering, vol. 5. Berlin, Germany: Springer, 2012.

[50]

W. Afzal, R. Torkar, and R. Feldt, “A systematic review of search-based testing for non-functional system properties,” Inf. Softw. Technol., vol. 51, no. 6, pp. 957–976, 2009.

Digital Library

[51]

M. Kirk, Thoughtful Machine Learning: A Test-Driven Approach. Sebastopol, CA, USA: O’Reilly Media, Inc., 2014.

[52]

V. Vapnik, E. Levin, and Y. L. Cun, “Measuring the VC-dimension of a learning machine,” Neural Comput., vol. 6, no. 5, pp. 851–876, 1994.

Digital Library

[53]

D. S. Rosenberg and P. L. Bartlett, “The rademacher complexity of co-regularized kernel classes,” in Proc. 11th Int. Conf. Artif. Intell. Statist., 2007, pp. 396–403.

[54]

J. Zhang, E. T. Barr, B. Guedj, M. Harman, and J. Shawe-Taylor, “Perturbed model validation: A new framework to validate model relevance,” working paper or preprint, May 2019.

[55]

IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12–1990, Dec.1990.

[56]

A. Shahrokni and R. Feldt, “A systematic review of software robustness,” Inf. Softw. Technol., vol. 55, no. 1, pp. 1–17, 2013.

Digital Library

[57]

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient SMT solver for verifying deep neural networks,” in Proc. Int. Conf. Comput. Aided Verification, 2017, pp. 97–117.

[58]

C. Dwork, “Differential privacy,” Encyclopedia Cryptography Secur., pp. 338–340, 2011.

[59]

P. Voigt and A. von dem Bussche, The EU General Data Protection Regulation (GDPR): A Practical Guide, 1st ed. Berlin, Germany: Springer, 2017.

Digital Library

[60]

wikipedia, “California consumer privacy act,” 2018. [Online]. Available: https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act

[61]

R. Baeza-Yates and Z. Liaghat, “Quality-efficiency trade-offs in machine learning for text processing,” in Proc. IEEE Int. Conf. Big Data, 2017, pp. 897–904.

[62]

S. Corbett-Davies and S. Goel, “The measure and mismeasure of fairness: A critical review of fair machine learning,” 2018,.

[63]

D. S. Days III, “Feedback loop: The civil rights act of 1964 and its progeny,” Louis ULJ, vol. 49, 2004, Art. no.

[64]

Z. C. Lipton, “The mythos of model interpretability,” 2016,.

[65]

F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” 2017,.

[66]

T. Sellam, K. Lin, I. Y. Huang, M. Yang, C. Vondrick, and E. Wu, “DeepBase: Deep inspection of neural networks,” in Proc. Int. Conf. Manage. Data, 2019, pp. 1117–1134.

[67]

O. Biran and C. Cotton, “Explanation and justification in machine learning: A survey,” in Proc. Workshop Explainable AI, 2017, Art. no.

[68]

T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artif. Intell., vol. 267, pp. 1–38, 2018.

[69]

B. Goodman and S. Flaxman, “European union regulations on algorithmic decision-making and a” right to explanation”,” 2016,.

[70]

C. Molnar, “Interpretable machine learning,” 2019. [Online]. Available: https://christophm.github.io/interpretable-ml-book/

[71]

J. Zhang, et al., “Search-based inference of polynomial metamorphic relations,” in Proc. 29th ACM/IEEE Int. Conf. Autom. Softw. Eng., 2014, pp. 701–712.

[72]

M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scalable end-to-end autonomous vehicle testing via rare-event simulation,” in Proc. 32nd Neural Inf. Process. Syst., 2018, pp. 9827–9838.

[73]

W. Xiang, et al., “Verification for machine learning, autonomy, and neural networks survey,” 2018,.

[74]

C. Wohlin, “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proc. 18th Int. Conf. Eval. Assessment Softw. Eng., 2014, Art. no.

[75]

A. Albarghouthi and S. Vinitsky, “Fairness-aware programming,” in Proc. Conf. Fairness Accountability Transparency, 2019, pp. 211–219.

[76]

Y. Tian, K. Pei, S. Jana, and B. Ray, “DeepTest: Automated testing of deep-neural-network-driven autonomous cars,” in Proc. 40th Int. Conf. Softw. Eng., 2018, pp. 303–314.

[77]

udacity, “Udacity challenge,” 2017. [Online]. Available: https://github.com/udacity/self-driving-car

[78]

I. Goodfellow, et al., “Generative adversarial nets,” in Proc. 27th Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.

[79]

M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems,” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 132–142.

[80]

M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 700–708.

[81]

H. Zhou, et al., “DeepBillboard: Systematic physical-world testing of autonomous driving systems,” 2018,.

[82]

X. Du, X. Xie, Y. Li, L. Ma, J. Zhao, and Y. Liu, “DeepCruiser: Automated guided testing for stateful deep learning systems,” 2018,.

[83]

J. Ding, D. Zhang, and X.-H. Hu, “A framework for ensuring the quality of a big data service,” in Proc. IEEE Int. Conf. Services Comput., 2016, pp. 82–89.

[84]

M. R. I. Rabin, K. Wang, and M. A. Alipour, “Testing neural program analyzers,” 2019.

[85]

U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning distributed representations of code,” in Proc. ACM Program. Lang., 2019, Art. no.

[86]

Z. Sun, J. M. Zhang, M. Harman, M. Papadakis, and L. Zhang, “Automatic testing and improvement of machine translation,” in Proc. Int. Conf. Softw. Eng., 2020.

[87]

P. McMinn, “Search-based software test data generation: A survey,” Softw. Testing Verification Rel., vol. 14, no. 2, pp. 105–156, 2004.

Digital Library

[88]

K. Lakhotia, M. Harman, and P. McMinn, “A multi-objective approach to search-based test data generation,” in Proc. 9th Annu. Conf. Genetic Evol. Comput., 2007, pp. 1098–1105.

[89]

A. Odena and I. Goodfellow, “TensorFuzz: Debugging neural networks with coverage-guided fuzzing,” 2018,.

[90]

J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “DLFuzz: Differential fuzzing testing of deep learning systems,” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf. and Symp. Found. Softw. Eng., 2018, pp. 739–743.

[91]

X. Xie, et al., “Coverage-guided fuzzing for deep neural networks,” 2018,.

[92]

L. Ma, et al., “DeepGauge: Multi-granularity testing criteria for deep learning systems,” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 120–131.

[93]

M. Wicker, X. Huang, and M. Kwiatkowska, “Feature-guided black-box safety testing of deep neural networks,” in Proc. Int. Conf. Tools Algorithms Construction Anal. Syst., 2018, pp. 408–426.

[94]

J. Uesato, et al., “Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures,” in Proc. Int. Conf. Learn. Representations, 2019.

[95]

Z. Q. Zhou and L. Sun, “Metamorphic testing of driverless cars,” Commun. ACM, vol. 62, no. 3, pp. 61–67, 2019.

Digital Library

[96]

S. Jha, et al., “ML-based fault injection for autonomous vehicles,” in Proc. 49th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw., 2019, pp. 112–124.

[97]

S. Udeshi and S. Chattopadhyay, “Grammar based directed testing of machine learning systems,” 2019,.

[98]

Y. Nie, Y. Wang, and M. Bansal, “Analyzing compositionality-sensitivity of NLI models,” 2018,.

[99]

H. Wang, D. Sun, and E. P. Xing, “What if we simply swap the two text fragments? A straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks,” 2018,.

[100]

A. Chan, L. Ma, F. Juefei-Xu, X. Xie, Y. Liu, and Y. S. Ong, “Metamorphic relation based adversarial attacks on differentiable neural computer,” 2018,.

[101]

S. Udeshi, P. Arora, and S. Chattopadhyay, “Automated directed fairness testing,” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 98–108.

[102]

C. E. Tuncali, G. Fainekos, H. Ito, and J. Kapinski, “Simulation-based adversarial test generation for autonomous vehicles with machine learning components,” in Proc. IEEE Intell. Vehicles Symp., 2018, pp. 1555–1562.

[103]

A. Hartman, “Software and hardware testing using combinatorial covering suites, ” in, Graph Theory, Combinatorics and Algorithms. Berlin, Germany: Springer, 2005, pp. 237–266.

[104]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.

[105]

J. C. King, “Symbolic execution and program testing,” Commun. ACM, vol. 19, no. 7, pp. 385–394, 1976.

Digital Library

[106]

L. Zhang, T. Xie, L. Zhang, N. Tillmann, J. De Halleux, and H. Mei, “Test generation via dynamic symbolic execution for mutation testing,” in Proc. IEEE Int. Conf. Softw. Maintenance, 2010, pp. 1–10.

[107]

T. Chen, X.-S. Zhang, S.-Z. Guo, H.-Y. Li, and Y. Wu, “State of the art: Dynamic symbolic execution for automated test generation,” Future Gener. Comput. Syst., vol. 29, no. 7, pp. 1758–1773, 2013.

Digital Library

[108]

D. Gopinath, K. Wang, M. Zhang, C. S. Pasareanu, and S. Khurshid, “Symbolic execution for deep neural networks,” 2018,.

[109]

A. Agarwal, P. Lohia, S. Nagar, K. Dey, and D. Saha, “Automated test generation to detect individual discrimination in ai models,” 2018,.

[110]

M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?: Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 1135–1144.

[111]

Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 109–119.

[112]

C. Murphy, G. Kaiser, and M. Arias, “Parameterizing random test data according to equivalence classes,” in Proc. 2nd Int. Workshop Random Testing: Co-Located 22nd IEEE/ACM Int. Conf. Autom. Softw. Eng., 2007, pp. 38–41.

[113]

S. Nakajima and H. N. Bui, “Dataset coverage for testing machine learning computer programs,” in Proc. 23rd Asia-Pacific Softw. Eng. Conf., 2016, pp. 297–304.

[114]

T. Y. Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: A new approach for generating next test cases,” Hong Kong Univ. Sci. Technol., Hong Kong, Tech. Rep., 1998.

[115]

C. Murphy, G. E. Kaiser, L. Hu, and L. Wu, “Properties of machine learning applications for use in metamorphic testing,” in Proc. 20th Int. Conf. Softw. Eng. Knowl. Eng., 2008, pp. 867–872.

[116]

J. Ding, X. Kang, and X.-H. Hu, “Validating a deep learning framework by metamorphic testing,” in Proc. 2nd Int. Workshop Metamorphic Testing, 2017, pp. 28–34.

[117]

C. Murphy, K. Shen, and G. Kaiser, “Using JML runtime assertion checking to automate metamorphic testing in applications without test oracles,” in Proc. Int. Conf. Softw. Testing Verification Validation, 2009, pp. 436–445.

[118]

X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Application of metamorphic testing to supervised classifiers,” in Proc. 9th Int. Conf. Quality Softw., 2009, pp. 135–144.

[119]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update,” ACM SIGKDD Explorations Newslett., vol. 11, no. 1, pp. 10–18, 2009.

Digital Library

[120]

S. Nakajima, “Generalized oracle for testing machine learning computer programs,” in Proc. Int. Conf. Softw. Eng. Formal Methods, 2017, pp. 174–179.

[121]

A. Dwarakanath, et al., “Identifying implementation bugs in machine learning based image classifiers using metamorphic testing,” in Proc. 27th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2018, pp. 118–128.

[122]

A. Sharma and H. Wehrheim, “Testing machine learning algorithms for balanced data usage,” in Proc. 12th IEEE Conf. Softw. Testing Validation Verification, 2019, pp. 125–135.

[123]

S. Al-Azani and J. Hassine, “Validation of machine learning classifiers using metamorphic testing and feature selection techniques,” in Proc. Int. Workshop Multi-Disciplinary Trends Artif. Intell., 2017, pp. 77–91.

[124]

M. S. Ramanagopal, C. Anderson, R. Vasudevan, and M. Johnson-Roberson, “Failing to learn: Autonomously identifying perception failures for self-driving cars,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3860–3867, Oct. 2018.

[125]

X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, and B. Xu, “METTLE: A metamorphic testing approach to validating unsupervised machine learning methods,” 2018.

[126]

S. Nakajima, “Dataset diversity for metamorphic testing of machine learning software,” in Proc. Int. Workshop Structured Object-Oriented Formal Lang. Method, 2018, pp. 21–38.

[127]

J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” 2018,.

[128]

C. Murphy, K. Shen, and G. Kaiser, “Automatic system testing of programs without test oracles,” in Proc. 18th Int. Symp. Softw. Testing Anal., 2009, pp. 189–200.

[129]

L. Sun and Z. Q. Zhou, “Metamorphic testing for machine translations: MT4MT,” in Proc. 25th Australas. Softw. Eng. Conf., 2018, pp. 96–100.

[130]

D. Pesu, Z. Q. Zhou, J. Zhen, and D. Towey, “A Monte Carlo method for metamorphic testing of machine translation services,” in Proc. 3rd Int. Workshop Metamorphic Testing, 2018, pp. 38–45.

[131]

W. M. McKeeman, “Differential testing for software,” Digit. Tech. J., vol. 10, no. 1, pp. 100–107, 1998.

[132]

V. Le, M. Afshari, and Z. Su, “Compiler validation via equivalence modulo inputs,” ACM SIGPLAN Notices, vol. 49, pp. 216–226, 2014.

Digital Library

[133]

M. Nejadgholi and J. Yang, “A study of oracle approximations in testing deep learning libraries,” in Proc. 34th IEEE/ACM Int. Conf. Autom. Softw. Eng., 2019, pp. 785–796.

[134]

A. Avizienis, “The methodology of N-version programming,” Softw. Fault Tolerance, vol. 3, pp. 23–46, 1995.

[135]

S. Srisakaokul, Z. Wu, A. Astorga, O. Alebiosu, and T. Xie, “Multiple-implementation testing of supervised learning software,” in Proc. Workshop Eng. Depend. Secure Mach. Learn. Syst., 2018, pp. 384–391.

[136]

Y. Qin, H. Wang, C. Xu, X. Ma, and J. Lu, “SynEva: Evaluating ML programs by mirror program synthesis,” in Proc. IEEE Int. Conf. Softw. Quality Rel. Secur., 2018, pp. 171–182.

[137]

V. Tjeng, K. Xiao, and R. Tedrake, “Evaluating robustness of neural networks with mixed integer programming,” 2017,.

[138]

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proc. 3rd Innovations Theoretical Comput. Sci. Conf., 2012, pp. 214–226.

[139]

M. Hardt, et al., “Equality of opportunity in supervised learning,” in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 3315–3323.

[140]

I. Zliobaite, “Fairness-aware machine learning: A perspective,” 2017,.

[141]

B. Herman, “The promise and peril of human evaluation for model interpretability,” 2017,.

[142]

D. Kang, D. Raghavan, P. Bailis, and M. Zaharia, “Model assertions for debugging machine learning,” in Proc. MLSys: Workshop Syst. ML Open Source Softw., 2018.

[143]

L. Li and Q. Yang, “Lifelong machine learning test,” in Proc. Workshop ”Beyond the Turing Test” AAAI Conf. Artif. Intell., 2015.

[144]

J. Zhang, et al., “Predictive mutation testing,” in Proc. 25th Int. Symp. Softw. Testing Anal., 2016, pp. 342–353.

[145]

Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,” 2018,.

[146]

A. Dupuy and N. Leveson, “An empirical evaluation of the MC/DC coverage criterion on the HETE-2 satellite software,” in Proc. 19th Digit. Avionics Syst. Conf., 2000, vol. 1, pp. 1B6/1–1B6/7.

[147]

J. Sekhon and C. Fleming, “Towards improved testing for deep learning,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng.: New Ideas Emerging Results, 2019, pp. 85–88.

[148]

L. Ma, et al., “Combinatorial testing for deep learning systems,” 2018,.

[149]

L. Ma, et al., “DeepCT: Tomographic combinatorial testing for deep learning systems,” in Proc. IEEE 26th Int. Conf. Softw. Anal. Evol. Reengineering, 2019, pp. 614–618.

[150]

Z. Li, X. Ma, C. Xu, and C. Cao, “Structural coverage criteria for neural networks could be misleading,” in Proc. 41st Int. Conf. Softw. Eng.: New Ideas Emerging Results, 2019, pp. 89–92.

[151]

Y. Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE Trans. Softw. Eng., vol. 37, no. 5, pp. 649–678, Sep./Oct. 2011.

Digital Library

[152]

L. Ma, et al., “DeepMutation: Mutation testing of deep learning systems,” 2018.

[153]

W. Shen, J. Wan, and Z. Chen, “MuNN: Mutation analysis of neural networks,” in Proc. IEEE Int. Conf. Softw. Quality Rel. Secur. Companion, 2018, pp. 108–115.

[154]

E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The ML test score: A rubric for ML production readiness and technical debt reduction,” in Proc. IEEE Int. Conf. Big Data, 2017, pp. 1123–1132.

[155]

T. Byun, V. Sharma, A. Vijayakumar, S. Rayadurgam, and D. Cofer, “Input prioritization for testing neural networks,” 2019,.

[156]

L. Zhang, X. Sun, Y. Li, Z. Zhang, and Y. Feng, “A noise-sensitivity-analysis-based test prioritization technique for deep neural networks,” 2019,.

[157]

Z. Li, X. Ma, C. Xu, C. Cao, J. Xu, and J. Lu, “Boosting operational DNN testing efficiency through conditioning,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 499–509.

[158]

W. Ma, M. Papadakis, A. Tsakmalis, M. Cordy, and Y. L. Traon, “Test selection for deep learning systems,” 2019.

[159]

F. Thung, S. Wang, D. Lo, and L. Jiang, “An empirical study of bugs in machine learning systems,” in Proc. IEEE 23rd Int. Symp. Softw. Rel. Eng., 2012, pp. 271–280.

[160]

Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, “An empirical study on TensorFlow program bugs,” in Proc. 27th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2018, pp. 129–140.

[161]

S. S. Banerjee, S. Jha, J. Cyriac, Z. T. Kalbarczyk, and R. K. Iyer, “Hands off the wheel in autonomous vehicles?: A systems perspective on over a million miles of field data,” in Proc. 48th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw., 2018, pp. 586–597.

[162]

S. Ma, Y. Liu, W.-C. Lee, X. Zhang, and A. Grama, “MODE: Automated neural network model debugging via state differential analysis and input selection,” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf. and Symp. Found. Softw. Eng., 2018, pp. 175–186.

[163]

S. Dutta, W. Zhang, Z. Huang, and S. Misailovic, “Storm: Program reduction for testing and debugging probabilistic programming systems,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 729–739.

[164]

S. Cai, E. Breck, E. Nielsen, M. Salib, and D. Sculley, “TensorFlow debugger: Debugging dataflow graphs for machine learning,” in Proc. Reliable Mach. Learn. Wild-NIPS Workshop, 2016.

[165]

M. Vartak, J. M. F. da Trindade, S. Madden, and M. Zaharia, “MISTIQUE: A system to store and query model intermediates for model diagnosis,” in Proc. Int. Conf. Manage. Data, 2018, pp. 1285–1300.

[166]

S. Krishnan and E. Wu, “PALM: Machine learning explanations for iterative debugging,” in Proc. 2nd Workshop Hum.-In-the-Loop Data Analytics, 2017, Art. no.

[167]

B. Nushi, E. Kamar, E. Horvitz, and D. Kossmann, “On human intellect and machine failures: Troubleshooting integrative machine learning systems,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 1017–1025.

[168]

A. Albarghouthi, L. D’Antoni, and S. Drews, “Repairing decision-making programs under uncertainty,” in Proc. Int. Conf. Comput. Aided Verification, 2017, pp. 181–200.

[169]

W. Yang and T. Xie, “Telemade: A testing framework for learning-based malware detection systems,” in Proc. 32nd AAAI Conf. Artif. Intell. Workshops, 2018, pp. 400–403.

[170]

T. Dreossi, S. Ghosh, A. Sangiovanni-Vincentelli, and S. A. Seshia, “Systematic testing of convolutional neural networks for autonomous driving,” 2017,.

[171]

F. Tramer, et al., “FairTest: Discovering unwarranted associations in data-driven applications,” in Proc. IEEE Eur. Symp. Secur. Privacy, 2017, pp. 401–416.

[172]

Y. Nishi, S. Masuda, H. Ogawa, and K. Uetsuki, “A test architecture for machine learning product,” in Proc. IEEE Int. Conf. Softw. Testing Verification Validation Workshops, 2018, pp. 273–278.

[173]

P. S. Thomas, B. C. da Silva, A. G. Barto, S. Giguere, Y. Brun, and E. Brunskill, “Preventing undesirable behavior of intelligent machines,” Science, vol. 366, no. 6468, pp. 999–1004, 2019.

[174]

R. Kohavi, et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, vol. 14, pp. 1137–1145.

[175]

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Boca Raton, FL, USA: CRC Press, 1994.

[176]

N. Japkowicz, “Why question machine learning evaluation methods,” in Proc. AAAI Workshop Eval. Methods Mach. Learn., 2006, pp. 6–11.

[177]

W. Chen, B. D. Gallas, and W. A. Yousef, “Classifier variability: Accounting for training and testing,” Pattern Recognit., vol. 45, no. 7, pp. 2661–2671, 2012.

Digital Library

[178]

W. Chen, F. W. Samuelson, B. D. Gallas, L. Kang, B. Sahiner, and N. Petrick, “On the assessment of the added value of new predictive biomarkers,” BMC Med. Res. Methodol., vol. 13, no. 1, 2013, Art. no.

[179]

N. Hynes, D. Sculley, and M. Terry, “The data linter: Lightweight, automated sanity checking for ML data sets,” 2017.

[180]

S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu, “BoostClean: Automated error detection and repair for machine learning,” 2017,.

[181]

S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger, “Automating large-scale data quality verification,” Proc. VLDB Endowment, vol. 11, no. 12, pp. 1781–1794, 2018.

[182]

D. Baylor, et al., “TFX: A TensorFlow-based production-scale machine learning platform,” in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2017, pp. 1387–1395.

[183]

D. M. Hawkins, “The problem of overfitting,” J. Chemical Inf. Comput. Sci., vol. 44, no. 1, pp. 1–12, 2004.

[184]

H.-P. Chan, B. Sahiner, R. F. Wagner, and N. Petrick, “Classifier design for computer-aided diagnosis: Effects of finite sample size on the mean performance of classical and neural network classifiers,” Med. Phys., vol. 26, no. 12, pp. 2654–2668, 1999.

[185]

B. Sahiner, H.-P. Chan, N. Petrick, R. F. Wagner, and L. Hadjiiski, “Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size,” Med. Phys., vol. 27, no. 7, pp. 1509–1522, 2000.

[186]

K. Fukunaga and R. R. Hayes, “Effects of sample size in classifier design,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 8, pp. 873–885, Aug. 1989.

Digital Library

[187]

A. Gossmann, A. Pezeshk, and B. Sahiner, “Test data reuse for evaluation of adaptive machine learning algorithms: Over-fitting to a fixed’test’dataset and a potential solution,” in Proc. Med. Imag.: Image Perception Observer Perform. Technol. Assessment, 2018, vol. 10577, Art. no.

[188]

S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: A simple and accurate method to fool deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2574–2582.

[189]

O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, and A. Criminisi, “Measuring neural net robustness with constraints,” in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2613–2621.

[190]

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57.

[191]

W. Ruan, M. Wu, Y. Sun, X. Huang, D. Kroening, and M. Kwiatkowska, “Global robustness evaluation of deep neural networks with provable guarantees for L0 norm,” 2018,.

[192]

D. Gopinath, G. Katz, C. S. Păsăreanu, and C. Barrett, “DeepSafe: A data-driven approach for assessing robustness of neural networks,” in Proc. Int. Symp. Autom. Technol. Verification Anal., 2018, pp. 3–19.

[193]

R. Mangal, A. Nori, and A. Orso, “Robustness of neural networks: A probabilistic and practical perspective,” in Proc. IEEE/ACM Int. Conf. Softw. Eng.: New Ideas Emerging Results, 2019, pp. 93–96.

[194]

S. S. Banerjee, J. Cyriac, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, “Towards a Bayesian approach for assessing faulttolerance of deep neural networks,” in Proc. 49th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw., 2019, pp. 25–26.

[195]

N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2016, pp. 582–597.

[196]

N. Papernot, I. Goodfellow, R. Sheatsley, R. Feinman, and P. McDaniel, “cleverhans v1.0.0: An adversarial machine learning library,” 2016,.

[197]

N. Papernot, et al., “Technical report on the cleverhans v2.1.0 adversarial examples library,” 2018,.

[198]

S. Jha, S. S. Banerjee, J. Cyriac, Z. T. Kalbarczyk, and R. K. Iyer, “AVFI: Fault injection for autonomous vehicles,” in Proc. 48th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw. Workshops, 2018, pp. 55–56.

[199]

S. Jha, et al., “Kayotee: A fault injection-based system to assess the safety and reliability of autonomous vehicles to faults and errors,” in Proc. 3rd IEEE Int. Workshop Automotive Rel. Test, 2018.

[200]

H. Spieker and A. Gotlieb, “Towards testing of deep learning systems with training set reduction,” 2019,.

[201]

S. Barocas and A. D. Selbst, “Big data’s disparate impact,” Cal. L. Rev., vol. 104, 2016, Art. no.

[202]

P. Gajane and M. Pechenizkiy, “On formalizing fairness in prediction with machine learning,” 2017,.

[203]

S. Verma and J. Rubin, “Fairness definitions explained,” in Proc. Int. Workshop Softw. Fairness, 2018, pp. 1–7.

[204]

N. Saxena, K. Huang, E. DeFilippis, G. Radanovic, D. Parkes, and Y. Liu, “How do fairness definitions fare? Examining public attitudes towards algorithmic definitions of fairness,” 2018,.

[205]

A. Finkelstein, M. Harman, A. Mansouri, J. Ren, and Y. Zhang, “Fairness analysis in requirements assignments,” in Proc. 16th IEEE Int. Requirements Eng. Conf., 2008, pp. 115–124.

[206]

M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 4066–4076.

[207]

N. Grgic-Hlaca, M. B. Zafar, K. P. Gummadi, and A. Weller, “The case for process fairness in learning: Feature selection for fair decision making,” in Proc. NIPS Symp. Mach. Learn. Law, 2016, vol.1, p. 2.

[208]

M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi, “Fairness constraints: Mechanisms for fair classification,” 2015,.

[209]

B. Metevier, et al., “Offline contextual bandits with high probability fairness guarantees,” in Proc. 33rd Annu. Conf. Neural Inf. Process. Syst., 2019, pp. 14922–14933.

[210]

L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 661–670.

[211]

P. Massart, “Concentration inequalities and model selection,” 2007.

[212]

A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach, “A reductions approach to fair classification,” 2018,.

[213]

R. Angell, B. Johnson, Y. Brun, and A. Meliou, “Themis: Automatically testing software for discrimination,” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2018, pp. 871–875.

[214]

B. Johnson, Y. Brun, and A. Meliou, “Causal testing: Finding defects’ root causes,” CoRR, 2018.

[215]

S. A. Friedler, C. D. Roy, C. Scheidegger, and D. Slack, “Assessing the local interpretability of machine learning models,” CoRR, 2019.

[216]

Z. Q. Zhou, L. Sun, T. Y. Chen, and D. Towey, “Metamorphic relations for enhancing system understanding and use,” IEEE Trans. Softw. Eng., to be published.

[217]

W. Chen, B. Sahiner, F. Samuelson, A. Pezeshk, and N. Petrick, “Calibration of medical diagnostic classifier scores to the probability of disease,” Statist. Methods Med. Res., vol. 27, no. 5, pp. 1394–1409, 2018.

[218]

Z. Ding, Y. Wang, G. Wang, D. Zhang, and D. Kifer, “Detecting violations of differential privacy,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2018, pp. 475–489.

[219]

B. Bichsel, T. Gehr, D. Drachsler-Cohen, P. Tsankov, and M. Vechev, “DD-Finder: Finding differential privacy violations by sampling and optimization,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2018, pp. 508–524.

[220]

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data management challenges in production machine learning,” in Proc. ACM Int. Conf. Manage. Data, 2017, pp. 1723–1726.

[221]

J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detecting adversarial perturbations,” 2017,.

[222]

J. Wang, G. Dong, J. Sun, X. Wang, and P. Zhang, “Adversarial sample detection for deep neural network through model mutation testing,” in Proc. 41st Int. Conf. Softw. Eng., 2019, pp. 1245–1256.

[223]

J. Wang, J. Sun, P. Zhang, and X. Wang, “Detecting adversarial samples for deep neural networks through mutation testing,” CoRR, 2018.

[224]

N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proc. 10th ACM Workshop Artif. Intell. Secur., 2017, pp. 3–14.

[225]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, “ActiveClean: Interactive data cleaning for statistical modeling,” Proc. VLDB Endowment, vol. 9, no. 12, pp. 948–959, 2016.

[226]

S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu, “ActiveClean: An interactive data cleaning framework for modern machine learning,” in Proc. Int. Conf. Manage. Data, 2016, pp. 2117–2120.

[227]

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. 8th IEEE Int. Conf. Data Mining, 2008, pp. 413–422.

[228]

S. Krishnan and E. Wu, “AlphaClean: Automatic generation of data cleaning pipelines,” 2019,.

[229]

E. Rahm and H. H. Do, “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3–13, Dec. 2000.

[230]

N. McClure, TensorFlow Machine Learning Cookbook. Birmingham, U.K.: Packt Publishing Ltd., 2017.

[231]

T. Schaul, I. Antonoglou, and D. Silver, “Unit tests for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2014.

[232]

X. Sun, T. Zhou, G. Li, J. Hu, H. Yang, and B. Li, “An empirical study on real bugs for machine learning programs,” in Proc. 24th Asia-Pacific Softw. Eng. Conf., 2017, pp. 348–357.

[233]

L. Ma, et al., “An orchestrated empirical study on deep learning frameworks and platforms,” 2018,.

[234]

Y. L. Karpov, L. E. Karpov, and Y. G. Smetanin, “Adaptation of general concepts of software testing to neural networks,” Program. Comput. Softw., vol. 44, no. 5, pp. 324–334, Sep. 2018.

Digital Library

[235]

W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proc. 11th Joint Meeting Found. Softw. Eng., 2017, pp. 49–60.

[236]

Z. Liu, X. Xia, A. E. Hassan, D. Lo, Z. Xing, and X. Wang, “Neural-machine-translation-based commit message generation: How far are we?” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 373–384.

[237]

J. Dolby, A. Shinnar, A. Allain, and J. Reinen, “Ariadne: Analysis for machine learning programs,” in Proc. 2nd ACM SIGPLAN Int. Workshop Mach. Learn. Program. Lang., 2018, pp. 1–10.

[238]

Q. Xiao, K. Li, D. Zhang, and W. Xu, “Security risks in deep learning implementations,” in Proc. IEEE Secur. Privacy Workshops, 2018, pp. 123–128.

[239]

C. Roberts, “How to unit test machine learning code,” 2017.

[240]

D. Cheng, C. Cao, C. Xu, and X. Ma, “Manifesting bugs in machine learning code: An explorative study with mutation testing,” in Proc. IEEE Int. Conf. Softw. Quality Rel. Secur., 2018, pp. 313–324.

[241]

X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Testing and validating machine learning classifiers by metamorphic testing,” J. Syst. Softw., vol. 84, no. 4, pp. 544–558, Apr. 2011.

Digital Library

[242]

Y.-S. Ma, J. Offutt, and Y. R. Kwon, “MuJava: An automated class mutation system,” Softw. Testing Verification Rel., vol. 15, no. 2, pp. 97–133, 2005.

Digital Library

[243]

J. Wegener and O. Bühler, “Evaluation of different fitness functions for the evolutionary testing of an autonomous parking system,” in Proc. Genetic Evol. Comput. Conf., 2004, pp. 1400–1412.

[244]

M. Woehrle, C. Gladisch, and C. Heinzemann, “Open questions in testing of learned computer vision functions for automated driving,” in Proc. Int. Conf. Comput. Saf. Rel. Secur., 2019, pp. 333–345.

[245]

R. B. Abdessalem, S. Nejati, L. C. Briand, and T. Stifter, “Testing vision-based control systems using learnable evolutionary algorithms,” in Proc. IEEE/ACM 40th Int. Conf. Softw. Eng., 2018, pp. 1016–1026.

[246]

R. B. Abdessalem, S. Nejati, L. C. Briand, and T. Stifter, “Testing advanced driver assistance systems using multi-objective search and neural networks,” in Proc. 31st IEEE/ACM Int. Conf. Autom. Softw. Eng., 2016, pp. 63–74.

[247]

R. B. Abdessalem, A. Panichella, S. Nejati, L. C. Briand, and T. Stifter, “Testing autonomous cars for feature interaction failures using many-objective search,” in Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng., 2018, pp. 143–154.

[248]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.

[249]

W. Zheng, et al., “Testing untestable neural machine translation: An industrial case,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 314–315.

[250]

W. Zheng, et al., “Testing untestable neural machine translation: An industrial case,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 314–315.

[251]

W. Wang, et al., “Detecting failures of neural machine translation in the absence of reference translations,” in Proc. 49th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw.–Industry Track, 2019, pp. 1–4.

[252]

J. Burrell, “How the machine ‘thinks’: Understanding opacity in machine learning algorithms,” Big Data Soc., vol. 3, no. 1, 2016, Art. no.

[253]

Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.

[254]

H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” CoRR, 2017.

[255]

A. Krizhevsky, V. Nair, and G. Hinton, Cifar-10 (Canadian Institute for Advanced Research) Tech. Rep.

[256]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.

[257]

R. A. Fisher, “Iris data set,” 1988. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/iris

[258]

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.

[259]

H. Mureşan and M. Oltean, “Fruit recognition from images using deep learning,” Acta Universitatis Sapientiae, Informatica, vol. 10, pp. 26–42, Jun. 2018.

[260]

O. Belitskaya, “Handwritten letters,” 2019. [Online]. Available: https://www.kaggle.com/olgabelitskaya/handwritten-letters

[261]

Balance scale data set, 1994. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/balance+scale

[262]

S. Malebary, “DSRC vehicle communications data set,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/DSRC+Vehicle+Communications

[263]

Nexar, “The nexar dataset,” 2018. [Online]. Available: https://www.getnexar.com/challenge-1/

[264]

T.-Y. Lin, et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.

[265]

X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement learning for autonomous driving,” 2017,.

[266]

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, 2013.

Digital Library

[267]

Facebook research, “The babi dataset,” 2015. [Online]. Available: https://research.fb.com/downloads/babi/

[268]

A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding recurrent networks,” 2015,.

[269]

Stack exchange data dump, 2014. [Online]. Available: https://archive.org/details/stackexchange

[270]

S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” 2015,.

[271]

A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2018, pp. 1112–1122.

[272]

Califormia DMV failure reports, 2019. [Online]. Available: https://www.dmv.ca.gov/portal/dmv/detail/vr/autonomous/disengagement_report_2019

[273]

Dr. H. Hofmann, “Statlog (german credit data) data set,” 1994. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

[274]

R. Kohav, “Adult data set,” 1996. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/adult

[275]

S. Moro, P. Cortez, and P. Rita, “A data-driven approach to predict the success of bank telemarketing,” Decis. Support Syst., vol. 62, pp. 22–31, 2014.

[276]

Executions in the United States, [Online]. Available: https://deathpenaltyinfo.org/views-executions

[277]

A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in Proc. IEEE Symp. Series Comput. Intell., 2015, pp. 159–166.

[278]

P. J. Bickel, E. A. Hammel, and J. W. O’Connell, “Sex bias in graduate admissions: Data from Berkeley,” Science, vol. 187, no. 4175, pp. 398–404, 1975.

[279]

Propublica, “Data for the propublica story ‘machine bias’,” 2016. [Online]. Available: https://github.com/propublica/compas-analysis/

[280]

F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Trans. Interactive Intell. Syst., vol. 5, no. 4, 2016, Art. no.

[281]

Zillow, “Zillow prize: Zillow’s home value prediction (Zestimate),” 2016. [Online]. Available: https://www.kaggle.com/c/zillow-prize-1/overview

[282]

U.S. Federal Reserve, “Report to the congress on credit scoring and its effects on the availability and affordability of credit,” Board of Governors of the Federal Reserve System, 2007.

[283]

Law School Admission Council, “LSAC national longitudinal bar passage study (NLBPS),” [Online]. Available: http://academic.udayton.edu/race/03justice/legaled/Legaled04.htm

[284]

VirusTotal, “Virustotal,” [Online]. Available: https://www.virustotal.com/#/home/search

[285]

Contagio malware dump, 2013. [Online]. Available: http://contagiodump.blogspot.com/2013/03/16800-clean-and-11960-maliciou s-files.html

[286]

D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. E. R. T. Siemens, “DREBIN: Effective and explainable detection of Android malware in your pocket,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2014, vol. 14, pp. 23–26.

[287]

A. Shapiro, “UCI chess (king-rook vs. king-pawn) data set,” 1989. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Paw%n%29

[288]

M. Harman and P. O’Hearn, “From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis,” in Proc. IEEE 18th Int. Work. Conf. Source Code Anal. Manipulation, 2018, pp. 1–23.

[289]

A. Finkelstein, M. Harman, A. Mansouri, J. Ren, and Y. Zhang, “A search based approach to fairness analysis in requirements assignments to aid negotiation, mediation and decision making,” Requirements Eng., vol. 14, no. 4, pp. 231–245, 2009.

Digital Library

[290]

J. Zhang, L. Zhang, D. Hao, M. Wang, and L. Zhang, “Do pseudo test suites lead to inflated correlation in measuring test effectiveness?” in Proc. 12th IEEE Conf. Softw. Testing Validation Verification, 2019, pp. 252–263.

[291]

K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” J. Big data, vol. 3, no. 1, 2016, Art. no.

[292]

S. Nakajima, “Quality assurance of machine learning software,” in Proc. IEEE 7th Global Conf. Consum. Electron., 2018, pp. 601–604.

Cited By

Shao HDing ZShang WYang JTsantalis N(2025)Towards effectively testing machine translation systems from white-box perspectivesEmpirical Software Engineering10.1007/s10664-024-10549-230:1Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10664-024-10549-2
Shen JLi ZPan MLi XFilkov VRay BZhou M(2024)Prioritizing Test Inputs for DNNs Using Training DynamicsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695498(1219-1231)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695498
Amini MNejati SFilkov VRay BZhou M(2024)Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695067(732-744)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695067
Show More Cited By

Index Terms

Machine Learning Testing: Survey, Landscapes and Horizons

Index terms have been assigned to the content through auto-classification.

Recommendations

Manifold-Based Testing of Machine Learning Systems
A Survey on Data-Flow Testing

Data-flow testing (DFT) is a family of testing strategies designed to verify the interactions between each program variable’s definition and its uses. Such a test objective of interest is referred to as a def-use pair. DFT selects test data with respect ...
An empirical study of testing machine learning in the wild
Background: Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 48, Issue 1

Jan. 2022

363 pages

ISSN:0098-5589

Issue’s Table of Contents

0098-5589 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

192
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shao HDing ZShang WYang JTsantalis N(2025)Towards effectively testing machine translation systems from white-box perspectivesEmpirical Software Engineering10.1007/s10664-024-10549-230:1Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10664-024-10549-2
Shen JLi ZPan MLi XFilkov VRay BZhou M(2024)Prioritizing Test Inputs for DNNs Using Training DynamicsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695498(1219-1231)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695498
Amini MNejati SFilkov VRay BZhou M(2024)Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695067(732-744)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695067
Wang ZXu SFan LCai XLi LLiu Z(2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672446
Swierzy BBoes FPohl TBungartz CMeier M(2024)SoK: Automated Software Testing for TLS LibrariesProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670871(1-12)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670871
Robles Herrera SMonjezi VKreinovich VTrivedi ATizpaz-Niari SShang WLamothe MWan Z(2024)Predicting Fairness of ML Software ConfigurationsProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3663533.3664040(56-65)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663533.3664040
Wang YLópez JNilsson UVarró Dd'Amorim M(2024)Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in NotebooksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663785(497-501)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663785
Xiao YZhang JLiu YMousavi MLiu SXue D(2024)MirrorFair: Fixing Fairness Bugs in Machine Learning Software via Counterfactual PredictionsProceedings of the ACM on Software Engineering10.1145/36608011:FSE(2121-2143)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660801
DeSmet CCook D(2024)HydraGAN: A Cooperative Agent Model for Multi-Objective Data GenerationACM Transactions on Intelligent Systems and Technology10.1145/365398215:3(1-21)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3653982
Ahmed ZMakedonski PCombemale BWimmer MChechik MEgyed A(2024)Exploring the Fundamentals of Mutations in Deep Neural NetworksProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3687426(227-233)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3687426
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents