research-article

Empirically evaluating flaky test detection techniques combining test case rerunning and machine learning models

Authors:

Gregory M. Kapfhammer,

Michael Hilton,

Phil McMinnAuthors Info & Claims

Empirical Software Engineering, Volume 28, Issue 3

https://doi.org/10.1007/s10664-023-10307-w

Published: 28 April 2023 Publication History

Abstract

A flaky test is a test case whose outcome changes without modification to the code of the test case or the program under test. These tests disrupt continuous integration, cause a loss of developer productivity, and limit the efficiency of testing. Many flaky test detection techniques are rerunning-based, meaning they require repeated test case executions at a considerable time cost, or are machine learning-based, and thus they are fast but offer only an approximate solution with variable detection performance. These two extremes leave developers with a stark choice. This paper introduces CANNIER, an approach for reducing the time cost of rerunning-based detection techniques by combining them with machine learning models. The empirical evaluation involving 89,668 test cases from 30 Python projects demonstrates that CANNIER can reduce the time cost of existing rerunning-based techniques by an order of magnitude while maintaining a detection performance that is significantly better than machine learning models alone. Furthermore, the comprehensive study extends existing work on machine learning-based detection and reveals a number of additional findings, including (1) the performance of machine learning models for detecting polluter test cases; (2) using the mean values of dynamic test case features from repeated measurements can slightly improve the detection performance of machine learning models; and (3) correlations between various test case features and the probability of the test case being flaky.

References

[1]

(2022). Python Package Index, https://pypi.org/

[2]

Al-Qutaish R, Abran A (2010) Halstead metrics: analysis of their design. Wiley, pp 145–159

[3]

Alshammari A, Morris C, Hilton M, Bell J (2021) FlakeFlagger: predicting flakiness without rerunning tests. In: Proceedings of the international conference on software engineering (ICSE)

[4]

Bell J, Kaiser G, Melski E, Dattatreya M (2015) Efficient dependency detection for safe Java test acceleration. In: Proceedings of the joint meeting of the European software engineering conference and the symposium on the foundations of software engineering (ESEC/FSE), pp 770–781

[5]

Bell J, Legunsen O, Hilton M, Eloussi L, Yung T, Marinov D (2018) DeFlaker: automatically detecting flaky tests. In: Proceedings of the international conference on software engineering (ICSE), pp 433–444

[6]

Bertolino A, Cruciani E, Miranda B, and Verdecchia R Know your neighbor: fast static prediction of test flakiness IEEE Access 2021 9 76119-76134

[7]

Biagiola M, Stocco A, Mesbah A, Ricca F, Tonella P (2019) Web test dependency detection. In: Proceedings of the joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE), pp 154–164

[8]

Breiman L Random forests Mach Learn 2001 45 1 5-32

[9]

CANNIER experiment (2022) https://github.com/flake-it/cannier-experiment

[10]

CANNIER framework (2022) https://github.com/flake-it/cannier-framework

[11]

Camara B, Silva M, Endo A, S. V (2021) On the use of test smells for prediction of flaky tests. In: Proceedings of the Brazilian symposium on systematic and automated software testing (SAST), pp 46–54

[12]

Camara B, Silva M, Endo A, S. V (2021) What is the vocabulary of flaky tests? An extended replication. In: Proceedings of the international conference on program comprehension (ICPC), pp 444–454

[13]

Candido J, Melo L, D’Amorim M (2017) Test suite parallelization in open-source projects: a study on its usage and impact. In: Proceedings of the international conference on automated software engineering (ASE), pp 153–158

[14]

Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP SMOTE: synthetic minority over-sampling technique J Artif Intell Res 2002 16 321-357

[15]

Chicco D and Jurman G The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation BMC Genomics 2020 21 6 1471-2164

[16]

Coverage.py (2022) — Coverage.py 6.4.1 documentation. https://coverage.readthedocs.io/en/stable/

[17]

Dillon E, LaRiviere J, Lundberg S, Roth J, Syrgkanis V (2021) Be careful when interpreting predictive models in search of causal insights, https://towardsdatascience.com/be-careful-when-interpreting-predictive-models-in-search-of-causalinsights-e68626e664b6

[18]

Docker documentation (2022) https://docs.docker.com/

[19]

Durieux T, Goues CL, Hilton M, Abreu R (2020) Empirical study of restarted and flaky builds on Travis CI. In: Proceedings of the international conference on mining software repositories (MSR), pp 254–264

[20]

Eck M, Palomba F, Castelluccio M, Bacchelli A (2019) Understanding flaky tests: the developer’s perspective. In: Proceedings of the joint meeting of the European software engineering conference and the symposium on the foundations of software engineering (ESEC/FSE), pp 830–840

[21]

Gambi A, Bell J, Zeller A (2018) Practical test dependency detection. In: Proceedings of the international conference on software testing, verification and validation (ICST), pp 1–11

[22]

Garousi V and Ku̇ċu̇k B Smells in software test code: a survey of knowledge in industry and academia J Syst Softw 2018 138 52-81

[23]

Geurts P, Ernst D, and Wehenkel L Extremely randomized trees Mach Learn 2006 63 1 3-42

[24]

Gill GK and Kemerer CF Cyclomatic complexity density and software maintenance productivity Trans Softw Eng 1991 17 12 1284

[25]

Glossary (2022) — Python 3.10.4 documenation. https://docs.python.org/3/glossary.html#term-global-interpreter-lock

[26]

Gruber M, Lukasczyk S, Kroiß F, Fraser G (2021) An empirical study of flaky tests in Python. In: Proceedings of the international conference on software testing, verification and validation (ICST)

[27]

Haben G, Habchi S, Papadakis M, Cordy M, Le Traon Y (2021) A replication study on the usability of code vocabulary in predicting flaky tests. In: Proceedings of the international conference on mining software repositories (MSR)

[28]

Harman M, O’hearn P (2018) From start-ups to scale-ups: opportunities and open problems for static and dynamic program analysis. In: Proceedings of the international working conference on source code analysis and manipulation (SCAM), pp 1–23

[29]

Hilton M, Bell J, Marinov D (2018) A large-scale study of test coverage evolution. In: Proceedings of the international conference on automated software engineering (ASE), pp 53–63

[30]

I/O statistics fields (2022) https://www.kernel.org/doc/Documentation/iostats.txt

[31]

Keller JM, Gray MR, and Givens JA A fuzzy k-nearest neighbor algorithm Trans Syst Man Cybernet 1985 15 4 580-585

[32]

Lam W, Godefroid P, Nath S, Santhiar A, Thummalapenta S (2019) Root causing flaky tests in a large-scale industrial setting. In: Proceedings of the international symposium on software testing and analysis (ISSTA), pp 204–215

[33]

Lam W, Muşlu K, Sajnani H, Thummalapenta S (2020) A study on the lifecycle of flaky tests. In: Proceedings of the international conference on software engineering (ICSE), pp 1471–1482

[34]

Lam W, Oei R, Shi A, Marinov D, Xie T (2019) IDFlakies: a framework for detecting and partially classifying flaky tests. In: Proceedings of the international conference on software testing, verification and validation (ICST), pp 312–322

[35]

Lam W, Shi A, Oei R, Zhang S, Ernst MD, Xie T (2020) Dependent-test-aware regression testing techniques. In: Proceedings of the international symposium on software testing and analysis (ISSTA), pp 298–311

[36]

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, and Lee S From local explanations to global understanding with explainable AI for trees Nat Mach Intell 2020 2 1 2522-5839

[37]

Luo Q, Hariri F, Eloussi L, Marinov D (2014) An empirical analysis of flaky tests. In: Proceedings of the symposium on the foundations of software engineering (FSE), pp 643–653

[38]

Machalica M, Samylkin A, Porth M, Chandra S (2019) Predictive test selection. In: Proceedings of the international conference on software engineering: software engineering in practice (ICSE-SEIP), pp 91–100

[39]

Memon A, Gao Z, Nguyen B, Dhanda S, Nickell E, Siemborski R, Micco J (2017) Taming Google-scale continuous testing. In: Proceedings of the international conference on software engineering: software engineering in practice (ICSE-SEIP), pp 233–242

[40]

New EC2 M5zn instances (2022) — Fastest Intel Xeon scalable CPU in the cloud — AWS news blog. https://aws.amazon.com/blogs/aws/new-ec2-m5zn-instances-fastest-intel-xeon-scalable-cpu-in-the-cloud/

[41]

Open source project criticality score (beta) (2022) https://github.com/ossf/criticality_score

[42]

Parry O, Kapfhammer GM, Hilton M, McMinn P (2020) Flake it ‘till you make it: using automated repair to induce and fix latent test flakiness. In: Proceedings of the international workshop on automated program repair (APR), pp 11–12

[43]

Parry O, Kapfhammer GM, Hilton M, and McMinn P A survey of flaky tests Trans Softw Eng Methodol 2021 31 1 1-74

[44]

Parry O, Kapfhammer GM, Hilton M, McMinn P (2022) Evaluating features for machine learning detection of order- and non-order-dependent flaky tests. In: Proceedings of the international conference on software testing, verification and validation (ICST), pp 93–104

[45]

Parry O, Kapfhammer GM, Hilton M, McMinn P (2022) Surveying the developer experience of flaky tests. In: Proceedings of the international conference on software engineering: software engineering in practice (ICSE-SEIP)

[46]

Peitek N, Apel S, Parnin C, Brechmann A, Siegmund J (2021) Program comprehension and code complexity metrics: an fMRI study International conference on software engineering (ICSE), pp 524–536

[47]

Pinto G, Miranda B, Dissanayake S, Amorim MD, Treude C, Bertolino A, D’amorim M (2020) What is the vocabulary of flaky tests?. In: Proceedings of the international conference on mining software repositories (MSR), pp 492–502

[48]

Pontillo V, Palomba F, Ferrucci F (2021) Toward static test flakiness prediction: a feasibility study. In: Proceedings of the international workshop on machine learning techniques for software quality evoluton, pp 19–24

[49]

Pontillo V, Palomba F, Ferrucci F (2022) Static test flakiness prediction: how far can we go?

[50]

Psutil documentation (2022) — Psutil 5.7.3 documenation. https://psutil.readthedocs.io/en/stable/

[51]

Pytest (2022) Helps you write better programs — Pytest documentation. https://docs.pytest.org/en/7.1.x/

[52]

Romano A, Song Z, Grandhi S, Yang W, Wang W (2021) An empirical analysis of UI-based flaky tests. In: Proceedings of the international conference on software engineering (ICSE)

[53]

Safavian SR and Landgrebe D A survey of decision tree classifier methodology Trans Syst Man Cybernet 1991 21 3 660-674

[54]

Scikit-learn (2022) Machine learning in Python — Scikit-learn 1.1.1 documenation. https://scikit-learn.org/stable/

[55]

Shi A, Bell J, Marinov D (2019) Mitigating the effects of flaky tests on mutation testing. In: Proceedings of the international symposium on software testing and analysis (ISSTA), pp 296–306

[56]

Shi A, Gyori A, Legunsen O, Marinov D (2016) Detecting assumptions on deterministic implementations of non-deterministic specifications. In: Proceedings of the international conference on software testing, verification and validation (ICST), pp 80–90

[57]

Shi T and Horvath S Unsupervised learning with random forest predictors J Comput Graph Stat 2006 15 1 118-138

[58]

Shi A, Lam W, Oei R, Xie T, Marinov D (2019) iFixFlakies: a framework for automatically fixing order-dependent flaky tests. In: Proceedings of the joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE), pp 545–555

[59]

Terragni V, Salza P, Ferrucci F (2020) A container-based infrastructure for fuzzy-driven root causing of flaky tests. In: Proceedings of the international conference on software engineering: new ideas and emerging results (ICSE-NIER), pp 69–72

[60]

Tomek I Two modifications of CNN Trans Syst Man Cybernet 1976 6 769-772

[61]

Unittest (2022) — Unit testing framework — Python 3.10.4 documenation. https://docs.python.org/3/library/unittest.html

[62]

Virtual environments and packages (2022) — Python 3.10.4 documenation. https://docs.python.org/3/tutorial/venv.html

[63]

Vysali S, Mcintosh S, Adams B (2020) Quantifying, characterizing, and mitigating flakily covered program elements. Transactions on Software Engineering

[64]

Wei A, Yi P, Li Z, Xie T, Marinov D, Lam W (2022) Preempting flaky tests via non-idempotent-outcome tests. In: Proceedings of the international conference on tools and algorithms for the construction and analysis of systems (TACAS)

[65]

Welcome to radon’s documenation! (2022) — Radon 4.1.0 documenation. https://radon.readthedocs.io/en/stable/index.html

[66]

Welcome to the SHAP documenation! (2022) — SHAP latest documenation. https://shap.readthedocs.io/en/stable/index.html

[67]

Welker KD The software maintainability index revisited CrossTalk 2001 14 18-21

[68]

Yao L, Chu Z, Li S, Li Y, Gao J, and Zhang A A survey on causal inference Trans Knowl Discov Data (TKDD) 2021 15 5 1-46

[69]

Zavala VM and Flores-Tlacuahuac A Stability of multiobjective predictive control: a utopia-tracking approach Automatica 2012 48 10 2627-2632

[70]

Zeller A and Hildebrandt R Simplifying and isolating failure-inducing input Trans Softw Eng 2002 28 2 183-200

[71]

Zhang S, Jalali D, Wuttke J, Muşlu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the international symposium on software testing and analysis (ISSTA), pp 385–396

[72]

Zhang P, Jiang Y, Wei A, Stodden V, Marinov D, Shi A (2021) Domain-specific fixes for flaky tests with wrong assumptions on underdetermined specifications. In: Proceedings of the international conference on software engineering (ICSE), pp 50–61

[73]

airflow/test (2022) airflow/test_local_client.py at c743b95. https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/tests/api/client/test_local_client.py#L127

[74]

apache/airflow at c743b95 (2022) https://github.com/apache/airflow/tree/c743b95a02ba1ec04013635a56ad042ce98823d2

[75]

ipython/test (2022) ipython/test_async_helpers.py at 95d2b79. https://github.com/ipython/ipython/blob/95d2b79a2bd889da7a29e7c3cf5f49c1d25ff43d/IPython/core/tests/test_async_helpers.py#L135

[76]

pytest-CANNIER (2022) https://github.com/flake-it/pytest-cannier

Cited By

Barboni MBertolino ADe Angelis G(2024)Flakiness goes liveInformation and Software Technology10.1016/j.infsof.2023.107373167:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107373

Recommendations

What do developer-repaired Flaky tests tell us about the effectiveness of automated Flaky test detection?
AST '22: Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test

Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers' trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately ...
Empirically evaluating Greedy-based test suite reduction methods at different levels of test suite complexity

Test suite reduction is an important approach that decreases the cost of regression testing. A test suite reduction technique operates based on the relationship between the test cases in the regression test suite and the test requirements in the program ...
Regression-Test History Data for Flaky-Test Research
FTW '24: Proceedings of the 1st International Workshop on Flaky Tests

Due to their random nature, flaky test failures are difficult to study. Without having observed a test to both pass and fail under the same setup, it is unknown whether a test is flaky and what its failure rate is. Thus, flaky-test research has greatly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 28, Issue 3

May 2023

845 pages

ISSN:1382-3256

Issue’s Table of Contents

© The Author(s) 2023.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 28 April 2023

Accepted: 09 February 2023

Author Tags

Qualifiers

Research-article

Funding Sources

Engineering and Physical Sciences Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Barboni MBertolino ADe Angelis G(2024)Flakiness goes liveInformation and Software Technology10.1016/j.infsof.2023.107373167:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107373

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents