research-article

Public Access

Preempting flaky tests via non-idempotent-outcome tests

Authors:

Wing LamAuthors Info & Claims

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 1730 - 1742

https://doi.org/10.1145/3510003.3510170

Published: 05 July 2022 Publication History

Abstract

Regression testing can greatly help in software development, but it can be seriously undermined by flaky tests, which can both pass and fail, seemingly nondeterministically, on the same code commit. Flaky tests are an emerging topic in both research and industry. Prior work has identified multiple categories of flaky tests, developed techniques for detecting these flaky tests, and analyzed some detected flaky tests.

To proactively detect, i.e., preempt, flaky tests, we propose to detect non-idempotent-outcome (NIO) tests, a novel category related to flaky tests. In particular, we run each test twice in the same test execution environment, e.g., run each Java test twice in the same Java Virtual Machine. A test is NIO if it passes in the first run but fails in the second. Each NIO test has side effects and "self-pollutes" the state shared among test runs. We perform experiments on both Java and Python open-source projects, detecting 223 NIO Java tests and 138 NIO Python tests. We have inspected all 361 detected tests and opened pull requests that fix 268 tests, with 192 already accepted, only 6 rejected, and the remaining 70 pending.

References

[1]

2022. https://github.com/Activiti/Activiti/pull/3488

[2]

2022. https://github.com/apache/dubbo/pull/6936

[3]

2022. https://github.com/apache/hadoop/pull/2482

[4]

2022. https://github.com/spring-projects/spring-boot/pull/25435

[5]

2022. https://github.com/spring-projects/spring-boot/pull/27664

[6]

2022. https://github.com/josiest/geom/pull/1

[7]

2022. https://github.com/mtik00/yamicache/pull/10

[8]

2022. https://github.com/querydsl/querydsl/pull/2658

[9]

2022. https://github.com/apache/hadoop/pull/2724

[10]

2022. https://github.com/apache/hadoop/pull/2500

[11]

2022. https://github.com/apache/hadoop/pull/2499

[12]

2022. https://github.com/zalando/riptide/pull/1020

[13]

2022. https://github.com/vmware/admiral/pull/319

[14]

2022. https://github.com/PolyJIT/benchbuild/pull/425

[15]

2022. https://stackoverflow.com/questions/11585793/are-numpy-arrays-passed-by-reference

[16]

2022. https://github.com/Zabamund/wellpathpy/pull/50

[17]

2022. https://github.com/airbrake/pybrake/pull/163

[18]

2022. NIO Tests. https://sites.google.com/view/nio-tests

[19]

2022. TotT: Avoiding flakey tests. http://googletesting.blogspot.com/2008/04/tott-avoiding-flakey-tests.html

[20]

Activiti 2022. https://github.com/activiti/activiti

[21]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting flakiness without rerunning tests. In ICSE.

[22]

Jonathan Bell. 2014. Detecting, isolating, and enforcing dependencies among and within test cases. In FSE Doctoral Symposium.

Digital Library

[23]

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE.

[24]

Jonathan Bell, Gail Kaiser, Eric Melski, and Mohan Dattatreya. 2015. Efficient dependency detection for safe Java test acceleration. In ESEC/FSE.

[25]

BenchBuild 2022. https://github.com/PolyJIT/benchbuild

[26]

Jeanderson Candido, Luis Melo, and Marcelo d'Amorim. 2017. Test suite parallelization in open-source projects: A study on its usage and impact. In ASE.

[27]

Zhen Dong, Abhishek Tiwari, Xiao Liang Yu, and Abhik Roychoudhury. 2021. Flaky test detection in Android via event order exploration. In ESEC/FSE.

[28]

Dubbo 2022. https://github.com/apache/dubbo

[29]

Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting flaky tests in probabilistic and machine learning applications. In ISSTA.

[30]

Saikat Dutta, August Shi, and Sasa Misailovic. 2021. FLEX: Fixing flaky tests in machine learning projects by updating assertion bounds. In ESEC/FSE.

[31]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer's perspective. In ESEC/FSE.

[32]

Sebastian Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. 2000. Prioritizing test cases for regression testing. In ISSTA.

[33]

Lamyaa Eloussi. 2016. Flaky tests (and how to avoid them). https://engineering.salesforce.com/flaky-tests-and-how-to-avoid-them-25b84b756f60

[34]

Facebook testing and verification request for proposals 2019. https://research.fb.com/programs/research-awards/proposals/facebook-testing-and-verification-request-for-proposals-2019

[35]

Martin Fowler. 2011. Eradicating non-determinism in tests. https://martinfowler.com/articles/nonDeterminism.html

[36]

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In ICST.

[37]

Zebao Gao, Yalan Liang, Myra B. Cohen, Atif M. Memon, and Zhen Wang. 2015. Making system user interactive tests repeatable: When and what should we control?. In ICSE.

[38]

Google. 2008. Avoiding flakey tests. http://googletesting.blogspot.com/2008/04/tott-avoiding-flakey-tests.html

[39]

Martin Gruber, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An empirical study of flaky tests in Python. In ICST.

[40]

Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: Detecting state-polluting tests to prevent test dependency. In ISSTA.

[41]

Sarra Habchi, Maxime Cordy, Mike Papadakis, and Yves Le Traon. 2021. On the use of mutation in injecting test order-dependency. In MSR.

[42]

Sarra Habchi, Maxime Cordy, Mike Papadakis, and Yves Le Traon. 2021. A replication study on the usability of code vocabulary in predicting flaky tests. In MSR.

[43]

Hadoop 2022. https://github.com/apache/hadoop

[44]

Mark Harman and Peter O'Hearn. 2018. From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis. In SCAM.

[45]

Mary Jean Harrold, James A. Jones, Tongyu Li, Donglin Liang, Alessandro Orso, Maikel Pennings, Saurabh Sinha, S. Alexander Spoon, and Ashish Gujarathi. 2001. Regression test selection for Java software. In OOPSLA.

[46]

Kim Herzig, Michaela Greiler, Jacek Czerwonka, and Brendan Murphy. 2015. The art of testing less without sacrificing quality. In ICSE.

[47]

Kim Herzig and Nachiappan Nagappan. 2015. Empirically detecting false test alarms using association rules. In ICSE.

[48]

Chen Huo and James Clause. 2014. Improving oracle quality by detecting brittle assertions and unused inputs in tests. In FSE.

[49]

He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. In ICSE.

[50]

James A. Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of test information to assist fault localization. In ICSE.

[51]

JUnit 2022. https://junit.org

[52]

JUnit and Java 7 2012. http://intellijava.blogspot.com/2012/05/junit-and-java-7.html

[53]

JUnit test method ordering 2022. http://www.java-allandsundry.com/2013/01/junit-test-method-ordering.html

[54]

Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. 2013. Optimizing unit test execution in large software programs using dependency analysis. In APSys.

[55]

Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. In ICSE SEIP.

[56]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root causing flaky tests in a large-scale industrial setting. In ISSTA.

[57]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In ICSE.

[58]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST.

[59]

Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-test-aware regression testing techniques. In ISSTA.

[60]

Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020. Understanding reproducibility and characteristics of flaky tests through test reruns in Java projects. In ISSRE.

[61]

Wing Lam, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, and Jonathan Bell. 2020. A large-scale longitudinal study of flaky tests. In OOPSLA.

[62]

Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller. 2021. When life gives you oranges: Detecting and diagnosing intermittent job failures at Mozilla. In ESEC/FSE.

[63]

Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing order-dependent flaky tests via test generation. In ICSE.

[64]

Jingjing Liang, Sebastian Elbaum, and Gregg Rothermel. 2018. Redefining prioritization: Continuous prioritization for continuous integration. In ICSE.

[65]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.

[66]

Maintaining the order of JUnit3 tests with JDK 1.7. 2013. https://coderanch.com/t/600985/engineering/Maintaining-order-JUnit-tests-JDK

[67]

Maven 2022. https://maven.apache.org

[68]

Maven Surefire plugin 2022. https://maven.apache.org/surefire/maven-surefire-plugin

[69]

Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing. In ICSE SEIP.

[70]

Gerard Meszaros. 2007. xUnit Test Patterns: Refactoring Test Code.

[71]

John Micco. 2017. The state of continuous integration testing at Google. In ICST. https://bit.ly/2OohAip

[72]

Rashmi Mudduluru, Jason Waataja, Suzanne Millstein, and Michael D. Ernst. 2021. Verifying determinism in sequential programs. In ICSE.

[73]

Suchita Mukherjee, Abigail Almanza, and Cindy Rubio-González. 2021. Fixing dependency errors for Python build reproducibility. In ISSTA.

[74]

Madan Musuvathi, Shaz Qadeer, and Thomas Ball. 2007. CHESS: A systematic testing tool for concurrent software. Technical Report MSR-TR-2007-149.

[75]

Pengyu Nie, Ahmet Celik, Matthew Coley, Aleksandar Milicevic, Jonathan Bell, and Milos Gligoric. 2020. Debugging the performance of Maven's test isolation: Experience report. In ISSTA.

[76]

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2020. Flake it 'till you make it: Using automated repair to induce and fix latent test flakiness. In ICSE (Workshops).

[77]

Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2021. A survey of flaky tests. TOSEM (2021).

[78]

Gustavo Pinto, Breno Miranda, Supun Dissanayake, Marcelo d'Amorim, Christoph Treude, and Antonia Bertolino. 2020. What is the vocabulary of flaky tests?. In MSR.

[79]

pybrake 2022. https://github.com/airbrake/pybrake

[80]

pytest 2022. https://docs.pytest.org/en/6.2.x

[81]

pytest-repeat 2022. https://pypi.org/project/pytest-repeat

[82]

Querydsl 2022. https://github.com/querydsl/querydsl

[83]

Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. In ESEC/FSE.

[84]

Maaz Hafeez Ur Rehman and Peter C. Rigby. 2021. Quantifying no-fault-found test failures to prioritize inspection of flaky tests at Ericsson. In ESEC/FSE.

[85]

Alan Romano, Zihe Song, Sampath Grandhi, Wei Yang, and Weihang Wang. 2021. An empirical analysis of UI-based flaky tests. In ICSE.

[86]

Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Prioritizing test cases for regression testing. TSE (2001).

[87]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In ICST.

[88]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.

[89]

Friedrich Steimann, Marcus Frenkel, and Rui Abreu. 2013. Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators. In ISSTA.

[90]

Test verification 2022. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification

[91]

Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An empirical study of flaky tests in Android apps. In ICSME.

[92]

Anjiang Wei, Pu Yi, Tao Xie, Darko Marinov, and Wing Lam. 2021. Probabilistic and systematic coverage of consecutive test-method pairs for detecting order-dependent flaky tests. In TACAS.

[93]

wellpathpy 2022. https://github.com/Zabamund/wellpathpy

[94]

Eric Wendelin. 2022. Introducing flaky test mitigation tools. https://blog.gradle.org/gradle-flaky-test-retry-plugin

[95]

Pu Yi, Anjiang Wei, Wing Lam, Tao Xie, and Darko Marinov. 2021. Finding polluter tests using Java PathFinder. SEN (2021).

[96]

Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: A survey. STVR (2012).

Digital Library

[97]

Lingming Zhang, Darko Marinov, Lu Zhang, and Sarfraz Khurshid. 2012. Regression mutation testing. In ISSTA.

[98]

Peilun Zhang, Yanjie Jiang, Anjiang Wei, Victoria Stodden, Darko Marinov, and August Shi. 2021. Domain-specific fixes for flaky tests with wrong assumptions on underdetermined specifications. In ICSE.

[99]

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA.

[100]

Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale. In ICSE.

Cited By

Eder FWinter SFilkov VRay BZhou M(2024)Efficient Detection of Test Interference in C ProjectsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694995(166-178)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3694995
Girol GLacombe GBardin S(2024)Quantitative Robustness for Vulnerability AssessmentProceedings of the ACM on Programming Languages10.1145/36564078:PLDI(741-765)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656407
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Show More Cited By

Recommendations

Repairing order-dependent flaky tests via test generation
ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Flaky tests are tests that pass or fail nondeterministically on the same version of code. These tests can mislead developers concerning the quality of their code changes during regression testing. A common kind of flaky tests are order-dependent tests, ...
A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
A large-scale longitudinal study of flaky tests

Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

May 2022

2508 pages

ISBN:9781450392211

DOI:10.1145/3510003

General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
323
Total Downloads

Downloads (Last 12 months)202
Downloads (Last 6 weeks)38

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Eder FWinter SFilkov VRay BZhou M(2024)Efficient Detection of Test Interference in C ProjectsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694995(166-178)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3694995
Girol GLacombe GBardin S(2024)Quantitative Robustness for Vulnerability AssessmentProceedings of the ACM on Programming Languages10.1145/36564078:PLDI(741-765)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656407
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Chen YJabbarvand R(2024)Can ChatGPT Repair Non-Order-Dependent Flaky Tests?Proceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643900(22-29)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643900
Chen YRoychoudhury APaiva AAbreu RStorey M(2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3641227
Gruber MRoslan MParry OScharnböck FMcMinn PFraser GRoychoudhury APaiva AAbreu RStorey M(2024)Do Automatic Test Generation Tools Generate Flaky Tests?Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3608138(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3608138
Rahman SBaz AMisailovic SShi A(2024)Quantizing Large-Language Models for Predicting Flaky Tests2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00018(93-104)Online publication date: 27-May-2024
https://doi.org/10.1109/ICST60714.2024.00018
Pontillo VPalomba FFerrucci F(2024)Test Code Flakiness in Mobile AppsInformation and Software Technology10.1016/j.infsof.2023.107394168:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107394
Chen YYildiz AMarinov DJabbarvand RJust RFraser G(2023)Transforming Test Suites into CroissantsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598119(1080-1092)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598119
Carmo DGonçalves LDias APombo N(2023)Improved Flaky Test Detection with Black-Box Approach and Test Smells2023 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC58397.2023.10217934(245-251)Online publication date: 9-Jul-2023
https://doi.org/10.1109/ISCC58397.2023.10217934
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents