research-article

Understanding flaky tests: the developer’s perspective

Authors:

Marco Castelluccio,

Alberto BacchelliAuthors Info & Claims

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 830 - 840

https://doi.org/10.1145/3338906.3338945

Published: 12 August 2019 Publication History

Abstract

Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) despite exercising unchanged code. In this work, we examine the perceptions of software developers about the nature, relevance, and challenges of flaky tests.

We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature and the origin of the flakiness, as well as of the fixing effort. We also examined developers' fixing strategies. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Public preprint [http://arxiv.org/abs/1907.01466], data and materials [https://doi.org/10.5281/zenodo.3265785].

References

[1]

Jonathan Bell and Gail Kaiser. 2014. Unit Test Virtualization with VMVM. In Proceedings of the International Conference on Software Engineering (ICSE). ACM, 550–561.

Digital Library

[2]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE). To Appear.

Digital Library

[3]

Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3–21.

[4]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Data and materials for: ‘Understanding Flaky Tests: The Developer’s Perspective’.

[5]

E. Farchi, Y. Nir, and S. Ur. 2003. Concurrent bug patterns and how to test them. In Proceedings International Parallel and Distributed Processing Symposium. 7 pp.–.

Digital Library

[6]

Timothy S Flanigan, Emily McFarlane, and Sarah Cook. 2008. Conducting survey research among physicians and other medical professionals: a review of current literature. In Proceedings of the Survey Research Methods Section, American Statistical Association, Vol. 1. 4136–47.

[7]

Martin Fowler. {n.d.}. Eradicating non-determinism in tests. https://martinfowler. com/articles/nonDeterminism.html

[8]

M. Fowler. 1999. Refactoring: improving the design of existing code. Addison-Wesley.

Digital Library

[9]

Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291.

Digital Library

[10]

Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. ACM, 26.

Digital Library

[11]

Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A Large-Scale, Longitudinal Study of Test Coverage Evolution. In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018). http://jonbell.net/ publications/coverage

Digital Library

[12]

Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. 2011. Automated Atomicity-violation Fixing. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 389–400.

Digital Library

[13]

R Burke Johnson and Anthony J Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational researcher 33, 7 (2004), 14–26.

[14]

Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 329–339.

Digital Library

[15]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the SIGSOFT International Symposium on Foundations of Software Engineering (FSE). ACM, 643–653.

Digital Library

[16]

Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 93–104.

Digital Library

[17]

Atif M. Memon and Myra B. Cohen. 2013. Automated Testing of GUI Applications: Models, Tools, and Controlling Flakiness. In Proceedings of the International Conference on Software Engineering (ICSE). IEEE, 1479–1480.

Digital Library

[18]

Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding Bugs by Isolating Unit Tests. In Proceedings of the SIGSOFT Symposium on Foundations of Software Engineering and the European Conference on Software Engineering (ESEC/FSE). ACM, 496–499.

Digital Library

[19]

A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measurement. Pinter Publishers.

[20]

Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: What if test code quality matters?. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 130–141.

Digital Library

[21]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017. 1–12. 2017.12

[22]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 1–12.

[23]

Fabio Palomba and Andy Zaidman. 2019. The smell of fear: On the relation between test smells and flaky tests. Journal of Empirical Software Engineering (2019).

[24]

Fabio Palomba, Andy Zaidman, and AD Lucia. 2018. Automatic test smell detection using information retrieval techniques. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE.

[25]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2017. To Mock or Not To Mock? An Empirical Study on Mocking Practices. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on. IEEE, 402–412.

Digital Library

[26]

Davide Spadini, Maurício Aniche, Magiel Bruntink, and Alberto Bacchelli. 2019. Mock objects for testing java systems: Why and how developers use them, and how they evolve. Empirical Software Engineering 24, 3 (Jun 2019), 1461–1498.

Digital Library

[27]

Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli. 2019. Test-driven code review: an empirical study. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, 1061–1072.

Digital Library

[28]

Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). IEEE.

[29]

Arie van Deursen, Leon Moonen, Alex Bergh, and Gerard Kok. 2001. Refactoring Test Code. In Proceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP). 92–95.

[30]

Marilyn Domas White and Emily E Marsh. 2006. Content analysis: A flexible methodology. Library trends 55, 1 (2006), 22–45.

[31]

Mozilla wiki. 2019. Sheriffing. https://wiki.mozilla.org/Sheriffing.

[32]

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering. ACM, 38.

Digital Library

Cited By

Martou PDuhoux BMens KLegay A(2025)Combinatorial transition testing in dynamically adaptive systems: Implementation and test oracleJournal of Systems and Software10.1016/j.jss.2024.112260221(112260)Online publication date: Mar-2025
https://doi.org/10.1016/j.jss.2024.112260
Job RHora A(2025)How and why developers implement OS-specific testsEmpirical Software Engineering10.1007/s10664-024-10571-430:1Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10664-024-10571-4
Wang JWang KNie PFilkov VRay BZhou M(2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695551
Show More Cited By

Index Terms

Understanding flaky tests: the developer’s perspective
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

August 2019

1264 pages

ISBN:9781450355728

DOI:10.1145/3338906

General Chairs:
Marlon Dumas
University of Tartu, Estonia
,
Dietmar Pfahl
University of Tartu, Estonia
,
Program Chairs:
Sven Apel
Saarland University, Germany
,
Alessandra Russo
Imperial College, UK

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '19

Sponsor:

SIGSOFT

ESEC/FSE '19: 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

August 26 - 30, 2019

Tallinn, Estonia

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

98
Total Citations
View Citations
879
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)27

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Martou PDuhoux BMens KLegay A(2025)Combinatorial transition testing in dynamically adaptive systems: Implementation and test oracleJournal of Systems and Software10.1016/j.jss.2024.112260221(112260)Online publication date: Mar-2025
https://doi.org/10.1016/j.jss.2024.112260
Job RHora A(2025)How and why developers implement OS-specific testsEmpirical Software Engineering10.1007/s10664-024-10571-430:1Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s10664-024-10571-4
Wang JWang KNie PFilkov VRay BZhou M(2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695551
Li CBaz AShi AFilkov VRay BZhou M(2024)Reducing Test Runtime by Transforming Test FixturesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695541(1757-1769)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695541
Zhang HLiao LDing ZShang WNarula NSporea CToma ASajedi SFilkov VRay BZhou M(2024)Towards a Robust Waiting Strategy for Web GUI Testing for an Industrial Software SystemProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695269(2065-2076)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695269
Berndt ABach TBaltes S(2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3695407
Yu XLiu LHu XKeung JXia XLo DChristakis MPradel M(2024)Practitioners’ Expectations on Automated Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680386(1618-1630)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680386
Cai XDong ZWang YTiwari APeng XChristakis MPradel M(2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680377
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Chen YJabbarvand R(2024)Can ChatGPT Repair Non-Order-Dependent Flaky Tests?Proceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643900(22-29)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643900
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents