Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

Published: 02 September 2015 Publication History

Abstract

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible.
This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2,…,Tn have coverage values c1, c2,…,cn for C and c1′, c2′,…,cn′ for C′, is it better to compare suites based on c1, c2,…,cn or based on c1′, c2′,…,cn? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.

References

[1]
Martijn Adolfsen. 2011. Industrial validation of test coverage quality. Master's thesis. University of Twente.
[2]
Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing. Cambridge University Press.
[3]
James H. Andrews, Lionel C. Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the International Conference on Software Engineering. 402--411.
[4]
James H. Andrews, Lionel C. Briand, Yvan Labiche, and Akbar Siami Namin. 2006. Using mutation analysis for assessing and comparing testing coverage criteria. Trans. Softw. Eng. 32, 608--624.
[5]
Andrea Arcuri and Lionel C. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the International Conference on Software Engineering. 1--10.
[6]
Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. Tech. Rep. MSR- TR-2004-28, Microsoft Research.
[7]
Thomas Ball. 2005. A theory of predicate-complete test coverage and generation. In Proceedings of the 3rd International Conference on Formal Methods for Components and Objects (FMCO). 1--22.
[8]
Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture. 46--57.
[9]
Thomas Ball and Sriram K. Rajamani. 2001. Automatically validating temporal safety properties of interfaces. In Proceedings of the Workshop on Model Checking of Software. 103--122.
[10]
Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the International Conference on Software Engineering. 82--91.
[11]
Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the International Workshop on Advances in Model-Based Testing. 1--7.
[12]
Sagar Chaki, Edmund M. Clarke, Alex Groce, and Ofer Strichman. 2003. Predicate abstraction with minimum predicates. In Proceedings of the Conference on Correct Hardware Design and Verification Methods. 19--34.
[13]
Sagar Chaki, Alex Groce, and Ofer Strichman. 2004. Explaining abstract counterexamples. In Proceedings of the Symposium on the Foundations of Software Engineering. 73--82.
[14]
Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In Proceedings of the International Conference on Software Engineering. 34--44.
[15]
Norman Cliff. 1996. Ordinal Methods for Behavioral Data Analysis. Pyschology Press.
[16]
Cloc. 2013. Count lines of code. http://cloc.sourceforge.net/.
[17]
Cobertura. 2013. Cobertura. http://cobertura.sourceforge.net/.
[18]
CoCo. 2014. CoCo. http://mir.cs.illinois.edu/coco/.
[19]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms, 3rd Ed. The MIT Press.
[20]
Herbert L. Costner. 1965. Criteria for measures of association. Amer. Sociological Revi. 3.
[21]
Coverage. 2013. Instrumented container classes - Predicate coverage. http://mir.cs.illinois.edu/coverage/.
[22]
Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Comput. 11, 34--41.
[23]
Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. 10, 405--435.
[24]
Eclipse. 2013. Eclipse. http://http://www.eclipse.org/.
[25]
Emma. 2013. EMMA. http://emma.sourceforge.net/.
[26]
Phyllis G. Frankl and Oleg Iakounenko. 1998. Further empirical studies of test effectiveness. In Proceedings of the Symposium on the Foundations of Software Engineering. 153--162.
[27]
Phyllis G. Frankl and Stewart N. Weiss. 1993. An experimental comparison of the effectiveness of branch testing and data flow testing. Trans. Software Eng. 19, 774--787.
[28]
Chen Fu and Barbara G. Ryder. 2005. Navigating error recovery code in Java applications. In Proceedings of the Workshop on Eclipse Technology eXchange. 40--44.
[29]
Juan Pablo Galeotti, Nicolás Rosner, Carlos Gustavo López Pombo, and Marcelo Fabian Frias. 2010. Analysis of invariants for efficient bounded verification. In Proceedings of the International Symposium on Software Testing and Analysis. 25--36.
[30]
gcov. 2013. gcov--a Test Coverage Program. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.
[31]
Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2013. Comparing non-adequate test suites using coverage criteria. In Proceedings of the International Symposium on Software Testing and Analysis. 302--313.
[32]
Patrice Godefroid. 2007. Compositional dynamic test generation. In Proceedings of the Symposium on Principles of Programming Languages. 47--54.
[33]
Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the International Conference on Software Engineering. 72--82.
[34]
Alex Groce. 2009. (Quickly) testing the tester via path coverage. In Proceedings of the Workshop on Dynamic Analysis. 22--28.
[35]
Alex Groce. 2011. Coverage rewarded: Test input generation via adaptation-based programming. In Proceedings of the International Conference on Automated Software Engineering. 380--383.
[36]
Alex Groce, Alan Fern, Jervis Pinto, Tim Bauer, Mohammad Amin Alipour, Martin Erwig, and Camden Lopez. 2012. Lightweight automated testing with adaptation-based programming. In Proceedings of the International Symposium on Software Reliability Engineering. 161--170.
[37]
Alex Groce, Gerard Holzmann, and Rajeev Joshi. 2007. Randomized differential testing as a prelude to formal verification. In Proceedings of the International Conference on Software Engineering. 621--631.
[38]
Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr. 2012. Swarm testing. In Proceedings of the International Symposium on Software Testing and Analysis. 78--88.
[39]
Joy Paul Guilford. 1956. Fundamental Statistics in Psychology and Education. McGraw-Hill.
[40]
Atul Gupta and Pankaj Jalote. 2008. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. Softw. Tools Technol. Transf. 10, 145--160.
[41]
Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. Trans. Softw. Eng. 3, 279--290.
[42]
Michael Harder, Jeff Mellen, and Michael D. Ernst. 2003. Improving test suites via operational abstraction. In Proceedings of the International Conference on Software Engineering. 60--71.
[43]
Mohammad Mahdi Hassan and James H. Andrews. 2013. Comparing multi-point stride coverage and dataflow coverage. In Proceedings of the International Conference on Software Engineering. 172--181.
[44]
Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2002. Lazy abstraction. In Proceedings of the Symposium on Principles of Programming Languages. 58--70.
[45]
Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. 2009. Query-driven program testing. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation. 151--166.
[46]
Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software Engineering. 191--200.
[47]
Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the International Conference on Software Engineering. 435--445.
[48]
Laura Michelle McLean Inozemtseva. 2012. Predicting test suite effectiveness for Java programs. Master's thesis, University of Waterloo.
[49]
JFreeChart. 2013. JFreeChart. http://www.jfree.org/jfreechart/.
[50]
Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. Trans. Softw. Eng. 37, 649--678.
[51]
JodaTime. 2013. JodaTime. http://joda-time.sourceforge.net/.
[52]
René Just, Gregory M. Kapfhammer, and Franz Schweiggert. 2012. Using non-redundant mutation operators and test suite prioritization to achieve efficient and scalable mutation analysis. In Proceedings of the International Symposium on Software Reliability Engineering. 11--20.
[53]
Maurice Kendall. 1938. A new measure of rank correlation. Biometrika 1--2, 81--89.
[54]
James R. Larus. 1999. Whole program paths. In Proceedings of the Conference on Programming Language Design and Implementation. 259--269.
[55]
Nan Li, Upsorn Praphamontripong, and Jeff Offutt. 2009. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In Proceedings of the International Workshop on Mutation Analysis. 220--229.
[56]
Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the International Symposium on Software Testing and Analysis. 57--68.
[57]
Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the International Conference on Software Engineering. 351--360.
[58]
George Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proceedings of the International Conference on Compiler Construction. 213--228.
[59]
A. Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An experimental evaluation of selective mutation. In Proceedings of the International Conference on Software Engineering. 100--107.
[60]
Peter Ohmann and Ben Liblit. 2013. Lightweight control-flow instrumentation and postmortem analysis in support of debugging. In Procedings of the International Conference on Automated Software Engineering. 378--388.
[61]
Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In Proceedings of the International Conference on Software Engineering. 75--84.
[62]
Mike Papadakis, Christopher Henard, and Yves Le Traon. 2014. Sampling program inputs with mutation analysis: Going beyond combinatorial interaction testing. In Proceedings of the International Conference on Software Testing, Verification and Validation. 1--10.
[63]
Sanjay J. Patel, Tony Tung, Satarupa Bose, and Matthew M. Crum. 2000. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the International Symposium on Microarchitecture. IEEE, 303--313.
[64]
Purify. 2013. IBM Rational purify documentation. ftp://ftp.software.ibm.com/software/rational/docs/documentation/manuals/unixsuites/pdf/purify/purify.pdf.
[65]
Gregg Rothermel, Roland Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Test case prioritization. Trans. Softw. Eng. 27, 929--948.
[66]
Atanas Rountev. 2004. Precise identification of side-effect-free methods in Java. In Proceedings of the International Conference on Software Maintenance. 82--91.
[67]
Alexandru Sălcianu and Martin Rinard. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th Annual Conference on Verification, Model Checking, and Abstract Interpretation. 199--215.
[68]
David Schuler and Andreas Zeller. 2009. Javalanche: Efficient mutation testing for Java. In Proceedings of the Symposium on the Foundations of Software Engineering. 297--298.
[69]
David Schuler and Andreas Zeller. 2013. Checked coverage: An indicator for oracle quality. Softw. Testing, Verification Reliability 23, 7, 531--551.
[70]
Rohan Sharma, Milos Gligoric, Andrea Arcuri, Gordon Fraser, and Darko Marinov. 2011. Testing container classes: Random or systematic? In Fundamental Approaches to Software Engineering, 262--277.
[71]
Rohan Sharma, Milos Gligoric, Vilas Jagannath, and Darko Marinov. 2010. A comparison of constraint- based and sequence-based generation of complex input data structures. In Proceedings of the Software Testing, Verification, and Validation Workshops. 337--342.
[72]
Charles Spearman. 1904. The proof and measurement of association between two things. Amer. J. Psychol 15, 1, 72--101.
[73]
SQLite. 2013. SQLite. http://www.sqlite.org/.
[74]
Willem Visser, Corina S. Pasareanu, and Radek Pelánek. 2006. Test input generation for Java containers using state matching. In Proceedings of the International Symposium on Software Testing and Analysis. 37--48.
[75]
Marian Vittek, Peter Borovansky, and Pierre-Etienne Moreau. 2006. A simple generic library for C. In Proceedings of the International Conference on Software Reuse. 423--426.
[76]
VMSpec. 2013. Java class file format. http://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html.
[77]
Filipos I. Vokolos and Phyllis G. Frankl. 1998. Empirical evaluation of the textual differencing regression testing technique. In Proceedings of the International Conference on Software Maintenance. 44--53.
[78]
WALA. 2013. WALA: T. J. Watson Libraries for Analysis. http://wala.sf.net.
[79]
Tao Wang and Abhik Roychoudhury. 2005. Automated path generation for software fault localization. In Proceedings of the International Conference on Automated Software Engineering. 347--351.
[80]
Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is branch coverage a good measure of testing effectiveness? In Empirical Software Engineering and Verification, Bertrand Meyer and Martin Nordio (Eds.), vol. 7007, Springer Berlin, Heidelberg, 194--212.
[81]
W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1994. Effect of test set size and block coverage on fault detection effectiveness. In Proceedings of the International Symposium on Software Reliability. 230--238.
[82]
W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1995. Effect of test set minimization on fault detection effectiveness. In Proceedings of the International Conference on Software Engineering. 41--50.
[83]
YAFFS2. 2013. YAFFS: A flash file system for embedded use. http://www.yaffs.net.
[84]
Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: Better together. In Proceedings of the International Conference on Automated Software Engineering. 92--102.
[85]
Lu Zhang, Shan-Shan Hou, Jun-Jue Hu, Tao Xie, and Hong Mei. 2010. Is operator-based mutant selection superior to random mutant selection? In Proceedings of the International Conference on Software Engineering. 435--444.

Cited By

View all
  • (2025)Subsumption, correctness and relative correctness: Implications for software testingScience of Computer Programming10.1016/j.scico.2024.103177239(103177)Online publication date: Jan-2025
  • (2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
  • (2024)Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test GenerationIEEE Transactions on Software Engineering10.1109/TSE.2024.336661350:4(854-883)Online publication date: 16-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 24, Issue 4
Special Issue on ISSTA 2013
August 2015
177 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/2820114
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 September 2015
Accepted: 01 August 2014
Revised: 01 May 2014
Received: 01 January 2014
Published in TOSEM Volume 24, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Coverage criteria
  2. non-adequate test suites

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)3
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Subsumption, correctness and relative correctness: Implications for software testingScience of Computer Programming10.1016/j.scico.2024.103177239(103177)Online publication date: Jan-2025
  • (2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
  • (2024)Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test GenerationIEEE Transactions on Software Engineering10.1109/TSE.2024.336661350:4(854-883)Online publication date: 16-Feb-2024
  • (2024)Detecting Faults vs. Revealing Failures: Exploring the Missing Link2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00021(115-126)Online publication date: 1-Jul-2024
  • (2023)Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?ACM Transactions on Software Engineering and Methodology10.1145/363571333:4(1-32)Online publication date: 5-Dec-2023
  • (2023)Heterogeneous Testing for Coverage Profilers Empowered with Debugging SupportProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616340(670-681)Online publication date: 30-Nov-2023
  • (2023)Input and Output Coverage Needed in File System TestingProceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3599691.3603405(93-101)Online publication date: 9-Jul-2023
  • (2023)Mutation-Based Minimal Test Suite Generation for Boolean ExpressionsInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402350018333:06(865-884)Online publication date: 17-May-2023
  • (2023)Achieving High MAP-Coverage Through Pattern Constraint ReductionIEEE Transactions on Software Engineering10.1109/TSE.2022.314448049:1(99-112)Online publication date: 1-Jan-2023
  • (2023)Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00036(102-113)Online publication date: 9-Oct-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media