article

Robust Statistical Methods for Empirical Software Engineering

Authors:

Barbara Kitchenham,

Pearl Brereton,

Stuart Charters,

Amnart PohthongAuthors Info & Claims

Empirical Software Engineering, Volume 22, Issue 2

Pages 579 - 630

https://doi.org/10.1007/s10664-016-9437-5

Published: 01 April 2017 Publication History

Abstract

There have been many changes in statistical theory in the past 30 years, including increased evidence that non-robust methods may fail to detect important results. The statistical advice available to software engineering researchers needs to be updated to address these issues. This paper aims both to explain the new results in the area of robust analysis methods and to provide a large-scale worked example of the new methods. We summarise the results of analyses of the Type 1 error efficiency and power of standard parametric and non-parametric statistical tests when applied to non-normal data sets. We identify parametric and non-parametric methods that are robust to non-normality. We present an analysis of a large-scale software engineering experiment to illustrate their use. We illustrate the use of kernel density plots, and parametric and non-parametric methods using four different software engineering data sets. We explain why the methods are necessary and the rationale for selecting a specific analysis. We suggest using kernel density plots rather than box plots to visualise data distributions. For parametric analysis, we recommend trimmed means, which can support reliable tests of the differences between the central location of two or more samples. When the distribution of the data differs among groups, or we have ordinal scale data, we recommend non-parametric methods such as Cliff's ź or a robust rank-based ANOVA-like method.

References

[1]

Acion L, Peterson JJ, Temple S, Arndt S (2006) Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Stat Med 25(4):591-602.

[2]

Agresti A, Pendergast J (1986) Comparing mean ranks for repeated measures data. Communications in Statistics - Theory and Methods 15(5):1417-1433.

[3]

Akritas MG, Arnold SF (1994) Fully nonparametric hypotheses for factorial designs i: multivariate repeated measures designs. J Am Stat Assoc 89(425):336-343.

[4]

Akritas MG, Arnold SF, Brunner E (1997) Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J Am Stat Assoc 92(437):258-265.

[5]

Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: ACM/IEEE international conference on software engineering (ICSE), IEEE, pp 1-10.

[6]

Arcuri A, Briand L (2014) A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24(3):219-250.

Digital Library

[7]

Behrens JT (1997) Principles and procedures of exploratory data analysis. Psychol Methods 2(2):131-160.

[8]

Bergmann R, Ludbrook J, Spooren WPJM (2000) Different outcomes of the Wilcoxon-Mann-Whitney test from different statistics packages. Am Stat 54(1):72-77.

[9]

Boehm BW (1981) Software engineering economics. Prentice-Hall.

[10]

Borenstein M, Hedges LV, Higgins JP, Hannah RR (2009) Introduction to meta-analysis. Wiley.

[11]

Box GEP (1954) Some theorems on quadratic forms applied in the study of analysis of variance problems, i. Effect of inequality of variance in the One-Way classification. Ann Math Stat 25(2):290-302.

[12]

Braver SL, Thoemmes FJ, Rosenthal R (2014) Continuously cumulating meta-analysis and replicability. Perspect Psychol Sci 9(3):333-342.

[13]

Brunner E, Munzel U, Puri ML (2002) The multivariate nonparametric Behrens-Fisher problem. Journal of Statistical Planning and Inference 108(1-2):37-53.

[14]

Budgen D, Kitchenham BA, Charters SM, Turner M, Brereton P, Linkman SG (2008) Presenting software engineering results using structured abstracts: a randomised experiment. Empir Softw Eng 13(4):435-468.

Digital Library

[15]

Budgen D, Burn AJ, Kitchenham B (2011) Reporting computing projects through structured abstracts: a quasi-experiment. Empir Softw Eng 16(2):244-277.

Digital Library

[16]

Budgen D, Kitchenham B, Charters S, Gibbs S, Pohthong A, Keung J, Brereton P (2013) Lessons from conducting a distributed quasi-experiment. In: 2013 ACM/IEEE international symposium on empirical software engineering and measurement, pp 143-152.

[17]

Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114(3):494-509.

[18]

Cohen JW (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale, New York.

[19]

Cohen JW (1992) A power primer. Psychol Bull 112(1):155-159.

[20]

Conover W, Imam RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Stat 35(3):124-129.

[21]

D'Agostino RB, Belanger A, D'Agostino J, Ralph B (1990) A suggestion for using powerful and informative tests of normality. Am Stat 44(4):316-321.

[22]

Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: a comparative study. IEEE Trans Softw Eng 38(2):357-397.

[23]

Dem¿ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1-30.

Digital Library

[24]

Dybå T, Kampenes VB, Sjøberg DIK (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745-755.

[25]

El-Attar M (2014) Using SMCD to reduce inconsistencies in misuse case models: a subject-based empirical evaluation. J Syst Softw 87:104-118.

Digital Library

[26]

El-Attar M, Elish M, Mahmood S, Miller J (2012) Is in-depth object-oriented knowledge necessary to develop quality robustness diagrams? Journal of Software 7(11):2538-2552.

[27]

Erceg-Hurn DM, Mirosevich VM (2008) Modern robust statistical methods an easy way to maximize the accuracy and power of your research. Am Psychol 63(7):591-601.

[28]

Gandrud C (2015) Reproducible research with R and R studio. CRC Press.

[29]

Goodall C (1983) Understanding robust and exploratory data analysis. John Wiley and Sons Inc., chap M-Estimators of Location: An outline of the theory, pp 339-403.

[30]

Grissom RJ (1996) The magical number.7 ±.2: meta-meta-analysis of the probability of superior outcome in comparisons involving therapy, placebo, and control. J Consult Clin Psychol 64(5):973-982.

[31]

Hoaglin DC, Mosteller F, Tukey JW (eds) (1983) Understanding robust and exploratory data analysis. Wiley.

[32]

Huijgens H, van Solingen R, van Deursen A (2013) How to build a good practice software project portfolio? Tech. Rep. TUD-SERG-2013-019, Delft University of Technology.

[33]

John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23(5):524-532.

[34]

Jureczko M, Madeyski L (2015) Cross-project defect prediction with respect to code ownership model: an empirical study. e-Informatica Software Engineering Journal 9(1):21-35.

[35]

Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11-12):1073-1086.

Digital Library

[36]

Kitchenham B (1996) Software metrics: measurement for software process improvement. Blackwell Publishers Inc.

[37]

Kitchenham B (2015) Robust statistical methods: why, what and how: keynote. In: Proceedings of the 19th international conference on evaluation and assessment in software engineering (EASE 2015), pp 1:1-1:6.

[38]

Kitchenham B, Känsälä K (1983) Inter-item correlations among function points. In: Proceedings ICSE 15. IEEE Computer Society Press, pp 477-480.

[39]

Kraemer HC, Kupfer DJ (2006) Size of treatment effects and their importance to clinical research and practice. Biol Psychiatry 59(11):990-996.

[40]

Kromrey JD, Hogarty KY, Ferron JM, Hines CV, Hess MR (2005) Robustness in meta-analysis: an empirical comparison of point and interval estimates of standardized mean differences and Cliff's delta. In: Proceedings of the joint statistical meetings, Minneapolis.

[41]

Lipsey MW, Wilson DB (2001) Practical meta-analysis. Sage Publications, California.

[42]

Madeyski L (2010) Test-driven development: an empirical evaluation of agile practice. Springer, Heidelberg.

[43]

Madeyski L (2015) Reproducer: reproduce statistical analyses and meta-analyses. http://madeyski.e-informatyka.pl/reproducible-research, R package (http://CRAN.R-project.org/package=reproducer).

[44]

Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? An empirical study. Softw Qual J 23(3):393-422.

Digital Library

[45]

Madeyski L, Orzeszyna W, Torkar R, Józala M (2012) Appendix to the paper "Overcoming the equivalent mutant problem: a systematic literature review and a comparative experiment of second order mutation". http://madeyski.e-informatyka.pl/download/app/AppendixTSE.pdf.

[46]

Madeyski L, Orzeszyna W, Torkar R, Józala M (2014) Overcoming the equivalent mutant problem: a systematic literature review and a comparative experiment of second order mutation. IEEE Trans Softw Eng 40(1):23-42.

Digital Library

[47]

Micceri T (1989) The unicorn, the normal curve, and other improbable creatures. Psychol Bull 105(1):156-166.

[48]

Mosteller F, Tukey JW (1977) Data analysis and regression: a second course in statistics. Addison-Wesley.

[49]

Mudholkar GS, Marchetti CE, Lin CT (2002) Independence characterizations and testing normality against restricted skewness-kurtosis alternatives. Journal of Statistical Planning and Inference 104(2):485-501.

[50]

Price RM, Bonett DG (2001) Estimating the variance of the sample median. J Stat Comput Simul 68(3):295-305.

[51]

R Core Team (2015) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria.

[52]

Ramsey PH (1980) Exact type 1 error rates for robustness of student's t test with unequal variances. J Educ Behav Stat 5(4):337-349.

[53]

Razali NM, Wah YB (2011) Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2(1):21-33.

[54]

Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston.

[55]

Shapiro SS, Wilk M, Chen HJ (1968) A comparative study of various tests for normality. J Am Stat Assoc 63(324):1343-1372.

[56]

Shrout P, Fleiss J (1979) Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86(2):420- 428.

[57]

Stout DE, Ruble TL (1995) Assessing the practical significance of empirical results in accounting education research: the use of effect size information. Journal of Accounting Education 13(3):281-298.

[58]

Tappenden AF, Miller J (2014) Automated cookie collection testing. ACM Trans Softw Eng Methodol 23(1):3:1-3:40.

Digital Library

[59]

Tian T, Wilcox R (2007) A comparison of two rank tests for repeated measures designs. Journal of Modern Applied Statistical Methods 6(1):331-335.

[60]

Urdan TC (2005) Statistics in plain english, 2nd edn. Routledge, Oxon, UK.

[61]

Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101-132.

[62]

Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. J Stat Softw 36(3):1-48.

[63]

Welch BL (1938) The significance of the difference between two means when the population variances are unequal. Biometrika 29(3-4):350-362.

[64]

Whigham PA, Owen C, MacDonell S (2015) A baseline model for software effort estimation. ACM Trans Softw Eng Methodol 24(3):20:1-20:11.

Digital Library

[65]

Wilcox RR (1998) How many discoveries have been lost by ignoring modern statistical methods? Am Psychol 53(3):300-314.

[66]

Wilcox RR (2012) Introduction to robust estimation & hypothesis testing, 3rd edn. Elsevier.

[67]

Wilcox RR, Keselman HJ (2003) Modern robust data analysis methods: measures of central tendency. Psychol Methods 8(3):254-274.

[68]

Yuen KK (1974) The two-sample trimmed t for unequal population variances. Biometrika 61(1):165-170.

[69]

Zimmerman DW (2000) Statistical significance levels of nonparametric tests biased by heterogeneous variances of treatment groups. J Gen Psychol 127(4):354-364.

[70]

Zimmerman DW, Zumbo BD (1993) Rank transformations and the power of the Student t test and Welch t test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 47(3):523-539.

Cited By

Wang XAli SArrieta AArcaini PArratibel Md'Amorim M(2024)Application of Quantum Extreme Learning Machines for QoS Prediction of Elevators’ Software in an Industrial ContextCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663859(399-410)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663859
Wan XZheng ZQin FLu X(2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3649596
Attaoui MFahmy HPastore FBriand L(2024)Supporting Safety Analysis of Image-processing DNNs through Clustering-based ApproachesACM Transactions on Software Engineering and Methodology10.1145/364367133:5(1-48)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643671
Show More Cited By

Robust Statistical Methods for Empirical Software Engineering
1. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms

Recommendations

Robust statistical methods: why, what and how: keynote
EASE '15: Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering

This keynote discusses the need for more robust statistical methods. For visualizing data I suggest using Kernel density plots rather than box plots. For parametric analysis, I propose more robust measures of central location such as trimmed means, ...
Evolution of statistical analysis in empirical software engineering research: Current state and steps forward
Highlights
- A manual review of statistical usage in 161 papers in 5 top journals.
- A semi-...
Abstract
Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. To ...
Methodological and Statistical Techniques: What Do Residents Really Need to Know About Statistics?

The purpose of this study was to catalog the statistical methods used in six journals, two each from the fields of Family Practice, Emergency Medicine, and Obstetrics and Gynecology. We reviewed the quantitative articles from January 1998 through ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 22, Issue 2

April 2017

383 pages

ISSN:1382-3256

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XAli SArrieta AArcaini PArratibel Md'Amorim M(2024)Application of Quantum Extreme Learning Machines for QoS Prediction of Elevators’ Software in an Industrial ContextCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663859(399-410)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663859
Wan XZheng ZQin FLu X(2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3649596
Attaoui MFahmy HPastore FBriand L(2024)Supporting Safety Analysis of Image-processing DNNs through Clustering-based ApproachesACM Transactions on Software Engineering and Methodology10.1145/364367133:5(1-48)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643671
Chowdhury SUddin GHemmati HHolmes R(2024)Method-level Bug Prediction: Problems and PromisesACM Transactions on Software Engineering and Methodology10.1145/364033133:4(1-31)Online publication date: 13-Jan-2024
https://dl.acm.org/doi/10.1145/3640331
Lyu DZhang ZArcaini PIshikawa FLaurent TZhao JLi XHandl J(2024)Search-Based Repair of DNN Controllers of AI-Enabled Cyber-Physical Systems Guided by System-Level SpecificationsProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654078(1435-1444)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638529.3654078
Chen ZZhang JSarro FHarman MRoychoudhury APaiva AAbreu RStorey M(2024)Fairness Improvement with Multiple Protected Attributes: How Far Are We?Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639083(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639083
Warnett SZdun U(2024)On the Understandability of MLOps System ArchitecturesIEEE Transactions on Software Engineering10.1109/TSE.2024.336748850:5(1015-1039)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3367488
Al Debeyan FMadeyski LHall TBowes D(2024)The impact of hard and easy negative training data on vulnerability prediction performanceJournal of Systems and Software10.1016/j.jss.2024.112003211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112003
Kitchenham BMadeyski L(2024)Recommendations for analysing and meta-analysing small sample size software engineering experimentsEmpirical Software Engineering10.1007/s10664-024-10504-129:6Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s10664-024-10504-1
Gaspar-Figueiredo DAbrahao SInsfran EVanderdonckt J(2023)Measuring User Experience of Adaptive User Interfaces using EEG: A Replication StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3593452(52-61)Online publication date: 14-Jun-2023
https://dl.acm.org/doi/10.1145/3593434.3593452
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents