Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Robust Statistical Methods for Empirical Software Engineering

Published: 01 April 2017 Publication History

Abstract

There have been many changes in statistical theory in the past 30 years, including increased evidence that non-robust methods may fail to detect important results. The statistical advice available to software engineering researchers needs to be updated to address these issues. This paper aims both to explain the new results in the area of robust analysis methods and to provide a large-scale worked example of the new methods. We summarise the results of analyses of the Type 1 error efficiency and power of standard parametric and non-parametric statistical tests when applied to non-normal data sets. We identify parametric and non-parametric methods that are robust to non-normality. We present an analysis of a large-scale software engineering experiment to illustrate their use. We illustrate the use of kernel density plots, and parametric and non-parametric methods using four different software engineering data sets. We explain why the methods are necessary and the rationale for selecting a specific analysis. We suggest using kernel density plots rather than box plots to visualise data distributions. For parametric analysis, we recommend trimmed means, which can support reliable tests of the differences between the central location of two or more samples. When the distribution of the data differs among groups, or we have ordinal scale data, we recommend non-parametric methods such as Cliff's ź or a robust rank-based ANOVA-like method.

References

[1]
Acion L, Peterson JJ, Temple S, Arndt S (2006) Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Stat Med 25(4):591-602.
[2]
Agresti A, Pendergast J (1986) Comparing mean ranks for repeated measures data. Communications in Statistics - Theory and Methods 15(5):1417-1433.
[3]
Akritas MG, Arnold SF (1994) Fully nonparametric hypotheses for factorial designs i: multivariate repeated measures designs. J Am Stat Assoc 89(425):336-343.
[4]
Akritas MG, Arnold SF, Brunner E (1997) Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J Am Stat Assoc 92(437):258-265.
[5]
Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: ACM/IEEE international conference on software engineering (ICSE), IEEE, pp 1-10.
[6]
Arcuri A, Briand L (2014) A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24(3):219-250.
[7]
Behrens JT (1997) Principles and procedures of exploratory data analysis. Psychol Methods 2(2):131-160.
[8]
Bergmann R, Ludbrook J, Spooren WPJM (2000) Different outcomes of the Wilcoxon-Mann-Whitney test from different statistics packages. Am Stat 54(1):72-77.
[9]
Boehm BW (1981) Software engineering economics. Prentice-Hall.
[10]
Borenstein M, Hedges LV, Higgins JP, Hannah RR (2009) Introduction to meta-analysis. Wiley.
[11]
Box GEP (1954) Some theorems on quadratic forms applied in the study of analysis of variance problems, i. Effect of inequality of variance in the One-Way classification. Ann Math Stat 25(2):290-302.
[12]
Braver SL, Thoemmes FJ, Rosenthal R (2014) Continuously cumulating meta-analysis and replicability. Perspect Psychol Sci 9(3):333-342.
[13]
Brunner E, Munzel U, Puri ML (2002) The multivariate nonparametric Behrens-Fisher problem. Journal of Statistical Planning and Inference 108(1-2):37-53.
[14]
Budgen D, Kitchenham BA, Charters SM, Turner M, Brereton P, Linkman SG (2008) Presenting software engineering results using structured abstracts: a randomised experiment. Empir Softw Eng 13(4):435-468.
[15]
Budgen D, Burn AJ, Kitchenham B (2011) Reporting computing projects through structured abstracts: a quasi-experiment. Empir Softw Eng 16(2):244-277.
[16]
Budgen D, Kitchenham B, Charters S, Gibbs S, Pohthong A, Keung J, Brereton P (2013) Lessons from conducting a distributed quasi-experiment. In: 2013 ACM/IEEE international symposium on empirical software engineering and measurement, pp 143-152.
[17]
Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114(3):494-509.
[18]
Cohen JW (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale, New York.
[19]
Cohen JW (1992) A power primer. Psychol Bull 112(1):155-159.
[20]
Conover W, Imam RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Stat 35(3):124-129.
[21]
D'Agostino RB, Belanger A, D'Agostino J, Ralph B (1990) A suggestion for using powerful and informative tests of normality. Am Stat 44(4):316-321.
[22]
Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: a comparative study. IEEE Trans Softw Eng 38(2):357-397.
[23]
Dem¿ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1-30.
[24]
Dybå T, Kampenes VB, Sjøberg DIK (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745-755.
[25]
El-Attar M (2014) Using SMCD to reduce inconsistencies in misuse case models: a subject-based empirical evaluation. J Syst Softw 87:104-118.
[26]
El-Attar M, Elish M, Mahmood S, Miller J (2012) Is in-depth object-oriented knowledge necessary to develop quality robustness diagrams? Journal of Software 7(11):2538-2552.
[27]
Erceg-Hurn DM, Mirosevich VM (2008) Modern robust statistical methods an easy way to maximize the accuracy and power of your research. Am Psychol 63(7):591-601.
[28]
Gandrud C (2015) Reproducible research with R and R studio. CRC Press.
[29]
Goodall C (1983) Understanding robust and exploratory data analysis. John Wiley and Sons Inc., chap M-Estimators of Location: An outline of the theory, pp 339-403.
[30]
Grissom RJ (1996) The magical number.7 ±.2: meta-meta-analysis of the probability of superior outcome in comparisons involving therapy, placebo, and control. J Consult Clin Psychol 64(5):973-982.
[31]
Hoaglin DC, Mosteller F, Tukey JW (eds) (1983) Understanding robust and exploratory data analysis. Wiley.
[32]
Huijgens H, van Solingen R, van Deursen A (2013) How to build a good practice software project portfolio? Tech. Rep. TUD-SERG-2013-019, Delft University of Technology.
[33]
John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23(5):524-532.
[34]
Jureczko M, Madeyski L (2015) Cross-project defect prediction with respect to code ownership model: an empirical study. e-Informatica Software Engineering Journal 9(1):21-35.
[35]
Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11-12):1073-1086.
[36]
Kitchenham B (1996) Software metrics: measurement for software process improvement. Blackwell Publishers Inc.
[37]
Kitchenham B (2015) Robust statistical methods: why, what and how: keynote. In: Proceedings of the 19th international conference on evaluation and assessment in software engineering (EASE 2015), pp 1:1-1:6.
[38]
Kitchenham B, Känsälä K (1983) Inter-item correlations among function points. In: Proceedings ICSE 15. IEEE Computer Society Press, pp 477-480.
[39]
Kraemer HC, Kupfer DJ (2006) Size of treatment effects and their importance to clinical research and practice. Biol Psychiatry 59(11):990-996.
[40]
Kromrey JD, Hogarty KY, Ferron JM, Hines CV, Hess MR (2005) Robustness in meta-analysis: an empirical comparison of point and interval estimates of standardized mean differences and Cliff's delta. In: Proceedings of the joint statistical meetings, Minneapolis.
[41]
Lipsey MW, Wilson DB (2001) Practical meta-analysis. Sage Publications, California.
[42]
Madeyski L (2010) Test-driven development: an empirical evaluation of agile practice. Springer, Heidelberg.
[43]
Madeyski L (2015) Reproducer: reproduce statistical analyses and meta-analyses. http://madeyski.e-informatyka.pl/reproducible-research, R package (http://CRAN.R-project.org/package=reproducer).
[44]
Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? An empirical study. Softw Qual J 23(3):393-422.
[45]
Madeyski L, Orzeszyna W, Torkar R, Józala M (2012) Appendix to the paper "Overcoming the equivalent mutant problem: a systematic literature review and a comparative experiment of second order mutation". http://madeyski.e-informatyka.pl/download/app/AppendixTSE.pdf.
[46]
Madeyski L, Orzeszyna W, Torkar R, Józala M (2014) Overcoming the equivalent mutant problem: a systematic literature review and a comparative experiment of second order mutation. IEEE Trans Softw Eng 40(1):23-42.
[47]
Micceri T (1989) The unicorn, the normal curve, and other improbable creatures. Psychol Bull 105(1):156-166.
[48]
Mosteller F, Tukey JW (1977) Data analysis and regression: a second course in statistics. Addison-Wesley.
[49]
Mudholkar GS, Marchetti CE, Lin CT (2002) Independence characterizations and testing normality against restricted skewness-kurtosis alternatives. Journal of Statistical Planning and Inference 104(2):485-501.
[50]
Price RM, Bonett DG (2001) Estimating the variance of the sample median. J Stat Comput Simul 68(3):295-305.
[51]
R Core Team (2015) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria.
[52]
Ramsey PH (1980) Exact type 1 error rates for robustness of student's t test with unequal variances. J Educ Behav Stat 5(4):337-349.
[53]
Razali NM, Wah YB (2011) Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2(1):21-33.
[54]
Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston.
[55]
Shapiro SS, Wilk M, Chen HJ (1968) A comparative study of various tests for normality. J Am Stat Assoc 63(324):1343-1372.
[56]
Shrout P, Fleiss J (1979) Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86(2):420- 428.
[57]
Stout DE, Ruble TL (1995) Assessing the practical significance of empirical results in accounting education research: the use of effect size information. Journal of Accounting Education 13(3):281-298.
[58]
Tappenden AF, Miller J (2014) Automated cookie collection testing. ACM Trans Softw Eng Methodol 23(1):3:1-3:40.
[59]
Tian T, Wilcox R (2007) A comparison of two rank tests for repeated measures designs. Journal of Modern Applied Statistical Methods 6(1):331-335.
[60]
Urdan TC (2005) Statistics in plain english, 2nd edn. Routledge, Oxon, UK.
[61]
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101-132.
[62]
Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. J Stat Softw 36(3):1-48.
[63]
Welch BL (1938) The significance of the difference between two means when the population variances are unequal. Biometrika 29(3-4):350-362.
[64]
Whigham PA, Owen C, MacDonell S (2015) A baseline model for software effort estimation. ACM Trans Softw Eng Methodol 24(3):20:1-20:11.
[65]
Wilcox RR (1998) How many discoveries have been lost by ignoring modern statistical methods? Am Psychol 53(3):300-314.
[66]
Wilcox RR (2012) Introduction to robust estimation & hypothesis testing, 3rd edn. Elsevier.
[67]
Wilcox RR, Keselman HJ (2003) Modern robust data analysis methods: measures of central tendency. Psychol Methods 8(3):254-274.
[68]
Yuen KK (1974) The two-sample trimmed t for unequal population variances. Biometrika 61(1):165-170.
[69]
Zimmerman DW (2000) Statistical significance levels of nonparametric tests biased by heterogeneous variances of treatment groups. J Gen Psychol 127(4):354-364.
[70]
Zimmerman DW, Zumbo BD (1993) Rank transformations and the power of the Student t test and Welch t test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 47(3):523-539.

Cited By

View all
  • (2024)Application of Quantum Extreme Learning Machines for QoS Prediction of Elevators’ Software in an Industrial ContextCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663859(399-410)Online publication date: 10-Jul-2024
  • (2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
  • (2024)Supporting Safety Analysis of Image-processing DNNs through Clustering-based ApproachesACM Transactions on Software Engineering and Methodology10.1145/364367133:5(1-48)Online publication date: 3-Jun-2024
  • Show More Cited By
  1. Robust Statistical Methods for Empirical Software Engineering

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Empirical Software Engineering
    Empirical Software Engineering  Volume 22, Issue 2
    April 2017
    383 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 April 2017

    Author Tags

    1. Empirical software engineering
    2. Robust methods
    3. Robust statistical methods
    4. Statistical methods

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Application of Quantum Extreme Learning Machines for QoS Prediction of Elevators’ Software in an Industrial ContextCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663859(399-410)Online publication date: 10-Jul-2024
    • (2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
    • (2024)Supporting Safety Analysis of Image-processing DNNs through Clustering-based ApproachesACM Transactions on Software Engineering and Methodology10.1145/364367133:5(1-48)Online publication date: 3-Jun-2024
    • (2024)Method-level Bug Prediction: Problems and PromisesACM Transactions on Software Engineering and Methodology10.1145/364033133:4(1-31)Online publication date: 13-Jan-2024
    • (2024)Search-Based Repair of DNN Controllers of AI-Enabled Cyber-Physical Systems Guided by System-Level SpecificationsProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654078(1435-1444)Online publication date: 14-Jul-2024
    • (2024)Fairness Improvement with Multiple Protected Attributes: How Far Are We?Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639083(1-13)Online publication date: 20-May-2024
    • (2024)On the Understandability of MLOps System ArchitecturesIEEE Transactions on Software Engineering10.1109/TSE.2024.336748850:5(1015-1039)Online publication date: 20-Feb-2024
    • (2024)The impact of hard and easy negative training data on vulnerability prediction performanceJournal of Systems and Software10.1016/j.jss.2024.112003211:COnline publication date: 2-Jul-2024
    • (2024)Recommendations for analysing and meta-analysing small sample size software engineering experimentsEmpirical Software Engineering10.1007/s10664-024-10504-129:6Online publication date: 1-Nov-2024
    • (2023)Measuring User Experience of Adaptive User Interfaces using EEG: A Replication StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3593452(52-61)Online publication date: 14-Jun-2023
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media