Article

The Prevalence of Errors in Machine Learning Experiments

Authors:

Martin Shepperd,

Andrea Capiluppi,

Steve Counsell,

Giuseppe Destefanis,

Leila YousefiAuthors Info & Claims

Intelligent Data Engineering and Automated Learning – IDEAL 2019: 20th International Conference, Manchester, UK, November 14–16, 2019, Proceedings, Part I

Pages 102 - 109

https://doi.org/10.1007/978-3-030-33607-3_12

Published: 14 November 2019 Publication History

Abstract

Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments.

Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors.

Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors.

Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error).

Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments.

References

[1]

Benavoli A, Corani G, Demšar J, and Zaffalon M Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis J. Mach. Learn. Res. 2017 18 1 2653-2688

[2]

Bender R and Lange S Adjusting for multiple testing - when and how? J. Clin. Epidemiol. 2001 54 4 343-349

[3]

Benjamini Y and Hochberg Y Controlling the false discovery rate: a practical and powerful approach to multiple testing J. Royal Stat. Soc.: Ser. B (Methodol.) 1995 57 1 289-300

[4]

Bowes D, Hall T, and Gray D DConfusion: a technique to allow cross study performance evaluation of fault prediction studies Autom. Softw. Eng. 2014 21 2 287-313

Digital Library

[5]

Brown N and Heathers J The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology Soc. Psychol. Pers. Sci. 2017 8 4 363-369

[6]

Catal C and Diri B A systematic review of software fault prediction studies Expert Syst. Appl. 2009 36 4 7346-7354

Digital Library

[7]

Colquhoun D An investigation of the false discovery rate and the misinterpretation of p-values Royal Soc. Open Sci. 2014 1 140216

[8]

Demšar J Statistical comparisons of classifiers over multiple data sets J. Mach. Learn. Res. 2006 7 1-30

Digital Library

[9]

Earp B and Trafimow D Replication, falsification, and the crisis of confidence in social psychology Front. Psychol. 2015 6 621

[10]

Hall T, Beecham S, Bowes D, Gray D, and Counsell S A systematic literature review on fault prediction performance in software engineering IEEE Trans. Softw. Eng. 2012 38 6 1276-1304

Digital Library

[11]

Ioannidis J Why most published research findings are false PLoS Med. 2005 2 8 e124

[12]

Kitchenham B, Budgen D, and Brereton P Evidence-Based Software Engineering and Systematic Reviews 2015 Boca Raton CRC Press

[13]

Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. (2019, under review)

[14]

Munafò M et al. A manifesto for reproducible science Nat. Hum. Behav. 2017 1 1 0021

[15]

Nuijten M, Hartgerink C, van Assen M, Epskamp S, and Wicherts J The prevalence of statistical reporting errors in psychology (1985–2013) Behav. Res. Methods 2016 48 4 1205-1226

[16]

Perlin Marcelo S., Imasato Takeyoshi, and Borenstein Denis Is predatory publishing a real threat? Evidence from a large database study Scientometrics 2018 116 1 255-273

Digital Library

[17]

Shepperd M, Bowes D, and Hall T Researcher bias: the use of machine learning in software defect prediction IEEE Trans. Softw. Eng. 2014 40 6 603-616

Cited By

Vegas SElbaum SChandra SBlincoe KTonella P(2023)Pitfalls in Experiments with DNN4SE: An Analysis of the State of the PracticeProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616320(528-540)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616320

Recommendations

Revising the Panko-Halverson taxonomy of spreadsheet errors

Error taxonomies are useful because different types of errors have different commission and detection rates and because error mitigation techniques often are only useful for some types of errors. In the early 1990s, Panko and Halverson developed a ...
Soft Errors: Technology Trends, System Effects, and Protection Techniques
IOLTS '07: Proceedings of the 13th IEEE International On-Line Testing Symposium

Radiation-induced soft errors are getting worse in digital systems manufactured in advanced technologies. Stringent data integrity and availability requirements of enterprise computing and networking applications demand special attention to soft errors ...
Statistical errors in software engineering experiments: a preliminary literature review
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Intelligent Data Engineering and Automated Learning – IDEAL 2019: 20th International Conference, Manchester, UK, November 14–16, 2019, Proceedings, Part I

Nov 2019

574 pages

ISBN:978-3-030-33606-6

DOI:10.1007/978-3-030-33607-3

Editors:
Hujun Yin
University of Manchester, Manchester, UK
,
David Camacho
Technical University of Madrid, Madrid, Spain
,
Peter Tino
University of Birmingham, Birmingham, UK
,
Antonio J. Tallón-Ballesteros
University of Huelva, Huelva, Spain
,
Ronaldo Menezes
University of Exeter, Exeter, UK
,
Richard Allmendinger
University of Manchester, Manchester, UK

© Springer Nature Switzerland AG 2019.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 14 November 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vegas SElbaum SChandra SBlincoe KTonella P(2023)Pitfalls in Experiments with DNN4SE: An Analysis of the State of the PracticeProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616320(528-540)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616320

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents