Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Using p-values for the comparison of classifiers: pitfalls and alternatives

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The statistical comparison of machine learning classifiers is frequently underpinned by null hypothesis significance testing. Here, we provide a survey and analysis of underrated problems that significance testing entails for classification benchmark studies. The p-value has become deeply entrenched in machine learning, but it is substantially less objective and less informative than commonly assumed. Even very small p-values can drastically overstate the evidence against the null hypothesis. Moreover, the p-value depends on the experimenter’s intentions, irrespective of whether these were actually realized or not. We show how such intentions can lead to experimental designs with more than one stage, and how to calculate a valid p-value for such designs. We discuss two widely used statistical tests for the comparison of classifiers, the Friedman test and the Wilcoxon signed rank test. Some improvements to the use of p-values, such as the calibration with the Bayes factor bound, and alternative methods for the evaluation of benchmark studies are discussed as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. This assumption is clearly violated: if A is better than B and B is better that C, then A must also be better than C.

  2. Notice that for \(p = 0.0034\), the BFB is approximately 19; and with a false positive risk of 5%, the prior probability of the null hypothesis is \({\mathbb {P}}(H_0) = 0.5\).

  3. The R function hdiBeta(0.95, a, b) from the the package nclbayes calculates the 95%-HDI of the beta distribution with parameters a and b.

  4. It is of course assumed that nothing is known yet about this new data set.

  5. https://cos.io/rr/.

References

  • Abelson R (2016) A retrospective on the significance test ban of 1999 (if there were no significance tests, they would need to be invented). In: Harlow L, Mulaik S, Steiger J (eds) What if there were no significance tests?. Routledge Classic Editions, pp 107–128

  • Althouse A (2016) Adjust for multiple comparisons? It’s not that simple. Ann Thorac Surg 101:1644–1645

    Article  Google Scholar 

  • Amrhein V, Greenland S (2018) Remove, rather than redefine, statistical significance. Nat Hum Behav 2(4):4

    Article  Google Scholar 

  • Amrhein V, Korner-Nievergelt F, Roth T (2017) The earth is flat (\(p > 0.05\)): significance thresholds and the crisis of unreplicable research. PeerJ 5:e3544

  • Bayarri M, Berger J (2000) P values for composite null models. J Am Stat Assoc 95(452):1127–1142

  • Bayarri M, Benjamin D, Berger J, Sellke T (2016) Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J Math Psychol 72:90–103

    Article  MathSciNet  MATH  Google Scholar 

  • Benavoli A, Corani G, Mangili F (2016) Should we really use post-hoc tests based on mean-ranks? J Mach Learn Res 17(5):1–10

    MathSciNet  MATH  Google Scholar 

  • Benavoli A, Corani G, Demšar J, Zaffalon M (2017) Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J Mach Learn Res 18(77):1–36

    MathSciNet  MATH  Google Scholar 

  • Benjamin D, Berger J (2016) Comment: a simple alternative to \(p\)-values. Am Stat (Online Discussion: ASA Statement on Statistical Significance and \(P\)-values) 70:1–2

  • Benjamin D, Berger J (2019) Three recommendations for improving the use of \(p\)-values. Am Stat 73(sup1):186–191

    Article  MathSciNet  Google Scholar 

  • Benjamin D, Berger J, Johannesson M, Nosek B, Wagenmakers E, Berk R, Bollen K, Brembs B, Brown L, Camerer C, Cesarini D, Chambers C, Clyde M, Cook T, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field A, Forster M, George E, Gonzalez R, Goodman S, Green E, Green D, Greenwald A, Hadfield J, Hedges L, Held L, Hua Ho T, Hoijtink H, Hruschka D, Imai K, Imbens G, Ioannidis J, Jeon M, Jones J, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell S, McCarthy M, Moore D, Morgan S, Munafó M, Nakagawa S, Nyhan B, Parker T, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schönbrodt F, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts D, Winship C, Wolpert R, Xie Y, Young C, Zinman J, Johnson V (2018) Redefine statistical significance. Nat Hum Behav 2(1):6–10

    Article  Google Scholar 

  • Berger J, Berry D (1988) Statistical analysis and the illusion of objectivity. Am Sci 76:159–165

    Google Scholar 

  • Berger J, Delampady M (1987) Testing precise hypotheses. Stat Sci 2(3):317–352

    MathSciNet  MATH  Google Scholar 

  • Berger J, Sellke T (1987) Testing a point null hypothesis: the irreconcilability of \(p\) values and evidence. J Am Stat Assoc 82:112–122

    MathSciNet  MATH  Google Scholar 

  • Berger J, Wolpert R (1988) The Likelihood Principle, 2nd edn. Institute of Mathematical Statistics, Hayward, California

  • Berrar D (2017) Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Mach Learn 106(6):911–949

  • Berrar D, Dubitzky W (2019) Should significance testing be abandoned in machine learning? Int J Data Sci Anal 7(4):247–257

    Article  Google Scholar 

  • Berrar D, Lozano J (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers? J Exp Theor Artif Intell 25(2):189–206

    Article  Google Scholar 

  • Berrar D, Lopes P, Dubitzky W (2017) Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. Int J Data Sci Anal 4(2):143–151

    Article  Google Scholar 

  • Berry D (2017) A \(p\)-value to die for. J Am Stat Assoc 112:895–897

    Article  Google Scholar 

  • Birnbaum A (1961) A unified theory of estimation, I. Ann Math Stat 32:112–135

    Article  MATH  Google Scholar 

  • Carrasco J, García S, Rueda M, Das S, Herrera F (2020) Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: practical guidelines and a critical review. Swarm Evol Comput 54:100665

    Article  Google Scholar 

  • Carver R (1978) The case against statistical significance testing. Harv Educ Rev 48(3):378–399

    Article  Google Scholar 

  • Christensen R (2005) Testing Fisher, Neyman, Pearson, and Bayes. Am Stat 59(2):121–126

    Article  MathSciNet  Google Scholar 

  • Cockburn A, Dragicevic P, Besançon L, Gutwin C (2020) Threats of a replication crisis in empirical computer science. Commun ACM 63(8):70–79

    Article  Google Scholar 

  • Cohen J (1990) Things I have learned (so far). Am Psychol 45(12):1304–1312

    Article  Google Scholar 

  • Cohen J (1994) The earth is round (\(p <\) .05). Am Psychol 49(12):997–1003

    Article  Google Scholar 

  • Cole P (1979) The evolving case-control study. J Chronic Dis 32:15–27

    Article  Google Scholar 

  • Colquhoun D (2017) The reproducibility of research and the misinterpretation of \(p\)-values. R Soc Open Sci 4:171085

    Article  MathSciNet  Google Scholar 

  • Cumming G (2012) Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. Routledge, Taylor & Francis Group, New York/London

    Google Scholar 

  • Dau HA, Bagnall AJ, Kamgar K, Yeh CM, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh EJ (2019) The UCR time series archive. CoRR. arXiv:1810.07758

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:31–36

    Article  Google Scholar 

  • Drummond C (2006) Machine learning as an experimental science, revisited. In: Proceedings of the 21st national conference on artificial intelligence: workshop on evaluation methods for machine learning. AAAI Press, pp 1–5

  • Drummond C, Japkowicz N (2010) Warning: statistical benchmarking is addictive. Kicking the habit in machine learning. J Exp Theor Artif Intell 2:67–80

    Article  MATH  Google Scholar 

  • Dua D, Graff C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Dudoit S, van der Laan M (2008) Multiple testing procedures with applications to genomics, 1st edn. Springer, New York

    Book  MATH  Google Scholar 

  • Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701

    Article  MATH  Google Scholar 

  • García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9(89):2677–2694

    MATH  Google Scholar 

  • Gelman A (2016) The problems with \(p\)-values are not just with \(p\)-values. The American Statistician, Online Discussion, pp 1–2

  • Gibson E (2020) The role of \(p\)-values in judging the strength of evidence and realistic replication expectations. Stat Biopharm Res 0(0):1–13

    Google Scholar 

  • Gigerenzer G (1998) We need statistical thinking, not statistical rituals. Behav Brain Sci 21:199–200

    Article  Google Scholar 

  • Gigerenzer G (2004) Mindless statistics. J Socio-Econ 33:587–606

    Article  Google Scholar 

  • Gigerenzer G, Krauss S, Vitouch O (2004) The Null Ritual-What you always wanted to know about significance testing but were afraid to ask. In: Kaplan D (ed) The Sage handbook of quantitative methodology for the social sciences. Sage, Thousand Oaks, pp 391–408

    Google Scholar 

  • Goodman S (1992) A comment on replication, \(p\)-values and evidence. Stat Med 11:875–879

    Article  Google Scholar 

  • Goodman S (1993) P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 137(5):485–496

  • Goodman S (1999) Toward evidence-based medical statistics 1: the P value fallacy. Ann Intern Med 130(12):995–1004

    Article  Google Scholar 

  • Goodman S (2008) A dirty dozen: twelve P-value misconceptions. Semin Hematol 45(3):135–140

  • Goodman S, Royall R (1988) Evidence and scientific research. Am J Public Health 78(12):1568–1574

    Article  Google Scholar 

  • Greenland S, Senn S, Rothman K, Carlin J, Poole C, Goodman S, Altman D (2016) Statistical tests, \(p\) values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31(4):337–350

    Article  Google Scholar 

  • Gundersen OE, Kjensmo S (2018) State of the art: reproducibility in artificial intelligence. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the 32nd AAAI conference on artificial intelligence. AAAI Press, pp 1644–1651

  • Hagen R (1997) In praise of the null hypothesis significance test. Am Psychol 52(1):15–23

    Article  Google Scholar 

  • Hays W (1963) Statistics. Holt, Rinehart and Winston, New York

    MATH  Google Scholar 

  • Hoekstra R, Morey R, Rouder J, Wagenmakers E-J (2014) Robust misinterpretation of confidence intervals. Psychon Bull Rev 21(5):1157–1164

    Article  Google Scholar 

  • Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70

    MathSciNet  MATH  Google Scholar 

  • Hubbard R (2004) Alphabet soup—blurring the distinctions between \(p\)’s and \(\alpha \)’s in psychological research. Theory Psychol 14(3):295–327

    Article  MathSciNet  Google Scholar 

  • Hubbard R (2019) Will the ASA’s efforts to improve statistical practice be successful? Some evidence to the contrary. Am Stat 73(sup1: Statistical Inference in the 21st Century: A World Beyond \(p < 0.05\)):31–35

  • Hubbard R, Bayarri M (2003) P values are not error probabilities. Technical Report University of Valencia. http://www.uv.es/sestio/TechRep/tr14-03.pdf. Accessed 8 February 2021

  • Iman R, Davenport J (1980) Approximations of the critical region of the Friedman statistic. Commun Stat 9(6):571–595

    Article  MATH  Google Scholar 

  • Infanger D, Schmidt-Trucksäss A (2019) P value functions: an underused method to present research results and to promote quantitative reasoning. Stat Med 38(21):4189–4197

  • Isaksson A, Wallmana M, Göransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recogn Lett 29(14):1960–1965

    Article  Google Scholar 

  • Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Kass R, Raftery A (1995) Bayes factors. J Am Stat Assoc 90(430):773–795

    Article  MathSciNet  MATH  Google Scholar 

  • Kruschke J (2010) Bayesian data analysis. WIREs Cogn Sci 1(5):658–676

    Article  Google Scholar 

  • Kruschke J (2013) Bayesian estimation supersedes the \(t\) test. J Exp Psychol Gen 142(2):573–603

    Article  Google Scholar 

  • Kruschke J (2015) Doing Bayesian data analysis, 2nd edn. Elsevier Academic Press, Amsterdam. http://doingbayesiandataanalysis.blogspot.com/

  • Kruschke J (2018) Rejecting or accepting parameter values in Bayesian estimation. Adv Methods Pract Psychol Sci 1(2):270–280

    Article  Google Scholar 

  • Kruschke J, Liddell T (2018) Bayesian data analysis for newcomers. Psychon Bull Rev 25:155–177

    Article  Google Scholar 

  • Lakens D (2021) The practical alternative to the \(p\) value is the correctly used \(p\) value. Perspect Psychol Sci 16(3):639–648

    Article  Google Scholar 

  • Lindley D (1957) A statistical paradox. Biometrika 44:187–192

    Article  MATH  Google Scholar 

  • McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12:153–157

    Article  Google Scholar 

  • McShane BB, Gal D, Gelman A, Robert C, Tackett JL (2019) Abandon statistical significance. Am Stat 73(sup1: Statistical Inference in the 21st Century: A World Beyond \(p < 0.05\)):235–245

  • Meehl P (1967) Theory-testing in psychology and physics: a methodological paradox. Philos Sci 34(2):103–115

    Article  Google Scholar 

  • Miller J, Ulrich R (2014) Interpreting confidence intervals: a comment on Hoekstra, Morey, and Wagenmakers (2014). Psychon Bull Rev 23(1):124–130

    Article  Google Scholar 

  • Mulaik S, Raju N, R.A H (2016) There is a time and a place for significance testing. In: Harlow L, Mulaik S, Steiger J (eds) What if there were no significance tests? Routledge Classic Editions

  • Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52:239–281

    Article  MATH  Google Scholar 

  • Nosek B, Ebersole C, DeHaven A, Mellor D (2018) The preregistration revolution. Proc Natl Acad Sci USA 115(11):2600–2606

    Article  Google Scholar 

  • Nuzzo R (2014) Statistical errors. Nature 506:150–152

    Article  Google Scholar 

  • Perneger T (1998) What’s wrong with Bonferroni adjustments. BMJ 316:1236–1238

    Article  Google Scholar 

  • Poole C (1987) Beyond the confidence interval. Am J Public Health 2(77):195–199

    Article  Google Scholar 

  • Raschka S (2018) Model evaluation, model selection, and algorithm selection in machine learning. CoRR. arXiv:1811.12808

  • Rothman K (1990) No adjustments are needed for multiple comparisons. Epidemiology 1(1):43–46

    Article  Google Scholar 

  • Rothman K (1998) Writing for epidemiology. Epidemiology 9(3):333–337

    Article  Google Scholar 

  • Rothman K, Greenland S, Lash T (2008) Modern epidemiology, 3rd edn. Wolters Kluwer

  • Rozeboom W (1960) The fallacy of the null hypothesis significance test. Psychol Bull 57:416–428

    Article  Google Scholar 

  • Salzberg S (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327

    Article  Google Scholar 

  • Schmidt F (1996) Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol Methods 1(2):115–129

    Article  Google Scholar 

  • Schmidt F, Hunter J (2016) Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow L, Mulaik S, Steiger J (eds) What if there were no significance tests? Routledge, pp 35–60

  • Schneider J (2015) Null hypothesis significance tests: a mix-up of two different theories-the basis for widespread confusion and numerous misinterpretations. Scientometrics 102(1):411–432

    Article  Google Scholar 

  • Sellke T, Bayarri M, Berger J (2001) Calibration of \(p\) values for testing precise null hypotheses. Am Stat 55(1):62–71

    Article  MathSciNet  MATH  Google Scholar 

  • Serlin R, Lapsley D (1985) Rationality in psychological research: the good-enough principle. Am Psychol 40(1):73–83

    Article  Google Scholar 

  • Sheskin D (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman and Hall, CRC

  • Simon R (1989) Optimal two-stage designs for stage II clinical trials. Control Clin Trials 10:1–10

    Article  Google Scholar 

  • Stang A, Poole C, Kuss O (2010) The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 25:225–230

    Article  Google Scholar 

  • Tukey J (1991) The philosophy of multiple comparisons. Stat Sci 6(1):100–116

    Article  MathSciNet  Google Scholar 

  • Vovk V (1993) A logic of probability, with application to the foundations of statistics. J Roy Stat Soc B 55:317–351

    MathSciNet  MATH  Google Scholar 

  • Wagenmakers E-J (2007) A practical solution to the pervasive problems of \(p\) values. Psychon Bull Rev 14(5):779–804

    Article  Google Scholar 

  • Wagenmakers E-J, Ly A (2021) History and nature of the Jeffreys–Lindley Paradox. https://arxiv.org/abs/2111.10191

  • Wagenmakers E-J, Gronau Q, Vandekerckhove J (2019) Five Bayesian intuitions for the stopping rule principle. PsyArXiv 1–13. https://doi.org/10.31234/osf.io/5ntkd

  • Wasserstein R, Lazar N (2016) The ASA’s statement on \(p\)-values: context, process, and purpose (editorial). Am Stat 70(2):129–133

    Article  MathSciNet  Google Scholar 

  • Wasserstein R, Schirm A, Lazar N (2019) Moving to a world beyond “\(p < 0.05\)". Am Stat 73(sup1: Statistical Inference in the 21st Century: A World Beyond \(p < 0.05\)):1–19

  • Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83

    Article  Google Scholar 

  • Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

Download references

Acknowledgements

I am very grateful to James O. Berger for our discussion of the p-value under optional stopping. I also thank the three reviewers and the editor very much for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Berrar.

Additional information

Responsible editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Bayes factor bound

Appendix: Bayes factor bound

Under the null hypothesis, the p-value is known to be a random variable that is uniformly distributed over [0, 1]. Under the alternative hypothesis \(H_1\), the p-value has density \(f(p\,|\,\xi )\), with \(\xi \) being an unknown parameter (Sellke et al. 2001; Benjamin and Berger 2019),

$$\begin{aligned} H_0: p \sim {\mathcal {U}}(0,1) \end{aligned}$$
(17)
$$\begin{aligned} H_1: p \sim f(p\,|\,\xi ) \end{aligned}$$
(18)

In significance testing, larger absolute values of the test statistic (cf. Eq. 1) are taken as casting more doubt on the null hypothesis and thereby providing more evidence in favor of the alternative hypothesis. Under \(H_1\), the density should therefore be decreasing for increasing values of p. Sellke et al. (2001) propose the class of Beta(\(\xi ,1\)) densities, with \(0 < \xi \le 1\), for the distribution of the p-value,

$$\begin{aligned} f(p\,|\,\xi ) = \xi p^{\xi -1} \end{aligned}$$
(19)

For \(\xi = 1\), we obtain the uniform distribution under the null hypothesis, \(f(p\,|\,\xi = 1) = 1\). For a given prior density \(\pi (\xi )\) under the alternative hypothesis, the odds of \(H_1\) to \(H_0\) (i.e., the Bayes factor) are

$$\begin{aligned} \mathrm {BF}_{\theta }(p)&= \frac{{\mathbb {P}}(H_1\,|\,p)}{{\mathbb {P}}(H_0\,|\,p)} \nonumber \\&= \frac{\int _{0}^{1} f(p\,|\,\xi )~\pi (\xi )~d\xi }{f(p\,|\, 1)} \end{aligned}$$
(20)

The upper bound of \(\mathrm {BF}_{\pi }(p)\) is

$$\begin{aligned} \overline{\mathrm {BF}}_{\pi }(p) = \sup _{\mathrm {all}\,\pi } \mathrm {BF}_{\pi }(p) = \frac{\sup _{\xi } \xi \,p^{\xi -1}}{f(p\,|\, 1)} \end{aligned}$$
(21)

Thus,

$$\begin{aligned} \frac{\partial \,\xi \,p^{\xi -1}}{\partial \xi } = p^{\xi -1} + \xi \,p^{\xi -1}\,\ln (p)&\overset{!}{=} 0 \\ 1 + \xi \,\ln (p)&= 0 \\ \xi&= - \frac{1}{\ln (p)}\quad \mathrm {for}\,p \le e^{-1}\,\mathrm {since}\, \xi \le 1 \end{aligned}$$

Hence,

$$\begin{aligned} f\left( p\,|-\frac{1}{\ln (p)}\right)&= -\frac{1}{\ln (p)}\,p^{-\frac{1}{\ln (p)}-1} \\&= -\frac{1}{\ln (p)}\,p^{-\left( \frac{1}{\ln (p)}+1\right) } \\&= -\frac{1}{\ln (p)\,p^{\frac{1}{\ln (p)}} \, p} \\&= -\frac{1}{\ln (p)\,e\,p} \quad \mathrm {because}\,p^{\frac{1}{\ln (p)}} = p^{\log _pe} = e \end{aligned}$$

The second derivative w.r.t. \(\xi \) is

$$\begin{aligned} \frac{\partial ^2 f(p\,|\,\xi )}{\partial \xi ^2} = 2\,p^{\xi -1}\,\ln (p) + [\ln (p)]^2 \,\xi \,p^{\xi -1} \end{aligned}$$

and for \(\xi = - \frac{1}{\ln (p)}\),

$$\begin{aligned} \frac{\partial ^2 f\left( p\,|- \frac{1}{\ln (p)}\right) }{\partial \xi ^2} = \frac{\ln (p)}{e\,p} < 0 \end{aligned}$$

Thus, \([-\ln (p)\,e\,p]^{-1}\) is an upper bound on the odds of the alternative to the null hypothesis for \(p \le e^{-1}\), and this bound holds for any reasonable prior distribution on \(\xi \) (Sellke et al. 2001).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berrar, D. Using p-values for the comparison of classifiers: pitfalls and alternatives. Data Min Knowl Disc 36, 1102–1139 (2022). https://doi.org/10.1007/s10618-022-00828-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00828-1

Keywords

Navigation