On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

Steven L. Salzberg¹

2818 Accesses
583 Citations
13 Altmetric
1 Mention
Explore all metrics

Abstract

An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Aha, D. Generalizing from case studies: A case study. In Proc. Ninth Intl. Workshop on Machine Learning, pages 1–10, San Mateo, CA, 1992. Morgan Kaufmann.
Cochran, W. and Cox, G. Experimental Designs. Wiley, 2nd edition, 1957.
Cohen, P.R. and Jensen, D. Overfitting explained. In Prelim. Papers Sixth Intl. Workshop on Artificial Intelligence and Statistics, pages 115–122, January 1997.
Denton, F. Data mining as an industry. Review of Economics and Statistics, 67:124–127, 1985.
Google Scholar
Dietterich, T. Statistical tests for comparing supervised learning algorithms. Technical report, Oregon State University, Corvallis, OR, 1996.
Everitt, B. The Analysis of Contingency Tables. Chapman and Hall, London., 1977.
Fayyad, U.M. and Irani, K.B. Multi-interval discretization of continuous valued attributes for classification learning. In Proc. 13th Intl. Joint Conf. on Artificial Intelligence, pages 1022–1027, Chambery, France, 1993. Morgan Kaufmann.
Feelders, A. and Verkooijen,W. Which method learns most from the data? In Prelim. Papers Fifth Intl. Workshop on Artificial Intelligence and Statistics, pages 219–225, Fort Lauderdale, Florida, 1995.
Flexer, A. Statistical evaluation of neural network experiments: Minimum requirements and current practice. In R. Trappl, editor, Cybernetics and Systems '96: Proc. 13th European Meeting on Cybernetics and Systems Res., pages 1005–1008. Austrian Society for Cybernetic Studies, 1996.
Gascuel, O. and Caraux, G. Statistical significance in inductive learning. In Proc. of the European Conf. on Artificial Intelligence (ECAI), pages 435–439, New York, 1992. Wiley.
Hildebrand, D. Statistical Thinking for Behavioral Scientists. Duxbury Press, Boston, MA, 1986.
Google Scholar
Holte, R. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–90, 1993.
Google Scholar
Jensen, D. Knowledge discovery through induction with randomization testing. In G. Piatetsky-Shapiro, editor, Proc. 1991 Knowledge Discovery in DatabasesWorkshop, pages 148–159, Menlo Park, CA, 1991. AAAI Press.
Jensen, D. Labeling space: A tool for thinking about significance testing in knowledge discovery. Office of Technology Assessment, U.S. Congress, 1995.
Google Scholar
Kibler, D. and Langley, P. Machine learning as an experimental science. In Proc. of 1988 Euro. Working Session on Learning, pages 81–92, 1988.
Kohavi, R. and Sommerfield, D. Oblivious decision trees, graphs, and top-down pruning. In Proc. 14th Intl. Joint Conf. on Artificial Intelligence, pages 1071–1077, Montreal, 1995. Morgan Kaufmann.
Murphy, P.M. UCI repository of machine learning databases-a machine-readable data repository. Maintained at the Department of Information and Computer Science, University of California, Irvine. Anonymous FTP from ics.uci.edu in the directory pub/machine-learning-databases, 1995.
Prechelt, L. A quantitative study of experimental evaluations of neural network algorithms: Current research practice. Neural Networks, 9, 1996.
Qian, N. and Sejnowski, T. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202:65–884, 1988.
Google Scholar
Raftery, A. Bayesian model selection in social research (with discussion by Andrew Gelman, Donald B. Rubin, and Robert M. Hauser). In Peter Marsden, editor, Sociological Methodology 1995, pages 111–196. Blackwells, Oxford, UK, 1995.
Google Scholar
Sejnowski, T. and Rosenberg, C. Parallel networks that learn to pronounce English text. Complex Systems, 1:145–168, 1987.
Google Scholar
Shavlik, J., Mooney, R. and Towell, G. Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6:111–143, 1991.
Google Scholar
Wettschereck, D. and Dietterich, T. An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms. Machine Learning, 19(1):5–28, 1995.
Google Scholar
Wolpert, D. On the connection between in-sample testing and generalization error. Complex Systems, 6:47–94, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
Steven L. Salzberg

Authors

Steven L. Salzberg
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salzberg, S.L. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1, 317–328 (1997). https://doi.org/10.1023/A:1009752403260

Download citation

Issue Date: September 1997
DOI: https://doi.org/10.1023/A:1009752403260

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using p-values for the comparison of classifiers: pitfalls and alternatives

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Statistical comparison of classifiers through Bayesian hierarchical modelling

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using p-values for the comparison of classifiers: pitfalls and alternatives

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Statistical comparison of classifiers through Bayesian hierarchical modelling

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation