Summary
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1).
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1,2).
Bradley, A. P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6):1145–1159.
Buckland, M. and Gey, F. (1994). The Relationship Between Recall and Precision. Journal of the American Society for Information Science, 45(1):12–19.
Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.
Chawla, N. V., Bowyer, K.W., Hall, L. O., and Kegelmeyer,W. P. (2002). SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16:321–357.
Chawla, N. V., Japkowicz, N., and Kołcz, A., editors (2003a). Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets II.
Chawla, N. V., Japkowicz, N., and Kołcz, A., editors (2004a). SIGKDD Special Issue on Learning from Imbalanced Datasets.
Chawla, N. V., Japkowicz, N., Kolcz, A. (2004b), Editorial: Learning form Imbalanced Datasets, SIGKDD Explorations, 6(1).
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003b). Smoteboost: Improving Prediction of the Minority Class in Boosting. In Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107–119, Dubrovnik, Croatia.
Cohen, W. (1995a). Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115–123. Department of Computer Science, Katholieke Universiteit Leuven.
Cohen,W. (1995b). Learning to Classify English Text with ILP Methods. In Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 3–24. Department of Computer Science, Katholieke Universiteit Leuven.
Cost, S. and Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10(1):57–78.
Dietterich, T. (2000). An Empirical Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 40(2):139 – 157.
Dietterich, T., Margineantu, D., Provost, F., and Turney, P., editors (2003). Proceedings of the ICML’2000 Workshop on COST-SENSITIVE LEARNING.
Domingos, P. (1999). Metacost: A General Method for Making Classifiers Cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego, CA. ACM Press.
Drummond, C. and Holte, R. (2003). C4.5, class imbalance, and cost sensitivity: Why undersampling beats over-sampling. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.
Drummond, C. and Holte, R. C. (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207, Boston. ACM.
Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management., pages 148–155.
Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. In Series in Cognition and Perception. Academic Press, New York.
Elkan, C. (2001). The Foundations of Cost-sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973–978, Seattle, WA.
Ezawa, K., J., Singh, M., and Norton, S., W. (1996). Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management. In Proceedings of the International Conference on Machine Learning, ICML-96, pages 139–147, Bari, Italy. Morgan Kauffman.
Fan,W., Stolfo, S., Zhang, J., and Chan, P. (1999). Adacost: Misclassification Cost-sensitive Boosting. In Proceedings of Sixteenth International Conference on Machine Learning, pages 983–990, Slovenia.
Ferri, C., Flach, P., Orallo, J., and Lachice, N., editors (2004). ECAI’ 2004 First Workshop on ROC Analysis in AI. ECAI.
Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. In Thirteenth International Conference on Machine Learning, Bari, Italy.
Guo, H. and Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1).
Hand, D. J. (1997). Construction and Assessment of Classification Rules. John Wiley and Sons.
Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 14:515–516.
Japkowicz, N. (2000a). The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI’2000): Special Track on Inductive Learning, Las Vegas, Nevada.
Japkowicz, N. (2000b). Learning from Imbalanced Data sets: A Comparison of Various Strategies. In Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX.
Japkowicz, N. (2001a). Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67–77.
Japkowicz, N. (2001b). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(1/2):97–122.
Jo, T. and Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1).
Joshi, M., Kumar, V., and Agarwal, R. (2001). Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In Proceedings of the First IEEE International Conference on Data Mining, pages 257–264, San Jose, CA.
Juszczak, P. and Duin, R. P. W. (2003). Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.
Kubat, M., Holte, R., and Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30:195–215.
Kubat, M. and Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186, Nashville, Tennesse. Morgan Kaufmann.
Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere.
Lee, S. S. (2000). Noisy Replication in Skewed Binary Classification. Computational Statistics and Data Analysis, 34.
Lewis, D. and Catlett, J. (1994). Heterogeneous Uncertainity Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference of Machine Learning, pages 148–156, San Francisco, CA. Morgan Kaufmann.
Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.
Liu, Y., Chawla, N. V., Shriberg, E., Stolcke, A., and Harper, M. (2004). Resampling Techniques for Sentence Boundary Detection: A Case Study in Machine Learning from Imbalanced Data for Spoken Language Processing. Under Review.
Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.
Mladenić, D. and Grobelnik, M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning., pages 258–267. Morgan Kaufmann.
Phua, C. and Alahakoon, D. (2004). Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations, 6(1).
Provost, F. and Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42/3:203–231.
Quinlan, J. R. (1992). C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Radivojac, P., Chawla, N. V., Dunker, K., and Obradovic, Z. (2004). Classification and Knowledge Discovery in Protein Databases. Journal of Biomedical Informatics, 37(4):224–239.
Raskutti, B. and Kowalczyk, A. (2004). Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1).
Solberg, A. H. and Solberg, R. (1996). A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills in ERS SAR Images. In International Geoscience and Remote Sensing Symposium, pages 1484–1486, Lincoln, NE.
Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240:1285–1293.
Tax, D. (2001). One-class classification. PhD thesis, Delft University of Technology.
Ting, K. (2000). A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth International Conference on Machine Learning, pages 983–990, Stanford, CA.
Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6:769–772.
Turney, P. (2000). Types of Cost in Inductive Concept Learning. In Workshop on Cost- Sensitive Learning at the Seventeenth International Conference on Machine Learning, pages 15–21, Stanford, CA.
Weiss, G. and Provost, F. (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315–354.
Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C., and Kegelmeyer, P. (1993). Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. International Journal of Pattern Recognition and Artificial Intelligence, 7(6):1417–1436.
Acknowledgements
I would like to thank Larry Hall, Kevin Bowyer and Philip Kegelmeyer for their valuable input during my Ph.D. research in this field. I am also extremely grateful to all my collaborators and co-authors in the area of learning from imbalanced datasets. I have enjoyed working with them and contributing to this field.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Chawla, N.V. (2009). Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_45
Download citation
DOI: https://doi.org/10.1007/978-0-387-09823-4_45
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)