research-article

A novel ensemble method for classifying imbalanced data

Authors:

Yuming ZhouAuthors Info & Claims

Pattern Recognition, Volume 48, Issue 5

Pages 1623 - 1637

https://doi.org/10.1016/j.patcog.2014.11.014

Published: 01 May 2015 Publication History

Abstract

The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. However, these conventional class imbalance handling methods might suffer from the loss of potentially useful information, unexpected mistakes or increasing the likelihood of overfitting because they may alter the original data distribution. Thus we propose a novel ensemble method, which firstly converts an imbalanced data set into multiple balanced ones and then builds a number of classifiers on these multiple data with a specific classification algorithm. Finally, the classification results of these classifiers for new data are combined by a specific ensemble rule. In the empirical study, different class imbalance data handling methods including three conventional sampling methods, one cost-sensitive learning method, six Bagging and Boosting based ensemble methods, our previous method EM1vs1 and two fuzzy-rule based classification methods were compared with our method. The experimental results on 46 imbalanced data sets show that our proposed method is usually superior to the conventional imbalance data handling methods when solving the highly imbalanced problems. HighlightsWe propose a novel ensemble method to handle imbalanced binary data.The method turns imbalanced data learning into multiple balanced data learning.Our method usually performs better than the conventional methods on imbalanced data.

References

[1]

W. Chao, J. Liu, J. Ding, Facial age estimation based on label-sensitive learning and age-oriented regression, Pattern Recognit., 46 (2013) 628-641.

Digital Library

[2]

M. Kubat, R.C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Mach. Learn., 30 (1998) 195-215.

[3]

W. Khreich, E. Granger, A. Miri, R. Sabourin, Adaptive roc-based ensembles of hmms applied to anomaly detection, Pattern Recognit., 45 (2012) 208-230.

Digital Library

[4]

T. Fawcett, F. Provost, Adaptive fraud detection, Data Min. Knowl. Discov., 1 (1997) 291-316.

[5]

L. Pelayo, S. Dick, Applying novel resampling strategies to software defect prediction, in: Annual Meeting of the North American on Fuzzy Information Processing Society, pp. 69-72.

[6]

D. Zhang, M.M. Islam, G. Lu, A review on automatic image annotation techniques, Pattern Recognit., 45 (2012) 346-362.

Digital Library

[7]

N. Japkowicz (Ed.), Learning from imbalanced data sets, AAAI Workshop on Learning from Imbalanced Data Sets, 2000. Technical Report WS-00-05.

[8]

N. J. Nitesh Chawla, A. Kolcz (Eds.), Workshop on learning from imbalanced data sets II, in: International Conference on Machine Learning, 2003.

[9]

N. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data sets, ACM SIGKDD Explor. Newsl., 6 (2004) 1-6.

Digital Library

[10]

G.M. Weiss, Mining with rarity: a unifying framework, ACM SIGKDD Explor. Newsl., 6 (2004) 7-19.

Digital Library

[11]

G. Batista, R. Prati, M. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl., 6 (2004) 20-29.

Digital Library

[12]

H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., 21 (2009) 1263-1284.

Digital Library

[13]

I. Mani, I. Zhang, KNN approach to unbalanced data distributions: a case study involving information extraction, in: Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets II.

[14]

W. Liu, S. Chawla, Class confidence weighted knn algorithms for imbalanced data sets, Adv. Knowl. Disc. Data Min., 6635 (2011) 345-356.

[15]

N. Chawla, A. Lazarevic, L. Hall, K. Bowyer, Smoteboost: Improving prediction of the minority class in boosting, in: Knowledge Discovery in Databases: PKDD 2003, 2003, pp. 107-119.

[16]

C. Seiffert, T. Khoshgoftaar, J. van Hulse, A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. A: Syst. Hum., 40 (2010) 185-197.

Digital Library

[17]

F. Provost, Machine learning from imbalanced data sets 101, Technical Report, AAAI Workshop on Learning from Imbalanced Data Sets, 2000.

[18]

Y. Sun, M. Kamel, A. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit., 40 (2007) 3358-3378.

Digital Library

[19]

X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. B: Cybern., 39 (2009) 539-550.

Digital Library

[20]

R. Barandela, J.S. Sánchez, V. Garcıa, E. Rangel, Strategies for learning in class imbalance problems, Pattern Recognit., 36 (2003) 849-851.

[21]

M.A. Tahir, J. Kittler, F. Yan, Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognit., 45 (2012) 3738-3750.

Digital Library

[22]

N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002) 321-357.

Digital Library

[23]

S. García, F. Herrera, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evol. Comput., 17 (2009) 275-306.

Digital Library

[24]

G. Weiss, K. McCarthy, B. Zabar, Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs, in: Proceedings of the International Conference on Data Mining, pp. 35-41.

[25]

Z. Zhou, X. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng., 18 (2006) 63-77.

Digital Library

[26]

C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, A comparative study of data sampling and cost sensitive learning, in: Proceedings of IEEE International Conference on Data Mining Workshops, pp. 46-52.

Digital Library

[27]

L. Breiman, Bagging predictors, Mach. Learn., 24 (1996) 123-140.

Digital Library

[28]

Y. Freund, R. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the International Conference on Machine Learning, pp. 148-156.

[29]

H. Guo, H. Viktor, Learning from imbalanced data sets with boosting and data generation: the databoost-im approach, ACM SIGKDD Explor. Newsl., 6 (2004) 30-39.

Digital Library

[30]

S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced data, Stat. Anal. Data Min., 2 (2009) 412-426.

Digital Library

[31]

J. Kittler, M. Hatef, R.P. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 20 (1998) 226-239.

Digital Library

[32]

G. H. John, P. Langley, Estimating continuous distributions in bayesian classifiers, in: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338-345.

[33]

R. Quinlan, Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[34]

W.W. Cohen, Fast effective rule induction, in: Twelfth International Conference on Machine Learning, pp. 115-123.

[35]

L. Breiman, Random forests, Mach. Learn., 45 (2001) 5-32.

Digital Library

[36]

J. Platt, Machines using sequential minimal optimization, in: Advances in Kernel Methods - Support Vector Learning, MIT Press, 1998.

Digital Library

[37]

D. Aha, D. Kibler, Instance-based learning algorithms, Mach. Learn., 6 (1991) 37-66.

[38]

P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155-164.

[39]

R. Barandela, R.M. Valdovinos, J.S. Sánchez, New applications of ensembles of classifiers, Pattern Anal. Appl., 6 (2003) 245-256.

Digital Library

[40]

Z. Sun, Q. Song, X. Zhu, Using coding-based ensemble learning to improve software defect prediction, IEEE Trans. Syst. Man Cybern. C: Appl. Rev., 42 (2012) 1806-1817.

Digital Library

[41]

A. Fernández, M.J. del Jesus, F. Herrera, On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets, Inf. Sci., 180 (2010) 1268-1291.

[42]

Q. Yang, X. Wu, 10 challenging problems in data mining research, Int. J. Inf. Technol. Decis. Mak., 5 (2006) 597-604.

[43]

I. Brown, C. Mues, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl., 39 (2012) 3446-3453.

[44]

D. Thammasiri, D. Delen, P. Meesad, N. Kasap, A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition, Expert Syst. Appl., 41 (2014) 321-330.

Digital Library

[45]

V. López, A. Fernández, J.G. Moreno-Torres, F. Herrera, Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics, Expert Syst. Appl., 39 (2012) 6585-6608.

[46]

V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci., 250 (2013) 113-141.

[47]

N. Tomašev, D. Mladenić, Class imbalance and the curse of minority hubs, Knowl.-Based Syst., 53 (2013) 157-172.

Digital Library

[48]

V. López, A. Fernández, F. Herrera, On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed, Inf. Sci., 257 (2014) 1-13.

Digital Library

[49]

R. Alejo, R.M. Valdovinos, V. García, J. Pacheco-Sanchez, A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios, Pattern Recognit. Lett., 34 (2013) 380-388.

Digital Library

[50]

T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl., 6 (2004) 40-49.

Digital Library

[51]

K. Satyasree, J. Murthy, An exhaustive literature review on class imbalance problem, Int. J. Emerg. Trends Technol. Comput. Sci., 2 (2013) 109-118.

[52]

D. Rodríguez, R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz, Searching for rules to detect defective modules: A subgroup discovery approach, Inf. Sci., 191 (2012) 14-30.

Digital Library

[53]

L. Gonzalez-Abril, C. Angulo, F. Velasco, Gsvm: an svm for handling imbalanced accuracy between classes inbi-classification problems, Appl. Soft Comput., 17 (2014) 23-31.

Digital Library

[54]

J. Derrac, I. Triguero, C.J. Carmona, F. Herrera, Evolutionary-based selection of generalized instances for imbalanced classification, Knowl.-Based Syst., 25 (2012) 3-12.

Digital Library

[55]

N. Japkowicz, The class imbalance problem: Significance and strategies, in: Proceedings of the 2000 International Conference on Artificial Intelligence, pp. 111-117.

[56]

M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186.

[57]

J. Van Hulse, T. Khoshgoftaar, A. Napolitano, Experimental perspectives on learning from imbalanced data, in: Proceedings of the Twenty-fourth International Conference on Machine learning, pp. 935-942.

[58]

A. Estabrooks, T. Jo, N. Japkowicz, A multiple resampling method for learning from imbalanced data sets, Comput. Intell., 20 (2004) 18-36.

[59]

V. García, J.S. Sánchez, R. Martín-Félez, R.A. Mollineda, Surrounding neighborhood-based smote for learning from imbalanced data sets, Prog. Artif. Intell., 1 (2012) 347-362.

[60]

S. Chen, H. He, E.A. Garcia, Ramoboost: ranked minority oversampling in boosting, IEEE Trans. Neural Netw., 21 (2010) 1624-1642.

Digital Library

[61]

T. Khoshgoftaar, J. van Hulse, A. Napolitano, Comparing boosting and bagging techniques with noisy and imbalanced data, IEEE Trans. Syst. Man Cybern. A Syst. Hum., 41 (2011) 552-568.

Digital Library

[62]

T. Khoshgoftaar, E. Geleyn, L. Nguyen, Empirical case studies of combining software quality classification models, in: Proceedings of the Third International Conference on Quality Software, pp. 40-49.

[63]

C. Seiffert, T. Khoshgoftaar, J. van Hulse, Improving software-quality predictions with data sampling and boosting, IEEE Trans. Syst. Man Cybern. A Syst. Hum., 39 (2009) 1283-1294.

Digital Library

[64]

V. Nikulin, G. J. McLachlan, S. K. Ng, Ensemble approach for the classification of imbalanced data, in: AI 2009: Advances in Artificial Intelligence, Springer, 2009, pp. 291-300.

[65]

K. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Trans. Knowl. Data Eng., 14 (2002) 659-665.

Digital Library

[66]

J. Zheng, Cost-sensitive boosting neural networks for software defect prediction, Expert Syst. Appl., 37 (2010) 4537-4543.

Digital Library

[67]

H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: Advances in Intelligent Computing, 2005, pp. 878-887.

Digital Library

[68]

J. Xie, Z. Qiu, The effect of imbalanced data sets on lda: a theoretical and empirical analysis, Pattern Recognit., 40 (2007) 557-562.

Digital Library

[69]

R. Barandela, R. Valdovinos, J. Sánchez, F. Ferri, The imbalanced training sample problem: under or over sampling?, Struct. Syntactic Stat. Pattern Recognit., 3138 (2004) 806-814.

[70]

C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973-978.

Digital Library

[71]

S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: a stepwise procedure for building and training a neural network, Neuro-comput. Algorithms Arch. Appl., 68 (1990) 41-50.

[72]

T. Hastie, R. Tibshirani, Classification by pairwise coupling, Ann. Stat., 26 (1998) 451-471.

[73]

S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: IEEE Symposium on Computational Intelligence and Data Mining, 2009., IEEE, pp. 324-331.

[74]

A. Liu, J. Ghosh, C. Martin, Generative oversampling for mining imbalanced datasets, in: Proceedings of the 2007 International Conference on Data Mining, pp. 25-28.

[75]

M.A. Tahir, J. Kittler, A. Bouridane, Multilabel classification using heterogeneous ensemble of multi-label classifiers, Pattern Recognit. Lett., 33 (2012) 513-523.

Digital Library

[76]

J. Alcalá-Fdez, L. Sánchez, S. García, M. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. Rivas, Keel: a software tool to assess evolutionary algorithms for data mining problems, Soft Comput. - Fusion Found. Methodol. Appl., 13 (2009) 307-318.

[77]

J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, J. Mult.-Valued Logic Soft Comput., 17 (2011) 255-287.

[78]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The weka data mining software: an update, ACM SIGKDD Explor. Newsl., 11 (2009) 10-18.

Digital Library

[79]

A.P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms, Pattern Recognit., 30 (1997) 1145-1159.

Digital Library

[80]

J. Huang, C.X. Ling, Using auc and accuracy in evaluating learning algorithms, IEEE Trans. Knowl. Data Eng., 17 (2005) 299-310.

Digital Library

[81]

J.A. Sáez, J. Luengo, F. Herrera, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognit., 46 (2013) 355-364.

Digital Library

[82]

H. He, Y. Ma, Imbalanced learning: foundations, John Wiley & Sons, Hoboken, New Jersey, 2013.

[83]

T. Fawcett, An introduction to roc analysis, Pattern Recognit. Lett., 27 (2006) 861-874.

Digital Library

[84]

Z. Chi, H. Yan, T. Pham, World Scientific, 5 Toh Tuck Link, Singapore 596224, 1996.

[85]

H. Ishibuchi, T. Yamamoto, T. Nakashima, Hybridization of fuzzy gbml approaches for pattern classification problems, IEEE Trans. Syst. Man Cybern. B Cybern., 35 (2005) 359-365.

Digital Library

[86]

F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull., 1 (1945) 80-83.

[87]

M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. Am. Stat. Assoc., 32 (1937) 675-701.

[88]

J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006) 1-30.

Digital Library

[89]

P. Nemenyi, Distribution-free multiple comparisons (Ph.D. thesis), Princeton, 1963.

Cited By

Xu YYu ZChen C(2024)Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427436:10(5194-5205)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3384274
Chen LLi TZhou AWang SDong JZhou H(2024)Underwater object detection in noisy imbalanced datasetsPattern Recognition10.1016/j.patcog.2024.110649155:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.patcog.2024.110649
Baik JYoon IChoi J(2024)DBN-MixPattern Recognition10.1016/j.patcog.2023.110107147:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.patcog.2023.110107
Show More Cited By

A novel ensemble method for classifying imbalanced data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Classifying imbalanced data using a bagging ensemble variation (BEV)
ACMSE '07: Proceedings of the 45th annual ACM Southeast Conference

In many applications, data collected are highly skewed where data of one class clearly dominates data from the other classes. Most existing classification systems that perform well on balanced data give very poor performance on imbalanced data, ...
Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data

Extending binary ensemble techniques to multi-class imbalanced data.OVO scheme enhancement for multi-class imbalanced data by ensemble learning.A complete experimental study of comparison of the ensemble learning techniques with OVO.Study of the impact ...
Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Pattern Recognition

Pattern Recognition Volume 48, Issue 5

May 2015

360 pages

ISSN:0031-3203

Issue’s Table of Contents

Copyright © Elsevier Ltd.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 May 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu YYu ZChen C(2024)Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427436:10(5194-5205)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3384274
Chen LLi TZhou AWang SDong JZhou H(2024)Underwater object detection in noisy imbalanced datasetsPattern Recognition10.1016/j.patcog.2024.110649155:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.patcog.2024.110649
Baik JYoon IChoi J(2024)DBN-MixPattern Recognition10.1016/j.patcog.2023.110107147:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.patcog.2023.110107
Ahmad Qureshi SHussain LRafique MSohail HAman HRahat Abbas SBasit MKhalid M(2024)EML-PSPExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122863243:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122863
Khan AChaudhari OChandra R(2024)A review of ensemble learning and data augmentation models for class imbalanced problemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122778244:COnline publication date: 15-Jun-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122778
Duan JGu YYu HYang XGao S(2024)ECC + +Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.121366236:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121366
Bhattacharya RDe RChakraborty ASarkar R(2024)Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative ApproachSN Computer Science10.1007/s42979-024-02717-45:4Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s42979-024-02717-4
Jeong YLee KPark YHan S(2024)A systematic approach for learning imbalanced data: enhancing zero-inflated models through boostingMachine Language10.1007/s10994-024-06558-3113:10(8233-8299)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-024-06558-3
Ding HSun YHuang NCui X(2024)VGAN-BL: imbalanced data classification based on generative adversarial network and biased lossNeural Computing and Applications10.1007/s00521-023-09180-x36:6(2883-2899)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s00521-023-09180-x
Rezvani SPourpanah FLim CWu Q(2024)Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09931-528:20(11873-11894)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/s00500-024-09931-5
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents