Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A novel ensemble method for classifying imbalanced data

Published: 01 May 2015 Publication History

Abstract

The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. However, these conventional class imbalance handling methods might suffer from the loss of potentially useful information, unexpected mistakes or increasing the likelihood of overfitting because they may alter the original data distribution. Thus we propose a novel ensemble method, which firstly converts an imbalanced data set into multiple balanced ones and then builds a number of classifiers on these multiple data with a specific classification algorithm. Finally, the classification results of these classifiers for new data are combined by a specific ensemble rule. In the empirical study, different class imbalance data handling methods including three conventional sampling methods, one cost-sensitive learning method, six Bagging and Boosting based ensemble methods, our previous method EM1vs1 and two fuzzy-rule based classification methods were compared with our method. The experimental results on 46 imbalanced data sets show that our proposed method is usually superior to the conventional imbalance data handling methods when solving the highly imbalanced problems. HighlightsWe propose a novel ensemble method to handle imbalanced binary data.The method turns imbalanced data learning into multiple balanced data learning.Our method usually performs better than the conventional methods on imbalanced data.

References

[1]
W. Chao, J. Liu, J. Ding, Facial age estimation based on label-sensitive learning and age-oriented regression, Pattern Recognit., 46 (2013) 628-641.
[2]
M. Kubat, R.C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Mach. Learn., 30 (1998) 195-215.
[3]
W. Khreich, E. Granger, A. Miri, R. Sabourin, Adaptive roc-based ensembles of hmms applied to anomaly detection, Pattern Recognit., 45 (2012) 208-230.
[4]
T. Fawcett, F. Provost, Adaptive fraud detection, Data Min. Knowl. Discov., 1 (1997) 291-316.
[5]
L. Pelayo, S. Dick, Applying novel resampling strategies to software defect prediction, in: Annual Meeting of the North American on Fuzzy Information Processing Society, pp. 69-72.
[6]
D. Zhang, M.M. Islam, G. Lu, A review on automatic image annotation techniques, Pattern Recognit., 45 (2012) 346-362.
[7]
N. Japkowicz (Ed.), Learning from imbalanced data sets, AAAI Workshop on Learning from Imbalanced Data Sets, 2000. Technical Report WS-00-05.
[8]
N. J. Nitesh Chawla, A. Kolcz (Eds.), Workshop on learning from imbalanced data sets II, in: International Conference on Machine Learning, 2003.
[9]
N. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data sets, ACM SIGKDD Explor. Newsl., 6 (2004) 1-6.
[10]
G.M. Weiss, Mining with rarity: a unifying framework, ACM SIGKDD Explor. Newsl., 6 (2004) 7-19.
[11]
G. Batista, R. Prati, M. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl., 6 (2004) 20-29.
[12]
H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., 21 (2009) 1263-1284.
[13]
I. Mani, I. Zhang, KNN approach to unbalanced data distributions: a case study involving information extraction, in: Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets II.
[14]
W. Liu, S. Chawla, Class confidence weighted knn algorithms for imbalanced data sets, Adv. Knowl. Disc. Data Min., 6635 (2011) 345-356.
[15]
N. Chawla, A. Lazarevic, L. Hall, K. Bowyer, Smoteboost: Improving prediction of the minority class in boosting, in: Knowledge Discovery in Databases: PKDD 2003, 2003, pp. 107-119.
[16]
C. Seiffert, T. Khoshgoftaar, J. van Hulse, A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. A: Syst. Hum., 40 (2010) 185-197.
[17]
F. Provost, Machine learning from imbalanced data sets 101, Technical Report, AAAI Workshop on Learning from Imbalanced Data Sets, 2000.
[18]
Y. Sun, M. Kamel, A. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit., 40 (2007) 3358-3378.
[19]
X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. B: Cybern., 39 (2009) 539-550.
[20]
R. Barandela, J.S. Sánchez, V. Garcıa, E. Rangel, Strategies for learning in class imbalance problems, Pattern Recognit., 36 (2003) 849-851.
[21]
M.A. Tahir, J. Kittler, F. Yan, Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognit., 45 (2012) 3738-3750.
[22]
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002) 321-357.
[23]
S. García, F. Herrera, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evol. Comput., 17 (2009) 275-306.
[24]
G. Weiss, K. McCarthy, B. Zabar, Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs, in: Proceedings of the International Conference on Data Mining, pp. 35-41.
[25]
Z. Zhou, X. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng., 18 (2006) 63-77.
[26]
C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, A comparative study of data sampling and cost sensitive learning, in: Proceedings of IEEE International Conference on Data Mining Workshops, pp. 46-52.
[27]
L. Breiman, Bagging predictors, Mach. Learn., 24 (1996) 123-140.
[28]
Y. Freund, R. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the International Conference on Machine Learning, pp. 148-156.
[29]
H. Guo, H. Viktor, Learning from imbalanced data sets with boosting and data generation: the databoost-im approach, ACM SIGKDD Explor. Newsl., 6 (2004) 30-39.
[30]
S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced data, Stat. Anal. Data Min., 2 (2009) 412-426.
[31]
J. Kittler, M. Hatef, R.P. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 20 (1998) 226-239.
[32]
G. H. John, P. Langley, Estimating continuous distributions in bayesian classifiers, in: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338-345.
[33]
R. Quinlan, Morgan Kaufmann Publishers, San Mateo, CA, 1993.
[34]
W.W. Cohen, Fast effective rule induction, in: Twelfth International Conference on Machine Learning, pp. 115-123.
[35]
L. Breiman, Random forests, Mach. Learn., 45 (2001) 5-32.
[36]
J. Platt, Machines using sequential minimal optimization, in: Advances in Kernel Methods - Support Vector Learning, MIT Press, 1998.
[37]
D. Aha, D. Kibler, Instance-based learning algorithms, Mach. Learn., 6 (1991) 37-66.
[38]
P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155-164.
[39]
R. Barandela, R.M. Valdovinos, J.S. Sánchez, New applications of ensembles of classifiers, Pattern Anal. Appl., 6 (2003) 245-256.
[40]
Z. Sun, Q. Song, X. Zhu, Using coding-based ensemble learning to improve software defect prediction, IEEE Trans. Syst. Man Cybern. C: Appl. Rev., 42 (2012) 1806-1817.
[41]
A. Fernández, M.J. del Jesus, F. Herrera, On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets, Inf. Sci., 180 (2010) 1268-1291.
[42]
Q. Yang, X. Wu, 10 challenging problems in data mining research, Int. J. Inf. Technol. Decis. Mak., 5 (2006) 597-604.
[43]
I. Brown, C. Mues, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl., 39 (2012) 3446-3453.
[44]
D. Thammasiri, D. Delen, P. Meesad, N. Kasap, A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition, Expert Syst. Appl., 41 (2014) 321-330.
[45]
V. López, A. Fernández, J.G. Moreno-Torres, F. Herrera, Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics, Expert Syst. Appl., 39 (2012) 6585-6608.
[46]
V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci., 250 (2013) 113-141.
[47]
N. Tomašev, D. Mladenić, Class imbalance and the curse of minority hubs, Knowl.-Based Syst., 53 (2013) 157-172.
[48]
V. López, A. Fernández, F. Herrera, On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed, Inf. Sci., 257 (2014) 1-13.
[49]
R. Alejo, R.M. Valdovinos, V. García, J. Pacheco-Sanchez, A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios, Pattern Recognit. Lett., 34 (2013) 380-388.
[50]
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl., 6 (2004) 40-49.
[51]
K. Satyasree, J. Murthy, An exhaustive literature review on class imbalance problem, Int. J. Emerg. Trends Technol. Comput. Sci., 2 (2013) 109-118.
[52]
D. Rodríguez, R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz, Searching for rules to detect defective modules: A subgroup discovery approach, Inf. Sci., 191 (2012) 14-30.
[53]
L. Gonzalez-Abril, C. Angulo, F. Velasco, Gsvm: an svm for handling imbalanced accuracy between classes inbi-classification problems, Appl. Soft Comput., 17 (2014) 23-31.
[54]
J. Derrac, I. Triguero, C.J. Carmona, F. Herrera, Evolutionary-based selection of generalized instances for imbalanced classification, Knowl.-Based Syst., 25 (2012) 3-12.
[55]
N. Japkowicz, The class imbalance problem: Significance and strategies, in: Proceedings of the 2000 International Conference on Artificial Intelligence, pp. 111-117.
[56]
M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186.
[57]
J. Van Hulse, T. Khoshgoftaar, A. Napolitano, Experimental perspectives on learning from imbalanced data, in: Proceedings of the Twenty-fourth International Conference on Machine learning, pp. 935-942.
[58]
A. Estabrooks, T. Jo, N. Japkowicz, A multiple resampling method for learning from imbalanced data sets, Comput. Intell., 20 (2004) 18-36.
[59]
V. García, J.S. Sánchez, R. Martín-Félez, R.A. Mollineda, Surrounding neighborhood-based smote for learning from imbalanced data sets, Prog. Artif. Intell., 1 (2012) 347-362.
[60]
S. Chen, H. He, E.A. Garcia, Ramoboost: ranked minority oversampling in boosting, IEEE Trans. Neural Netw., 21 (2010) 1624-1642.
[61]
T. Khoshgoftaar, J. van Hulse, A. Napolitano, Comparing boosting and bagging techniques with noisy and imbalanced data, IEEE Trans. Syst. Man Cybern. A Syst. Hum., 41 (2011) 552-568.
[62]
T. Khoshgoftaar, E. Geleyn, L. Nguyen, Empirical case studies of combining software quality classification models, in: Proceedings of the Third International Conference on Quality Software, pp. 40-49.
[63]
C. Seiffert, T. Khoshgoftaar, J. van Hulse, Improving software-quality predictions with data sampling and boosting, IEEE Trans. Syst. Man Cybern. A Syst. Hum., 39 (2009) 1283-1294.
[64]
V. Nikulin, G. J. McLachlan, S. K. Ng, Ensemble approach for the classification of imbalanced data, in: AI 2009: Advances in Artificial Intelligence, Springer, 2009, pp. 291-300.
[65]
K. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Trans. Knowl. Data Eng., 14 (2002) 659-665.
[66]
J. Zheng, Cost-sensitive boosting neural networks for software defect prediction, Expert Syst. Appl., 37 (2010) 4537-4543.
[67]
H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: Advances in Intelligent Computing, 2005, pp. 878-887.
[68]
J. Xie, Z. Qiu, The effect of imbalanced data sets on lda: a theoretical and empirical analysis, Pattern Recognit., 40 (2007) 557-562.
[69]
R. Barandela, R. Valdovinos, J. Sánchez, F. Ferri, The imbalanced training sample problem: under or over sampling?, Struct. Syntactic Stat. Pattern Recognit., 3138 (2004) 806-814.
[70]
C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973-978.
[71]
S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: a stepwise procedure for building and training a neural network, Neuro-comput. Algorithms Arch. Appl., 68 (1990) 41-50.
[72]
T. Hastie, R. Tibshirani, Classification by pairwise coupling, Ann. Stat., 26 (1998) 451-471.
[73]
S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: IEEE Symposium on Computational Intelligence and Data Mining, 2009., IEEE, pp. 324-331.
[74]
A. Liu, J. Ghosh, C. Martin, Generative oversampling for mining imbalanced datasets, in: Proceedings of the 2007 International Conference on Data Mining, pp. 25-28.
[75]
M.A. Tahir, J. Kittler, A. Bouridane, Multilabel classification using heterogeneous ensemble of multi-label classifiers, Pattern Recognit. Lett., 33 (2012) 513-523.
[76]
J. Alcalá-Fdez, L. Sánchez, S. García, M. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. Rivas, Keel: a software tool to assess evolutionary algorithms for data mining problems, Soft Comput. - Fusion Found. Methodol. Appl., 13 (2009) 307-318.
[77]
J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, J. Mult.-Valued Logic Soft Comput., 17 (2011) 255-287.
[78]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The weka data mining software: an update, ACM SIGKDD Explor. Newsl., 11 (2009) 10-18.
[79]
A.P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms, Pattern Recognit., 30 (1997) 1145-1159.
[80]
J. Huang, C.X. Ling, Using auc and accuracy in evaluating learning algorithms, IEEE Trans. Knowl. Data Eng., 17 (2005) 299-310.
[81]
J.A. Sáez, J. Luengo, F. Herrera, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognit., 46 (2013) 355-364.
[82]
H. He, Y. Ma, Imbalanced learning: foundations, John Wiley & Sons, Hoboken, New Jersey, 2013.
[83]
T. Fawcett, An introduction to roc analysis, Pattern Recognit. Lett., 27 (2006) 861-874.
[84]
Z. Chi, H. Yan, T. Pham, World Scientific, 5 Toh Tuck Link, Singapore 596224, 1996.
[85]
H. Ishibuchi, T. Yamamoto, T. Nakashima, Hybridization of fuzzy gbml approaches for pattern classification problems, IEEE Trans. Syst. Man Cybern. B Cybern., 35 (2005) 359-365.
[86]
F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull., 1 (1945) 80-83.
[87]
M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. Am. Stat. Assoc., 32 (1937) 675-701.
[88]
J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006) 1-30.
[89]
P. Nemenyi, Distribution-free multiple comparisons (Ph.D. thesis), Princeton, 1963.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Pattern Recognition
Pattern Recognition  Volume 48, Issue 5
May 2015
360 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 May 2015

Author Tags

  1. Classification
  2. Ensemble learning
  3. Imbalanced data

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427436:10(5194-5205)Online publication date: 1-Oct-2024
  • (2024)Underwater object detection in noisy imbalanced datasetsPattern Recognition10.1016/j.patcog.2024.110649155:COnline publication date: 1-Nov-2024
  • (2024)DBN-MixPattern Recognition10.1016/j.patcog.2023.110107147:COnline publication date: 4-Mar-2024
  • (2024)EML-PSPExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122863243:COnline publication date: 25-Jun-2024
  • (2024)A review of ensemble learning and data augmentation models for class imbalanced problemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122778244:COnline publication date: 15-Jun-2024
  • (2024)ECC + +Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.121366236:COnline publication date: 1-Feb-2024
  • (2024)Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative ApproachSN Computer Science10.1007/s42979-024-02717-45:4Online publication date: 1-Apr-2024
  • (2024)A systematic approach for learning imbalanced data: enhancing zero-inflated models through boostingMachine Language10.1007/s10994-024-06558-3113:10(8233-8299)Online publication date: 8-Jul-2024
  • (2024)VGAN-BL: imbalanced data classification based on generative adversarial network and biased lossNeural Computing and Applications10.1007/s00521-023-09180-x36:6(2883-2899)Online publication date: 1-Feb-2024
  • (2024)Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09931-528:20(11873-11894)Online publication date: 1-Oct-2024
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media