Abstract
The resampling methods are among the most popular strategies to face the class imbalance problem. The objective of these methods is to compensate the imbalanced class distribution by over-sampling the minority class and/or under-sampling the majority class. In this paper, a new under-sampling method based on the DBSCAN clustering algorithm is introduced. The main idea is to remove the majority class instances that are identified as noise by DBSCAN. The proposed method is empirically compared to well-known state-of-the-art under-sampling algorithms over 25 benchmarking databases and the experimental results demonstrate the effectiveness of the new method in terms of sensitivity, specificity, and geometric mean of individual accuracies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cao, P., Zhao, D., Zaiane, O.: Hybrid probabilistic sampling with random subspace for imbalanced data learning. Intell. Data Anal. 18(6), 1089–1108 (2014)
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining Knowl. Disc. 17(2), 225–252 (2008)
Dal Pozzolo, A., Caelen, O., Bontempi, G.: When is undersampling effective in unbalanced classification tasks? In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 200–215. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_13
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Devi, D., Biswas, S., Purkayastha, B.: Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognit. Lett. 93, 3–12 (2017)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)
Fernández, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int. J. Approximate Reason. 50(3), 561–577 (2009)
García, V., Sánchez, J.S., Marqués, A.I., Florencia, R., Rivera, G.: Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Exp. Syst. Appl. 158, 113026 (2019). https://doi.org/10.1016/j.eswa.2019.113026
García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley - IEEE Press, Piscataway (2013)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016)
Kumar, K.A., Rangan, C.P.: Privacy preserving DBSCAN algorithm for clustering. In: Alhajj, R., Gao, H., Li, J., Li, X., Zaïane, O.R. (eds.) Advanced Data Mining and Applications, pp. 57–68. Springer, Heidelberg (2007). https://doi.org/10.1007/11811305
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)
Ofek, N., Rokach, L., Stern, R., Shabtai, A.: Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243, 88–102 (2017)
Onan, A.: Consensus clustering-based undersampling approach to imbalanced learning. Sci. Programm. 2019, 5901087 (2019). Article ID 5901087
Prati, R.C., Batista, G.E., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45, 247–270 (2015). https://doi.org/10.1007/s10115-014-0794-3
Smiti, A., Elouedi, Z.: DBSCAN-GM: an improved clustering method based on Gaussian Means and DBSCAN techniques. In: Proceedings of IEEE 16th International Conference on Intelligent Engineering Systems, Lisbon, Portugal, pp. 573–578 (2012)
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC 6(6), 448–452 (1976)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC 6(6), 769–772 (1976)
Tsai, C.F., Lin, W.C., Hu, Y.H., Yao, G.T.: Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 477, 47–54 (2019)
Verma, M.K., Xaxa, D.K., Verma, S.: DBCS: density based cluster sampling for solving imbalanced classification problem. In: Proceedings of International conference of Electronics, Communication and Aerospace Technology, Coimbatore, India, vol. 1, pp. 156–161 (2017)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC 2(3), 408–421 (1972)
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Exp. Syst. Appl. 36(3, Part 1), 5718–5727 (2009)
Yue, S.H., Li, P., Guo, J.D., Zhou, S.Q.: Using greedy algorithm: DBSCAN revisited II. J. Zhejiang Univ. - SCIENCE A 5(11), 1405–1412 (2004). https://doi.org/10.1631/jzus.2004.1405
Acknowledgment
This work was partially supported by the Universitat Jaume I under grant [UJI-B2018-49], the 5046/2020CIC UAEM project and the Mexican Science and Technology Council (CONACYT) under scholarship [702275].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Guzmán-Ponce, A., Valdovinos, R.M., Sánchez, J.S. (2020). A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data. In: de la Cal, E.A., Villar Flecha, J.R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2020. Lecture Notes in Computer Science(), vol 12344. Springer, Cham. https://doi.org/10.1007/978-3-030-61705-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-61705-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61704-2
Online ISBN: 978-3-030-61705-9
eBook Packages: Computer ScienceComputer Science (R0)