A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data

A. Guzmán-Ponce^12,13,
R. M. Valdovinos¹² &
J. S. Sánchez ORCID: orcid.org/0000-0003-1053-4658¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12344))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

1183 Accesses

Abstract

The resampling methods are among the most popular strategies to face the class imbalance problem. The objective of these methods is to compensate the imbalanced class distribution by over-sampling the minority class and/or under-sampling the majority class. In this paper, a new under-sampling method based on the DBSCAN clustering algorithm is introduced. The main idea is to remove the majority class instances that are identified as noise by DBSCAN. The proposed method is empirically compared to well-known state-of-the-art under-sampling algorithms over 25 benchmarking databases and the experimental results demonstrate the effectiveness of the new method in terms of sensitivity, specificity, and geometric mean of individual accuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DB2SM: An Efficient Resampling Technique for Imbalanced Data Classification

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Article 10 May 2023

A Review of the Oversampling Techniques in Class Imbalance Problem

References

Cao, P., Zhao, D., Zaiane, O.: Hybrid probabilistic sampling with random subspace for imbalanced data learning. Intell. Data Anal. 18(6), 1089–1108 (2014)
Article Google Scholar
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining Knowl. Disc. 17(2), 225–252 (2008)
Article MathSciNet Google Scholar
Dal Pozzolo, A., Caelen, O., Bontempi, G.: When is undersampling effective in unbalanced classification tasks? In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 200–215. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_13
Chapter Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Devi, D., Biswas, S., Purkayastha, B.: Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognit. Lett. 93, 3–12 (2017)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)
Google Scholar
Fernández, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int. J. Approximate Reason. 50(3), 561–577 (2009)
Article MATH Google Scholar
García, V., Sánchez, J.S., Marqués, A.I., Florencia, R., Rivera, G.: Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Exp. Syst. Appl. 158, 113026 (2019). https://doi.org/10.1016/j.eswa.2019.113026
Article Google Scholar
García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)
Article Google Scholar
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley - IEEE Press, Piscataway (2013)
Book MATH Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Article MATH Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016)
Article Google Scholar
Kumar, K.A., Rangan, C.P.: Privacy preserving DBSCAN algorithm for clustering. In: Alhajj, R., Gao, H., Li, J., Li, X., Zaïane, O.R. (eds.) Advanced Data Mining and Applications, pp. 57–68. Springer, Heidelberg (2007). https://doi.org/10.1007/11811305
Chapter Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Chapter Google Scholar
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)
Article Google Scholar
Ofek, N., Rokach, L., Stern, R., Shabtai, A.: Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243, 88–102 (2017)
Article Google Scholar
Onan, A.: Consensus clustering-based undersampling approach to imbalanced learning. Sci. Programm. 2019, 5901087 (2019). Article ID 5901087
Google Scholar
Prati, R.C., Batista, G.E., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45, 247–270 (2015). https://doi.org/10.1007/s10115-014-0794-3
Article Google Scholar
Smiti, A., Elouedi, Z.: DBSCAN-GM: an improved clustering method based on Gaussian Means and DBSCAN techniques. In: Proceedings of IEEE 16th International Conference on Intelligent Engineering Systems, Lisbon, Portugal, pp. 573–578 (2012)
Google Scholar
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC 6(6), 448–452 (1976)
MathSciNet MATH Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC 6(6), 769–772 (1976)
MathSciNet MATH Google Scholar
Tsai, C.F., Lin, W.C., Hu, Y.H., Yao, G.T.: Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 477, 47–54 (2019)
Article Google Scholar
Verma, M.K., Xaxa, D.K., Verma, S.: DBCS: density based cluster sampling for solving imbalanced classification problem. In: Proceedings of International conference of Electronics, Communication and Aerospace Technology, Coimbatore, India, vol. 1, pp. 156–161 (2017)
Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC 2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Exp. Syst. Appl. 36(3, Part 1), 5718–5727 (2009)
Article Google Scholar
Yue, S.H., Li, P., Guo, J.D., Zhou, S.Q.: Using greedy algorithm: DBSCAN revisited II. J. Zhejiang Univ. - SCIENCE A 5(11), 1405–1412 (2004). https://doi.org/10.1631/jzus.2004.1405

Download references

Acknowledgment

This work was partially supported by the Universitat Jaume I under grant [UJI-B2018-49], the 5046/2020CIC UAEM project and the Mexican Science and Technology Council (CONACYT) under scholarship [702275].

Author information

Authors and Affiliations

Facultad de Ingeniería, Universidad Autónoma del Estado de México, Toluca, Mexico
A. Guzmán-Ponce & R. M. Valdovinos
Institute of New Imaging Technologies, Department of Computer Languages and Systems, Universitat Jaume I, Castelló de la Plana, Spain
A. Guzmán-Ponce & J. S. Sánchez

Authors

A. Guzmán-Ponce
View author publications
You can also search for this author in PubMed Google Scholar
R. M. Valdovinos
View author publications
You can also search for this author in PubMed Google Scholar
J. S. Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Guzmán-Ponce .

Editor information

Editors and Affiliations

University of Oviedo, Oviedo, Spain
Enrique Antonio de la Cal
University of Oviedo, Oviedo, Spain
José Ramón Villar Flecha
University of A Coruña, Ferrol, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guzmán-Ponce, A., Valdovinos, R.M., Sánchez, J.S. (2020). A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data. In: de la Cal, E.A., Villar Flecha, J.R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2020. Lecture Notes in Computer Science(), vol 12344. Springer, Cham. https://doi.org/10.1007/978-3-030-61705-9_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-61705-9_25
Published: 04 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61704-2
Online ISBN: 978-3-030-61705-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DB2SM: An Efficient Resampling Technique for Imbalanced Data Classification

OUBoost: boosting based over and under sampling technique for handling imbalanced data

A Review of the Oversampling Techniques in Class Imbalance Problem

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DB2SM: An Efficient Resampling Technique for Imbalanced Data Classification

OUBoost: boosting based over and under sampling technique for handling imbalanced data

A Review of the Oversampling Techniques in Class Imbalance Problem

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation