Abstract
In this paper, a new clustering algorithm, IM-c-means, is proposed for clusters with skewed distributions. C-means algorithm is a well-known and widely used strategy for data clustering, but at the same time prone to poor performance if the data set is not distributed uniformly, which is called “uniform effect” in studies. We first analyze the cause of this effect and find that it occurs only when clusters sizes are varied, whereas different object densities inter-clusters have no effect on c-means algorithm. According to this finding, we propose to form a new objective function by considering volumes and object densities of all clusters, which creates a new effective clustering algorithm with respect to the clusters with varied sizes or densities, while at the same time inheriting the good performance of traditional c-means algorithm for balanced data set. The experiments using both synthetic and real data sets have provided promising results of the proposed clustering algorithm. In addition, the nonparametric test has showed that the proposed algorithm could offer a significant improvement over other clustering methods for imbalanced data sets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Azamathulla HM, Ab Ghani A (2010) Genetic programming to predict river pipeline scour. J Pipeline Syst Eng Pract 1(3):127–132
Babuka R, Van der Veen PJ, Kaymak U (2002) Improved covariance estimation for Gustafson-Kessel clustering. In: IEEE International conference on fuzzy systems, pp. 1081–1085
Bae E, Bailey J, Dong GZ (2010) A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21(3):427–471
Belo LDS, Jr CAC, Guimarães SJF (2016) Summarizing video sequence using a graph-based hierarchical approach. Neurocomputing 173(P3):1001–1016
Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2002) Support vector clustering. J Mach Learn Res 2(2):125–137
Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483
Carvalho FDATD, Simões EC, Santana LVC, Ferreira MRP (2018) Gaussian Kernel c-means hard clustering algorithms with automated computation of the width hyper-parameters. Pattern Recogn 79:370–386
Deng Z, Jiang Y, Chung FL, Ishibuchi H, Choi KS, Wang S (2016) Transfer prototype-based fuzzy clustering. IEEE Trans Fuzzy Syst 24(5):1210–1232
Ferreira MR, De Carvalho FDA (2014) Kernel fuzzy c-means with automatic variable weighting. Fuzzy Sets Syst 237:1–46
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–780
He H, Tan YH (2012) A two-stage genetic algorithm for automatic clustering. Neurocomputing 81:49–59
Ismkhan H (2018) I-k-means-+: an iterative clustering algorithm based on an enhanced version of the k -means. Pattern Recogn 79:402–413
Jain AK (2015) Data clustering: a review. ACM Comput Surv 31(2):264–323
Krishna K, Murty MN (1999) Genetic k-means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–9
Leung HC, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, Chen J, Qin J, Li R, Chin FY (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27(11):1489–95
Liang JY, Bai L, Dang CY, Cao FY (2012) The k-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):728–745
Liao R, Zhang R, Guan J, Zhou S (2014) A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting. IEEE/ACM Trans Comput Biol Bioinf 11(1):42–54
Lin PL, Huang PW, Kuo CH, Lai YH (2014) A size-insensitive integrity-based fuzzy c-means method for data clustering. Pattern Recogn 47(5):2042–2056
Liu J, Xu M (2008) Kernelized fuzzy attribute c-means clustering algorithm. Fuzzy Sets Syst 159(18):2428–2445
Liu Y, Hou T, Liu F (2015) Improving fuzzy c-means method for unbalanced dataset. Electron Lett 51(23):1880–1881
Noordam JC, van den Broek WHAM, Buydens LMC (2002) Multivariate image segmentation with cluster size insensitive fuzzy c-means. Chemometr Intell Lab Syst 64(1):65–78
Pérez-Suárez A, Martínez-Trinidad JF, Carrasco-Ochoa JA, Medina-Pagola JE (2013) OClustR: a new graph-based algorithm for overlapping clustering. Neurocomputing 121(18):234–247
Ramathilagam S, Huang YM (2011) Extended gaussian kernel version of fuzzy c-means in the problem of data analyzing. Expert Syst Appl 38(4):3793–3805
Ruiz C, Spiliopoulou M, Menasalvas E (2010) Density-based semi-supervised clustering. Data Min Knowl Disc 21(3):345–370
Siddiqui FU, Isa NAM (2012) Optimized k-means (okm) clustering algorithm for image segmentation. Opto-Electron Rev 20(3):216–225
Tseng LY, Yang SB (2001) A genetic approach to the automatic clustering problem. Pattern Recogn 34(2):415–424
Tu Q, Lu JF, Yuan B, Tang JB, Yang JY (2012) Density-based hierarchical clustering for streaming data. Pattern Recogn Lett 33(5):641–645
Wang CD, Lai JH, Zhu JY (2012) Graph-based multiprototype competitive learning and its applications. IEEE Trans Syst Man Cybern Part C 42(6):934–946
Wang Y, Leung HC, Yiu SM, Chin FY (2012) Metacluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J Comput Biol 19(2):241–249
Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331
Zhou K, Yang S (2019) Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Appl 23:255
Zhou KL, Yang SL (2016) Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowl Based Syst 96:76–83
Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for discovering clusters with varying densities. Pattern Recogn 60:983–997
Acknowledgements
This work has been supported by the National Natural Science Foundation of China (61503151), the Natural Science Foundation of Jilin Province (20160520100JH) and the Project funded by China Postdoctoral Science Foundation (2019M651204).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, Y., Hou, T., Miao, Y. et al. IM-c-means: a new clustering algorithm for clusters with skewed distributions. Pattern Anal Applic 24, 611–623 (2021). https://doi.org/10.1007/s10044-020-00932-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00932-2