Abstract
Grouping similar objects into common groups, also known as clustering, is an important problem of unsupervised machine learning. Various clustering algorithms have been proposed in literature. In recent years, the need to analyze large amounts of data has led to reconsidering some fundamental clustering procedures. One of them is the celebrated K-means algorithm popular among practitioners due to its speedy performance and appealingly intuitive construction. Unfortunately, the algorithm often shows poor performance unless data groups have spherical shapes and approximately same sizes. In many applications, this restriction is so severe that the use of the K-means algorithm becomes questionable, misleading, or simply incorrect. We propose an extension of K-means that preserves the speed and intuitive interpretation of the original algorithm while providing greater flexibility in modeling clusters. The idea of the proposed generalization relies on the exponential transformation of Manly originally designed to obtain near-normally distributed data. The suggested modification is derived and illustrated on several datasets with good results.
Similar content being viewed by others
References
Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589–609
Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5
Andrews DF, Gnanadesikan R, Warner JL (1971) Transformations of multivariate data. Biometrics 27(4):825–840
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Box G, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252
Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(4):1–22
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of IEEE symposium on foundations of computer science, New York, pp 633–644
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 01 Jan 2017
Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376
Manly BFJ (1976) Exponential data transformations. Biom Unit 25:37–42
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
Melnykov I, Melnykov V (2014) On k-means algorithm with the use of Mahalanobis distances. Stat Probab Lett 84:88–95
Melnykov V (2013a) Challenges in model-based clustering. WIREs Comput Stat 5:135–148
Melnykov V (2013b) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592
Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395
Melnykov V, Shen G (2013) Clustering through empirical likelihood ratio. Comput Stat Data Anal 62:1–10
Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin
Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438
Tortora C, Franczak BC, Browne RP, McNicholas PD (2014) A mixture of coalesced generalized hyperbolic distributions. arXiv:1403.2332
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Zhu X, Melnykov V (2017) ManlyMix: an R package for manly mixture modeling. R J 9(2):176–197
Acknowledgements
The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Melnykov, V., Zhu, X. An extension of the K-means algorithm to clustering skewed data. Comput Stat 34, 373–394 (2019). https://doi.org/10.1007/s00180-018-0821-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-018-0821-z