An extension of the K-means algorithm to clustering skewed data

Volodymyr Melnykov¹ &
Xuwen Zhu²

1134 Accesses
9 Citations
Explore all metrics

Abstract

Grouping similar objects into common groups, also known as clustering, is an important problem of unsupervised machine learning. Various clustering algorithms have been proposed in literature. In recent years, the need to analyze large amounts of data has led to reconsidering some fundamental clustering procedures. One of them is the celebrated K-means algorithm popular among practitioners due to its speedy performance and appealingly intuitive construction. Unfortunately, the algorithm often shows poor performance unless data groups have spherical shapes and approximately same sizes. In many applications, this restriction is so severe that the use of the K-means algorithm becomes questionable, misleading, or simply incorrect. We propose an extension of K-means that preserves the speed and intuitive interpretation of the original algorithm while providing greater flexibility in modeling clusters. The idea of the proposed generalization relies on the exponential transformation of Manly originally designed to obtain near-normally distributed data. The suggested modification is derived and illustrated on several datasets with good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MORe++: k-Means Based Outlier Removal on High-Dimensional Data

cs-means: Determining optimal number of clusters based on a level-of-similarity

Article 06 October 2020

Clustering Large Datasets by Merging K-Means Solutions

Article 29 March 2019

References

Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589–609
Article Google Scholar
Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5
Google Scholar
Andrews DF, Gnanadesikan R, Warner JL (1971) Transformations of multivariate data. Biometrics 27(4):825–840
Article Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Article MathSciNet MATH Google Scholar
Box G, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252
MATH Google Scholar
Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(4):1–22
Article Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Article MathSciNet MATH Google Scholar
Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of IEEE symposium on foundations of computer science, New York, pp 633–644
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
MATH Google Scholar
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 01 Jan 2017
Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Article MathSciNet MATH Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188
Article Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book MATH Google Scholar
Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454
Article MathSciNet MATH Google Scholar
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376
Article MathSciNet Google Scholar
Manly BFJ (1976) Exponential data transformations. Biom Unit 25:37–42
Google Scholar
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
Melnykov I, Melnykov V (2014) On k-means algorithm with the use of Mahalanobis distances. Stat Probab Lett 84:88–95
Article MathSciNet MATH Google Scholar
Melnykov V (2013a) Challenges in model-based clustering. WIREs Comput Stat 5:135–148
Article Google Scholar
Melnykov V (2013b) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592
Article MathSciNet Google Scholar
Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395
Article MathSciNet MATH Google Scholar
Melnykov V, Shen G (2013) Clustering through empirical likelihood ratio. Comput Stat Data Anal 62:1–10
Article MathSciNet MATH Google Scholar
Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Article Google Scholar
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
MATH Google Scholar
Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin
MATH Google Scholar
Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438
Google Scholar
Tortora C, Franczak BC, Browne RP, McNicholas PD (2014) A mixture of coalesced generalized hyperbolic distributions. arXiv:1403.2332
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Article MathSciNet Google Scholar
Zhu X, Melnykov V (2017) ManlyMix: an R package for manly mixture modeling. R J 9(2):176–197

Download references

Acknowledgements

The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.

Author information

Authors and Affiliations

University of Alabama, Tuscaloosa, USA
Volodymyr Melnykov
University of Louisville, Louisville, USA
Xuwen Zhu

Authors

Volodymyr Melnykov
View author publications
You can also search for this author in PubMed Google Scholar
Xuwen Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Volodymyr Melnykov.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 221 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melnykov, V., Zhu, X. An extension of the K-means algorithm to clustering skewed data. Comput Stat 34, 373–394 (2019). https://doi.org/10.1007/s00180-018-0821-z

Download citation

Received: 14 January 2017
Accepted: 22 May 2018
Published: 04 July 2018
Issue Date: 05 March 2019
DOI: https://doi.org/10.1007/s00180-018-0821-z

An extension of the K-means algorithm to clustering skewed data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MORe++: k-Means Based Outlier Removal on High-Dimensional Data

cs-means: Determining optimal number of clusters based on a level-of-similarity

Clustering Large Datasets by Merging K-Means Solutions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 221 KB)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An extension of the K-means algorithm to clustering skewed data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MORe++: k-Means Based Outlier Removal on High-Dimensional Data

cs-means: Determining optimal number of clusters based on a level-of-similarity

Clustering Large Datasets by Merging K-Means Solutions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 221 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation