Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

An extension of the K-means algorithm to clustering skewed data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Grouping similar objects into common groups, also known as clustering, is an important problem of unsupervised machine learning. Various clustering algorithms have been proposed in literature. In recent years, the need to analyze large amounts of data has led to reconsidering some fundamental clustering procedures. One of them is the celebrated K-means algorithm popular among practitioners due to its speedy performance and appealingly intuitive construction. Unfortunately, the algorithm often shows poor performance unless data groups have spherical shapes and approximately same sizes. In many applications, this restriction is so severe that the use of the K-means algorithm becomes questionable, misleading, or simply incorrect. We propose an extension of K-means that preserves the speed and intuitive interpretation of the original algorithm while providing greater flexibility in modeling clusters. The idea of the proposed generalization relies on the exponential transformation of Manly originally designed to obtain near-normally distributed data. The suggested modification is derived and illustrated on several datasets with good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589–609

    Article  Google Scholar 

  • Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5

    Google Scholar 

  • Andrews DF, Gnanadesikan R, Warner JL (1971) Transformations of multivariate data. Biometrics 27(4):825–840

    Article  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Box G, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252

    MATH  Google Scholar 

  • Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(4):1–22

    Article  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332

    Article  MathSciNet  MATH  Google Scholar 

  • Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of IEEE symposium on foundations of computer science, New York, pp 633–644

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    MATH  Google Scholar 

  • Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 01 Jan 2017

  • Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160

    Article  MathSciNet  MATH  Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188

    Article  Google Scholar 

  • Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780

    Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  MATH  Google Scholar 

  • Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  MATH  Google Scholar 

  • Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454

    Article  MathSciNet  MATH  Google Scholar 

  • Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376

    Article  MathSciNet  Google Scholar 

  • Manly BFJ (1976) Exponential data transformations. Biom Unit 25:37–42

    Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • Melnykov I, Melnykov V (2014) On k-means algorithm with the use of Mahalanobis distances. Stat Probab Lett 84:88–95

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V (2013a) Challenges in model-based clustering. WIREs Comput Stat 5:135–148

    Article  Google Scholar 

  • Melnykov V (2013b) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592

    Article  MathSciNet  Google Scholar 

  • Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Shen G (2013) Clustering through empirical likelihood ratio. Comput Stat Data Anal 62:1–10

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25

    Article  Google Scholar 

  • Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge

    MATH  Google Scholar 

  • Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin

    MATH  Google Scholar 

  • Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438

    Google Scholar 

  • Tortora C, Franczak BC, Browne RP, McNicholas PD (2014) A mixture of coalesced generalized hyperbolic distributions. arXiv:1403.2332

  • Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244

    Article  MathSciNet  Google Scholar 

  • Zhu X, Melnykov V (2017) ManlyMix: an R package for manly mixture modeling. R J 9(2):176–197

Download references

Acknowledgements

The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Volodymyr Melnykov.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 221 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Melnykov, V., Zhu, X. An extension of the K-means algorithm to clustering skewed data. Comput Stat 34, 373–394 (2019). https://doi.org/10.1007/s00180-018-0821-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0821-z

Keywords

Navigation