[PDF][PDF] The k-means clustering technique: General considerations and implementation in Mathematica

L Morissette, S Chartier - Tutorials in Quantitative Methods for …, 2013 - researchgate.net
Tutorials in Quantitative Methods for Psychology, 2013researchgate.net
Data clustering techniques are valuable tools for researchers working with large databases
of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means
clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the
MacQueen algorithm and the Hartigan & Wong algorithm. We then present an
implementation in Mathematica and various examples of the different options available to
illustrate the application of the technique. Data clustering techniques are descriptive data …
Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique.
Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data. They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used. Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset’s distribution, is available. In such a situation, data clustering can be a valuable tool. Data clustering is a form of unsupervised classification, as the clusters are formed by evaluating similarities and dissimilarities of intrinsic characteristics between different cases, and the grouping of cases is based on those emergent similarities and not on an external criterion. Also, these techniques can be useful for datasets of any dimensionality over three, as it is very difficult for humans to compare items of such complexity reliably without a support to aid the comparison.
researchgate.net