Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Clustering Stability: An Overview

Published: 01 March 2010 Publication History

Abstract

A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this monograph we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.

References

[1]
S. Ben-David, "A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering," Machine Learning, vol. 66, pp. 243-257, 2007.
[2]
S. Ben-David, D. Pál, and H.-U. Simon, "Stability of k-Means Clustering," in Conference on Learning Theory (COLT), (N. Bshouty and C. Gentile, eds.), pp. 20-34, Springer, 2007.
[3]
S. Ben-David and U. von Luxburg, "Relating clustering stability to properties of cluster boundaries," in Proceedings of the 21st Annual Conference on Learning Theory (COLT), (R. Servedio and T. Zhang, eds.), pp. 379-390, Springer, Berlin, 2008.
[4]
S. Ben-David, U. von Luxburg, and D. Pál, "A sober look on clustering stability," in Proceedings of the 19th Annual Conference on Learning Theory (COLT), (G. Lugosi and H. Simon, eds.), pp. 5-19, Springer, Berlin, 2006.
[5]
A. Ben-Hur, A. Elisseeff, and I. Guyon, "A stability based method for discovering structure in clustered data," in Pacific Symposium on Biocomputing, pp. 6-17, 2002.
[6]
A. Bertoni and G. Valentini, "Model order selection for bio-molecular data clustering," BMC Bioinformatics, vol. 8(Suppl 2):S7, 2007.
[7]
A. Bertoni and G. Valentini, "Discovering multi-level structures in biomolecular data through the Bernstein inequality," BMC Bioinformatics, vol. 9(Suppl 2), 2008.
[8]
M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, M. Hayward, and J. Trent, "Molecular classification of cutaneous malignant melanoma by gene expression profiling," Nature, vol. 406, pp. 536- 540, 2000.
[9]
S. Bubeck, M. Meila, and U. von Luxburg, "How the initialization affects the stability of the k-means algorithm," Draft, http://arxiv.org/abs/0907.5494, 2009.
[10]
S. Dasgupta and L. Schulman, "A probabilistic analysis of EM for mixtures of separated, spherical gaussians," JMLR, vol. 8, pp. 203-226, 2007.
[11]
B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.
[12]
J. Fridlyand and S. Dudoit, "Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method," Technical Report 600, Department of Statistics, University of California, Berkeley, 2001.
[13]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer, 2001.
[14]
M. K. Kerr and G. A. Churchill, "Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments," PNAS, vol. 98, no. 16, pp. 8961-8965, 2001.
[15]
T. Lange, V. Roth, M. Braun, and J. Buhmann, "Stability-based validation of clustering solutions," Neural Computation, vol. 16, no. 6, pp. 1299-1323, 2004.
[16]
J. Lember, "On minimizing sequences for k-centres," Journal of Approximation Theory, vol. 120, pp. 20-35, 2003.
[17]
E. Levine and E. Domany, "Resampling method for unsupervised estimation of cluster validity," Neural Computation, vol. 13, no. 11, pp. 2573-2593, 2001.
[18]
M. Meila, "Comparing clusterings by the variation of information," in Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT), (B. Schölkopf and M. Warmuth, eds.), pp. 173-187, Springer, 2003.
[19]
U. Möller and D. Radke, "A cluster validity approach based on nearest-neighbor resampling," in Proceedings of the 18th International Conference on Pattern Recognition (ICPR), pp. 892-895, Washington, DC, USA: IEEE Computer Society, 2006.
[20]
D. Pollard, "Strong consistency of k-means clustering," Annals of Statistics, vol. 9, no. 1, pp. 135-140, 1981.
[21]
D. Pollard, "A central limit theorem for k-means clustering," Annals of Probability, vol. 10, no. 4, pp. 919-926, 1982.
[22]
O. Shamir and N. Tishby, "Cluster stability for finite samples," in Advances in Neural Information Processing Systems (NIPS) 21, (J. Platt, D. Koller, Y. Singer, and S. Rowseis, eds.), Cambridge, MA: MIT Press, 2008.
[23]
O. Shamir and N. Tishby, "Model Selection and Stability in k-means clustering," in Proceedings of the 21rst Annual Conference on Learning Theory (COLT), (R. Servedio and T. Zhang, eds.), 2008.
[24]
O. Shamir and N. Tishby, "On the reliability of clustering stability in the large sample regime," in Advances in Neural Information Processing Systems 21 (NIPS), (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), 2009.
[25]
M. Smolkin and D. Ghosh, "Cluster stability scores for microarray data in cancer studies," BMC Bioinformatics, vol. 36, no. 4, 2003.
[26]
A. Strehl and J. Ghosh, "Cluster ensembles -- A knowledge reuse framework for combining multiple partitions," JMLR, vol. 3, pp. 583-617, 2002.
[27]
N. Vinh and J. Epps, "A novel approach for automatic number of clusters detection in microarray data based on consensus clustering," in Proceedings of the Ninth IEEE International Conference on Bioinformatics and Bioengineering, pp. 84-91, IEEE Computer Society, 2009.

Cited By

View all

Recommendations

Reviews

Paul J Kennedy

Choosing the correct number of clusters when using unsupervised machine learning approaches has always been problematic. Von Luxburg presents an accessible introduction to important research in one approach: choosing numbers of clusters by the stability of the resulting clusterings. This approach assumes that several datasets can be sampled from the same underlying distribution of data to be clustered: a stable clustering produces similar clusters for the datasets. Von Luxburg reviews recent work on selecting the number of clusters using cluster stability, focusing mainly on the popular k -means clustering algorithm, and introduces this quite technical area by sketching key proofs in nontechnical language. Consequently, it is a good primer for data mining practitioners and researchers interested in joining the area. As it highlights several areas lacking in current research, it is particularly useful for beginning doctoral students. After a short philosophical introduction to the approach, von Luxburg defines clustering stability scores and how to calculate them. The bulk of the paper addresses stability analysis of k -means clustering and gives the main result: if the selected number of clusters is correct, then the clustering is stable; if the number is too large, then the clustering is unstable; and if the number of clusters is too small, then the clustering may be either stable or unstable, depending on the underlying data distribution. The next sections provide an outlook for the research area and show that, while some of the theoretical results carry over to other clustering algorithms, much work still needs to be done in extending stability analysis. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Foundations and Trends® in Machine Learning
Foundations and Trends® in Machine Learning  Volume 2, Issue 3
March 2010
41 pages
ISSN:1935-8237
EISSN:1935-8245
Issue’s Table of Contents

Publisher

Now Publishers Inc.

Hanover, MA, United States

Publication History

Published: 01 March 2010

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Unlocking hidden market segmentsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124331254:COnline publication date: 15-Nov-2024
  • (2024)A fast consistent grid-based clustering algorithmPattern Analysis & Applications10.1007/s10044-024-01354-027:4Online publication date: 1-Dec-2024
  • (2022)Reproducibility in learningProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519973(818-831)Online publication date: 9-Jun-2022
  • (2022)Cluster-based stability evaluation in time series data setsApplied Intelligence10.1007/s10489-022-04231-753:13(16606-16629)Online publication date: 13-Dec-2022
  • (2022)Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data miningPattern Analysis & Applications10.1007/s10044-021-01045-025:1(139-156)Online publication date: 1-Feb-2022
  • (2022)Richness FallacyFoundations of Intelligent Systems10.1007/978-3-031-16564-1_25(262-271)Online publication date: 3-Oct-2022
  • (2021)SFCM: A Fuzzy Clustering Algorithm of Extracting the Shape Information of DataIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2020.301466229:1(75-89)Online publication date: 1-Jan-2021
  • (2021)Adaptive Unsupervised Learning with Enhanced Feature Representation for Intra-tumor Partitioning and Survival Prediction for GlioblastomaBrainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries10.1007/978-3-031-08999-2_10(124-139)Online publication date: 27-Sep-2021
  • (2020)On hyperparameter tuning in general clustering problemsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525219(2996-3007)Online publication date: 13-Jul-2020
  • (2020)A numerical measure of the instability of mapper-type algorithmsThe Journal of Machine Learning Research10.5555/3455716.345591821:1(8347-8391)Online publication date: 1-Jan-2020
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media