article

Clustering Stability: An Overview

Author:

Ulrike von LuxburgAuthors Info & Claims

Foundations and Trends® in Machine Learning, Volume 2, Issue 3

Pages 235 - 274

https://doi.org/10.1561/2200000008

Published: 01 March 2010 Publication History

Abstract

A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this monograph we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.

References

[1]

S. Ben-David, "A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering," Machine Learning, vol. 66, pp. 243-257, 2007.

Digital Library

Google Scholar

[2]

S. Ben-David, D. Pál, and H.-U. Simon, "Stability of k-Means Clustering," in Conference on Learning Theory (COLT), (N. Bshouty and C. Gentile, eds.), pp. 20-34, Springer, 2007.

Crossref

Google Scholar

[3]

S. Ben-David and U. von Luxburg, "Relating clustering stability to properties of cluster boundaries," in Proceedings of the 21st Annual Conference on Learning Theory (COLT), (R. Servedio and T. Zhang, eds.), pp. 379-390, Springer, Berlin, 2008.

Google Scholar

[4]

S. Ben-David, U. von Luxburg, and D. Pál, "A sober look on clustering stability," in Proceedings of the 19th Annual Conference on Learning Theory (COLT), (G. Lugosi and H. Simon, eds.), pp. 5-19, Springer, Berlin, 2006.

Crossref

Google Scholar

[5]

A. Ben-Hur, A. Elisseeff, and I. Guyon, "A stability based method for discovering structure in clustered data," in Pacific Symposium on Biocomputing, pp. 6-17, 2002.

Google Scholar

[6]

A. Bertoni and G. Valentini, "Model order selection for bio-molecular data clustering," BMC Bioinformatics, vol. 8(Suppl 2):S7, 2007.

Crossref

Google Scholar

[7]

A. Bertoni and G. Valentini, "Discovering multi-level structures in biomolecular data through the Bernstein inequality," BMC Bioinformatics, vol. 9(Suppl 2), 2008.

Google Scholar

[8]

M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, M. Hayward, and J. Trent, "Molecular classification of cutaneous malignant melanoma by gene expression profiling," Nature, vol. 406, pp. 536- 540, 2000.

Google Scholar

[9]

S. Bubeck, M. Meila, and U. von Luxburg, "How the initialization affects the stability of the k-means algorithm," Draft, http://arxiv.org/abs/0907.5494, 2009.

Google Scholar

[10]

S. Dasgupta and L. Schulman, "A probabilistic analysis of EM for mixtures of separated, spherical gaussians," JMLR, vol. 8, pp. 203-226, 2007.

Crossref

Google Scholar

[11]

B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.

Crossref

Google Scholar

[12]

J. Fridlyand and S. Dudoit, "Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method," Technical Report 600, Department of Statistics, University of California, Berkeley, 2001.

Google Scholar

[13]

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer, 2001.

Google Scholar

[14]

M. K. Kerr and G. A. Churchill, "Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments," PNAS, vol. 98, no. 16, pp. 8961-8965, 2001.

Crossref

Google Scholar

[15]

T. Lange, V. Roth, M. Braun, and J. Buhmann, "Stability-based validation of clustering solutions," Neural Computation, vol. 16, no. 6, pp. 1299-1323, 2004.

Digital Library

Google Scholar

[16]

J. Lember, "On minimizing sequences for k-centres," Journal of Approximation Theory, vol. 120, pp. 20-35, 2003.

Digital Library

Google Scholar

[17]

E. Levine and E. Domany, "Resampling method for unsupervised estimation of cluster validity," Neural Computation, vol. 13, no. 11, pp. 2573-2593, 2001.

Digital Library

Google Scholar

[18]

M. Meila, "Comparing clusterings by the variation of information," in Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT), (B. Schölkopf and M. Warmuth, eds.), pp. 173-187, Springer, 2003.

Google Scholar

[19]

U. Möller and D. Radke, "A cluster validity approach based on nearest-neighbor resampling," in Proceedings of the 18th International Conference on Pattern Recognition (ICPR), pp. 892-895, Washington, DC, USA: IEEE Computer Society, 2006.

Crossref

Google Scholar

[20]

D. Pollard, "Strong consistency of k-means clustering," Annals of Statistics, vol. 9, no. 1, pp. 135-140, 1981.

Crossref

Google Scholar

[21]

D. Pollard, "A central limit theorem for k-means clustering," Annals of Probability, vol. 10, no. 4, pp. 919-926, 1982.

Google Scholar

[22]

O. Shamir and N. Tishby, "Cluster stability for finite samples," in Advances in Neural Information Processing Systems (NIPS) 21, (J. Platt, D. Koller, Y. Singer, and S. Rowseis, eds.), Cambridge, MA: MIT Press, 2008.

Google Scholar

[23]

O. Shamir and N. Tishby, "Model Selection and Stability in k-means clustering," in Proceedings of the 21rst Annual Conference on Learning Theory (COLT), (R. Servedio and T. Zhang, eds.), 2008.

Google Scholar

[24]

O. Shamir and N. Tishby, "On the reliability of clustering stability in the large sample regime," in Advances in Neural Information Processing Systems 21 (NIPS), (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), 2009.

Google Scholar

[25]

M. Smolkin and D. Ghosh, "Cluster stability scores for microarray data in cancer studies," BMC Bioinformatics, vol. 36, no. 4, 2003.

Google Scholar

[26]

A. Strehl and J. Ghosh, "Cluster ensembles -- A knowledge reuse framework for combining multiple partitions," JMLR, vol. 3, pp. 583-617, 2002.

Crossref

Google Scholar

[27]

N. Vinh and J. Epps, "A novel approach for automatic number of clusters detection in microarray data based on consensus clustering," in Proceedings of the Ninth IEEE International Conference on Bioinformatics and Bioengineering, pp. 84-91, IEEE Computer Society, 2009.

Crossref

Google Scholar

Cited By

View all

Jodlbauer HTripathi SBachmann NBrunner M(2024)Unlocking hidden market segmentsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124331254:COnline publication date: 15-Nov-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124331
Tarasenko ABerikov VPestunov IRylov SRuzankin P(2024)A fast consistent grid-based clustering algorithmPattern Analysis & Applications10.1007/s10044-024-01354-027:4Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10044-024-01354-0
Impagliazzo RLei RPitassi TSorrell JLeonardi SGupta A(2022)Reproducibility in learningProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519973(818-831)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519935.3519973
Show More Cited By

Index Terms

Clustering Stability: An Overview
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Clustering stability-based Evolutionary K-Means

Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means' initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster ...
Stability of k-means clustering
COLT'07: Proceedings of the 20th annual conference on Learning theory

We consider the stability of k-means clustering problems. Clustering stability is a common heuristics used to determine the number of clusters in a wide variety of clustering applications. We continue the theoretical analysis of clustering stability by ...
Non-unique cluster numbers determination methods based on stability in spectral clustering

Recently, a large amount of work has been devoted to the study of spectral clustering--a simple yet powerful method for finding structure in a data set using spectral properties of an associated pairwise similarity matrix. Most of the existing spectral ...

Reviews

Reviewer: Paul J Kennedy

Choosing the correct number of clusters when using unsupervised machine learning approaches has always been problematic. Von Luxburg presents an accessible introduction to important research in one approach: choosing numbers of clusters by the stability of the resulting clusterings. This approach assumes that several datasets can be sampled from the same underlying distribution of data to be clustered: a stable clustering produces similar clusters for the datasets. Von Luxburg reviews recent work on selecting the number of clusters using cluster stability, focusing mainly on the popular k -means clustering algorithm, and introduces this quite technical area by sketching key proofs in nontechnical language. Consequently, it is a good primer for data mining practitioners and researchers interested in joining the area. As it highlights several areas lacking in current research, it is particularly useful for beginning doctoral students. After a short philosophical introduction to the approach, von Luxburg defines clustering stability scores and how to calculate them. The bulk of the paper addresses stability analysis of k -means clustering and gives the main result: if the selected number of clusters is correct, then the clustering is stable; if the number is too large, then the clustering is unstable; and if the number of clusters is too small, then the clustering may be either stable or unstable, depending on the underlying data distribution. The next sections provide an outlook for the research area and show that, while some of the theoretical results carry over to other clustering algorithms, much work still needs to be done in extending stability analysis. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Foundations and Trends® in Machine Learning

Foundations and Trends® in Machine Learning Volume 2, Issue 3

March 2010

41 pages

ISSN:1935-8237

EISSN:1935-8245

Issue’s Table of Contents

Publisher

Now Publishers Inc.

Hanover, MA, United States

Publication History

Published: 01 March 2010

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jodlbauer HTripathi SBachmann NBrunner M(2024)Unlocking hidden market segmentsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124331254:COnline publication date: 15-Nov-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124331
Tarasenko ABerikov VPestunov IRylov SRuzankin P(2024)A fast consistent grid-based clustering algorithmPattern Analysis & Applications10.1007/s10044-024-01354-027:4Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10044-024-01354-0
Impagliazzo RLei RPitassi TSorrell JLeonardi SGupta A(2022)Reproducibility in learningProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519973(818-831)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519935.3519973
Klassen GTatusch MConrad S(2022)Cluster-based stability evaluation in time series data setsApplied Intelligence10.1007/s10489-022-04231-753:13(16606-16629)Online publication date: 13-Dec-2022
https://dl.acm.org/doi/10.1007/s10489-022-04231-7
Pandey KShukla D(2022)Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data miningPattern Analysis & Applications10.1007/s10044-021-01045-025:1(139-156)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s10044-021-01045-0
Kłopotek MKłopotek R(2022)Richness FallacyFoundations of Intelligent Systems10.1007/978-3-031-16564-1_25(262-271)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-16564-1_25
Bui QVo BSnasel VPedrycz WHong TNguyen NChen M(2021)SFCM: A Fuzzy Clustering Algorithm of Extracting the Shape Information of DataIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2020.301466229:1(75-89)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TFUZZ.2020.3014662
Li YLi CWei YPrice SSchönlieb CChen X(2021)Adaptive Unsupervised Learning with Enhanced Feature Representation for Intra-tumor Partitioning and Survival Prediction for GlioblastomaBrainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries10.1007/978-3-031-08999-2_10(124-139)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-031-08999-2_10
Fan XYue YSarkar PWang YDaumé HSingh A(2020)On hyperparameter tuning in general clustering problemsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525219(2996-3007)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525219
Belchí FBrodzki JBurfitt MNiranjan M(2020)A numerical measure of the instability of mapper-type algorithmsThe Journal of Machine Learning Research10.5555/3455716.345591821:1(8347-8391)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.5555/3455716.3455918
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Clustering stability-based Evolutionary K-Means

Stability of k-means clustering

Non-unique cluster numbers determination methods based on stability in spectral clustering

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations