Abstract
Most commonly, stability analyses are performed using an external validation measure. For example, the Jaccard index is one of the indexes of choice for stability measurement. The index is wrapped around a resampling method to sense the model’s stability. Other methods use classifiers to look for stable partitions instead. In these cases, a resampling method is also used with an external index, an error measure driven by a classifier, and a clustering algorithm aiming to find stable clustering model configurations. Contrary to previous stability-based methods, we propose a novel validation procedure consisting of an internal validation index within a resampling strategy. We propose an index based on the distance between cluster centroids coupled with a twofold cross-validation resampling approach. Moreover, we use a threshold based on a null hypothesis to detect meaningful clustering partitions. As part of our experimental study, we have selected the K-means algorithm because of its simplicity but primarily for its instability compared to other algorithms, such as Hierarchical methods. Finally, we compare our approach with several known validation indexes and discuss the results. Our findings show that our method cannot only find meaningful clustering partitions but is also helpful as an unsupervised data analysis tool.
Similar content being viewed by others
Data and code availability
The code repository address https://github.com/ar-baya/DStab. Data is on public repositories.
References
Arbelaitz O, Gurrutxaga I, Muguerza J et al (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256. https://doi.org/10.1016/j.patcog.2012.07.021
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Bayá AE, Granitto PM (2013) How many clusters: a validation index for arbitrary-shaped clusters. IEEE ACM Trans Comput Biol Bioinf 10(2):401–414. https://doi.org/10.1109/TCBB.2013.32
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing pacific symposium on biocomputing, pp 6–17
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27. https://doi.org/10.1080/03610927408827101
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, pp 226–231
Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477. https://doi.org/10.1016/j.csda.2011.09.003
Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn 39(5):761–765. https://doi.org/10.1016/j.patcog.2005.09.012
Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212. https://doi.org/10.1093/bioinformatics/bti517
Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 4th edn. Springer
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271. https://doi.org/10.1016/j.csda.2006.11.025
Hu L, Zhong C (2019) An internal validity index based on density-involved distance. IEEE Access 7:40038–40051. https://doi.org/10.1109/ACCESS.2019.2906949
Jain A (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Kosub S (2019) A note on the triangle inequality for the Jaccard distance. Pattern Recogn Lett 120:36–38. https://doi.org/10.1016/j.patrec.2018.12.007
Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44(1):23–24. https://doi.org/10.2307/2531893
Kuhn H (2005) The Hungarian method for the assignment problem. Naval Res Logist 52(1):7–21. https://doi.org/10.1002/nav.20053
Lange T, Roth V, Braun M et al (2004) Stability-based validation of clustering solutions. Neural Comput 16(6):1299–1323. https://doi.org/10.1162/089976604773717621
Liu T, Yu H, Blair R (2022) Stability estimation for unsupervised clustering: a review. Wiley Interdiscip Rev Comput Stat. https://doi.org/10.1002/wics.1575
Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118. https://doi.org/10.1023/A:1023949509487
Moulavi D, Jaskowiak P, Campello R et al (2014) Density-based clustering validation, pp 839–847. https://doi.org/10.1137/1.9781611973440.96
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Prasanth A, Pavalarajan S (2020) Implementation of efficient intra- and inter-zone routing for extending network consistency in wireless sensor networks. J Circuits Syst Comput. https://doi.org/10.1142/S0218126620501297
Rathore P, Ghafoori Z, Bezdek JC et al (2019) Approximating Dunn’s cluster validity indices for partitions of big data. IEEE Trans Cybern 49(5):1629–1641. https://doi.org/10.1109/TCYB.2018.2806886
Rezaei M, Fränti P (2016) Set-matching methods for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186
Rojas-Thomas J, Santos M (2021) New internal clustering validation measure for contiguous arbitrary-shape clusters. Int J Intell Syst 36(10):5506–5529. https://doi.org/10.1002/int.22521
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528. https://doi.org/10.1198/106186005X59243
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Methodol) 63(2):411–423. https://doi.org/10.1111/1467-9868.00293
Veenman C, Reinders M, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280. https://doi.org/10.1109/TPAMI.2002.1033218
Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847. https://doi.org/10.1109/34.85677
Yu H, Chapman B, Di Florio A et al (2019) Bootstrapping estimates of stability for clusters, observations and model selection. Comput Stat 34(1):349–372. https://doi.org/10.1007/s00180-018-0830-y
Funding
This study was funded by CONICET Grant “PIP 11220170100066CO.”
Author information
Authors and Affiliations
Contributions
AB: Conceptualization, Methodology, Software, Writing—Original draft preparation. ML: Methodology, Visualization, Investigation, Writing—Original draft preparation—Reviewing and Editing.
Corresponding author
Ethics declarations
Conflict of interest
the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Here we include Tables 4 and 5, where we summarize Figs. 7,8 9, 10 and 11. In Table 4, we present clusters estimation for artificial datasets, and in Table 5, we estimate the number of groups on real datasets. Tables have three columns, the first with the dataset name, the second with the number of clusters estimation, and the third with a p-value. We analyze the relation between DStab and the reference for each k to determine candidates. In our analyses, we view DStab as an observed distribution and the reference as a Null Hypothesis. We assume that the observed data and the null come from the same distribution but have different locations. Hence, we can use a non-parametric test to estimate if DStab distribution median is smaller than the reference. In this case, a smaller median means better stability-results presented in Tables 4 and 5 show the number of clusters k with the lowest p-values after performing a Mann–Whitney U test. Fewer solutions mean that candidates have a p-value above the 0.05 threshold, and a hyphen indicates no candidate is below the 0.05 threshold. Also, solutions inside a parenthesis mean equivalent candidates. For example, Meter D dataset has three identical solutions (2-3-4), we report a single p-value for the identical solutions.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bayá, A.E., Larese, M.G. DStab: estimating clustering quality by distance stability. Pattern Anal Applic 26, 1463–1479 (2023). https://doi.org/10.1007/s10044-023-01175-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-023-01175-7