DStab: estimating clustering quality by distance stability

183 Accesses
2 Citations
Explore all metrics

Abstract

Most commonly, stability analyses are performed using an external validation measure. For example, the Jaccard index is one of the indexes of choice for stability measurement. The index is wrapped around a resampling method to sense the model’s stability. Other methods use classifiers to look for stable partitions instead. In these cases, a resampling method is also used with an external index, an error measure driven by a classifier, and a clustering algorithm aiming to find stable clustering model configurations. Contrary to previous stability-based methods, we propose a novel validation procedure consisting of an internal validation index within a resampling strategy. We propose an index based on the distance between cluster centroids coupled with a twofold cross-validation resampling approach. Moreover, we use a threshold based on a null hypothesis to detect meaningful clustering partitions. As part of our experimental study, we have selected the K-means algorithm because of its simplicity but primarily for its instability compared to other algorithms, such as Hierarchical methods. Finally, we compare our approach with several known validation indexes and discuss the results. Our findings show that our method cannot only find meaningful clustering partitions but is also helpful as an unsupervised data analysis tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion

A New Clustering Accuracy Measure Based on Relative Distances and its Cross-Validation Using Dirichlet Distribution

Article 24 July 2024

Effects of Resampling in Determining the Number of Clusters in a Data Set

Article 16 July 2019

Data and code availability

The code repository address https://github.com/ar-baya/DStab. Data is on public repositories.

References

Arbelaitz O, Gurrutxaga I, Muguerza J et al (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256. https://doi.org/10.1016/j.patcog.2012.07.021
Article Google Scholar
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Bayá AE, Granitto PM (2013) How many clusters: a validation index for arbitrary-shaped clusters. IEEE ACM Trans Comput Biol Bioinf 10(2):401–414. https://doi.org/10.1109/TCBB.2013.32
Article Google Scholar
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing pacific symposium on biocomputing, pp 6–17
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27. https://doi.org/10.1080/03610927408827101
Article MathSciNet MATH Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Article Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21
Article Google Scholar
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Article MathSciNet MATH Google Scholar
Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, pp 226–231
Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477. https://doi.org/10.1016/j.csda.2011.09.003
Article MathSciNet MATH Google Scholar
Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn 39(5):761–765. https://doi.org/10.1016/j.patcog.2005.09.012
Article MATH Google Scholar
Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212. https://doi.org/10.1093/bioinformatics/bti517
Article Google Scholar
Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York
MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 4th edn. Springer
Book MATH Google Scholar
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271. https://doi.org/10.1016/j.csda.2006.11.025
Article MathSciNet MATH Google Scholar
Hu L, Zhong C (2019) An internal validity index based on density-involved distance. IEEE Access 7:40038–40051. https://doi.org/10.1109/ACCESS.2019.2906949
Article Google Scholar
Jain A (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Kosub S (2019) A note on the triangle inequality for the Jaccard distance. Pattern Recogn Lett 120:36–38. https://doi.org/10.1016/j.patrec.2018.12.007
Article Google Scholar
Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44(1):23–24. https://doi.org/10.2307/2531893
Article MathSciNet MATH Google Scholar
Kuhn H (2005) The Hungarian method for the assignment problem. Naval Res Logist 52(1):7–21. https://doi.org/10.1002/nav.20053
Article MathSciNet Google Scholar
Lange T, Roth V, Braun M et al (2004) Stability-based validation of clustering solutions. Neural Comput 16(6):1299–1323. https://doi.org/10.1162/089976604773717621
Article MATH Google Scholar
Liu T, Yu H, Blair R (2022) Stability estimation for unsupervised clustering: a review. Wiley Interdiscip Rev Comput Stat. https://doi.org/10.1002/wics.1575
Article MathSciNet Google Scholar
Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118. https://doi.org/10.1023/A:1023949509487
Article MATH Google Scholar
Moulavi D, Jaskowiak P, Campello R et al (2014) Density-based clustering validation, pp 839–847. https://doi.org/10.1137/1.9781611973440.96
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Prasanth A, Pavalarajan S (2020) Implementation of efficient intra- and inter-zone routing for extending network consistency in wireless sensor networks. J Circuits Syst Comput. https://doi.org/10.1142/S0218126620501297
Article Google Scholar
Rathore P, Ghafoori Z, Bezdek JC et al (2019) Approximating Dunn’s cluster validity indices for partitions of big data. IEEE Trans Cybern 49(5):1629–1641. https://doi.org/10.1109/TCYB.2018.2806886
Article Google Scholar
Rezaei M, Fränti P (2016) Set-matching methods for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186
Article Google Scholar
Rojas-Thomas J, Santos M (2021) New internal clustering validation measure for contiguous arbitrary-shape clusters. Int J Intell Syst 36(10):5506–5529. https://doi.org/10.1002/int.22521
Article Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528. https://doi.org/10.1198/106186005X59243
Article MathSciNet Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Methodol) 63(2):411–423. https://doi.org/10.1111/1467-9868.00293
Article MathSciNet MATH Google Scholar
Veenman C, Reinders M, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280. https://doi.org/10.1109/TPAMI.2002.1033218
Article Google Scholar
Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847. https://doi.org/10.1109/34.85677
Article Google Scholar
Yu H, Chapman B, Di Florio A et al (2019) Bootstrapping estimates of stability for clusters, observations and model selection. Comput Stat 34(1):349–372. https://doi.org/10.1007/s00180-018-0830-y
Article MathSciNet MATH Google Scholar

Download references

Funding

This study was funded by CONICET Grant “PIP 11220170100066CO.”

Author information

Authors and Affiliations

CIFASIS, French Argentine International Center for Information and Systems Sciences, CONICET-UNR, Bv. 27 de Febrero 210 Bis, 2000, Rosario, Argentina
Ariel E. Bayá & Mónica G. Larese

Authors

Ariel E. Bayá
View author publications
You can also search for this author in PubMed Google Scholar
Mónica G. Larese
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AB: Conceptualization, Methodology, Software, Writing—Original draft preparation. ML: Methodology, Visualization, Investigation, Writing—Original draft preparation—Reviewing and Editing.

Corresponding author

Correspondence to Ariel E. Bayá.

Ethics declarations

Conflict of interest

the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Here we include Tables 4 and 5, where we summarize Figs. 7,8 9, 10 and 11. In Table 4, we present clusters estimation for artificial datasets, and in Table 5, we estimate the number of groups on real datasets. Tables have three columns, the first with the dataset name, the second with the number of clusters estimation, and the third with a p-value. We analyze the relation between DStab and the reference for each k to determine candidates. In our analyses, we view DStab as an observed distribution and the reference as a Null Hypothesis. We assume that the observed data and the null come from the same distribution but have different locations. Hence, we can use a non-parametric test to estimate if DStab distribution median is smaller than the reference. In this case, a smaller median means better stability-results presented in Tables 4 and 5 show the number of clusters k with the lowest p-values after performing a Mann–Whitney U test. Fewer solutions mean that candidates have a p-value above the 0.05 threshold, and a hyphen indicates no candidate is below the 0.05 threshold. Also, solutions inside a parenthesis mean equivalent candidates. For example, Meter D dataset has three identical solutions (2-3-4), we report a single p-value for the identical solutions.

Table 4 We perform a Mann–Whitney U test between DStab and DStab Ref., assuming that significant results have a lower median value than the reference

Full size table

Table 5 We perform a Mann–Whitney U test between DStab and DStab Ref., assuming that significant results have a lower median value than the reference

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bayá, A.E., Larese, M.G. DStab: estimating clustering quality by distance stability. Pattern Anal Applic 26, 1463–1479 (2023). https://doi.org/10.1007/s10044-023-01175-7

Download citation

Received: 08 October 2022
Accepted: 29 May 2023
Published: 21 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01175-7