Abstract
This paper focuses on using feature salience to evaluate the quality of a partition when dealing with hard clustering. It is based on the hypothesis that a good partition is an easy to label partition, i.e. a partition for which each cluster is made of salient features. This approach is mostly compared to usual approaches relying on distances between data, but also to more recent approaches based on entropy or stability. We show that our feature-based approach outperforms the compared indexes for optimal model selection: they are more efficient from low- to high-dimensional range as well as they are more robust to noise. To show the efficiency of our indexes on a real-life application, we consider the task of diachronic analysis on a textual dataset. We demonstrate that our approach allows to get some interesting and relevant results in that context, while other approaches mostly lead to unusable results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
As an example, for the R52 dataset, the EC index computation time was 125 s as compared to 43,000 s for the Silhouette index using a standard laptop with 2,2 GHz quadricore processor and 8 GB of memory.
This new version has been recently developed in the context of the ISTEX project. A demonstrator is available online (see https://github.com/nicolasdugue/istex-demonstrateur) and described in Dugué et al. [36].
Diachronic analysis and feature selection code can be found at https://github.com/nicolasdugue/istex.
References
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: International conference on data mining, pp 911–916
Angel Latha Mary S, Sivagami AN, Usha Rani M (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Rendón E, Abundez Itzel, Arizmendi Alejandra, Mquiroz Elvia (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Kassab R, Lamirel J-C (2008) Feature-based cluster validation for high-dimensional data. In: International conference on artificial intelligence and applications, pp 232–239
Lamirel J-C, Mall R, Cuxac P, Safi G (2011) Variations to incremental growing neural gas algorithm based on label maximization. In: International joint conference on neural networks, pp 956–965
Guerra L, Robles V, Bielza C, Larranñaga P (2012) A comparison of clustering quality indices using outliers and noise. Intell Data Anal 16(4):703–715
Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288
Bock HH (1996) Probability model and hypothesis testing in partitioning cluster analysis
Gordon AD (1997) External validation in cluster analysis. Bull Int Stat Inst 51(2):353–356
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145
Bellman RE (1961) Adaptive control processes. Princeton University Press
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105
Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104
Davies D, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Rousseuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Dudek A (2019) Silhouette index as clustering evaluation tool. In: Conference of the section on classification and data analysis of the polish statistical association. Springer, pp 19–33
Ünlü R, Xanthopoulos P (2019) Estimating the number of clusters in a dataset via consensus clustering. Expert Syst Appl 125:33–39
Krasnov F, Sen A (2019) The number of topics optimization: clustering approach. Mach Learn Knowl Extr 1(1):416–426
Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. arXiv preprint arXiv:2002.01822
Kargar M, Isazadeh A, Izadkhah H (2020) New internal metric for software clustering algorithms validity. IET Softw 14:402–410
Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–159
Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recognit 30(6):847–857
Lago-Fernández LF, Corbacho F (2009) Using the negentropy increment to determine the number of clusters. In: International work-conference on artificial neural networks, pp 448–455
Lago-Fernández LF, Aragón J, Martínez-Muñoz G, González AM, Sánchez-Montañés M (2014) Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing 123:33–39
Akaike H (1974) A new look at the statistical identification model. IEEE Trans Autom Control 19:716–723
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Manning C, Raghavan P, SChütze H (2008) An introduction to information retrieval, vol 151, p 177
Ben-Hur A, Elisseef A, Guyon I (2001) A stability based method for discovering structure in clustered data. Pac Symp Biocomput 7:6–17
Lamirel J-C, Cuxac P, Chivukula AS, Hajlaoui K (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):379–396
Falk I, Gardent C, Lamirel J-C (2012) Classifying French verbs using French and English lexical resources. In: Proceedings of the Association for Computational Linguistics, pp 854–863
Dugué N, Lamirel J-C, Cuxac P (2016) Diachronic’explorer: keep track of your clusters. In: Research challenges in information science, pp 1–2
Bache K, Lichman M (2013) Uci machine learning repository
Sun L, Korhonen A, Poibeau T, Messiant C (2010) Investigating the cross-linguistic potential of verbnet: style classification. In: International conference on computational linguistics, pp 1056–1064
Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, pp 625–632
Schubert E, Sander J, Ester M, Peter KH, Xiaowei X (2017) Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst (TODS) 42(3):1–21
Schubert E, Gertz M (2018) Improving the cluster structure extracted from optics plots. In: LWDA, pp 318–329
van der Merwe DW, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Lamirel J-C (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics 93(1):151–166
Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):191–218
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Clustering stability and subsampling
Appendix: Clustering stability and subsampling
In Ben-Hur et al. [33], the method to find the correct number of clusters is based on sampling and clustering stability. To evaluate the agreement between partitions calculated on distinct samples with the same value of k, the authors propose to calculate the correlation between pairs of points classified in the same cluster in different models. Then, by looking at the correlation distribution for each k, one may be able to find the correct k. Indeed, in the original paper, the correlation distribution is really close to 1 for low k values and one can see a dramatic decrease for a particular value of \(k_{\mathrm{dec}}\). The optimal value chosen is \(k_{\mathrm{dec}} -1\), the last value of k with a distribution close to 1. As one can see, our experiments show that this method does not seem to work on our reference data by the exploitation of a standard clustering algorithm, like K-means. Indeed, Figs. 7 and 8 show that the correlation decreases steadily between each k and that it is therefore impossible to identify a dramatic decrease indicating the optimal k. As stated in the introduction, it may due to the fact that K-means partitions are too suboptimal, which leads to poor correlations. However, the algorithm requires to make a large number of run for each k and it is then impossible to use more complex algorithms on big datasets.
Rights and permissions
About this article
Cite this article
Dugué, N., Lamirel, JC. & Chen, Y. Evaluating clustering quality using features salience: a promising approach. Neural Comput & Applic 33, 12939–12956 (2021). https://doi.org/10.1007/s00521-021-05942-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05942-7