Evaluating clustering quality using features salience: a promising approach

307 Accesses
6 Citations
Explore all metrics

Abstract

This paper focuses on using feature salience to evaluate the quality of a partition when dealing with hard clustering. It is based on the hypothesis that a good partition is an easy to label partition, i.e. a partition for which each cluster is made of salient features. This approach is mostly compared to usual approaches relying on distances between data, but also to more recent approaches based on entropy or stability. We show that our feature-based approach outperforms the compared indexes for optimal model selection: they are more efficient from low- to high-dimensional range as well as they are more robust to noise. To show the efficiency of our indexes on a real-life application, we consider the task of diachronic analysis on a textual dataset. We demonstrate that our approach allows to get some interesting and relevant results in that context, while other approaches mostly lead to unusable results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

New Quality Indexes for Optimal Clustering Model Identification Based on Cross-Domain Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://www.research.att.com/~lewis/reuters21578.html.
As an example, for the R52 dataset, the EC index computation time was 125 s as compared to 43,000 s for the Silhouette index using a standard laptop with 2,2 GHz quadricore processor and 8 GB of memory.
This new version has been recently developed in the context of the ISTEX project. A demonstrator is available online (see https://github.com/nicolasdugue/istex-demonstrateur) and described in Dugué et al. [36].
Diachronic analysis and feature selection code can be found at https://github.com/nicolasdugue/istex.

References

Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: International conference on data mining, pp 911–916
Angel Latha Mary S, Sivagami AN, Usha Rani M (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012
Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Article Google Scholar
Rendón E, Abundez Itzel, Arizmendi Alejandra, Mquiroz Elvia (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Google Scholar
Kassab R, Lamirel J-C (2008) Feature-based cluster validation for high-dimensional data. In: International conference on artificial intelligence and applications, pp 232–239
Lamirel J-C, Mall R, Cuxac P, Safi G (2011) Variations to incremental growing neural gas algorithm based on label maximization. In: International joint conference on neural networks, pp 956–965
Guerra L, Robles V, Bielza C, Larranñaga P (2012) A comparison of clustering quality indices using outliers and noise. Intell Data Anal 16(4):703–715
Article Google Scholar
Hamerly G, Elkan C (2004) Learning the k in k-means. In: Advances in neural information processing systems, pp 281–288
Bock HH (1996) Probability model and hypothesis testing in partitioning cluster analysis
Gordon AD (1997) External validation in cluster analysis. Bull Int Stat Inst 51(2):353–356
MATH Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145
Article Google Scholar
Bellman RE (1961) Adaptive control processes. Princeton University Press
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105
Article Google Scholar
Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26
Article Google Scholar
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104
Article MathSciNet Google Scholar
Davies D, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar
Rousseuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27
Article MathSciNet Google Scholar
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Article Google Scholar
Dudek A (2019) Silhouette index as clustering evaluation tool. In: Conference of the section on classification and data analysis of the polish statistical association. Springer, pp 19–33
Ünlü R, Xanthopoulos P (2019) Estimating the number of clusters in a dataset via consensus clustering. Expert Syst Appl 125:33–39
Article Google Scholar
Krasnov F, Sen A (2019) The number of topics optimization: clustering approach. Mach Learn Knowl Extr 1(1):416–426
Article Google Scholar
Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. arXiv preprint arXiv:2002.01822
Kargar M, Isazadeh A, Izadkhah H (2020) New internal metric for software clustering algorithms validity. IET Softw 14:402–410
Article Google Scholar
Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–159
Article MathSciNet Google Scholar
Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recognit 30(6):847–857
Article Google Scholar
Lago-Fernández LF, Corbacho F (2009) Using the negentropy increment to determine the number of clusters. In: International work-conference on artificial neural networks, pp 448–455
Lago-Fernández LF, Aragón J, Martínez-Muñoz G, González AM, Sánchez-Montañés M (2014) Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing 123:33–39
Article Google Scholar
Akaike H (1974) A new look at the statistical identification model. IEEE Trans Autom Control 19:716–723
Article Google Scholar
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet Google Scholar
Manning C, Raghavan P, SChütze H (2008) An introduction to information retrieval, vol 151, p 177
Ben-Hur A, Elisseef A, Guyon I (2001) A stability based method for discovering structure in clustered data. Pac Symp Biocomput 7:6–17
Google Scholar
Lamirel J-C, Cuxac P, Chivukula AS, Hajlaoui K (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):379–396
Article Google Scholar
Falk I, Gardent C, Lamirel J-C (2012) Classifying French verbs using French and English lexical resources. In: Proceedings of the Association for Computational Linguistics, pp 854–863
Dugué N, Lamirel J-C, Cuxac P (2016) Diachronic’explorer: keep track of your clusters. In: Research challenges in information science, pp 1–2
Bache K, Lichman M (2013) Uci machine learning repository
Sun L, Korhonen A, Poibeau T, Messiant C (2010) Investigating the cross-linguistic potential of verbnet: style classification. In: International conference on computational linguistics, pp 1056–1064
Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, pp 625–632
Schubert E, Sander J, Ester M, Peter KH, Xiaowei X (2017) Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst (TODS) 42(3):1–21
Article MathSciNet Google Scholar
Schubert E, Gertz M (2018) Improving the cluster structure extracted from optics plots. In: LWDA, pp 318–329
van der Merwe DW, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Article Google Scholar
Lamirel J-C (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics 93(1):151–166
Article Google Scholar
Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):191–218
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique de l’Université du Maine (LIUM - EA 4023), IC2, Institut d’Informatique Claude Chappe, Avenue Olivier Messiaen, 72000, Le Mans, France
Nicolas Dugué
LORIA, Equipe Synalp, Bâtiment B, 54506, Vandoeuvre Cedex, France
Jean-Charles Lamirel
WiseLAB, Dalian University of Technology, Dalian, 116024, Liaoning, China
Jean-Charles Lamirel & Yue Chen

Authors

Nicolas Dugué
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar
Yue Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Charles Lamirel.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Clustering stability and subsampling

In Ben-Hur et al. [33], the method to find the correct number of clusters is based on sampling and clustering stability. To evaluate the agreement between partitions calculated on distinct samples with the same value of k, the authors propose to calculate the correlation between pairs of points classified in the same cluster in different models. Then, by looking at the correlation distribution for each k, one may be able to find the correct k. Indeed, in the original paper, the correlation distribution is really close to 1 for low k values and one can see a dramatic decrease for a particular value of $k_{\mathrm{dec}}$. The optimal value chosen is $k_{\mathrm{dec}} -1$, the last value of k with a distribution close to 1. As one can see, our experiments show that this method does not seem to work on our reference data by the exploitation of a standard clustering algorithm, like K-means. Indeed, Figs. 7 and 8 show that the correlation decreases steadily between each k and that it is therefore impossible to identify a dramatic decrease indicating the optimal k. As stated in the introduction, it may due to the fact that K-means partitions are too suboptimal, which leads to poor correlations. However, the algorithm requires to make a large number of run for each k and it is then impossible to use more complex algorithms on big datasets.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dugué, N., Lamirel, JC. & Chen, Y. Evaluating clustering quality using features salience: a promising approach. Neural Comput & Applic 33, 12939–12956 (2021). https://doi.org/10.1007/s00521-021-05942-7

Download citation

Received: 30 December 2019
Accepted: 25 March 2021
Published: 24 May 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00521-021-05942-7