Abstract
Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as k-medoids or k-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.
Similar content being viewed by others
References
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. ACM SIGMOD Rec 28(2):49–60
Ao SI, Yip K, Ng M, Cheung D, Fong P, Melhado I, Sham P (2005) Clustag: hierarchical clustering and graph methods for selecting tag SNPS. Bioinformatics 21:1735–6. https://doi.org/10.1093/bioinformatics/bti201
Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, USA, SODA ’07. Society for Industrial and Applied Mathematics, pp 1027–1035
Aurenhammer F (1991) Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Comput Surv 23(3):345–405. https://doi.org/10.1145/116873.116880
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821
Ben Hough J, Krishnapur M, Peres Y, Virág B (2006) Determinantal processes and independence. Probab Surv [electronic only] 3:206–229
Bicego M, Baldo S (2016) Properties of the Box–Cox transformation for pattern classification. Neurocomputing 218:390–400
Bien J, Tibshirani R (2011) Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc 106:1075–1084. https://doi.org/10.1198/jasa.2011.tm10183
Bilodeau M, Nangue AG (2017) Tests of mutual or serial independence of random vectors with applications. J Mach Learn Res 18(1):2518–2557
Blatt M, Wiseman S, Domany E (1996) Superparamagnetic clustering of data. Phys Rev Lett 76:3251–3254
Blatt M, Wiseman S, Domany E (1997) Data clustering using a model granular magnet. Neural Comput 9(8):1805–1842
Borodin A, Olshanski G (2000) Distributions on partitions, point processes, and the hypergeometric kernel. Commun Math Phys 211:335–358. https://doi.org/10.1007/s002200050815arXiv:math/9904010
Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252
Budiaji W, Leisch F (2019) Simple k-medoids partitioning algorithm for mixed variable data. Algorithms 12(9):177
Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the k-means clustering for massive data. Knowl-Based Syst 117:56–69
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Chaudhuri A, Kakde D, Sadek C, Gonzalez L, Kong S (2017) The mean and median criteria for kernel bandwidth selection for support vector data description. In: 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, pp 842–849
Chen D, Xing K, Henson D, Sheng L, Schwartz AM, Cheng X (2009) Developing prognostic systems of cancer patients by ensemble clustering. J Biomed Biotechnol 2009:632786
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Fan Z, Jiang X, Xu B, Jiang Z (2010) An automatic index validity for clustering. In: Tan Y, Shi Y, Tan KC (eds) Advances in swarm intelligence. Springer, Berlin, pp 359–366
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41(1):176–190
Florek K, Łukaszewicz J, Perkal J, Steinhaus H, Zubrzycki S (1951) Sur la liaison et la division des points d’un ensemble fini. Colloq Math 2:282–285
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112
Girolami M (2002) Mercer kernel-based clustering in feature space. IEEE Trans Neural Netw 13(3):780–784
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306
Hafiz Affandi R, Fox EB, Taskar B (2013) Approximate inference in continuous determinantal point processes. ArXiv e-prints arXiv:1311.2971
Hafiz Affandi R, Fox EB, Adams RP, Taskar B (2014) Learning the parameters of determinantal point process kernels. arXiv e-prints arXiv:1402.4862
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user, Chap 1. Wiley, Hoboken, pp 1–24
Herbrich R (2001) Learning kernel classifiers: theory and algorithms. MIT Press, Cambridge
Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: 25th International conference on very large databases, pp 506–517
Howley T, Madden MG (2006) An evolutionary approach to automatic kernel construction. In: Kollias S, Stafylopatis A, Duch W, Oja E (eds) Artificial Neural Networks—ICANN 2006. Springer, Berlin, pp 417–426
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Ibrahim LF, Harbi MHA (2013) Using modified partitioning around medoids clustering technique in mobile network planning. arXiv preprint arXiv:1302.6602
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
Kang B (2013) Fast determinantal point process sampling with application to clustering. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc., New York, pp 2319–2327
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an s4 package for kernel methods in r. J Stat Softw 11(9):1–20
Katsavounidis I, Kuo CCJ, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Process Lett 1(10):144–146
Kaufmann L, Rousseeuw P (1987) Clustering by means of medoids. In: Proceedings of the statistical data analysis based on the L1 norm conference, Neuchatel, Switzerland, vol 31, pp 405–416
Kulesza A, Taskar B (2012) Determinantal point processes for machine learning. Found Trends Mach Learn 5(2–3):123–286. https://doi.org/10.1561/2200000044
Lago-Fernández LF, Corbacho F (2010) Normality-based validation for crisp clustering. Pattern Recogn 43(3):782–795
Lewis D (1997) Reuters-21578 text categorization collection, distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 296–304
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–136
Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6(1):144–157
Melnykov V, Chen WC, Maitra R (2012) Mixsim: an r package for simulating data to study performance of clustering algorithms. J Stat Softw 51(12):1–25. https://doi.org/10.18637/jss.v051.i12
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118
Muñoz J, Murua A (2018) Building cancer prognosis systems with survival function clusters. Stat Anal Data Min ASA Data Sci J 11(3):98–110. https://doi.org/10.1002/sam.11373
Murua A, Wicker N (2014) The conditional-Potts clustering model. J Comput Graph Stat 23(3):717–739
Murua A, Stanberry L, Stuetzle W (2008) On Potts model clustering, kernel k-means, and density estimation. J Comput Graph Stat 17(3):629–658
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856
Okabe A, Boots B, Sugihara K, Chiu SN (2000) Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, 2nd edn. Series in Probability and Statistics. Wiley, Hoboken
Park HS, Jun CH (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recogn Lett 20(10):1027–1040
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Saad D (1998) Online algorithms and stochastic approximations. Online Learn 5:3–6
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Schölkopf B, Tsuda K, Vert JP (2004) Kernel methods in computational biology. MIT Press, Cambridge
Schubert E, Rousseeuw PJ (2019) Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: International conference on similarity search and applications. Springer, pp 171–187
Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat 41:2263–2291
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905
Smyth P (1997) Clustering sequences with hidden Markov models. In: Proceedings of the 9th international conference on neural information processing systems (NIPS 1996). MIT Press, Cambridge, pp 648–654
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20(1):25–47
Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418
Thygesen HH, Zwinderman AH (2004) Comparing transformation methods for DNA microarray data. BMC Bioinform 5(1):77
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198
Vapnik VN (1995) The nature of statistical learning theory. Springer, Berlin
Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recogn Artif Intell 25(03):337–372. https://doi.org/10.1142/S0218001411008683
Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer, pp 758–770
Vert JP, Tsuda K, Schölkopf B (2004) A primer on kernel methods. Kernel Methods Comput Biol 47:35–70
Wang F, Landau DP (2001) Efficient, multiple-range random walk algorithm to calculate the density of states. Phys Rev Lett 86(10):2050
Wang W, Yang J, Muntz R et al (1997) Sting: a statistical information grid approach to spatial data mining. VLDB 97:186–195
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Watkins C (1999) Dynamic alignment kernels. In: Smola AJ, Bartlett PL, Schölkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, pp 39–50
Xuan L, Zhigang C, Fan Y (2013) Exploring of clustering algorithm on class-imbalanced data. In: 8th International conference on computer science and education, ICCSE 2013, pp 89–93. https://doi.org/10.1109/ICCSE.2013.6553890
Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17:977–987
Acknowledgements
This work was partially funded by a Discovery grant 2019-05444 from the Natural Sciences and Engineering Research Council of Canada, and by a Fundamental Research Projects grant PRF-2017-20 from the Institute for Data Valorization (IVADO) of Quebec, Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Proof that an increasing sequence of thresholds gives rise to hierarchical cluster partitions
Consider s given sequence of increasing thresholds \(\theta _1< \theta _2< \cdots < \theta _m\). As mentioned in Sect. 4, the associated consolidated matrices \(C^{(1)}, \ldots , C^{(m)}\) are nested in the sense that when \(s < t\), \(C^{(s)}_{ij} \ge C^{(t)}_{ij}\) for all pairs (i, j) \(s, t=1, \ldots ,m\). Let \(\mathcal {P}_s\) and \(\mathcal {P}_t\) be the connected component partitions associated with the consolidated matrices \(C^{(s)}\) and \(C^{(t)}\), respectively. Let points \(x_i\) and \(x_j\) be in the same connected component of \(\mathcal {P}_t\). That is, there is a path \(x_i \mapsto x_{l_1} \mapsto \cdots \mapsto x_{l_{g-1}} \mapsto x_j\) of points satisfying \(C^{(t)}_{l_h, l_{h+1}} = 1\), \(h=0, \ldots , g-1\), with \(l_0 = i,\) and \(l_{g} = j\). Since \(C^{(s)}_{ij} \ge C^{(t)}_{ij}\) for all pairs (i, j), necessarily points \(x_i\) and \(x_j\) are also in the same connected component of \(\mathcal {P}_s\). The converse is not true. So suppose that points \(x_i\) and \(x_j\) are in the same connected component, say \(\mathcal {C}_\ell \) of \(\mathcal {P}_s\). That is, there is a path \(x_i \mapsto x_{l_1} \mapsto \cdots \mapsto x_{l_{g-1}} \mapsto x_j\) of points satisfying \(C^{(s)}_{l_h, l_{h+1}} = 1\), \(h=0, \ldots , g-1\), with \(l_0 = i,\) and \(l_{g} = j\). Suppose as well that for some \(h\in \{0, \ldots , g-1\}\), \(C^{(t)}_{l_h, l_{h+1}} = 0\). Then the proportion of times the pair \((x_i, x_j)\) appear together is at least \(\theta _s\), but less than \(\theta _t\). Points \(x_i\) and \(x_j\) are necessarily in different connected components of \(\mathcal {P}_t\). This means that the component \(\mathcal {C}_\ell \in \mathcal {P}_s\) has split in several connected components isolated from all other components of \(\mathcal {P}_s\). In this sense the sequence of thresholds \(\theta _1< \theta _2< \cdots < \theta _m\), leads to a sequence of partitions satisfying \(\mathcal {P}_1 \supseteq \mathcal {P}_2 \supseteq \cdots \supseteq \mathcal {P}_m\), where \(\mathcal {P}_s \supseteq \mathcal {P}_t\) means that each connected component in \(\mathcal {P}_t\) arises from a connected component of \(\mathcal {P}_s\), and the number of connected components of \(\mathcal {P}_t\) is larger or equal to the number of connected components of \(\mathcal {P}_s\).
Appendix: Scattering formulas in the transformed space: using the kernel trick
Consider the RBF kernel \(\kappa (x_i,x_j)\) defined by (3). The mean scattering induced by the kernel on the data (Vert et al. 2004; Fan et al. 2010) is defined as
This formula is derived using the kernel trick which links kernel evaluations with inner products between transformed points in a high-dimensional space. It can be shown that this quantity corresponds to the average distance between the transformed points and the mean of the transformed points. Let \(\mathcal {V}= \{\mathcal {V}_1, \ldots , \mathcal {V}_K\}\) be a clustering configuration with K clusters. Also, let \(n_k\) be the size of cluster \(\mathcal {V}_k\), and let \(J_k = \{ i : x_i \in \mathcal {V}_k\}\), \(k=1,\ldots ,K.\) As with the mean scattering, Fan et al. (2010) define similarly, the kernel induced scattering within each cluster \(\mathcal {V}_k\),
which can be seen as the average distance between the mean of the clusters \(\mathcal {V}_k\) in the transformed space and the transformed points of cluster \(\mathcal {V}_k\). Finally, to measure the total scattering between clusters, one considers the distance between clusters in the transformed space,
Appendix: Scheme to sample \(\varvec{Y}\sim DPP_{\mathcal {S}}(L)\)
Below \(e_i\) represents the ith standard basis vector, a n-dimensional vector that is all zeros except for a one in the ith position. The span of a given finite collection of vectors V, denoted by \({{\,\mathrm{span}\,}}(V)\), is the set containing of all linear combinations of the vectors in V, i.e., \({{\,\mathrm{span}\,}}(V)=\left\{ \sum _{v\in V}a_v v \mid a_v\in \mathbb {R}\right\} \).
Appendix: Sensitivity analysis on kernel choice
As mentioned in Sect. 5.3, we considered three kernels: the linear kernel, \(\kappa (x_i,x_j)=x_i^Tx_j\), and two variants of the q-stable kernel (Bilodeau and Nangue 2017), also known as the q-Gaussian kernel (Verleysen and François 2005), which is given by \(\kappa (x_i, x_j)=\exp \left( -\left\Vert x_i-x_j\right\Vert ^{q}/\sigma ^q\right) \), \(q\in (0,2]\), where \(\sigma >0\) is the kernel bandwidth. The variants considered are the Laplace kernel (\(q=1\)), and the Gaussian kernel (\(q=2\)).
Two estimates of the bandwidth associated with each one of the two q-stable kernels studied were considered. For the Laplace kernel, the bandwidth estimates were the mean \(\hat{\sigma }_2\) and the median \(\tilde{\sigma }_2\) of all pairwise Euclidean distances. For the Gaussian kernel, the estimates were the mean \(\hat{\sigma }^2\) and the median \(\tilde{\sigma }^2\) of all pairwise squared Euclidean distances. Note that the Gaussian kernel with bandwidth estimate \(\hat{\sigma }^2\) corresponds to our proposed DPP kernel.
We further investigated the behavior of the kernels on data whose clusters are non-Gaussian. For this simulation, we followed the setup of Sect. 5.2. Recall that the data were generated from 24 experimental conditions of a \(3^3\)-factorial design. For this study, we generated five replicas from each experimental condition. The non-ellipsoidal (non-Gaussian) clusters were obtained by applying the inverse Box–Cox transformation to the datasets, As before, we selected \(\lambda =1/3\), \(\lambda =1/2\), and \(\lambda = 2\) for the power parameter \(\lambda \) of the Box–Cox transformations. We run the consensus DPP algorithm described in Sect. 4 on each dataset with the five different kernels. The values of the consensus DPP parameters were set to \(a=0.5\), \(R=200\), and \(\tau =0.6\). To select the optimal clustering configuration, we used the kernel-based validation index KVI\(_\mathcal {V}\) given by (6). To measure its quality, we use the adjusted Rand index (see Sect. 4.2). The results are displayed in Fig. 5 in the form of barplots. Each bar represents a different kernel or consensus method. The segments on top of the bars represents ± one standard error. We also compare the results from the different kernels with those yielded by CC-PAM and CC-k-means\(++\).
For both q-stable kernels, estimating the bandwidth parameter with the median or the mean of all the pairwise distances do not produce any major difference in the ARI results.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vicente, S., Murua-Sazo, A. Determinantal consensus clustering. Adv Data Anal Classif 17, 829–858 (2023). https://doi.org/10.1007/s11634-022-00514-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-022-00514-6
Keywords
- Classification
- Kernel-based validation index
- Kernel k-means
- Partitioning about medoids
- Radial basis function
- Repulsion
- Voronoi diagram