Abstract
Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of k-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional datasets. By using the k-influence space, Is-ClusterMPP solves the problem of local heterogeneity in densities and prevents the impact of the global density in the detection of unbalanced classes. This concept reduces also the input values amount. The curse of dimensionality is handled by using a local subspace clustering principal embedded in a weighted similarity metric. Balls covering data points are constituting a configuration sampled from a marked point process (MPP). Due to the choice of the energy function, they tends to cover neighboring data, which share the same cluster. The statistical model of random balls is sampled through a Monte Carlo Markovian dynamical approach. The energy is balancing different goals. (1) The data driven objective function is provided according to k-influence space. Data in a high-dense region are favored to be covered by a ball. (2) An interaction part in the energy prevents the balls full overlap phenomenon and favors connected groups of balls. The algorithm through Markov dynamics, does converge towards configurations sampled from the MPP model. This algorithm has been applied in real benchmarks through gene expression data set of various sizes. Different experiments have been done to compare Is-ClusterMPP against the most well-known clustering algorithms and its efficiency is claimed.
Similar content being viewed by others
References
Adovanovic M, Nanopoulos A, Ivanovic M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
Alata O, Burg S, Dupas A (2011) Grouping/degrouping point process, a point process driven by geometrical and topological properties of a partition in regions. Comput Vis Image Underst 115(9):1324–1339
Baddeley A, Rubak E, Turner R (2015) Spatial point patterns: methodology and applications with R. Chapman and Hall/CRC. ISBN 9781482210200
Bar-Hen A, Emily M, Picard N (2015) Spatial cluster detection using nearest neighbor distance. Spat Stat 14(Part C):400–411
Bhagat PM, Halgaonkar PS, Wadhai VM (2013) Review of clustering algorithm for categorical data. Int J Eng Adv Technol 32:2249–8958
Bhole P, Agrawal AJ (2014) Extractive based single document text summarization using clustering approach. IAES Int J Artif Intell 3(2):73–78
Bisheng Y, Wenxue X, Zhen D (2013) Automated extraction of building outlines from airborne laser scanning point clouds. Geosci Remote Sens Lett IEEE 10(6):1399–1403
Bohm C, Railing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Fourth IEEE international conference on data mining (ICDM’04), Icdm 04, pP 27–34. ISBN 0-7695-2142-8
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7819 LNAI, pp 160–172. ISBN 9783642374555
Cassisi C, Ferro A, Giugno R, Pigola G, Pulvirenti A (2013) Enhancing density-based clustering: parameter reduction and outlier detection. Inf Syst 38(3):317–330
Chin YC, Baddeley AJ (1999) On connected component Markov point processes. Adv Appl Probab 31(2):279–282
Chiu SN, Stoyan D, Kendall WS, Mecke J (2013) Stochastic geometry and its applications. Wiley, Lodon. ISBN 978-0-470-66481-0
Datar M, Immorlica N, Indyk P (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SCG’04, June 9–11, 2004, Brooklyn, New York, USA., pp 253–262. ISBN 1581138857
Descombes X, Zerubia J (2002) Marked point processes in image analysis. IEEE Signal Process Mag 19(5):77–84
Descombes X, Zhizhina E (2004) Applications of Gibbs fields methods to image processing problems. Probl Inf Transm 40(3):279–295
Descombes X, Stoica R, Garcin L, Zerubia J (2001) A RJMCMC algorithm for object processes in image processing. Monte Carlo Methods Appl 7(1–2):149–156
Descombes X, Minlos R, Zhizhina E (2009) Object extraction using a Stochastic birth-and-death dynamics in continuum. J Math Imaging Vis 33(3):347–359
El Sonbaty Y, Ismail MA, Farouk M (2004) An efficient density based clustering algorithm for large databases. In: 16th IEEE international conference on tools with artificial intelligence, Ictai, pp 637–677. ISBN 0-7695-2236-X
Elavarasi SA, Akilandeswari J, Sathiyabhama B (2011) A survey on partition clustering algorithms. Learning 1(1):1–14
Elbatta MTH, Ashour WM (2013) A dynamic method for discovering density varied clusters. Int J Signal Process Image Process Pattern Recognit 6(1):123–134
Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. Adv Data Min Appl Theor Asp 8557:214–227
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster Analysis, 5th edn. Wiley, London
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Fahim A, Salem A (2006) Density clustering based on radius of data (DCBRD). Informatika 16:344–349
Flexer A, Schnitzer D (2015) Choosing \(L^p\) norms in high-dimensional spaces based on hub analysis. Neurocomputing 169:281–287
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Gaonkar MN, Sawant K (2013) Auto Eps DBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):2319–2526
Geyer CJ (2003) The Metropolis–Hastings–Green algorithm. http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf
Green PJ (1995) Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika 82(4):711–732
Guibas L, Morozov D, Mérigot Q (2013) Witnessed k-distance. Discrete Comput Geom 49(1):22–45
Gul A, Perperoglou A, Khan Z, Mahmoud O, Miftahuddin M, Adler W, Lausen B (2018) Ensemble of a subset of knn classifiers. Adv Data Anal Classif 12(4):827–840
Henni K, Alata O, Alidrissi A, Vannier B, Zaoui L, Moussa A (2017a) Marked point processes for MicroArray data clustering. In: Palumbo F, Montanari A, Vichi M (eds) Data science—innovative developments in data analysis and clustering: studies in classification, data analysis and knowledge organization, pp 125–137. Springer, Berlin. ISBN 978-3-319-55723-6
Henni K, Alata O, Zaoui L, Vannier B, Alidrissi A, Moussa A (2017b) ClusterMPP: an unsupervised density-based clustering algorithm via marked point process. Intell Data Anal 21(4):827–847
Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2(8):3441–3446
Kelly FP, Ripley BD (1976) A note on Strauss’s model for clustering. Biometrika 63(63):357–360
Lv Y, Ma T, Tang M, Cao J, Tian Y (2016) An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171:9–22
Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226–231. AAAI Press
Møller J, Waagepetersen RP (2004) Statistical inference and simulation for spatial point processes. Chapman and Hall/CRC, London. ISBN 9781584882657
Montanari A, Calò DG (2013) Model-based clustering of probability density functions. Adv Data Anal Classif 7(3):301–319
Moussa A, Sbihi A, Postaire JG (2008) A Markov random field model for mode detection in cluster analysis. Pattern Recognit Lett 29(9):1197–1207
Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov 2(1):86–97
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Ortner M, Descombes X, Zerubia J (2007) Building outline extraction from Digital Elevation Models using marked point processes. Int J Comput Vis 72:107–132
Peng L, Dong Z, Naijun W (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings—ICSSSM’07: 2007 international conference on service systems and service management. ISBN 1424408857
Ram A, Jalal S, Jalal AS, Kumar M (2010) DVBSCAN: a density based algorithm for discovering density varied clusters in large spatial databases. Int J Comput Appl 3(6):1–4
Rehman SU, Asghar S, Fong S, Sarasvady S (2014) DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp 232–238
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks—PPT. Science 344(6191):1492–1496
Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218
Stoica RS (2014) Modélisation probabiliste et inférence statistique pour l’analyse des données spatialisées. Research Habilitation Thesis, Université Lille 1
Stoica RS, Descombes X, Zerubia J (2004) A Gibbs point process for road extraction from remotely sensed images. Int J Comput Vis 57(2):121–136
Stoica RS, Gay E, Kretzschmar A (2007) Cluster pattern detection in spatial data based on Monte Carlo inference. Biom J 49(4):505–519
Stoica RS, Martinez VJ, Saar E (2010) Filaments in observed and mock galaxy catalogues. Astron Astrophys 510:1–12
Sun S, Huang R (2010) An adaptive k-nearest neighbor algorithm. In: Proceedings—2010 7th international conference on fuzzy systems and knowledge discovery, FSKD 2010, vol 1, pp 91–94. ISBN 9781424459346
Uncu O, Gruver WA, Kotak DB, Sabaz D, Alibhai Z, Ng C (2007) GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In: Conference Proceedings—IEEE International conference on systems, man and cybernetics, vol 4, pp 2976–2981. ISBN 1424401003
van Lieshout MNM (2000) Markov point processes and their applications. Imperial College Press, London
van Lieshout MNM, Stoica RS (2006) Perfect simulation for marked point processes. Comput Stat Data Anal 51:679–698
Xiaopeng Y, Deyi Z, Yan Z (2005) A new clustering algorithm based on distance and density. In: International conference on services systems and services management, vol 2
Zhang Q, Yang LT, Chen Z, Xia F (2015) A high-order possibilistic-means algorithm for clustering incomplete multimedia data. IEEE Syst J PP(99):1–10
Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422
Acknowledgements
K. Henni thanks the 2RCT research team at University of Poitiers where this work was initiated and major part was done. Funding through Al Idrisi mobility grant is gratefully acknowledged. K. Henni thanks also Ma Tinghuai for data sharing. The authors thank anonymous referees for very careful reading of the manuscript and comments leading to a strongly improved version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Annexe A: IS-ClusterMPP algorithm’s pseudo-code
Here is the pseudo-code transcription of IS-ClusterMPP algorithm. Algorithm 2 is iterated to implement the dynamical evolution. Algorithm 1 is called to implement the jumps: adding to and cancelling balls from the sampled configuration. When Algorithm 2 stops, then Algorithm 3 is called. To affect a cluster label to the remaining not covered data points, Algorithm 4 is run at the end. In Algorithm 1 \(\text {size}({\underline{\omega }}^{i}) >0\) denotes the number of balls in the configuration \({\underline{\omega }}^{i}\). In Algorithm 2,—Jump simulation—means a call to Algorithm 1.
The functions used on the following algorithms are defined as:
-
Not-converged The convergence of the simulation process is verified by the stabilization of the number of balls, i.e, no changes in the configurations size.
-
Extract cc(x) this function identified the set of connected components cc(x).
-
\(x_{j}\)Is covered by \(cc_{i}\) if there is one or more balls \(\omega _{p}(x_{p}, r_{p})\) such that \(dist(x_{j}, x_{p}) \le r_{p}\), this function returns true.
-
Minimal distance(P, NP) This function have as input two observations sets and it returns two observations, one from each input set, which have the smallest distance.
Annexe B: Proof of the energy function local stability
The local stability means, it exists \(\varLambda \in {\mathbb {R}}\), such that for any \({\underline{w}}\) configuration and any \(\xi \) object added to the configuration \({\underline{w}}\), the ratio of the probabilities, which is the conditional intensity, satisfies
Since \(p({\underline{w}}) \propto e^{-U({\underline{\omega }})}\), it is equivalent to it exists \(\varLambda \in {\mathbb {R}}\), such that for any \({\underline{\omega }}\) configuration and any \(\xi \) object added to the configuration \({\underline{\omega }}\),
is uniformly bounded from below.
In the model considered, \(\xi \) denotes a new ball to be added to the configuration \({\underline{\omega }}\),
Since \(|\text {IS}_{k}(x)| \le k\), it holds \(V(\xi )\ge (\frac{1}{k}-k) \ \text {e}\).
It holds
and using the fact that \( |cc ({\underline{\omega }})|\) is at most the number of balls in the configuration, the second and third parts are uniformly bounded from below.
Remark that the (random) number \(N({\underline{\omega }})\) of balls in a configuration \({\underline{\omega }}\), sampled from the chosen model of point process, is Poisson distributed with a parameter which is finite since we consider point processes inside a compact.
Finally remark: \(|{\tilde{n}}({{\underline{\omega }}})| \ge |{\tilde{n}}({{\underline{\omega }}\cup \xi })|\) and, with \(\delta <1\), \( n_{\text {overlap}} ({\underline{\omega }}\cup \xi ) \ge n_{\text {overlap}} ({\underline{\omega }})\).
Rights and permissions
About this article
Cite this article
Henni, K., Louis, PY., Vannier, B. et al. Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data. Adv Data Anal Classif 14, 543–570 (2020). https://doi.org/10.1007/s11634-019-00379-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-019-00379-2
Keywords
- Density-based clustering
- Influence space
- Marked point processes
- Spatial data analysis
- Gibbs cost/objective function
- MCMC/Monte Carlo technique
- High dimensional real data sets