Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data

Khadidja Henni ORCID: orcid.org/0000-0002-3019-3187¹,
Pierre-Yves Louis²,
Brigitte Vannier³ &
…
Ahmed Moussa⁴

308 Accesses
4 Citations
Explore all metrics

Abstract

Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of k-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional datasets. By using the k-influence space, Is-ClusterMPP solves the problem of local heterogeneity in densities and prevents the impact of the global density in the detection of unbalanced classes. This concept reduces also the input values amount. The curse of dimensionality is handled by using a local subspace clustering principal embedded in a weighted similarity metric. Balls covering data points are constituting a configuration sampled from a marked point process (MPP). Due to the choice of the energy function, they tends to cover neighboring data, which share the same cluster. The statistical model of random balls is sampled through a Monte Carlo Markovian dynamical approach. The energy is balancing different goals. (1) The data driven objective function is provided according to k-influence space. Data in a high-dense region are favored to be covered by a ball. (2) An interaction part in the energy prevents the balls full overlap phenomenon and favors connected groups of balls. The algorithm through Markov dynamics, does converge towards configurations sampled from the MPP model. This algorithm has been applied in real benchmarks through gene expression data set of various sizes. Different experiments have been done to compare Is-ClusterMPP against the most well-known clustering algorithms and its efficiency is claimed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Marked Point Processes for Microarray Data Clustering

A Comparative Analysis of Clustering and Biclustering Algorithms in Gene Analysis

Clustering by fast search and merge of local density peaks for gene expression microarray data

Article Open access 19 April 2017

References

Adovanovic M, Nanopoulos A, Ivanovic M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
MathSciNet MATH Google Scholar
Alata O, Burg S, Dupas A (2011) Grouping/degrouping point process, a point process driven by geometrical and topological properties of a partition in regions. Comput Vis Image Underst 115(9):1324–1339
Google Scholar
Baddeley A, Rubak E, Turner R (2015) Spatial point patterns: methodology and applications with R. Chapman and Hall/CRC. ISBN 9781482210200
Bar-Hen A, Emily M, Picard N (2015) Spatial cluster detection using nearest neighbor distance. Spat Stat 14(Part C):400–411
MathSciNet Google Scholar
Bhagat PM, Halgaonkar PS, Wadhai VM (2013) Review of clustering algorithm for categorical data. Int J Eng Adv Technol 32:2249–8958
Google Scholar
Bhole P, Agrawal AJ (2014) Extractive based single document text summarization using clustering approach. IAES Int J Artif Intell 3(2):73–78
Google Scholar
Bisheng Y, Wenxue X, Zhen D (2013) Automated extraction of building outlines from airborne laser scanning point clouds. Geosci Remote Sens Lett IEEE 10(6):1399–1403
Google Scholar
Bohm C, Railing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Fourth IEEE international conference on data mining (ICDM’04), Icdm 04, pP 27–34. ISBN 0-7695-2142-8
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519
MathSciNet MATH Google Scholar
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7819 LNAI, pp 160–172. ISBN 9783642374555
Cassisi C, Ferro A, Giugno R, Pigola G, Pulvirenti A (2013) Enhancing density-based clustering: parameter reduction and outlier detection. Inf Syst 38(3):317–330
Google Scholar
Chin YC, Baddeley AJ (1999) On connected component Markov point processes. Adv Appl Probab 31(2):279–282
MathSciNet MATH Google Scholar
Chiu SN, Stoyan D, Kendall WS, Mecke J (2013) Stochastic geometry and its applications. Wiley, Lodon. ISBN 978-0-470-66481-0
Datar M, Immorlica N, Indyk P (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SCG’04, June 9–11, 2004, Brooklyn, New York, USA., pp 253–262. ISBN 1581138857
Descombes X, Zerubia J (2002) Marked point processes in image analysis. IEEE Signal Process Mag 19(5):77–84
Google Scholar
Descombes X, Zhizhina E (2004) Applications of Gibbs fields methods to image processing problems. Probl Inf Transm 40(3):279–295
MathSciNet MATH Google Scholar
Descombes X, Stoica R, Garcin L, Zerubia J (2001) A RJMCMC algorithm for object processes in image processing. Monte Carlo Methods Appl 7(1–2):149–156
MathSciNet MATH Google Scholar
Descombes X, Minlos R, Zhizhina E (2009) Object extraction using a Stochastic birth-and-death dynamics in continuum. J Math Imaging Vis 33(3):347–359
MathSciNet Google Scholar
El Sonbaty Y, Ismail MA, Farouk M (2004) An efficient density based clustering algorithm for large databases. In: 16th IEEE international conference on tools with artificial intelligence, Ictai, pp 637–677. ISBN 0-7695-2236-X
Elavarasi SA, Akilandeswari J, Sathiyabhama B (2011) A survey on partition clustering algorithms. Learning 1(1):1–14
Google Scholar
Elbatta MTH, Ashour WM (2013) A dynamic method for discovering density varied clusters. Int J Signal Process Image Process Pattern Recognit 6(1):123–134
Google Scholar
Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. Adv Data Min Appl Theor Asp 8557:214–227
Google Scholar
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster Analysis, 5th edn. Wiley, London
MATH Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Google Scholar
Fahim A, Salem A (2006) Density clustering based on radius of data (DCBRD). Informatika 16:344–349
Google Scholar
Flexer A, Schnitzer D (2015) Choosing $L^p$ norms in high-dimensional spaces based on hub analysis. Neurocomputing 169:281–287
Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
MathSciNet MATH Google Scholar
Gaonkar MN, Sawant K (2013) Auto Eps DBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):2319–2526
Google Scholar
Geyer CJ (2003) The Metropolis–Hastings–Green algorithm. http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf
Green PJ (1995) Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika 82(4):711–732
MathSciNet MATH Google Scholar
Guibas L, Morozov D, Mérigot Q (2013) Witnessed k-distance. Discrete Comput Geom 49(1):22–45
MathSciNet MATH Google Scholar
Gul A, Perperoglou A, Khan Z, Mahmoud O, Miftahuddin M, Adler W, Lausen B (2018) Ensemble of a subset of knn classifiers. Adv Data Anal Classif 12(4):827–840
MathSciNet MATH Google Scholar
Henni K, Alata O, Alidrissi A, Vannier B, Zaoui L, Moussa A (2017a) Marked point processes for MicroArray data clustering. In: Palumbo F, Montanari A, Vichi M (eds) Data science—innovative developments in data analysis and clustering: studies in classification, data analysis and knowledge organization, pp 125–137. Springer, Berlin. ISBN 978-3-319-55723-6
Henni K, Alata O, Zaoui L, Vannier B, Alidrissi A, Moussa A (2017b) ClusterMPP: an unsupervised density-based clustering algorithm via marked point process. Intell Data Anal 21(4):827–847
Google Scholar
Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2(8):3441–3446
Google Scholar
Kelly FP, Ripley BD (1976) A note on Strauss’s model for clustering. Biometrika 63(63):357–360
MathSciNet MATH Google Scholar
Lv Y, Ma T, Tang M, Cao J, Tian Y (2016) An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171:9–22
Google Scholar
Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226–231. AAAI Press
Møller J, Waagepetersen RP (2004) Statistical inference and simulation for spatial point processes. Chapman and Hall/CRC, London. ISBN 9781584882657
Montanari A, Calò DG (2013) Model-based clustering of probability density functions. Adv Data Anal Classif 7(3):301–319
MathSciNet MATH Google Scholar
Moussa A, Sbihi A, Postaire JG (2008) A Markov random field model for mode detection in cluster analysis. Pattern Recognit Lett 29(9):1197–1207
Google Scholar
Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov 2(1):86–97
Google Scholar
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Google Scholar
Ortner M, Descombes X, Zerubia J (2007) Building outline extraction from Digital Elevation Models using marked point processes. Int J Comput Vis 72:107–132
Google Scholar
Peng L, Dong Z, Naijun W (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings—ICSSSM’07: 2007 international conference on service systems and service management. ISBN 1424408857
Ram A, Jalal S, Jalal AS, Kumar M (2010) DVBSCAN: a density based algorithm for discovering density varied clusters in large spatial databases. Int J Comput Appl 3(6):1–4
Google Scholar
Rehman SU, Asghar S, Fong S, Sarasvady S (2014) DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp 232–238
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks—PPT. Science 344(6191):1492–1496
Google Scholar
Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218
MathSciNet Google Scholar
Stoica RS (2014) Modélisation probabiliste et inférence statistique pour l’analyse des données spatialisées. Research Habilitation Thesis, Université Lille 1
Stoica RS, Descombes X, Zerubia J (2004) A Gibbs point process for road extraction from remotely sensed images. Int J Comput Vis 57(2):121–136
Google Scholar
Stoica RS, Gay E, Kretzschmar A (2007) Cluster pattern detection in spatial data based on Monte Carlo inference. Biom J 49(4):505–519
MathSciNet Google Scholar
Stoica RS, Martinez VJ, Saar E (2010) Filaments in observed and mock galaxy catalogues. Astron Astrophys 510:1–12
Google Scholar
Sun S, Huang R (2010) An adaptive k-nearest neighbor algorithm. In: Proceedings—2010 7th international conference on fuzzy systems and knowledge discovery, FSKD 2010, vol 1, pp 91–94. ISBN 9781424459346
Uncu O, Gruver WA, Kotak DB, Sabaz D, Alibhai Z, Ng C (2007) GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In: Conference Proceedings—IEEE International conference on systems, man and cybernetics, vol 4, pp 2976–2981. ISBN 1424401003
van Lieshout MNM (2000) Markov point processes and their applications. Imperial College Press, London
MATH Google Scholar
van Lieshout MNM, Stoica RS (2006) Perfect simulation for marked point processes. Comput Stat Data Anal 51:679–698
MathSciNet MATH Google Scholar
Xiaopeng Y, Deyi Z, Yan Z (2005) A new clustering algorithm based on distance and density. In: International conference on services systems and services management, vol 2
Zhang Q, Yang LT, Chen Z, Xia F (2015) A high-order possibilistic-means algorithm for clustering incomplete multimedia data. IEEE Syst J PP(99):1–10
Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422
MathSciNet MATH Google Scholar

Download references

Acknowledgements

K. Henni thanks the 2RCT research team at University of Poitiers where this work was initiated and major part was done. Funding through Al Idrisi mobility grant is gratefully acknowledged. K. Henni thanks also Ma Tinghuai for data sharing. The authors thank anonymous referees for very careful reading of the manuscript and comments leading to a strongly improved version of this paper.

Author information

Authors and Affiliations

Centre de recherche LICEF, Télé-université (TÉLUQ), Montréal, Québec, H2S 3L4, Canada
Khadidja Henni
Laboratoire de Mathématiques et Applications, Université de Poitiers, CNRS, UMR 7348, 11 Boulevard Marie et Pierre Curie, Téléport 2 - BP 30179 86962 Futuroscope de Poitiers, Chasseneuil Cedex, France
Pierre-Yves Louis
Receptors, Regulation and Tumor Cells Laboratory, Université de Poitiers, Bâtiment PBS-B36, 1 rue Georges Bonnet TSA 51106, 86073, Poitiers cedex 09, France
Brigitte Vannier
ENSA-Tangier, Abdelmalek Essaadi University, Tangier, Morocco
Ahmed Moussa

Authors

Khadidja Henni
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Yves Louis
View author publications
You can also search for this author in PubMed Google Scholar
Brigitte Vannier
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Moussa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khadidja Henni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Annexe A: IS-ClusterMPP algorithm’s pseudo-code

Here is the pseudo-code transcription of IS-ClusterMPP algorithm. Algorithm 2 is iterated to implement the dynamical evolution. Algorithm 1 is called to implement the jumps: adding to and cancelling balls from the sampled configuration. When Algorithm 2 stops, then Algorithm 3 is called. To affect a cluster label to the remaining not covered data points, Algorithm 4 is run at the end. In Algorithm 1 $\text {size}({\underline{\omega }}^{i}) >0$ denotes the number of balls in the configuration ${\underline{\omega }}^{i}$. In Algorithm 2,—Jump simulation—means a call to Algorithm 1.

The functions used on the following algorithms are defined as:

Not-converged The convergence of the simulation process is verified by the stabilization of the number of balls, i.e, no changes in the configurations size.
Extract cc(x) this function identified the set of connected components cc(x).
$x_{j}$Is covered by $cc_{i}$ if there is one or more balls $\omega _{p}(x_{p}, r_{p})$ such that $dist(x_{j}, x_{p}) \le r_{p}$, this function returns true.
Minimal distance(P, NP) This function have as input two observations sets and it returns two observations, one from each input set, which have the smallest distance.

Annexe B: Proof of the energy function local stability

The local stability means, it exists $\varLambda \in {\mathbb {R}}$, such that for any ${\underline{w}}$ configuration and any $\xi $ object added to the configuration ${\underline{w}}$, the ratio of the probabilities, which is the conditional intensity, satisfies

$$\begin{aligned} \frac{p({\underline{w}} \cup \xi )}{p({\underline{w}})} \le \varLambda . \end{aligned}$$

Since $p({\underline{w}}) \propto e^{-U({\underline{\omega }})}$, it is equivalent to it exists $\varLambda \in {\mathbb {R}}$, such that for any ${\underline{\omega }}$ configuration and any $\xi $ object added to the configuration ${\underline{\omega }}$,

$$\begin{aligned} \varDelta U({\underline{\omega }};\xi ):= U ({\underline{\omega }}\cup \xi ) - U ({\underline{\omega }}) \end{aligned}$$

is uniformly bounded from below.

In the model considered, $\xi $ denotes a new ball to be added to the configuration ${\underline{\omega }}$,

$$\begin{aligned} \varDelta U({\underline{\omega }};\xi )= & {} V(\xi ) + \left( |cc ({\underline{\omega }})| - |cc ({\underline{\omega }}\cup \xi )| \right) \sum _{\omega _i(x_i,r_i) \in {\underline{\omega }}} \ | \text {IS}_{k}(x_i)| - |cc ({\underline{\omega }}\cup \xi )| \text {IS}_{k}(\xi ) \nonumber \\&+ \log \frac{|{\tilde{n}}({{\underline{\omega }}})|}{|{\tilde{n}}({{\underline{\omega }}\cup \xi })| } - \log \delta \ \frac{n_{\text {overlap}} ({\underline{\omega }}\cup \xi )}{n_{\text {overlap}} ({\underline{\omega }})}. \end{aligned}$$

(12)

Since $|\text {IS}_{k}(x)| \le k$, it holds $V(\xi )\ge (\frac{1}{k}-k) \ \text {e}$.

It holds

$$\begin{aligned} -1 \le \left( |cc ({\underline{\omega }})| - |cc ({\underline{\omega }}\cup \xi )| \right) \le |cc ({\underline{\omega }})| -1 \end{aligned}$$

and using the fact that $ |cc ({\underline{\omega }})|$ is at most the number of balls in the configuration, the second and third parts are uniformly bounded from below.

Remark that the (random) number $N({\underline{\omega }})$ of balls in a configuration ${\underline{\omega }}$, sampled from the chosen model of point process, is Poisson distributed with a parameter which is finite since we consider point processes inside a compact.

Finally remark: $|{\tilde{n}}({{\underline{\omega }}})| \ge |{\tilde{n}}({{\underline{\omega }}\cup \xi })|$ and, with $\delta <1$, $ n_{\text {overlap}} ({\underline{\omega }}\cup \xi ) \ge n_{\text {overlap}} ({\underline{\omega }})$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Henni, K., Louis, PY., Vannier, B. et al. Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data. Adv Data Anal Classif 14, 543–570 (2020). https://doi.org/10.1007/s11634-019-00379-2

Download citation

Received: 29 March 2018
Revised: 18 September 2019
Accepted: 19 November 2019
Published: 27 November 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11634-019-00379-2