Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of k-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional datasets. By using the k-influence space, Is-ClusterMPP solves the problem of local heterogeneity in densities and prevents the impact of the global density in the detection of unbalanced classes. This concept reduces also the input values amount. The curse of dimensionality is handled by using a local subspace clustering principal embedded in a weighted similarity metric. Balls covering data points are constituting a configuration sampled from a marked point process (MPP). Due to the choice of the energy function, they tends to cover neighboring data, which share the same cluster. The statistical model of random balls is sampled through a Monte Carlo Markovian dynamical approach. The energy is balancing different goals. (1) The data driven objective function is provided according to k-influence space. Data in a high-dense region are favored to be covered by a ball. (2) An interaction part in the energy prevents the balls full overlap phenomenon and favors connected groups of balls. The algorithm through Markov dynamics, does converge towards configurations sampled from the MPP model. This algorithm has been applied in real benchmarks through gene expression data set of various sizes. Different experiments have been done to compare Is-ClusterMPP against the most well-known clustering algorithms and its efficiency is claimed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Adovanovic M, Nanopoulos A, Ivanovic M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531

    MathSciNet  MATH  Google Scholar 

  • Alata O, Burg S, Dupas A (2011) Grouping/degrouping point process, a point process driven by geometrical and topological properties of a partition in regions. Comput Vis Image Underst 115(9):1324–1339

    Google Scholar 

  • Baddeley A, Rubak E, Turner R (2015) Spatial point patterns: methodology and applications with R. Chapman and Hall/CRC. ISBN 9781482210200

  • Bar-Hen A, Emily M, Picard N (2015) Spatial cluster detection using nearest neighbor distance. Spat Stat 14(Part C):400–411

    MathSciNet  Google Scholar 

  • Bhagat PM, Halgaonkar PS, Wadhai VM (2013) Review of clustering algorithm for categorical data. Int J Eng Adv Technol 32:2249–8958

    Google Scholar 

  • Bhole P, Agrawal AJ (2014) Extractive based single document text summarization using clustering approach. IAES Int J Artif Intell 3(2):73–78

    Google Scholar 

  • Bisheng Y, Wenxue X, Zhen D (2013) Automated extraction of building outlines from airborne laser scanning point clouds. Geosci Remote Sens Lett IEEE 10(6):1399–1403

    Google Scholar 

  • Bohm C, Railing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Fourth IEEE international conference on data mining (ICDM’04), Icdm 04, pP 27–34. ISBN 0-7695-2142-8

  • Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519

    MathSciNet  MATH  Google Scholar 

  • Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7819 LNAI, pp 160–172. ISBN 9783642374555

  • Cassisi C, Ferro A, Giugno R, Pigola G, Pulvirenti A (2013) Enhancing density-based clustering: parameter reduction and outlier detection. Inf Syst 38(3):317–330

    Google Scholar 

  • Chin YC, Baddeley AJ (1999) On connected component Markov point processes. Adv Appl Probab 31(2):279–282

    MathSciNet  MATH  Google Scholar 

  • Chiu SN, Stoyan D, Kendall WS, Mecke J (2013) Stochastic geometry and its applications. Wiley, Lodon. ISBN 978-0-470-66481-0

  • Datar M, Immorlica N, Indyk P (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SCG’04, June 9–11, 2004, Brooklyn, New York, USA., pp 253–262. ISBN 1581138857

  • Descombes X, Zerubia J (2002) Marked point processes in image analysis. IEEE Signal Process Mag 19(5):77–84

    Google Scholar 

  • Descombes X, Zhizhina E (2004) Applications of Gibbs fields methods to image processing problems. Probl Inf Transm 40(3):279–295

    MathSciNet  MATH  Google Scholar 

  • Descombes X, Stoica R, Garcin L, Zerubia J (2001) A RJMCMC algorithm for object processes in image processing. Monte Carlo Methods Appl 7(1–2):149–156

    MathSciNet  MATH  Google Scholar 

  • Descombes X, Minlos R, Zhizhina E (2009) Object extraction using a Stochastic birth-and-death dynamics in continuum. J Math Imaging Vis 33(3):347–359

    MathSciNet  Google Scholar 

  • El Sonbaty Y, Ismail MA, Farouk M (2004) An efficient density based clustering algorithm for large databases. In: 16th IEEE international conference on tools with artificial intelligence, Ictai, pp 637–677. ISBN 0-7695-2236-X

  • Elavarasi SA, Akilandeswari J, Sathiyabhama B (2011) A survey on partition clustering algorithms. Learning 1(1):1–14

    Google Scholar 

  • Elbatta MTH, Ashour WM (2013) A dynamic method for discovering density varied clusters. Int J Signal Process Image Process Pattern Recognit 6(1):123–134

    Google Scholar 

  • Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. Adv Data Min Appl Theor Asp 8557:214–227

    Google Scholar 

  • Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster Analysis, 5th edn. Wiley, London

    MATH  Google Scholar 

  • Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Google Scholar 

  • Fahim A, Salem A (2006) Density clustering based on radius of data (DCBRD). Informatika 16:344–349

    Google Scholar 

  • Flexer A, Schnitzer D (2015) Choosing \(L^p\) norms in high-dimensional spaces based on hub analysis. Neurocomputing 169:281–287

    Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    MathSciNet  MATH  Google Scholar 

  • Gaonkar MN, Sawant K (2013) Auto Eps DBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):2319–2526

    Google Scholar 

  • Geyer CJ (2003) The Metropolis–Hastings–Green algorithm. http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf

  • Green PJ (1995) Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika 82(4):711–732

    MathSciNet  MATH  Google Scholar 

  • Guibas L, Morozov D, Mérigot Q (2013) Witnessed k-distance. Discrete Comput Geom 49(1):22–45

    MathSciNet  MATH  Google Scholar 

  • Gul A, Perperoglou A, Khan Z, Mahmoud O, Miftahuddin M, Adler W, Lausen B (2018) Ensemble of a subset of knn classifiers. Adv Data Anal Classif 12(4):827–840

    MathSciNet  MATH  Google Scholar 

  • Henni K, Alata O, Alidrissi A, Vannier B, Zaoui L, Moussa A (2017a) Marked point processes for MicroArray data clustering. In: Palumbo F, Montanari A, Vichi M (eds) Data science—innovative developments in data analysis and clustering: studies in classification, data analysis and knowledge organization, pp 125–137. Springer, Berlin. ISBN 978-3-319-55723-6

  • Henni K, Alata O, Zaoui L, Vannier B, Alidrissi A, Moussa A (2017b) ClusterMPP: an unsupervised density-based clustering algorithm via marked point process. Intell Data Anal 21(4):827–847

    Google Scholar 

  • Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2(8):3441–3446

    Google Scholar 

  • Kelly FP, Ripley BD (1976) A note on Strauss’s model for clustering. Biometrika 63(63):357–360

    MathSciNet  MATH  Google Scholar 

  • Lv Y, Ma T, Tang M, Cao J, Tian Y (2016) An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171:9–22

    Google Scholar 

  • Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226–231. AAAI Press

  • Møller J, Waagepetersen RP (2004) Statistical inference and simulation for spatial point processes. Chapman and Hall/CRC, London. ISBN 9781584882657

  • Montanari A, Calò DG (2013) Model-based clustering of probability density functions. Adv Data Anal Classif 7(3):301–319

    MathSciNet  MATH  Google Scholar 

  • Moussa A, Sbihi A, Postaire JG (2008) A Markov random field model for mode detection in cluster analysis. Pattern Recognit Lett 29(9):1197–1207

    Google Scholar 

  • Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov 2(1):86–97

    Google Scholar 

  • Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016

    Google Scholar 

  • Ortner M, Descombes X, Zerubia J (2007) Building outline extraction from Digital Elevation Models using marked point processes. Int J Comput Vis 72:107–132

    Google Scholar 

  • Peng L, Dong Z, Naijun W (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings—ICSSSM’07: 2007 international conference on service systems and service management. ISBN 1424408857

  • Ram A, Jalal S, Jalal AS, Kumar M (2010) DVBSCAN: a density based algorithm for discovering density varied clusters in large spatial databases. Int J Comput Appl 3(6):1–4

    Google Scholar 

  • Rehman SU, Asghar S, Fong S, Sarasvady S (2014) DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp 232–238

  • Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks—PPT. Science 344(6191):1492–1496

    Google Scholar 

  • Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218

    MathSciNet  Google Scholar 

  • Stoica RS (2014) Modélisation probabiliste et inférence statistique pour l’analyse des données spatialisées. Research Habilitation Thesis, Université Lille 1

  • Stoica RS, Descombes X, Zerubia J (2004) A Gibbs point process for road extraction from remotely sensed images. Int J Comput Vis 57(2):121–136

    Google Scholar 

  • Stoica RS, Gay E, Kretzschmar A (2007) Cluster pattern detection in spatial data based on Monte Carlo inference. Biom J 49(4):505–519

    MathSciNet  Google Scholar 

  • Stoica RS, Martinez VJ, Saar E (2010) Filaments in observed and mock galaxy catalogues. Astron Astrophys 510:1–12

    Google Scholar 

  • Sun S, Huang R (2010) An adaptive k-nearest neighbor algorithm. In: Proceedings—2010 7th international conference on fuzzy systems and knowledge discovery, FSKD 2010, vol 1, pp 91–94. ISBN 9781424459346

  • Uncu O, Gruver WA, Kotak DB, Sabaz D, Alibhai Z, Ng C (2007) GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In: Conference Proceedings—IEEE International conference on systems, man and cybernetics, vol 4, pp 2976–2981. ISBN 1424401003

  • van Lieshout MNM (2000) Markov point processes and their applications. Imperial College Press, London

    MATH  Google Scholar 

  • van Lieshout MNM, Stoica RS (2006) Perfect simulation for marked point processes. Comput Stat Data Anal 51:679–698

    MathSciNet  MATH  Google Scholar 

  • Xiaopeng Y, Deyi Z, Yan Z (2005) A new clustering algorithm based on distance and density. In: International conference on services systems and services management, vol 2

  • Zhang Q, Yang LT, Chen Z, Xia F (2015) A high-order possibilistic-means algorithm for clustering incomplete multimedia data. IEEE Syst J PP(99):1–10

  • Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

K. Henni thanks the 2RCT research team at University of Poitiers where this work was initiated and major part was done. Funding through Al Idrisi mobility grant is gratefully acknowledged. K. Henni thanks also Ma Tinghuai for data sharing. The authors thank anonymous referees for very careful reading of the manuscript and comments leading to a strongly improved version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khadidja Henni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Annexe A: IS-ClusterMPP algorithm’s pseudo-code

Here is the pseudo-code transcription of IS-ClusterMPP algorithm. Algorithm 2 is iterated to implement the dynamical evolution. Algorithm 1 is called to implement the jumps: adding to and cancelling balls from the sampled configuration. When Algorithm 2 stops, then Algorithm 3 is called. To affect a cluster label to the remaining not covered data points, Algorithm 4 is run at the end. In Algorithm 1 \(\text {size}({\underline{\omega }}^{i}) >0\) denotes the number of balls in the configuration \({\underline{\omega }}^{i}\). In Algorithm 2,—Jump simulation—means a call to Algorithm 1.

The functions used on the following algorithms are defined as:

  • Not-converged The convergence of the simulation process is verified by the stabilization of the number of balls, i.e, no changes in the configurations size.

  • Extract cc(x) this function identified the set of connected components cc(x).

  • \(x_{j}\)Is covered by \(cc_{i}\) if there is one or more balls \(\omega _{p}(x_{p}, r_{p})\) such that \(dist(x_{j}, x_{p}) \le r_{p}\), this function returns true.

  • Minimal distance(PNP) This function have as input two observations sets and it returns two observations, one from each input set, which have the smallest distance.

figure a
figure b
figure c
figure d

Annexe B: Proof of the energy function local stability

The local stability means, it exists \(\varLambda \in {\mathbb {R}}\), such that for any \({\underline{w}}\) configuration and any \(\xi \) object added to the configuration \({\underline{w}}\), the ratio of the probabilities, which is the conditional intensity, satisfies

$$\begin{aligned} \frac{p({\underline{w}} \cup \xi )}{p({\underline{w}})} \le \varLambda . \end{aligned}$$

Since \(p({\underline{w}}) \propto e^{-U({\underline{\omega }})}\), it is equivalent to it exists \(\varLambda \in {\mathbb {R}}\), such that for any \({\underline{\omega }}\) configuration and any \(\xi \) object added to the configuration \({\underline{\omega }}\),

$$\begin{aligned} \varDelta U({\underline{\omega }};\xi ):= U ({\underline{\omega }}\cup \xi ) - U ({\underline{\omega }}) \end{aligned}$$

is uniformly bounded from below.

In the model considered, \(\xi \) denotes a new ball to be added to the configuration \({\underline{\omega }}\),

$$\begin{aligned} \varDelta U({\underline{\omega }};\xi )= & {} V(\xi ) + \left( |cc ({\underline{\omega }})| - |cc ({\underline{\omega }}\cup \xi )| \right) \sum _{\omega _i(x_i,r_i) \in {\underline{\omega }}} \ | \text {IS}_{k}(x_i)| - |cc ({\underline{\omega }}\cup \xi )| \text {IS}_{k}(\xi ) \nonumber \\&+ \log \frac{|{\tilde{n}}({{\underline{\omega }}})|}{|{\tilde{n}}({{\underline{\omega }}\cup \xi })| } - \log \delta \ \frac{n_{\text {overlap}} ({\underline{\omega }}\cup \xi )}{n_{\text {overlap}} ({\underline{\omega }})}. \end{aligned}$$
(12)

Since \(|\text {IS}_{k}(x)| \le k\), it holds \(V(\xi )\ge (\frac{1}{k}-k) \ \text {e}\).

It holds

$$\begin{aligned} -1 \le \left( |cc ({\underline{\omega }})| - |cc ({\underline{\omega }}\cup \xi )| \right) \le |cc ({\underline{\omega }})| -1 \end{aligned}$$

and using the fact that \( |cc ({\underline{\omega }})|\) is at most the number of balls in the configuration, the second and third parts are uniformly bounded from below.

Remark that the (random) number \(N({\underline{\omega }})\) of balls in a configuration \({\underline{\omega }}\), sampled from the chosen model of point process, is Poisson distributed with a parameter which is finite since we consider point processes inside a compact.

Finally remark: \(|{\tilde{n}}({{\underline{\omega }}})| \ge |{\tilde{n}}({{\underline{\omega }}\cup \xi })|\) and, with \(\delta <1\), \( n_{\text {overlap}} ({\underline{\omega }}\cup \xi ) \ge n_{\text {overlap}} ({\underline{\omega }})\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Henni, K., Louis, PY., Vannier, B. et al. Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data. Adv Data Anal Classif 14, 543–570 (2020). https://doi.org/10.1007/s11634-019-00379-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-019-00379-2

Keywords

Mathematics Subject Classification

Navigation