Abstract
Noise is irrelevant or meaningless data and hinders most types of data analysis. The existing clustering algorithms seldom take the noise points into consideration and cannot detect arbitrary-shaped clusters. This paper presents a Hierarchical Clustering algorithm Based on Noise Removal (HCBNR). It is robust against noise points and good at discovering clusters with arbitrary shapes. In this work, natural neighbor-based density is applied to remove noise points in a data set firstly. Then we construct a saturated neighbor graph on the rest points, and a novel modularity-based graph partitioning algorithm is used to divide the graph into small clusters. Finally, the small clusters are repeatedly merged according to a novel similarity metric between clusters until the desired cluster number is obtained. The experimental results on synthetic data sets and real data sets show that our method can accurately identify noise points and obtain better clustering results than existing clustering algorithms when discovering arbitrary-shaped clusters.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. Acm Sigmod Record 29(2):93–104
Chen WY, Song Y, Bai H, Lin CJ, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586
Cheng D, Zhu Q, Huang J, Yang L, Wu Q (2017) Natural neighbor-based clustering algorithm with local representatives. Knowl Based Syst 123C:238–253
Ester M, Kriegel HP, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp 226–231
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Guha S, Rastogi R, Shim K (2001) Cure: an efficient clustering algorithm for large databases. Inf Syst 26(1):35–58
Ha J, Seok S, Lee JS (2014) Robust outlier detection using the instability factor. Knowl Based Syst 63(2):15–23
Huang J, Zhu Q, Yang L, Cheng D, Wu Q (2017) Qcc: a novel clustering algorithm based on quasi-cluster centers. Mach Learn 106(3):337–357
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. Acm Comput Surv 31(3):264–323
Karypis G, Aggarwal R, Kumar V, Shekhar S (2002) Multilevel hypergraph partitioning: applications in vlsi domain. IEEE Trans Very Large Scale Integr Syst 7(1):69–79
Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer Society Press
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis. John Wiley, Hoboken
King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101
Lv Y, Ma T, Tang M, Cao J, Tian Y, Al-Rodhaan M (2015) An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171C:9–22
Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E Stat Nonlinear Soft Matter Phys . https://doi.org/10.1103/PhysRevE.69.026113
Newman MEJ (2004) Analysis of weighted networks. Phys Rev E Stat Nonlinear Soft Matter Phys 70(5):1–9
Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492
Sneath PH, Sokal RR (1962) Numerical taxonomy. Nature 193:855–860
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Wang G, Song Q (2016) Automatic clustering via outward statistical testing on density metrics. IEEE Trans Knowl Data Eng 28(8):1971–1985
Wang X, Wang XL, Chen C, Wilkes DM (2013) Enhancing minimum spanning tree-based clustering by removing density-based outliers. Digital Signal Process 23(5):1523–1538
Xie JY, Gao HC, Xie WX, Liu XH, Grant PW (2016) Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Inf Sci 354:19–40
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data, pp 103–114
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36
Acknowledgements
This work is supported by National Natural Science Foundation of China, No. 61702060 and No. 61502060, Science and Technology Project of Chongqing Municipal Education Commission, No. KJ15012014 and Project of Chongqing Education Commission, No. KJZH17104.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cheng, D., Zhu, Q., Huang, J. et al. A hierarchical clustering algorithm based on noise removal. Int. J. Mach. Learn. & Cyber. 10, 1591–1602 (2019). https://doi.org/10.1007/s13042-018-0836-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-018-0836-3