Abstract
This article evaluates the efficiency and performance of both clustering algorithms: an agglomerative hierarchical clustering AHC (with various linkage options and distance measures) and the \(K-Means\) algorithm. We assess the quality of clustering using Davies-Bouldin and Dunn cluster validity indices. Our goal is to compare and analyze outlier detection algorithms depending on the applied clustering algorithm. We also wanted to verify whether the quality of clusters without outliers is higher than of those with outliers. In our research, we compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers (selecting \(1\%\), \(5\%\), and 10% of the most outlier instances in a given dataset). Next, we analyze how clustering quality has improved after excluding such outliers. In the experiments, three real datasets were used with a different number of instances. We wanted to investigate whether it is essential what clustering algorithm and outlier detection method we will use? Our goal was to check whether the clustering parameters impact the obtained clustering results. To the best of our knowledge, no research would combine these issues in one study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kishan, G.M., Chilukuri, K.M., HuaMing, H.: Anomaly Detection Principles and Algorithms, pp. 23–38. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67526-8
Ranga Suri, N.N.R., Narasimha, M.M., Athithan, G.: Outlier Detection: Techniques and Applications, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05127-3
Maddala, G.S.: Outliers. Introduction to Econometrics, 2nd edn. MacMillan, New York (1992)
The CLUSTER Procedure: Clustering Methods, pp.1250–1260. SAS Institute (2009)
Legany, C., Juhasz, S., Babos, A.: Cluster validity measurement techniques, Knowledge Engineering and Data Bases, pp. 388–393. WSEAS, USA (2006)
UCI Machine Learning Repository, October 2021. https://archive.ics.uci.edu/ml/
Martiniano, A., Ferreira, R.P., Sassi, R.J.: Absenteeism dataset, April 2018. https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
Alzahrani, A., Sadaoui S.: Shill Bidding Dataset, March 2020. https://archive.ics.uci.edu/ml/datasets/Shill+Bidding+Dataset
Gardner, A., Selmic, R.R. Kanno, J., Duncan, C.A.: MoCap Hand Postures Data Set, November 2016. https://archive.ics.uci.edu/ml/datasets/MoCap+Hand+Postures
Wes McKinney and the Pandas Development Team: Pandas: powerful Python data analysis toolkit Release 1.3.3, November 2021. https://devdocs.io/pandas~0.25/
NumPy Reference, release 1.21.0, Written by the NumPy community, November 2021. https://numpy.org/doc/stable/numpy-ref.pdf
Anomaly Detection Tutorial, November 2021. https://pycaret.readthedocs.io/en/latest/api/anomaly.html
An introduction to machine learning with scikit-learn, October 2021. https://scikit-learn.org/0.21/tutorial/basic/tutorial.html
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybernetica 4, 95–104 (1974)
Santhana, C., Katie, A., Frans, C.: Best Clustering Configuration Metrics: Towards Multiagent Based Clustering, pp. 2–8. University of Liverpool, UK (2010)
Steinbach, M.S., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques, Department of Computer Science and Engineering, Computer Science (2000)
Karthikeyan, B., George, D.J., Manikandan, G., Thomas, T.: A comparative study on K-means clustering and agglomerative hierarchical clustering. Int. J. Emerg. Trends Eng. Res. 8(5) (2020). https://doi.org/10.30534/ijeter/2020/20852020
Saleena, T.S., Sathish, S.J., Joseph, A.: Comparison of K-means algorithm and hierarchical algorithm using Weka tool. Int. J. Adv. Res. Comput. Commun. Eng. IJARCCE 7(7) (2018)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dalles, TX (2000)
Karuna, K., Rajanikanth, A.: An enhanced algorithm for improved cluster generation to remove outlier’s ratio for large datasets in data mining (2016)
Jabbar, A.M.: Local and global outlier detection algorithms in unsupervised approach: a review. Iraqi J. Electr. Electron. Eng. (College of Engineering, University of Basrah) 17(1) (2021)
Nowak-Brzezińska, A., Horyń, C.: Outliers in rules - the comparison of LOF, COF and K-means algorithms. Procedia Comput. Sci. 176, 1420–1429 (2020)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Accessed Dec 2021
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html. Accessed Dec 2021
Liu, H., Li, J., Wu, Y., Fu, Y.: Clustering with outlier removal. IEEE Trans. Knowl. Data Eng. (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nowak-Brzezińska, A., Gaibei, I. (2022). The Quality of Clustering Data Containing Outliers. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-21967-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)