The Quality of Clustering Data Containing Outliers

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

775 Accesses

Abstract

This article evaluates the efficiency and performance of both clustering algorithms: an agglomerative hierarchical clustering AHC (with various linkage options and distance measures) and the $K-Means$ algorithm. We assess the quality of clustering using Davies-Bouldin and Dunn cluster validity indices. Our goal is to compare and analyze outlier detection algorithms depending on the applied clustering algorithm. We also wanted to verify whether the quality of clusters without outliers is higher than of those with outliers. In our research, we compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers (selecting $1\%$, $5\%$, and 10% of the most outlier instances in a given dataset). Next, we analyze how clustering quality has improved after excluding such outliers. In the experiments, three real datasets were used with a different number of instances. We wanted to investigate whether it is essential what clustering algorithm and outlier detection method we will use? Our goal was to check whether the clustering parameters impact the obtained clustering results. To the best of our knowledge, no research would combine these issues in one study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A New K-means-Based Algorithm for Automatic Clustering and Outlier Discovery

A Simple Clustering Algorithm Based on Weighted Expected Distances

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Article 20 January 2022

References

Kishan, G.M., Chilukuri, K.M., HuaMing, H.: Anomaly Detection Principles and Algorithms, pp. 23–38. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67526-8
Book Google Scholar
Ranga Suri, N.N.R., Narasimha, M.M., Athithan, G.: Outlier Detection: Techniques and Applications, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05127-3
Book Google Scholar
Maddala, G.S.: Outliers. Introduction to Econometrics, 2nd edn. MacMillan, New York (1992)
Google Scholar
The CLUSTER Procedure: Clustering Methods, pp.1250–1260. SAS Institute (2009)
Google Scholar
Legany, C., Juhasz, S., Babos, A.: Cluster validity measurement techniques, Knowledge Engineering and Data Bases, pp. 388–393. WSEAS, USA (2006)
Google Scholar
UCI Machine Learning Repository, October 2021. https://archive.ics.uci.edu/ml/
Martiniano, A., Ferreira, R.P., Sassi, R.J.: Absenteeism dataset, April 2018. https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
Alzahrani, A., Sadaoui S.: Shill Bidding Dataset, March 2020. https://archive.ics.uci.edu/ml/datasets/Shill+Bidding+Dataset
Gardner, A., Selmic, R.R. Kanno, J., Duncan, C.A.: MoCap Hand Postures Data Set, November 2016. https://archive.ics.uci.edu/ml/datasets/MoCap+Hand+Postures
Wes McKinney and the Pandas Development Team: Pandas: powerful Python data analysis toolkit Release 1.3.3, November 2021. https://devdocs.io/pandas~0.25/
NumPy Reference, release 1.21.0, Written by the NumPy community, November 2021. https://numpy.org/doc/stable/numpy-ref.pdf
Anomaly Detection Tutorial, November 2021. https://pycaret.readthedocs.io/en/latest/api/anomaly.html
An introduction to machine learning with scikit-learn, October 2021. https://scikit-learn.org/0.21/tutorial/basic/tutorial.html
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybernetica 4, 95–104 (1974)
Article MathSciNet MATH Google Scholar
Santhana, C., Katie, A., Frans, C.: Best Clustering Configuration Metrics: Towards Multiagent Based Clustering, pp. 2–8. University of Liverpool, UK (2010)
Google Scholar
Steinbach, M.S., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques, Department of Computer Science and Engineering, Computer Science (2000)
Google Scholar
Karthikeyan, B., George, D.J., Manikandan, G., Thomas, T.: A comparative study on K-means clustering and agglomerative hierarchical clustering. Int. J. Emerg. Trends Eng. Res. 8(5) (2020). https://doi.org/10.30534/ijeter/2020/20852020
Saleena, T.S., Sathish, S.J., Joseph, A.: Comparison of K-means algorithm and hierarchical algorithm using Weka tool. Int. J. Adv. Res. Comput. Commun. Eng. IJARCCE 7(7) (2018)
Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dalles, TX (2000)
Google Scholar
Karuna, K., Rajanikanth, A.: An enhanced algorithm for improved cluster generation to remove outlier’s ratio for large datasets in data mining (2016)
Google Scholar
Jabbar, A.M.: Local and global outlier detection algorithms in unsupervised approach: a review. Iraqi J. Electr. Electron. Eng. (College of Engineering, University of Basrah) 17(1) (2021)
Google Scholar
Nowak-Brzezińska, A., Horyń, C.: Outliers in rules - the comparison of LOF, COF and K-means algorithms. Procedia Comput. Sci. 176, 1420–1429 (2020)
Google Scholar
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Accessed Dec 2021
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html. Accessed Dec 2021
Liu, H., Li, J., Wu, Y., Fu, Y.: Clustering with outlier removal. IEEE Trans. Knowl. Data Eng. (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science and Technology, Institute of Computer Science, University of Silesia, Bankowa 12, 40-007, Katowice, Poland
Agnieszka Nowak-Brzezińska & Igor Gaibei

Authors

Agnieszka Nowak-Brzezińska
View author publications
You can also search for this author in PubMed Google Scholar
Igor Gaibei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Agnieszka Nowak-Brzezińska .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nowak-Brzezińska, A., Gaibei, I. (2022). The Quality of Clustering Data Containing Outliers. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-21967-2_8
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Quality of Clustering Data Containing Outliers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A New K-means-Based Algorithm for Automatic Clustering and Outlier Discovery

A Simple Clustering Algorithm Based on Weighted Expected Distances

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Quality of Clustering Data Containing Outliers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A New K-means-Based Algorithm for Automatic Clustering and Outlier Discovery

A Simple Clustering Algorithm Based on Weighted Expected Distances

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation