The experiments aimed at checking the impact of the clustering algorithms, clustering methods, and the selected distance measures on the effectiveness of outlier detection, measured by the response of cluster quality assessment indexes to remove outliers from the set. We wanted to see if the clustering algorithms and the outlier detection algorithms contributed similarly to improving the quality of clusters after detecting and removing outliers. We performed experiments on three different real datasets. We modified the number of detected outliers three times, using , , and of the entire dataset as the number of outliers. We wanted to recognize the differences in the results. Our goal was to answer the question ”is it true that the more outliers we discover the better the quality of clusters without selected outliers will be”. In other words, we assume that if we discover outliers first and then cluster the data excluding the outliers, the quality of such clusters will be better than if we cluster the data including outliers. We analyzed two indexes for cluster quality assessment: the Dunn and the Davies–Bouldin. We measured the quality of clusters for the original data set in which potential deviations may occur. Then we look for outliers and omit them in the clustering process. In this way, we can compare the quality of clusters before and after removing outliers. Improving the quality of a cluster occurs if, after removing outliers, the quality measured by the Dunn index will increase as compared to the quality of clusters in which the outliers were not omitted in the clustering process. It is the completely opposite when the Davies–Bouldin index is concerned. Quality improvements occur when the quality of clusters measured by the Davies–Bouldin index decreases after removing outliers. Based on the experiments’ results, we can count in how many cases, after removing the outliers, the quality of the clusters has improved (i.e., the Dunn index increased, and the Davies–Bouldin index decreased).
5.1. Data Description
The source of the databases is the UCI Machine Learning Repository [
16], a collection of databases and data generators used by the machine learning community to analyze machine learning algorithms empirically. A brief description of each data set is shown in
Table 1.
The created databases differ in the number of instances and attributes and the types of attributes.
A (Absenteeism at work Dataset) is the database created with absenteeism records from July 2007 to July 2010 at a courier company in Brazil. The set contains 740 instances, each consisting of 21 numeric (Integer, Real) attributes [
17].
B (Shill Bidding Dataset) contains information about bidders, auctions, bids, prices, and auction duration. This dataset contains 6321 instances, each consisting of 13 mixed numeric attributes [
18]. C (MoCap Hand Postures Dataset) a dataset containing 78,095 instances, with each instance consisting of 38 numeric (Integer, Real) attributes [
19]. To record 12 users performing five hand gestures with markers attached to a left-handed glove, a Vicon motion capture camera system was used. It is worth mentioning that the dataset
B originally was of mixed type. Only one feature was qualitative, but this feature has only one value, and we decided to exclude it in this analysis. Therefore, finally, all datasets were numeric. Qualitative data research will be the basis of our research in the future.
5.2. Methodology
The purpose of the experiments was to compare various clustering methods, clustering algorithms, and distance measures, which makes it possible to determine how changes in these parameters affect the final clustering results and how much the quality of outlier detection is improved. The steps involved in the experiments are described below. For each of the three datasets, the following experiments were carried out:
Loading the dataset and preparing it correctly before applying clustering algorithms: preprocessing data using standarization, normalization, etc.;
Data clustering using two different algorithms: with various number of clusters and with different clustering methods (single, complete, average) and two different ways of measuring distance (Euclidean and Chebyshev). The tests were carried out with a different number of clusters in the range of k. Iteratively, starting with and increasing an i-th parameter by one at each step, the number of clusters k is calculated as until the condition that and is satisfied;
Assessing the clustering quality using the Dunn and the Davies–Bouldin indexes;
Finding , , and of all outliers in the dataset using the and . Removing the selected outliers and reclustering and recalculating the quality of clusters.
In total, we performed 686 experiments which are presented in this paper. The number of 686 experiments comes from the following calculation:
For example, in case of the A dataset the calculation of k will be following:
- -
For and 19;
- -
For and 12;
- -
For and 5;
- -
For (here we can not continue the process of calculationg k values because we met the stop criteria which is in this case and ).
This solution will allow us to check different k parameter values adapted to the size of the input dataset.
The number of experiments is 686 as there are 8 versions of k parameter for the A dataset, 4 versions for the B dataset and 2 versions of k for the C dataset. We have 14 versions, and we repeat them for each of 6 different concepts of the algorithm and 1 version of the algorithm. Adding all these combinations together, we reach 98 experiments.
Choosing two outlier detection algorithms and accordingly and for each of the three different variants of the number of outliers 1%, 5% and 10% we obtain the final number of experiments equal to 686.
Every experiment contains the value of clustering quality indexes Dunn and Davies–Bouldin which are essential for comparing before and after excluding potential outliers from a given dataset.
5.3. Experimental Environment
To analyze clustering algorithms before and after removing the outliers, the Spyder programming environment (Python 3.8) was used, as well as the following libraries: Pandas for data processing and analysis [
20], NumPy to perform basic operations on n-arrays and matrices: addition, subtraction, division, multiplication, transposition, calculating determinant, etc. [
21], PyCaret to prepare the data for modeling, create an unsupervised anomaly detector, and prepare the model for predictions on unseen data [
22] and Scikit-learn, one of the most widely used Python packages for data science and machine learning, which allows many operations and provides a great variety of algorithms for data processing, reduction in dimensions, model selection, regression, classification and cluster analysis [
23].
The algorithms described in
Section 3 and
Section 4 have been implemented using Python and tested on the datasets described herein. We use Python 3.8 and the Anaconda package in this work, which includes many of the libraries required to run machine learning models, data mining, and output data in various formats. Existing Scikit-learn library models were used to implement the
and
clustering algorithms, the
Dunn and
Davies–Bouldin indexes and the Pycaret library to implement the outlier detection algorithm. The program operates in the following way:
Import Python analytical libraries Scikit-learn, NumPy, Pandas, PyCaret, and libraries to perform operations related to time.
Implementation of algorithms:
- (a)
() with parameters: k denoting the selected number of clusters, linkage denoting the type of linkage used in clustering, affinity denoting distance measures;
- (b)
(kmeans) with a k parameter denoting the selected number of clusters;
- (c)
Dunn and Davies–Bouldin algorithms (, );
- (d)
and algorithms with parameter percent denoting the percentage of removed outliers.
Data preparation functions:
- (a)
—a function to replace the missing values with other values dynamically;
- (b)
—a function to replace Null values in Pandas data frame;
- (c)
—a function to normalize and standardize values in the data frame.
Uploading and reading all three datasets.
Execution of , , , , Dunn, and Davies–Bouldin algorithms on datasets.
Transfering results to the Excel file.
5.4. Results
First, the impact of the percentage of detected outliers for both the
and
algorithms was examined with regard to a frequency of improvement in the quality of clusters after removing the detected outliers. The results are presented in
Table 3. We can see that using the
Davies–Bouldin index was much likelier to improve the quality of clusters than the
Dunn index, regardless of how many outliers were detected. It is essential to explain that all results presented in this Section are the average values of the analyzed parameters for each of the 686 experiments performed in this research.
All experiments present the number of cases in which there has been improvement, deterioration, or no changes in the values of the quality of clusters. The percentage values we see in the tables do not mean to represent an average value but the exact number of cases reflecting the event. It is expressed in percentage compared to all experiments from a given group. For example, in
Table 4, we can see that when 1% of outliers are discovered and removed (there are 196 such cases), in 129 of these 196 cases, which is 65.82%, the quality of clusters measured by the
Davies–Buldin index has improved. In 102 cases in this group, the quality of clusters measured by the
Dunn index has improved. The Tables are extended by a piece of additional information (the number of cases confirming a given event).
Then we decided to check whether any of the clustering algorithms used contributed more to improving the quality of clusters than the other after removing outliers. The results are presented in
Table 4.
It turns out that taking into account all the experiments performed, the quality of the clusters was higher after removing the outliers, more often for the algorithm than for . Furthermore, this is regardless of whether the Dunn or Davies–Bouldin index was used. We see that not all the differences studied are statistically significant. At the level of statistical significance, , we will say that in the case of the Davies–Bouldin index, the use of the algorithm for clustering data has much more often led to a record improvement in the quality of clusters after removing deviations. In other words, the algorithm is not resistant to the presence of outliers. Therefore, no statistically significant differences in the quality of clusters were noticed when we eliminated outliers using the Dunn index.
An important task was to examine the impact of using the outlier detection method on the frequency of improvement of the quality of clusters after removing outliers.
Table 5 contains the results. There is an interesting tendency there.
Using the outlier detection algorithm the increase in quality of created clusters is achieved much more often than using the algorithm. It means that algorithm depends more significantly on the occurence of outliers. We notice that using the algorithm statistically significantly (, Pearson Test) more often leads to improving the quality of clusters after eliminating the outliers. Therefore, the algorithm tends to discover more significant outliers. After removing them, the quality of the clusters improves.
We also wanted to check if and how the distance measures contribute to improving the quality of the clusters. It turns out that when using the
Euclidean distance measure, the improvement of cluster quality is more often achieved for the
Davies–Bouldin index, while for the
Chebyshev measure, the quality of the clusters is more often improved by using the
Dunn index. As
Table 6 indicates, there are no statistically significant differences (
) in the effectiveness of improving the quality of clustering after removing outliers depending on what distance measure (
Euclidean or
Chebyshev) we use.
Knowing that the analyzed datasets are real datasets that differ with respect to the size and type of the analyzed data, we also decided to investigate the differences in the frequency of increase or decrease in clustering quality depending on the input data source. The results are presented in
Table 7.
Types of data sets we analyze significantly impact how effective the process of outlier detection is and consequently impact the quality of the created clusters. There are statistically significant differences for each of the analyzed datasets in the frequency of improvement in the quality of clusters after removing previously found outliers.
Table 8 also presents interesting results. We can see that depending on which set was analyzed, the quality of clusters did not constantly improve as the number of detected deviations increased. It is also impossible to unequivocally determine whether any of the measured indexes of the quality of clusters always allows obtaining an improvement in the quality of clusters. This confirms that the size and type of analyzed data have a significant impact on the effectiveness of deviation detection and the quality of clustering.
We see a trend in which the more deviations we detect and turn off from clustering, the more often the quality of the clusters improves. We should point out that in the end, the analyzed dataset with a specific type of data determines the effectiveness of outlier detection and improves the quality of clusters.
The last analyzed clustering parameter, which can affect the improvement of the quality of clusters after removing the outliers, is the cluster combinination method.
Table 9 indicates that there are statistically significant differences between the clustering methods (single, complete, average) in the frequency of improvement of cluster quality after removing outliers. We can see that outliers removal improves the quality of clusters by less than 30 percent of cases (using the
Dunn index to assess the quality) while using the single method. In the case of the complete or average method, this effect is obtained much more often (about 80%).