A path-based distance computation for non-convexity with applications in clustering

223 Accesses
Explore all metrics

Abstract

Clustering algorithms are essential in data analysis, but evaluating their performance is challenging when the true labels are not available, especially for non-convex clusters. Traditional performance evaluation metrics struggle to identify clustering quality, often assigning higher scores for linearly separated clusters than the true clusters. We propose an original approach to distance computation that accounts for the data structure, thus improving the clustering quality evaluation for non-convex clusters without affecting other shapes of clusters. We also showcase the applicability of this method through a modified version of K-Means using the proposed method that is capable of correctly separating non-convex clusters. The validation included the analysis of performance and time complexity of 3 traditional clustering quality evaluation metrics and the K-Means clustering algorithm against their augmented versions with the proposed approach. This analysis conducted on 7 benchmark synthetic datasets and 6 real datasets with various numbers of examples and features of diverse characteristics and joint complexities: simple convex clusters, overlapped and imbalanced clusters, and non-convex clusters. Through these analyses, we show the ineffectiveness of traditional methods and that the proposed approach overcomes the weaknesses of traditional methods.

On a Nonconvex Distance-Based Clustering Problem

K-Means Clustering via a Nonconvex Optimization Approach

A Computational Study of the DC Minimization Global Optimality Conditions Applied to K-Means Clustering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is an essential tool in data analysis that allows us to find the underlying structure in data by grouping similar data points. All clustering algorithms rely on distance computations to assess the similarity between data points, which is further used to organize the data into clusters. The goal is to ensure that points within the same cluster are closer to each other than those in different clusters.

When the true labels are not available, internal clustering performance evaluation metrics are employed to assess the quality of the clusters. These metrics also rely on distances to determine whether the found clusters are separated based on the computation of intra-cluster and inter-cluster distances. Thus, dense convex clusters that are well separated receive the highest scores and as complexities are introduced into the data the scores decrease even when considering the true labels. The complex geometry of non-convex clusters is difficult to capture using traditional distance measures, and as traditional clustering evaluation metrics are based on them, they lead to erroneous evaluation.

This paper introduces a novel approach for distance computation, denominated edging distance (ED), designed specifically for non-convex clusters. The proposed approach addresses the limitations of the Euclidean distance by iterating through the data points to discover the structure of the data. By considering the structure of the data in the computation of the distance, the proposed method offers a better estimation of the distances in and between clusters that enhances the performance of clustering and clustering evaluation metrics. The structure of the data is integrated into the distance computation based on principles from graph theory, the distance is calculated as a path between the points which can account for the complex shapes of non-convex clusters.

To demonstrate the effectiveness of our method, we conducted extensive experiments on several synthetic clustering benchmark datasets of varied characteristics: simple convex clusters, overlapped and imbalanced convex clusters, and non-convex clusters. The obtained results indicate that the proposed approach offers improvements in evaluating clustering performance as the cohesion and separation of clusters are better estimated when employing our distance computation compared with the Euclidean distance. Moreover, our method does not affect clustering performance evaluation on convex clusters; thus, it is a more robust approach to evaluation of clustering performance. This statement is confirmed through the results obtained on the various erroneous labelling evaluated with higher scores by traditional clustering evaluation metrics. To further showcase the method's applicability, we have demonstrated improved clustering performance by replacing the conventional Euclidean distance in K-Means with the proposed approach.

2 Related work

2.1 Distances

Choosing the appropriate distance measure poses a significant challenge for anyone endeavouring to apply distance-based clustering algorithms to datasets. The array of available similarity measures often leads to confusion and difficulty in selection. A study [23] aimed to elucidate the suitability of various similarity measures for low-dimensional and high-dimensional datasets through experimentation by comparing similarity measures tailored for clustering numerical data using various datasets. Assessing the performance of these measures based on the Rand Index [13], it was identified that the average distance [5] ranked consistently among the most appropriate measures.

It has also been stated [14] that relying solely on distance metrics in clustering may not sufficiently capture correlations among data objects. Even when data points are well separated according to these metrics, they may still share similar patterns. Therefore, distance-based metrics alone are inadequate for capturing the intricate characteristics of data and understanding the associations and dissociations between patterns of data points is crucial for revealing their closeness in similarity. Thus, the literature [5, 14, 23] indicates that distance computation needs improvement, specifically in clustering and clustering performance evaluation metrics.

Path-based approaches have been shown to be capable of improving various methods, even clustering methods. One approach is to use a path-based cost function, as shown by [10], to create an agglomerative approach to clustering. The authors propose a criterion for the connectedness of two samples: the objects belong to the same cluster if a path of intermediate samples exists between them. In this case, the dissimilarity of two objects is computed as the minimum largest gap across all paths between two samples. Another approach can be found in [6], where a path-based similarity measure is created to improve the performance of spectral clustering. The authors propose a Gaussian kernel to determine the similarity between samples in a dataset. The final similarity between two samples in a dataset is the maximum across the minimum edges for all paths between these two samples. These approaches are inefficient as they require all paths between pairs of samples to be computed. Nevertheless, these prove the applicability of path-based methods in improving various algorithms.

2.2 Internal clustering performance metrics

Internal metrics require the dataset and a set of labels, which can be the true labels or a clustering assignment. Nevertheless, these metrics do not require a ground truth, they are most commonly used to evaluate clustering with respect to certain characteristics such as intra-cluster and inter-cluster distributions, the shape of clusters and the separation of clusters. Clusters that respect the standard concept, i.e. well-separated convex clusters and with high density, receive higher scores. When evaluating the data with the true labels, these metrics can be used to evaluate the structure of the data and its separability into clusters. In this section, we review the fundamental concepts of popular internal clustering performance evaluation, highlighting their functionality, strengths, and limitations. We succinctly present a summary of the information in this section on internal clustering performance metrics in Table 1, while the following subsections expand on this subject.

Table 1 A summary of internal clustering performance metrics including a description, their range and limitations

A path-based distance computation for non-convexity with applications in clustering

Abstract

Similar content being viewed by others

On a Nonconvex Distance-Based Clustering Problem

K-Means Clustering via a Nonconvex Optimization Approach

A Computational Study of the DC Minimization Global Optimality Conditions Applied to K-Means Clustering

1 Introduction

2 Related work

2.1 Distances

2.2 Internal clustering performance metrics

2.2.1 Silhouette score

2.2.2 Davies-Bouldin score

2.2.3 Calinski-Harabasz score

2.2.4 Other internal metrics

2.2.4.1 Density-based clustering validation

2.2.4.2 Dunn index

2.3 External clustering performance metrics

2.3.1 Adjusted rand index

2.3.2 Adjusted mutual information

2.4 Clustering algorithms

2.4.1 K-Means

2.4.2 Spectral clustering

2.4.3 MeanShift

2.4.4 Agglomerative clustering

2.4.5 DBSCAN

2.4.6 HDBSCAN

3 Materials and methods

3.1 Problem characterisation

3.2 Proposed approach

3.3 Detailed approach

4 Results

4.1 Analysis of effects on clustering performance metrics

4.1.1 Silhouette score

4.1.2 Davies-Bouldin score

4.1.3 Calinski-Harabasz score

4.1.4 Theoretical considerations

4.1.5 Comparison against other metrics

4.1.6 Analysis on 3D datasets

4.1.7 Analysis on high-dimensional datasets

4.1.8 Evaluation of time complexity

4.2 Analysis of effects on clustering

4.2.1 K-Means

4.2.2 Evaluation of performance

4.2.3 Evaluation of time complexity

5 Discussions and conclusions

Data availability

Code availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation