1 Introduction

Clustering is an essential tool in data analysis that allows us to find the underlying structure in data by grouping similar data points. All clustering algorithms rely on distance computations to assess the similarity between data points, which is further used to organize the data into clusters. The goal is to ensure that points within the same cluster are closer to each other than those in different clusters.

When the true labels are not available, internal clustering performance evaluation metrics are employed to assess the quality of the clusters. These metrics also rely on distances to determine whether the found clusters are separated based on the computation of intra-cluster and inter-cluster distances. Thus, dense convex clusters that are well separated receive the highest scores and as complexities are introduced into the data the scores decrease even when considering the true labels. The complex geometry of non-convex clusters is difficult to capture using traditional distance measures, and as traditional clustering evaluation metrics are based on them, they lead to erroneous evaluation.

This paper introduces a novel approach for distance computation, denominated edging distance (ED), designed specifically for non-convex clusters. The proposed approach addresses the limitations of the Euclidean distance by iterating through the data points to discover the structure of the data. By considering the structure of the data in the computation of the distance, the proposed method offers a better estimation of the distances in and between clusters that enhances the performance of clustering and clustering evaluation metrics. The structure of the data is integrated into the distance computation based on principles from graph theory, the distance is calculated as a path between the points which can account for the complex shapes of non-convex clusters.

To demonstrate the effectiveness of our method, we conducted extensive experiments on several synthetic clustering benchmark datasets of varied characteristics: simple convex clusters, overlapped and imbalanced convex clusters, and non-convex clusters. The obtained results indicate that the proposed approach offers improvements in evaluating clustering performance as the cohesion and separation of clusters are better estimated when employing our distance computation compared with the Euclidean distance. Moreover, our method does not affect clustering performance evaluation on convex clusters; thus, it is a more robust approach to evaluation of clustering performance. This statement is confirmed through the results obtained on the various erroneous labelling evaluated with higher scores by traditional clustering evaluation metrics. To further showcase the method's applicability, we have demonstrated improved clustering performance by replacing the conventional Euclidean distance in K-Means with the proposed approach.

2 Related work

2.1 Distances

Choosing the appropriate distance measure poses a significant challenge for anyone endeavouring to apply distance-based clustering algorithms to datasets. The array of available similarity measures often leads to confusion and difficulty in selection. A study [23] aimed to elucidate the suitability of various similarity measures for low-dimensional and high-dimensional datasets through experimentation by comparing similarity measures tailored for clustering numerical data using various datasets. Assessing the performance of these measures based on the Rand Index [13], it was identified that the average distance [5] ranked consistently among the most appropriate measures.

It has also been stated [14] that relying solely on distance metrics in clustering may not sufficiently capture correlations among data objects. Even when data points are well separated according to these metrics, they may still share similar patterns. Therefore, distance-based metrics alone are inadequate for capturing the intricate characteristics of data and understanding the associations and dissociations between patterns of data points is crucial for revealing their closeness in similarity. Thus, the literature [5, 14, 23] indicates that distance computation needs improvement, specifically in clustering and clustering performance evaluation metrics.

Path-based approaches have been shown to be capable of improving various methods, even clustering methods. One approach is to use a path-based cost function, as shown by [10], to create an agglomerative approach to clustering. The authors propose a criterion for the connectedness of two samples: the objects belong to the same cluster if a path of intermediate samples exists between them. In this case, the dissimilarity of two objects is computed as the minimum largest gap across all paths between two samples. Another approach can be found in [6], where a path-based similarity measure is created to improve the performance of spectral clustering. The authors propose a Gaussian kernel to determine the similarity between samples in a dataset. The final similarity between two samples in a dataset is the maximum across the minimum edges for all paths between these two samples. These approaches are inefficient as they require all paths between pairs of samples to be computed. Nevertheless, these prove the applicability of path-based methods in improving various algorithms.

2.2 Internal clustering performance metrics

Internal metrics require the dataset and a set of labels, which can be the true labels or a clustering assignment. Nevertheless, these metrics do not require a ground truth, they are most commonly used to evaluate clustering with respect to certain characteristics such as intra-cluster and inter-cluster distributions, the shape of clusters and the separation of clusters. Clusters that respect the standard concept, i.e. well-separated convex clusters and with high density, receive higher scores. When evaluating the data with the true labels, these metrics can be used to evaluate the structure of the data and its separability into clusters. In this section, we review the fundamental concepts of popular internal clustering performance evaluation, highlighting their functionality, strengths, and limitations. We succinctly present a summary of the information in this section on internal clustering performance metrics in Table 1, while the following subsections expand on this subject.

Table 1 A summary of internal clustering performance metrics including a description, their range and limitations

2.2.1 Silhouette score

The Silhouette score (SS) [19, 20] is an internal metric utilized for evaluating clustering performance at the individual sample level within a dataset. SS is determined by measuring the distances from a point to all other samples within its own cluster and the distances from the point to all samples of the closest cluster. The formula for computing the score of a single sample is given by Eq. (1).

$$\begin{array}{*{20}c} s = \frac{b - a}{{\max({a,b})}} \\ \end{array}$$
(1)

here b represents the mean distance between a sample and all samples of the closest different cluster, while a denotes the mean distance between a sample and the other samples within its own cluster.

The SS for the entire dataset is obtained by averaging the scores of each individual sample. This metric is constrained within the [− 1, 1] interval. A score of − 1 indicates incorrect clustering, while a score of 1 indicates dense clustering and values close to 0 indicate overlapping clusters.

2.2.2 Davies-Bouldin score

The Davies-Bouldin score (DBS) [8, 12, 19] is an internal metric and it quantifies the similarity between each cluster and its most similar counterpart. Here, similarity is computed as the ratio between distances within a cluster and distances separating clusters:

$$\begin{array}{*{20}c} {R_{ij} = \frac{{s_{i} + s_{j} }}{{d_{ij} }}} \\ \end{array}$$
(2)

In Eq. (2), Rij denotes the similarity between cluster i and its most similar cluster j. The term s represents the average distance from any point within a cluster to its centroid, while dij represents the distance between the centroids of clusters i and j.

The score is computed as the arithmetic mean of the maximum similarity R among pairs of clusters i and j, as delineated in Eq. (2).

$$\begin{array}{*{20}c} {DBS = \frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \max \left( {R_{ij} } \right), \,\,\,{\text{where}}\,\,\, \forall j \in \left[ {1,k} \right]\,\,\,\, {\text{and}}\,\,\, i \ne j } \\ \end{array}$$
(3)

here k represents the number of clusters. DBS is constrained with a lower bound of 0, where superior cluster separation corresponding to lower values. Moreover, DBS tends to assign higher scores to clusters exhibiting convex shapes compared to other cluster geometries.

2.2.3 Calinski-Harabasz score

The Calinski-Harabasz score (CHS) [2, 19] , also referred to as the Variance Ratio Criterion, evaluates clustering algorithms by considering the ratio of inter-cluster dispersion (the spread between clusters) to intra-cluster dispersion (the compactness within clusters). Both are measured as the sum of squared distances.

$$\begin{array}{*{20}c} {CHS = \frac{{tr\left( {B_{k} } \right)}}{{tr\left( {W_{k} } \right)}} \frac{n - k}{{k - 1}}} \\ \end{array}$$
(4)

here tr(B) represents the trace of the inter-cluster dispersion matrix, capturing the variability between clusters, while tr(W) denotes the trace of the intra-cluster dispersion matrix, measuring the variability within clusters. The score is computed based on the number of samples n clustered into k clusters within the dataset.

The CHS metric has a lower bound of 0, with higher scores indicating better-separated clusters.

2.2.4 Other internal metrics

2.2.4.1 Density-based clustering validation

The density-based clustering validation score (DBCVS) [17] was designed to evaluate the quality of density-based clustering algorithms, such as DBSCAN. It assesses the performance of the algorithm through cluster density rather than centroids or variance. This score is the weighted average of the score applied to each cluster, which is computed using mathematical formulations of the density sparseness of a cluster and the densitity separation of cluster pairs. The DBCVS metric is bounded in the [− 1, 1] range, where higher values indicate better density-based clustering solutions.

2.2.4.2 Dunn index

The Dunn Score (DS) evaluates clustering quality through the compactness and separation of clusters. DS is computed as the ratio between the minimum inter-cluster distance and the maximum intra-cluster distance. The DS metric is lower bounded at 0, with higher scores indicating better clustering.

2.3 External clustering performance metrics

In contrast to internal performance metrics, external metrics necessitate both the true labels and the clustering assignments, thus having limited use in unsupervised tasks. They employ various statistical techniques to assess the accuracy of clustering by evaluating the alignment between the labels produced by the clustering algorithm and those of the true labels.

2.3.1 Adjusted rand index

The Adjusted Rand Index (ARI) [21, 24, 27] is a metric derived from the Rand Index (RI) [13], that compares pairs of labels from true and predicted labels. It distinguishes between cases of agreement, where pairs belong to the same cluster, and disagreement, where pairs are in different clusters, thereby penalizing over-clustering. As expressed by Eq. (5), the RI is computed by dividing the total number of agreements by the sum of agreements and disagreements.

$$\begin{array}{*{20}c} {RI = \frac{{{\text{agreements}}}}{{{\text{agreements}} + {\text{disagreements}}}}} \\ \end{array}$$
(5)

ARI adjusts for chance agreements using an expected index, as expressed by Eq. (6) to ensure consistent scores for random labelling.

$$\begin{array}{*{20}c} {ARI = \frac{{RI - {\text{Expected}}RI}}{{{\text{Maximum}}RI - {\text{Expected}}RI}}} \\ \end{array}$$
(6)

RI represents the probability of agreement between two sets of labels on a randomly chosen pair, yielding a score of 1 for complete agreement and 0 for complete disagreement. Expressed through the ARI formula, scores fall within the [− 1, 1] interval, with negative values indicating individual assignments, values close to 1 indicating correct clusterings and values close to 0 indicating random labellings.

2.3.2 Adjusted mutual information

Adjusted Mutual Information (AMI) [25, 27] is an external metric derived from Mutual Information (MI) that includes the normalization of Normalized Mutual Information. MI relies itself on entropy (H) as defined in Eq. (7), which, in turn, relies on probability.

$$\begin{array}{*{20}c} {H\left( U \right) = - \mathop \sum \limits_{i = 1}^{\left| U \right|} P\left( i \right)\log \left( {P\left( i \right)} \right)} \\ \end{array}$$
(7)

Here P(i) represents the probability of a sample belonging to class Ui, calculated by dividing the number of samples with this property (|Ui|) by the total number of samples (N). By defining U and V as two label assignments and by modifying the entropy formula, we arrive at the MI formula:

$$\begin{array}{*{20}c} {MI\left( {U,V} \right) = \mathop \sum \limits_{i = 1}^{\left| U \right|} \mathop \sum \limits_{j = 1}^{\left| V \right|} P\left( {i,j} \right)\log \left( {\frac{{P\left( {i,j} \right)}}{P\left( i \right)P\left( j \right)}} \right)} \\ \end{array}$$
(8)

where P(i, j) denotes the probability of a sample belonging to both Ui and Vj. This is the ratio of the number of samples in the intersection of Ui and Vj to the total number of samples. By utilizing the expected index, MI can be adjusted for chance to ensure a consistent score for random labellings, thus yielding AMI:

$$\begin{array}{*{20}c} {AMI\left( {U,V} \right) = \frac{{MI\left( {U,V} \right) - E\left\{ {MI\left( {U,V} \right)} \right\}}}{{{\text{mean}}\left\{ {H\left( U \right), H\left( V \right)} \right\} - E\left\{ {MI\left( {U,V} \right)} \right\}}}} \\ \end{array}$$
(9)

Similar to ARI, this metric is confined to the [− 1, 1] range, with random labellings given a score of 0, while negative values indicate erroneous assignments and values close to 1 indicate correct clustering.

2.4 Clustering algorithms

2.4.1 K-Means

K-Means (KM) clustering, as outlined in [16], operates by partitioning the sample space into K partitions, assigning each object to the cluster whose mean is closest. It utilizes Euclidean distances [11, 16] within clusters and aims to minimize intra-cluster variance by minimizing the squared error. Initially, K-Means initializes K centroids, typically through random assignment, and then updates their values iteratively until convergence. The iterative process consists of two main steps: firstly, assigning each sample to the nearest centroid based on distance, and secondly, updating the centroid's location to the mean of all samples assigned to its cluster.

One primary drawback of K-Means is its reliance on the user to specify the number of centroids [11], which can be challenging in unsupervised tasks. Additionally, it is non-deterministic and may converge to a local minimum, requiring multiple runs of the algorithm. Another limitation is its struggle with identifying clusters of arbitrary shapes. Despite these limitations, K-Means remains one of the fastest clustering algorithms, with time complexity of O(ndki), where n is the number of samples (or examples/instances in the dataset), d is the number of dimensions (also known as the number of features of the dataset), k is the number of clusters specified, and i is the number of iterations of the algorithm.

When the data lack linear separability, K-Means struggles to perform effectively. This algorithm forms distinct clusters based on the similarity to the cluster centre, typically using Euclidean distance as the measure. Consequently, it can only establish linear decision boundaries, leading to poor performance when data aren't linearly separable. Nonetheless, by employing alternative similarity criteria or preprocessing the data to achieve linear separability, K-Means can operate effectively [11].

2.4.2 Spectral clustering

Spectral clustering (SC), as outlined in the literature [28], operates by transforming the data into a graph representation, where nodes represent data samples and edges represent the relationships between them. The graph is partitioned into K clusters based on the eigenvalues and eigenvectors of a similarity matrix derived from the data. To overcome the limitations of K-Means, which relies on distance measures in the original feature space, SC captures nonlinear relationships more effectively via space transformation. SC’s ability to capture complex nonlinear relationships by operating in a transformed space allows it to identify clusters of arbitrary shapes.

The process of spectral clustering typically involves several steps. First, a similarity matrix is constructed based on pairwise similarities between data points, commonly using measures like Gaussian kernel similarity or k-nearest neighbours. Next, the Laplacian matrix of the graph is computed from the similarity matrix. Then, the eigenvectors corresponding to the smallest eigenvalues of the Laplacian matrix are extracted. These eigenvectors encode the underlying structure of the data and are used to embed the data into a lower-dimensional space. Finally, standard clustering algorithms, such as K-Means [16] or normalized cuts [22], are applied to the embedded data to partition it into K clusters. Due to computation of the similarity matrix of all points and the eigenvalue decomposition of the Laplacian matrix, the time complexity of Spectral Clustering can be found at about O(n3), where n is the number of samples. However, its complexity depends on the similarity measure, decomposition method and clustering algorithm employed. Thus, it may be computationally expensive, particularly when dealing with large datasets, due to the eigenvalue decomposition step.

2.4.3 MeanShift

MeanShift (MS) [4, 7] is considered a mode-seeking algorithm. It offers a mathematical analysis for the localization of the maxima of a density function. From a clustering perspective, it identifies clusters by updating centroid candidates to the mean of regional points. In this way, it is similar to K-Means as its geometry is also based on distances between points. Mean Shift is an iterative process that shifts this kernel to a high-density region where it reaches convergence. Considering a dataset with n samples, the time complexity of MeanShift is O(n2). In contrast to K-Means, it does not require the number of clusters as an input, but it has difficulties with overlapping clusters and can result in underclustering. Nevertheless, the feature extraction step can circumvent this, and additionally, MeanShift can tackle a high number of clusters and imbalances.

2.4.4 Agglomerative clustering

Agglomerative clustering (Agg) [1] applies a “bottom-up” approach to hierarchical clustering. Initially, each sample is assigned to an individual cluster. As the algorithm progresses through its iterations, these clusters merge. Pairs of clusters are merged based on a proximity matrix, which contains distances between points. Agglomerative clustering uses a linkage function to merge clusters based on distance information. One type of linkage is the ward linkage, which analyses cluster variance rather than directly analyzing distance. Its main disadvantage is that it requires the number of clusters as input. In addition, its complexity is demanding as well. Considering a dataset with n samples, its time complexity is O(n3), and its space complexity O(n2).

2.4.5 DBSCAN

DBSCAN [9], or density-based spatial clustering of applications with noise, is a density-based clustering algorithm. In the first step, the method identifies the cores of clusters as regions with high densities and expands them. As a result, it defines clusters as high-density regions while low-density regions are labelled as noise. It can be inferred that, despite its name, DBSCAN has a high performance only if clusters of similar densities are provided. It has difficulties in separating overlapped clusters and, especially, embedded ones. Unlike K-Means, DBSCAN does not require the number of clusters as input. In addition, DBSCAN can identify clusters with arbitrary shapes and is a mostly deterministic algorithm. DBSCAN's main disadvantage is its inability to distinguish between different densities of clusters. DBSCAN has a time complexity of O(n2), where n is the number of samples, while the memory complexity of DBSCAN may vary between O(n) and O(n2), depending on the implementation. This complexity appears from the necessity of computing distances among neighbourhoods. DBSCAN has also been found to be useful within the domain of spike sorting [26].

2.4.6 HDBSCAN

HDBSCAN [3] is a hierarchical clustering algorithm and a modified version of DBSCAN. Just as DBSCAN does, it identifies a subset of sparse points as noise while dense regions are considered cluster centres. This algorithm groups points based on a weighed graph by creating a minimum spanning tree, using for example Prim’s greedy algorithm. The single most important parameter is the leaf size, which establishes the smallest number of points required for a group of points to be considered a cluster. The time complexity of the algorithm is similar to that of DBSCAN, O(n2), while it also has a space complexity of O(nd), where n is the number of points in the dataset and d the number of dimensions of any point.

3 Materials and methods

3.1 Problem characterisation

Unlike convex clusters, which have well-defined boundaries, non-convex clusters often exhibit irregular shapes, overlapping regions, or disconnected components. The Euclidean distance metric assumes that the shortest distance between two points is a straight line, which can lead to inaccuracies when dealing with non-convex clusters. As a result, clustering algorithms relying solely on Euclidean distance may struggle to properly capture the underlying structure of such clusters, potentially misclassifying points or failing to detect meaningful patterns. Therefore, alternative distance metrics or clustering algorithms designed to handle non-convex structures are often preferred in these scenarios to achieve more accurate and robust clustering results.

The significant limitation lies in its assumption of straight line distances between points, which may not accurately reflect the actual distances within clusters exhibiting irregular shapes. In non-convex clusters, where boundaries can be intricate and curved, Euclidean distance fails to capture the complexity of shapes effectively. Take, for example, a cluster with the shape of circle; many distances are evaluated as values close to the diameter of circle, even if points are densely packed. Furthermore, the assumptions underlying many clustering algorithms, such as compactness and well-separated clusters, often do not hold for non-convex clusters, further undermining the efficacy of Euclidean distance-based methods. Most commonly, to address these limitations, alternative distance metrics (such as Mahalanobis distance or cosine similarity) and clustering algorithms (such as DBSCAN or spectral clustering) that are better suited for capturing non-convex structures are employed.

3.2 Proposed approach

In this work, we propose a new way to compute distances, denominated as edging distance (ED), to better identify complex structures, such as those of non-convex clusters, without penalizing the identification of simpler, more regular structures. Instead of the classical Euclidean distance that overestimates the distance between points, especially in non-convex shapes, as it does not take into consideration the overall structure, the proposed approach identifies the distance between two points as the maximum edge in a guided path between these two points. The edge and its value are computed as the Euclidean distance between two k-nearest neighbours. In this way, the structure of the data points determines the value of the distance between any two points. For example, two points in the same cluster have a low distance value as the path between them is determined by many points that are closely packed. In comparison, the distance between two points in different clusters is determined by the distance between a border point of the cluster that contains the start point and the end point as that is the largest edge in such a path.

A simple illustration of the proposed approach is shown in Fig. 1 in comparison with the Euclidean distance. The proposed approach starts from the white node and, at each step, chooses as the next node between the two closest nearest neighbours of the current node. The resulting distance is computed as the weight of the maximum edge along the obtained path. The weights of the edges are the euclidean distance between the connected nodes.

Fig. 1
figure 1

A comparison of the proposed edging distance and the Euclidean distance

Let there be a dataset of points that can be represented as nodes N = {n1, n2, , nn} in a graph G = (N, E) where E represents the edges that connect k-nearest neighbours. An edge eij between nodes ni and nj has the weight equal to the Euclidean distance dEuclidian(ni, nj) between these nodes. The guided path P represents the sequence of edges that connects any nstart to nend in the graph G. ED is defined as the maximum edge weight between any pi and pj along the guided path between these two points. Thus, ED can be mathematically formulated plainly as:

$$\begin{array}{*{20}c} {ED\left( {n_{{{\text{start}}}} , n_{{{\text{end}}}} } \right) = \mathop {\max }\limits_{{n_{i} ,n_{j} \in P}} d_{{{\text{Euclidean}}}} \left( {n_{i} ,n_{j} } \right) } \\ \end{array}$$
(10)

The expanded form of Eq. (10), considering the k-nearest neighbourhood, is:

$$\begin{array}{*{20}c} {ED\left( {n_{{{\text{start}}}} , n_{{{\text{end}}}} } \right) = \mathop {\max }\limits_{{n_{i} \in P}} d_{{{\text{Euclidean}}}} \left( {n_{i} , \mathop {\min }\limits_{{n_{j} \in kNN\left( {n_{i} } \right)}} d_{{{\text{Euclidean}}}} \left( {n_{j} , n_{{{\text{end}}}} } \right)} \right) } \\ \end{array}$$
(11)

where P represents the guided path from nstart to nend and kNN represents the k-nearest neighbourhood of a node. It is important to note that this mathematical formulation does not include the lookahead mechanism that is explained later.

An exact representation of these concepts can be viewed in Fig. 2, where a comparison between the proposed distance and the Euclidean distance is made. In this figure, we show that for convex clusters the intra-cluster distance remains much smaller than the inter-cluster distance, thus retaining the characteristics of the Euclidean distance.

Fig. 2
figure 2

Distances in well-separated convex clusters. A The intra-cluster distance of the proposed approach. B The inter-cluster distance of the proposed approach. C The intra-cluster Euclidean distance. D The inter-cluster Euclidean distance

However, in Fig. 3 where non-convex clusters are present, we show that the proposed approach takes into consideration the structure of the data and the intra-cluster distances are smaller than inter-cluster distances. For the Euclidean distance, it can be seen that intra-cluster distances are estimated to be larger than inter-cluster distances for non-convex clusters.

Fig. 3
figure 3

Distances in non-convex clusters. A The intra-cluster distance of the proposed approach. B The inter-cluster distance of the proposed approach. C The intra-cluster Euclidean distance. D The inter-cluster Euclidean distance

3.3 Detailed approach

The cluster centre has been changed from the mean of the cluster points to the point with the smallest distance to all other points, as described in Algorithm 1. The conventional cluster centres can also be used if they are added as part of the dataset in the computation of the distances in the proposed approach. Nevertheless, it is not the best choice, as the cluster mean can be outside of the cluster (especially for non-convex clusters), resulting in high intra-cluster distances, which is the reason for which we suggest this approach.

Algorithm 1
figure a

The centre_from_data function that computes the cluster centre as the point which has the smallest distance to all other points.

For the distance computations, the distance between two points is computed as a guided path from one point, considered the start node of the path, to another, considered the end node of said path. Regarding the guided path, each node of the path is chosen from among a subset of nearest neighbours of the current node as the closest to the end node and all other points are removed from consideration to improve execution time. As the path may not always be straight (as would a path of colinear points), our approach incorporates a lookahead mechanism that allows the path to diverge for a given number of nodes, searching in another direction if a path may be found, otherwise the path is short-circuited to the end node by forcing the next node to be the end node.

The conventional distance is substituted in the proposed approach for the maximum edge in this guided path, where the value of each edge is the Euclidean distance between the two nodes that form the edge. We made this choice because it can minimize intra-cluster distances while maximizing inter-cluster distances even for non-convex clusters. The minimization of intra-cluster distances is made as however long a path is within a cluster, the maximum edge is smaller than the conventional distance. Take, for example, the distance between the cluster centre and a border point, an edge with maximum distance can be obtained only if no points can be found between them, any point would segment the path into two edges which would lower the value of the maximum edge resulting in a smaller value than the Euclidean distance between two points. Moreover, when clusters are well separated, a large edge must be created to step from one cluster to the other, and in the case of the maximum, that is the edge that defines the distance between points in different clusters, thus maximizing the inter-cluster distances. Furthermore, the proposed approach is deterministic as it does not incorporate any randomness, the distance between same two points in the same dataset is always equal.

The pseudocode, presented in Algorithm 2, defines a function called max_edge_in_guided_path. This function aims to determine the maximum edge within a directed path from a starting node (start) to an end node (end), based on a given set of points X, with additional parameters including NN for nearest neighbour selection and lookahead for path exploration and short-circuiting.

Initially, the function checks if the starting node is identical to the ending node. If they match, it returns 0, as no traversal would be required. Subsequently, an array that defines the visited nodes is defined, and the algorithm initializes the next node in the path as the starting node.

The algorithm then proceeds with iterative exploration until the next node reaches the end node. During each iteration, the algorithm identifies the NN nearest neighbours (that have not been visited) of the current next node and computes their distances to the end node. The neighbour closest to the end node is selected as the subsequent node to explore unless all neighbours are further away from the end node than the current.

Algorithm 2
figure b

The max_edge_in_guided_path function that describes the proposed way of computing the distance between two given points

For this case, we added a lookahead mechanism described by the pseudocode in Algorithm 3. During each iteration, the algorithm assesses whether all nearest neighbours are farther from the end node than the current node. If so, and if lookahead has not commenced, the algorithm initiates lookahead, preserving the current state.

If lookahead is underway and a closer node to the end node is discovered, the lookahead is reset to enable further iterations. When lookahead is active, the counter decrements, and if it reaches 0, indicating the lookahead limit has been reached, the algorithm evaluates whether a superior node (one closer to the end node) has been identified. If the counter has reached 0 and no superior node was identified, it resets the path to the pre-lookahead state, in essence short-circuiting the traversal. Otherwise, it resets the counter and continues lookahead.

All other nodes in the NN-neighbourhood are marked as visited after selecting the next node based on the above criteria. If the path length is 1, indicating that the ending node is directly adjacent to the starting node, the function calculates and returns the distance between them. In cases where the path consists of multiple nodes, the algorithm traverses the path to compute the maximum edge between consecutive nodes. Finally, the function returns the maximum edge length found along the directed path.

The two parameters offered have an impact on the results obtained:

  1. i.

    NN parameter determines how large the neighbourhood should be. As unused neighbours are deleted, they cannot appear in subsequent iterations, thus higher values of NN reduce execution time but may generate larger distances between edges, resulting in lower scores.

  2. ii.

    lookahead (La) parameter allows the path to veer towards directions that might not seem optimal at first, and thus higher lookahead values can obtain higher scores but may increase execution time. Lower values of lookahead while reducing execution time, might short-circuit the path exploration resulting in lower scores.

In essence, this pseudocode outlines an algorithmic approach for efficiently identifying the maximum edge length within a guided path in a given set of points, employing strategies such as neighbour selection and path exploration.

Algorithm 3
figure c

The lookahead_mechanism function that allows for the traversal of seemingly non-optimal paths

As the proposed approach is a modification of the distance computation, it is widely applicable. Here, we showcase each of the internal metrics presented in the Related Work section modified to use the proposed approach for distance computation. Therefore, all three metrics have been updated to incorporate these two modifications that encompass the proposed approach. In the next section, we analyse the impact on the performance of the modifications compared to the basic versions, but also the impact of various parametrizations. Several applications can be found for the proposed approach, we showcase here our augmented version of K-Means and evaluate its performance against other clustering algorithms on several datasets of various sizes and dimensionalities. In the creation of paths provided by the proposed approach, it always moves from one node towards the closest (with regard to the end node) neighbour from a nearest neighbour vicinity. The lookahead mechanism works on the same principle but allows the path to go to nodes further from the end node than the current node. Thus, the proposed approach is deterministic, and any inconsistencies in results that can appear are dependent upon the algorithm it is used within. An example of this is the K-Means algorithm where the centroids (end nodes) are randomly instantiated at the beginning. As the internal metrics are based upon the distance computation between pairs of samples, the proposed approach substitutes the Euclidean distance between two samples where one is considered the start node and the other the end node.

We illustrate a simple example of the lookahead mechanism in Fig. 4. Consider a path through nodes with the start node (0,0) and the end node (10, 0). The correct maximum edge in such a path is the edge between (8,4) and (10, 0). In the guided path of the proposed approach with NN = 1, when the (4,4) node is reached, the nearest neighbour would be (4,5), which is further away from the end node than the current node, inducing a short-circuit to the end node. This will result in a larger maximum edge, as shown on the left side of Fig. 4. Through the addition of the lookahead mechanism, a nearest neighbour that is further away from the end node than the current node can be reached, in this case (4,5), which allows us to explore the path further in order to find the correct maximum edge, as shown on the right side of Fig. 4.

Fig. 4
figure 4

An example of the inner workings of the lookahead mechanism

4 Results

Several analyses have been conducted to determine the performance of the proposed approach in different applications. We have analysed the impact of the proposed approach when applied to clustering performance metrics. For each internal metric presented in the Related Work section, an augmented form has been created. We have also analysed the impact in clustering by augmenting the K-Means algorithm to use the proposed distance instead of the conventional Euclidean distance.

To prove the efficacy of the proposed method, we have compared the basic clustering performance evaluation metrics with their augmented versions through the proposed approach from the perspective of several datasets with various characteristics, such as convexity, overlap and density. A similar comparison has been made on the traditional and augmented versions of K-Means using several datasets. Furthermore we have also analysed the time complexity obtained by augmenting these algorithms with the proposed method for both of these applications.

4.1 Analysis of effects on clustering performance metrics

We compared the proposed approach with traditional clustering performance metrics. Three evaluation metrics have been chosen, specifically SS, DBS and CHS. Each has been modified to use the proposed approach and annotated with a prefix; for example, ED-SS denotes our augmented version that was compared with the basic version of the Silhouette Score, denoted SS. We have chosen 7 datasets [18], each with different characteristics, from simple convex clusters to overlapping convex clusters with different densities to non-convex clusters. We have also chosen to analyse 5 different labelling options, including the true labels and random labels. The true labels are required to ascertain that they will receive high values, while random labels are required to ensure that such labelling is penalized.

Three additional wrong labellings are introduced, where the points are linearly separated to obtain the labels by a diagonal line (DL), a vertical line (VL) and a horizontal line (HL). An ideal metric would produce the best scores for the true labels, while any other labelling would preferably result in lower scores depending on how far it is from the correct answer. Thus, we introduced linearly separated labellings in order to further test the clustering quality evaluation that these metrics provide by checking if the true labels will have higher scores than any other labels. Each of these additional labellings offers different characteristics depending on the dataset it is paired with. The visual representation of the datasets is illustrated in Fig. 5.

Fig. 5
figure 5

Plots of 7 different datasets paired with 5 different labellings. On the rows of this figure, the datasets are presented. (D1) A dataset with 3 easily separable convex clusters. (D2) A dataset with 5 easily separable convex clusters. On the columns of this figure, the labellings are presented. (D3) A dataset [18] with 3 convex clusters out of which 2 have a slight overlap. (D4) A dataset [18] with 3 elongated convex clusters. (D5) A dataset [18] with 3 slightly overlapping convex clusters of various densities. (D6) A dataset [18] with 2 non-convex clusters as semi-moons. (D7) A dataset [18] with 2 non-convex clusters as concentric circles. (True Labels—TL) The correct labels that separate perfectly the clusters. (Diagonal Line—DL) Linear separation into labels through a diagonal line from the bottom-left to the top-right separates data points into clusters. (Vertical Line—VL) Linear separation into labels through a vertical line with x = 0 separates data points into clusters. (Horizontal Line—HL) Linear separation into labels through a vertical line with y = 0 separates data points into clusters. (Random Labels—RL) Data points have been assigned random labels

4.1.1 Silhouette score

This section presents the comparison between the basic Silhouette Score (SS) and our augmented version of the Silhouette Score (ED-SS).

As expected, SS results in higher values for conventional clusters, such as those presented in D1, D2, D3 and D5, when paired with the true labels. Moreover, for all datasets, when paired with random labels, it results in values close to 0. However, several inconsistencies occur when dealing with non-convex clusters. As can be seen from Fig. 6, the unconventional labellings (DL, VL and HL) obtain higher values than the true labels for non-convex datasets (D6, D7). This should not be the case, as it is clear to the naked eye that they are not correct clusterings. Moreover, a similar trend appears for elongated clusters (D4) where unconventional labellings receive a higher score than the true labels.

Fig. 6
figure 6

A comparison of the basic version of the Silhouette Score (SS, panel A) and our augmented version (ED-SS, panels BD) with various parametrizations. We have chosen to highlight erroneous values (in comparison to the true labels as reference) with red stars (colour figure online)

This is not the case for ED-SS; as we can see from Fig. 6, it is able to correctly evaluate with higher values non-convex clusters when paired with true labels rather than unconventional or random labels. However, the modifications brought still retain the weakness of the metric when considering elongated clusters. There is also an additional case where the evaluation is mistaken, when evaluating D3 with VL, due to the overlap between the light-blue clusters and the functionality of the proposed distance computation, it results in low values for intra-cluster distances and, as such, receives a high score. To explain this phenomenon further, D3/D4 paired with VL results in low intra-cluster distances as there is overlap between the clusters identified incorrectly by the labels, while inter-cluster distances are high, however with true labels the inter-cluster distances are low because they have overlap and as such true labels result in lower scores. Nevertheless, for all other cases ED-SS retains the characteristics of the basic SS metric and results in a correct evaluation.

Higher values of NN generate larger neighbourhoods, thus more points are removed at each step as they are considered visited and this results in larger distances in the path which produces lower scores. This can be observed when comparing the scores of Fig. 6B, C. By comparing Fig. 6B, D, the impact of the lookahead parameter can be observed. The effect of modifying this parameter varies from dataset to dataset, nevertheless a generic trend is that true labels receive a lower score, while wrong labellings (DL, VL, HL) receive a slightly higher score.

4.1.2 Davies-Bouldin score

DBS follows a similar pattern to that of SS, where non-convex clusters (D6, D7) and elongated clusters (D4) receive worse scores (meaning higher values) for true labels. Nevertheless, DBS can identify random labels as incorrect and generate bad scores (meaning higher values). Our augmented version of DBS, denoted ED-DBS, generates better scores (meaning lower values) for non-convex clusters paired with true labels than either unconventional or random labels, indicating that the proposed approach does indeed offer a better evaluation. When comparing the basic and the augmented versions, ED-DBS has a wrong evaluation only when considering D5 paired with DL/VL as these labellings generate lower distances resulting in better scores. However, the remaining cases are correct as true labels are identified with better scores (meaning lower scores) than other labellings. All of these observations can be inferred from the analysis of the values obtained in Fig. 7.

Fig. 7
figure 7

A comparison of the basic version of the Davies-Bouldin Score (DBS, panel A) and our augmented version (ED-DBS, panels BD) with various parametrizations. We have chosen to highlight erroneous values (in comparison to the true labels as reference) with red stars (colour figure online)

DBS is a slightly simpler metric when compared with SS and as such, certain wrong labellings can result in better scores (meaning lower scores) than the true labels. This becomes apparent when analyzing the VL labelling, where D3, D4 and D5 paired with VL obtain better scores than the true labels. This phenomenon happens due to the fact that DBS only uses the distance between the cluster centres and their scatter to compute the performance. In the case of VL, for those specific datasets, the separation offered by this labelling results in better distance values.

The comparison between Fig. 7B, C, indicates that higher values of k result in worse scores, however the execution is faster and by considering their relative values, the true labels still receive better scores than any other labelling. By comparing Fig. 7B, D, the impact of the lookahead parameter emerges. For lower lookahead values, the scores obtained for the true labels remain similar, while the scores of the wrong labellings (DL, VL, HL) tend to have better scores (meaning lower values) than true labels. This indicates that for DBS, higher values of lookahead are necessary for a correct evaluation.

4.1.3 Calinski-Harabasz score

The true labels are expected to obtain higher scores for the CHS clustering metric, while random labels would result in the lowest scores (close to 0). However, similar to SS and DBS, this is not the case for non-convex clusters (D6 and D7) and for close elongated clusters (D4), where CHS yields lower scores. Our augmented version of CHS, denoted as ED-CHS produces higher values for the non-convex clusters (D6 and D7) when paired with true labels than with any other labelling.

As per the cases of ED-SS and ED-DBS, it can be seen from Fig. 8 that two erroneous evaluations are yielded by ED-CHS when considering D5 paired with DL/VL while all remaining cases retain the characteristics of the basic version of the metric. Notably, the scores of D6 and D7 when paired with true labels are lower for ED-CHS than for CHS. This is should not be an issue, as the true labels receive higher scores than any other labelling.

Fig. 8
figure 8

A comparison of the basic version of the Calinski-Harabasz Score (CHS, panel A) and our augmented version (ED-CHS, panels BD) with various parametrizations. We have chosen to highlight erroneous values (in comparison to the true labels as reference) with red stars (colour figure online)

For the same reasons of intra-cluster distances, D5 paired with DL/VL has higher values for ED-CHS as well. However, for ED-CHS, D3 and D4, when paired with VL, receive a higher score than when paired with the true labels, this happens due to the high density of points which skew the cluster centres and induce a high inter-cluster distance while because of the closeness or overlap of clusters the intra-cluster distances are small.

Similarly to the previous metrics, when increasing the k parameter, the scores of true labels become worse, while those of wrong labels tend to increase. This becomes evident when comparing Fig. 8B, C. The decrease of the lookahead parameter seems to affect only on a subset of cases, while most have only slight variation when comparing Fig. 8B, D.

4.1.4 Theoretical considerations

The augmentation of each metric using the proposed approach requires any distance computation to be changed into ED for that metric. Depending on the metric and how it calculates intra-cluster and inter-cluster distances, these might be affected more or less. Nevertheless, as a general rule, inter-cluster distances having higher values are less affected in absolute terms by the modification in distance computation. As intuition, this is because for intra-cluster distances the identified path does not leave the cluster, thus obtaining lower values. In contrast, for inter-cluster distances, the maximum edge in the path is identified as the edge that connects the two clusters, therefore having higher values.

For SS, as it considers inter-cluster distances as the distance from a point to the closest distinct cluster, it is affected to a certain degree. Let's consider a point as being in the middle of the cluster; in the basic version SS, a sample in the middle of another cluster would have a greater distance than one from the border. However, in our augmented version, ED-SS, all samples from one cluster would have the distance from the border to the closest distinct cluster. It is important to note that this is the general case and an exception can be found for highly overlapping and sparse clusters. However this would affect the basic version as well. For DBS, the distance between clusters is computed only between their means. Similarly to SS, in the basic version the distance between the centres is greater than on our augmented version because the distance would be computed as the distance between their borders under the assumption that the clusters are separated. For CHS, the distance between each centre is computed to the general mean of the data.

In principle, the inter-cluster distance is approximately the distance between the borders for our augmented version, rather than the distance from one sample to the border or the distance from one centre to another (depending on how the inter-cluster distance is calculated in the considered score). As such, in our augmented version, the inter-cluster distance can be considered more homogeneous across different performance metrics.

4.1.5 Comparison against other metrics

The following analysis compares the augmented version of the Silhouette Score against other metrics, specifically DBCVS and DS. The results obtained in Table 2 indicate that DBCVS has issues for D4, D5, D6, D7, where the performance values obtained for incorrect linear labellings are similar or higher than those of the true labels. However, this is not the case of ED-SS. Although, DBCVS was made for density-based clustering algorithms, it is outperformed by the proposed method.

Table 2 A comparison of the basic version of the DBCVS and ED-SS parametrized with NN = 5 and La = 3

The second comparison made against DS in Table 3, shows that DS correctly identifies the true labels with higher scores, such as the cases of D1, D2, D6, D7; however, the scores obtained are very low and close to those of incorrect labellings, especially in the case of D3, D4, D5. D1 has the highest score indicating a correct clustering when paired with the true labels, however a similar score is obtained with a labelling defined by a horizontal line as well. Overall, ED-SS has a higher scores for the true labels and greater differences in scores between the true labels and the other types of labellings.

Table 3 A comparison of the basic version of the DS and ED-SS parametrized with NN = 5 and La = 3

4.1.6 Analysis on 3D datasets

To further prove the generality of our method, we repeated the previous experiments and computed the scores on 3-dimensional versions of the D3, D4 and D5 datasets. Figure 9 illustrates the visual representations of the dataset. We continue to conduct a thorough analysis of 5 different labelling options, including the true and random labels. Three additional analogous wrong labellings are introduced, where the points are separated by a diagonal plane (DP), a vertical plane (VP) and a horizontal plane (HP).

Fig. 9
figure 9

Plots of 3 different datasets with points on three dimensions, paired with 5 different labellings. On the rows of this figure, the datasets are presented. (D4) A dataset with 3 elongated convex clusters. (D5) A dataset with 3 convex clusters, out of which 2 have a slight overlap. On the columns of this figure, the labellings are presented. (D3) A dataset with 3 slightly overlapping convex clusters of various densities. (True Labels—TL) The correct labels that separate the clusters perfectly. (Diagonal Plane—DP) A diagonal plane determined by the bottom-front-right, top-front-left and top-back-right separates data points into two clusters. (Vertical Plane—VP) A vertical plane with x = 0 which separates data points into two clusters. (Horizontal Plane—HP) A horizontal plane with z = 0 which data points into two clusters. (Random Labels—RL) Data points have been assigned random labels

For all three of these datasets paired with the introduced labellings, we have compared the basic version of the clustering performance metrics and our augmented versions using the parametrization that has been shown to give the best results (NN = 5, La = 10).

Table 4 compares the scores of the basic and augmented versions of the Silhouette Score. Higher scores are achieved overall for all labellings besides the random labelling. Noticeably, the only wrong estimations of the ED-SS are those on D3 paired with DP and VP, which even SS mistakes, indicating that the proposed approach does achieve its intent by improving the metric on non-convex clusters while still retaining the characteristics of the basic metric. Moreover, for the case of D3 paired with VP, where the basic version SS indicates that it is a better clustering than the true labels, the proposed method ED-SS correctly shows that the true labels are better, however only by a small margin.

Table 4 A comparison of the basic version of the Silhouette Score (SS) and our augmented version (ED-SS) parametrized with NN = 3 and La = 10 on 3D versions of D3, D4 and D5

For the basic and augmented versions of DBS and CHS, a similar trend can be viewed in Tables 5 and 6. The obtained values indicate higher scores overall for our augmented versions (for DBS, this is indicated by lower values). However, the only mistakes made by our augmented versions appear in D3 when paired with DP and VP for which the basic versions also assign wrong values. Moreover, for D3 paired with HP, we can see that the basic version of DBS assigns a higher value than for true labels and this is not the case for ED-DBS. Thus, solidifying our claim even further that the proposed approach is adequate.

Table 5 A comparison of the basic version of the Davies-Bouldin Score (DBS) and our augmented version (ED-DBS) parametrized with NN = 3 and La = 10 on 3D versions of D3, D4 and D5
Table 6 A comparison of the basic version of the Calinski-Harabasz Score (CHS) and our augmented version (ED-CHS) parametrized with NN = 3 and La = 10 on 3D versions of D3, D4 and D5

4.1.7 Analysis on high-dimensional datasets

The following analysis assesses the proposed approach's ability to deal with high-dimensional data. This analysis was made on 5 real-world datasets that are available from the UCI (University of California, Irvine) machine learning repository [15]. The statistics of these datasets are presented in Table 7.

Table 7 Information about the chosen high-dimensional datasets

The original and modified SS, DBS and CHS versions were compared; however, in this analysis, we only consider the true and random labels, as the high dimensionality of the datasets hinders the creation of additional labels. For the comparison between SS and ED-SS shown in Table 8, we can see that the modified version offers lower values for both the true and random labellings, with the first three datasets having very low scores regardless of the labellings indicating that the approach of SS is not able to capture the structure of data with the provided labellings. Overall, the difference between the true and random labels is distinguishable in both the original and modified versions, with the Statlog dataset having a bigger difference for ED-SS and a higher value for the true labels.

Table 8 A comparison of the original version of the Silhouette Score (SS) and the modified version (ED-SS) parametrized with NN = 3 and La = 10 on the high-dimensional datasets presented in Table 7

In Table 9, the same analysis is shown for DBS and ED-DBS; we can see that the random labellings have lower scores for ED-DBS which can happen in the case of spread out clusters with overlap as the maximum edges found by the approach will be lower than the distances between points. Except for the Ecoli dataset, the ED-DBS scores are higher (meaning lower values) for true labellings and DBS, this indicates that ED-DBS is better able to estimate the performance of a correct clustering for more complex datasets such as the last three datasets.

Table 9 A comparison of the original version of the Davies-Bouldin Score (DBS) and the modified version (ED-DBS) parametrized with NN = 3 and La = 10 on the high-dimensional datasets presented in Table 7

This last statement is supported by the results provided in Table 10, where the modified version of CHS obtains higher scores for all datasets except Statlog. In Tables 9 and 10, ED-DBS and ED-CHS have better scores than the original versions for almost all datasets paired with true labels, while the pairing with random labels have better scores than the original versions. Nevertheless, the values are still distinguishable from each other.

Table 10 A comparison of the original version of the Calinski-Harabasz Score (CHS) and the modified version (ED-CHS) parametrized with NN = 3 and La = 10 on the high-dimensional datasets presented in Table 7

4.1.8 Evaluation of time complexity

We analyse the impact of our method from a time complexity perspective by comparing the time in seconds needed for the computation of the Silhouette Score (SS) and our augmented version (ED-SS). In the first set of experiments from Table 11, the number of examples in the D5 dataset (with 2 features) paired with TL has been varied from 100 to 10.000. From this table, we can see that just as it is the case with SS, the scores of ED-SS are hardly affected by the number of examples as the variation that appears can be attributed to the random generation of the dataset based on cluster means and standard deviations. We can see that the proposed approach has a higher complexity than the basic version. This is to be expected as paths need to be calculated through the points of the dataset. However, in the last column of the table, we showcase that the increase of the NN parameter can greatly reduce the amount of time required for computation without affecting the score. This indicates that higher values of the NN parameter can be used for datasets with a high number of examples. Noticeably, the last column only contains results for the last two rows because the rest of the experiments have too few examples for such a high value of the NN parameter to be applied.

Table 11 Time comparison for Silhouette Score (SS) and our augmented version (ED-SS)

We continue the analysis from the perspective of time complexity by computing the scores on the same D5 dataset paired with TL with a fixed number of examples at 1000 while varying the number of features from 2 to 6 features in Table 12. As the number of points remains fixed the overall paths traversed by the proposed approach remain the same, as such, the number of features should have a lower effect on the time complexity, and this is confirmed by the values obtained in Table 12. Furthermore, if the scores obtained are analysed, we can indeed see that the proposed approach obtains slightly higher scores overall. However, if the results are compared across rows, it is noticeable that the same trend maintains; when values are higher or lower for SS, so are those of ED-SS. The random generation of the dataset based on cluster means and standard deviations is most likely to cause this.

Table 12 Time complexity comparison between Silhouette Score (SS) and our augmented version (ED-SS) where we change the number of features in each iteration 

In Table 13, the time complexity analysis of high-dimensional datasets is shown. As expected, the basic versions require a significantly lower amount of time than the augmented versions as they have a simpler approach to their computation. As was previously the case, the execution time of the augmented versions can be significantly reduced by increasing the NN parameter.

Table 13 Time complexity comparison between the basic versions and our augmented versions of the scores (with a parametrization of NN = 3, La = 10) for the high-dimensional datasets presented in Table 13

4.2 Analysis of effects on clustering

4.2.1 K-Means

The applications of the proposed approach are not limited to performance metrics. In this section, we have modified the K-Means [16] algorithm to use the proposed approach instead of the classical computation of Euclidean distance. We show the results of this approach in comparison to the classic K-Means in Fig. 10.

Fig. 10
figure 10

A comparison of the results of K-Means, our augmented K-Means using the proposed distance, and various clustering algorithms on the datasets presented in Fig. 5

As already shown, internal clustering performance metrics cannot be used to evaluate the performance of the augmented K-Means, especially for the non-convex clusters present in D6 and D7 as they would receive low scores. Therefore, we have chosen to evaluate the performance of the augmented K-Means against the traditional K-Means and spectral clustering through the perspective of external performance metrics.

4.2.2 Evaluation of performance

External clustering performance metrics are used to numerically evaluate the impact of the proposed approach in the clustering of K-Means, denoted as ED-KM. There are several reasons for this, we have already shown that conventional internal metrics cannot assess the clustering quality for non-convex clustering. We have chosen to use the already established external metrics instead of our modifications to not bias the analysis.

In Fig. 10, we showcase a visual comparison of the results obtained by our modified K-Means with five other clustering algorithms commonly used in literature: original K-Means, spectral clustering, DBSCAN, mean shift, and agglomerative clustering. We use 7 datasets for the comparative analysis. Only 3 of the 6 algorithms perform well on non-convex structures: ED-K-Means, spectral clustering, and DBSCAN. Looking at the other datasets, spectral clustering does not handle adequately elongated clusters, as illustrated by D4. At the same time, DBSCAN does not do a correct separation for overlapping convex clusters with different densities as in D5.

Tables 14 and 15 presents this performance analysis on the 7 datasets presented in Fig. 10 from the perspective of ARI and AMI metrics, respectively. The best performance was given by the parametrization presented in Table 16. It can be seen that our augmented version of K-Means has higher performance on the datasets containing non-convex clusters and even on some of the datasets containing convex clusters. Only on D3 does our augmented K-Means have lower values than the basic version with only slightly lower values and on D5 it is slightly outperformed by the agglomerative clustering algorithm. These tables indicate that our method is able to outperform commonly used clustering algorithms on both convex and non-convex clusters, an ability that is lacking in other clustering algorithms as they tend to be more performant on one of the two types of clusters. Various lookahead values result in similar clusterings; however, for D7, our augmented version requires higher lookahead values to obtain the intended clusters.

Table 14 Comparison between the basic K-Means (KM), our augmented version of K-Means (ED-KM) and various clustering algorithms from the perspective of ARI on the 7 datasets presented in Fig. 10
Table 15 Comparison between the basic K-Means (KM), our augmented version of K-Means (ED-KM) and various clustering algorithms from the perspective of AMI the 7 datasets presented in Fig. 10
Table 16 Parametrization of each algorithm used in the analysis of performance

The performance was also evaluated on the set of real-world datasets from UCI with high dimensionality presented in Table 7. With dimensionality increase, the clustering algorithms' performance diminishes, as shown in Table 17. In this table, for the Ecoli, Glass and Statlog datasets, the modified version of K-Means outperforms both the classical K-Means and spectral clustering from the perspective of ARI and AMI. For the Yeast dataset, none of the algorithms are able to correctly identify the clusters with the values of ARI and AMI indicating scores close to random labelling, with the classical spectral clustering having the highest values. In the case of the Wdbc dataset, both the original version of K-Means and spectral clustering outperform the proposed approach.

Table 17 Comparison between the original K-Means, modified K-Means and spectral clustering from the perspective of ARI and AMI on the 5 high-dimensional datasets presented in Table 7

4.2.3 Evaluation of time complexity

Tables 18 and 19 illustrate an analysis from the perspective of time complexity between the basic K-Means and our proposed methods on two dimensions. In Table 18, the number of features is fixed to 2 and the number of examples is varied from 100 to 10.000. While, in Table 19, number of examples is fixed the to 1.000 and the number of features varies from 2 to 6.

Table 18 Time comparison between the basic K-Means (KM), our augmented version (ED-KM) and spectral clustering (SC) on the same dataset with various numbers of examples
Table 19 Time comparison between the basic K-Means (KM), our augmented version (ED-KM) and spectral clustering (SC) on the same dataset with various numbers of featuress

Table 18 showcases the impact of dataset size on time complexity. As expected from the execution time of the augmented metrics, the execution time for the augmented K-Means increases drastically with the number of examples in the dataset; however, the quality of the clustering (as indicated by ARI and AMI) remains consistent regardless of the number of examples if not better than the basic version of K-Means. Furthermore, in the last column of this table, we can see that by increasing the NN parameter, the computation time can be greatly reduced while the clustering quality is unaffected (as indicated by ARI and AMI). As was the case for the metrics, the last column shows values only for higher numbers of examples as the lower values should not be combined with such a high value for the NN parameter.

Similar to the previous time analysis on metrics, Table 19 analyses the impact of the number of features on the complexity. The number of features does not affect the computation time as it remains within the same range. Due to the random generation of data and the random initialization of K-Means, the convergence time is expected to have some variation. It is clearly visible from this table, that although the computation time is higher it is unaffected by the number of features of the dataset and the scores are consistent with the basic version of K-Means, if not better.

The time complexity analysis of the real-world datasets is shown in table Table 20. The basic version of K-Means and SC require a significantly lower amount of time to cluster the datasets; however, as shown in Table 14, they do not offer the best performance.

Table 20 Time comparison between the basic K-Means (KM), our augmented version (ED-KM) and spectral clustering (SC) on the real-world datasets presented in Table 7

5 Discussions and conclusions

The results presented in this paper indicate that the introduction of our augmented distance substantially improves the evaluation of clustering quality as measured by the augmented versions of SS, DBS, and CHS compared to the basic versions. This comparison reveals valuable insights into their strengths and weaknesses across diverse clustering scenarios.

We applied our algorithm to 13 datasets (7 benchmark synthetic sets and 6 real sets) with multiple (or even joint) complexities. Our augmented versions of the clustering performance evaluation metrics show increased performance, particularly in the evaluation of non-convex clusters when paired with true labels. The augmented metrics maintain the core characteristics of the original metrics, avoiding the introduction of new weaknesses. A subset of weaknesses of the basic clustering metrics remain. ED-SS still struggles in accurately assessing elongated clusters, as evidenced by inconsistencies in certain datasets. Similarly, ED-DBS can produce misleading results for specific scenarios. For ED-CHS, challenges persist in scenarios involving dense or overlapping clusters, where the interpretation of inter-cluster distances becomes ambiguous.

The analysis on 3D datasets further highlights the robustness of our proposed method as consistent improvements were observed in clustering performance metrics. Our approach rectifies errors in clustering metrics, such as misestimations on certain types of labellings while retaining the essence of basic metrics. This statement is supported by the consistent enhancement shown in the comparisons between basic and augmented versions of Davies-Bouldin and Calinski-Harabasz scores.

The examination of time complexity reveals a higher computational complexity due to the necessity of paths computation through dataset points. However, this overhead can be mitigated by increasing the NN parameter, particularly for datasets with a large number of examples. Furthermore, the proposed method proves less sensitive to increases in dimensionality as shown by our analysis of datasets with various numbers of features. The scores obtained favour the proposed method, while the strong correlation between the scores of the basic versions and augmented versions across diverse parameter settings supports the performance evaluation of the approach and its retainment of the characteristics of the basic version.

The augmentation of the K-Means clustering algorithm demonstrates the applicability of the proposed approach. The results demonstrate consistent improvements in clustering performance, especially for datasets with non-convex clusters. Even on convex clusters, our augmented K-Means outperforms when compared with 5 commonly used clustering algorithms. Our decision to rely on established external metrics, rather than using the basic or augmented versions, ensures an unbiased evaluation of performance. Similarly to the case of the evaluation metrics, the time complexity analysis reveals that the size of the dataset significantly impacts computational complexity, with larger datasets necessitating increased computation time that can be mitigated through an increase of the NN parameter, while the number of features shows minimal impact on computation time.

The La parameter permits paths to traverse through non-optimal nodes, with higher values improving performance in non-convex shapes. This is due to the fact that convex shapes provide 'straighter' paths from the start node to the end node. The NN parameter defines the number of nodes that we look at each step and the number of nodes that are thrown away at each step, as such when the clusters are dense, the NN parameter can be increased.

The findings presented in this paper emphasize the need for further investigation to address the limitations of clustering evaluation metrics, particularly for more complex cluster shapes. We highlight the importance of careful consideration in selecting the evaluation metric and interpreting clustering evaluation. The advancements observed in our augmented metrics represent a significant step forward in the quest for more effective and robust methodologies for the evaluation of clustering performance. These developments not only enhance our current understanding of clustering performance but also pave the way for the development of more robust metrics capable of addressing the inherent complexities of real-world datasets.