Computer Science > Machine Learning

arXiv:2209.05757 (cs)

[Submitted on 13 Sep 2022]

Title:Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

Authors:Marek Gagolewski, Maciej Bartoszuk, Anna Cena

View PDF

Abstract:The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure -- unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not drastically increase above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage's speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution even further. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source 'genie' package for R. See also this https URL for a new implementation (genieclust) -- available for both R and Python.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2209.05757 [cs.LG]
	(or arXiv:2209.05757v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2209.05757
Journal reference:	Information Sciences 363, 8-23, 2016
Related DOI:	https://doi.org/10.1016/j.ins.2016.05.003

Submission history

From: Marek Gagolewski [view email]
[v1] Tue, 13 Sep 2022 06:42:53 UTC (75 KB)

Computer Science > Machine Learning

Title:Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators