research-article

A hash-based co-clustering algorithm for categorical data

Author:

Fabrício Olivetti de FrançaAuthors Info & Claims

Expert Systems with Applications: An International Journal, Volume 64, Issue C

Pages 24 - 35

https://doi.org/10.1016/j.eswa.2016.07.024

Published: 01 December 2016 Publication History

Abstract

The proposal of a new Co-Clustering approach for categorical data.The proposed algorithm is scale linearly with the data size.The results show the quality of found clusters and a diverse set of applications for such approach. Cluster analysis, or clustering, refers to the analysis of the structural organization of a data set. This analysis is performed by grouping together objects of the data that are more similar among themselves than to objects of different groups. The sampled data may be described by numerical features or by a symbolic representation, known as categorical features. These features often require a transformation into numerical data in order to be properly handled by clustering algorithms. The transformation usually assigns a weight for each feature calculated by a measure of importance (i.e., frequency, mutual information). A problem with the weight assignment is that the values are calculated with respect to the whole set of objects and features. This may pose as a problem when a subset of the features have a higher degree of importance to a subset of objects but a lower degree with another subset. One way to deal with such problem is to measure the importance of each subset of features only with respect to a subset of objects. This is known as co-clustering that, similarly to clustering, is the task of finding a subset of objects and features that presents a higher similarity among themselves than to other subsets of objects and features. As one might notice, this task has a higher complexity than the traditional clustering and, if not properly dealt with, may present an scalability issue. In this paper we propose a novel co-clustering technique, called HBLCoClust, with the objective of extracting a set of co-clusters from a categorical data set, without the guarantees of an enumerative algorithm, but with the compromise of scalability. This is done by using a probabilistic clustering algorithm, named Locality Sensitive Hashing, together with the enumerative algorithm named InClose. The experimental results are competitive when applied to labeled categorical data sets and text corpora. Additionally, it is shown that the extracted co-clusters can be of practical use to expert systems such as Recommender Systems and Topic Extraction.

References

[1]

A. Anandkumar, R. Valluvan, Learning loopy graphical models with latent variables: Efficient methods and guarantees, The Annals of Statistics, 41 (2013) 401-435.

Abstract

References

Cited By

Recommendations

An Entropy-Based Subspace Clustering Algorithm for Categorical Data

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Categorical data visualization and clustering using subjective factors

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations