Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A hash-based co-clustering algorithm for categorical data

Published: 01 December 2016 Publication History

Abstract

The proposal of a new Co-Clustering approach for categorical data.The proposed algorithm is scale linearly with the data size.The results show the quality of found clusters and a diverse set of applications for such approach. Cluster analysis, or clustering, refers to the analysis of the structural organization of a data set. This analysis is performed by grouping together objects of the data that are more similar among themselves than to objects of different groups. The sampled data may be described by numerical features or by a symbolic representation, known as categorical features. These features often require a transformation into numerical data in order to be properly handled by clustering algorithms. The transformation usually assigns a weight for each feature calculated by a measure of importance (i.e., frequency, mutual information). A problem with the weight assignment is that the values are calculated with respect to the whole set of objects and features. This may pose as a problem when a subset of the features have a higher degree of importance to a subset of objects but a lower degree with another subset. One way to deal with such problem is to measure the importance of each subset of features only with respect to a subset of objects. This is known as co-clustering that, similarly to clustering, is the task of finding a subset of objects and features that presents a higher similarity among themselves than to other subsets of objects and features. As one might notice, this task has a higher complexity than the traditional clustering and, if not properly dealt with, may present an scalability issue. In this paper we propose a novel co-clustering technique, called HBLCoClust, with the objective of extracting a set of co-clusters from a categorical data set, without the guarantees of an enumerative algorithm, but with the compromise of scalability. This is done by using a probabilistic clustering algorithm, named Locality Sensitive Hashing, together with the enumerative algorithm named InClose. The experimental results are competitive when applied to labeled categorical data sets and text corpora. Additionally, it is shown that the extracted co-clusters can be of practical use to expert systems such as Recommender Systems and Topic Extraction.

References

[1]
A. Anandkumar, R. Valluvan, Learning loopy graphical models with latent variables: Efficient methods and guarantees, The Annals of Statistics, 41 (2013) 401-435.
[2]
S. Andrews, In-close, a fast algorithm for computing formal concepts, 2009.
[3]
S. Andrews, In-close2, a high performance formal concept miner, Springer, 2011.
[4]
K. Bache, M. Lichman, UCI machine learning repository, 2013.
[5]
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, 2001.
[6]
S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: A comparative evaluation, Red, 30 (2008) 3.
[7]
A.Z. Broder, On the resemblance and containment of documents, IEEE, 1997.
[8]
J. Carter, M.N. Wegman, Universal classes of hash functions, Journal of Computer and System Sciences, 18 (1979) 143-154.
[9]
P.A.D. de Castro, F.O. de França, H.M. Ferreira, G.P. Coelho, F.J. Von Zuben, Query expansion using an immune-inspired biclustering algorithm, Natural Computing (2010) 1-24.
[10]
Y. Cheng, G.M. Church, Biclustering of expression data, 2000.
[11]
G.P. Coelho, F.O. de França, F.J. Von Zuben, Multi-objective biclustering: When non-dominated solutions are not enough, Journal of Mathematical Modelling and Algorithms (2009).
[12]
P.A.D. de Castro, F.O. de França, H.M. Ferreira, F.J. Von Zuben, Applying Biclustering to Perform Collaborative Filtering, 2007.
[13]
I.S. Dhillon, S. Mallela, D.S. Modha, Information-theoretic co-clustering, ACM, 2003.
[14]
H.-M. Feng, C.-Y. Chen, F. Ye, Evolutionary fuzzy particle swarm optimization vector quantization learning scheme in image compression, Expert Systems with Applications, 32 (2007) 213-222.
[15]
F.O. de França, G.P. Coelho, F.J. Von Zuben, bicaco: An ant colony inspired biclustering algorithm, Springer, 2008.
[16]
F.O. de França, G.P. Coelho, F.J. Von Zuben, Predicting missing values with biclustering: A coherence-based approach, Pattern Recognition, 46 (2013) 1255-1266.
[17]
F.O. de França, Scalable overlapping co-clustering of word-document data, IEEE, 2012.
[18]
F.O. de França, F.J. Von Zuben, Finding a high coverage set of 5-biclusters with swarm intelligence, IEEE, 2010.
[19]
F.O. de França, F.J. Von Zuben, Extracting additive and multiplicative coherent biclusters with swarm intelligence, IEEE, 2011.
[20]
S. Funk, Netflix update: Try this at home, 2006.
[21]
T. Gao, L. Akoglu, Fast information-theoretic agglomerative co-clustering, in: Lecture notes in computer science, Vol. 8506, Springer International Publishing, 2014, pp. 147-159.
[22]
S. Har-Peled, P. Indyk, R. Motwani, Approximate nearest neighbor: Towards removing the curse of dimensionality., Theory OF Computing, 8 (2012) 321-350.
[23]
J.A. Hartigan, Direct clustering of a data matrix, Journal of the American Statistical Association (JASA), 67 (1972) 123-129.
[24]
J.L. Herlocker, J.A. Konstan, A. Borchers, J. Riedl, An algorithmic framework for performing collaborative filtering, ACM, 1999.
[25]
G. Karypis, V. Kumar, Metis-serial graph partitioning and fill-reducing matrix ordering, 2012.
[26]
B.W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System Technical Journal, 49 (1970) 291-307.
[27]
L. Labiod, M. Nadif, Co-clustering for binary and categorical data with maximum modularity., 2011.
[28]
K. Lang, Newsweeder: Learning to filter netnews, 1995.
[29]
D.D. Lewis, Naive (bayes) at forty: The independence assumption in information retrieval, Springer, 1998.
[30]
B. Mirkin, Mathematical classification and clustering, Springer, 1996.
[31]
S. Mitra, H. Banka, Multi-objective evolutionary biclustering of gene expression data, Pattern Recognition, 39 (2006) 2464-2477.
[32]
J.N. Morgan, J.A. Sonquist, Problems in the analysis of survey data, and a proposal, Journal of the American statistical association, 58 (1963) 415-434.
[33]
M.E. Newman, Modularity and community structure in networks, Proceedings of the National Academy of Sciences, 103 (2006) 8577-8582.
[34]
P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, Providing justifications in recommender systems, Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 38 (2008) 1-1272.
[35]
J. Zamora, M. Mendoza, H. Allende, Hashing-based clustering in high dimensional data, Expert Systems with Applications, 62 (2016) 202-211.

Cited By

View all
  • (2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020
  1. A hash-based co-clustering algorithm for categorical data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Expert Systems with Applications: An International Journal
    Expert Systems with Applications: An International Journal  Volume 64, Issue C
    December 2016
    645 pages

    Publisher

    Pergamon Press, Inc.

    United States

    Publication History

    Published: 01 December 2016

    Author Tags

    1. Biclustering
    2. Categorical data
    3. Co-clustering
    4. Data mining
    5. Text mining

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media