research-article

Relevant overlapping subspace clusters on categorical data

Authors:

Claudia PlantAuthors Info & Claims

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 213 - 222

https://doi.org/10.1145/2623330.2623652

Published: 24 August 2014 Publication History

Abstract

Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.

Supplementary Material

MP4 File (p213-sidebyside.mp4)

Download
260.46 MB

References

[1]

E. Achtert, H.-P. Kriegel, and A. Zimek. Elki: A software system for evaluation of subspace clustering algorithms. In SSDBM, pages 580--585, 2008.

Digital Library

[2]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD Conference, pages 94--105, 1998.

Digital Library

[3]

P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable clustering of categorical data. In EDBT, pages 123--146, 2004.

[4]

A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping clustering. In KDD, pages 532--537, 2005.

Digital Library

[5]

D. Barbara, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. In CIKM, pages 582--589, 2002.

Digital Library

[6]

S. Boriah, V. Chandola, and V. Kumar. Similarity measures for categorical data: A comparative evaluation. In SDM, pages 243--254, 2008.

[7]

E. Cesario, G. Manco, and R. Ortale. Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng., 19(12):1607--1624, 2007.

Digital Library

[8]

Q. Fu and A. Banerjee. Bayesian overlapping subspace clustering. In ICDM, pages 776--781, 2009.

Digital Library

[9]

G. Gan and J. Wu. Subspace clustering for high dimensional categorical data. SIGKDD Explorations, 6(2):87--94, 2004.

Digital Library

[10]

V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus - clustering categorical data using summaries. In KDD, pages 73--83, 1999.

Digital Library

[11]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.

Digital Library

[12]

F. Geerts, B. Goethals, and T. Mielikainen. Tiling databases. In Discovery Science, pages 278--289, 2004.

[13]

S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE, pages 512--521, 1999.

Digital Library

[14]

Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discoery, 1997.

[15]

K.-N. Kontonasios and T. D. Bie. An information-theoretic approach to finding informative noisy tiles in binary databases. In SDM, pages 153--164, 2010.

[16]

H.-P. Kriegel, P. Kroger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD, 3(1), 2009.

Digital Library

[17]

C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960--981, 1994.

Digital Library

[18]

M. Mampaey, N. Tatti, and J. Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In KDD, pages 573--581, 2011.

Digital Library

[19]

G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD, pages 533--541, 2008.

Digital Library

[20]

E. Muller, I. Assent, S. Gunnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In ICDM, pages 377--386, 2009.

Digital Library

[21]

A. K. Poernomo and V. Gopalkrishnan. Towards efficient mining of proportional fault-tolerant frequent itemsets. In KDD, pages 697--706, 2009.

Digital Library

[22]

J. Rissanen. Information and Complexity in Statistical Modeling. Springer, 2007.

Digital Library

[23]

J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011.

Digital Library

[24]

Y. Xiang, R. Jin, D. Fuhry, and F. F. Dragan. Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov., 23(2):215--251, 2011.

Digital Library

[25]

T. Xiong, S. Wang, A. Mayers, and E. Monga. A new mca-based divisive hierarchical algorithm for clustering categorical data. In ICDM, pages 1058--1063, 2009.

Digital Library

[26]

M. J. Zaki, M. Peters, I. Assent, and T. Seidl. Clicks: an effective algorithm for mining subspace clusters in categorical datasets. In KDD, pages 736--742, 2005.

Digital Library

Cited By

Brnawy RShiri N(2023)Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical DataDatabase and Expert Systems Applications10.1007/978-3-031-39821-6_28(337-343)Online publication date: 16-Aug-2023
https://doi.org/10.1007/978-3-031-39821-6_28
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Brnawy RShiri N(2021)Improving Quality of Ensemble Technique for Categorical Data Clustering Using Granule ComputingDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_24(261-272)Online publication date: 31-Aug-2021
https://doi.org/10.1007/978-3-030-86472-9_24
Show More Cited By

Index Terms

Relevant overlapping subspace clusters on categorical data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Subspace clustering for high dimensional categorical data

Data clustering has been discussed extensively, but almost all known conventional clustering algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the data points. Existing subspace clustering algorithms for ...
A weighting k-modes algorithm for subspace clustering of categorical data

Traditional clustering algorithms consider all of the dimensions of an input data set equally. However, in the high dimensional data, a common property is that data points are highly clustered in subspaces, which means classes of objects are categorized ...
Rough subspace-based clustering ensemble for categorical data

Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2014

2028 pages

ISBN:9781450329569

DOI:10.1145/2623330

General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '14

Sponsor:

KDD '14: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2014

New York, New York, USA

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,103
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Brnawy RShiri N(2023)Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical DataDatabase and Expert Systems Applications10.1007/978-3-031-39821-6_28(337-343)Online publication date: 16-Aug-2023
https://doi.org/10.1007/978-3-031-39821-6_28
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Brnawy RShiri N(2021)Improving Quality of Ensemble Technique for Categorical Data Clustering Using Granule ComputingDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_24(261-272)Online publication date: 31-Aug-2021
https://doi.org/10.1007/978-3-030-86472-9_24
Khader MAl-Naymat G(2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3403951
Pang NZhang JZhang CQin X(2019)Parallel Hierarchical Subspace Clustering of Categorical DataIEEE Transactions on Computers10.1109/TC.2018.287933268:4(542-555)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1109/TC.2018.2879332
Vo BPham SLe TDeng Z(2017)A novel approach for mining maximal frequent patternsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.12.02373:C(178-186)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1016/j.eswa.2016.12.023
Kalaiselvi KD. K(undefined)Review of Traditional and Ensemble Clustering Algorithms for High Dimensional DataSSRN Electronic Journal10.2139/ssrn.3170321
https://doi.org/10.2139/ssrn.3170321

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten