Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2623330.2623652acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Relevant overlapping subspace clusters on categorical data

Published: 24 August 2014 Publication History

Abstract

Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.

Supplementary Material

MP4 File (p213-sidebyside.mp4)

References

[1]
E. Achtert, H.-P. Kriegel, and A. Zimek. Elki: A software system for evaluation of subspace clustering algorithms. In SSDBM, pages 580--585, 2008.
[2]
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD Conference, pages 94--105, 1998.
[3]
P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable clustering of categorical data. In EDBT, pages 123--146, 2004.
[4]
A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping clustering. In KDD, pages 532--537, 2005.
[5]
D. Barbara, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. In CIKM, pages 582--589, 2002.
[6]
S. Boriah, V. Chandola, and V. Kumar. Similarity measures for categorical data: A comparative evaluation. In SDM, pages 243--254, 2008.
[7]
E. Cesario, G. Manco, and R. Ortale. Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng., 19(12):1607--1624, 2007.
[8]
Q. Fu and A. Banerjee. Bayesian overlapping subspace clustering. In ICDM, pages 776--781, 2009.
[9]
G. Gan and J. Wu. Subspace clustering for high dimensional categorical data. SIGKDD Explorations, 6(2):87--94, 2004.
[10]
V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus - clustering categorical data using summaries. In KDD, pages 73--83, 1999.
[11]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.
[12]
F. Geerts, B. Goethals, and T. Mielikainen. Tiling databases. In Discovery Science, pages 278--289, 2004.
[13]
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE, pages 512--521, 1999.
[14]
Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discoery, 1997.
[15]
K.-N. Kontonasios and T. D. Bie. An information-theoretic approach to finding informative noisy tiles in binary databases. In SDM, pages 153--164, 2010.
[16]
H.-P. Kriegel, P. Kroger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD, 3(1), 2009.
[17]
C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960--981, 1994.
[18]
M. Mampaey, N. Tatti, and J. Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In KDD, pages 573--581, 2011.
[19]
G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD, pages 533--541, 2008.
[20]
E. Muller, I. Assent, S. Gunnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In ICDM, pages 377--386, 2009.
[21]
A. K. Poernomo and V. Gopalkrishnan. Towards efficient mining of proportional fault-tolerant frequent itemsets. In KDD, pages 697--706, 2009.
[22]
J. Rissanen. Information and Complexity in Statistical Modeling. Springer, 2007.
[23]
J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011.
[24]
Y. Xiang, R. Jin, D. Fuhry, and F. F. Dragan. Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov., 23(2):215--251, 2011.
[25]
T. Xiong, S. Wang, A. Mayers, and E. Monga. A new mca-based divisive hierarchical algorithm for clustering categorical data. In ICDM, pages 1058--1063, 2009.
[26]
M. J. Zaki, M. Peters, I. Assent, and T. Seidl. Clicks: an effective algorithm for mining subspace clusters in categorical datasets. In KDD, pages 736--742, 2005.

Cited By

View all
  • (2023)Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical DataDatabase and Expert Systems Applications10.1007/978-3-031-39821-6_28(337-343)Online publication date: 16-Aug-2023
  • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
  • (2021)Improving Quality of Ensemble Technique for Categorical Data Clustering Using Granule ComputingDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_24(261-272)Online publication date: 31-Aug-2021
  • Show More Cited By

Index Terms

  1. Relevant overlapping subspace clusters on categorical data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2014
    2028 pages
    ISBN:9781450329569
    DOI:10.1145/2623330
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. categorical data
    2. minimum description length
    3. relevant subspace clustering

    Qualifiers

    • Research-article

    Conference

    KDD '14
    Sponsor:

    Acceptance Rates

    KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical DataDatabase and Expert Systems Applications10.1007/978-3-031-39821-6_28(337-343)Online publication date: 16-Aug-2023
    • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
    • (2021)Improving Quality of Ensemble Technique for Categorical Data Clustering Using Granule ComputingDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_24(261-272)Online publication date: 31-Aug-2021
    • (2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020
    • (2019)Parallel Hierarchical Subspace Clustering of Categorical DataIEEE Transactions on Computers10.1109/TC.2018.287933268:4(542-555)Online publication date: 18-Jul-2019
    • (2017)A novel approach for mining maximal frequent patternsExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.12.02373:C(178-186)Online publication date: 1-May-2017
    • (undefined)Review of Traditional and Ensemble Clustering Algorithms for High Dimensional DataSSRN Electronic Journal10.2139/ssrn.3170321

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media