Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020408.2020584acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

INCONCO: interpretable clustering of numerical and categorical objects

Published: 21 August 2011 Publication History

Abstract

The integrative mining of heterogeneous data and the interpretability of the data mining result are two of the most important challenges of today's data mining. It is commonly agreed in the community that, particularly in the research area of clustering, both challenges have not yet received the due attention. Only few approaches for clustering of objects with mixed-type attributes exist and those few approaches do not consider cluster-specific dependencies between numerical and categorical attributes. Likewise, only a few clustering papers address the problem of interpretability: to explain why a certain set of objects have been grouped into a cluster and what a particular cluster distinguishes from another. In this paper, we approach both challenges by constructing a relationship to the concept of data compression using the Minimum Description Length principle: a detected cluster structure is the better the more efficient it can be exploited for data compression. Following this idea, we can learn, during the run of a clustering algorithm, the optimal trade-off for attribute weights and distinguish relevant attribute dependencies from coincidental ones. We extend the efficient Cholesky decomposition to model dependencies in heterogeneous data and to ensure interpretability. Our proposed algorithm, INCONCO, successfully finds clusters in mixed type data sets, identifies the relevant attribute dependencies, and explains them using linear models and case-by-case analysis. Thereby, it outperforms existing approaches in effectiveness, as our extensive experimental evaluation demonstrates.

References

[1]
E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Deriving quantitative models for correlation clusters. In KDD Conference, pages 4--13, 2006.
[2]
C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD Conference, pages 70--81, 2000.
[3]
A. Ahmad and L. Dey. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng., 63(2):503--527, 2007.
[4]
P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable clustering of categorical data. In EDBT Conference, pages 123--146, 2004.
[5]
C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008.
[6]
C. Böhm, S. Goebl, A. Oswald, C. Plant, M. Plavinski, and B. Wackersreuther. Integrative parameter-free clustering of data with mixed type attributes. In PAKDD, pages 38--47, 2010.
[7]
C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing clusters of correlation connected objects. In SIGMOD Conference, pages 455--466, 2004.
[8]
A. Chaturvedi, P. E. Green, and J. D. Caroll. K-modes clustering. Journal of Classification, 18(1).
[9]
T. A. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 1996.
[10]
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In In KDD Conference, pages 89--98. ACM Press, 2003.
[11]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD Conference, 1996.
[12]
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamical systems. The VLDB Journal, 8(3--4):222--236, 2000.
[13]
G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996.
[14]
J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, 1975.
[15]
Z. He, X. Xu, and S. Deng. Clustering mixed numeric and categorical data: A cluster ensemble approach. CoRR, abs/cs/0509011, 2005.
[16]
C.-C. Hsu and Y.-C. Chen. Mining of mixed data with application to catalog marketing. Expert Syst. Appl., 32(1):12--23, 2007.
[17]
Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283--304, 1998.
[18]
H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 3:1:1--1:58, March 2009.
[19]
J. B. Macqueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[20]
G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD Conference, pages 533--541, 2008.
[21]
E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In ICDM Conference, pages 377--386, 2009.
[22]
E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl. Morpheus: interactive exploration of subspace clustering. In KDD Conference, pages 1089--1092, 2008.
[23]
Y. Okiyama, T. Nakano, K. Yamashita, Y. Mochizuki, N. Taguchi, and S. Tanaka. Acceleration of fragment molecular orbital calculations with cholesky decomposition approach. Chemical Physics Letters, 490(1--3):84 -- 89, 2010.
[24]
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000.
[25]
G. Piatetsky-Shapiro, C. Djeraba, L. Getoor, R. Grossman, R. Feldman, and M. J. Zaki. What are the grand challenges for data mining? KDD 2006 panel report. SIGKDD Explorations, 8(2):70--77, 2006.
[26]
J. Rissanen. Information and Complexity in Statistical Modeling. Springer, 2007.
[27]
A. K. Tung, X. Xu, and B. C. Ooi. CURLER: Finding and visualizing nonlinear correlation clusters. In SIGMOD Conference, pages 467--478, 2005.
[28]
N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML Conference, pages 1073--1080, 2009.
[29]
Q. Yang and X. Wu. 10 challenging problems in data mining research. IJITDM, 5(4):597--604, 2006.
[30]
J. Yin and Z. Tan. Clustering mixed type attributes in large dataset. In ISPA Conference, pages 655--661, 2005.
[31]
M. J. Zaki, M. Peters, I. Assent, and T. Seidl. Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data Knowl. Eng., 60(1):51--70, 2007.
[32]
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD Conference, pages 103--114, 1996.

Cited By

View all
  • (2024)Explainable AI for Mixed Data ClusteringExplainable Artificial Intelligence10.1007/978-3-031-63797-1_3(42-62)Online publication date: 10-Jul-2024
  • (2023)Clustering Mixed Data Comprising Time SeriesProceedings of the 12th International Symposium on Information and Communication Technology10.1145/3628797.3628968(110-117)Online publication date: 7-Dec-2023
  • (2023)Nearly tight bounds on the price of explainability for the k-center and the maximum-spacing clustering problemsTheoretical Computer Science10.1016/j.tcs.2023.113744949:COnline publication date: 9-Mar-2023
  • Show More Cited By

Index Terms

  1. INCONCO: interpretable clustering of numerical and categorical objects

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2011
    1446 pages
    ISBN:9781450308137
    DOI:10.1145/2020408
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. minimum description length principle
    3. mixed-type data

    Qualifiers

    • Poster

    Conference

    KDD '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Explainable AI for Mixed Data ClusteringExplainable Artificial Intelligence10.1007/978-3-031-63797-1_3(42-62)Online publication date: 10-Jul-2024
    • (2023)Clustering Mixed Data Comprising Time SeriesProceedings of the 12th International Symposium on Information and Communication Technology10.1145/3628797.3628968(110-117)Online publication date: 7-Dec-2023
    • (2023)Nearly tight bounds on the price of explainability for the k-center and the maximum-spacing clustering problemsTheoretical Computer Science10.1016/j.tcs.2023.113744949:COnline publication date: 9-Mar-2023
    • (2023)CCET: towards customized explanation of clusteringThe Visual Computer10.1007/s00371-023-02958-z39:8(3169-3181)Online publication date: 22-Jul-2023
    • (2023)Interpretable Data Partitioning Through Tree-Based Clustering MethodsDiscovery Science10.1007/978-3-031-45275-8_33(492-507)Online publication date: 8-Oct-2023
    • (2023)Algorithm-Agnostic Feature Attributions for ClusteringExplainable Artificial Intelligence10.1007/978-3-031-44064-9_13(217-240)Online publication date: 30-Oct-2023
    • (2023)k-SubMix: Common Subspace Clustering on Mixed-Type DataMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_39(662-677)Online publication date: 17-Sep-2023
    • (2022)Time2FeatProceedings of the VLDB Endowment10.14778/3565816.356582216:2(193-201)Online publication date: 23-Nov-2022
    • (2021)A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretabilityPattern Analysis and Applications10.1007/s10044-021-00977-x24:3(1387-1402)Online publication date: 12-May-2021
    • (2020)Clustering of mixed-type data considering concept hierarchies: problem specification and algorithmInternational Journal of Data Science and Analytics10.1007/s41060-020-00216-2Online publication date: 25-Apr-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media