poster

INCONCO: interpretable clustering of numerical and categorical objects

Authors:

Christian BöhmAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1127 - 1135

https://doi.org/10.1145/2020408.2020584

Published: 21 August 2011 Publication History

Abstract

The integrative mining of heterogeneous data and the interpretability of the data mining result are two of the most important challenges of today's data mining. It is commonly agreed in the community that, particularly in the research area of clustering, both challenges have not yet received the due attention. Only few approaches for clustering of objects with mixed-type attributes exist and those few approaches do not consider cluster-specific dependencies between numerical and categorical attributes. Likewise, only a few clustering papers address the problem of interpretability: to explain why a certain set of objects have been grouped into a cluster and what a particular cluster distinguishes from another. In this paper, we approach both challenges by constructing a relationship to the concept of data compression using the Minimum Description Length principle: a detected cluster structure is the better the more efficient it can be exploited for data compression. Following this idea, we can learn, during the run of a clustering algorithm, the optimal trade-off for attribute weights and distinguish relevant attribute dependencies from coincidental ones. We extend the efficient Cholesky decomposition to model dependencies in heterogeneous data and to ensure interpretability. Our proposed algorithm, INCONCO, successfully finds clusters in mixed type data sets, identifies the relevant attribute dependencies, and explains them using linear models and case-by-case analysis. Thereby, it outperforms existing approaches in effectiveness, as our extensive experimental evaluation demonstrates.

References

[1]

E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Deriving quantitative models for correlation clusters. In KDD Conference, pages 4--13, 2006.

Digital Library

[2]

C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD Conference, pages 70--81, 2000.

Digital Library

[3]

A. Ahmad and L. Dey. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng., 63(2):503--527, 2007.

Digital Library

[4]

P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable clustering of categorical data. In EDBT Conference, pages 123--146, 2004.

[5]

C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008.

Digital Library

[6]

C. Böhm, S. Goebl, A. Oswald, C. Plant, M. Plavinski, and B. Wackersreuther. Integrative parameter-free clustering of data with mixed type attributes. In PAKDD, pages 38--47, 2010.

Digital Library

[7]

C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing clusters of correlation connected objects. In SIGMOD Conference, pages 455--466, 2004.

Digital Library

[8]

A. Chaturvedi, P. E. Green, and J. D. Caroll. K-modes clustering. Journal of Classification, 18(1).

[9]

T. A. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 1996.

Digital Library

[10]

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In In KDD Conference, pages 89--98. ACM Press, 2003.

Digital Library

[11]

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD Conference, 1996.

[12]

D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamical systems. The VLDB Journal, 8(3--4):222--236, 2000.

Digital Library

[13]

G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996.

[14]

J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, 1975.

Digital Library

[15]

Z. He, X. Xu, and S. Deng. Clustering mixed numeric and categorical data: A cluster ensemble approach. CoRR, abs/cs/0509011, 2005.

[16]

C.-C. Hsu and Y.-C. Chen. Mining of mixed data with application to catalog marketing. Expert Syst. Appl., 32(1):12--23, 2007.

[17]

Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283--304, 1998.

Digital Library

[18]

H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 3:1:1--1:58, March 2009.

Digital Library

[19]

J. B. Macqueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.

[20]

G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD Conference, pages 533--541, 2008.

Digital Library

[21]

E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In ICDM Conference, pages 377--386, 2009.

Digital Library

[22]

E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl. Morpheus: interactive exploration of subspace clustering. In KDD Conference, pages 1089--1092, 2008.

Digital Library

[23]

Y. Okiyama, T. Nakano, K. Yamashita, Y. Mochizuki, N. Taguchi, and S. Tanaka. Acceleration of fragment molecular orbital calculations with cholesky decomposition approach. Chemical Physics Letters, 490(1--3):84 -- 89, 2010.

[24]

D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000.

Digital Library

[25]

G. Piatetsky-Shapiro, C. Djeraba, L. Getoor, R. Grossman, R. Feldman, and M. J. Zaki. What are the grand challenges for data mining? KDD 2006 panel report. SIGKDD Explorations, 8(2):70--77, 2006.

Digital Library

[26]

J. Rissanen. Information and Complexity in Statistical Modeling. Springer, 2007.

Digital Library

[27]

A. K. Tung, X. Xu, and B. C. Ooi. CURLER: Finding and visualizing nonlinear correlation clusters. In SIGMOD Conference, pages 467--478, 2005.

Digital Library

[28]

N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML Conference, pages 1073--1080, 2009.

Digital Library

[29]

Q. Yang and X. Wu. 10 challenging problems in data mining research. IJITDM, 5(4):597--604, 2006.

[30]

J. Yin and Z. Tan. Clustering mixed type attributes in large dataset. In ISPA Conference, pages 655--661, 2005.

Digital Library

[31]

M. J. Zaki, M. Peters, I. Assent, and T. Seidl. Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data Knowl. Eng., 60(1):51--70, 2007.

Digital Library

[32]

T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD Conference, pages 103--114, 1996.

Digital Library

Cited By

Amling JScheele SSlany ELang MSchmid U(2024)Explainable AI for Mixed Data ClusteringExplainable Artificial Intelligence10.1007/978-3-031-63797-1_3(42-62)Online publication date: 10-Jul-2024
https://doi.org/10.1007/978-3-031-63797-1_3
Guérard GBen Amor SLevy LSonia D(2023)Clustering Mixed Data Comprising Time SeriesProceedings of the 12th International Symposium on Information and Communication Technology10.1145/3628797.3628968(110-117)Online publication date: 7-Dec-2023
https://dl.acm.org/doi/10.1145/3628797.3628968
Laber ESaadi Murtinho L(2023)Nearly tight bounds on the price of explainability for the k-center and the maximum-spacing clustering problemsTheoretical Computer Science10.1016/j.tcs.2023.113744949:COnline publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1016/j.tcs.2023.113744
Show More Cited By

Index Terms

INCONCO: interpretable clustering of numerical and categorical objects
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Efficient algorithms based on the k-means and Chaotic League Championship Algorithm for numeric, categorical, and mixed-type data clustering

The search algorithm C-LCA is done by adding two chaotic parameters into the LCA.The search clustering using CS-LCA and the KSC-LCA algorithms are proposed.The Gower distance and a mechanism are adopted for handling the mixed-type data.The search ...
Clustering Mixed-Type Data with Correlation-Preserving Embedding
Database Systems for Advanced Applications
Abstract
Mixed-type data that contains both categorical and numerical features is prevalent in many real-world applications. Clustering mixed-type data is challenging, especially because of the complex relationship between categorical and numerical ...
Imputation Strategies for Clustering Mixed-Type Data with Missing Values
Abstract
Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
628
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Amling JScheele SSlany ELang MSchmid U(2024)Explainable AI for Mixed Data ClusteringExplainable Artificial Intelligence10.1007/978-3-031-63797-1_3(42-62)Online publication date: 10-Jul-2024
https://doi.org/10.1007/978-3-031-63797-1_3
Guérard GBen Amor SLevy LSonia D(2023)Clustering Mixed Data Comprising Time SeriesProceedings of the 12th International Symposium on Information and Communication Technology10.1145/3628797.3628968(110-117)Online publication date: 7-Dec-2023
https://dl.acm.org/doi/10.1145/3628797.3628968
Laber ESaadi Murtinho L(2023)Nearly tight bounds on the price of explainability for the k-center and the maximum-spacing clustering problemsTheoretical Computer Science10.1016/j.tcs.2023.113744949:COnline publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1016/j.tcs.2023.113744
Zhang YLuo YLiu YZhou BLiu X(2023)CCET: towards customized explanation of clusteringThe Visual Computer10.1007/s00371-023-02958-z39:8(3169-3181)Online publication date: 22-Jul-2023
https://doi.org/10.1007/s00371-023-02958-z
Guidotti RLandi CBeretta AFadda DNanni M(2023)Interpretable Data Partitioning Through Tree-Based Clustering MethodsDiscovery Science10.1007/978-3-031-45275-8_33(492-507)Online publication date: 8-Oct-2023
https://doi.org/10.1007/978-3-031-45275-8_33
Scholbeck CFunk HCasalicchio G(2023)Algorithm-Agnostic Feature Attributions for ClusteringExplainable Artificial Intelligence10.1007/978-3-031-44064-9_13(217-240)Online publication date: 30-Oct-2023
https://doi.org/10.1007/978-3-031-44064-9_13
Klein MLeiber CBöhm C(2023)k-SubMix: Common Subspace Clustering on Mixed-Type DataMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_39(662-677)Online publication date: 17-Sep-2023
https://doi.org/10.1007/978-3-031-43412-9_39
Bonifati ABuono FGuerra FTiano D(2022)Time2FeatProceedings of the VLDB Endowment10.14778/3565816.356582216:2(193-201)Online publication date: 23-Nov-2022
https://doi.org/10.14778/3565816.3565822
Mousavian Anaraki SHaeri AMoslehi F(2021)A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretabilityPattern Analysis and Applications10.1007/s10044-021-00977-x24:3(1387-1402)Online publication date: 12-May-2021
https://doi.org/10.1007/s10044-021-00977-x
Behzadi SMüller NPlant CBöhm C(2020)Clustering of mixed-type data considering concept hierarchies: problem specification and algorithmInternational Journal of Data Science and Analytics10.1007/s41060-020-00216-2Online publication date: 25-Apr-2020
https://doi.org/10.1007/s41060-020-00216-2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten