Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Zhexue Huang¹

13k Accesses
1577 Citations
13 Altmetric
1 Mention
Explore all metrics

Abstract

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms

An extension of the K-means algorithm to clustering skewed data

Article 04 July 2018

A method for k-means-like clustering of categorical data

Article 06 September 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press.
Ball, G.H. and Hall, D.J. 1967. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153–155.
Google Scholar
Bezdek, J.C. 1981. Pattern Recognition with Fuzzy Objective Function. Plenum Press.
Bobrowski, L. and Bezdek, J.C. 1991. c-Means clustering with the l ₁ and l _∞ norms. IEEE Transactions on Systems, Man and Cybernetics, 21(3):545–554.
Article MathSciNet MATH Google Scholar
Cormack, R.M. 1971. A review of classification. J. Roy. Statist. Soc. Serie A, 134:321–367.
MathSciNet Google Scholar
Dubes, R. 1987. How many clusters are best? An experiment. Pattern Recognition, 20(6):645–663.
Article Google Scholar
Dubes, R. and Jian, A.K. 1979. Validity studies in clustering methodologies. Pattern Recognition, 11:235–254.
Article MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA: AAAI Press, pp. 226–231.
Google Scholar
Everitt, B. 1974. Cluster Analysis. Heinemann Educational Books Ltd.
Fisher, D.H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139–172.
Google Scholar
Gowda, K.C. and Diday, E. 1991. Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6):567–578.
Article Google Scholar
Gower, J.C. 1971. A general coefficient of similarity and some of its properties. BioMetrics, 27:857–874.
Google Scholar
Huang, Z. 1997a. Clustering large data sets with mixed numeric and categorical values. Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore: World Scientific, pp. 21–34.
Google Scholar
Huang, Z. 1997b. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada, pp. 1–8.
Google Scholar
IBM. 1996. Data Management Solutions. IBM White Paper, IBM Corp.
Jain, A.K. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.
Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data—An Introduction to Cluster Analysis. Wiley.
Klosgen, W. and Zytkow, J.M. 1996. Knowledge discovery in databases terminology. Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/The MIT Press, pp. 573–592.
Kodratoff, Y. and Tecuci, G. 1988. Learning based on conceptual distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):897–909.
Article MATH Google Scholar
Lebowitz, M. 1987. Experiments with incremental concept formation. Machine Learning, 2(2):103–138.
Google Scholar
MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297.
Michalski, R.S and Stepp, R.E. 1983. Automated construction of classifications: Conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(4):396–410.
Google Scholar
Milligan, G.W. 1981. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187–199.
Article MATH MathSciNet Google Scholar
Milligan, G.W. 1985. An algorithm for generating artificial test clusters. Psychometrika, 50(1):123–127.
Article Google Scholar
Milligan, G.W. and Cooper, M.C. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.
Article Google Scholar
Milligan, G.W. and Isaac, P.D. 1980. The validation of four ultrametric clustering algorithms. Pattern Recognition, 12:41–50.
Article Google Scholar
Ng, R.T. an d Han J. 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th VL DB Conference, Santiago, Chile, pp. 144–155.
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.
Ralambondrainy, H. 1995. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147–1157.
Article Google Scholar
Ruspini, E.R. 1969. A new approach to clustering. Information Control, 19:22–32.
Article Google Scholar
Ruspini, E.R. 1973. New experimental results in fuzzy clustering. Information Sciences, 6:273–284.
Article Google Scholar
Selim, S.Z. and Ismail, M.A. 1984. k-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1):81–87.
Article MATH Google Scholar
Williams, G.J. and Huang, Z. 1996. A case study in knowledge acquisition for insurance risk assessment using a KDD methodology. Proceedings of the Pacific Rim Knowledge Acquisition Workshop, Dept. of AI, Univ. of NSW, Sydney, Australia, pp. 117–129.
Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. Proceedings of ACM SIGMOD Conference, Montreal, Canada, pp. 103–114.

Download references

Author information

Authors and Affiliations

ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT, 2601, Australia
Zhexue Huang

Authors

Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304 (1998). https://doi.org/10.1023/A:1009769707641

Download citation

Issue Date: September 1998
DOI: https://doi.org/10.1023/A:1009769707641

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms

An extension of the K-means algorithm to clustering skewed data

A method for k-means-like clustering of categorical data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms

An extension of the K-means algorithm to clustering skewed data

A method for k-means-like clustering of categorical data

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation