Abstract
In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In case of very large number of values, the risk of overfitting the data increases sharply and building good groupings becomes difficult. In this paper, we propose two new grouping methods founded on a Bayesian approach, leading to Bayes optimal groupings. The first method exploits a standard schema for grouping models and the second one extends this schema by managing a “garbage” group dedicated to the least frequent values. Extensive comparative experiments demonstrate that the new grouping methods build high quality groupings in terms of predictive quality, robustness and small number of groups.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Berckman, N.C.: Value grouping for binary decision trees. Technical Report, Computer Science Department – University of Massachusetts (1995)
Boullé, M.: A robust method for partitioning the values of categorical attributes. In: Revue des Nouvelles Technologies de l’Information, Extraction et gestion des connaissances (EGC 2004), RNTI-E-2, vol. II, pp. 173–182 (2004a)
Boullé, M.: A Bayesian Approach for Supervised Discretization. In: Zanasi, A., Ebecken, N.F.F., Brebbia, C.A. (eds.) Data Mining V, pp. 199–208. WIT Press, Southampton (2004b)
Boullé, M.: MODL: une méthode quasi-optimale de groupage des valeurs d’un attribut symbolique. Note Technique NT/FT/R&D/8611. France Telecom R&D (2004c)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International, California (1984)
Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 1986: A knowledge-elicitation tool for sophisticated users. In: Bratko, I., Lavrac, N. (eds.) Progress in Machine Learning. Sigma Press, Wilmslow (1987)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Applied Statistics 29(2), 119–127 (1980)
Kerber, R.: Chimerge discretization of numeric attributes. In: Proceedings of the 10th International Conference on Artificial Intelligence, pp. 123–128 (1991)
Kullback, S.: Information Theory and Statistics. Wiley, New York (1959); republished by Dover (1968)
Langley, P., Iba, W., Thompson, K.: An analysis of bayesian classifiers. In: Proceedings of the 10th national conference on Artificial Intelligence, pp. 223–228. AAAI Press, Menlo Park (1992)
Langley, P., Sage, S.: Induction of Selective Bayesian Classifiers. In: Proc. of the 10th Conference on Uncertainty in Artificial Intelligence, pp. 399–406. Morgan Kaufmann, San Francisco (1994)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Statis. 11, 416–431 (1983)
Ritschard, G., Zighed, D.A., Nicoloyannis, N.: Maximisation de l’association par regroupement de lignes ou de colonnes d’un tableau croisé. Math. & Sci. Hum., n° 154-155, 81–98 (2001)
Witten, I.H., Franck, E.: Data Mining. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boullé, M. (2005). A Grouping Method for Categorical Attributes Having Very Large Number of Values. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_23
Download citation
DOI: https://doi.org/10.1007/11510888_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)