Smoothing Categorical Data

Arno Siebes²⁰ &
René Kersten²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7523))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4914 Accesses

Abstract

Global models of a dataset reflect not only the large scale structure of the data distribution, they also reflect small(er) scale structure. Hence, if one wants to see the large scale structure, one should somehow subtract this smaller scale structure from the model.

While for some kinds of model – such as boosted classifiers – it is easy to see the “important” components, for many kind of models this is far harder, if at all possible. In such cases one might try an implicit approach: simplify the data distribution without changing the large scale structure. That is, one might first smooth the local structure out of the dataset. Then induce a new model from this smoothed dataset. This new model should now reflect the large scale structure of the original dataset. In this paper we propose such a smoothing for categorical data and for one particular type of models, viz., code tables.

By experiments we show that our approach preserves the large scale structure of a dataset well. That is, the smoothed dataset is simpler while the original and smoothed datasets share the same large scale structure.

Download to read the full chapter text

Chapter PDF

Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

Article Open access 13 May 2022

Sparse Robust Regression for Explaining Classifiers

Categorical Data Clustering

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI (1996)
Google Scholar
Agresti, A.: Categorical Data Analysis, 2nd edn. Wiley (2002)
Google Scholar
Coenen, F.: The LUCS-KDD discretised/normalised (C)ARM data library (2003)
Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley (2006)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
Grünwald, P.D.: Minimum description length tutorial. In: Grünwald, P., Myung, I. (eds.) Advances in Minimum Description Length. MIT Press (2005)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: Weka data mining software: An update. SIGKDD Explorations 11 (2009)
Google Scholar
van Leeuwen, M., Vreeken, J., Siebes, A.: Compression Picks Item Sets That Matter. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 585–592. Springer, Heidelberg (2006)
Chapter Google Scholar
Pei, J., Tung, A.K.H., Han, J.: Fault tolerant pattern mining: Problems and challenges. In: DMKD (2001)
Google Scholar
Siebes, A., Kersten, R.: A structure function for transaction data. In: Proc. SIAM Conf. on Data Mining (2011)
Google Scholar
Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Proc. SIAM Conf. Data Mining, pp. 393–404 (2006)
Google Scholar
Simonoff, J.S.: Three sides of smoothing: Categorical data smoothing, nonparametric regression, and density estimation. International Statistical Reviews /Revue Internationale de Statistique 66(2), 137–156 (1998)
Article MathSciNet MATH Google Scholar
Vreeken, J., Siebes, A.: Filling in the blanks - krimp minimization for missing data. In: Proceedings of the IEEE International Conference on Data Mining (2008)
Google Scholar
Wand, M., Jones, M.: Kernel Smoothing. Chapman & Hall (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Universiteit Utrecht, The Netherlands
Arno Siebes & René Kersten

Authors

Arno Siebes
View author publications
You can also search for this author in PubMed Google Scholar
René Kersten
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach , Tijl De Bie & Nello Cristianini , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siebes, A., Kersten, R. (2012). Smoothing Categorical Data. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-33460-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics