Abstract
The so-called noise-component has been introduced by Banfield and Raftery (1993) to improve the robustness of cluster analysis based on the normal mixture model. The idea is to add a uniform distribution over the convex hull of the data as an additional mixture component. While this yields good results in many practical applications, there are some problems with the original proposal: 1) As shown by Hennig (2004), the method is not breakdown-robust. 2) The original approach doesn’t define a proper ML estimator, and doesn’t have satisfactory asymptotic properties.
We discuss two alternatives. The first one consists of replacing the uniform distribution by a fixed constant, modelling an improper uniform distribution that doesn’t depend on the data. This can be proven to be more robust, though the choice of the involved tuning constant is tricky. The second alternative is to approximate the ML-estimator of a mixture of normals with a uniform distribution more precisely than it is done by the “convex hull” approach. The approaches are compared by simulations and for a real data example.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
BANFIELD, J. D. and RAFTERY, A. E. (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803-821.
CAMPBELL, N. A. (1984): Mixture models and atypical values. Mathematical Geology, 16, 465-477.
CORETTO P. and HENNIG C. (2006): Identifiability for mixtures of distributions from a location-scale family with uniforms. DISES Working Papers No. 3.186, University of Salerno.
CORETTO P. and HENNIG C. (2007): Choice of the improper density in robust improper ML for finite normal mixtures. Submitted.
CUESTA-ALBERTOS, J. A., GORDALIZA, A. and MATRAN, C. (1997): Trimmed k-means: An Attempt to Robustify Quantizers. Annals of Statistics, 25, 553-576.
DONOHO, D. L. and HUBER, P. J. (1983): The notion of breakdown point. In P. J. Bickel, K. Doksum, and J. L. Hodges jr. (Eds.): A Festschrift for Erich L. Lehmann, Wadsworth, Belmont, CA, 157-184.
FRALEY, C. and RAFTERY, A. E. (1998): How Many Clusters? Which Clustering Method? Answers Via Model Based Cluster Analysis. Computer Journal, 41, 578-588.
HATHAWAY, R. J. (1985): A constrained formulation of maximum-likelihood estimates for normal mixture distributions. Annals of Statistics, 13, 795-800.
HENNIG, C. (2004): Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313-1340.
MCLACHLAN, G. J. and PEEL, D. (2000): Finite Mixture Models, Wiley, New York.
REDNER, R. A. and WALKER, H. F. (1984): Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26, 195-239.
SCHWARZ, G. (1978): Estimating the dimension of a model, Annals of Statistics, 6, 461-464.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hennig, C., Coretto, P. (2008). The Noise Component in Model-based Cluster Analysis. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-78246-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)