Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Model-based clustering and outlier detection with missing data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://cran.r-project.org/package=MixtureMissing.

  2. https://cran.r-project.org/package=MixtureMissing.

  3. https://archive.ics.uci.edu/ml/datasets/automobile.

References

  • Aitken A (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(1):14–22

    Article  MATH  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575

    Article  MathSciNet  MATH  Google Scholar 

  • Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388

    Article  MATH  Google Scholar 

  • Buck S (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B 22:302–306

    MathSciNet  MATH  Google Scholar 

  • Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. J Am Stat Assoc 111(516):1648–1659

    Article  MathSciNet  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22

    MathSciNet  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A et al (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345

    Article  MathSciNet  MATH  Google Scholar 

  • Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. R package version 1.0-10

  • Ghahramani Z, Jordan MI (1994) Learning from incomplete data. Technical report, USA

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  • Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4):577–590

    Article  MathSciNet  MATH  Google Scholar 

  • Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods, pp 405–416

  • Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195

    Article  MathSciNet  MATH  Google Scholar 

  • Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4):633–648

    Article  MathSciNet  MATH  Google Scholar 

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2016) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.4

  • McNicholas PD (2016) Mixture model-based classification. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723

    Article  MathSciNet  MATH  Google Scholar 

  • Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278

    Article  MathSciNet  MATH  Google Scholar 

  • Novi Inverardi PL, Taufer E (2020) Outlier detection through mixtures with an improper component. Electron J Appl Stat Anal 13(1):146–163

    Google Scholar 

  • Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348

    Article  Google Scholar 

  • Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biom J 58(6):1506–1537

    Article  MathSciNet  MATH  Google Scholar 

  • Punzo A, Tortora C (2021) Multiple scaled contaminated normal distribution and its application in clustering. Stat Model 21(4):332–358

  • Punzo A, Mazza A, McNicholas PD (2016) Contaminatedmixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. arXiv preprint arXiv:1606.03766

  • Qiu W, Joe H (2020) clusterGeneration: random cluster generation (with specified degree of separation). https://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.7

  • R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, Hoboken

    MATH  Google Scholar 

  • Salgado CM, Azevedo C, Proença H, Vieira SM (2016) Noise versus outliers, pp 163–183. https://doi.org/10.1007/978-3-319-43742-2_14

  • Serafini A, Murphy TB, Scrucca L (2020) Handling missing data in model-based clustering. arXiv preprint arXiv:2006.02954

  • Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, Chichester

    MATH  Google Scholar 

  • Tortora C, ElSherbiny A, Browne RP, Franczak BC, McNicholas PD, Amos DD (2020) MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. https://CRAN.R-project.org/package=MixGHD. R package version 2.3.4

  • van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Article  Google Scholar 

  • Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445

    Article  MathSciNet  MATH  Google Scholar 

  • Wang H, Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate \(t\)-distribution with missing information. Pattern Recognit Lett 25(6):701–710

    Google Scholar 

  • Wei Y, Tang Y, McNicholas PD (2019) Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 130:18–41

    Article  MathSciNet  MATH  Google Scholar 

  • Wilks SS (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3(3):163–195

    Article  MATH  Google Scholar 

  • Yu C, Chen K, Yao W (2015) Outlier detection and robust mixture modeling using nonconvex penalized likelihood. J Stat Plan Inference 164:27–38

    Article  MathSciNet  MATH  Google Scholar 

  • Yu C, Yao W, Chen K (2017) A new method for robust mixture regression. Can J Stat 45(1):77–94

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristina Tortora.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tong, H., Tortora, C. Model-based clustering and outlier detection with missing data. Adv Data Anal Classif 16, 5–30 (2022). https://doi.org/10.1007/s11634-021-00476-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-021-00476-1

Keywords

Mathematics Subject Classification

Navigation