Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/775047.775050acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Scalable robust covariance and correlation estimates for data mining

Published: 23 July 2002 Publication History

Abstract

Covariance and correlation estimates have important applications in data mining. In the presence of outliers, classical estimates of covariance and correlation matrices are not reliable. A small fraction of outliers, in some cases even a single outlier, can distort the classical covariance and correlation estimates making them virtually useless. That is, correlations for the vast majority of the data can be very erroneously reported; principal components transformations can be misleading; and multidimensional outlier detection via Mahalanobis distances can fail to detect outliers. There is plenty of statistical literature on robust covariance and correlation matrix estimates with an emphasis on affine-equivariant estimators that possess high breakdown points and small worst case biases. All such estimators have unacceptable exponential complexity in the number of variables and quadratic complexity in the number of observations. In this paper we focus on several variants of robust covariance and correlation matrix estimates with quadratic complexity in the number of variables and linear complexity in the number of observations. These estimators are based on several forms of pairwise robust covariance and correlation estimates. The estimators studied include two fast estimators based on coordinate-wise robust transformations embedded in an overall procedure recently proposed by [14]. We show that the estimators have attractive robustness properties, and give an example that uses one of the estimators in the new Insightful Miner data mining product.

References

[1]
M. B. Abdullah. On a Robust Correlation Coefficient. In The Statistician, 39, pp. 455--460, 1990.
[2]
P. Davies. Asymptotic Behavior of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. In The Annals of Statistics, 15, pp. 1269--1292, 1987.
[3]
S. J. Devlin, R. Gnanadesikan and J. R. Kettenring. Robust Estimation of Dispersion Matrices and Principal Components. In Journal of the American Statistical Association, 76, pp. 354--362, 1981.
[4]
D. L. Donoho. Breakdown Properties of Multivariate Location Estimators. Ph.D. Qualifying Paper. Dept. of Statistic, Harvard University, 1982.
[5]
R. Gnanadesikan and J. R. Kettenring. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. In Biometrics, 28, pp. 81--124, 1972.
[6]
F. Hampel, P. Ronchetti, P. Rousseeuw and W. Stahel. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, 1986.
[7]
P. J. Huber. Robust Statistics. John Wiley & Sons, 1981.
[8]
G. S. Manku, S. Rajagopalan and B. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Data Sets. In ACM SIGMOD Record, 28, 1999.
[9]
A. Marazzi and C. Ruffieux. Implementing M-Estimators of the Gamma Distribution, in Robust Statistics. :Data Analysis and Computer Intensive Methods, in Honor of Peter J. Huber's 60th Birthday, Springer Verlag, 1996.
[10]
R. Maronna. Personal Communication. In International Conference on Robust Statistics, 2002.
[11]
R. Maronna. Robust M-Estimators of Multivariate Location and Scatter. In The Annals of Statistics, 4, pp. 51--67, 1976.
[12]
R. A. Maronna, W. A. Stahel and V. Yohai. Bias-Robust Estimation of Multivariate Scatter Based on Projections. In Journal off Multivariate Analysis, 42, pp. 141--161, 1992.
[13]
R. Maronna and V. Yohai. The Behaviour of the Stahel-Donoho Robust Multivariate Estimator. In Journal of the American Statistical Association, 90 (429), pp. 330--341, 1995.
[14]
R. Maronna and R. Zamar. Robust Estimates of Location and Dispersion for High Dimensional Data Sets. In Technometrics, to appear, 2002.
[15]
D.M. Rocke and D.L. Woodruff. Identification of Outliers in Multivariate Data. In Journal of the American Statistical Association, 91 (435), pp. 1047--1061, 1996.
[16]
P. Rousseeuw. Least Median of Squares Regression. In Journal of the American Statistical Association, 79, pp. 871--880, 1984.
[17]
P. Rousseeuw. Multivariate Estimation with High Breakdown Point. Mathematical Statistics and Applications, pp. 283--297, Reidel Publishing, 1985.
[18]
P. Rousseeuw and V. Driessen. A Fast Algorithm for the Minimum Covariance Determinant Estimator. In Technometrics, 41, pp. 212--223, 1999
[19]
P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons, 1987.
[20]
W. A. Stahel. Breakdown of Covariance Estimators. Research report, 31, Fachgruppe fur Statistik, ETH, Zurich, 1981.

Cited By

View all
  • (2023)DyAnNet: A Scene Dynamicity Guided Self-Trained Video Anomaly Detection Network2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00550(5530-5539)Online publication date: Jan-2023
  • (2022)Robust and Sparse Estimation of Graphical Models Based on Multivariate WinsorizationRobust and Multivariate Statistical Methods10.1007/978-3-031-22687-8_12(249-275)Online publication date: 26-Nov-2022
  • (2020)CoPASample: A Heuristics Based Covariance Preserving Data AugmentationMachine Learning, Optimization, and Data Science10.1007/978-3-030-37599-7_26(308-320)Online publication date: 3-Jan-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data mining
  2. outliers
  3. robust estimators
  4. robust statistics
  5. scalable algorithm

Qualifiers

  • Article

Conference

KDD02
Sponsor:

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)DyAnNet: A Scene Dynamicity Guided Self-Trained Video Anomaly Detection Network2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00550(5530-5539)Online publication date: Jan-2023
  • (2022)Robust and Sparse Estimation of Graphical Models Based on Multivariate WinsorizationRobust and Multivariate Statistical Methods10.1007/978-3-031-22687-8_12(249-275)Online publication date: 26-Nov-2022
  • (2020)CoPASample: A Heuristics Based Covariance Preserving Data AugmentationMachine Learning, Optimization, and Data Science10.1007/978-3-030-37599-7_26(308-320)Online publication date: 3-Jan-2020
  • (2019)Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets2019 IEEE International Congress on Big Data (BigDataCongress)10.1109/BigDataCongress.2019.00023(77-84)Online publication date: Jul-2019
  • (2019)Fast Robust Correlation for High-Dimensional DataTechnometrics10.1080/00401706.2019.1677270(1-15)Online publication date: 1-Nov-2019
  • (2019)ReferencesRobust Statistics10.1002/9781119214656.refs(407-422)Online publication date: 25-Jan-2019
  • (2018)High-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon $-contaminationElectronic Journal of Statistics10.1214/18-EJS142712:1Online publication date: 1-Jan-2018
  • (2018)Robust and sparse Gaussian graphical modelling under cell‐wise contaminationStat10.1002/sta4.1817:1Online publication date: 23-Mar-2018
  • (2017)Multivariate location and scatter matrix estimation under cellwise and casewise contaminationComputational Statistics & Data Analysis10.1016/j.csda.2017.02.007111(59-76)Online publication date: Jul-2017
  • (2017)Cellwise robust regularized discriminant analysisStatistical Analysis and Data Mining: The ASA Data Science Journal10.1002/sam.1136510:6(436-447)Online publication date: 10-Nov-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media