Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Local search methods for k-means with outliers

Published: 01 March 2017 Publication History

Abstract

We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective.
Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.

References

[1]
A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In RANDOM, pages 15--28, 2009.
[2]
C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD, pages 37--46, 2001.
[3]
C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 2013.
[4]
A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, pages 164--169, 1996.
[5]
D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pages 1027--1035, 2007.
[6]
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. PVLDB, 5(7):622--633, 2012.
[7]
S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003.
[8]
P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.
[9]
C. Bohm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD, pages 185--198, 2008.
[10]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Record, 29(2):93--104, 2000.
[11]
M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642--651, 2001.
[12]
S. Chawla and A. Gionis. k-means-: A unified approach to clustering and outlier detection. In ICDM, pages 189--197, 2013.
[13]
K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008.
[14]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pages 226--231, 1996.
[15]
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000.
[16]
D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SOCG, pages 11--18, 2007.
[17]
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003.
[18]
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998.
[19]
S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005.
[20]
A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.
[21]
S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3--19, 2007.
[22]
A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998.
[23]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999.
[24]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3):89--112, 2004.
[25]
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.
[26]
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 19--22, 1997.
[27]
A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1+ϵ)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004.
[28]
M. Lichman. UCI machine learning repository, 2013.
[29]
S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982.
[30]
K. E. M. and N. R. T. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999.
[31]
G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. In NIPS, pages 1063--1071, 2015.
[32]
R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX, pages 165--178, 2008.
[33]
C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006.
[34]
C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000.
[35]
C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004.
[36]
L. Ott, L. Pang, F. T. Ramos, and S. Chawla. On integrated clustering and outlier detection. In NIPS, pages 1359--1367, 2014.
[37]
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In ICDE, pages 315--326, 2003.
[38]
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD, pages 427--438, 2000.
[39]
X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007.

Cited By

View all
  • (2024)Can Evolutionary Clustering Have Theoretical Guarantees?IEEE Transactions on Evolutionary Computation10.1109/TEVC.2023.329664528:5(1220-1234)Online publication date: 1-Oct-2024
  • (2024)Clustering approximation via a fusion of multiple random samplesInformation Fusion10.1016/j.inffus.2023.101986101:COnline publication date: 1-Jan-2024
  • (2023)Fast algorithms for distributed k-clustering with outliersProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618970(13845-13868)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 10, Issue 7
March 2017
132 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2017
Published in PVLDB Volume 10, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)9
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Can Evolutionary Clustering Have Theoretical Guarantees?IEEE Transactions on Evolutionary Computation10.1109/TEVC.2023.329664528:5(1220-1234)Online publication date: 1-Oct-2024
  • (2024)Clustering approximation via a fusion of multiple random samplesInformation Fusion10.1016/j.inffus.2023.101986101:COnline publication date: 1-Jan-2024
  • (2023)Fast algorithms for distributed k-clustering with outliersProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618970(13845-13868)Online publication date: 23-Jul-2023
  • (2023)Scalable and globally optimal generalized L1 K-center clustering via constraint generation in mixed integer linear programmingProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25857(7015-7023)Online publication date: 7-Feb-2023
  • (2023)Clustering what mattersProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25818(6666-6674)Online publication date: 7-Feb-2023
  • (2023)Diversity maximization in the presence of outliersProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i10.26454(12338-12345)Online publication date: 7-Feb-2023
  • (2023)GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph ProcessingACM Transactions on Architecture and Code Optimization10.1145/360009120:3(1-24)Online publication date: 19-Jul-2023
  • (2023)k-Median/Means with Outliers Revisited: A Simple Fpt ApproximationComputing and Combinatorics10.1007/978-3-031-49193-1_22(295-302)Online publication date: 15-Dec-2023
  • (2023)Distributed k-Means with Outliers in General MetricsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_32(474-488)Online publication date: 28-Aug-2023
  • (2022)k-Clustering with Fair OutliersProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498485(5-15)Online publication date: 11-Feb-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media