Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The effectiveness of lloyd-type methods for the k-means problem

Published: 09 January 2013 Publication History

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

References

[1]
Aggarwal, A., Deshpande, A., and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimazation Problems (APPROX 2009). 15--28.
[2]
Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Proceedings of the 23rd Annual Conference on Neural Information Processing System (NIPS). 10--18.
[3]
Alsabti, K., Ranka, S., and Singh, V. 1998. An efficient k-means clustering algorithm. In Proceedings of the 1st Workshop on High Performance Data Mining.
[4]
Arthur, D., Manthey, B., and Röglin, H. 2011. Smoothed analysis of the k-means method. J. ACM 58, 5, 19.
[5]
Arthur, D., and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SoCG). 144--153.
[6]
Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1027--1035.
[7]
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33, 544--562.
[8]
Awasthi, P., Blum, A., and Sheffet, O. 2010. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS). 309--318.
[9]
Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC). 250--257.
[10]
Balcan, M., Blum, A., and Gupta, A. 2009. Approximate clustering without the approximation. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1068--1077.
[11]
Bradley, P. S. and Fayyad, U. 1998. Refining initial points for K-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML). 91--99.
[12]
Bunn, P. and Ostrovsky, R. 2007. Secure two-party k-means clustering. In Proceedings of the ACM Conference on Computer and Communications Security. 486--497.
[13]
Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). 378--388.
[14]
Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 129--149.
[15]
Chrobak, M., Kenyon, C., and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Info. Proc. Lett. 97, 68--72.
[16]
Cox, D. R. 1957. Note on grouping. J. ASA 52, 543--547.
[17]
Dasgupta, S. 2003. How fast is k-means? In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). 735.
[18]
de la Vega, W. F., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th ACM Symposium on Theory of Computing. 50--58.
[19]
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Y. Stat. Soc. B 39, 1--38.
[20]
Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 2004. Clustering large graphs via the Singular Value Decomposition. Mach. Learn. 56, 9--33.
[21]
Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience.
[22]
Effros, M. and Schulman, L. J. 2004a. Deterministic clustering with data nets. Electronic Tech rep. ECCC TR04-050.
[23]
Effros, M. and Schulman, L. J. 2004b. Deterministic clustering with data nets. In Proceedings of the International Symposium on Information Theory (ISIT).
[24]
Fisher, D. 1996. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 4, 147--178.
[25]
Forgey, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 21, 768.
[26]
Gersho, A. and Gray, R. M. 1992. Vector quantization and signal compression. Kluwer.
[27]
Gray, R. M. and Neuhoff, D. L. 1998. Quantization. IEEE Trans. Inform. Theory 44, 6, 2325--2384.
[28]
Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC). 291--300.
[29]
Har-Peled, S., and Sadri, B. 2005. How fast is the k-means method? Algorithmica 41, 185--202.
[30]
Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. 1997. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci. 37, 861--870.
[31]
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3.
[32]
Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. 2003. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. J. ACM 50, 795--824.
[33]
Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 274--296.
[34]
Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. 2004. A local search approximation algorithm for k-means clustering. Comput. Geom. 28, 89--112.
[35]
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881--892.
[36]
Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley.
[37]
Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual Symposium on Foundations of Computer Science (FOCS). 454--462.
[38]
Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization design. IEEE Trans. Commun. COM-28, 84--95.
[39]
Lloyd, S. P. 1982. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory 28, 129--137.
[40]
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.
[41]
Matousek, J. 2000. On approximate geometric k-clustering. Disc. Computat. Geom. 24, 61--84.
[42]
Max, J. 1960. Quantizing for minimum distortion. IEEE Trans. Inform. Theory IT-6, 1, 7--12.
[43]
Meila, M. and Heckerman, D. 1998. An experimental comparison of several clustering and initialization methods. In Proceedings of the 14th Conference on Uncertainty in Arificial Intelligent (UAI). 386--395.
[44]
Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 35--60.
[45]
Milligan, G. 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325--342.
[46]
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press.
[47]
Nissim, K., Rashkhodnikova, S., and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual Symposium on Theory of Computing (STOC). 75--84.
[48]
Ostrovsky, R. and Rabani, Y. 2002. Polynomial time approximation schemes for geometric clustering problems. J. ACM 49, 2, 139--156.
[49]
Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundation of Computer Science (FOCS). 165--174.
[50]
Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci. 61, 217--235.
[51]
Pelleg, D. and Moore, A. 1999. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th Annual ACM Conference on Knowledge Discovery and Data Mining (KDD). 277--281.
[52]
Pena, J. M., Lozano, J. A., and Larranaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Rec. Lett. 20, 1027--1040.
[53]
Phillips, S. J. 2002. Acceleration of k-means and related clustering problems. In Proceedings of the 4th Workshop on Algorithm Engineering and Experiments (ALENEX).
[54]
Schulman, L. J. 2000. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC). 547--555.
[55]
Snarey, M., Terrett, N. K., Willet, P., and Wilton, D. J. 1997. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 15, 372--385.
[56]
Spielman, D. and Teng, S. 2001. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of 33rd Annual Symposium on Theory of Computing (STOC). 296--305.
[57]
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801--804.
[58]
Tryon, R. C. and Bailey, D. E. 1970. Cluster Analysis. McGraw-Hill. 147--150.
[59]
Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Disc. Comput. Geom. 45, 4, 596--616.

Cited By

View all
  • (2024)Sun-as-a-star Study of an X-class Solar Flare with Spectroscopic Observations of CHASEThe Astrophysical Journal10.3847/1538-4357/ad3446966:1(45)Online publication date: 24-Apr-2024
  • (2024)Variations and Claims in International Construction Projects in the MENA Region from the Last DecadeBuildings10.3390/buildings1408249614:8(2496)Online publication date: 13-Aug-2024
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 59, Issue 6
December 2012
213 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2395116
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2013
Accepted: 01 September 2012
Revised: 01 September 2012
Received: 01 March 2010
Published in JACM Volume 59, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Randomized algorithms
  2. approximation algorithms

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)22
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sun-as-a-star Study of an X-class Solar Flare with Spectroscopic Observations of CHASEThe Astrophysical Journal10.3847/1538-4357/ad3446966:1(45)Online publication date: 24-Apr-2024
  • (2024)Variations and Claims in International Construction Projects in the MENA Region from the Last DecadeBuildings10.3390/buildings1408249614:8(2496)Online publication date: 13-Aug-2024
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • (2024)Skilled Worker Shortage across Key Labor-Intensive Construction Trades in Union versus Nonunion EnvironmentsJournal of Management in Engineering10.1061/JMENEA.MEENG-564940:1Online publication date: Jan-2024
  • (2024)Transfer learning guided discovery of efficient perovskite oxide for alkaline water oxidationNature Communications10.1038/s41467-024-50605-515:1Online publication date: 26-Jul-2024
  • (2024)On clustering levels of a hierarchical categorical risk factorAnnals of Actuarial Science10.1017/S1748499523000283(1-39)Online publication date: 1-Feb-2024
  • (2024)Efficient fuzzy-pruned high dimensional clustering with minimal distance measureExpert Systems with Applications10.1016/j.eswa.2023.122748243(122748)Online publication date: Jun-2024
  • (2024)A Semi Brute-Force Search Approach for (Balanced) ClusteringAlgorithmica10.1007/s00453-023-01158-486:1(130-146)Online publication date: 1-Jan-2024
  • (2024)Speeding Up Constrained k-Means Through 2-MeansAlgorithmic Aspects in Information and Management10.1007/978-981-97-7801-0_5(52-63)Online publication date: 19-Sep-2024
  • (2023)On generalization bounds for projective clusteringProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669262(71723-71754)Online publication date: 10-Dec-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media