research-article

Scalable K-Means by ranked retrieval

Authors:

Lluis Garcia-Pueyo,

Vanja Josifovski,

Sergei Vassilvitskii,

Srihari VenkatesanAuthors Info & Claims

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

Pages 233 - 242

https://doi.org/10.1145/2556195.2556260

Published: 24 February 2014 Publication History

Abstract

The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.

The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.

References

[1]

N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, NIPS 22. 2009.

[2]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, Jan. 2008.

Digital Library

[3]

D. Arthur, B. Manthey, and H. Röglin. Smoothed analysis of the k-means method. J. ACM, 58(5):19, 2011.

Digital Library

[4]

D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Symposium on Computational Geometry, 2006.

Digital Library

[5]

D. Arthur and S. Vassilvitskii. K-means: the advantages of careful seeding. In N. Bansal, K. Pruhs, and C. Stein, editors, SODA. SIAM, 2007.

Digital Library

[6]

R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

Digital Library

[7]

B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means. PVLDB, 5(7):622--633, 2012.

Digital Library

[8]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW '07, 2007.

Digital Library

[9]

R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.

Digital Library

[10]

V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '11, pages 26--40. SIAM, 2011.

Digital Library

[11]

A. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, V. Josifovski, and T. Zhang. Robust classification of rare queries using web knowledge. In SIGIR'07, 2007.

Digital Library

[12]

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, 2003.

Digital Library

[13]

M. Charikar, L. O'Callaghan, and R. Panigrahy. Better streaming algorithms for clustering problems. In L. L. Larmore and M. X. Goemans, editors, STOC. ACM, 2003.

Digital Library

[14]

C. Elkan. Using the triangle inequality to accelerate k-means. In ICML, pages 147--153, 2003.

[15]

M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Y. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. PVLDB, 4(11), 2011.

[16]

A. S. Foundation, I. Drost, T. Dunning, J. Eastman, O. Gospodnetic, G. Ingersoll, J. Mannix, S. Owen, and K. Wettin. Apache mahout, 2010. http://mloss.org/software/view/144/.

[17]

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, 2000.

Digital Library

[18]

V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T. Kinnunen, and P. Fränti. Improving k-means by outlier removal. In SCIA, 2005.

Digital Library

[19]

R. Jin, A. Goswami, and G. Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst., 10(1):17--40, July 2006.

Digital Library

[20]

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7), 2002.

Digital Library

[21]

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3), 2004.

Digital Library

[22]

P. Lacour, C. Macdonald, and I. Ounis. Efficiency comparison of document matching techniques. In Efficiency Issues in Information Retrieval Workshop; European Conference for Information Retrieval, 2008.

[23]

N. Lester, A. Moffat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In WISE, 2005.

Digital Library

[24]

S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28, 1982.

Digital Library

[25]

A. Meyerson. Online facility location. In FOCS, 2001.

Digital Library

[26]

A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 1996.

Digital Library

[27]

L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explor., 6(1), 2004.

Digital Library

[28]

D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, pages 277--281, New York, NY, USA, 1999. ACM.

Digital Library

[29]

D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In In Proceedings of the 17th International Conf. on Machine Learning, pages 727--734. Morgan Kaufmann, 2000.

Digital Library

[30]

D. Sculley. Web-scale k-means clustering. In WWW, 2010.

Digital Library

[31]

M. Shindler, A. Wong, and A. W. Meyerson. Fast and accurate k-means for large datasets. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2375--2383. 2011.

[32]

A. Strehl, J. Ghosh, and R. J. Mooney. Impact of similarity measures on web-page clustering. In Proc. AAAI Workshop on AI for Web Search (AAAI 2000), Austin, pages 58--64. AAAI/MIT Press, July 2000.

[33]

H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 1995.

Digital Library

[34]

A. Vattani. k-means requires exponentially many iterations even in the plane. Discrete & Computational Geometry, 45(4):596--616, 2011.

Digital Library

[35]

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006.

Digital Library

Cited By

Zhu XSun JHe ZJiang JWang Z(2024) Staleness-Reduction Mini-Batch K -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.327912235:10(14424-14436)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3279122
Belcastro LMarozzo FPresta ATalia D(2024)A Spark-based Task Allocation Solution for Machine Learning in the Edge-Cloud Continuum2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT)10.1109/DCOSS-IoT61029.2024.00090(576-582)Online publication date: 29-Apr-2024
https://doi.org/10.1109/DCOSS-IoT61029.2024.00090
Ping YLi HHao BGuo CWang B(2024) Beyond -Means++: Towards better cluster exploration with geometrical information Pattern Recognition10.1016/j.patcog.2023.110036146(110036)Online publication date: Feb-2024
https://doi.org/10.1016/j.patcog.2023.110036
Show More Cited By

Index Terms

Scalable K-Means by ranked retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Initializing K-means Clustering Using Affinity Propagation
HIS '09: Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01

K-means clustering is widely used due to its fast convergence, but it is sensitive to the initial condition.Therefore, many methods of initializing K-means clustering have been proposed in the literatures. Compared with Kmeans clustering, a novel ...
Ensemble-Initialized k-Means Clustering
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

As one of the most classical clustering techniques, the k-means clustering has been widely used in various areas over the past few decades. Despite its significant success, there are still several challenging issues in the k-means clustering research, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

February 2014

712 pages

ISBN:9781450323512

DOI:10.1145/2556195

General Chairs:
Ben Carterette
University of Delaware, USA
,
Fernando Diaz
Microsoft Research, USA
,
Program Chairs:
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Donald Metzler
Google, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2014

Sponsor:

WSDM 2014: Seventh ACM International Conference on Web Search and Data Mining

February 24 - 28, 2014

New York, New York, USA

Acceptance Rates

WSDM '14 Paper Acceptance Rate 64 of 355 submissions, 18%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
766
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XSun JHe ZJiang JWang Z(2024) Staleness-Reduction Mini-Batch K -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.327912235:10(14424-14436)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3279122
Belcastro LMarozzo FPresta ATalia D(2024)A Spark-based Task Allocation Solution for Machine Learning in the Edge-Cloud Continuum2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT)10.1109/DCOSS-IoT61029.2024.00090(576-582)Online publication date: 29-Apr-2024
https://doi.org/10.1109/DCOSS-IoT61029.2024.00090
Ping YLi HHao BGuo CWang B(2024) Beyond -Means++: Towards better cluster exploration with geometrical information Pattern Recognition10.1016/j.patcog.2023.110036146(110036)Online publication date: Feb-2024
https://doi.org/10.1016/j.patcog.2023.110036
Mortensen KZardbani FHaque MAgustsson SMottin DHofmann PKarras P(2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.14778/3587136.3587147
Jiménez Benjumea JLópez Giraldo LPeña Palacio JRamirez-Guerrero T(2023)Fuzzy Model for Risk Characterization in Avocado Crops for Index Insurance ConfigurationAdvances in Computing10.1007/978-3-031-47372-2_22(271-284)Online publication date: 14-Nov-2023
https://doi.org/10.1007/978-3-031-47372-2_22
Coleman SKirk PWallace C(2022)Consensus clustering for Bayesian mixture modelsBMC Bioinformatics10.1186/s12859-022-04830-823:1Online publication date: 21-Jul-2022
https://doi.org/10.1186/s12859-022-04830-8
Yuan MZobel JLin P(2022)Measurement of clustering effectiveness for document collectionsInformation Retrieval Journal10.1007/s10791-021-09401-825:3(239-268)Online publication date: 10-Jan-2022
https://doi.org/10.1007/s10791-021-09401-8
Baker DDyjack NBraverman VHicks SLangmead BJiang HHuang XZhang J(2021)Fast and memory-efficient scRNA-seq k-means clustering with various distancesProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469523(1-8)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1145/3459930.3469523
Zhao WLan SChen RNgo CDemartini GZuccon GCulpepper JHuang ZTong H(2021)k-sums ClusteringProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482359(2679-2687)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482359
Sutanto TNayak R(2021)Fast Knowledge Discovery in Social Media Data using Clustering via Ranking2021 9th International Conference on Cyber and IT Service Management (CITSM)10.1109/CITSM52892.2021.9588866(1-8)Online publication date: 22-Sep-2021
https://doi.org/10.1109/CITSM52892.2021.9588866
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents