Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2556195.2556260acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Scalable K-Means by ranked retrieval

Published: 24 February 2014 Publication History

Abstract

The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.

References

[1]
N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, NIPS 22. 2009.
[2]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, Jan. 2008.
[3]
D. Arthur, B. Manthey, and H. Röglin. Smoothed analysis of the k-means method. J. ACM, 58(5):19, 2011.
[4]
D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Symposium on Computational Geometry, 2006.
[5]
D. Arthur and S. Vassilvitskii. K-means: the advantages of careful seeding. In N. Bansal, K. Pruhs, and C. Stein, editors, SODA. SIAM, 2007.
[6]
R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
[7]
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means. PVLDB, 5(7):622--633, 2012.
[8]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW '07, 2007.
[9]
R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.
[10]
V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '11, pages 26--40. SIAM, 2011.
[11]
A. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, V. Josifovski, and T. Zhang. Robust classification of rare queries using web knowledge. In SIGIR'07, 2007.
[12]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, 2003.
[13]
M. Charikar, L. O'Callaghan, and R. Panigrahy. Better streaming algorithms for clustering problems. In L. L. Larmore and M. X. Goemans, editors, STOC. ACM, 2003.
[14]
C. Elkan. Using the triangle inequality to accelerate k-means. In ICML, pages 147--153, 2003.
[15]
M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Y. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. PVLDB, 4(11), 2011.
[16]
A. S. Foundation, I. Drost, T. Dunning, J. Eastman, O. Gospodnetic, G. Ingersoll, J. Mannix, S. Owen, and K. Wettin. Apache mahout, 2010. http://mloss.org/software/view/144/.
[17]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, 2000.
[18]
V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T. Kinnunen, and P. Fränti. Improving k-means by outlier removal. In SCIA, 2005.
[19]
R. Jin, A. Goswami, and G. Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst., 10(1):17--40, July 2006.
[20]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7), 2002.
[21]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3), 2004.
[22]
P. Lacour, C. Macdonald, and I. Ounis. Efficiency comparison of document matching techniques. In Efficiency Issues in Information Retrieval Workshop; European Conference for Information Retrieval, 2008.
[23]
N. Lester, A. Moffat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In WISE, 2005.
[24]
S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28, 1982.
[25]
A. Meyerson. Online facility location. In FOCS, 2001.
[26]
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 1996.
[27]
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explor., 6(1), 2004.
[28]
D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, pages 277--281, New York, NY, USA, 1999. ACM.
[29]
D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In In Proceedings of the 17th International Conf. on Machine Learning, pages 727--734. Morgan Kaufmann, 2000.
[30]
D. Sculley. Web-scale k-means clustering. In WWW, 2010.
[31]
M. Shindler, A. Wong, and A. W. Meyerson. Fast and accurate k-means for large datasets. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2375--2383. 2011.
[32]
A. Strehl, J. Ghosh, and R. J. Mooney. Impact of similarity measures on web-page clustering. In Proc. AAAI Workshop on AI for Web Search (AAAI 2000), Austin, pages 58--64. AAAI/MIT Press, July 2000.
[33]
H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 1995.
[34]
A. Vattani. k-means requires exponentially many iterations even in the plane. Discrete & Computational Geometry, 45(4):596--616, 2011.
[35]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006.

Cited By

View all
  • (2024) Staleness-Reduction Mini-Batch K -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.327912235:10(14424-14436)Online publication date: Oct-2024
  • (2024)A Spark-based Task Allocation Solution for Machine Learning in the Edge-Cloud Continuum2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT)10.1109/DCOSS-IoT61029.2024.00090(576-582)Online publication date: 29-Apr-2024
  • (2024) Beyond -Means++: Towards better cluster exploration with geometrical information Pattern Recognition10.1016/j.patcog.2023.110036146(110036)Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
February 2014
712 pages
ISBN:9781450323512
DOI:10.1145/2556195
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-means
  2. wand

Qualifiers

  • Research-article

Conference

WSDM 2014

Acceptance Rates

WSDM '14 Paper Acceptance Rate 64 of 355 submissions, 18%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024) Staleness-Reduction Mini-Batch K -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.327912235:10(14424-14436)Online publication date: Oct-2024
  • (2024)A Spark-based Task Allocation Solution for Machine Learning in the Edge-Cloud Continuum2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT)10.1109/DCOSS-IoT61029.2024.00090(576-582)Online publication date: 29-Apr-2024
  • (2024) Beyond -Means++: Towards better cluster exploration with geometrical information Pattern Recognition10.1016/j.patcog.2023.110036146(110036)Online publication date: Feb-2024
  • (2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
  • (2023)Fuzzy Model for Risk Characterization in Avocado Crops for Index Insurance ConfigurationAdvances in Computing10.1007/978-3-031-47372-2_22(271-284)Online publication date: 14-Nov-2023
  • (2022)Consensus clustering for Bayesian mixture modelsBMC Bioinformatics10.1186/s12859-022-04830-823:1Online publication date: 21-Jul-2022
  • (2022)Measurement of clustering effectiveness for document collectionsInformation Retrieval Journal10.1007/s10791-021-09401-825:3(239-268)Online publication date: 10-Jan-2022
  • (2021)Fast and memory-efficient scRNA-seq k-means clustering with various distancesProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469523(1-8)Online publication date: 1-Aug-2021
  • (2021)k-sums ClusteringProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482359(2679-2687)Online publication date: 26-Oct-2021
  • (2021)Fast Knowledge Discovery in Social Media Data using Clustering via Ranking2021 9th International Conference on Cyber and IT Service Management (CITSM)10.1109/CITSM52892.2021.9588866(1-8)Online publication date: 22-Sep-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media