Abstract
Clustering is one of the most long-standing fundamental problems in the fields of computational geometry and algorithm design. In this paper, we focus on the variance-based clustering problems, included in which is the widely known k-means clustering. As the main contribution, a so-called semi brute-force search approach is proposed and analyzed from both theoretical and experimental aspects. The proposed approach samples a small percentage from the input dataset and search in a brute-force way for a k sized seed whose resulting Voronoi Diagram gives a good clustering of the original dataset. With high probability, the clustering is provably good to estimate the optimum under certain assumptions. Extensive experiments on both synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod in IEEE Trans Inf Theory 28(2):129–137, 1982) and k-means++ method (Arthur and Vassilvitskii, (in: 18th ACM-SIAM symposium on discrete algorithms (SODA), 2007)), we only need a subset of 7% size comparing with the input dataset. If we are allowed to sample 15% from the dataset, our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.
Similar content being viewed by others
Notes
Note that \(V_1\), \(V_2\) are not necessary to be independent for Fact 1 to hold, but it does not hurt to say so.
References
Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means++. In: Proceedings of the 38th Very Large Data Bases (VLDB), pp. 622–633 (2012)
Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate \(k\)-means++ in sublinear time. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), pp. 1459–1467 (2016)
Lattanzi, S., Sohler, C.: A better \(k\)-means++ algorithm via local search. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3662–3671 (2019)
Choo, D., Grunau, C., Portmann, J., Rozhoň, V.: \(k\)-means++: few more steps yield constant approximation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1909–1917 (2020)
Awasthi, P., Charikar, M., Krishnaswamy, R., Sinop, A.K.: The hardness of approximation of euclidean \(k\)-means. In: Proceedings of the 31st ACM Symposium on International Symposium on Computational Geometry (SoCG), pp. 754–767 (2015)
Liu, H., Han, J., Nie, F., Li, X.: Balanced clustering with least square regression. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pp. 2231–2237 (2017)
Li, Z., Nie, F., Chang, X., Ma, Z., Yang, Y.: Balanced clustering via exclusive lasso: A pragmatic approach. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 3596–3603 (2018)
Lin, W., He, Z., Xiao, M.: Balanced clustering: A uniform model and fast algorithm. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2987–2993 (2019)
Xu, Y., Möhring, R.H., Xu, D., Zhang, Y., Zou, Y.: A constant fpt approximation algorithm for hard-capacitated \(k\)-means. Optim. Eng. 21(3), 709–722 (2020)
Cohen-Addad, V., Li, J.: On the fixed-parameter tractability of capacitated clustering. In: Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP), pp. 1–14 (2019)
Bandyapadhyay, S., Lochet, W., Saurabh, S.: FPT constant-approximations for capacitated clustering to minimize the sum of cluster radii. To appear in the proceedings of the 39th International Symposium on Computational Geometry (SoCG), (2023)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the \(k\)-means problem. J. ACM 59(6), 1–22 (2013)
Braverman, V., Cohen-Addad, V., Jiang, H.C.S., Krauthgamer, R., Schwiegelshohn, C., Toftrup, M.B., Wu, X.: The power of uniform sampling for coresets. In: Proceedings of the 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 462–473 (2022)
Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)
Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceedings of the 10th ACM Symposium International Symposium on Computational Geometry (SoCG), pp. 332–339 (1994)
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows - Theory, Algorithms and Applications. Prentice Hall, USA (1993)
Funding
Yicheng Xu is supported by Fundamental Research Project of Shenzhen City (No. JCYJ20210324102012033) and Shenzhen Science and Technology Program (No. ZDSYS20220422103800001 and CJGJZD20210408092806017). Vincent Chau is supported by NSFC (No. 62202100), and by the Fundamental Research Funds for the Central Universities (No. 2242022R10024). Chenchen Wu is supported by National Natural Science Foundation of China (No.11971349). Yong Zhang is supported by National Key R &D Program of China (No. 2022YFE0196100) and National Natural Science Foundation of China (No. 12071460). Vassilis Zissimopoulos is supported by the CAS President’s International Fellowship Initiative (No. 2019VTA0005). Yifei Zou is supported by Shandong Science Fund for Excellent Young Scholars (No. 2023HWYQ-007).
Author information
Authors and Affiliations
Contributions
YX and VC wrote the main manuscript text. All other authors participated on the problem discussion and helped to review the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Y., Chau, V., Wu, C. et al. A Semi Brute-Force Search Approach for (Balanced) Clustering. Algorithmica 86, 130–146 (2024). https://doi.org/10.1007/s00453-023-01158-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-023-01158-4