Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Distributed Partial Clustering

Published: 15 October 2019 Publication History

Abstract

Recent years have witnessed an increasing popularity of algorithm design for distributed data, largely due to the fact that massive datasets are often collected and stored in different locations. In the distributed setting, communication typically dominates the query processing time. Thus, it becomes crucial to design communication-efficient algorithms for queries on distributed data. Simultaneously, it has been widely recognized that partial optimizations, where we are allowed to disregard a small part of the data, provide us significantly better solutions. The motivation for disregarded points often arises from noise and other phenomena that are pervasive in large data scenarios.
In this article, we focus on partial clustering problems, k-center, k-median, and k-means objectives in the distributed model, and provide algorithms with communication sublinear of the input size. As a consequence, we develop the first algorithms for the partial k-median and means objectives that run in subquadratic running time. We also initiate the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution.

References

[1]
Charu C. Aggarwal. 2013. A survey of uncertain data clustering algorithms. In Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 457--482.
[2]
Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. 2013. Distributed k-means and k-median clustering on general communication topologies. In Proceedings of the NIPS. 1995--2003.
[3]
Moses Charikar and Sudipto Guha. 1999. Improved combinatorial algorithms for the facility location and -median problems. In Proceedings of the FOCS. 378--388.
[4]
Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. 2001. Algorithms for facility location problems with outliers. In Proceedings of the SODA. 642--651.
[5]
Jiecao Chen, He Sun, D. Woodruff, and Qin Zhang. 2016. Communication-optimal distributed clustering. In Proceedings of the NIPS.
[6]
Ke Chen. 2008. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the SODA. 826--835.
[7]
Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. 2015. Dimensionality reduction for -means clustering and low rank approximation. In Proceedings of the STOC. 163--172.
[8]
Graham Cormode and Andrew McGregor. 2008. Approximation algorithms for clustering uncertain data. In Proceedings of the PODS. 191--200.
[9]
Graham Cormode, S. Muthukrishnan, and Wei Zhuang. 2007. Conquering the divide: Continuous clustering of distributed data streams. In Proceedings of the ICDE. IEEE, 1036--1045.
[10]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the (OSDI’04). USENIX Association, 10--10.
[11]
M. E. Dyer. 1986. On a multidimensional search technique and its application to the Euclidean one centre problem. SIAM J. Comput. 15, 3 (1986), 725--738.
[12]
Alina Ene, Sungjin Im, and Benjamin Moseley. 2011. Fast clustering using MapReduce. In Proceedings of the SIGKDD. 681--689.
[13]
Dan Feldman and Leonard J. Schulman. 2012. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the SODA. 1343--1354.
[14]
Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38 (1985), 293--306.
[15]
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3 (2003), 515--528.
[16]
Sudipto Guha and Kamesh Munagala. 2009. Exceeding expectations and clustering uncertain data. In Proceedings of the PODS. 269--278.
[17]
Sungjin Im and Benjamin Moseley. 2015. Brief announcement: Fast and better distributed MapReduce algorithms for k-center clustering. In Proceedings of the SPAA. 65--67.
[18]
Kamal Jain and Vijay V. Vazirani. 2001. Approximation algorithms for metric facility location and -median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 2 (2001), 274--296.
[19]
Samir Khuller, Manish Purohit, and Kanthi K. Sarpatwar. 2014. Analyzing the optimal neighborhood: Algorithms for budgeted and partial connected dominating set problems. In Proceedings of the SODA. 1702--1713.
[20]
Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P. Woodruff. 2014. Improved distributed principal component analysis. In Proceedings of the NIPS. 3113--3121.
[21]
Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley. 2015. Fast distributed k-center clustering with outliers on massive data. In Proceedings of the NIPS. 1063--1071.
[22]
Dan Suciu, Dan Olteanu, R. Christopher, and Christoph Koch. 2011. Probabilistic Databases (1st ed.). Morgan 8 Claypool Publishers.
[23]
Pavol D̆urĭs and José D.P. Rolim. 1998. Lower bounds on the multiparty communication complexity. J. Comput. Syst. Sci. 56, 1 (1998), 90--95.
[24]
Haitao Wang and Jingru Zhang. 2015. One-dimensional k-center on uncertain data. Theor. Comput. Sci. 602 (2015), 114--124.

Cited By

View all
  • (2024)MapReduce algorithms for robust center-based clustering in doubling metricsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104966194(104966)Online publication date: Dec-2024
  • (2024)C4y: a metric for distributed IoT clusteringCCF Transactions on Pervasive Computing and Interaction10.1007/s42486-024-00148-x6:2(133-149)Online publication date: 23-Feb-2024
  • (2023)Fair $k$-Center Problem with Outliers on Massive DataTsinghua Science and Technology10.26599/TST.2023.901001328:6(1072-1084)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 6, Issue 3
Special Issue on SPAA 2017
September 2019
185 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3366783
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019
Accepted: 01 December 2018
Revised: 01 November 2018
Received: 01 October 2017
Published in TOPC Volume 6, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-centers
  2. k-means
  3. k-medians
  4. Clustering
  5. distributed computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)11
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MapReduce algorithms for robust center-based clustering in doubling metricsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104966194(104966)Online publication date: Dec-2024
  • (2024)C4y: a metric for distributed IoT clusteringCCF Transactions on Pervasive Computing and Interaction10.1007/s42486-024-00148-x6:2(133-149)Online publication date: 23-Feb-2024
  • (2023)Fair $k$-Center Problem with Outliers on Massive DataTsinghua Science and Technology10.26599/TST.2023.901001328:6(1072-1084)Online publication date: Dec-2023
  • (2023)k-Center Clustering with Outliers in the MPC and Streaming Model2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00090(853-863)Online publication date: May-2023
  • (2023)Distributed k-Means with Outliers in General MetricsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_32(474-488)Online publication date: 24-Aug-2023
  • (2021)Parallel and efficient hierarchical k-median clusteringProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541816(20333-20345)Online publication date: 6-Dec-2021
  • (2021)Expected size of random Tukey layers and convex layersComputational Geometry10.1016/j.comgeo.2021.101856(101856)Online publication date: Dec-2021
  • (2020)Soft computing model using cluster-PCA in port model for throughput forecastingSoft Computing10.1007/s00500-020-04786-yOnline publication date: 27-Feb-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media