Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3534678.3539409acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Published: 14 August 2022 Publication History

Abstract

We study the private k-median and k-means clustering problem in d dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most O(d3/2 log n)⁆ OPT + O(kd2 log2 n/ε2), where ε is the privacy guarantee. (The dimension term, d, can be replaced with O(log k) using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, Õ (nkd), time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.

Supplemental Material

MP4 File
We present a fast and scalable algorithm to compute differentially private solution to k-median and k-means. Solutions computed in practice have cost close to the optimum.

References

[1]
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. 2017. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on. Ieee, 61--72.
[2]
Alexandr Andoni, Zhao Song, Clifford Stein, Zhengyu Wang, and Peilin Zhong. 2018. Parallel graph connectivity in log diameter rounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 674--685.
[3]
David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.
[4]
Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. 2017. Code of the algorithm described in Differentially Private Clustering in High-Dimensional Euclidean Spaces. https://github.com/mouwenlong/dpclustering-icml17
[5]
Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. 2017. Differentially Private Clustering in High-Dimensional Euclidean Spaces. In Proceedings of the 34th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 322--331.
[6]
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature communications 5, 1 (2014), 1--9.
[7]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2013. Communication steps for parallel query processing. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems. ACM, 273--284.
[8]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2017. Communication steps for parallel query processing. Journal of the ACM (JACM) 64, 6 (2017), 1--58.
[9]
Rajen Bhatt and Abhinav Dhall. 2009. https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
[10]
Jock A Blackard and Denis J Dean. 1999. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24, 3 (1999), 131--151.
[11]
JarosŁaw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. 2014. An improved approximation for k-median, and positive correlation in budgeted optimization. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. SIAM, 737--756.
[12]
Alisa Chang, Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. 2021. Locally Private k-Means in One Round. CoRR abs/2104.09734 (2021). https://arxiv.org/abs/2104.09734
[13]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 380--388.
[14]
Anamay Chaturvedi, Huy L. Nguyen, and Eric Xu. 2020. Differentially private kmeans clustering via exponential mechanism and max cover. CoRR abs/2009.01220 (2020). https://arxiv.org/abs/2009.01220
[15]
Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. 2021. Nearlinear Time Approximation Schemes for Clustering in Doubling Metrics. In J. ACM, Vol. 68. 44:1--44:34. https://doi.org/10.1145/3477541
[16]
Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. 2021. Improved Coresets and Sublinear Algorithms for Power Means in Euclidean Spaces. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and JenniferWortman Vaughan (Eds.). 21085--21098. https://proceedings.neurips.cc/paper/2021/hash/b035d6563a2adac9f822940c145263ce-Abstract.html
[17]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[18]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[19]
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211--407. https://doi.org/10.1561/0400000042
[20]
Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. 2003. A tight bound on approximating arbitrary metrics by tree metrics. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9-11, 2003, San Diego, CA, USA. 448--455.
[21]
Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. 2009. Private coresets. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, Michael Mitzenmacher (Ed.). ACM, 361--370. https://doi.org/10.1145/1536414.1536465
[22]
Pasi Fränti and Sami Sieranoja. 2018. K-means properties on six clustering benchmark datasets., 4743--4759 pages. http://cs.uef.fi/sipu/datasets/
[23]
Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. 2020. Differentially Private Clustering: Tight Approximation Ratios. In Advances in Neural Information Processing Systems, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
[24]
Michael T Goodrich, Nodari Sitchinava, and Qin Zhang. 2011. Sorting, searching, and simulation in the mapreduce framework. In International Symposium on Algorithms and Computation. Springer, 374--383.
[25]
Sariel Har-Peled. 2011. Geometric approximation algorithms. Number 173. American Mathematical Soc.
[26]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS operating systems review, Vol. 41. ACM, 59--72.
[27]
Roger Iyengar, Joseph P. Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. 2019. Towards Practical Differentially Private Convex Optimization. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019. IEEE, 299--316.
[28]
Kamal Jain, Mohammad Mahdian, Evangelos Markakis, Amin Saberi, and Vijay V Vazirani. 2003. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM (JACM) 50, 6 (2003), 795--824.
[29]
Matthew Jones, Huy L. Nguyen, and Thy D. Nguyen. 2021. Differentially Private Clustering via Maximum Coverage. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 11555--11563. https://ojs.aaai.org/index.php/AAAI/article/view/17375
[30]
Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2004. A local search approximation algorithm for k-means clustering. Computational Geometry 28, 2-3 (2004), 89--112.
[31]
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms. SIAM, 938--948.
[32]
Shi Li. 2011. A 1.488 approximation algorithm for the uncapacitated facility location problem. In International Colloquium on Automata, Languages, and Programming. Springer, 77--88.
[33]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129--137.
[34]
Zhigang Lu and Hong Shen. 2020. Differentially Private k-Means Clustering with Guaranteed Convergence. CoRR abs/2002.01043 (2020). https://arxiv.org/abs/2002.01043
[35]
Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. 2019. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 1027--1038.
[36]
Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 349--360.
[37]
Richard Nock, Raphaël Canyasse, Roksana Boreli, and Frank Nielsen. 2016. kvariates++: more pluses in the k-means++. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). JMLR.org, 145--154.
[38]
Uri Stemmer and Haim Kaplan. 2018. Differentially Private k-Means with Constant Multiplicative Error. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 5436--5446. https://proceedings.neurips.cc/paper/2018/hash/32b991e5d77ad140559ffb95522992d0-Abstract.html
[39]
Tom White. 2012. Hadoop: The definitive guide. "O'Reilly Media, Inc.".
[40]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10--10 (2010), 95.

Cited By

View all
  • (2024)PPPCTComputers in Biology and Medicine10.1016/j.compbiomed.2024.108351173:COnline publication date: 9-Jul-2024
  • (2023)k-means clustering with distance-based privacyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666981(19570-19593)Online publication date: 10-Dec-2023
  • (2022)Near-optimal private and scalable k-clusteringProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601030(10462-10475)Online publication date: 28-Nov-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. clustering
  2. differential privacy
  3. mpc

Qualifiers

  • Research-article

Funding Sources

  • Independent Research Fund Denmark (DFF)
  • French National Research Agency (ANR)

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)273
  • Downloads (Last 6 weeks)20
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PPPCTComputers in Biology and Medicine10.1016/j.compbiomed.2024.108351173:COnline publication date: 9-Jul-2024
  • (2023)k-means clustering with distance-based privacyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666981(19570-19593)Online publication date: 10-Dec-2023
  • (2022)Near-optimal private and scalable k-clusteringProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601030(10462-10475)Online publication date: 28-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media