Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2986459.2986698guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Scalable training of mixture models via coresets

Published: 12 December 2011 Publication History

Abstract

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk32) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.

References

[1]
P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximations via coresets. Combinatorial and Computational Geometry - MSRI Publications, 52:1-30, 2005.
[2]
A. Czumaj and C. Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms (RSA), 30(1-2):226-256, 2007.
[3]
Sanjeev Arora and Ravi Kannan. Learning mixtures of separated nonspherical gaussians. Annals of Applied Probability, 15(1A):69-92, 2005.
[4]
D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proc. 41th Annu. ACM Symp. on Theory of Computing (STOC), 2011.
[5]
S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3-19, 2007.
[6]
S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th Annu. ACM Symp. on Theory of Computing (STOC), pages 291-300, 2004.
[7]
Jon Louis Bentley and James B. Saxe. Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms, 1(4):301-358, 1980.
[8]
D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. 23rd ACM Symp. on Computational Geometry (SoCG), pages 11-18, 2007.
[9]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
[10]
D. Feldman, A. Fiat, and M. Sharir. Coresets for weighted facilities and their applications. In Proc. 47th IEEE Annu. Symp. on Foundations of Computer Science (FOCS), pages 315-324, 2006.
[11]
Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information maximization. In Proc. Neural Information Processing Systems (NIPS), 2010.
[12]
Matthew Faulkner, Michael Olson, Rishi Chandy, Jonathan Krause, K. Mani Chandy, and Andreas Krause. The next big one: Detecting earthquakes and other rare events from community-based sensors. In In Proc. ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), 2011.
[13]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Statist. Soc. Ser. B, 39:1-38, 1977.
[14]
S. Dasgupta. Learning mixtures of gaussians. In Fortieth Annual IEEE Symposium on Foundations of Computer Science (FOCS), 1999.
[15]
S. Dasgupta and L.J. Schulman. A two-round variant of em for gaussian mixtures. In Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), 2000.
[16]
S. Vempala and G. Wang. A spectral algorithm for learning mixture models. In In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002.
[17]
J. Feldman, R. A. Servedio, and R. O'Donnell. Pac learning axis-aligned mixtures of gaussians with no separation assumption. In COLT, 2006.
[18]
A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In In Proc. Foundations of Computer Science (FOCS), 2010.
[19]
M. Belkin and K. Sinha. Polynomial learning of distribution families. In In Proc. Foundations of Computer Science (FOCS), 2010.
[20]
D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput., 100(1):78-150, 1992.
[21]
S. Har-Peled and K. R. Varadarajan. High-dimensional shape fitting in linear time. Discrete & Computational Geometry, 32(2):269-288, 2004.
[22]
M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697, 2009.
[23]
G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Proc. 37th Annu. ACM Symp. on Theory of Computing (STOC), pages 209-217, 2005.

Cited By

View all
  • (2022)Pruning neural networks via coresets and convex geometryProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3603024(38003-38019)Online publication date: 28-Nov-2022
  • (2020)Small-GANProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525773(9005-9015)Online publication date: 13-Jul-2020
  • (2020)A Complexity Theoretical Study of Fuzzy K-MeansACM Transactions on Algorithms10.1145/340938516:4(1-25)Online publication date: 16-Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems
December 2011
2752 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 12 December 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Pruning neural networks via coresets and convex geometryProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3603024(38003-38019)Online publication date: 28-Nov-2022
  • (2020)Small-GANProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525773(9005-9015)Online publication date: 13-Jul-2020
  • (2020)A Complexity Theoretical Study of Fuzzy K-MeansACM Transactions on Algorithms10.1145/340938516:4(1-25)Online publication date: 16-Sep-2020
  • (2019)Automated scalable Bayesian inference via Hilbert coresetsThe Journal of Machine Learning Research10.5555/3322706.332272120:1(551-588)Online publication date: 1-Jan-2019
  • (2019)Efficient construction of approximate ad-hoc ML models through materialization and reuseProceedings of the VLDB Endowment10.14778/3236187.326946211:11(1468-1481)Online publication date: 17-Jan-2019
  • (2018)On coresets for logistic regressionProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327763(6562-6571)Online publication date: 3-Dec-2018
  • (2018)Efficient construction of approximate ad-hoc ML models through materialization and reuseProceedings of the VLDB Endowment10.5555/3236187.326946211:11(1468-1481)Online publication date: 1-Jul-2018
  • (2018)A fast approximation scheme for low-dimensional k-meansProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3174304.3175298(430-440)Online publication date: 7-Jan-2018
  • (2018)Efficient construction of approximate ad-hoc ML models through materialization and reuseProceedings of the VLDB Endowment10.14778/3236187.323619911:11(1468-1481)Online publication date: 5-Oct-2018
  • (2018)Scalable k -Means Clustering via Lightweight CoresetsProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219973(1119-1127)Online publication date: 19-Jul-2018
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media