Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2783258.2783323acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Petuum: A New Platform for Distributed Machine Learning on Big Data

Published: 10 August 2015 Publication History

Abstract

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)- Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative-convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML-first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.

Supplementary Material

MP4 File (p1335.mp4)

References

[1]
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.
[2]
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, 2012.
[3]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177--186. Springer, 2010.
[4]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1--124, 2011.
[5]
J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML, 2011.
[6]
X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing. Smoothing proximal gradient method for general structured sparse learning. In UAI, 2011.
[7]
W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing. High-performance distributed ml at scale through parameter server consistency models. In AAAI. 2015.
[8]
J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pages 209--216. ACM, 2007.
[9]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In NIPS 2012, 2012.
[10]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[11]
H. B. M. et. al. Ad click prediction: a view from the trenches. In KDD, 2013.
[12]
J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302--332, 2007.
[13]
T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl 1):5228--5235, 2004.
[14]
Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.
[15]
M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14, 2013.
[16]
A. Kumar, A. Beutel, Q. Ho, and E. P. Xing. Fugue: Slow-worker-agnostic distributed learning for big models on big data. In AISTATS, 2014.
[17]
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
[18]
S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. P. Xing. On model parallelism and scheduling strategies for distributed machine learning. In NIPS. 2014.
[19]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.
[20]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012.
[21]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In ACM SIGMOD International Conference on Management of data, 2010.
[22]
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.
[23]
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In OSDI. USENIX Association, 2010.
[24]
P. Richtárik and M. Takáč. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873, 2012.
[25]
C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for accelerating parallel coordinate descent. NIPS, 2012.
[26]
Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, J. Zeng, Q. Yang, et al. Towards topic modeling for big data. arXiv:1405.4402, 2014.
[27]
T. White. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012.
[28]
S. A. Williamson, A. Dubey, and E. P. Xing. Parallel markov chain monte carlo for nonparametric mixture models. In ICML, 2013.
[29]
E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In NIPS, 2002.
[30]
H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, 2012.
[31]
J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. Lightlda: Big topic models on modest compute clusters. In WWW. 2015.
[32]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, 2010.
[33]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: A distributed framework for prioritized iterative computations. In SOCC, 2011.
[34]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: A distributed framework for prioritizing iterative computations. IEEE Transactions on Parallel and Distributed Systems, 24(9):1884--1893, 2013.
[35]
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management, 2008.
[36]
M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In NIPS, 2009.

Cited By

View all
  • (2024)Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC EnvironmentProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673095(514-523)Online publication date: 12-Aug-2024
  • (2024)KLNK: Expanding Page Boundaries in a Distributed Shared Memory SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340988235:9(1524-1535)Online publication date: Sep-2024
  • (2024)Asynchronous Decentralized Federated Learning for Heterogeneous DevicesIEEE/ACM Transactions on Networking10.1109/TNET.2024.342444432:5(4535-4550)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. big model
  3. data-parallelism
  4. distributed systems
  5. machine learning
  6. model-parallelism
  7. theory

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '15
Sponsor:

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)173
  • Downloads (Last 6 weeks)23
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC EnvironmentProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673095(514-523)Online publication date: 12-Aug-2024
  • (2024)KLNK: Expanding Page Boundaries in a Distributed Shared Memory SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340988235:9(1524-1535)Online publication date: Sep-2024
  • (2024)Asynchronous Decentralized Federated Learning for Heterogeneous DevicesIEEE/ACM Transactions on Networking10.1109/TNET.2024.342444432:5(4535-4550)Online publication date: Oct-2024
  • (2024)Privacy-Preserving and Secure Industrial Big Data Analytics: A Survey and the Research FrameworkIEEE Internet of Things Journal10.1109/JIOT.2024.335372711:11(18976-18999)Online publication date: 1-Jun-2024
  • (2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
  • (2024)A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine LearningCollaborative Computing: Networking, Applications and Worksharing10.1007/978-3-031-54531-3_21(385-403)Online publication date: 23-Feb-2024
  • (2023)Scaling Machine Learning with a Ring-based Distributed FrameworkProceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence10.1145/3638584.3638667(23-32)Online publication date: 8-Dec-2023
  • (2023)Good Intentions: Adaptive Parameter Management via Intent SignalingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614895(2156-2166)Online publication date: 21-Oct-2023
  • (2023) HPC 2 lusterScape: Increasing Transparency and Efficiency of Shared High-Performance Computing Clusters for Large-scale AI Models 2023 IEEE Visualization in Data Science (VDS)10.1109/VDS60365.2023.00008(21-29)Online publication date: 15-Oct-2023
  • (2023)Two-layer accumulated quantized compression for communication-efficient federated learning: TLAQCScientific Reports10.1038/s41598-023-38916-x13:1Online publication date: 19-Jul-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media