Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3326285.3329065acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwqosConference Proceedingsconference-collections
research-article

Chic: experience-driven scheduling in machine learning clusters

Published: 24 June 2019 Publication History

Abstract

Large-scale machine learning (ML) models are routinely trained in a distributed fashion, due to their increasing complexity and data sizes. In a shared cluster handling multiple distributed learning workloads with a parameter server framework, it is important to determine the adequate number of concurrent workers and parameter servers for each ML workload over time, in order to minimize the average completion time and increase resource utilization. Existing schedulers for machine learning workloads involve meticulously designed heuristics. However, as the execution environment is highly complex and dynamic, it is challenging to construct an accurate model to make online decisions. In this paper, we design an experience-driven approach that learns to manage the cluster directly from experience rather than using a mathematical model. We propose Chic, a scheduler that is tailored for scheduling machine learning workloads in a cluster by leveraging deep reinforcement learning techniques. With our design of the state space, action space, and reward function, Chic trains a deep neural network with a modified version of the cross-entropy method to approximate the policy for assigning workers and parameter servers for future workloads based on the experience of the agent. Furthermore, a simplified version named Chic-Pair with a shorter training time for the policy is purposed by assigning workers and parameter servers in a pair. We compare Chic and Pair with state-of-the-art heuristics, and our results show that Chic and Chic-Pair are able to reduce the average training time significantly for machine learning workloads under a wide variety of conditions.

References

[1]
M. Li, D. G. Anderson, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014.
[2]
A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg," in Proceedings of the European Conference on Computer Systems (Eurosys), 2015.
[3]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler, "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the Annual Symposium on Cloud Computing (SoCC), 2013.
[4]
Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, "Optimus: An efficient dynamic resource scheduler for deep learning clusters," in Proceedings of the European Conference on Computer Systems (Eurosys), 2018.
[5]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529--533, February 2015.
[6]
H. Mao, M. Alizadeh, I. Menache, and S. Kandula, "Resource management with deep reinforcement learning," in Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets), 2016.
[7]
A. Mirhoseini, H. Pham, Q. L., M. Norouzi, S. Bengio, B. Steiner, Y. Zhou, N. Kumar, R. Larsen, and J. Dean, "Device placement optimization with reinforcement learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017.
[8]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.
[9]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS) Workshop on Systems for Machine Learning and Open Source Software (LearningSys), 2015.
[10]
Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM), 2018.
[11]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
[12]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, "Mastering the game of go without human knowledge," Nature, vol. 550, pp. 354--359, October 2017.
[13]
R. Rubinstein and D. Kroese, The Cross-Entropy Method. Springer, 2004.
[14]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, "Mesos: A platform for fine-grained resource sharing in the data center," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.
[15]
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, "Dominant resource fairness: Fair allocation of multiple resource types," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.
[16]
Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, "Fuxi: A fault-tolerant resource management and job scheduling system at internet scale," in Proceedings of the VLDB Endowment (PVLDB), 2014.
[17]
I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand, "Firmament: Fast, centralized cluster scheduling at scale," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.
[18]
H. Zhang, L. Stafman, A. Or, and M. J. Freedman, "SLAQ: Quality-driven scheduling for distributed machine learning," in Proceedings of the Symposium on Cloud Computing (SoCC), 2017.
[19]
D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, pp. 484--489, January 2016.

Cited By

View all
  • (2023)A combined priority scheduling method for distributed machine learningEURASIP Journal on Wireless Communications and Networking10.1186/s13638-023-02253-42023:1Online publication date: 29-May-2023
  • (2023)Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2022.320252931:2(634-647)Online publication date: Apr-2023
  • (2023)On a Meta Learning-Based Scheduler for Deep Learning ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2023.330816111:4(3631-3642)Online publication date: Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
IWQoS '19: Proceedings of the International Symposium on Quality of Service
June 2019
420 pages
ISBN:9781450367783
DOI:10.1145/3326285
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep reinforcement learning
  2. distributed machine learning
  3. workload scheduling

Qualifiers

  • Research-article

Funding Sources

  • Natural Sciences and Engineering Research Council (NSERC) of Canada

Conference

IWQoS '19

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A combined priority scheduling method for distributed machine learningEURASIP Journal on Wireless Communications and Networking10.1186/s13638-023-02253-42023:1Online publication date: 29-May-2023
  • (2023)Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2022.320252931:2(634-647)Online publication date: Apr-2023
  • (2023)On a Meta Learning-Based Scheduler for Deep Learning ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2023.330816111:4(3631-3642)Online publication date: Oct-2023
  • (2021)Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480859(1-15)Online publication date: 14-Nov-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media