research-article

Chic: experience-driven scheduling in machine learning clusters

Authors:

Zheng ZhanAuthors Info & Claims

IWQoS '19: Proceedings of the International Symposium on Quality of Service

Article No.: 30, Pages 1 - 10

https://doi.org/10.1145/3326285.3329065

Published: 24 June 2019 Publication History

Abstract

Large-scale machine learning (ML) models are routinely trained in a distributed fashion, due to their increasing complexity and data sizes. In a shared cluster handling multiple distributed learning workloads with a parameter server framework, it is important to determine the adequate number of concurrent workers and parameter servers for each ML workload over time, in order to minimize the average completion time and increase resource utilization. Existing schedulers for machine learning workloads involve meticulously designed heuristics. However, as the execution environment is highly complex and dynamic, it is challenging to construct an accurate model to make online decisions. In this paper, we design an experience-driven approach that learns to manage the cluster directly from experience rather than using a mathematical model. We propose Chic, a scheduler that is tailored for scheduling machine learning workloads in a cluster by leveraging deep reinforcement learning techniques. With our design of the state space, action space, and reward function, Chic trains a deep neural network with a modified version of the cross-entropy method to approximate the policy for assigning workers and parameter servers for future workloads based on the experience of the agent. Furthermore, a simplified version named Chic-Pair with a shorter training time for the policy is purposed by assigning workers and parameter servers in a pair. We compare Chic and Pair with state-of-the-art heuristics, and our results show that Chic and Chic-Pair are able to reduce the average training time significantly for machine learning workloads under a wide variety of conditions.

References

[1]

M. Li, D. G. Anderson, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014.

Digital Library

[2]

A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg," in Proceedings of the European Conference on Computer Systems (Eurosys), 2015.

Digital Library

[3]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler, "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the Annual Symposium on Cloud Computing (SoCC), 2013.

[4]

Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, "Optimus: An efficient dynamic resource scheduler for deep learning clusters," in Proceedings of the European Conference on Computer Systems (Eurosys), 2018.

[5]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529--533, February 2015.

[6]

H. Mao, M. Alizadeh, I. Menache, and S. Kandula, "Resource management with deep reinforcement learning," in Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets), 2016.

[7]

A. Mirhoseini, H. Pham, Q. L., M. Norouzi, S. Bengio, B. Steiner, Y. Zhou, N. Kumar, R. Larsen, and J. Dean, "Device placement optimization with reinforcement learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017.

Digital Library

[8]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.

Digital Library

[9]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS) Workshop on Systems for Machine Learning and Open Source Software (LearningSys), 2015.

[10]

Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM), 2018.

[11]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.

Digital Library

[12]

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, "Mastering the game of go without human knowledge," Nature, vol. 550, pp. 354--359, October 2017.

[13]

R. Rubinstein and D. Kroese, The Cross-Entropy Method. Springer, 2004.

[14]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, "Mesos: A platform for fine-grained resource sharing in the data center," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.

Digital Library

[15]

A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, "Dominant resource fairness: Fair allocation of multiple resource types," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.

Digital Library

[16]

Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, "Fuxi: A fault-tolerant resource management and job scheduling system at internet scale," in Proceedings of the VLDB Endowment (PVLDB), 2014.

Digital Library

[17]

I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand, "Firmament: Fast, centralized cluster scheduling at scale," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.

[18]

H. Zhang, L. Stafman, A. Or, and M. J. Freedman, "SLAQ: Quality-driven scheduling for distributed machine learning," in Proceedings of the Symposium on Cloud Computing (SoCC), 2017.

Digital Library

[19]

D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, pp. 484--489, January 2016.

Cited By

Du TXiao GChen JZhang CSun HLi WGeng Y(2023)A combined priority scheduling method for distributed machine learningEURASIP Journal on Wireless Communications and Networking10.1186/s13638-023-02253-42023:1Online publication date: 29-May-2023
https://doi.org/10.1186/s13638-023-02253-4
Bao YPeng YWu C(2023)Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2022.320252931:2(634-647)Online publication date: Apr-2023
https://doi.org/10.1109/TNET.2022.3202529
Yang JBao LLiu WYang RWu C(2023)On a Meta Learning-Based Scheduler for Deep Learning ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2023.330816111:4(3631-3642)Online publication date: Oct-2023
https://doi.org/10.1109/TCC.2023.3308161
Show More Cited By

Index Terms

Chic: experience-driven scheduling in machine learning clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning

Recommendations

Distributed deep reinforcement learning on the cloud for autonomous driving
SEFAIS '18: Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems

This paper proposes an architecture for leveraging cloud computing technology to reduce training time for deep reinforcement learning models for autonomous driving by distributing the training process across a pool of virtual machines. By parallelizing ...
Swift machine learning model serving scheduling: a region based reinforcement learning approach
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

The success of machine learning has prospered Machine-Learning-as-a-Service (MLaaS) - deploying trained machine learning (ML) models in cloud to provide low latency inference services at scale. To meet latency Service-Level-Objective (SLO), judicious ...
Deep Reinforcement Learning-Based Workload Scheduling for Edge Computing
Abstract
Edge computing is a new paradigm for providing cloud computing capacities at the edge of network near mobile users. It offers an effective solution to help mobile devices with computation-intensive and delay-sensitive tasks. However, the edge of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

IWQoS '19: Proceedings of the International Symposium on Quality of Service

June 2019

420 pages

ISBN:9781450367783

DOI:10.1145/3326285

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council (NSERC) of Canada

Conference

IWQoS '19

IWQoS '19: IEEE/ACM International Symposium on Quality of Service

June 24 - 25, 2019

Arizona, Phoenix

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
227
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Du TXiao GChen JZhang CSun HLi WGeng Y(2023)A combined priority scheduling method for distributed machine learningEURASIP Journal on Wireless Communications and Networking10.1186/s13638-023-02253-42023:1Online publication date: 29-May-2023
https://doi.org/10.1186/s13638-023-02253-4
Bao YPeng YWu C(2023)Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2022.320252931:2(634-647)Online publication date: Apr-2023
https://doi.org/10.1109/TNET.2022.3202529
Yang JBao LLiu WYang RWu C(2023)On a Meta Learning-Based Scheduler for Deep Learning ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2023.330816111:4(3631-3642)Online publication date: Oct-2023
https://doi.org/10.1109/TCC.2023.3308161
Bian ZLi SWang WYou Yde Supinski BHall MGamblin T(2021)Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480859(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3480859

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten