research-article

Open access

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline

Authors:

Romil Bhardwaj,

Joseph E. Gonzalez,

Alexey TumanovAuthors Info & Claims

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

Pages 61 - 73

https://doi.org/10.1145/3357223.3362719

Published: 20 November 2019 Publication History

Abstract

Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible.

To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched---a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload overlooked in prior work -- trial disposability, progressively identifiable rankings among different configurations, and space-time constraints -- to outperform standard hyperparameter search algorithms across a variety of benchmarks.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

[2]

James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research (2012).

[3]

Renato L. de F. Cunha, Eduardo R. Rodrigues, Matheus Pallhares Viana, and Dario Augusto Borges Oliveira. 2018. An argument in favor of strong scaling for deep neural networks with small datasets. In International Symposium on Computer Architecture and High Performance Computing.

[4]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1487--1495.

Digital Library

[5]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.

[6]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[8]

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. 2018. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence.

[9]

https://github.com/kuangliu/pytorch cifar. [n. d.]. Pytorch CIFAR. ([n. d.]).

[10]

Kevin Jamieson and Ameet Talwalkar. 2016. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics. 240--248.

[11]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018).

[12]

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. (2018), 1--52.

[13]

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2018. Massively Parallel Hyperparameter Tuning. In Proceedings of Workshop on ML Systems in The Thirty-second Annual Conference on Neural Information Processing Systems (NIPS). http://learningsys.org/nips18/assets/papers/41CameraReadySubmissionparallel.pdf

[14]

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. 2018. RLlib: Abstractions for Distributed Reinforcement Learning. In International Conference on Machine Learning (ICML).

[15]

Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. 2018. Tune: A Research Platform for Distributed Model Selection and Training. arXiv preprint arXiv:1807.05118 (2018).

[16]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In EuroSys '18: Thirteenth EuroSys Conference. 14.

[17]

Jeff Rasley, Yuxiong He, Feng Yan, Olatunji Ruwase, and Rodrigo Fonseca. 2017. HyperDrive: Exploring Hyperparameters with POP Scheduling. In Proceedings of Middleware '17, Las Vegas, NV, USA. 13.

Digital Library

[18]

Yun Shen, Enrico Mariconti, Pierre-Antoine Vervier, and Gianluca Stringhini. 2018. Tiresias: Predicting Security Events Through Deep Learning. In 2018 ACM SIGSAC Conference on Computer and Communications Security (CCSâĂ&Zacute;18). 14.

[19]

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. Neural Information Processing Systems Conference (NIPS).

[20]

Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, and Mor Harchol-Balter and Gregory R. Ganger. 2016. Tetrisched: global rescheduling with adaptive plan- ahead in dynamic heterogeneous clusters. In Proceedings of the 11th European Conference on Computer Systems.

[21]

Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2016. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. http://arxiv.org/abs/1502.05698

[22]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 595--610.

[23]

Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2018. Large batch size training of neural networks with adversarial training and second-order information. arXiv preprint arXiv:1810.01021 (2018).

[24]

Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J. Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In ACM Symposium on Cloud Computing (SoCC).

Cited By

Jin XBai ZZhang ZZhu YZhong YLiu X(2024)DistMind: Efficient Resource Disaggregation for Deep Learning WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2024.335501032:3(2422-2437)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2024.3355010
Gao WYe ZSun PZhang TWen Y(2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
https://doi.org/10.1109/TC.2024.3371794
Chen GSubramaniyan SWang X(2024)Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00051(473-484)Online publication date: 23-Jul-2024
https://doi.org/10.1109/ICDCS60910.2024.00051
Show More Cited By

Index Terms

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

RubberBand: cloud-based hyperparameter tuning
EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems

Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine learning (ML), but requires substantial compute resources to perform. Existing systems primarily focus on effectively allocating resources for a hyperparameter tuning ...
Scheduling OLTP transactions via learned abort prediction
aiDM '19: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

Current main memory database system architectures are still challenged by high contention workloads and this challenge will continue to grow as the number of cores in processors continues to increase [23]. These systems schedule transactions randomly ...
A combined priority scheduling method for distributed machine learning
Abstract
Algorithms and frameworks for distributed machine learning have been widely used in numerous artificial intelligence engineering applications. A cloud platform provides a large number of resources at a lower cost and is a more convenient method ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

November 2019

503 pages

ISBN:9781450369732

DOI:10.1145/3357223

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SoCC '19

Sponsor:

SoCC '19: ACM Symposium on Cloud Computing

November 20 - 23, 2019

CA, Santa Cruz, USA

Acceptance Rates

SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,520
Total Downloads

Downloads (Last 12 months)189
Downloads (Last 6 weeks)28

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin XBai ZZhang ZZhu YZhong YLiu X(2024)DistMind: Efficient Resource Disaggregation for Deep Learning WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2024.335501032:3(2422-2437)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2024.3355010
Gao WYe ZSun PZhang TWen Y(2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
https://doi.org/10.1109/TC.2024.3371794
Chen GSubramaniyan SWang X(2024)Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00051(473-484)Online publication date: 23-Jul-2024
https://doi.org/10.1109/ICDCS60910.2024.00051
S SP KSuchindhar R(2024)Predicting Cloud Workloads: Using Machine Learning with Amazon Infrastructure2024 Second International Conference on Advances in Information Technology (ICAIT)10.1109/ICAIT61638.2024.10690810(1-6)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ICAIT61638.2024.10690810
Kim SPark KEun Y(2024)Buffer Parameter Optimization for Advanced Automated Material Handling Systems in Serial Production LinesInternational Journal of Control, Automation and Systems10.1007/s12555-024-0040-z22:11(3377-3385)Online publication date: 6-Nov-2024
https://doi.org/10.1007/s12555-024-0040-z
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Gu DZhao YZhong YXiong YHan ZCheng PYang FHuang GJin XLiu XAamodt TJerger NSwift M(2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575721
Yang ZWu HXu YWu YZhong HZhang W(2023)Hydra: Deadline-Aware and Efficiency-Oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUsIEEE Transactions on Computers10.1109/TC.2023.324220072:8(2224-2236)Online publication date: 1-Aug-2023
https://doi.org/10.1109/TC.2023.3242200
Wu HDeng JFan HIbrahim SWu SJin H(2023)QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00093(886-896)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00093
Zhao BLiu CDong JCao ZNie WWu W(2023)Enabling Switch Memory Management for Distributed Training with In-Network AggregationIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228956(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10228956
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents