research-article

HyperDrive: exploring hyperparameters with POP scheduling

Authors:

Olatunji Ruwase,

Rodrigo FonsecaAuthors Info & Claims

Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference

Pages 1 - 13

https://doi.org/10.1145/3135974.3135994

Published: 11 December 2017 Publication History

Abstract

The quality of machine learning (ML) and deep learning (DL) models are very sensitive to many different adjustable parameters that are set before training even begins, commonly called hyperparameters. Efficient hyperparameter exploration is of great importance to practitioners in order to find high-quality models with affordable time and cost. This is however a challenging process due to a huge search space, expensive training runtime, sparsity of good configurations, and scarcity of time and resources. We develop a scheduling algorithm POP that quickly identifies among promising, opportunistic and poor configurations of hyperparameters. It infuses probabilistic model-based classification with dynamic scheduling and early termination to jointly optimize quality and cost. We also build a comprehensive hyperparameter exploration infrastructure, HyperDrive, to support existing and future scheduling algorithms for a wide range of usage scenarios across different ML/DL frameworks and learning domains. We evaluate POP and HyperDrive using complex and deep models. The results show that we speedup the training process by up to 6.7x compared with basic approaches like random/grid search and up to 2.1x compared with state-of-the-art approaches while achieving similar model quality compared with prior work.

References

[1]

2017. A high performance, open-source universal RPC framework. https://grpc.io. (2017).

[2]

2017. Checkpoint/Restore In Userspace (CRIU). https://criu.org/. (2017). Accessed: 2017-09-13.

[3]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

Digital Library

[4]

Kavosh Asadi and Jason D. Williams. 2016. Sample-efficient Deep Reinforcement Learning for Dialog Control. CoRR abs/1612.06000 (2016). http://arxiv.org/abs/1612.06000

[5]

The GPyOpt authors. 2016. GPyOpt: A Bayesian Optimization framework in python. http://github.com/SheffieldML/GPyOpt. (2016).

[6]

James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU Math Compiler in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 3 -- 10.

[7]

J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML '13). JMLR.org, I-115--I-123. http://dl.acm.org/citation.cfm?id=3042817.3042832

Digital Library

[8]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571--582. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi

Digital Library

[9]

François Chollet. 2015. Keras. https://github.com/fchollet/keras. (2015).

[10]

Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. 2002. Torch: A Modular Machine Learning Software Library. Idiap-RR Idiap-RR-46-2002. IDIAP.

[11]

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15). AAAI Press, 3460--3468. http://dl.acm.org/citation.cfm?id=2832581.2832731

Digital Library

[12]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of machine learning research 7, Jun (2006), 1079--1105.

Digital Library

[13]

Tim Hunter. 2016. Deep Learning with Apache Spark and TensorFlow. https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html. (January 2016).

[14]

F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In Proc. of LION-5. 507âĂŞ523.

Digital Library

[15]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

[16]

Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. 2017. Learning curve prediction with Bayesian neural networks. Proc. of ICLR 17 (2017).

[17]

Oleg Klimov. 2017. LunarLander-v2. https://gym.openai.com/envs/LunarLander-v2. (2017).

[18]

Brent Komer, James Bergstra, and Chris Eliasmith. 2014. Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In ICML workshop on AutoML.

[19]

Alex Krizhevsky. 2017. cuda-convnet. https://code.google.com/p/cuda-convnet/. (2017).

[20]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009). Technical report, University of Toronto.

[21]

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: Bandit-based Configuration Evaluation for Hyperparameter Optimization. Proc. of ICLR 17 (2017).

[22]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 693--701.

Digital Library

[23]

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. arXiv CoRR abs/1611.01603 (2016). http://arxiv.org/abs/1611.01603

[24]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951--2959.

Digital Library

[25]

Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. Automating Model Search for Large Scale Machine Learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC 2015). 13.

Digital Library

[26]

Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. 2014. Freeze-Thaw Bayesian Optimization. arXiv preprint arXiv:1406.3896 (2014).

[27]

Wangda Tan and Vinod Kumar Vavilapalli. 2017. Distributed Tensor-Flow Assembly on Apache Hadoop YARN. https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/. (March 2017).

[28]

Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '13). ACM, New York, NY, USA, 847--855.

Digital Library

[29]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2074--2082. http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf

Digital Library

[30]

Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng. 2017. Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters. http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep. (February 2017).

[31]

Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, and Xuedong Huang. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report.

[32]

Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.

[33]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. arXiv CoRR abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329

Cited By

Fan LZhang XZhao YSood KYu S(2024)Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic NetworksIEEE Transactions on Cognitive Communications and Networking10.1109/TCCN.2023.332633110:1(277-291)Online publication date: Feb-2024
https://doi.org/10.1109/TCCN.2023.3326331
Nagrecha KKumar A(2023)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning WorkloadsProceedings of the VLDB Endowment10.14778/3636218.363622717:4(712-725)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636227
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Show More Cited By

Index Terms

HyperDrive: exploring hyperparameters with POP scheduling
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Data Centers Job Scheduling with Deep Reinforcement Learning
Advances in Knowledge Discovery and Data Mining
Abstract
Efficient job scheduling on data centers under heterogeneous complexity is crucial but challenging since it involves the allocation of multi-dimensional resources over time and space. To adapt the complex computing environment in data centers, we ...
FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms
SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Serverless computing platforms have become increasingly popular for running machine learning (ML) tasks due to their user-friendliness and decoupling from underlying infrastructure. However, auto-scaling to efficiently serve incoming requests still ...
Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs
Euro-Par 2019: Parallel Processing Workshops
Abstract
Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference

December 2017

268 pages

ISBN:9781450347204

DOI:10.1145/3135974

General Chairs:
K. R. Jayaram
IBM T. J. Watson Research Center
,
Anshul Gandhi
Stony Brook University
,
Program Chairs:
Bettina Kemme
McGill University
,
Peter Pietzuch
Imperial College London

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

ACM: Association for Computing Machinery

In-Cooperation

USENIX Assoc: USENIX Assoc
IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

Middleware '17

Sponsor:

ACM

Middleware '17: 18th International Middleware Conference

December 11 - 15, 2017

Nevada, Las Vegas

Acceptance Rates

Middleware '17 Paper Acceptance Rate 20 of 85 submissions, 24%;

Overall Acceptance Rate 203 of 948 submissions, 21%

Upcoming Conference

MIDDLEWARE '24

25th International Middleware Conference

December 2 - 6, 2024

Hong Kong , Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fan LZhang XZhao YSood KYu S(2024)Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic NetworksIEEE Transactions on Cognitive Communications and Networking10.1109/TCCN.2023.332633110:1(277-291)Online publication date: Feb-2024
https://doi.org/10.1109/TCCN.2023.3326331
Nagrecha KKumar A(2023)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning WorkloadsProceedings of the VLDB Endowment10.14778/3636218.363622717:4(712-725)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636227
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Shi XPeng XHe LZhao YJin H(2023)Waterwave: A GPU Memory Flow Engine for Concurrent DNN TrainingIEEE Transactions on Computers10.1109/TC.2023.327853072:10(2938-2950)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3278530
Xu YCheng LCai XMa XChen WZhang LWang Y(2023)Efficient Supernet Training Using Path Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071099(1249-1261)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071099
Sun QLiu YYang HZhang RDun MLi MLiu XXiao WLi YLuan ZQian DWolf FShende SCulhane CAlam SJagode H(2022)CoGNNProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571936(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571936
Shin AJeong JKim DJung SChun B(2022)HippoProceedings of the VLDB Endowment10.14778/3510397.351040215:5(1038-1052)Online publication date: 18-May-2022
https://dl.acm.org/doi/10.14778/3510397.3510402
Liu LYu JDing Z(2022)Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in CloudProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545027(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545027
Wang SPi AZhou X(2022)Elastic Parameter Server: Accelerating ML Training With Scalable Resource SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310424233:5(1128-1143)Online publication date: 1-May-2022
https://doi.org/10.1109/TPDS.2021.3104242
Wang HLiu ZShen H(2022)Machine Learning Feature Based Job Scheduling for Distributed Machine Learning ClustersIEEE/ACM Transactions on Networking10.1109/TNET.2022.3190797(1-16)Online publication date: 2022
https://doi.org/10.1109/TNET.2022.3190797
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents