research-article

Open access

Learning scheduling algorithms for data processing clusters

Authors:

Malte Schwarzkopf,

Shaileshh Bojja Venkatakrishnan,

Mohammad AlizadehAuthors Info & Claims

SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication

Pages 270 - 288

https://doi.org/10.1145/3341302.3342080

Published: 19 August 2019 Publication History

Abstract

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems use simple, generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically.

Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective, such as minimizing average job completion time. However, off-the-shelf RL techniques cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals.

Our prototype integration with Spark on a 25-node cluster shows that Decima improves average job completion time by at least 21% over hand-tuned scheduling heuristics, achieving up to 2x improvement during periods of high cluster load.

Supplementary Material

MP4 File (p270-mao.mp4)

Download
1027.02 MB

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12<sup>th</sup> USENIX Conference on Operating Systems Design and Implementation (OSDI). 265--283. http://dl.acm.org/citation.cfm?id=3026877.3026899

Digital Library

[2]

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning-Volume 70. 22--31.

Digital Library

[3]

Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient Progressive Device Placement Optimization. In Proceedings of the 1<sup>st</sup> Machine Learning for Systems Workshop.

[4]

Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. 2012. Re-optimizing Data-parallel Computing. In Proceedings of the 9<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI). 281--294. http://dl.acm.org/citation.cfm?id=2228298.2228327

Digital Library

[5]

Kunal Agrawal, Jing Li, Kefu Lu, and Benjamin Moseley. 2016. Scheduling parallel DAG jobs online to minimize average flow time. In Proceedings of the 27<sup>th</sup> annual ACM-SIAM symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, 176--189.

Digital Library

[6]

Alibaba. 2017. Cluster data collected from production clusters in Alibaba for cluster management research. https://github.com/alibaba/clusterdata. (2017).

[7]

Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/. (2018).

[8]

Apache Hadoop. 2014. Hadoop Fair Scheduler. (2014). http://hadoop.apache.org/common/docs/stable1/fair_scheduler.html

[9]

Apache Spark. 2018. Spark: Dynamic Resource Allocation. (2018). http://spark.apache.org/docs/2.2.1/job-scheduling.html#dynamic-resource-allocation Spark v2.2.1 Documentation.

[10]

Apache Tez2013. Apache Tez Project. https://tez.apache.org/. (2013).

[11]

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, second edition. Synthesis Lectures on Computer Architecture 8, 3 (July 2013).

[12]

Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).

[13]

Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34--37.

[14]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26<sup>th</sup> annual International Conference on Machine Learning (ICML). 41--48.

Digital Library

[15]

Dimitri P Bertsekas and John N Tsitsiklis. 1995. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, Vol. 1. IEEE, 560--564.

[16]

Arka A. Bhattacharya, David Culler, Eric Friedman, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Hierarchical Scheduling for Diverse Datacenter Workloads. In Proceedings of the 4<sup>th</sup> Annual Symposium on Cloud Computing (SoCC). Article 4, 15 pages.

Digital Library

[17]

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.

Digital Library

[18]

Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM) 46, 5 (1999), 720--748.

Digital Library

[19]

Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, Efficient Data-parallel Pipelines. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 363--375.

Digital Library

[20]

Chandra Chekuri, Ashish Goel, Sanjeev Khanna, and Amit Kumar. 2004. Multiprocessor scheduling to minimize flow time with resource augmentation. In Proceedings of the 36<sup>th</sup> Annual ACM Symposium on Theory of Computing. 363--372.

Digital Library

[21]

Dilip Chhajed and Timothy J Lowe. 2008. Building intuition: insights from basic operations management models and principles. Vol. 115. Springer Science & Business Media.

[22]

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. 2018. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214 (2018).

[23]

Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learning Combinatorial Optimization Algorithms over Graphs. In Proceedings of the 31<sup>st</sup> Conference on Neural Information Processing Systems (NeurIPS). 6348--6358.

Digital Library

[24]

Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. arXiv preprint arXiv: 1606.09375 (2016).

Digital Library

[25]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In Proceedings of the 18<sup>th</sup> International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 77--88.

Digital Library

[26]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19<sup>th</sup> International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 127--144.

Digital Library

[27]

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. 2016. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv preprint arXiv:1611.02779 (2016).

[28]

Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7<sup>th</sup> ACM European Conference on Computer Systems (EuroSys).

Digital Library

[29]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning (ICML). 1126--1135.

Digital Library

[30]

Peter Geibel. 2006. Reinforcement learning for MDPs with constraints. In Proceedings of the 17<sup>th</sup> European Conference on Machine Learning (ECML). 646--653.

Digital Library

[31]

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In Proceedings of the 8<sup>th</sup> USENIX Symposium on Networked Systems Design and Implementation (NSDI). 323--336. http://dl.acm.org/citation.cfm?id=1972457.1972490

Digital Library

[32]

Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2013. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proceedings of the 8<sup>th</sup> ACM European Conference on Computer Systems (EuroSys). 365--378.

Digital Library

[33]

Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: fast, centralized cluster scheduling at scale. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Operating Systems Design and Implementation (OSDI). 99--115.

Digital Library

[34]

Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-resource Packing for Cluster Schedulers. In Proceedings of the 2014 ACM SIGCOMM Conference (SIGCOMM). 455--466.

Digital Library

[35]

Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016. Altruistic Scheduling in Multi-resource Clusters. In Proceedings of the 12<sup>th</sup> USENIX Conference on Operating Systems Design and Implementation (OSDI). 65--80. http://dl.acm.org/citation.cfm?id=3026877.3026884

Digital Library

[36]

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Operating Systems Design and Implementation (OSDI). 81--97.

Digital Library

[37]

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, Nov (2004), 1471--1530.

Digital Library

[38]

Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems (AAMAS). 66--83.

[39]

Martin T Hagan, Howard B Demuth, Mark H Beale, and Orlando De Jesüs. 1996. Neural network design. PWS publishing company Boston.

Digital Library

[40]

W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1 (1970).

[41]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI).

Digital Library

[42]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2<sup>nd</sup> ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys). 59--72.

Digital Library

[43]

Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the 22<sup>nd</sup> ACM Symposium on Operating Systems Principles (SOSP). 261--276.

Digital Library

[44]

James E. Kelley Jr and Morgan R. Walker. 1959. Critical-path planning and scheduling. In Proceedings of the Eastern Joint IRE-AIEE-ACM Computer Conference (EJCC). 160--173.

Digital Library

[45]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the 7<sup>th</sup> International Conference on Learning Representations (ICLR) (2015).

[46]

Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv.1609.02907 (2016). http://arxiv.org/abs/1609.02907

[47]

Prasad A. Kulkarni. 2011. JIT compilation policy for modern machines. In ACM SIGPLAN Notices, Vol. 46. 773--788.

Digital Library

[48]

Tom Leighton, Bruce Maggs, and Satish Rao. 1988. Universal packet routing algorithms. In Proceedings of the 29<sup>th</sup> annual Symposium on Foundations of Computer Science (FOCS). 256--269.

Digital Library

[49]

Zhuwen Li, Qifeng Chen, and Vladlen Koltun. 2018. Combinatorial optimization with graph convolutional networks and guided tree search. In Proceedings of the 32<sup>nd</sup> Conference on Neural Information Processing Systems (NeurIPS). 539--548.

Digital Library

[50]

Eric Liang and Richard Liaw. 2018. Scaling Multi-Agent Reinforcement Learning. https://bair.berkeley.edu/blog/2018/12/12/rllib/. (2018).

[51]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[52]

Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: An analysis on alibaba cluster trace. In Proceedings of the 2017 IEEE International Conference on Big Data (BigData). IEEE, 2884--2892.

[53]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15<sup>th</sup> ACM Workshop on Hot Topics in Networks (HotNets).

Digital Library

[54]

Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuandong Tian, Mohammad Alizadeh, and Eytan Bakshy. 2019. Real-world Video Adaptation with Reinforcement Learning. In Proceedings of the 2019 Reinforcement Learning for Real Life Workshop.

[55]

Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. 2019. Variance Reduction for Reinforcement Learning in Input-Driven Environments. Proceedings of the 7<sup>th</sup> International Conference on Learning Representations (ICLR) (2019).

[56]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. arXiv preprint arXiv:1904.03711 (2019).

[57]

Monaldo Mastrolilli and Ola Svensson. 2008. (Acyclic) job shops are hard to approximate. In Proceedings of the 49<sup>th</sup> IEEE Symposium on Foundations of Computer Science (FOCS). 583--592.

Digital Library

[58]

Ishai Menache, Shie Mannor, and Nahum Shimkin. 2005. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 1 (2005), 215--238.

[59]

Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V Le, and Jeff Dean. 2018. A Hierarchical Model for Device Placement. In Proceedings of the 6<sup>th</sup> International Conference on Learning Representations (ICLR).

[60]

Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. In Proceedings of the 33<sup>rd</sup> International Conference on Machine Learning (ICML).

Digital Library

[61]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2015. Human-level control through deep reinforcement learning. Nature 518 (2015), 529--533.

[62]

Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. 2018. Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications. arXiv preprint arXiv:1812.11794 (2018).

[63]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Networked Systems Design and Implementation (NSDI). 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout

Digital Library

[64]

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust Adversarial Reinforcement Learning. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning (ICML). 2817--2826.

Digital Library

[65]

Chandrasekharan Rajendran. 1994. A no-wait flowshop scheduling heuristic to minimize makespan. Journal of the Operational Research Society 45, 4 (1994), 472--478.

[66]

Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnovic, and Sriram Rao. 2016. Efficient Queue Management for Cluster Scheduling. In Proceedings of the 11<sup>th</sup> European Conference on Computer Systems (EuroSys). Article 36, 15 pages.

Digital Library

[67]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 1889--1897.

Digital Library

[68]

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8<sup>th</sup> ACM European Conference on Computer Systems (EuroSys). 351--364.

Digital Library

[69]

David B Shmoys, Clifford Stein, and Joel Wein. 1994. Improved approximation algorithms for shop scheduling problems. SIAM J. Comput. 23, 3 (1994), 617--632.

Digital Library

[70]

David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (2016), 484--503.

[71]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550, 7676 (2017), 354.

[72]

Richard. S. Sutton and Andrew. G. Barto. 2017. Reinforcement Learning: An Introduction, Second Edition. MIT Press.

Digital Library

[73]

TPC-H 2018. The TPC-H Benchmarks. www.tpc.org/tpch/. (2018).

[74]

Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the 11<sup>th</sup> European Conference on Computer Systems (EuroSys). Article 35, 16 pages.

Digital Library

[75]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4<sup>th</sup> annual Symposium on Cloud Computing (SoCC). Article 5, 16 pages.

Digital Library

[76]

Abhishek Verma, Madhukar Korupolu, and John Wilkes. 2014. Evaluating job packing in warehouse-scale computing. In Proceedings of the 2014 IEEE International Conference on Cluster Computing (CLUSTER). 48--56.

[77]

Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the 10<sup>th</sup> European Conference on Computer Systems (EuroSys). Bordeaux, France.

Digital Library

[78]

Lex Weaver and Nigel Tao. 2001. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the 17<sup>th</sup> Conference on Uncertainty in Artificial Intelligence (UAI). 538--545.

Digital Library

[79]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229--256.

Digital Library

[80]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI). 15--28. http://dl.acm.org/citation.cfm?id=2228298.2228301

Digital Library

Cited By

Xu DLiu FWang BTang XZeng DGao HChen RWu Q(2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107539
Liu LMiao ZNan HLi WPan XYang XYu MChen HZhao Y(2024)Energy and Carbon-aware Distributed Machine Learning Tasks Scheduling Scheme for the Multi-Renewable Energy-based Edge-Cloud ContinuumScience and Technology for Energy Transition10.2516/stet/2024076Online publication date: 2-Sep-2024
https://doi.org/10.2516/stet/2024076
Chang YPeng HZhan YXia Y(2024)Octopus: An End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning2024 43rd Chinese Control Conference (CCC)10.23919/CCC63176.2024.10662729(2588-2593)Online publication date: 28-Jul-2024
https://doi.org/10.23919/CCC63176.2024.10662729
Show More Cited By

Index Terms

Learning scheduling algorithms for data processing clusters

Recommendations

Job scheduling for large-scale machine learning clusters
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications running on modern platforms, it is crucial to satisfy application performance requirements such as meeting deadline and ensuring accuracy. To this end, researchers ...
Modified Rate-Monotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines

The deadline of a request is the time instant at which its execution must complete. The deadline of the request in any period of a job with deferred deadline is some time instant after the end of the period. The authors describe a semi-static priority-...
Toward balanced and sustainable job scheduling for production supercomputers

Job scheduling on production supercomputers is complicated by diverse demands of system administrators and amorphous characteristics of workloads. Specifically, various scheduling goals such as queuing efficiency and system utilization are usually ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication

August 2019

526 pages

ISBN:9781450359566

DOI:10.1145/3341302

General Chairs:
Jianping Wu
Tsinghua University, China
,
Wendy Hall
University of Southampton, UK

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Alfred P. Sloan Research Fellowship
Google Faculty Research Award
AWS Machine Learning Re- search Award
Cisco Research Center Award
MIT Data Systems and AI Lab
NSF

Conference

SIGCOMM '19

Sponsor:

SIGCOMM

SIGCOMM '19: ACM SIGCOMM 2019 Conference

August 19 - 23, 2019

Beijing, China

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

386
Total Citations
View Citations
11,430
Total Downloads

Downloads (Last 12 months)1,976
Downloads (Last 6 weeks)282

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu DLiu FWang BTang XZeng DGao HChen RWu Q(2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107539
Liu LMiao ZNan HLi WPan XYang XYu MChen HZhao Y(2024)Energy and Carbon-aware Distributed Machine Learning Tasks Scheduling Scheme for the Multi-Renewable Energy-based Edge-Cloud ContinuumScience and Technology for Energy Transition10.2516/stet/2024076Online publication date: 2-Sep-2024
https://doi.org/10.2516/stet/2024076
Chang YPeng HZhan YXia Y(2024)Octopus: An End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning2024 43rd Chinese Control Conference (CCC)10.23919/CCC63176.2024.10662729(2588-2593)Online publication date: 28-Jul-2024
https://doi.org/10.23919/CCC63176.2024.10662729
Lekkala C(2024)Leveraging Reinforcement Learning for Autonomous Data Pipeline Optimization and ManagementSSRN Electronic Journal10.2139/ssrn.4908414Online publication date: 2024
https://doi.org/10.2139/ssrn.4908414
WANG RLI YYAN JYANG X(2024)Learning to solve combinatorial optimization under positive linear constraints via non-autoregressive neural networksSCIENTIA SINICA Informationis10.1360/SSI-2023-0269Online publication date: 30-Sep-2024
https://doi.org/10.1360/SSI-2023-0269
Zhang ZXu CXu SHuang LZhang J(2024)Towards optimized scheduling and allocation of heterogeneous resource via graph-enhanced EPSO algorithmJournal of Cloud Computing10.1186/s13677-024-00670-413:1Online publication date: 23-May-2024
https://doi.org/10.1186/s13677-024-00670-4
Zheng CZang MHong XPerreault LBensoussane RVargaftik SBen-Itzhak YZilberman N(2024)Planter: Rapid Prototyping of In-Network Machine Learning InferenceACM SIGCOMM Computer Communication Review10.1145/3687230.368723254:1(2-21)Online publication date: 6-Aug-2024
https://doi.org/10.1145/3687230.3687232
Su YAnand VYu JTan JWierman A(2024)Learning-Augmented Energy-Aware List Scheduling for Precedence-Constrained TasksACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36802789:4(1-24)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1145/3680278
Pereira DGhosh SDey S(2024)Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL ApproachACM Transactions on Design Automation of Electronic Systems10.1145/367737829:6(1-36)Online publication date: 11-Jul-2024
https://dl.acm.org/doi/10.1145/3677378
Hui YYu MQi HGan YLi TLi YRen XMa SLu XWang Y(2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677137
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents