Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3341302.3342080acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

Learning scheduling algorithms for data processing clusters

Published: 19 August 2019 Publication History

Abstract

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems use simple, generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically.
Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective, such as minimizing average job completion time. However, off-the-shelf RL techniques cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals.
Our prototype integration with Spark on a 25-node cluster shows that Decima improves average job completion time by at least 21% over hand-tuned scheduling heuristics, achieving up to 2x improvement during periods of high cluster load.

Supplementary Material

MP4 File (p270-mao.mp4)

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12<sup>th</sup> USENIX Conference on Operating Systems Design and Implementation (OSDI). 265--283. http://dl.acm.org/citation.cfm?id=3026877.3026899
[2]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning-Volume 70. 22--31.
[3]
Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient Progressive Device Placement Optimization. In Proceedings of the 1<sup>st</sup> Machine Learning for Systems Workshop.
[4]
Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. 2012. Re-optimizing Data-parallel Computing. In Proceedings of the 9<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI). 281--294. http://dl.acm.org/citation.cfm?id=2228298.2228327
[5]
Kunal Agrawal, Jing Li, Kefu Lu, and Benjamin Moseley. 2016. Scheduling parallel DAG jobs online to minimize average flow time. In Proceedings of the 27<sup>th</sup> annual ACM-SIAM symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, 176--189.
[6]
Alibaba. 2017. Cluster data collected from production clusters in Alibaba for cluster management research. https://github.com/alibaba/clusterdata. (2017).
[7]
Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/. (2018).
[8]
Apache Hadoop. 2014. Hadoop Fair Scheduler. (2014). http://hadoop.apache.org/common/docs/stable1/fair_scheduler.html
[9]
Apache Spark. 2018. Spark: Dynamic Resource Allocation. (2018). http://spark.apache.org/docs/2.2.1/job-scheduling.html#dynamic-resource-allocation Spark v2.2.1 Documentation.
[10]
Apache Tez2013. Apache Tez Project. https://tez.apache.org/. (2013).
[11]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, second edition. Synthesis Lectures on Computer Architecture 8, 3 (July 2013).
[12]
Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
[13]
Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34--37.
[14]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26<sup>th</sup> annual International Conference on Machine Learning (ICML). 41--48.
[15]
Dimitri P Bertsekas and John N Tsitsiklis. 1995. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, Vol. 1. IEEE, 560--564.
[16]
Arka A. Bhattacharya, David Culler, Eric Friedman, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Hierarchical Scheduling for Diverse Datacenter Workloads. In Proceedings of the 4<sup>th</sup> Annual Symposium on Cloud Computing (SoCC). Article 4, 15 pages.
[17]
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
[18]
Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM) 46, 5 (1999), 720--748.
[19]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, Efficient Data-parallel Pipelines. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 363--375.
[20]
Chandra Chekuri, Ashish Goel, Sanjeev Khanna, and Amit Kumar. 2004. Multiprocessor scheduling to minimize flow time with resource augmentation. In Proceedings of the 36<sup>th</sup> Annual ACM Symposium on Theory of Computing. 363--372.
[21]
Dilip Chhajed and Timothy J Lowe. 2008. Building intuition: insights from basic operations management models and principles. Vol. 115. Springer Science & Business Media.
[22]
Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. 2018. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214 (2018).
[23]
Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learning Combinatorial Optimization Algorithms over Graphs. In Proceedings of the 31<sup>st</sup> Conference on Neural Information Processing Systems (NeurIPS). 6348--6358.
[24]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. arXiv preprint arXiv: 1606.09375 (2016).
[25]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In Proceedings of the 18<sup>th</sup> International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 77--88.
[26]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19<sup>th</sup> International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 127--144.
[27]
Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. 2016. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv preprint arXiv:1611.02779 (2016).
[28]
Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7<sup>th</sup> ACM European Conference on Computer Systems (EuroSys).
[29]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning (ICML). 1126--1135.
[30]
Peter Geibel. 2006. Reinforcement learning for MDPs with constraints. In Proceedings of the 17<sup>th</sup> European Conference on Machine Learning (ECML). 646--653.
[31]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In Proceedings of the 8<sup>th</sup> USENIX Symposium on Networked Systems Design and Implementation (NSDI). 323--336. http://dl.acm.org/citation.cfm?id=1972457.1972490
[32]
Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2013. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proceedings of the 8<sup>th</sup> ACM European Conference on Computer Systems (EuroSys). 365--378.
[33]
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: fast, centralized cluster scheduling at scale. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Operating Systems Design and Implementation (OSDI). 99--115.
[34]
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-resource Packing for Cluster Schedulers. In Proceedings of the 2014 ACM SIGCOMM Conference (SIGCOMM). 455--466.
[35]
Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016. Altruistic Scheduling in Multi-resource Clusters. In Proceedings of the 12<sup>th</sup> USENIX Conference on Operating Systems Design and Implementation (OSDI). 65--80. http://dl.acm.org/citation.cfm?id=3026877.3026884
[36]
Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Operating Systems Design and Implementation (OSDI). 81--97.
[37]
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, Nov (2004), 1471--1530.
[38]
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems (AAMAS). 66--83.
[39]
Martin T Hagan, Howard B Demuth, Mark H Beale, and Orlando De Jesüs. 1996. Neural network design. PWS publishing company Boston.
[40]
W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1 (1970).
[41]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI).
[42]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2<sup>nd</sup> ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys). 59--72.
[43]
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the 22<sup>nd</sup> ACM Symposium on Operating Systems Principles (SOSP). 261--276.
[44]
James E. Kelley Jr and Morgan R. Walker. 1959. Critical-path planning and scheduling. In Proceedings of the Eastern Joint IRE-AIEE-ACM Computer Conference (EJCC). 160--173.
[45]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the 7<sup>th</sup> International Conference on Learning Representations (ICLR) (2015).
[46]
Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv.1609.02907 (2016). http://arxiv.org/abs/1609.02907
[47]
Prasad A. Kulkarni. 2011. JIT compilation policy for modern machines. In ACM SIGPLAN Notices, Vol. 46. 773--788.
[48]
Tom Leighton, Bruce Maggs, and Satish Rao. 1988. Universal packet routing algorithms. In Proceedings of the 29<sup>th</sup> annual Symposium on Foundations of Computer Science (FOCS). 256--269.
[49]
Zhuwen Li, Qifeng Chen, and Vladlen Koltun. 2018. Combinatorial optimization with graph convolutional networks and guided tree search. In Proceedings of the 32<sup>nd</sup> Conference on Neural Information Processing Systems (NeurIPS). 539--548.
[50]
Eric Liang and Richard Liaw. 2018. Scaling Multi-Agent Reinforcement Learning. https://bair.berkeley.edu/blog/2018/12/12/rllib/. (2018).
[51]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[52]
Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: An analysis on alibaba cluster trace. In Proceedings of the 2017 IEEE International Conference on Big Data (BigData). IEEE, 2884--2892.
[53]
Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15<sup>th</sup> ACM Workshop on Hot Topics in Networks (HotNets).
[54]
Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuandong Tian, Mohammad Alizadeh, and Eytan Bakshy. 2019. Real-world Video Adaptation with Reinforcement Learning. In Proceedings of the 2019 Reinforcement Learning for Real Life Workshop.
[55]
Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. 2019. Variance Reduction for Reinforcement Learning in Input-Driven Environments. Proceedings of the 7<sup>th</sup> International Conference on Learning Representations (ICLR) (2019).
[56]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. arXiv preprint arXiv:1904.03711 (2019).
[57]
Monaldo Mastrolilli and Ola Svensson. 2008. (Acyclic) job shops are hard to approximate. In Proceedings of the 49<sup>th</sup> IEEE Symposium on Foundations of Computer Science (FOCS). 583--592.
[58]
Ishai Menache, Shie Mannor, and Nahum Shimkin. 2005. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 1 (2005), 215--238.
[59]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V Le, and Jeff Dean. 2018. A Hierarchical Model for Device Placement. In Proceedings of the 6<sup>th</sup> International Conference on Learning Representations (ICLR).
[60]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. In Proceedings of the 33<sup>rd</sup> International Conference on Machine Learning (ICML).
[61]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2015. Human-level control through deep reinforcement learning. Nature 518 (2015), 529--533.
[62]
Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. 2018. Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications. arXiv preprint arXiv:1812.11794 (2018).
[63]
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Proceedings of the 12<sup>th</sup> USENIX Symposium on Networked Systems Design and Implementation (NSDI). 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout
[64]
Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust Adversarial Reinforcement Learning. In Proceedings of the 34<sup>th</sup> International Conference on Machine Learning (ICML). 2817--2826.
[65]
Chandrasekharan Rajendran. 1994. A no-wait flowshop scheduling heuristic to minimize makespan. Journal of the Operational Research Society 45, 4 (1994), 472--478.
[66]
Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnovic, and Sriram Rao. 2016. Efficient Queue Management for Cluster Scheduling. In Proceedings of the 11<sup>th</sup> European Conference on Computer Systems (EuroSys). Article 36, 15 pages.
[67]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 1889--1897.
[68]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8<sup>th</sup> ACM European Conference on Computer Systems (EuroSys). 351--364.
[69]
David B Shmoys, Clifford Stein, and Joel Wein. 1994. Improved approximation algorithms for shop scheduling problems. SIAM J. Comput. 23, 3 (1994), 617--632.
[70]
David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (2016), 484--503.
[71]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550, 7676 (2017), 354.
[72]
Richard. S. Sutton and Andrew. G. Barto. 2017. Reinforcement Learning: An Introduction, Second Edition. MIT Press.
[73]
TPC-H 2018. The TPC-H Benchmarks. www.tpc.org/tpch/. (2018).
[74]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the 11<sup>th</sup> European Conference on Computer Systems (EuroSys). Article 35, 16 pages.
[75]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4<sup>th</sup> annual Symposium on Cloud Computing (SoCC). Article 5, 16 pages.
[76]
Abhishek Verma, Madhukar Korupolu, and John Wilkes. 2014. Evaluating job packing in warehouse-scale computing. In Proceedings of the 2014 IEEE International Conference on Cluster Computing (CLUSTER). 48--56.
[77]
Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the 10<sup>th</sup> European Conference on Computer Systems (EuroSys). Bordeaux, France.
[78]
Lex Weaver and Nigel Tao. 2001. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the 17<sup>th</sup> Conference on Uncertainty in Artificial Intelligence (UAI). 538--545.
[79]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229--256.
[80]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9<sup>th</sup> USENIX Conference on Networked Systems Design and Implementation (NSDI). 15--28. http://dl.acm.org/citation.cfm?id=2228298.2228301

Cited By

View all
  • (2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
  • (2024)Energy and Carbon-aware Distributed Machine Learning Tasks Scheduling Scheme for the Multi-Renewable Energy-based Edge-Cloud ContinuumScience and Technology for Energy Transition10.2516/stet/2024076Online publication date: 2-Sep-2024
  • (2024)Octopus: An End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning2024 43rd Chinese Control Conference (CCC)10.23919/CCC63176.2024.10662729(2588-2593)Online publication date: 28-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication
August 2019
526 pages
ISBN:9781450359566
DOI:10.1145/3341302
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. job scheduling
  2. reinforcement learning
  3. resource management

Qualifiers

  • Research-article

Funding Sources

  • Alfred P. Sloan Research Fellowship
  • Google Faculty Research Award
  • AWS Machine Learning Re- search Award
  • Cisco Research Center Award
  • MIT Data Systems and AI Lab
  • NSF

Conference

SIGCOMM '19
Sponsor:
SIGCOMM '19: ACM SIGCOMM 2019 Conference
August 19 - 23, 2019
Beijing, China

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,976
  • Downloads (Last 6 weeks)282
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
  • (2024)Energy and Carbon-aware Distributed Machine Learning Tasks Scheduling Scheme for the Multi-Renewable Energy-based Edge-Cloud ContinuumScience and Technology for Energy Transition10.2516/stet/2024076Online publication date: 2-Sep-2024
  • (2024)Octopus: An End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning2024 43rd Chinese Control Conference (CCC)10.23919/CCC63176.2024.10662729(2588-2593)Online publication date: 28-Jul-2024
  • (2024)Leveraging Reinforcement Learning for Autonomous Data Pipeline Optimization and ManagementSSRN Electronic Journal10.2139/ssrn.4908414Online publication date: 2024
  • (2024)Learning to solve combinatorial optimization under positive linear constraints via non-autoregressive neural networksSCIENTIA SINICA Informationis10.1360/SSI-2023-0269Online publication date: 30-Sep-2024
  • (2024)Towards optimized scheduling and allocation of heterogeneous resource via graph-enhanced EPSO algorithmJournal of Cloud Computing10.1186/s13677-024-00670-413:1Online publication date: 23-May-2024
  • (2024)Planter: Rapid Prototyping of In-Network Machine Learning InferenceACM SIGCOMM Computer Communication Review10.1145/3687230.368723254:1(2-21)Online publication date: 6-Aug-2024
  • (2024)Learning-Augmented Energy-Aware List Scheduling for Precedence-Constrained TasksACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36802789:4(1-24)Online publication date: 1-Aug-2024
  • (2024)Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL ApproachACM Transactions on Design Automation of Electronic Systems10.1145/367737829:6(1-36)Online publication date: 11-Jul-2024
  • (2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media