Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DAG-based workflows scheduling using Actor–Critic Deep Reinforcement Learning

Published: 01 January 2024 Publication History

Abstract

High-Performance Computing (HPC) is essential to support the advance in multiple research and industrial fields. Despite the recent growth in processing and networking power, the HPC Data Centers (DCs) are finite, and should be carefully managed to host multiple jobs. The scheduling of tasks (composing a job) is a crucial and complex task, once the reflexes of the scheduler’s decisions are perceptible both for users (e.g., slowdown) and for infrastructure administrators (e.g., use of resources and queue length). In fact, the process of scheduling workflows atop a DC can be modeled as a graph mapping problem. While an undirected graph is used to represent the DC, a Directed Acyclic Graph (DAG) is used to express the tasks dependencies. Each vertex and edge from both graphs can have weights associated with them, denoting the residual capacities for DC resources, as well as computing and networking demands for workflows. Motivated by the combinatorial explosion of the aforementioned scheduling problem, the integration of Machine Learning (ML) for generating or improving scheduling policies is a reality, however the proposals in the specialized literature opt, mostly, for using simplified models to reduce the search space or are trained to specific scenarios, which leads to policies that eventually fall short of real DCs expectations. Given this challenge, this work applies Actor–Critic (AC) Reinforcement Learning (RL) to schedule DAG-based workflows. Instead of proposing a new policy, the AC RL is used to select the appropriated scheduling policy from a pool of consolidated algorithms, guided by the DAGs workload and DC usage. The AC RL-based scheduler analyzes the DAGs queue and the DC status to define which algorithms are better suited to improve the overall performance indicators in each scenario instance. The simulation protocol comprises multiple analysis with distinct workload configurations, number of jobs, queue ordering polices and strategies to select the target DC servers. The results demonstrated that the AC RL selects the scheduling policy which fits the current workload and DC status.

Highlights

We researched the DAG-based workflows scheduling considering both the users’ and HPC DC’s perspectives.
We proposed, implemented and analyzed an AC RL scheduler to select the appropriated combination of queueing policies.
The AC RL prototype runs alongside state-of-the-art consolidated policies, and can be easily extended.
The AC RL prototype is simple. It is based on well-known queueing policies and actor–critic reinforcement learning; and.
The simulation protocol demonstrated how the AC RL prototype can learn and improve the overall indicators.

References

[1]
Bernholdt D.E., Boehm S., Bosilca G., Venkata M.G., Grant R.E., Naughton T., Pritchard H.P., Schulz M., Vallee G.R., A survey of MPI usage in the US exascale computing project, Concurr. Comput.: Pract. Exper. 32 (3) (2020),. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4851.
[2]
Kothe D., Lee S., Qualters I., Exascale computing in the United States, Comput. Sci. Eng. 21 (1) (2019) 17–29,.
[3]
Brucker P., Scheduling algorithms, J. Oper. Res. Soc. 50 (1999) 774.
[4]
Lenstra J.K., Kan A.R., Computational complexity of discrete optimization problems, in: Annals of Discrete Mathematics, Vol. 4, Elsevier, 1979, pp. 121–140.
[5]
Gonzalez T., Sahni S., Flowshop and jobshop schedules: complexity and approximation, Oper. Res. 26 (1) (1978) 36–52.
[6]
Amaldi E., Coniglio S., Koster A.M., Tieves M., On the computational complexity of the virtual network embedding problem, Electron. Notes Discrete Math. 52 (2016) 213–220,.
[7]
Garey M.R., Johnson D.S., Sethi R., The complexity of flowshop and jobshop scheduling, Math. Oper. Res. 1 (2) (1976) 117–129.
[8]
Noormohammadpour M., Raghavendra C.S., Datacenter traffic control: Understanding techniques and tradeoffs, IEEE Commun. Surv. Tutor. 20 (2) (2018) 1492–1525,.
[9]
Dongarra J., Luszczek P., Padua D. (Ed.), TOP500, Springer US, Boston, MA, 2011, pp. 2055–2057,.
[10]
Mu’alem A.W., Feitelson D.G., Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling, IEEE Trans. Parallel Distrib. Syst. 12 (6) (2001) 529–543.
[11]
Carastan-Santos D., De Camargo R.Y., Trystram D., Zrigui S., One can only gain by replacing EASY backfilling: A simple scheduling policies case study, in: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), IEEE, 2019, pp. 1–10.
[12]
R. Grandl, S. Kandula, S. Rao, A. Akella, J. Kulkarni, {GRAPHENE}: Packing and {Dependency-Aware} Scheduling for {Data-Parallel} Clusters, in: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 81–97.
[13]
Rodrigues L.R., Koslovski G.P., Pasin M., Pillon M.A., Alves O.C., Miers C.C., Time-constrained and network-aware containers scheduling in GPU era, Future Gener. Comput. Syst. 117 (2021) 72–86,. URL https://www.sciencedirect.com/science/article/pii/S0167739X20330387.
[14]
L.L. Nesi, M.A. Pillon, M.D. de Assunção, C.C. Miers, G.P. Koslovski, Tackling Virtual Infrastructure Allocation in Cloud Data Centers: a GPU-Accelerated Framework, in: 2018 14th International Conference on Network and Service Management (CNSM), 2018, pp. 191–197.
[15]
Demirci M., A survey of machine learning applications for energy-efficient resource management in cloud computing environments, in: Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on, IEEE, 2015, pp. 1185–1190.
[16]
Hashem I.A.T., Yaqoob I., Anuar N.B., Mokhtar S., Gani A., Khan S.U., The rise of “big data” on cloud computing: Review and open research issues, Inf. Syst. 47 (2015) 98–115.
[17]
Li F., Hu B., Deepjs: Job scheduling based on deep reinforcement learning in cloud data center, in: Proceedings of the 2019 4th International Conference on Big Data and Computing, in: ICBDC 2019, ACM, New York, NY, USA, 2019, pp. 48–53,. URL http://doi.acm.org/10.1145/3335484.3335513.
[18]
Yao H., Chen X., Li M., Zhang P., Wang L., A novel reinforcement learning algorithm for virtual network embedding, Neurocomputing 284 (2018) 1–9,. URL http://www.sciencedirect.com/science/article/pii/S0925231218300420.
[19]
Blenk A., Kalmbach P., Kellerer W., Schmid S., O’zapft is: Tap your network algorithm’s big data!, in: Proceedings of the Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, in: Big-DAMA ’17, ACM, New York, NY, USA, 2017, pp. 19–24,. URL http://doi.acm.org/10.1145/3098593.3098597.
[20]
Li J., Zhang X., Wei J., Ji Z., Wei Z., Garlsched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems, Future Gener. Comput. Syst. 135 (2022) 259–269,. URL https://www.sciencedirect.com/science/article/pii/S0167739X22001613.
[21]
Boutaba R., Salahuddin M.A., Limam N., Ayoubi S., Shahriar N., Estrada-Solano F., Caicedo O.M., A comprehensive survey on machine learning for networking: evolution, applications and research opportunities, J. Internet Serv. Appl. 9 (1) (2018) 1–99.
[22]
Liu C.-L., Chang C.-C., Tseng C.-J., Actor-critic deep reinforcement learning for solving job shop scheduling problems, IEEE Access 8 (2020) 71752–71762,.
[23]
Shyalika C., Silva T., Karunananda A., Reinforcement learning in dynamic task scheduling: A review, SN Comput. Sci. 1 (6) (2020) 1–17.
[24]
Zhang L., Qi Q., Wang J., Sun H., Liao J., Multi-task deep reinforcement learning for scalable parallel task scheduling, in: 2019 IEEE International Conference on Big Data (Big Data), IEEE, 2019, pp. 2992–3001.
[25]
J. Song, G.d. Veciana, S. Shakkottai, Meta-Scheduling for the Wireless Downlink through Learning with Bandit Feedback, in: 2020 18th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOPT), 2020, pp. 1–7.
[26]
Jaakkola T., Jordan M.I., Singh S.P., On the convergence of stochastic iterative dynamic programming algorithms, Neural Comput. 6 (6) (1994) 1185–1201.
[27]
Bhatnagar S., Sutton R.S., Ghavamzadeh M., Lee M., Natural actor–critic algorithms, Automatica 45 (2009) 2471–2482.
[28]
Feitelson D.G., Rudolph L., Metrics and benchmarking for parallel job scheduling, in: Workshop on Job Scheduling Strategies for Parallel Processing, Springer, 1998, pp. 1–24.
[29]
Carastan-Santos D., de Camargo R.Y., Obtaining dynamic scheduling policies with simulation and machine learning, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, Association for Computing Machinery, New York, NY, USA, 2017,.
[30]
L.L. Mentz, W.J. Loch, G.P. Koslovski, Comparative experimental analysis of Docker container networking drivers, in: 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), 2020, pp. 1–7, https://doi.org/10.1109/CloudNet51028.2020.9335811.
[31]
Grondman I., Busoniu L., Lopes G.A.D., Babuska R., A survey of actor-critic reinforcement learning: Standard and natural policy gradients, IEEE Trans. Syst. Man Cybern. C 42 (6) (2012) 1291–1307,.
[32]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2017, arXiv:1412.6980.
[33]
S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. Su, K. Vahi, Characterization of scientific workflows, in: 2008 Third Workshop on Workflows in Support of Large-Scale Science, 2008, pp. 1–10, https://doi.org/10.1109/WORKS.2008.4723958.
[34]
Juve G., Chervenak A., Deelman E., Bharathi S., Mehta G., Vahi K., Characterizing and profiling scientific workflows, Future Gener. Comput. Syst. 29 (3) (2013) 682–692,. URL https://www.sciencedirect.com/science/article/pii/S0167739X12001732. Special Section: Recent Developments in High Performance Computing and Security.
[35]
Hu Z., Tu J., Li B., Spear: Optimized dependency-aware task scheduling with deep reinforcement learning, in: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), IEEE, 2019, pp. 2037–2046.
[36]
Mao H., Alizadeh M., Menache I., Kandula S., Resource management with deep reinforcement learning, in: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets ’16, Association for Computing Machinery, New York, NY, USA, 2016, pp. 50–56,.
[37]
Casagrande L.C., Koslovski G.P., Miers C.C., Pillon M.A., DeepScheduling: Grid computing job scheduler based on deep reinforcement learning, in: Barolli L., Amato F., Moscato F., Enokido T., Takizawa M. (Eds.), Advanced Information Networking and Applications, Springer International Publishing, Cham, 2020, pp. 1032–1044.
[38]
Casagrande L.C., Koslovski G.P., Miers C.C., Pillon M.A., Gonzalez N.M., Don’t hurry be green: scheduling servers shutdown in grid computing with deep reinforcement learning, Int. J. Grid Util. Comput. 13 (6) (2022) 589–606,. URL https://www.inderscienceonline.com/doi/abs/10.1504/IJGUC.2022.128303. arXiv:https://www.inderscienceonline.com/doi/pdf/10.1504/IJGUC.2022.128303.
[39]
Shabka Z., Zervas G., Resource allocation in disaggregated data centre systems with reinforcement learning, 2021, arXiv:2106.02412.
[40]
Tassel P., Gebser M., Schekotihin K., A reinforcement learning environment for job-shop scheduling, 2021, arXiv:2104.03760.
[41]
A. Legrand, D. Trystram, S. Zrigui, Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning?, in: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019, pp. 686–695, https://doi.org/10.1109/IPDPS.2019.00077.

Cited By

View all
  • (2024)Towards Highly Compatible I/O-Aware Workflow Scheduling on HPC SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00031(1-15)Online publication date: 17-Nov-2024
  • (2024)Energy-efficient DAG scheduling with DVFS for cloud data centersThe Journal of Supercomputing10.1007/s11227-024-06035-780:10(14799-14823)Online publication date: 27-Mar-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 150, Issue C
Jan 2024
451 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2024

Author Tags

  1. 00-01
  2. 99-00

Author Tags

  1. Scheduling
  2. Actor–critic
  3. Deep reinforcement learning
  4. DAG
  5. Tasks
  6. Jobs
  7. Workflow

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Highly Compatible I/O-Aware Workflow Scheduling on HPC SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00031(1-15)Online publication date: 17-Nov-2024
  • (2024)Energy-efficient DAG scheduling with DVFS for cloud data centersThe Journal of Supercomputing10.1007/s11227-024-06035-780:10(14799-14823)Online publication date: 27-Mar-2024

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media