research-article

DAG-based workflows scheduling using Actor–Critic Deep Reinforcement Learning

Authors:

Guilherme Piêgas Koslovski,

Kleiton Pereira,

Paulo Roberto AlbuquerqueAuthors Info & Claims

Volume 150, Issue C

Pages 354 - 363

https://doi.org/10.1016/j.future.2023.09.018

Published: 01 January 2024 Publication History

Abstract

High-Performance Computing (HPC) is essential to support the advance in multiple research and industrial fields. Despite the recent growth in processing and networking power, the HPC Data Centers (DCs) are finite, and should be carefully managed to host multiple jobs. The scheduling of tasks (composing a job) is a crucial and complex task, once the reflexes of the scheduler’s decisions are perceptible both for users (e.g., slowdown) and for infrastructure administrators (e.g., use of resources and queue length). In fact, the process of scheduling workflows atop a DC can be modeled as a graph mapping problem. While an undirected graph is used to represent the DC, a Directed Acyclic Graph (DAG) is used to express the tasks dependencies. Each vertex and edge from both graphs can have weights associated with them, denoting the residual capacities for DC resources, as well as computing and networking demands for workflows. Motivated by the combinatorial explosion of the aforementioned scheduling problem, the integration of Machine Learning (ML) for generating or improving scheduling policies is a reality, however the proposals in the specialized literature opt, mostly, for using simplified models to reduce the search space or are trained to specific scenarios, which leads to policies that eventually fall short of real DCs expectations. Given this challenge, this work applies Actor–Critic (AC) Reinforcement Learning (RL) to schedule DAG-based workflows. Instead of proposing a new policy, the AC RL is used to select the appropriated scheduling policy from a pool of consolidated algorithms, guided by the DAGs workload and DC usage. The AC RL-based scheduler analyzes the DAGs queue and the DC status to define which algorithms are better suited to improve the overall performance indicators in each scenario instance. The simulation protocol comprises multiple analysis with distinct workload configurations, number of jobs, queue ordering polices and strategies to select the target DC servers. The results demonstrated that the AC RL selects the scheduling policy which fits the current workload and DC status.

Highlights

•

We researched the DAG-based workflows scheduling considering both the users’ and HPC DC’s perspectives.

•

We proposed, implemented and analyzed an AC RL scheduler to select the appropriated combination of queueing policies.

•

The AC RL prototype runs alongside state-of-the-art consolidated policies, and can be easily extended.

•

The AC RL prototype is simple. It is based on well-known queueing policies and actor–critic reinforcement learning; and.

•

The simulation protocol demonstrated how the AC RL prototype can learn and improve the overall indicators.

References

[1]

Bernholdt D.E., Boehm S., Bosilca G., Venkata M.G., Grant R.E., Naughton T., Pritchard H.P., Schulz M., Vallee G.R., A survey of MPI usage in the US exascale computing project, Concurr. Comput.: Pract. Exper. 32 (3) (2020),. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4851.

Abstract

Highlights

References

Cited By

Recommendations

Malleable scheduling beyond identical machines

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents

Comments

Information

Published In

Publisher

Publication History

Author Tags

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations