short-paper

Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads

Authors:

Manikya Swathi Vallabhajosyula,

Sandeep Satish Budhya,

Rajiv RamnathAuthors Info & Claims

PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

Article No.: 75, Pages 1 - 4

https://doi.org/10.1145/3626203.3670555

Published: 17 July 2024 Publication History

Get Access

Abstract

Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads. Efficiently managing and scheduling these workloads via SLURM interfaces requires users to comprehensively understand available resources, allocation policies, and suitable execution configurations aligned with their models’ estimated resource requirements and constraints. Typically, scheduling jobs involves using default configurations, adjusting them as needed, or requesting maximum available limits to ensure uninterrupted execution. However, this approach can lead to job interruptions due to underprovisioning, prolonged wait times, inefficient resource utilization, and increased costs from overprovisioning. These issues ultimately degrade cluster performance, emphasizing the need for a more efficient solution like an AI-enabled Scheduler framework that can profile the DNN workloads and estimate and provision resources dynamically. The existing resource estimation models are trained independently to predict various aspects of batch processing and scheduling, which do not work cohesively to orchestrate a job execution. In our work, we propose to introduce a framework that investigates the feasibility of implementing an iScheduler framework, which transforms the traditional SLURM resource provisioning workflow into an AI-enabled scheduler that plugs different estimators where needed to orchestrate workflow by generating a cyberinfrastructure-aware execution plan, schedules and monitors jobs till completion. We demonstrate the feasibility of our framework by orchestrating a user-specific DNN training workload.

References

[1]

Qiyang Ding, Pengfei Zheng, Shreyas Kudari, Shivaram Venkataraman, and Zhao Zhang. 2023. Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.

Digital Library

Google Scholar

[2]

Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating gpu memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352.

Digital Library

Google Scholar

[3]

Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Michael Burrows, Charith Mendis, and Bryan Perozzi. 2024. TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. Advances in Neural Information Processing Systems 36 (2024).

Google Scholar

[4]

Joe Stubbs, Richard Cardone, Mike Packard, Anagha Jamthe, Smruti Padhy, Steve Terry, Julia Looney, Joseph Meiring, Steve Black, Maytal Dahan, 2021. Tapis: An API platform for reproducible, distributed computational research. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 1. Springer, 878–900.

Crossref

Google Scholar

[5]

Sahil Tyagi and Prateek Sharma. 2023. Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 403–413.

Google Scholar

[6]

Manikya Swathi Vallabhajosyula and Rajiv Ramnath. 2022. Towards Practical, Generalizable Machine-Learning Training Pipelines to Build Regression Models for Predicting Application Resource Needs on HPC Systems. In Practice and Experience in Advanced Research Computing (Boston, MA, USA) (PEARC ’22). Association for Computing Machinery, New York, NY, USA, Article 43, 5 pages. https://doi.org/10.1145/3491418.3535172

Digital Library

Google Scholar

[7]

Manikya Swathi Vallabhajosyula and Rajiv Ramnath. 2023. Insights from the HARP Framework: Using an AI-Driven Approach for Efficient Resource Allocation in HPC Scientific Workflows. In Practice and Experience in Advanced Research Computing. 341–344.

Google Scholar

Index Terms

Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads
1. Computing methodologies
  1. Artificial intelligence
    1. Planning and scheduling
      1. Planning under uncertainty
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software design engineering

Recommendations

Orchestrating a DNN training job using an iScheduler Framework: a use case
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

Orchestrating DNN training jobs efficiently on HPC centers such as Ohio Supercomputer Center (OSC), Texas Advanced Computing Center (TACC), and San Diego Supercomputer Center (SDSC) is crucial due to the prevalence of AI-driven workloads. However, ...
An Adaptive Scheduler Framework for Complex Workflow Jobs on Grid Systems

Grid Computing provides sharing of geographically distributed resources among large scale complex applications. Due to dynamic nature of resources in grid, there is a need of highly efficient job scheduling and resource management policies in grid. A ...
Modified Rate-Monotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines

The deadline of a request is the time instant at which its execution must complete. The deadline of the request in any period of a job with deferred deadline is some time instant after the end of the period. The authors describe a semi-static priority-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

July 2024

608 pages

ISBN:9798400704192

DOI:10.1145/3626203

Editors:
Shawn T. Brown
Hewlett-Packard Enterprise
,
Barr von Oehsen
Pittsburgh Supercomputing Center
,
Eric Adams
Purdue University
,
Eva Siegmann
Stony Brook University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

PEARC '24

Sponsor:

PEARC '24: Practice and Experience in Advanced Research Computing

July 21 - 25, 2024

RI, Providence, USA

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Upcoming Conference

PEARC '25

Sponsor:
sighpc
sighpc

Practice and Experience in Advanced Research Computing

July 20 - 24, 2025

Columbus , OH , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
110
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)18

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Orchestrating a DNN training job using an iScheduler Framework: a use case

An Adaptive Scheduler Framework for Complex Workflow Jobs on Grid Systems

Modified Rate-Monotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations