Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3626203.3670555acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
short-paper

Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads

Published: 17 July 2024 Publication History

Abstract

Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads. Efficiently managing and scheduling these workloads via SLURM interfaces requires users to comprehensively understand available resources, allocation policies, and suitable execution configurations aligned with their models’ estimated resource requirements and constraints. Typically, scheduling jobs involves using default configurations, adjusting them as needed, or requesting maximum available limits to ensure uninterrupted execution. However, this approach can lead to job interruptions due to underprovisioning, prolonged wait times, inefficient resource utilization, and increased costs from overprovisioning. These issues ultimately degrade cluster performance, emphasizing the need for a more efficient solution like an AI-enabled Scheduler framework that can profile the DNN workloads and estimate and provision resources dynamically. The existing resource estimation models are trained independently to predict various aspects of batch processing and scheduling, which do not work cohesively to orchestrate a job execution. In our work, we propose to introduce a framework that investigates the feasibility of implementing an iScheduler framework, which transforms the traditional SLURM resource provisioning workflow into an AI-enabled scheduler that plugs different estimators where needed to orchestrate workflow by generating a cyberinfrastructure-aware execution plan, schedules and monitors jobs till completion. We demonstrate the feasibility of our framework by orchestrating a user-specific DNN training workload.

References

[1]
Qiyang Ding, Pengfei Zheng, Shreyas Kudari, Shivaram Venkataraman, and Zhao Zhang. 2023. Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.
[2]
Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating gpu memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352.
[3]
Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Michael Burrows, Charith Mendis, and Bryan Perozzi. 2024. TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. Advances in Neural Information Processing Systems 36 (2024).
[4]
Joe Stubbs, Richard Cardone, Mike Packard, Anagha Jamthe, Smruti Padhy, Steve Terry, Julia Looney, Joseph Meiring, Steve Black, Maytal Dahan, 2021. Tapis: An API platform for reproducible, distributed computational research. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 1. Springer, 878–900.
[5]
Sahil Tyagi and Prateek Sharma. 2023. Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 403–413.
[6]
Manikya Swathi Vallabhajosyula and Rajiv Ramnath. 2022. Towards Practical, Generalizable Machine-Learning Training Pipelines to Build Regression Models for Predicting Application Resource Needs on HPC Systems. In Practice and Experience in Advanced Research Computing (Boston, MA, USA) (PEARC ’22). Association for Computing Machinery, New York, NY, USA, Article 43, 5 pages. https://doi.org/10.1145/3491418.3535172
[7]
Manikya Swathi Vallabhajosyula and Rajiv Ramnath. 2023. Insights from the HARP Framework: Using an AI-Driven Approach for Efficient Resource Allocation in HPC Scientific Workflows. In Practice and Experience in Advanced Research Computing. 341–344.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing
July 2024
608 pages
ISBN:9798400704192
DOI:10.1145/3626203
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AI4CI
  2. AI4OPT
  3. ML
  4. estimation scalability
  5. execution time estimation
  6. job scheduling
  7. model
  8. workflow orchestration

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

Conference

PEARC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 78
    Total Downloads
  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)10
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media