Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3332186.3333041acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article
Public Access

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

Published: 28 July 2019 Publication History

Abstract

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model.
Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

References

[1]
{n. d.}. Beocat. https://support.beocat.ksu.edu/BeocatDocs/index.php/Main_Page. (Accessed on 03/013/2019).
[2]
{n. d.}. Documentation Index. http://www.adaptivecomputing.com/support/documentation-index/. (Accessed on 02/011/2019).
[3]
{n. d.}. GitHub - ubccr-slurm-simulator/slurm_simulator: Slurm Simulator: Slurm Modification to Enable its Simulation. https://github.com/ubccr-slurm-simulator/slurm_simulator. (Accessed on 01/03/2019).
[4]
{n. d.}. PBS Professional Open Source Project. https://www.pbspro.org/. (Accessed on 02/03/2019).
[5]
{n. d.}. Slurm Workload Manager - Documentation. https://slurm.schedmd.com/. (Accessed on 01/07/2019).
[6]
{n. d.}. TORQUE Resource Manager. http://www.adaptivecomputing.com/products/torque/. (Accessed on 02/02/2019).
[7]
2019. Getting Started with Scikit-learn for Machine Learning. In Python® Machine Learning. John Wiley & Sons, Inc., 93--117.
[8]
Dan Andresen, William Hsu, Huichen Yang, and Adedolapo Okanlawon. 2018. Machine Learning for Predictive Analytics of Compute Cluster Jobs. CoRR abs/1806.01116 (2018). arXiv:1806.01116 http://arxiv.org/abs/1806.01116
[9]
Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi Guitart, Ricard Gavaldà, and Jordi Torres. 2010. Towards energy-aware scheduling in data centers using machine learning. In Proceedings of the 1st International Conference on Energy-Efficient Computing and Networking - e-Energy '10. ACM Press.
[10]
Bruce Bugbee, Caleb Phillips, Hilary Egan, Ryan Elmore, Kenny Gruchalla, and Avi Purkayastha. 2017. Prediction and characterization of application power use in a high-performance computing environment. Statistical Analysis and Data Mining: The ASA Data Science Journal 10, 3 (Feb. 2017), 155--165.
[11]
N.R. Council, D.E.L. Studies, D.E.P. Sciences, and C.P.I.H.E.C.I.F.S. Engineering. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. National Academies Press. https://books.google.com/books?id=2XadAgAAQBAJ
[12]
Fenoy GarcÃηa and Carlos. 2014. Improving HPC applications scheduling with predictions based on automatically collected historical data. https://upcommons.upc.edu/handle/2099.1/23049
[13]
Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram. 2015. Improving backfilling by using machine learning to predict running times. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15. ACM Press.
[14]
W. Gentzsch. {n. d.}. Sun Grid Engine: towards creating a compute power grid. In Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE Comput. Soc.
[15]
S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. 2007. Machine learning: a review of classification and combining techniques. https://link.springer.com/article/10.1007/s10462-007-9052-3
[16]
Rajath Kumar and Sathish Vadhiyar. 2013. Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne, Narayan Desai, Eitan Frachtenberg, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 196--215.
[17]
L. Massaron and A. Boschetti. 2016. Regression Analysis with Python. Packt Publishing. https://books.google.com/books?id=d2tLDAAAQBAJ
[18]
Andréa Matsunaga and José A.B. Fortes. 2010. On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE.
[19]
Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer International Publishing, Cham, 197--217.
[20]
Warren Smith. 2007. Prediction Services for Distributed Computing. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE.
[21]
Chaowei Yang, David Wong, Qianjun Miao, and Ruixin Yang (Eds.). 2010. Advanced Geoinformation Science. CRC Press.
[22]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44--60.

Cited By

View all
  • (2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Mar-2025
  • (2024)Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in ParallelPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670610(1-4)Online publication date: 17-Jul-2024
  • (2024)Toward Sustainable HPC: In-Production Deployment of Incentive-Based Power Efficiency Mechanism on the Fugaku SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00030(1-16)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)
July 2019
775 pages
ISBN:9781450372275
DOI:10.1145/3332186
  • General Chair:
  • Tom Furlani
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. Performance
  3. Scheduling
  4. Slurm
  5. Supervised Machine Learning
  6. User Modeling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PEARC '19

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)549
  • Downloads (Last 6 weeks)144
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Mar-2025
  • (2024)Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in ParallelPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670610(1-4)Online publication date: 17-Jul-2024
  • (2024)Toward Sustainable HPC: In-Production Deployment of Incentive-Based Power Efficiency Mechanism on the Fugaku SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00030(1-16)Online publication date: 17-Nov-2024
  • (2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
  • (2024)Relative Performance Prediction Using Few-Shot Learning2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00278(1764-1769)Online publication date: 2-Jul-2024
  • (2024)Sizey: Memory-Efficient Execution of Scientific Workflow Tasks2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00039(370-381)Online publication date: 24-Sep-2024
  • (2024)Trade-off topology design for hierarchical network based on job characteristicsCCF Transactions on High Performance Computing10.1007/s42514-024-00193-zOnline publication date: 21-May-2024
  • (2024)Augmented access pattern-based I/O performance prediction using directed acyclic graph regressionCluster Computing10.1007/s10586-024-04719-628:1Online publication date: 14-Oct-2024
  • (2024)The Running Time Prediction of Spacecraft Simulation Job Based on HC-LSTMSignal and Information Processing, Networking and Computers10.1007/978-981-97-2116-0_59(482-490)Online publication date: 3-May-2024
  • (2023)Prediction of Reservoir Simulation Jobs Times Using a Real-World SLURM LogAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235649(49-60)Online publication date: 17-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media