Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-88224-2_6guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters

Published: 21 May 2021 Publication History

Abstract

High Throughput Computing datacenters are a cornerstone of scientific discoveries in the fields of High Energy Physics and Astroparticles Physics. These datacenters provide thousands of users from dozens of scientific collaborations with tens of thousands computing cores and Petabytes of storage.
The scheduling algorithm used in such datacenters to handle the millions of (mostly single-core) jobs submitted every month ensures a fair sharing of the computing resources among user groups, but may also cause unpredictably long job wait times for some users. The time a job will wait can be caused by many entangled factors and configuration parameters and is thus very hard to predict. Moreover, batch systems implementing a fair-share scheduling algorithm cannot provide users with any estimation of the job wait time at submission time.
Therefore, we investigate in this paper how learning-based techniques applied to the logs of the batch scheduling system of a large HTC datacenter can be used to get an estimation of job wait time. First, we illustrate the need for users for such an estimation. Then, we identify some intuitive causes of this wait time from the information found in the batch system logs. We also formally analyze the correlation between job and system features and job wait time. Finally, we study several Machine Learning algorithms to implement learning-based estimators of both job wait time and job wait time ranges. Our experimental results show that a regression-based estimator can predict job wait time with a median absolute percentage error of about 54%, while a classifier that combines regression and classification assigns nearly 77% of the jobs in the right wait time range or in an immediately adjacent one.

References

[1]
Azevedo F, Gombert L, and Suter F Klusáček D, Cirne W, and Desai N Reducing the human-in-the-loop component of the scheduling of large HTC workloads Job Scheduling Strategies for Parallel Processing 2018 Cham Springer 39-60
[2]
Azevedo F, Klusáček D, and Suter F Yahyapour R Improving fairness in a large scale HTC system through workload analysis and simulation Euro-Par 2019: Parallel Processing 2019 Cham Springer 129-141
[3]
Breiman L Stacked regressions Mach. Learn. 1996 24 1 49-64
[4]
Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), New York, NY, pp. 110–118, March 2006.
[5]
Feitelson DG Feitelson DG and Rudolph L Metrics for parallel job scheduling and their convergence Job Scheduling Strategies for Parallel Processing 2001 Heidelberg Springer 188-205
[6]
Feitelson D, Tsafrir D, and Krakov D Experience with using the parallel workloads archive J. Parallel Distri. Comput. 2014 74 10 2967-2982
[7]
Freund Y and Schapire R A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting J. Comput. Syst. Sci. 1997 55 1 119-139
[8]
Gombert, L., Suter, F.: Companion of the “learning-based approaches to estimate job wait time in HTC datacenters” article (2021).
[9]
Jancauskas, V., Piontek, T., Kopta, P., Bosak, B.: Predicting queue wait time probabilities for multi-scale computing. Philos. Trans. Roy. Soc. A 377(2142) (2019).
[10]
Kay J and Lauder P A fair share scheduler Commun. ACM 1988 31 1 44-55
[11]
Kumar R and Vadhiyar S Cirne W and Desai N Prediction of queue waiting times for metascheduling on parallel batch systems Job Scheduling Strategies for Parallel Processing 2015 Cham Springer 108-128
[12]
Li, H., Chen, J., Tao, Y., Groep, D., Wolters, L.: Improving a local learning technique for QueueWait time predictions. In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid), Singapore, pp. 335–342, May 2006.
[13]
Li, H., Groep, D., Wolters, L.: Efficient response time predictions by exploiting application and resource state similarities. In: Proceedings of of the 6th IEEE/ACM International Conference on Grid Computing (GRID), Seattle, WA, pp. 234–241, November 2005.
[14]
Loh WY Classification and regression trees Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 2011 1 14-23
[15]
Mu’alem A and Feitelson D Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling IEEE TPDS 2001 12 6 529-543
[16]
Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830
[17]
Schlagkamp, S., Ferreira da Silva, R., Allcock, W., Deelman, E., Schwiegelshohn, U.: Consecutive job submission behavior at mira supercomputer. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Kyoto, Japan, pp. 93–96, May 2016.
[18]
Smith, W.: A service for queue prediction and job statistics. In: Proceedings of the 2010 Gateway Computing Environments Workshop, Los Alamitos, CA, pp. 1–8, November 2010.
[19]
Smith W, Foster I, and Taylor V Predicting application run times with historical information JPDC 2004 64 9 1007-1016
[20]
Smith W, Taylor V, and Foster I Feitelson DG and Rudolph L Using run-time predictions to estimate queue wait times and improve scheduler performance Job Scheduling Strategies for Parallel Processing 1999 Heidelberg Springer 202-219
[21]
The IN2P3/CNRS Computing Center. http://cc.in2p3.fr/en/
[22]
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996).
[23]
Univa Corporation: Grid Engine. http://www.univa.com/products/

Cited By

View all
  • (2024)Quantifying Uncertainty in HPC Job Queue Time PredictionsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670627(1-3)Online publication date: 17-Jul-2024
  • (2024)Tandem Predictions for HPC jobsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670547(1-9)Online publication date: 17-Jul-2024

Index Terms

  1. Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Job Scheduling Strategies for Parallel Processing: 24th International Workshop, JSSPP 2021, Virtual Event, May 21, 2021, Revised Selected Papers
    May 2021
    237 pages
    ISBN:978-3-030-88223-5
    DOI:10.1007/978-3-030-88224-2

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 21 May 2021

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Quantifying Uncertainty in HPC Job Queue Time PredictionsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670627(1-3)Online publication date: 17-Jul-2024
    • (2024)Tandem Predictions for HPC jobsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670547(1-9)Online publication date: 17-Jul-2024

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media