Abstract
The mission of the DOE Argonne Leadership Computing Facility (ALCF) is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. The ALCF operates supercomputers that are generally amongst the Top 5 fastest machines in the world. Specifically, ALCF is looking for the science that is either too big to run anywhere else, or it would take so long as to be impractical (i.e., “capability jobs”). At ALCF, batch scheduling plays a critical role for achieving a set of site goals within a set of constraints. While system utilization is an important goal at ALCF, its largest mission constraint is to enable extreme scale parallel jobs to take precedence. In this paper, we will describe the specific scheduling goals and constraints, analyze the workload traces collected in 2013–2017 from the 48-rack petascale supercomputer Mira, and discuss the upcoming scheduling challenges at ALCF.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Argonne National Laboratory. http://www.anl.gov/
Argonne National Laboratory User Facilities. http://www.anl.gov/user-facilities
Argonne Leadership Computing Facility. http://www.alcf.anl.gov/
Top500. https://www.top500.org/
Innovative and Novel Computational Impact on Theory and Experiment (INCITE) Program. http://www.doeleadershipcomputing.org/incite-program/
Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) Program. https://science.energy.gov/ascr/facilities/accessing-ascr-facilities/alcc/
The Directors Discretionary (DD) program. https://www.alcf.anl.gov/dd-program
IBM Blue Gene. https://en.wikipedia.org/wiki/Blue_Gene
SciDAC Scalable Systems Software ISIC. http://www.scidac.gov/ASCR/ASCR_SSS.html
Intrepid. https://www.alcf.anl.gov/intrepid
Argonne Advanced Photon Source. https://www1.aps.anl.gov/
DIII-D. https://en.wikipedia.org/wiki/DIII-D_(fusion_reactor)
ITER. https://www.iter.org/
Shifter. https://github.com/NERSC/shifter
Singularity. http://singularity.lbl.gov/
Zheng, Z., Yu, L., Tang, W., Lan, Z.: Co-analysis of RAS log and job log on Blue Gene/P. In: Proceedings of IPDPS (2011)
Yang, X., Zhou, Z., Wallace, S., Lan, Z., Tang, W., Coghlan, S., Papka, M.: Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2013)
Wallace, S., Yang, X., Vishwanath, V., Allcock, W., Coghlan, S., Papka, M., Lan, Z.: A data driven scheduling approach for power management on HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2016)
Zhou, Z., Yang, X., Lan, Z., Rich, P., Tang, W., Morozov, V., Desai, N.: Improving batch scheduling on Blue Gene/Q by relaxing 5D torus network allocation constraints. In: Proceedings of IEEE IPDPS (2015)
Zhou, Z., Yang, X., Zhao, D., Rich, P., Tang, W., Wang, J., Lan, Z.: I/O-aware batch scheduling for petascale computing systems. In: Proceedings of IEEE Cluster (2015)
Yan, J., Yang, X., Jin, D., Lan, Z.: Cerberus: a three-phase burst-buffer-aware batch scheduler for high performance computing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Poster Session (2016)
Acknowledgement
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Zhiling Lan is supported in part by US National Science Foundation grants CNS-1320125 and CCF-1422009.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Allcock, W., Rich, P., Fan, Y., Lan, Z. (2018). Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-77398-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77397-1
Online ISBN: 978-3-319-77398-8
eBook Packages: Computer ScienceComputer Science (R0)