Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-32041-5_16guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

Published: 21 May 2023 Publication History

Abstract

Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC’s Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter’s lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation.

References

[1]
Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 154–165. IEEE (2014)
[2]
Das, A., Mueller, F., Siegel, C., Vishnu, A.: Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 40–51 (2018)
[3]
Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: LogAider: a tool for mining potential correlations of HPC log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451. IEEE (2017)
[4]
Gil Y, Greaves M, Hendler J, and Hirsh H Amplify scientific discovery with artificial intelligence Science 2014 346 6206 171-172
[5]
Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2017)
[6]
Ji, X., et al.: Understanding object-level memory access patterns across the spectrum. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2017)
[7]
Kindratenko V and Trancoso P Trends in high-performance computing Comput. Sci. Eng. 2011 13 3 92-95
[8]
Li, J., et al.: MonSTer: an out-of-the-box monitoring tool for high performance computing systems. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 119–129. IEEE (2020)
[9]
Madireddy, S., et al.: Analysis and correlation of application I/O performance and system-wide I/O activity. In: 2017 International Conference on Networking, Architecture, and Storage (NAS), pp. 1–10. IEEE (2017)
[10]
Michelogiannakis G et al. A case for intra-rack resource disaggregation in HPC ACM Trans. Archit. Code Optim. (TACO) 2022 19 2 1-26
[16]
Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2007), pp. 575–584. IEEE (2007)
[17]
Panwar, G., et al.: Quantifying memory underutilization in HPC systems and using it to improve performance via architecture support. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 821–835 (2019)
[18]
Patel, T., Byna, S., Lockwood, G.K., Tiwari, D.: Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2019)
[19]
Peng, I., Karlin, I., Gokhale, M., Shoga, K., Legendre, M., Gamblin, T.: A holistic view of memory utilization on HPC systems: current and future trends. In: The International Symposium on Memory Systems, pp. 1–11 (2021)
[20]
Peng, I., Pearce, R., Gokhale, M.: On the memory underutilization: exploring disaggregated memory on HPC systems. In: 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 183–190. IEEE (2020)
[21]
Tau Leng, R.A., Hsieh, J., Mashayekhi, V., Rooholamini, R.: An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45 (2002)
[22]
Thomas, R., Stephey, L., Greiner, A., Cook, B.: Monitoring scientific python usage on a supercomputer (2021)
[23]
Turner A and McIntosh-Smith S Jarvis S, Wright S, and Hammond S A survey of application memory usage on a national supercomputer: an analysis of memory requirements on ARCHER High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation 2018 Cham Springer 250-260
[24]
Wang, F., Oral, S., Sen, S., Imam, N.: Learning from five-year resource-utilization data of titan system. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–6. IEEE (2019)
[25]
Xie, B., et al.: Predicting output performance of a petascale supercomputer. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 181–192 (2017)
[26]
Zheng, Z., et al.: Co-analysis of RAS log and job log on blue Gene/P. In: 2011 IEEE International Parallel & Distributed Processing Symposium, pp. 840–851. IEEE (2011)

Cited By

View all
  • (2023)A Data-driven Analysis of a Cloud Data Center: Statistical Characterization of Workload, Energy and TemperatureProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing10.1145/3603166.3632137(1-10)Online publication date: 4-Dec-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings
May 2023
431 pages
ISBN:978-3-031-32040-8
DOI:10.1007/978-3-031-32041-5
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 May 2023

Author Tags

  1. HPC
  2. Large-scale Characterization
  3. Resource Utilization
  4. GPU Utilization
  5. Memory System
  6. Disaggregated Memory

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Data-driven Analysis of a Cloud Data Center: Statistical Characterization of Workload, Energy and TemperatureProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing10.1145/3603166.3632137(1-10)Online publication date: 4-Dec-2023

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media