Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CCGrid.2016.32acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
research-article

Towards understanding job heterogeneity in HPC: a NERSC case study

Published: 16 May 2016 Publication History

Abstract

The high performance computing (HPC) scheduling landscape is changing. Increasingly, there are large scientific computations that include high-throughput, data-intensive, and stream-processing compute models. These jobs increase the workload heterogeneity, which presents challenges for classical tightly coupled MPI job oriented HPC schedulers. Thus, it is important to define new analyses methods to understand the heterogeneity of the workload, and its possible effect on the performance of current systems. In this paper, we present a methodology to assess the job heterogeneity in workloads and scheduling queues. We apply the method on the workloads of three current National Energy Research Scientific Computing Center (NERSC) systems in 2014. Finally, we present the results of such analysis, with an observation that heterogeneity might reduce predictability in the jobs' wait time.

References

[1]
NERSC. http://www.nersc.gov. 2015-01-18.
[2]
K. Antypas. NERSC-6 workload analysis and benchmark selection process. Lawrence Berkeley National Laboratory, 2008.
[3]
K. Antypas, B. A. Austin, T. L. Butler, and R. A. Gerber. NERSC workload analysis on Hopper. Technical report, LBNL Report: 6804E, October 2014.
[4]
T. M. Declerck and I. Sakrejda. External Torque/Moab on an XC30 and Fairshare. Technical report, NERSC, Lawrence Berkeley National Lab, 2013.
[5]
Y. Etsion and D. Tsafrir. A short survey of commercial cluster batch schedulers. School of Computer Science and Engineering, The Hebrew University of Jerusalem, 44221:2005--13, 2005.
[6]
D. Feitelson. Parallel workloads archive. 71(86):337--360, 2007. http://www.cs.huji.ac.il/labs/parallel/workload.
[7]
D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel job scheduling, a status report. In Job Scheduling Strategies for Parallel Processing, pages 1--16. Springer, 2005.
[8]
J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Applied statistics, pages 100--108, 1979.
[9]
J. D. Hunter. Matplotlib: A 2D graphics environment. Computing In Science & Engineering, 9(3):90--95, 2007.
[10]
A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, and D. Epema. The grid workloads archive. Future Generation Computer Systems, 24(7):672--686, 2008.
[11]
D. A. Lifka. The ANL/IBM SP scheduling system. In Job Scheduling Strategies for Parallel Processing, pages 295--303. Springer, 1995.
[12]
A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: insights from google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37(4):34--41, 2010.
[13]
NERSC. Queues and polices (carver). https://www.nersc.gov/users/computational-systems/carver/running-jobs/queues-and-policies/. 2014.1.15.
[14]
NERSC. Submitting batch jobs (carver). https://www.nersc.gov/users/computational-systems/carver/running-jobs/batch-jobs/. 2015.1.15.
[15]
S. N. Srirama, P. Jakovits, and E. Vainikko. Adapting scientific computing problems to clouds using mapreduce. Future Generation Computer Systems, 28(1):184--192, 2012.
[16]
G. Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 8. ACM, 2006.
[17]
C. Vaughan, M. Rajan, R. Barrett, D. Doerfler, and K. Pedretti. Investigating the impact of the Cielo Cray XE6 architecture on scientific application codes. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pages 1831--1837. IEEE, 2011.

Cited By

View all
  • (2022)A Case For Intra-rack Resource Disaggregation in HPCACM Transactions on Architecture and Code Optimization10.1145/351424519:2(1-26)Online publication date: 7-Mar-2022
  • (2021)User-level Workload Analysis for SupercomputersProceedings of the 2021 4th International Conference on Software Engineering and Information Management10.1145/3451471.3451483(68-73)Online publication date: 16-Jan-2021
  • (2020)Job characteristics on large-scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433812(1-17)Online publication date: 9-Nov-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing
May 2016
784 pages
ISBN:9781509024520

Publisher

IEEE Press

Publication History

Published: 16 May 2016

Check for updates

Qualifiers

  • Research-article

Conference

CCGrid '16

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Case For Intra-rack Resource Disaggregation in HPCACM Transactions on Architecture and Code Optimization10.1145/351424519:2(1-26)Online publication date: 7-Mar-2022
  • (2021)User-level Workload Analysis for SupercomputersProceedings of the 2021 4th International Conference on Software Engineering and Information Management10.1145/3451471.3451483(68-73)Online publication date: 16-Jan-2021
  • (2020)Job characteristics on large-scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433812(1-17)Online publication date: 9-Nov-2020
  • (2020)GIFTProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386702(103-120)Online publication date: 24-Feb-2020
  • (2020)Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production HPC systemsProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386701(91-102)Online publication date: 24-Feb-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media