research-article

Effective sampling-driven performance tools for GPU-accelerated supercomputers

Authors:

Karthik Murthy,

John Mellor-CrummeyAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 43, Pages 1 - 12

https://doi.org/10.1145/2503210.2503299

Published: 17 November 2013 Publication History

Abstract

Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, April 2010.

[2]

G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proc. of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation, PLDI '97, pages 85--96, 1997.

Digital Library

[3]

S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proc. of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 23--34, 2012.

Digital Library

[4]

D. Bohme, M. Geimer, F. Wolf, and L. Arnold. Identifying the root causes of wait states in large-scale parallel applications. In In Proc. of the 2010 39th Intl. Conference on Parallel Processing, ICPP '10, pages 90--100, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[5]

W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington. Implementing molecular dynamics on hybrid high performance computers - short range forces. Computer Physics Communications, 182(4):898--911, 2011.

[6]

Dave Goodwin, NVIDIA Corp. Personal communication.

[7]

D. Hackenberg, G. Juckeland, and H. Brunst. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience, 24(1):62--72, 2012.

[8]

I. Karlin et al. Lulesh programming model and performance ports overview. Technical Report LLNL-TR-608824, Lawrence Livermore National Laboratory, December 2012.

[9]

I. Karlin, J. Keasler, and R. Neely. Lulesh 2.0 updates and changes. Technical Report LLNL-TR-641973, Lawrence Livermore National Laboratory, August 2013.

[10]

C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual IEEE/ACM Intl. Symposium on Microarchitecture, MICRO 42, pages 45--55, 2009.

Digital Library

[11]

A. D. Malony et al. Parallel performance measurement of heterogeneous parallel systems with GPUs. In Proc. of the 2011 Intl. Conference on Parallel Processing, ICPP '11, pages 176--185, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[12]

G. Nakhimovsky. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html, Jul 2001.

[13]

NVIDIA Corp. CUDA Tools SDK CUPTI User's Guide DA-05679-001_v01, October 2011.

[14]

NVIDIA Corp. NVIDIA CUDA C Programming Guide Version 4.1, October 2011.

[15]

NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2012.

[16]

NVIDIA Corp. Nvidia Visual Profiler. https://developer.nvidia.com/nvidia-visual-profiler, Jan 2013.

[17]

Open Solaris Forum. Man solaris - getitimer (2). http://www.opensolarisforum.org/man/man2/getitimer.html, June 2001.

[18]

S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige. Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. SIGMETRICS Performance Evaluation Rev., 38(4):23--29, Mar. 2011.

Digital Library

[19]

F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the 2003 ACM/IEEE conference on Supercomputing, SC '03, pages 55--, New York, NY, USA, 2003. ACM.

Digital Library

[20]

S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1--19, Mar. 1995.

Digital Library

[21]

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In Proc. of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 11--22, 2012.

Digital Library

[22]

F. Song, S. Tomov, and J. Dongarra. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In Proc. of the Intl. Conference on Supercomputing, ICS '12, pages 1--11, 2012.

Digital Library

[23]

N. R. Tallent, J. Mellor-Crummey, M. Franco, R. Landrum, and L. Adhianto. Scalable fine-grained call path tracing. In Proc. of the Intl. Conference on Supercomputing, ICS '11, pages 63--74, 2011.

Digital Library

[24]

N. R. Tallent, J. M. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 441--452, June 2009.

Digital Library

[25]

N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 269--280, 2010.

Digital Library

[26]

J. S. Vetter et al. Keeneland: Bringing heterogeneous GPU computing to the computational science community. Computing in Science Engineering, 13(5):90--95, Sept.--Oct. 2011.

Digital Library

Cited By

Kalam Azad MIqbal NHassan FRoy P(2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00037
Zhou KAnderson JMeng XMellor-Crummey JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Low overhead and context sensitive profiling of GPU-accelerated applicationsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532388(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532388
CANOĞLU FÖZEL MDOĞAN MŞENGÜLER OBASCİFTCİ F(2021)Performance Test of GTA 5 Game With MSI AfterburnerEuropean Journal of Science and Technology10.31590/ejosat.992188Online publication date: 13-Sep-2021
https://doi.org/10.31590/ejosat.992188
Show More Cited By

Index Terms

Effective sampling-driven performance tools for GPU-accelerated supercomputers
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
432
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kalam Azad MIqbal NHassan FRoy P(2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00037
Zhou KAnderson JMeng XMellor-Crummey JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Low overhead and context sensitive profiling of GPU-accelerated applicationsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532388(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532388
CANOĞLU FÖZEL MDOĞAN MŞENGÜLER OBASCİFTCİ F(2021)Performance Test of GTA 5 Game With MSI AfterburnerEuropean Journal of Science and Technology10.31590/ejosat.992188Online publication date: 13-Sep-2021
https://doi.org/10.31590/ejosat.992188
Cherian AZhou KGrubisic DMeng XMellor-Crummey J(2021)Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs2021 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools54808.2021.00009(26-35)Online publication date: Nov-2021
https://doi.org/10.1109/ProTools54808.2021.00009
Eberius DBoehme DPearce O(2021)Did the GPU obfuscate the load imbalance in my MPI simulation?2021 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar54615.2021.00008(20-29)Online publication date: Nov-2021
https://doi.org/10.1109/HiPar54615.2021.00008
Zhou KKrentel MMellor-Crummey JAyguadé EHwu WBadia RHofstee H(2020)Tools for top-down performance analysis of GPU-accelerated applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392752(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392752
Xie XLiang YLi XTan WWeissman JButt ASmirni E(2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325407
Zhang HHollingsworth J(2019)Understanding the Performance of GPGPU Applications from a Data-Centric View2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools49597.2019.00006(1-8)Online publication date: Nov-2019
https://doi.org/10.1109/ProTools49597.2019.00006
Shen DSong SLi ALiu X(2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168831(214-227)Online publication date: 2018
https://doi.org/10.1145/3179541.3168831
Shen DSong SLi ALiu XKnoop JSchordan MJohnson TO'Boyle M(2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168831(214-227)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168831
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents