Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2503210.2503299acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Effective sampling-driven performance tools for GPU-accelerated supercomputers

Published: 17 November 2013 Publication History

Abstract

Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, April 2010.
[2]
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proc. of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation, PLDI '97, pages 85--96, 1997.
[3]
S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proc. of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 23--34, 2012.
[4]
D. Bohme, M. Geimer, F. Wolf, and L. Arnold. Identifying the root causes of wait states in large-scale parallel applications. In In Proc. of the 2010 39th Intl. Conference on Parallel Processing, ICPP '10, pages 90--100, Washington, DC, USA, 2010. IEEE Computer Society.
[5]
W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington. Implementing molecular dynamics on hybrid high performance computers - short range forces. Computer Physics Communications, 182(4):898--911, 2011.
[6]
Dave Goodwin, NVIDIA Corp. Personal communication.
[7]
D. Hackenberg, G. Juckeland, and H. Brunst. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience, 24(1):62--72, 2012.
[8]
I. Karlin et al. Lulesh programming model and performance ports overview. Technical Report LLNL-TR-608824, Lawrence Livermore National Laboratory, December 2012.
[9]
I. Karlin, J. Keasler, and R. Neely. Lulesh 2.0 updates and changes. Technical Report LLNL-TR-641973, Lawrence Livermore National Laboratory, August 2013.
[10]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual IEEE/ACM Intl. Symposium on Microarchitecture, MICRO 42, pages 45--55, 2009.
[11]
A. D. Malony et al. Parallel performance measurement of heterogeneous parallel systems with GPUs. In Proc. of the 2011 Intl. Conference on Parallel Processing, ICPP '11, pages 176--185, Washington, DC, USA, 2011. IEEE Computer Society.
[12]
G. Nakhimovsky. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html, Jul 2001.
[13]
NVIDIA Corp. CUDA Tools SDK CUPTI User's Guide DA-05679-001_v01, October 2011.
[14]
NVIDIA Corp. NVIDIA CUDA C Programming Guide Version 4.1, October 2011.
[15]
NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2012.
[16]
NVIDIA Corp. Nvidia Visual Profiler. https://developer.nvidia.com/nvidia-visual-profiler, Jan 2013.
[17]
Open Solaris Forum. Man solaris - getitimer (2). http://www.opensolarisforum.org/man/man2/getitimer.html, June 2001.
[18]
S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige. Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. SIGMETRICS Performance Evaluation Rev., 38(4):23--29, Mar. 2011.
[19]
F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the 2003 ACM/IEEE conference on Supercomputing, SC '03, pages 55--, New York, NY, USA, 2003. ACM.
[20]
S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1--19, Mar. 1995.
[21]
J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In Proc. of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 11--22, 2012.
[22]
F. Song, S. Tomov, and J. Dongarra. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In Proc. of the Intl. Conference on Supercomputing, ICS '12, pages 1--11, 2012.
[23]
N. R. Tallent, J. Mellor-Crummey, M. Franco, R. Landrum, and L. Adhianto. Scalable fine-grained call path tracing. In Proc. of the Intl. Conference on Supercomputing, ICS '11, pages 63--74, 2011.
[24]
N. R. Tallent, J. M. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 441--452, June 2009.
[25]
N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 269--280, 2010.
[26]
J. S. Vetter et al. Keeneland: Bringing heterogeneous GPU computing to the computational science community. Computing in Science Engineering, 13(5):90--95, Sept.--Oct. 2011.

Cited By

View all
  • (2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
  • (2022)Low overhead and context sensitive profiling of GPU-accelerated applicationsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532388(1-13)Online publication date: 28-Jun-2022
  • (2021)Performance Test of GTA 5 Game With MSI AfterburnerEuropean Journal of Science and Technology10.31590/ejosat.992188Online publication date: 13-Sep-2021
  • Show More Cited By

Index Terms

  1. Effective sampling-driven performance tools for GPU-accelerated supercomputers

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
        November 2013
        1123 pages
        ISBN:9781450323789
        DOI:10.1145/2503210
        • General Chair:
        • William Gropp,
        • Program Chair:
        • Satoshi Matsuoka
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 November 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. CPU-GPU blame shifting
        2. call path profiling
        3. heterogeneous architectures
        4. performance analysis

        Qualifiers

        • Research-article

        Conference

        SC13
        Sponsor:

        Acceptance Rates

        SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
        Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)28
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 12 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
        • (2022)Low overhead and context sensitive profiling of GPU-accelerated applicationsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532388(1-13)Online publication date: 28-Jun-2022
        • (2021)Performance Test of GTA 5 Game With MSI AfterburnerEuropean Journal of Science and Technology10.31590/ejosat.992188Online publication date: 13-Sep-2021
        • (2021)Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs2021 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools54808.2021.00009(26-35)Online publication date: Nov-2021
        • (2021)Did the GPU obfuscate the load imbalance in my MPI simulation?2021 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar54615.2021.00008(20-29)Online publication date: Nov-2021
        • (2020)Tools for top-down performance analysis of GPU-accelerated applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392752(1-12)Online publication date: 29-Jun-2020
        • (2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
        • (2019)Understanding the Performance of GPGPU Applications from a Data-Centric View2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools49597.2019.00006(1-8)Online publication date: Nov-2019
        • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168831(214-227)Online publication date: 2018
        • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168831(214-227)Online publication date: 24-Feb-2018
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media