Article

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Authors:

Andre SeznecAuthors Info & Claims

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pages 1 - 10

https://doi.org/10.1109/CGO.2013.6494986

Published: 23 February 2013 Publication History

Abstract

In this paper, we present an approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound performance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source code.

References

[1]

Asfermi. http://code.google.com/p/asfermi/.

[2]

Netlib. http://www.netlib.org/blas/.

[3]

Nvidia. Visual Profiler, https://developer.nvidia. com/nvidia-visual-profiler.

[4]

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The tera computer system. In Proceedings of the 4th international conference on Supercomputing, ICS '90, New York, NY, USA, 1990. ACM.

Digital Library

[5]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, april 2009.

[6]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, New York, NY, USA, 2009. ACM.

Digital Library

[7]

J. Kurzak, S. Tomov, and J. Dongarra. Autotuning gemm kernels for the fermi gpu. Parallel and Distributed Systems, IEEE Transactions on, PP(99):1, 2012.

Digital Library

[8]

M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63-74, Apr. 1991.

Digital Library

[9]

A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.

Digital Library

[10]

J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, New York, NY, USA, 2011. ACM.

Digital Library

[11]

R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi gpus, 2010.

[12]

NVIDIA. Nvidia cuda c programming guide 4.2.

[13]

NVIDIA. Fermi Whitepaper. http://www.nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_ Whitepaper.pdf, 2009.

[14]

NVIDIA. GTX680 Whitepaper. http://www.geforce. com/Active/en_US/en_US/pdf/GeForce-GTX- 680-Whitepaper-FINAL.pdf, 2012.

[15]

NVIDIA. NVIDIA Tesla K20/K20X GPU Accelerators Application Performance Technical Brief. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performancetechnical-brief.pdf, Nov. 2012.

[16]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.- Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, New York, NY, USA, 2008. ACM.

Digital Library

[17]

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, New York, NY, USA, 2012. ACM.

Digital Library

[18]

G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 35:1-35:11, New York, NY, USA, 2011. ACM.

Digital Library

[19]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4), Apr. 2009.

Digital Library

[20]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), Feb. 2011.

Digital Library

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Rasch ASchulze RShabalin DElster AGorlatch SHall MVerbrugge CLhoták OShen X(2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580269
Show More Cited By

Index Terms

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Recommendations

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, ...
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an ...
Designing and dynamically load balancing hybrid LU for multi/many-core

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

February 2013

366 pages

ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
165
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Rasch ASchulze RShabalin DElster AGorlatch SHall MVerbrugge CLhoták OShen X(2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580269
Yang SLiu XWang YHe XTan GGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Fast All-Pairs Shortest Paths Algorithm in Large Sparse GraphProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593728(277-288)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593728
Katel NKhandelwal VBondhugula UEgger BSmith A(2022)MLIR-based code generation for GPU tensor coresProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517770(117-128)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517770
Liu JYang DLai J(2021)Optimizing Winograd-Based Convolution with Tensor CoresProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472473(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472473
Feng BWang YChen GZhang WXie YDing YLee JPetrank E(2021)EGEMM-TCProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441599(278-291)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441599
Ma LXie ZYang ZXue JMiao YCui WHu WYang FZhang LZhou LLu SHowell J(2020)RAMMERProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488816(881-897)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488816
Huang JYu CGeijn R(2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3372419
Hayes AHua FHuang JChen YZhang EKandemir MJimborean AMoseley T(2019)Decoding CUDA binaryProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314900(229-241)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314900
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents