Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO.2013.6494986acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Published: 23 February 2013 Publication History

Abstract

In this paper, we present an approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound performance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source code.

References

[1]
Asfermi. http://code.google.com/p/asfermi/.
[2]
Netlib. http://www.netlib.org/blas/.
[3]
Nvidia. Visual Profiler, https://developer.nvidia. com/nvidia-visual-profiler.
[4]
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The tera computer system. In Proceedings of the 4th international conference on Supercomputing, ICS '90, New York, NY, USA, 1990. ACM.
[5]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, april 2009.
[6]
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, New York, NY, USA, 2009. ACM.
[7]
J. Kurzak, S. Tomov, and J. Dongarra. Autotuning gemm kernels for the fermi gpu. Parallel and Distributed Systems, IEEE Transactions on, PP(99):1, 2012.
[8]
M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63-74, Apr. 1991.
[9]
A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.
[10]
J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, New York, NY, USA, 2011. ACM.
[11]
R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi gpus, 2010.
[12]
NVIDIA. Nvidia cuda c programming guide 4.2.
[13]
NVIDIA. Fermi Whitepaper. http://www.nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_ Whitepaper.pdf, 2009.
[14]
NVIDIA. GTX680 Whitepaper. http://www.geforce. com/Active/en_US/en_US/pdf/GeForce-GTX- 680-Whitepaper-FINAL.pdf, 2012.
[15]
NVIDIA. NVIDIA Tesla K20/K20X GPU Accelerators Application Performance Technical Brief. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performancetechnical-brief.pdf, Nov. 2012.
[16]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.- Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, New York, NY, USA, 2008. ACM.
[17]
J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, New York, NY, USA, 2012. ACM.
[18]
G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 35:1-35:11, New York, NY, USA, 2011. ACM.
[19]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4), Apr. 2009.
[20]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), Feb. 2011.

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
  • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
February 2013
366 pages
ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Author Tags

  1. CUDA
  2. Fermi GPU
  3. Kepler GPU
  4. Performance Upper Bound Analysis
  5. SGEMM

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
  • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
  • (2023)Fast All-Pairs Shortest Paths Algorithm in Large Sparse GraphProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593728(277-288)Online publication date: 21-Jun-2023
  • (2022)MLIR-based code generation for GPU tensor coresProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517770(117-128)Online publication date: 19-Mar-2022
  • (2021)Optimizing Winograd-Based Convolution with Tensor CoresProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472473(1-10)Online publication date: 9-Aug-2021
  • (2021)EGEMM-TCProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441599(278-291)Online publication date: 17-Feb-2021
  • (2020)RAMMERProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488816(881-897)Online publication date: 4-Nov-2020
  • (2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
  • (2019)Decoding CUDA binaryProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314900(229-241)Online publication date: 16-Feb-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media