A Note on Auto-tuning GEMM for GPUs

Yinan Li⁷,
Jack Dongarra^7,8,9 &
Stanimire Tomov⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5544))

3031 Accesses

Abstract

The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA’s GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).

Download to read the full chapter text

Chapter PDF

Performance, Design, and Autotuning of Batched GEMM for GPUs

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications

Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Keywords

References

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK user’s guide, 3rd edn. SIAM, Philadelphia (1999)
Book Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley, Tech. Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)
Google Scholar
Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs [in the MAGMA project]. Poster at Supercomputing 2008, November 18 (2008), http://www.cs.utk.edu/~tomov/SC08-poster.pdf
Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E., Quintana-Orti, G.: Exploiting the capabilities of modern GPUs for dense matrix computations, Technical Report ICC 01-11-2008, Universidad Jaime I, Spain (2008)
Google Scholar
Bilmes, J., Asanovic, K., Chin, C.-W., Demmel, J.: Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)
Google Scholar
Bosilca, G., Chen, Z., Dongarra, J., Eijkhout, V., Fagg, G., Fuentes, E., Langou, J., Luszczek, P., Pjesivac-Grbovic, J., Seymour, K., You, H., Vadiyar, S.S.: Self adapting numerical software (SANS) effort. IBM Journal of Reseach and Development 50(2/3), 223–238 (2006)
Google Scholar
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE 93(2) (2005); special issue on Program Generation, Optimization, and Adaptation
Article Google Scholar
Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.: Exploring new architectures in accelerating CFD for Air Force applications. In: Proceedings of HPCMP Users Group Conference 2008, July 14-17 (2008), http://www.cs.utk.edu/~tomov/ugc2008_final.pdf
Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)
Google Scholar
Gunnels, J.A., Van De Geijn, R.A., Henry, G.M.: Flame: Formal linear algebra methods environment. ACM Transactions on Mathematical Software 27, 422–455 (2001)
Article Google Scholar
Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008. IEEE, Los Alamitos (2008) (to appear)
Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)
Article Google Scholar
Wolfe, M.: Compilers and More: Optimizing GPU Kernels, 10/2008, HPC Wire, http://www.hpcwire.com/features/33607434.html

Download references

Author information

Authors and Affiliations

University of Tennessee, USA
Yinan Li, Jack Dongarra & Stanimire Tomov
Oak Ridge National Laboratory, USA
Jack Dongarra
University of Manchester, UK
Jack Dongarra

Authors

Yinan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar
Stanimire Tomov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computation & Technology, Louisiana State University, 216 Johnston Hall, LA 70803, Baton Rouge, USA
Gabrielle Allen
Poznan Supercomputing and Networking Center, Poznan, Poland
Jaroslaw Nabrzyski
Center for Computation and Technology, Louisiana State University, LA 70803, Baton Rouge, USA
Edward Seidel
Department of Mathematics and Computer Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Geert Dick van Albada
Computer Science Department, Knoxville, University of Tennessee, TN 37996-3450, USA
Jack Dongarra
Faculty of Sciences, Section of Computational Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89

Download citation

DOI: https://doi.org/10.1007/978-3-642-01970-8_89
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01969-2
Online ISBN: 978-3-642-01970-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Note on Auto-tuning GEMM for GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Performance, Design, and Autotuning of Batched GEMM for GPUs

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications

Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Note on Auto-tuning GEMM for GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Performance, Design, and Autotuning of Batched GEMM for GPUs

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications

Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation