Abstract
The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA’s GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).
Chapter PDF
Similar content being viewed by others
References
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK user’s guide, 3rd edn. SIAM, Philadelphia (1999)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley, Tech. Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)
Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs [in the MAGMA project]. Poster at Supercomputing 2008, November 18 (2008), http://www.cs.utk.edu/~tomov/SC08-poster.pdf
Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E., Quintana-Orti, G.: Exploiting the capabilities of modern GPUs for dense matrix computations, Technical Report ICC 01-11-2008, Universidad Jaime I, Spain (2008)
Bilmes, J., Asanovic, K., Chin, C.-W., Demmel, J.: Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)
Bosilca, G., Chen, Z., Dongarra, J., Eijkhout, V., Fagg, G., Fuentes, E., Langou, J., Luszczek, P., Pjesivac-Grbovic, J., Seymour, K., You, H., Vadiyar, S.S.: Self adapting numerical software (SANS) effort. IBM Journal of Reseach and Development 50(2/3), 223–238 (2006)
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE 93(2) (2005); special issue on Program Generation, Optimization, and Adaptation
Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.: Exploring new architectures in accelerating CFD for Air Force applications. In: Proceedings of HPCMP Users Group Conference 2008, July 14-17 (2008), http://www.cs.utk.edu/~tomov/ugc2008_final.pdf
Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)
Gunnels, J.A., Van De Geijn, R.A., Henry, G.M.: Flame: Formal linear algebra methods environment. ACM Transactions on Mathematical Software 27, 422–455 (2001)
Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008. IEEE, Los Alamitos (2008) (to appear)
Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)
Wolfe, M.: Compilers and More: Optimizing GPU Kernels, 10/2008, HPC Wire, http://www.hpcwire.com/features/33607434.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89
Download citation
DOI: https://doi.org/10.1007/978-3-642-01970-8_89
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01969-2
Online ISBN: 978-3-642-01970-8
eBook Packages: Computer ScienceComputer Science (R0)