Article

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Authors:

Daichi Mukunoki,

Katsuhisa Ozaki,

Toshiyuki ImamuraAuthors Info & Claims

High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings

Pages 230 - 248

https://doi.org/10.1007/978-3-030-50743-5_12

Published: 22 June 2020 Publication History

Abstract

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.

References

[1]

Carson E and Higham N Accelerating the solution of linear systems by iterative refinement in three precisions SIAM J. Sci. Comput. 2018 40 2 A817-A847

[2]

Dekker TJ A floating-point technique for extending the available precision Numerische Mathematik 1971 18 224-242

[3]

Domke, J., et al.: Double-precision FPUs in high-performance computing: an embarrassment of riches? In: Proceedings 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), pp. 78–88 (2019)

[4]

Dongarra JJ, Du Croz J, Hammarling S, and Duff IS A set of level 3 basic linear algebra subprograms ACM Trans. Math. Softw. 1990 16 1 1-17

[5]

Fousse L, Hanrot G, Lefèvre V, Pélissier P, and Zimmermann P MPFR: a multiple-precision binary floating-point library with correct rounding ACM Trans. Math. Softw. 2007 33 2 13:1-13:15

[6]

Haider A, et al., et al. Shi Y, et al., et al. The design of fast and energy-efficient linear solvers: on the potential of half-precision arithmetic and iterative refinement techniques Computational Science – ICCS 2018 2018 Cham Springer 586-600

[7]

Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proceedings International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2018), pp. 47:1–47:11 (2018)

[8]

Henry, G., Tang, P.T.P., Heinecke, A.: Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations. In: Proceedings 26th IEEE Symposium on Computer Arithmetic (ARITH-26), pp. 69–76 (2019)

[9]

Higham NJ and Mary T A new approach to probabilistic rounding error analysis SIAM J. Sci. Comput. 2019 41 5 A2815-A2835

[10]

Ichimura, S., Katagiri, T., Ozaki, K., Ogita, T., Nagai, T.: Threaded accurate matrix-matrix multiplications with sparse matrix-vector multiplications. In: Proceedings 32nd IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 1093–1102 (2018)

[11]

Markidis, S., Chien, S.W.D., Laure, E., Peng, I.B., Vetter, J.S.: NVIDIA tensor core programmability, performance precision. In: Proceedings 32nd IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 522–531 (2018)

[12]

Mukunoki, D., Ogita, T., Ozaki, K.: Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures. In: Proceedings 13th International Conference on Parallel Processing and Applied Mathematics (PPAM2019), Lecture Notes in Computer Science, vol. 12043, pp. 516–527 (2020)

[13]

Ozaki K, Ogita T, Oishi S, and Rump SM Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications Numer. Algorithms 2012 59 1 95-118

[14]

Rump S, Ogita T, and Oishi S Accurate floating-point summation part ii: Sign, k-fold faithful and rounding to nearest SIAM J. Sci. Comput. 2009 31 2 1269-1302

[15]

Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: Proceedings 25th IEEE International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)

[16]

Yang, K., Chen, Y.F., Roumpos, G., Colby, C., Anderson, J.: High performance monte carlo simulation of ising model on TPU clusters. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), pp. 83:1–83:15 (2019)

Cited By

Ootomo HYokota RHuebl ASilvano CRobinson T(2023)Mixed-Precision Random Projection for RandNLA on Tensor CoresProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3592979.3593413(1-11)Online publication date: 26-Jun-2023
https://dl.acm.org/doi/10.1145/3592979.3593413
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Ootomo HYokota R(2022)Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performanceInternational Journal of High Performance Computing Applications10.1177/1094342022109025636:4(475-491)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/10943420221090256
Show More Cited By

Index Terms

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Index terms have been assigned to the content through auto-classification.

Recommendations

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
HPCAsia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak ...
CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Programming tensor cores from an image processing DSL
SCOPES '20: Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings

Jun 2020

561 pages

ISBN:978-3-030-50742-8

DOI:10.1007/978-3-030-50743-5

Editors:
Ponnuswamy Sadayappan
https://ror.org/03r0ha626School of Computing, University of Utah, Salt Lake City, UT, USA
,
Bradford L. Chamberlain
Cray, a Hewlett Packard Enterprise Company, Seattle, WA, USA
,
Guido Juckeland
https://ror.org/01zy2cs03Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Germany
,
Hatem Ltaief
https://ror.org/01q3tbs38Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 22 June 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ootomo HYokota RHuebl ASilvano CRobinson T(2023)Mixed-Precision Random Projection for RandNLA on Tensor CoresProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3592979.3593413(1-11)Online publication date: 26-Jun-2023
https://dl.acm.org/doi/10.1145/3592979.3593413
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Ootomo HYokota R(2022)Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performanceInternational Journal of High Performance Computing Applications10.1177/1094342022109025636:4(475-491)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/10943420221090256
Mukunoki DOzaki KOgita TImamura T(2022)Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore ProcessorsParallel Processing and Applied Mathematics10.1007/978-3-031-30442-2_4(40-54)Online publication date: 11-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-30442-2_4
Mukunoki DOzaki KOgita TImamura T(2021)Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki SchemeProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472493(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472493
Mukunoki DOzaki KOgita TIakymchuk R(2021)Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki schemeThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432270(100-109)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3432261.3432270

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents