research-article

Open access

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Authors:

Zizhong ChenAuthors Info & Claims

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 360 - 372

https://doi.org/10.1145/3577193.3593715

Published: 21 June 2023 Publication History

PDF eReader

Abstract

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

Abstract

References

Cited By

Index Terms

Recommendations

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations