Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3577193.3593715acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Published: 21 June 2023 Publication History

Abstract

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
Anna Antola, Roberto Negrini, MG Sami, and Nello Scarabottolo. 1992. Fault tolerance in FFT arrays: time redundancy approaches. Journal of VLSI signal processing systems for signal, image and video technology 4, 4 (1992), 295--316.
[3]
Robert Baumann. 2002. Soft errors in commercial semiconductor technology: Overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals 7 (2002).
[4]
Jon Calhoun, Marc Snir, Luke N Olson, and William D Gropp. 2017. Towards a more complete understanding of SDC propagation. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 131--142.
[5]
Hongwei Chen, Yujia Zhai, Joshua J Turner, and Adrian Feiguin. 2023. A High-Performance Implementation of Atomistic Spin Dynamics Simulations on x86 CPUs. arXiv preprint arXiv:2304.10966 (2023).
[6]
Longxiang Chen, Dingwen Tao, Panruo Wu, and Zizhong Chen. 2014. Extending checksum-based ABFT to tolerate soft errors online in iterative methods. In 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 344--351.
[7]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799 (2018).
[8]
Zizhong Chen. 2008. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--8.
[9]
Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In ACM SIGPLAN Notices, Vol. 48. ACM, 167--176.
[10]
Zizhong Chen and Jack Dongarra. 2008. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems 19, 12 (2008), 1628--1641.
[11]
Zizhong Chen and Jack Dongarra. 2008. A scalable checkpoint encoding algorithm for diskless checkpointing. In 2008 11th IEEE High Assurance Systems Engineering Symposium. IEEE, 71--79.
[12]
Zhi Chen, Alexandru Nicolau, and Alexander V Veidenbaum. 2016. SIMD-based soft error detection. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, 45--54.
[13]
Chen-Yong Cher, Meeta S Gupta, Pradip Bose, and K Paul Muller. 2014. Understanding soft error resiliency of blue gene/q compute chip through hardware proton irradiation and software fault injection. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 587--596.
[14]
Andrew Chien, Pavan Balaji, Peter Beckman, Nan Dun, Aiman Fang, Hajime Fujita, Kamil Iskra, Zachary Rubenstein, Ziming Zheng, Rob Schreiber, et al. 2015. Versioned distributed arrays for resilience in scientific applications: Global view resilience. Procedia Computer Science 51 (2015), 29--38.
[15]
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135 (2022).
[16]
Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2809--2823.
[17]
Chong Ding, Christer Karlsson, Hui Liu, Teresa Davies, and Zizhong Chen. 2011. Matrix multiplication on gpus with on-line fault tolerance. In 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications. IEEE, 311--317.
[18]
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, et al. 2011. The international exascale software project roadmap. International Journal of High Performance Computing Applications 25, 1 (2011), 3--60.
[19]
Al Geist. 2016. Supercomputing's monster in the closet. IEEE Spectrum 53, 3 (2016), 30--35.
[20]
Leonardo Arturo Bautista Gomez and Franck Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In 2015 IEEE International Conference on Cluster Computing. IEEE, 595--602.
[21]
John A Gunnels, Daniel S Katz, Enrique S Quintana-Orti, and RA Van de Gejin. 2001. Fault-tolerant high-performance matrix multiplication: Theory and practice. In 2001 International Conference on Dependable Systems and Networks. IEEE, 47--56.
[22]
Doug Hakkarinen, Panruo Wu, and Zizhong Chen. 2014. Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems 26, 5 (2014), 1323--1335.
[23]
Jianyu Huang, Chenhan D Yu, and Robert A van de Geijn. 2020. Strassen's algorithm reloaded on GPUs. ACM Transactions on Mathematical Software (TOMS) 46, 1 (2020), 1--22.
[24]
Kuang-Hua Huang and Jacob A Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100, 6 (1984), 518--528.
[25]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675--678.
[26]
Jack Kosaian and KV Rashmi. 2021. Arithmetic-intensity-guided fault tolerance for neural network inference on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.
[27]
Jean-Claude Laprie. 1985. Dependable computing and fault-tolerance. Digest of Papers FTCS-15 (1985), 2--11.
[28]
Dong Li, Jeffrey S Vetter, and Weikuan Yu. 2012. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 57.
[29]
Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W Keckler. 2017. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 8.
[30]
Sihuan Li, Hongbo Li, Xin Liang, Jieyang Chen, Elisabeth Giem, Kaiming Ouyang, Kai Zhao, Sheng Di, Franck Cappello, and Zizhong Chen. 2019. FT-iSort: efficient fault tolerance for introsort. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 71.
[31]
Xin Liang, Jieyang Chen, Dingwen Tao, Sihuan Li, Panruo Wu, Hongbo Li, Kaiming Ouyang, Yuanlai Liu, Fengguang Song, and Zizhong Chen. 2017. Correcting soft errors online in fast Fourier transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 30.
[32]
Robert Lucas, James Ang, Keren Bergman, Shekhar Borkar, William Carlson, Laura Carrington, George Chiu, Robert Colwell, William Dally, Jack Dongarra, et al. 2014. DOE advanced scientific computing advisory subcommittee (ASCAC) report: top ten exascale research challenges. Technical Report. USDOE Office of Science (SC)(United States).
[33]
Robyn R Lutz. 1993. Analyzing software requirements errors in safety-critical, embedded systems. In [1993] Proceedings of the IEEE International Symposium on Requirements Engineering. IEEE, 126--133.
[34]
Timothy C May and Murray H Woods. 1979. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices 26, 1 (1979), 2--9.
[35]
Subhasish Mitra, Pradip Bose, Eric Cheng, Chen-Yong Cher, Hyungmin Cho, Rajiv Joshi, Young Moon Kim, Charles R Lefurgy, Yanjing Li, Kenneth P Rodbell, et al. 2014. The resilience wall: Cross-layer solution strategies. In Proceedings of Technical Program-2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA). IEEE, 1--11.
[36]
Michael Nicolaidis. 1999. Time redundancy based soft-error tolerance to rescue nanometer technologies. In Proceedings 17th IEEE VLSI Test Symposium (Cat. No. PR00146). IEEE, 86--94.
[37]
NVIDIA. Retrieved in 2022. https://github.com/NVIDIA/cutlass. Online.
[38]
Nahmsuk Oh, Philip P Shirvani, and McCluskey. 2002. Control-flow checking by software signatures. IEEE transactions on Reliability 51, 1 (2002), 111--122.
[39]
Nahmsuk Oh, Philip P Shirvani, and Edward J McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51, 1 (2002), 63--75.
[40]
Daniel Oliveira, Laércio Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe Navaux, and Paolo Rech. 2017. Experimental and analytical study of Xeon Phi reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 28.
[41]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035.
[42]
James C Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D Skeel, Laxmikant Kale, and Klaus Schulten. 2005. Scalable molecular dynamics with NAMD. Journal of computational chemistry 26, 16 (2005), 1781--1802.
[43]
George A Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization. IEEE Computer Society, 243--254.
[44]
Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. Compressing DMA engine: Leveraging activation sparsity for training deep neural networks. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 78--91.
[45]
Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, et al. 2014. Addressing failures in exascale computing. The International Journal of High Performance Computing Applications 28, 2 (2014), 129--173.
[46]
Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, and Franck Cappello. 2018. Improving performance of iterative methods by lossy checkponting. In Proceedings of the 27th international symposium on high-performance parallel and distributed computing. 52--65.
[47]
DL Tao and Carlos RP Hartmann. 1993. A novel concurrent error detection scheme for FFT networks. IEEE Transactions on Parallel and Distributed Systems 4, 2 (1993), 198--221.
[48]
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z Zhang, Darren Kerbyson, and Zizhong Chen. 2016. New-sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 43--55.
[49]
Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors online for ScaLAPACK Cholesky, QR, and LU factorization routines. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, 49--60.
[50]
Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, and Zizhong Chen. 2013. On-line soft error correction in matrix-matrix multiplication. Journal of Computational Science 4, 6 (2013), 465--472.
[51]
Jing Yu, Maria Jesus Garzaran, and Marc Snir. 2009. Esoftcheck: Removal of non-vital checks for fault tolerance. In 2009 International Symposium on Code Generation and Optimization. IEEE, 35--46.
[52]
Yujia Zhai, Elisabeth Giem, Quan Fan, Kai Zhao, Jinyang Liu, and Zizhong Chen. 2021. FT-BLAS: a high performance BLAS implementation with online fault tolerance. In Proceedings of the ACM International Conference on Supercomputing. 127--138.
[53]
Yujia Zhai, Mohannad Ibrahim, Yiqin Qiu, Fabian Boemer, Zizhong Chen, Alexey Titov, and Alexander Lyashevsky. 2022. Accelerating encrypted computing on intel gpus. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 705--716.
[54]
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2022. ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs. arXiv preprint arXiv:2210.03052 (2022).
[55]
Kai Zhao, Sheng Di, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang, Franck Cappello, and Zizhong Chen. 2020. Algorithm-based fault tolerance for convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems (2020).

Cited By

View all
  • (2024)Parallelized 0/1 Knapsack Algorithm Optimization in CPU-GPU-Based Heterogeneous System with Algorithm-based Fault Tolerance2024 18th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM60618.2024.10418349(1-8)Online publication date: 3-Jan-2024
  • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: 25-Sep-2023

Index Terms

  1. Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
    June 2023
    505 pages
    ISBN:9798400700569
    DOI:10.1145/3577193
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2023

    Check for updates

    Author Tags

    1. GEMM
    2. GPU
    3. performance optimization
    4. reliability
    5. resilience

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)529
    • Downloads (Last 6 weeks)70
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Parallelized 0/1 Knapsack Algorithm Optimization in CPU-GPU-Based Heterogeneous System with Algorithm-based Fault Tolerance2024 18th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM60618.2024.10418349(1-8)Online publication date: 3-Jan-2024
    • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: 25-Sep-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media