Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3458817.3476182acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Efficient tensor core-based GPU kernels for structured sparsity under reduced precision

Published: 13 November 2021 Publication History

Abstract

The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under >70% and >90% sparsity with 4x1 grain size and half-precision.

Supplementary Material

MP4 File (Efficient Tensor Core-Based GPU Kernels for Structured Sparsity Under Reduced Precision 232 Morning 3.mp4.mp4)
Presentation video

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh. 2018. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 662--673.
[3]
Zhaodong Chen, Mingyu Yan, Maohua Zhu, Lei Deng, Guoqi Li, Shuangchen Li, and Yuan Xie. 2020. fuseGNN: accelerating graph convolutional neural network training on GPGPU. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.
[4]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. CoRR abs/1904.10509 (2019). arXiv:1904.10509 http://arxiv.org/abs/1904.10509
[5]
Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. 2019. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing. 46--57.
[6]
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020.
[7]
Tong Geng, Ang Li, Tianqi Wang, Chunshu Wu, Yanfei Li, Antonino Tumeo, and Martin C. Herbordt. 2019. UWB-GCN: Hardware Acceleration of Graph-Convolution-Network through Runtime Workload Rebalancing. CoRR abs/1908.10834 (2019). arXiv:1908.10834 http://arxiv.org/abs/1908.10834
[8]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2020. A Survey on Visual Transformer. arXiv preprint arXiv:2012.12556 (2020).
[9]
Song Han, Huizi Mao, and William Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.
[10]
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. CoRR abs/1506.02626 (2015). arXiv:1506.02626 http://arxiv.org/abs/1506.02626
[11]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).
[12]
Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.
[13]
Bing Li, Wei Wen, Jiachen Mao, Sicheng Li, Yiran Chen, and Hai Li. 2018. Running sparse and low-precision neural network: When algorithm meets hardware. In 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 534--539.
[14]
Liu Liu, Lei Deng, Zhaodong Chen, Yuke Wang, Shuangchen Li, Jingwei Zhang, Yihua Yang, Zhenyu Gu, Yufei Ding, and Yuan Xie. 2020. Boosting Deep Neural Network Efficiency with Dual-Module Inference. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 6205--6215. http://proceedings.mlr.press/v119/liu20c.html
[15]
Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. 2017. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922 (2017).
[16]
Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. CoRR abs/1704.05119 (2017). arXiv:1704.05119 http://arxiv.org/abs/1704.05119
[17]
M Naumov, L Chien, P Vandermersch, and U Kapasi. 2010. Cusparse library. In GPU Technology Conference.
[18]
Tesla NVIDIA. 2017. V100 GPU Architecture: The world's most advanced datacenter GPU. NVIDIA Corporation (2017).
[19]
Mi Sun Park, Xiaofan Xu, and Cormac Brick. 2018. Squantizer: Simultaneous learning for both sparse and low-precision neural networks. arXiv preprint arXiv:1812.08301 (2018).
[20]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[21]
Md Aamir Raihan, Negar Goli, and Tor M Aamodt. 2019. Modeling deep learning accelerator enabled gpus. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 79--92.
[22]
Google Research. [n.d.]. Deep Learning Matrix Collection. https://github.com/google-research/google-research/tree/master/sgk.
[23]
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive Attention Span in Transformers. CoRR abs/1905.07799 (2019). arXiv:1905.07799 http://arxiv.org/abs/1905.07799
[24]
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2020. Long Range Arena: A Benchmark for Efficient Transformers. arXiv preprint arXiv:2011.04006 (2020).
[25]
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732 (2020).
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 6000--6010. http://papers.nips.cc/paper/7181-attention-is-all-you-need
[27]
Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2861--2865.
[28]
Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing Batched Winograd Convolution on GPUs. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '20). ACM, San Diego, CA, USA.
[29]
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2021. Big Bird: Transformers for Longer Sequences. arXiv:2007.14062 [cs.LG]
[30]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 17283--17297. https://proceedings.neurips.cc/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
[31]
Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-Wise Sparse Neural Networks on Modern GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). Association for Computing Machinery, New York, NY, USA, 359--371.

Cited By

View all
  • (2024)Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor CoreProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673108(1124-1134)Online publication date: 12-Aug-2024
  • (2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
  • (2024)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators AutomaticallyProceedings of the ACM on Management of Data10.1145/36393232:1(1-26)Online publication date: 26-Mar-2024
  • Show More Cited By

Index Terms

  1. Efficient tensor core-based GPU kernels for structured sparsity under reduced precision
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2021
    1493 pages
    ISBN:9781450384421
    DOI:10.1145/3458817
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. GPGPU
    2. neural networks
    3. sparse matrices
    4. tensor core

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)720
    • Downloads (Last 6 weeks)60
    Reflects downloads up to 23 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor CoreProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673108(1124-1134)Online publication date: 12-Aug-2024
    • (2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
    • (2024)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators AutomaticallyProceedings of the ACM on Management of Data10.1145/36393232:1(1-26)Online publication date: 26-Mar-2024
    • (2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
    • (2024)High Performance Unstructured SpMM Computation Using Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00060(1-14)Online publication date: 17-Nov-2024
    • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
    • (2023)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityProceedings of the VLDB Endowment10.14778/3626292.362630317:2(211-224)Online publication date: 1-Oct-2023
    • (2023)VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607087(1-14)Online publication date: 12-Nov-2023
    • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
    • (2023)Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00033(245-255)Online publication date: May-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media