research-article

Open access

Shfl-BW: accelerating deep neural network inference with tensor-core aware weight pruning

Authors:

Yuan XieAuthors Info & Claims

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Pages 1153 - 1158

https://doi.org/10.1145/3489517.3530588

Published: 23 August 2022 Publication History

Abstract

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.

In this work, we propose a novel sparse pattern, Shuffled Blockwise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer [1] by 1.81, 4.18 and 1.90× on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

References

[1]

Ashish Vaswani et al. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017.

[2]

Dario Amodei et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173--182. PMLR, 2016.

[3]

Christian Szegedy et al. Going deeper with convolutions. In CVPR, pages 1--9, 2015.

[4]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770--778, 2016.

[5]

Yonghui Wu et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[6]

Tom B Brown et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

[7]

OpenAI. Ai and compute. https://openai.com/blog/ai-and-compute/.

[8]

William Fedus et al. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.

[9]

Mohammad Shoeybi et al. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[10]

Yann LeCun et al. Optimal brain damage. In Advances in neural information processing systems, pages 598--605, 1990.

Digital Library

[11]

Song Han et al. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626, 2015.

[12]

NVIDIA. cuSPARSE. https://docs.nvidia.com/cuda/cusparse/index.html.

[13]

Trevor Gale et al. Sparse GPU kernels for deep learning. In SC, pages 1--14. IEEE, 2020.

[14]

S. de Wit et al. GPU kernels for block-sparse weights. Journal of Neuroscience, 32(35):12066--12075, 2012.

[15]

AMD. Introducing the AMD CDNA architecture: the all-new AMD GPU architecture for the modern era of HPC and AI. 2020.

[16]

Intel. Intel architecture instruction set extensions programming reference, 2020.

[17]

Nvidia. Nvidia A100 tensor core GPU. Data Sheet, pages 20--21, 2020.

[18]

Angshuman Parashar et al. Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News, 45(2):27--40, 2017.

Digital Library

[19]

Song Han et al. Ese: Efficient speech recognition engine with sparse LSTM on FPGA. In FPGA, pages 75--84, 2017.

[20]

Kartik Hegde et al. Extensor: An accelerator for sparse tensor algebra. In MICRO, pages 319--333, 2019.

[21]

Shijie Cao et al. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In FPGA, pages 63--72, 2019.

[22]

Zhuliang Yao et al. Balanced sparsity for efficient DNN inference on GPU. In AAAI, volume 33, pages 5676--5683, 2019.

[23]

Cong Guo et al. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In SC, pages 1--15. IEEE, 2020.

[24]

Erich Elsen et al. Fast sparse convnets. In CVPR, pages 14629--14638, 2020.

[25]

NVIDIA. cuDNN. https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html.

[26]

NVIDIA. cuBLAS. https://docs.nvidia.com/cuda/cublas/index.html.

[27]

cusparselt.

[28]

Zhaodong Chen et al. Efficient tensor core-based GPU kernels for structured sparsity under reduced precision. In SC, pages 1--14, 2021.

[29]

Tianyun Zhang et al. A systematic DNN weight pruning framework using alternating direction method of multipliers. In ECCV, pages 184--199, 2018.

[30]

Xiaolong Ma et al. Effective model sparsification by scheduled grow-and-prune methods. arXiv preprint arXiv:2106.09857, 2021.

Cited By

Niu YCasas M(2025)BerryBees: Breadth First Search by Bit-Tensor-CoresProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710859(339-354)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710859
Wei XLi FWang CZheng XTong SYuan XYue HWu Q(2024)RCW-Pruner: Row-Column Wise Pruning Framework on Systolic Array2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00030(126-131)Online publication date: 10-May-2024
https://doi.org/10.1109/HPSC62738.2024.00030
Castro RAndrade DFraguela B(2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402326
Show More Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

The sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

July 2022

1462 pages

ISBN:9781450391429

DOI:10.1145/3489517

General Chair:
Rob Oshana
NXP

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Check for updates

Qualifiers

Research-article

Conference

DAC '22

Sponsor:

SIGDA

DAC '22: 59th ACM/IEEE Design Automation Conference

July 10 - 14, 2022

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
734
Total Downloads

Downloads (Last 12 months)251
Downloads (Last 6 weeks)25

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Niu YCasas M(2025)BerryBees: Breadth First Search by Bit-Tensor-CoresProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710859(339-354)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710859
Wei XLi FWang CZheng XTong SYuan XYue HWu Q(2024)RCW-Pruner: Row-Column Wise Pruning Framework on Systolic Array2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00030(126-131)Online publication date: 10-May-2024
https://doi.org/10.1109/HPSC62738.2024.00030
Castro RAndrade DFraguela B(2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402326
Dantas PSabino da Silva WCordeiro LCarvalho C(2024)A comprehensive review of model compression techniques in machine learningApplied Intelligence10.1007/s10489-024-05747-w54:22(11804-11844)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s10489-024-05747-w
Xia HZheng ZLi YZhuang DZhou ZQiu XLi YLin WSong S(2023)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityProceedings of the VLDB Endowment10.14778/3626292.362630317:2(211-224)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626303
Wilkinson LCheshmi KDehnavi M(2023)Register Tiling for Unstructured Sparsity in Neural Network InferenceProceedings of the ACM on Programming Languages10.1145/35913027:PLDI(1995-2020)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591302
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Guo JWang QLi L(2023)Scaling Factor and Shift Factor Based Neural Network Pruning2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)10.1109/PRAI59366.2023.10332025(1032-1038)Online publication date: 18-Aug-2023
https://doi.org/10.1109/PRAI59366.2023.10332025
Zhang YRen AChen XLin QTan YLiu D(2023)Re-compact: Structured Pruning and SpMM Kernel Co-design for Accelerating DNNs on GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00066(399-406)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00066

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten