Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3489517.3530588acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Open access

Shfl-BW: accelerating deep neural network inference with tensor-core aware weight pruning

Published: 23 August 2022 Publication History

Abstract

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.
In this work, we propose a novel sparse pattern, Shuffled Blockwise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer [1] by 1.81, 4.18 and 1.90× on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

References

[1]
Ashish Vaswani et al. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017.
[2]
Dario Amodei et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173--182. PMLR, 2016.
[3]
Christian Szegedy et al. Going deeper with convolutions. In CVPR, pages 1--9, 2015.
[4]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770--778, 2016.
[5]
Yonghui Wu et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[6]
Tom B Brown et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[7]
OpenAI. Ai and compute. https://openai.com/blog/ai-and-compute/.
[8]
William Fedus et al. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
[9]
Mohammad Shoeybi et al. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
[10]
Yann LeCun et al. Optimal brain damage. In Advances in neural information processing systems, pages 598--605, 1990.
[11]
Song Han et al. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626, 2015.
[12]
NVIDIA. cuSPARSE. https://docs.nvidia.com/cuda/cusparse/index.html.
[13]
Trevor Gale et al. Sparse GPU kernels for deep learning. In SC, pages 1--14. IEEE, 2020.
[14]
S. de Wit et al. GPU kernels for block-sparse weights. Journal of Neuroscience, 32(35):12066--12075, 2012.
[15]
AMD. Introducing the AMD CDNA architecture: the all-new AMD GPU architecture for the modern era of HPC and AI. 2020.
[16]
Intel. Intel architecture instruction set extensions programming reference, 2020.
[17]
Nvidia. Nvidia A100 tensor core GPU. Data Sheet, pages 20--21, 2020.
[18]
Angshuman Parashar et al. Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News, 45(2):27--40, 2017.
[19]
Song Han et al. Ese: Efficient speech recognition engine with sparse LSTM on FPGA. In FPGA, pages 75--84, 2017.
[20]
Kartik Hegde et al. Extensor: An accelerator for sparse tensor algebra. In MICRO, pages 319--333, 2019.
[21]
Shijie Cao et al. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In FPGA, pages 63--72, 2019.
[22]
Zhuliang Yao et al. Balanced sparsity for efficient DNN inference on GPU. In AAAI, volume 33, pages 5676--5683, 2019.
[23]
Cong Guo et al. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In SC, pages 1--15. IEEE, 2020.
[24]
Erich Elsen et al. Fast sparse convnets. In CVPR, pages 14629--14638, 2020.
[25]
NVIDIA. cuDNN. https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html.
[26]
NVIDIA. cuBLAS. https://docs.nvidia.com/cuda/cublas/index.html.
[27]
cusparselt.
[28]
Zhaodong Chen et al. Efficient tensor core-based GPU kernels for structured sparsity under reduced precision. In SC, pages 1--14, 2021.
[29]
Tianyun Zhang et al. A systematic DNN weight pruning framework using alternating direction method of multipliers. In ECCV, pages 184--199, 2018.
[30]
Xiaolong Ma et al. Effective model sparsification by scheduled grow-and-prune methods. arXiv preprint arXiv:2106.09857, 2021.

Cited By

View all
  • (2024)RCW-Pruner: Row-Column Wise Pruning Framework on Systolic Array2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00030(126-131)Online publication date: 10-May-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • (2024)A comprehensive review of model compression techniques in machine learningApplied Intelligence10.1007/s10489-024-05747-w54:22(11804-11844)Online publication date: 1-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference
July 2022
1462 pages
ISBN:9781450391429
DOI:10.1145/3489517
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Check for updates

Qualifiers

  • Research-article

Conference

DAC '22
Sponsor:
DAC '22: 59th ACM/IEEE Design Automation Conference
July 10 - 14, 2022
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)262
  • Downloads (Last 6 weeks)40
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)RCW-Pruner: Row-Column Wise Pruning Framework on Systolic Array2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00030(126-131)Online publication date: 10-May-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • (2024)A comprehensive review of model compression techniques in machine learningApplied Intelligence10.1007/s10489-024-05747-w54:22(11804-11844)Online publication date: 1-Nov-2024
  • (2023)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityProceedings of the VLDB Endowment10.14778/3626292.362630317:2(211-224)Online publication date: 1-Oct-2023
  • (2023)Register Tiling for Unstructured Sparsity in Neural Network InferenceProceedings of the ACM on Programming Languages10.1145/35913027:PLDI(1995-2020)Online publication date: 6-Jun-2023
  • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
  • (2023)Scaling Factor and Shift Factor Based Neural Network Pruning2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)10.1109/PRAI59366.2023.10332025(1032-1038)Online publication date: 18-Aug-2023
  • (2023)Re-compact: Structured Pruning and SpMM Kernel Co-design for Accelerating DNNs on GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00066(399-406)Online publication date: 6-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media