research-article

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA

Authors:

Jiangwei Shang,

Hongwei LiuAuthors Info & Claims

ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing

Article No.: 25, Pages 1 - 7

https://doi.org/10.1145/3547276.3548521

Published: 13 January 2023 Publication History

Abstract

Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.

References

[1]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Available: http://arxiv.org/abs/1409.1556

[2]

J. Redmon, S. Divvala, R. Girshick and A. Farhadi, ‘‘You only look once: Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.

[3]

K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2015, arXiv:1512.03385. [Online]. Available: https://doi.org/10.48550/arXiv.1512.03385.

[4]

J. Deng, W. Dong, R. Socher, L. Li, K. Li and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput Vis. Pattern Recongnit (CVPR), 2009, pp. 248–255.

[5]

K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang and H. Yang, “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(1), 2017, 35-47.

[6]

H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A Framework for Generating High Throughput CNN Implementations on FPGAs,” In 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2018, 117-126.

Digital Library

[7]

S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015.

Digital Library

[8]

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.

[9]

B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. “Sparse Convolutional Neural Networks,” in Proc. IEEE Conf. Comput Vis. Pattern Recongnit (CVPR) 2015.

[10]

A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), June 2017.

Digital Library

[11]

Y. Liang, L. Lu and J. Xie, ‘‘OMNI: A framework for integrating hardware and software optimizations for sparse CNNs,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 8, pp. 1648–1661, Aug. 2021.

[12]

X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, “Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach,” in Proceedings of 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 15–28.

Digital Library

[13]

C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan, “CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017, pp. 395–408.

Digital Library

[14]

Y. Niu, H. Zeng, A. Srivastava, K. Lakhotia, R. Kannan, Y. Wang, and V. K. Prasanna, “Spec2: Spectral Sparse CNN Accelerator on FPGAs,” 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 195–204, 2019.

[15]

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.

Digital Library

[16]

L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, “An efficient hardware accelerator for sparse convolutional neural networks on FPGAs,” in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2019, pp. 17–25.

[17]

C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, ‘‘An efficient hardware accelerator for structured sparse convolutional neural networks on FPGAs,’’ CoRR, vol. abs/2001.01955, 2020. [Online]. Available: http://arxiv.org/abs/2001.01955

[18]

W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 2074–2082.

[19]

Y. He, X. Zhang, J. Sun, “Channel pruning for accelerating very deep neural networks,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1389–1397 (2017)

[20]

P. Singh, V. K. Verma, P. Rai, and V. Namboodiri, “Leveraging filter correlations for deep model compression,” in The IEEE Winter Conference on Applications of Computer Vision (WACV), March 2020.

[21]

J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. ICCV, 2017.

[22]

M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: Training deep neural networks with binary weights during propagations, in Advances in neural information processing systems (NeurIPS), 2015, pp. 3123– 3131.

[23]

Y. Umuroglu, Y. Akhauri, N. J. Fraser and M. Blott, “LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications,” In International Conference on Field-Programmable Logic and Applications, 2020.

[24]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, 1998.

Cited By

Shang JZhang ZZhang KLi CQian LLiu H(2024)An algorithm/hardware co‐optimized method to accelerate CNNs with compressed convolutional weights on FPGAConcurrency and Computation: Practice and Experience10.1002/cpe.801136:11Online publication date: 6-Jan-2024
https://doi.org/10.1002/cpe.8011

Index Terms

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA
1. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

A dedicated hardware accelerator for real-time acceleration of YOLOv2
Abstract
In recent years, dedicated hardware accelerators for the acceleration of the convolutional neural network (CNN) have been extensively studied. Although many studies have presented efficient designs on FPGAs for image classification neural network ...
A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom ...
A hardware/software co-design approach to prototype 6G mobile applications inside the GNU Radio SDR Ecosystem using FPGA hardware accelerators
HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

The novel communication 6G requires raw data rates of up to 400 Gbit s− 1 in a single Field Programmable Gate Array (FPGA) front-end. For these high data rates, a Software Defined Radio (SDR) on a multi-core processor reaches a performance limit due to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing

August 2022

233 pages

ISBN:9781450394451

DOI:10.1145/3547276

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China
State Key Laboratory of Mathematical Engineering and Advanced Computing
Provincial Natural Science Foundation of Jiangsu, China

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
68
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shang JZhang ZZhang KLi CQian LLiu H(2024)An algorithm/hardware co‐optimized method to accelerate CNNs with compressed convolutional weights on FPGAConcurrency and Computation: Practice and Experience10.1002/cpe.801136:11Online publication date: 6-Jan-2024
https://doi.org/10.1002/cpe.8011

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten