Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Published: 04 September 2018 Publication History

Abstract

Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for image, voice, and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs that typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in integrated circuit (IC) gate count and power consumption. “Weight-sharing” accelerators have been proposed where the full range of weight values in a trained CNN are compressed and put into bins, and the bin index is used to access the weight-shared value. We reduce power and area of the CNN by implementing parallel accumulate shared MAC (PASM) in a weight-shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. In this article, we implement PASM in a weight-shared CNN convolution hardware accelerator and analyze its effectiveness. Experiments show that for a clock speed 1GHz implemented on a 45nm ASIC process our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency. We also show that the same weight-shared-with-PASM CNN accelerator can be implemented in resource-constrained FPGAs, where the FPGA has limited numbers of digital signal processor (DSP) units to accelerate the MAC operations.

References

[1]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support Programming Languages and Operating Systems. 269--284.
[2]
Yu Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 43rd International Symposium on Computer Architecture (ISCA’16). IEEE, 367--379.
[3]
Tim Dettmers. 2016. 8-bit approximations for parallelism in deep learning. In 4th International Conference on Learning Representations (ICLR'16). San Juan, Puerto Rico.
[4]
C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems. 257--260.
[5]
Yao Fu, Ephrem Wu, Ashish Sirasao, Sedny Attia, Kamran Khan, and Ralph Wittig. 2016. Deep learning with INT8 optimization on Xilinx devices white paper (WP485). 486, WP486 (v1.0.1) (2016), 1--11.
[6]
Martin Fürer. 2007. Faster integer multiplication. In Proceedings of the 39th Annual ACM Symposium on the Theory of Computing. 57--66.
[7]
Sridhar Gangadharan and Sanjay Churiwala. 2015. Constraining Designs for Synthesis and Timing Analysis: A Practical Guide to Synopsys Design Constraints (SDC). Springer.
[8]
J. Garland and D. Gregg. 2017. Low complexity multiply accumulate unit for weight-sharing convolutional neural networks. IEEE Comput. Arch. Lett. 16, 2 (Jul. 2017), 132--135.
[9]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning. 1737--1746.
[10]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. 243--254.
[11]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR'16).
[12]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
[13]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1 (NIPS’12). Curran Associates Inc., 1097--1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
[15]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (Dec. 1989), 541--551.
[16]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.
[17]
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 45--54.
[18]
S. Sabeetha, J. Ajayan, S. Shriram, K. Vivek, and V. Rajesh. 2015. A study of performance comparison of digital multipliers using 22nm strained silicon technology. In Proceedings of the 2015 2nd International Conference on Electronics and Communication Systems (ICECS’15). 180--184.
[19]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’14).
[20]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
[21]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1--9.
[22]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on the FPGAs. 161--170.

Cited By

View all
  • (2024)Flexible Quantization for Efficient Convolutional Neural NetworksElectronics10.3390/electronics1310192313:10(1923)Online publication date: 14-May-2024
  • (2024)An ASIC Accelerator for QNN With Variable Precision and Tunable Energy EfficiencyIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335759743:7(2057-2070)Online publication date: 24-Jan-2024
  • (2024)Modern Trends in Improving the Technical Characteristics of Devices and Systems for Digital Image ProcessingIEEE Access10.1109/ACCESS.2024.338149312(44659-44681)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
September 2018
322 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3274266
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018
Accepted: 01 June 2018
Revised: 01 May 2018
Received: 01 January 2018
Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ASIC
  2. CNN
  3. FPGA
  4. arithmetic hardware circuits
  5. multiply accumulate
  6. power efficiency

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Science Foundation Ireland

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)620
  • Downloads (Last 6 weeks)74
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Flexible Quantization for Efficient Convolutional Neural NetworksElectronics10.3390/electronics1310192313:10(1923)Online publication date: 14-May-2024
  • (2024)An ASIC Accelerator for QNN With Variable Precision and Tunable Energy EfficiencyIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335759743:7(2057-2070)Online publication date: 24-Jan-2024
  • (2024)Modern Trends in Improving the Technical Characteristics of Devices and Systems for Digital Image ProcessingIEEE Access10.1109/ACCESS.2024.338149312(44659-44681)Online publication date: 2024
  • (2024)Estimation of aquatic ecosystem health using deep neural network with nonlinear data mappingEcological Informatics10.1016/j.ecoinf.2024.10258881(102588)Online publication date: Jul-2024
  • (2024)A Precision-Aware Neuron Engine for DNN AcceleratorsSN Computer Science10.1007/s42979-024-02851-z5:5Online publication date: 25-Apr-2024
  • (2024)Efficient Processing Element Architecture Using Hybrid Approximate Multipliers and Parallel Prefix Adders for CNN AcceleratorsSmart Computing Paradigms: Artificial Intelligence and Network Applications10.1007/978-981-97-7880-5_42(489-499)Online publication date: 22-Nov-2024
  • (2023)Identification of Individual Hanwoo Cattle by Muzzle Pattern Images through Deep LearningAnimals10.3390/ani1318285613:18(2856)Online publication date: 8-Sep-2023
  • (2023)Design of The Ultra-Low-Power Driven VMM Configurations for μW Scale IoT Devices2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00018(65-72)Online publication date: 18-Dec-2023
  • (2023)FPGA-Based Efficient MLP Neural Network for Digit Recognition2023 International Conference on Integration of Computational Intelligent System (ICICIS)10.1109/ICICIS56802.2023.10430242(1-7)Online publication date: 1-Nov-2023
  • (2023)Constant Coefficient Multipliers Using Self-Similarity-Based Hybrid Binary-Unary Computing2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323844(1-7)Online publication date: 28-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media