Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DWMAcc: Accelerating Shift-based CNNs with Domain Wall Memories

Published: 08 October 2019 Publication History

Abstract

PIM (processing-in-memory) based hardware accelerators have shown great potentials in addressing the computation and memory access intensity of modern CNNs (convolutional neural networks). While adopting NVM (non-volatile memory) helps to further mitigate the storage and energy consumption overhead, adopting quantization, e.g., shift-based quantization, helps to tradeoff the computation overhead and the accuracy loss, integrating both NVM and quantization in hardware accelerators leads to sub-optimal acceleration.
In this paper, we exploit the natural shift property of DWM (domain wall memory) to devise DWMAcc, a DWM-based accelerator with asymmetrical storage of weight and input data, to speed up the inference phase of shift-based CNNs. DWMAcc supports flexible shift operations to enable fast processing with low performance and area overhead. We then optimize it with zero-sharing, input-reuse, and weight-share schemes. Our experimental results show that, on average, DWMAcc achieves 16.6× performance improvement and 85.6× energy consumption reduction over a state-of-the-art SRAM based design.

References

[1]
2016. NVIDIA TITAN X (pascal). http://www.geforce.com/hardware/10series/ titan-x-pasca.
[2]
Mohammad Arjomand, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das. 2016. Boosting access parallelism to PCM-based main memory. ACM SIGARCH Computer Architecture News 44, 3 (2016), 695--706.
[3]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609--622.
[4]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.
[5]
Ping Chi, Shuangchen Li, Yuanqing Cheng, Yu Lu, Seung H. Kang, and Yuan Xie. 2016. Architecture design with STT-RAM: Opportunities and challenges. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 109--114.
[6]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 27--39.
[7]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.
[8]
Ruizhou Ding, Zeye Liu, Rongye Shi, Diana Marculescu, and R. D. Blanton. 2017. Lightnn: Filling the gap between conventional deep neural networks and binarized networks. In Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 35--40.
[9]
Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 383--396.
[10]
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29 (2012).
[11]
Jordan L. Holi and J.-N. Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290.
[12]
Qingda Hu, Guangyu Sun, Jiwu Shu, and Chao Zhang. 2016. Exploring main memory design based on racetrack memory technology. In Proceedings of the 26th Edition on Great Lakes Symposium on VLSI. ACM, 397--402.
[13]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.
[14]
Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
[15]
Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. Drisa: A dram-based reconfigurable in-situ accelerator. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 288--301.
[16]
Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[17]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525--542.
[18]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.
[19]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.
[20]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[21]
Zhenyu Sun, Xiuyuan Bi, Wenqing Wu, Sungjoo Yoo, and Hai Helen Li. 2014. Array organization and data management exploration in racetrack memory. IEEE Trans. Comput. 65, 4 (2014), 1041--1054.
[22]
Synopsys Inc. 2017. Design Compiler. Synopsys Inc. https://www.synopsys.com/support/training/rtl-synthesis/design-compiler-rtl-synthesis.html.
[23]
Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A high density, energy efficient cache based on domain wall memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, 185--190.
[24]
Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. Stag: Spintronic-tape architecture for gpgpu cache hierarchies. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 253--264.
[25]
Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and Weisheng Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In The 20th Asia and South Pacific Design Automation Conference. IEEE, 100--105.
[26]
Xianwei Zhang, Lei Zhao, Youtao Zhang, and Jun Yang. 2015. Exploit common source-line to construct energy efficient domain wall memory based caches. In 2015 33rd IEEE International Conference on Computer Design (ICCD). IEEE, 157--163.
[27]
Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 (2017).
[28]
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).

Cited By

View all
  • (2022)Autoregressive Linguistic Steganography Based on BERT and Consistency CodingSecurity and Communication Networks10.1155/2022/90927852022Online publication date: 1-Jan-2022
  • (2022)Fast-track cacheProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532383(1-12)Online publication date: 28-Jun-2022
  • (2022)DAG Scheduling and Analysis on Multi-Core Systems by Modelling Parallelism and DependencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317704633:12(4019-4038)Online publication date: 23-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 18, Issue 5s
Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019
October 2019
1423 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3365919
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 08 October 2019
Accepted: 01 July 2019
Revised: 01 June 2019
Received: 01 April 2019
Published in TECS Volume 18, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CNN accelerators
  2. domain wall memory

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Autoregressive Linguistic Steganography Based on BERT and Consistency CodingSecurity and Communication Networks10.1155/2022/90927852022Online publication date: 1-Jan-2022
  • (2022)Fast-track cacheProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532383(1-12)Online publication date: 28-Jun-2022
  • (2022)DAG Scheduling and Analysis on Multi-Core Systems by Modelling Parallelism and DependencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317704633:12(4019-4038)Online publication date: 23-May-2022
  • (2022)An Automatic-Addressing Architecture With Fully Serialized Access in Racetrack Memory for Energy-Efficient CNNsIEEE Transactions on Computers10.1109/TC.2020.304543371:1(235-250)Online publication date: 1-Jan-2022
  • (2022)Hardware-software co-exploration with racetrack memory based in-memory computing for CNN inference in embedded systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102507128:COnline publication date: 15-Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media