research-article

DWMAcc: Accelerating Shift-based CNNs with Domain Wall Memories

Authors:

Youtao ZhangAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 18, Issue 5s

Article No.: 69, Pages 1 - 19

https://doi.org/10.1145/3358199

Published: 08 October 2019 Publication History

Abstract

PIM (processing-in-memory) based hardware accelerators have shown great potentials in addressing the computation and memory access intensity of modern CNNs (convolutional neural networks). While adopting NVM (non-volatile memory) helps to further mitigate the storage and energy consumption overhead, adopting quantization, e.g., shift-based quantization, helps to tradeoff the computation overhead and the accuracy loss, integrating both NVM and quantization in hardware accelerators leads to sub-optimal acceleration.

In this paper, we exploit the natural shift property of DWM (domain wall memory) to devise DWMAcc, a DWM-based accelerator with asymmetrical storage of weight and input data, to speed up the inference phase of shift-based CNNs. DWMAcc supports flexible shift operations to enable fast processing with low performance and area overhead. We then optimize it with zero-sharing, input-reuse, and weight-share schemes. Our experimental results show that, on average, DWMAcc achieves 16.6× performance improvement and 85.6× energy consumption reduction over a state-of-the-art SRAM based design.

References

[1]

2016. NVIDIA TITAN X (pascal). http://www.geforce.com/hardware/10series/ titan-x-pasca.

[2]

Mohammad Arjomand, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das. 2016. Boosting access parallelism to PCM-based main memory. ACM SIGARCH Computer Architecture News 44, 3 (2016), 695--706.

Digital Library

[3]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609--622.

Digital Library

[4]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.

Digital Library

[5]

Ping Chi, Shuangchen Li, Yuanqing Cheng, Yu Lu, Seung H. Kang, and Yuan Xie. 2016. Architecture design with STT-RAM: Opportunities and challenges. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 109--114.

Digital Library

[6]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 27--39.

[7]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.

Digital Library

[8]

Ruizhou Ding, Zeye Liu, Rongye Shi, Diana Marculescu, and R. D. Blanton. 2017. Lightnn: Filling the gap between conventional deep neural networks and binarized networks. In Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 35--40.

[9]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 383--396.

Digital Library

[10]

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29 (2012).

[11]

Jordan L. Holi and J.-N. Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290.

Digital Library

[12]

Qingda Hu, Guangyu Sun, Jiwu Shu, and Chao Zhang. 2016. Exploring main memory design based on racetrack memory technology. In Proceedings of the 26th Edition on Great Lakes Symposium on VLSI. ACM, 397--402.

Digital Library

[13]

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.

[14]

Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).

[15]

Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. Drisa: A dram-based reconfigurable in-situ accelerator. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 288--301.

Digital Library

[16]

Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.

[17]

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525--542.

[18]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.

Digital Library

[19]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.

Digital Library

[20]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[21]

Zhenyu Sun, Xiuyuan Bi, Wenqing Wu, Sungjoo Yoo, and Hai Helen Li. 2014. Array organization and data management exploration in racetrack memory. IEEE Trans. Comput. 65, 4 (2014), 1041--1054.

Digital Library

[22]

Synopsys Inc. 2017. Design Compiler. Synopsys Inc. https://www.synopsys.com/support/training/rtl-synthesis/design-compiler-rtl-synthesis.html.

[23]

Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A high density, energy efficient cache based on domain wall memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, 185--190.

Digital Library

[24]

Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. Stag: Spintronic-tape architecture for gpgpu cache hierarchies. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 253--264.

[25]

Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and Weisheng Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In The 20th Asia and South Pacific Design Automation Conference. IEEE, 100--105.

[26]

Xianwei Zhang, Lei Zhao, Youtao Zhang, and Jun Yang. 2015. Exploit common source-line to construct energy efficient domain wall memory based caches. In 2015 33rd IEEE International Conference on Computer Design (ICCD). IEEE, 157--163.

Digital Library

[27]

Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 (2017).

[28]

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).

Cited By

Zheng XWu H(2022)Autoregressive Linguistic Steganography Based on BERT and Consistency CodingSecurity and Communication Networks10.1155/2022/90927852022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9092785
Tárrega HValero ALorente VPetit SSahuquillo JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Fast-track cacheProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532383(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532383
Zhao SDai XBate I(2022)DAG Scheduling and Analysis on Multi-Core Systems by Modelling Parallelism and DependencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317704633:12(4019-4038)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3177046
Show More Cited By

Index Terms

DWMAcc: Accelerating Shift-based CNNs with Domain Wall Memories
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded hardware
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
    2. Semiconductor memory
      1. Non-volatile memory

Recommendations

TapeCache: a high density, energy efficient cache based on domain wall memory
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design

Domain Wall Memory (DWM) is a recently developed spin-based memory technology in which several bits of data are densely packed into the domains of a ferromagnetic wire. DWM has shown great promise in enabling non-volatile memory with unprecedented ...
Energy-Efficient All-Spin Cache Hierarchy Using Shift-Based Writes and Multilevel Storage

Spintronic memories are considered to be promising candidates for future on-chip memories due to their high density, nonvolatility, and near-zero leakage. However, they also face challenges such as high write energy and latency and limited read speed ...
Domain wall memory based digital signal processors for area and energy-efficiency
DAC '15: Proceedings of the 52nd Annual Design Automation Conference

In many Digital Signal Processing (DSP) applications such as Viterbi decoder and Fast Fourier Transform (FFT), Static Random Access Memory (SRAM) based embedded memory consumes significant portion of area and power. These DSP units are dominated by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 18, Issue 5s

Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019

October 2019

1423 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3365919

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 08 October 2019

Accepted: 01 July 2019

Revised: 01 June 2019

Received: 01 April 2019

Published in TECS Volume 18, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng XWu H(2022)Autoregressive Linguistic Steganography Based on BERT and Consistency CodingSecurity and Communication Networks10.1155/2022/90927852022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9092785
Tárrega HValero ALorente VPetit SSahuquillo JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Fast-track cacheProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532383(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532383
Zhao SDai XBate I(2022)DAG Scheduling and Analysis on Multi-Core Systems by Modelling Parallelism and DependencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317704633:12(4019-4038)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3177046
Wang JLiu JWang DAn JFan X(2022)An Automatic-Addressing Architecture With Fully Serialized Access in Racetrack Memory for Energy-Efficient CNNsIEEE Transactions on Computers10.1109/TC.2020.304543371:1(235-250)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3045433
Choong BLuo TLiu CHe BZhang WZhou J(2022)Hardware-software co-exploration with racetrack memory based in-memory computing for CNN inference in embedded systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102507128:COnline publication date: 15-Jun-2022
https://dl.acm.org/doi/10.1016/j.sysarc.2022.102507

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents