research-article

Public Access

Wire-Aware Architecture and Dataflow for CNN Accelerators

Authors:

Sumanth Gudaparthi,

Surya Narayanan,

Rajeev Balasubramonian,

Edouard Giacomin,

Hari Kambalasubramanyam,

Pierre-Emmanuel GaillardonAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1 - 13

https://doi.org/10.1145/3352460.3358316

Published: 12 October 2019 Publication History

Abstract

In spite of several recent advancements, data movement in modern CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual processing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data movements in these prior works are therefore across long wires, and account for much of the energy consumption. In this work, we design a new wire-aware CNN accelerator, WAX, that employs a deep and distributed memory hierarchy, thus enabling data movement over short wires in the common case. An array of computational units, each with a small set of registers, is placed adjacent to a subarray of a large cache to form a single tile. Shift operations among these registers allow for high reuse with little wire traversal overhead. This approach optimizes the common case, where register fetches and access to a few-kilobyte buffer can be performed at very low cost. Operations beyond the tile require traversal over the cache's H-tree interconnect, but represent the uncommon case. For high reuse of operands, we introduce a family of new data mappings and dataflows. The best dataflow, WAXFlow-3, achieves a 2× improvement in performance and a 2.6-4.4× reduction in energy, relative to Eyeriss. As more WAX tiles are added, performance scales well until 128 tiles.

References

[1]

2018. NVIDIA DGX-1. https://www.nvidia.com/en-us/data-center/dgx-1/.

[2]

2018. NVIDIA HGX-2. https://www.nvidia.com/en-us/data-center/hgx/.

[3]

Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Caches. In Proceedings of HPCA-23.

[4]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Jerger, and Andreas Moshovos. 2016. Cnvlutin: Zero-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of ISCA-43.

[5]

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 320--332.

Digital Library

[6]

R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM TACO 14(2) (2017).

[7]

James Balfour, Richard Harting, and William Dally. 2009. Operand Registers and Explicit Operand Forwarding. IEEE Computer Architecture Letters (2009).

[8]

Cerebras. 2019. Cerebras Wafer Scale Engine: An Introduction. https://www.cerebras.net/wp-content/uploads/2019/08/Cerebras-Wafer-Scale-Engine-Whitepaper.pdf.

[9]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of MICRO-47.

Digital Library

[10]

Y. Chen, T. Yang, J. Emer, and V. Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2019).

[11]

Y-H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52(1) (2016).

Digital Library

[12]

Ping Chi, Shuangchen Li, Ziyang Qi, Peng Gu, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of ISCA-43.

Digital Library

[13]

M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv preprint 1602.02830.

[14]

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).

[15]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of ISCA-42.

Digital Library

[16]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In Proceedings of ISCA-45.

Digital Library

[17]

Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijavalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. 2018. A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Training and Inference. In 2018 IEEE Symposium on VLSI Circuits. 35--36.

[18]

Graphcore. 2017. Intelligence Processing Unit. https://cdn2.hubspot.net/hubfs/729091/NIPS2017/NIPS%2017%20-%20IPU.pdf.

[19]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of ICML-32).

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).

[21]

R. Ho. 2003. On-Chip Wires: Scaling and Efficiency. Ph.D. Dissertation. Stanford University.

[22]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).

[23]

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).

[24]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. (2017).

[25]

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-Serial Deep Neural Network Computing. In Proceedings of MICRO-49.

[26]

S. Keckler. 2011. Life After Dennard and How I Learned to Love the Picojoule. Keynote at MICRO.

[27]

S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, and D. Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 5 (2011).

[28]

Duckhwan Kim, Jae Ha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of ISCA-43.

Digital Library

[29]

U. Koster, T. Webb, X. Wang, M. Nassar, A. Bansal, W. Constable, O. Elibol, S. Gray, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloass, R. Pai, and N. Rao. 2017. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. arXiv preprint arXiv:1711.02213 (2017).

[30]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS.

Digital Library

[31]

Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).

[32]

K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz. 2012. Towards Energy-Proportional Datacenter Memory with Mobile DRAM. In Proceedings of ISCA.

[33]

Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 123--135.

Digital Library

[34]

Naveen Muralimanohar et al. 2007. CACTI 6.0: A Tool to Understand Large Caches. Technical Report. University of Utah.

[35]

A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. Strachan, and N. Muralimanohar. 2018. Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration. IEEE Micro Special Issue on Memristor-Based Computing (2018).

[36]

M. O'Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. Keckler, and W. Dally. 2017. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems. In Proceedings of MICRO.

[37]

A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S.W. Keckler, and W.J. Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. (2017).

[38]

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R.S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of ISCA.

[39]

S.Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of ISCA.

[40]

S.Han, H. Mao, and W. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding. In Proceedings of ICLR.

[41]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).

[42]

Tesla. 2019. Tesla Autonomy Day. https://www.youtube.com/watch?v=Ucp0TTmvqOE.

[43]

S. Venkataramani, A. Ranjan, S. Avancha, A. Jagannathan, A. Raghunathan, S. Banerjee, D. Das, A. Durg, D. Nagaraj, B. Kaul, and P. Dubey. 2017. SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. (2017).

Digital Library

Cited By

Lee CYeh T(2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1145/3653363
Thijssen SRashed MJha SEwetz R(2024)PATH: Evaluation of Boolean Logic Using Path-Based In-Memory Computing SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334452343:5(1387-1400)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3344523
Behnam PKamal UShafiee ATumanov AMukhopadhyay S(2024)Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00061(619-630)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00061
Show More Cited By

Index Terms

Wire-Aware Architecture and Dataflow for CNN Accelerators
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA'17

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
2,314
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)21

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee CYeh T(2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1145/3653363
Thijssen SRashed MJha SEwetz R(2024)PATH: Evaluation of Boolean Logic Using Path-Based In-Memory Computing SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334452343:5(1387-1400)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3344523
Behnam PKamal UShafiee ATumanov AMukhopadhyay S(2024)Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00061(619-630)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00061
Liu FYang NLi HWang ZSong ZPei SJiang L(2024)SPARK: Scalable and Precision-Aware Acceleration of Neural Networks via Efficient Encoding2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00082(1029-1042)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00082
Tsakyridis AMoralis-Pegios MGiamougiannis GKirtas MPassalis NTefas APleros N(2024)Photonic neural networks and optics-informed deep learning fundamentalsAPL Photonics10.1063/5.01698109:1Online publication date: 29-Jan-2024
https://doi.org/10.1063/5.0169810
Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Kang LYang XZhang CYu SDou RLi WShi CLiu JWu NLiu L(2023) A 24.3 μ J/Image SNN Accelerator for DVS-Gesture With WS-LOS Dataflow and Sparse Methods IEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.328258970:11(4226-4230)Online publication date: Nov-2023
https://doi.org/10.1109/TCSII.2023.3282589
Abillama PFan ZChen YAn HZhang QChoi SBlaauw DSylvester DKim H(2023)SONA: An Accelerator for Transform-Domain Neural Networks with Sparse-Orthogonal Weights2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP57973.2023.00015(18-26)Online publication date: Jul-2023
https://doi.org/10.1109/ASAP57973.2023.00015
Koraei MCebrian JJahre M(2023)Near-optimal multi-accelerator architectures for predictive maintenance at the edgeFuture Generation Computer Systems10.1016/j.future.2022.10.030140:C(331-343)Online publication date: 8-Feb-2023
https://dl.acm.org/doi/10.1016/j.future.2022.10.030
Eliezer NBanner RBen-Yaakov HHoffer EMichaeli T(2023)Power Awareness in Low Precision Neural NetworksComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25082-8_5(67-83)Online publication date: 12-Feb-2023
https://doi.org/10.1007/978-3-031-25082-8_5
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents