Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3352460.3358316acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Wire-Aware Architecture and Dataflow for CNN Accelerators

Published: 12 October 2019 Publication History

Abstract

In spite of several recent advancements, data movement in modern CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual processing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data movements in these prior works are therefore across long wires, and account for much of the energy consumption. In this work, we design a new wire-aware CNN accelerator, WAX, that employs a deep and distributed memory hierarchy, thus enabling data movement over short wires in the common case. An array of computational units, each with a small set of registers, is placed adjacent to a subarray of a large cache to form a single tile. Shift operations among these registers allow for high reuse with little wire traversal overhead. This approach optimizes the common case, where register fetches and access to a few-kilobyte buffer can be performed at very low cost. Operations beyond the tile require traversal over the cache's H-tree interconnect, but represent the uncommon case. For high reuse of operands, we introduce a family of new data mappings and dataflows. The best dataflow, WAXFlow-3, achieves a 2× improvement in performance and a 2.6-4.4× reduction in energy, relative to Eyeriss. As more WAX tiles are added, performance scales well until 128 tiles.

References

[1]
2018. NVIDIA DGX-1. https://www.nvidia.com/en-us/data-center/dgx-1/.
[2]
2018. NVIDIA HGX-2. https://www.nvidia.com/en-us/data-center/hgx/.
[3]
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Caches. In Proceedings of HPCA-23.
[4]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Jerger, and Andreas Moshovos. 2016. Cnvlutin: Zero-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of ISCA-43.
[5]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 320--332.
[6]
R. Balasubramonian, A.B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM TACO 14(2) (2017).
[7]
James Balfour, Richard Harting, and William Dally. 2009. Operand Registers and Explicit Operand Forwarding. IEEE Computer Architecture Letters (2009).
[8]
Cerebras. 2019. Cerebras Wafer Scale Engine: An Introduction. https://www.cerebras.net/wp-content/uploads/2019/08/Cerebras-Wafer-Scale-Engine-Whitepaper.pdf.
[9]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of MICRO-47.
[10]
Y. Chen, T. Yang, J. Emer, and V. Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2019).
[11]
Y-H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52(1) (2016).
[12]
Ping Chi, Shuangchen Li, Ziyang Qi, Peng Gu, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of ISCA-43.
[13]
M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv preprint 1602.02830.
[14]
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[15]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of ISCA-42.
[16]
Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In Proceedings of ISCA-45.
[17]
Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijavalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. 2018. A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Training and Inference. In 2018 IEEE Symposium on VLSI Circuits. 35--36.
[18]
Graphcore. 2017. Intelligence Processing Unit. https://cdn2.hubspot.net/hubfs/729091/NIPS2017/NIPS%2017%20-%20IPU.pdf.
[19]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of ICML-32).
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).
[21]
R. Ho. 2003. On-Chip Wires: Scaling and Efficiency. Ph.D. Dissertation. Stanford University.
[22]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).
[23]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).
[24]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. (2017).
[25]
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-Serial Deep Neural Network Computing. In Proceedings of MICRO-49.
[26]
S. Keckler. 2011. Life After Dennard and How I Learned to Love the Picojoule. Keynote at MICRO.
[27]
S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, and D. Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 5 (2011).
[28]
Duckhwan Kim, Jae Ha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of ISCA-43.
[29]
U. Koster, T. Webb, X. Wang, M. Nassar, A. Bansal, W. Constable, O. Elibol, S. Gray, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloass, R. Pai, and N. Rao. 2017. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. arXiv preprint arXiv:1711.02213 (2017).
[30]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS.
[31]
Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
[32]
K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz. 2012. Towards Energy-Proportional Datacenter Memory with Mobile DRAM. In Proceedings of ISCA.
[33]
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 123--135.
[34]
Naveen Muralimanohar et al. 2007. CACTI 6.0: A Tool to Understand Large Caches. Technical Report. University of Utah.
[35]
A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. Strachan, and N. Muralimanohar. 2018. Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration. IEEE Micro Special Issue on Memristor-Based Computing (2018).
[36]
M. O'Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. Keckler, and W. Dally. 2017. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems. In Proceedings of MICRO.
[37]
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S.W. Keckler, and W.J. Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. (2017).
[38]
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R.S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of ISCA.
[39]
S.Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of ISCA.
[40]
S.Han, H. Mao, and W. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding. In Proceedings of ICLR.
[41]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
[42]
Tesla. 2019. Tesla Autonomy Day. https://www.youtube.com/watch?v=Ucp0TTmvqOE.
[43]
S. Venkataramani, A. Ranjan, S. Avancha, A. Jagannathan, A. Raghunathan, S. Banerjee, D. Das, A. Durg, D. Nagaraj, B. Kaul, and P. Dubey. 2017. SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. (2017).

Cited By

View all
  • (2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
  • (2024)PATH: Evaluation of Boolean Logic Using Path-Based In-Memory Computing SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334452343:5(1387-1400)Online publication date: May-2024
  • (2024)Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00061(619-630)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. Wire-Aware Architecture and Dataflow for CNN Accelerators

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
    October 2019
    1104 pages
    ISBN:9781450369381
    DOI:10.1145/3352460
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CNN
    2. DNN
    3. accelerator
    4. near memory
    5. neural networks

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MICRO '52
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Upcoming Conference

    MICRO '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)216
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 02 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
    • (2024)PATH: Evaluation of Boolean Logic Using Path-Based In-Memory Computing SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334452343:5(1387-1400)Online publication date: May-2024
    • (2024)Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00061(619-630)Online publication date: 27-May-2024
    • (2024)SPARK: Scalable and Precision-Aware Acceleration of Neural Networks via Efficient Encoding2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00082(1029-1042)Online publication date: 2-Mar-2024
    • (2024)Photonic neural networks and optics-informed deep learning fundamentalsAPL Photonics10.1063/5.01698109:1Online publication date: 29-Jan-2024
    • (2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
    • (2023) A 24.3 μ J/Image SNN Accelerator for DVS-Gesture With WS-LOS Dataflow and Sparse Methods IEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.328258970:11(4226-4230)Online publication date: Nov-2023
    • (2023)SONA: An Accelerator for Transform-Domain Neural Networks with Sparse-Orthogonal Weights2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP57973.2023.00015(18-26)Online publication date: Jul-2023
    • (2023)Near-optimal multi-accelerator architectures for predictive maintenance at the edgeFuture Generation Computer Systems10.1016/j.future.2022.10.030140:C(331-343)Online publication date: 8-Feb-2023
    • (2023)Power Awareness in Low Precision Neural NetworksComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25082-8_5(67-83)Online publication date: 12-Feb-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media