Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Published: 04 April 2017 Publication History

Abstract

The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.
This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.

References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016.
[2]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In 42nd International Symposium on Computer Architecture (ISCA), pages 105--117, 2015.
[3]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 1--13, 2016.
[4]
R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. Near-Data Processing: Insights from a MICRO-46 Workshop. IEEE Micro, 34(4): 36--42, 2014.
[5]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014.
[6]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In 47th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 609--622, 2014.
[7]
Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367--379, 2016.
[8]
Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference (ISSCC), pages 262--263, 2016.
[9]
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd International Symposium on Computer Architecture (ISCA), pages 27--39, 2016.
[10]
K. Choi. Coarse-Grained Reconfigurable Array: Architecture and Application Mapping. IPSJ Transactions on System LSI Design Methodology, 4:31--46, 2011.
[11]
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1223--1231, 2012.
[12]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In 42nd Annual International Symposium on Computer Architecture (ISCA), pages 92--104, 2015.
[13]
A. Dundar, J. Jin, V. Gokhale, B. Martini, and E. Culurciello. Memory Access Optimized Routing Scheme for Deep Networks on a Mobile Coprocessor. In 2014 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6, 2014.
[14]
Y. Eckert, N. Jayasena, and G. H. Loh. Thermal Feasibility of Die-Stacked Processing in Memory. In 2nd Workshop on Near-Data Processing (WoNDP), 2014.
[15]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision. In 2011 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 109--116, 2011.
[16]
M. Gao and C. Kozyrakis. HRL: Efficient and Flexible Re- configurable Logic for Near-Data Processing. In 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 126--137, 2016.
[17]
M. Gao, G. Ayers, and C. Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 113--124, 2015.
[18]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dall. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243--254, 2016.
[19]
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
[20]
Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification 2.1, 2014.
[21]
J. Jeddeloh and B. Keeth. Hybrid Memory Cube New DRAM Architecture Increases Density and Performance. In 2012 Symposium on VLSI Technology (VLSIT), pages 87--88, 2012.
[22]
JEDEC Standard. High Bandwidth Memory (HBM) DRAM. JESD235A, 2015.
[23]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.
[24]
A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early- stage Design Space Exploration. In Conference on Design, Automation and Test in Europe (DATE), pages 423--428, 2009.
[25]
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 380--392, 2016.
[26]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1097--1105, 2012.
[27]
Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521(7553):436--444, 2015.
[28]
D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 25.2 A 1.2V 8Gb 8-channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nm Process and TSV. In IEEE International Solid-State Circuits Conference (ISSCC), pages 432--433, 2014.
[29]
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011.
[30]
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In 13th International Conference on Field Programmable Logic and Application (FPL), pages 61--70, 2003.
[31]
Micron Technology Inc. TN-41-01: Calculating Memory System Power for DDR3 . https://www.micron.com/support/tools-and-utilities/power-calc, 2007.
[32]
Micron Technology Inc. Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3 SDRAM Features. https://www.micron.com/products/dram/lpdram/16Gb, 2014.
[33]
S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo. A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel MIMD Architecture for Big-Data Applications. In IEEE International Solid-State Circuits Conference (ISSCC), pages 1--3, 2015.
[34]
M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-Centric Accelerator Design for Convolutional Neural Networks. In 31st International Conference on Computer Design (ICCD), pages 13--19, 2013.
[35]
S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 190--200, 2014.
[36]
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267--278,2016.
[37]
D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In 40th International Symposium on Computer Architecture(ISCA), pages 475--486, 2013.
[38]
V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In 46th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 185--197, 2013.
[39]
V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Fast Bulk Bitwise AND and OR in DRAM. Computer Architecture Letters, 14 :127--131, 2015.
[40]
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars. In 43rd International Symposium on Computer Architecture (ISCA), pages 14--26, 2016.
[41]
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
[42]
H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. Filho. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions Computers, 49(5):465--481, 2000.
[43]
T. Vogelsang. Understanding the Energy Consumption of Dynamic Random Access Memories. In 43rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 363--374, 2010.
[44]
C. Weis, N. Wehn, L. Igor, and L. Benini. Design Space Exploration for 3D-stacked DRAMs. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 1--6, 2011.
[45]
X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz. A Systematic Approach to Blocking Convolutional Neural Networks. arXiv preprint arXiv:1606.04209, 2016.
[46]
M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In 13th European Conference on Computer Vision (ECCV), pages 818--833, 2014.
[47]
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 161--170, 2015.

Cited By

View all
  • (2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
  • (2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
  • (2024)Experimental demonstration of magnetic tunnel junction-based computational random-access memorynpj Unconventional Computing10.1038/s44335-024-00003-31:1Online publication date: 25-Jul-2024
  • Show More Cited By

Index Terms

  1. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 45, Issue 1
    Asplos'17
    March 2017
    812 pages
    ISSN:0163-5964
    DOI:10.1145/3093337
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
      April 2017
      856 pages
      ISBN:9781450344654
      DOI:10.1145/3037697
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 April 2017
    Published in SIGARCH Volume 45, Issue 1

    Check for updates

    Author Tags

    1. 3d memory
    2. acceleration
    3. dataflow scheduling
    4. neural networks
    5. partitioning

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,343
    • Downloads (Last 6 weeks)305
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
    • (2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
    • (2024)Experimental demonstration of magnetic tunnel junction-based computational random-access memorynpj Unconventional Computing10.1038/s44335-024-00003-31:1Online publication date: 25-Jul-2024
    • (2023)Resource-Efficient Convolutional Networks: A Survey on Model-, Arithmetic-, and Implementation-Level TechniquesACM Computing Surveys10.1145/358709555:13s(1-36)Online publication date: 13-Jul-2023
    • (2023)QUIDAM: A Framework for Quantization-aware DNN Accelerator and Model Co-ExplorationACM Transactions on Embedded Computing Systems10.1145/355580722:2(1-21)Online publication date: 24-Jan-2023
    • (2023)CREAM: Computing in ReRAM-Assisted Energy- and Area-Efficient SRAM for Reliable Neural Network AccelerationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.327287470:8(3198-3211)Online publication date: Aug-2023
    • (2023)Polyomino: A 3D-SRAM-Centric Accelerator for Randomly Pruned Matrix Multiplication With Simple Reordering Algorithm and Efficient Compression Format in 180-nm CMOSIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.326877270:9(3440-3450)Online publication date: Sep-2023
    • (2023) A 7-nm FinFET 1.2-TB/s/mm 2 3D-Stacked SRAM Module With 0.7-pJ/b Inductive Coupling Interface Using Over-SRAM Coil and Manchester-Encoded Synchronous Transceiver IEEE Journal of Solid-State Circuits10.1109/JSSC.2022.322442158:7(2075-2086)Online publication date: Jul-2023
    • (2023)PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247800(1-6)Online publication date: 9-Jul-2023
    • (2023)Application of KNCMA Algorithm Integrating Network Features in Data Resource SchedulingIEEE Access10.1109/ACCESS.2023.331947111(111622-111635)Online publication date: 2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media