research-article

Public Access

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Authors:

Christos KozyrakisAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 45, Issue 1

Pages 751 - 764

https://doi.org/10.1145/3093337.3037702

Published: 04 April 2017 Publication History

Abstract

The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.

This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016.

[2]

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In 42nd International Symposium on Computer Architecture (ISCA), pages 105--117, 2015.

Digital Library

[3]

J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 1--13, 2016.

Digital Library

[4]

R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. Near-Data Processing: Insights from a MICRO-46 Workshop. IEEE Micro, 34(4): 36--42, 2014.

[5]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014.

Digital Library

[6]

Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In 47th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 609--622, 2014.

Digital Library

[7]

Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367--379, 2016.

Digital Library

[8]

Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference (ISSCC), pages 262--263, 2016.

[9]

P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd International Symposium on Computer Architecture (ISCA), pages 27--39, 2016.

Digital Library

[10]

K. Choi. Coarse-Grained Reconfigurable Array: Architecture and Application Mapping. IPSJ Transactions on System LSI Design Methodology, 4:31--46, 2011.

[11]

J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1223--1231, 2012.

[12]

Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In 42nd Annual International Symposium on Computer Architecture (ISCA), pages 92--104, 2015.

Digital Library

[13]

A. Dundar, J. Jin, V. Gokhale, B. Martini, and E. Culurciello. Memory Access Optimized Routing Scheme for Deep Networks on a Mobile Coprocessor. In 2014 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6, 2014.

[14]

Y. Eckert, N. Jayasena, and G. H. Loh. Thermal Feasibility of Die-Stacked Processing in Memory. In 2nd Workshop on Near-Data Processing (WoNDP), 2014.

[15]

C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision. In 2011 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 109--116, 2011.

[16]

M. Gao and C. Kozyrakis. HRL: Efficient and Flexible Re- configurable Logic for Near-Data Processing. In 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 126--137, 2016.

[17]

M. Gao, G. Ayers, and C. Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 113--124, 2015.

Digital Library

[18]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dall. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243--254, 2016.

Digital Library

[19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.

[20]

Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification 2.1, 2014.

[21]

J. Jeddeloh and B. Keeth. Hybrid Memory Cube New DRAM Architecture Increases Density and Performance. In 2012 Symposium on VLSI Technology (VLSIT), pages 87--88, 2012.

[22]

JEDEC Standard. High Bandwidth Memory (HBM) DRAM. JESD235A, 2015.

[23]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.

[24]

A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early- stage Design Space Exploration. In Conference on Design, Automation and Test in Europe (DATE), pages 423--428, 2009.

[25]

D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 380--392, 2016.

Digital Library

[26]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1097--1105, 2012.

[27]

Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521(7553):436--444, 2015.

[28]

D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 25.2 A 1.2V 8Gb 8-channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nm Process and TSV. In IEEE International Solid-State Circuits Conference (ISSCC), pages 432--433, 2014.

[29]

S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011.

Digital Library

[30]

B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In 13th International Conference on Field Programmable Logic and Application (FPL), pages 61--70, 2003.

[31]

Micron Technology Inc. TN-41-01: Calculating Memory System Power for DDR3 . https://www.micron.com/support/tools-and-utilities/power-calc, 2007.

[32]

Micron Technology Inc. Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3 SDRAM Features. https://www.micron.com/products/dram/lpdram/16Gb, 2014.

[33]

S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo. A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel MIMD Architecture for Big-Data Applications. In IEEE International Solid-State Circuits Conference (ISSCC), pages 1--3, 2015.

[34]

M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-Centric Accelerator Design for Convolutional Neural Networks. In 31st International Conference on Computer Design (ICCD), pages 13--19, 2013.

[35]

S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 190--200, 2014.

[36]

B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267--278,2016.

Digital Library

[37]

D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In 40th International Symposium on Computer Architecture(ISCA), pages 475--486, 2013.

Digital Library

[38]

V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In 46th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 185--197, 2013.

Digital Library

[39]

V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Fast Bulk Bitwise AND and OR in DRAM. Computer Architecture Letters, 14 :127--131, 2015.

Digital Library

[40]

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars. In 43rd International Symposium on Computer Architecture (ISCA), pages 14--26, 2016.

Digital Library

[41]

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.

[42]

H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. Filho. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions Computers, 49(5):465--481, 2000.

Digital Library

[43]

T. Vogelsang. Understanding the Energy Consumption of Dynamic Random Access Memories. In 43rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 363--374, 2010.

Digital Library

[44]

C. Weis, N. Wehn, L. Igor, and L. Benini. Design Space Exploration for 3D-stacked DRAMs. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 1--6, 2011.

[45]

X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz. A Systematic Approach to Blocking Convolutional Neural Networks. arXiv preprint arXiv:1606.04209, 2016.

[46]

M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In 13th European Conference on Computer Vision (ECCV), pages 818--833, 2014.

[47]

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 161--170, 2015.

Cited By

Lai CZhou ZPoptani AZhang W(2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656592
Han HWang JDing BChen S(2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
https://doi.org/10.1109/AICAS59952.2024.10595921
Lv YZink BBloom RCılasun HKhanal PResch SChowdhury ZHabiboglu AWang WSapatnekar SKarpuzcu UWang J(2024)Experimental demonstration of magnetic tunnel junction-based computational random-access memorynpj Unconventional Computing10.1038/s44335-024-00003-31:1Online publication date: 25-Jul-2024
https://doi.org/10.1038/s44335-024-00003-3
Show More Cited By

Index Terms

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the ...
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
ASPLOS '17

The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the ...
Memory-system requirements for convolutional neural networks
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Energy efficiency of the underlying memory systems is a huge issue in most neural network accelerator designs. It is imperative to understand the characteristics and behavior of data in the algorithm of neural networks to gain an insight into their ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 45, Issue 1

Asplos'17

March 2017

812 pages

ISSN:0163-5964

DOI:10.1145/3093337

Editor:
Babak Falsafi
Interim

Issue’s Table of Contents

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Published in SIGARCH Volume 45, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

363
Total Citations
View Citations
7,684
Total Downloads

Downloads (Last 12 months)1,343
Downloads (Last 6 weeks)305

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lai CZhou ZPoptani AZhang W(2024)LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656592(62-73)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656592
Han HWang JDing BChen S(2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
https://doi.org/10.1109/AICAS59952.2024.10595921
Lv YZink BBloom RCılasun HKhanal PResch SChowdhury ZHabiboglu AWang WSapatnekar SKarpuzcu UWang J(2024)Experimental demonstration of magnetic tunnel junction-based computational random-access memorynpj Unconventional Computing10.1038/s44335-024-00003-31:1Online publication date: 25-Jul-2024
https://doi.org/10.1038/s44335-024-00003-3
Lee JMukhanov LMolahosseini AMinhas UHua YMartinez del Rincon JDichev KHong CVandierendonck H(2023)Resource-Efficient Convolutional Networks: A Survey on Model-, Arithmetic-, and Implementation-Level TechniquesACM Computing Surveys10.1145/358709555:13s(1-36)Online publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1145/3587095
Inci AVirupaksha SJain AChin TThallam VDing RMarculescu D(2023)QUIDAM: A Framework for Quantization-aware DNN Accelerator and Model Co-ExplorationACM Transactions on Embedded Computing Systems10.1145/355580722:2(1-21)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1145/3555807
Sun YWang DXu LChen YLi ZLiu SHe WLiu YYang HLi X(2023)CREAM: Computing in ReRAM-Assisted Energy- and Area-Efficient SRAM for Reliable Neural Network AccelerationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.327287470:8(3198-3211)Online publication date: Aug-2023
https://doi.org/10.1109/TCSI.2023.3272874
Shiba KOkada MKosuge AHamada MKuroda T(2023)Polyomino: A 3D-SRAM-Centric Accelerator for Randomly Pruned Matrix Multiplication With Simple Reordering Algorithm and Efficient Compression Format in 180-nm CMOSIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.326877270:9(3440-3450)Online publication date: Sep-2023
https://doi.org/10.1109/TCSI.2023.3268772
Shiba KOkada MKosuge AHamada MKuroda T(2023) A 7-nm FinFET 1.2-TB/s/mm 2 3D-Stacked SRAM Module With 0.7-pJ/b Inductive Coupling Interface Using Over-SRAM Coil and Manchester-Encoded Synchronous Transceiver IEEE Journal of Solid-State Circuits10.1109/JSSC.2022.322442158:7(2075-2086)Online publication date: Jul-2023
https://doi.org/10.1109/JSSC.2022.3224421
Jain VCavalcante MBruschi NRogenmoser MBenz TKurth ARossi DBenini LVerhelst M(2023)PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247800(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247800
Wang JChai W(2023)Application of KNCMA Algorithm Integrating Network Features in Data Resource SchedulingIEEE Access10.1109/ACCESS.2023.331947111(111622-111635)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3319471
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents