research-article

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

2021 58th ACM/IEEE Design Automation Conference (DAC)

Pages 769 - 774

https://doi.org/10.1109/DAC18074.2021.9586216

Published: 05 December 2021 Publication History

Abstract

DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of Systemon-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks.

References

[1]

T. Moreau et al., “VTA: An Open Hardware-Software Stack for Deep Learning,” CoRR, 2018.

[2]

R. Venkatesan et al., “MAGNet: A Modular Accelerator Generator for Neural Networks,” in ICCAD, 2019.

[3]

J. Cong et al., “PolySA: polyhedral-based systolic array auto-compilation,” in ICCAD, 2018.

[4]

X. Zhang et al., “DNNBuilder: An Automated Tool for Building Highperformance DNN Hardware Accelerators for FPGAs,” in ICCAD, 2018.

[5]

Xuechao Wei et al., “Automated systolic array architecture synthesis for high throughput cnn inference on fpgas,” in DAC, 2017.

[6]

Y. Wang et al., “Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family,” in DAC, 2016.

[7]

H. Ye et al., “HybridDNN: A framework for high-performance hybrid dnn accelerator design and implementation,” in DAC, 2020.

[8]

K. Hazelwood et al., “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective,” in HPCA, 2018.

[9]

C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in HPCA, 2019.

[10]

D. Richins et al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,” in HPCA, 2020.

[11]

F. Sijstermans, “The NVIDIA Deep Learning Accelerator,” in Hot Chips, 2018.

[12]

Y. Chen et al., “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” JETCAS, 2019.

[13]

Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sensor,” in ISCA, 2015.

[14]

S. Venkataramani et al., “ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks,” in ISCA, 2017.

[15]

X. Yang et al., “Interstellar: Using Halide’s scheduling language to analyze DNN accelerators,” in ASPLOS, 2020.

[16]

H. Kwon et al., “MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Programmable Interconnects,” in ASPLOS, 2018.

[17]

J. Fowers et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” in ISCA, 2018.

[18]

E. Qin et al., “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in HPCA, 2020.

[19]

X. He et al., “Sparse-TPU: Adapting systolic arrays for sparse matrices,” in ICS, 2020.

[20]

P. Dai et al., “SparseTrain: Exploiting dataflow sparsity for efficient convolutional neural networks training,” in DAC, 2020.

[21]

L. Song et al., “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” in HPCA, 2017.

[22]

H. Kim et al., “Algorithm/hardware co-design for in-memory neural network computing with minimal peripheral circuit overhead,” in DAC, 2020.

[23]

H. Sharma et al., “From High-level Deep Neural Models to FPGAs,” in MICRO, 2016.

[24]

G. Henry et al., “High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product,” in ISCA, 2020.

[25]

D. Lustig et al., “Tlb improvements for chip multiprocessors: Inter-core cooperative prefetchers and shared last-level tlbs,” TACO, 2013.

[26]

A. Amid et al., “Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs,” IEEE Micro, 2020.

[27]

S. Karandikar et al., “FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud,” in ISCA, 2018.

[28]

B. Hyun et al., “NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units,” in ASPLOS, 2020.

[29]

Y. Hao et al., “Supporting Address Translation for Accelerator-Centric Architectures,” in HPCA, 2017.

Cited By

Li SWang CDeng JWang ZYe ZWang ZShen HHuang KEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector QuantizationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707268(731-745)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707268
Kim SHsiao RNikolic BDemmel JShao YEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware SLAMProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707258(1035-1051)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707258
Wang CGao MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)UniZK: Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel MappingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707228(1101-1117)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707228
Show More Cited By

Index Terms

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Emerging technologies
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2021 58th ACM/IEEE Design Automation Conference (DAC)

Dec 2021

1380 pages

Copyright © 2021.

Publisher

IEEE Press

Publication History

Published: 05 December 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li SWang CDeng JWang ZYe ZWang ZShen HHuang KEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector QuantizationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707268(731-745)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707268
Kim SHsiao RNikolic BDemmel JShao YEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware SLAMProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707258(1035-1051)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707258
Wang CGao MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)UniZK: Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel MappingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707228(1101-1117)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707228
Ikarashi YQian KDroubi SReinking ABernstein GRagan-Kelley JEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Exo 2: Growing a Scheduling LanguageProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707218(426-444)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707218
Zou LZhao WYin SBai CSun QYu BSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)BiEProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694679(62978-62992)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694679
Lai CZhang W(2024)gem5-NVDLA: A Simulation Framework for Compiling, Scheduling, and Architecture Evaluation on AI System-on-ChipsACM Transactions on Design Automation of Electronic Systems10.1145/366199729:5(1-20)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3661997
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Lee CYeh T(2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1145/3653363
Jung KNamkoong SUm HKim HKim YPark YShrivastava ASui Y(2024)Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPUProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657571(72-82)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652032.3657571
Qiu JChristakis MPradel M(2024)Search-Based Translations for Tensor OperationsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3685558(1917-1919)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3685558
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten