Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/DAC18074.2021.9586216guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

Published: 05 December 2021 Publication History

Abstract

DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of Systemon-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks.

References

[1]
T. Moreau et al., “VTA: An Open Hardware-Software Stack for Deep Learning,” CoRR, 2018.
[2]
R. Venkatesan et al., “MAGNet: A Modular Accelerator Generator for Neural Networks,” in ICCAD, 2019.
[3]
J. Cong et al., “PolySA: polyhedral-based systolic array auto-compilation,” in ICCAD, 2018.
[4]
X. Zhang et al., “DNNBuilder: An Automated Tool for Building Highperformance DNN Hardware Accelerators for FPGAs,” in ICCAD, 2018.
[5]
Xuechao Wei et al., “Automated systolic array architecture synthesis for high throughput cnn inference on fpgas,” in DAC, 2017.
[6]
Y. Wang et al., “Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family,” in DAC, 2016.
[7]
H. Ye et al., “HybridDNN: A framework for high-performance hybrid dnn accelerator design and implementation,” in DAC, 2020.
[8]
K. Hazelwood et al., “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective,” in HPCA, 2018.
[9]
C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in HPCA, 2019.
[10]
D. Richins et al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,” in HPCA, 2020.
[11]
F. Sijstermans, “The NVIDIA Deep Learning Accelerator,” in Hot Chips, 2018.
[12]
Y. Chen et al., “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” JETCAS, 2019.
[13]
Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sensor,” in ISCA, 2015.
[14]
S. Venkataramani et al., “ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks,” in ISCA, 2017.
[15]
X. Yang et al., “Interstellar: Using Halide’s scheduling language to analyze DNN accelerators,” in ASPLOS, 2020.
[16]
H. Kwon et al., “MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Programmable Interconnects,” in ASPLOS, 2018.
[17]
J. Fowers et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” in ISCA, 2018.
[18]
E. Qin et al., “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in HPCA, 2020.
[19]
X. He et al., “Sparse-TPU: Adapting systolic arrays for sparse matrices,” in ICS, 2020.
[20]
P. Dai et al., “SparseTrain: Exploiting dataflow sparsity for efficient convolutional neural networks training,” in DAC, 2020.
[21]
L. Song et al., “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” in HPCA, 2017.
[22]
H. Kim et al., “Algorithm/hardware co-design for in-memory neural network computing with minimal peripheral circuit overhead,” in DAC, 2020.
[23]
H. Sharma et al., “From High-level Deep Neural Models to FPGAs,” in MICRO, 2016.
[24]
G. Henry et al., “High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product,” in ISCA, 2020.
[25]
D. Lustig et al., “Tlb improvements for chip multiprocessors: Inter-core cooperative prefetchers and shared last-level tlbs,” TACO, 2013.
[26]
A. Amid et al., “Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs,” IEEE Micro, 2020.
[27]
S. Karandikar et al., “FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud,” in ISCA, 2018.
[28]
B. Hyun et al., “NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units,” in ASPLOS, 2020.
[29]
Y. Hao et al., “Supporting Address Translation for Accelerator-Centric Architectures,” in HPCA, 2017.

Cited By

View all
  • (2025)MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector QuantizationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707268(731-745)Online publication date: 3-Feb-2025
  • (2025)SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware SLAMProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707258(1035-1051)Online publication date: 3-Feb-2025
  • (2025)UniZK: Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel MappingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707228(1101-1117)Online publication date: 3-Feb-2025
  • Show More Cited By

Index Terms

  1. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
        Index terms have been assigned to the content through auto-classification.

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        2021 58th ACM/IEEE Design Automation Conference (DAC)
        Dec 2021
        1380 pages

        Publisher

        IEEE Press

        Publication History

        Published: 05 December 2021

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 03 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector QuantizationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707268(731-745)Online publication date: 3-Feb-2025
        • (2025)SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware SLAMProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707258(1035-1051)Online publication date: 3-Feb-2025
        • (2025)UniZK: Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel MappingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707228(1101-1117)Online publication date: 3-Feb-2025
        • (2025)Exo 2: Growing a Scheduling LanguageProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707218(426-444)Online publication date: 3-Feb-2025
        • (2024)BiEProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694679(62978-62992)Online publication date: 21-Jul-2024
        • (2024)gem5-NVDLA: A Simulation Framework for Compiling, Scheduling, and Architecture Evaluation on AI System-on-ChipsACM Transactions on Design Automation of Electronic Systems10.1145/366199729:5(1-20)Online publication date: 29-Apr-2024
        • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
        • (2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
        • (2024)Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPUProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3652032.3657571(72-82)Online publication date: 20-Jun-2024
        • (2024)Search-Based Translations for Tensor OperationsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3685558(1917-1919)Online publication date: 11-Sep-2024
        • Show More Cited By

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media