Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3489517.3530505acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

NN-LUT: neural approximation of non-linear operations for efficient transformer inference

Published: 23 August 2022 Publication History

Abstract

Non-linear operations such as GELU, Layer normalization, and Soft-max are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a Look-up table(LUT). The proposed framework called Neural network generated LUT(NN-LUT) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.

References

[1]
NVIDIA Deep Learning Accelerator. http://nvdla.org/primer.html.
[2]
A. Cantoni. 1971. Optimal Curve Fitting With Piecewise Linear Functions. IEEE Trans. Comput. C-20, 1 (1971), 59--67.
[3]
J. Chen and X. Liu. 2017. A high-performance deeply pipelined architecture for elementary transcendental function evaluation. In ICCD.
[4]
G. Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303--314.
[5]
S. Eldridge, F. Raudies, D. Zou, and A. Joshi. 2014. Neural network-based accelerators for transcendental function approximation. In GLSVLSI.
[6]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012. Neural acceleration for general-purpose approximate programs. In MICRO.
[7]
J.-W. Jang et al. 2021. Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC. In ISCA.
[8]
S. Kim et al. 2021. I-BERT: Integer-only BERT Quantizatio. In ICML.
[9]
Z. Lu et al. 2017. The expressive power of neural networks. In NeurIPS.
[10]
J. R. Stevens et al. 2021. Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers. In DAC.
[11]
A. Vaswani et al. 2017. Attention is all you need. In NeurIPS.
[12]
H. Wang, Z. Zhang, and S. Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In HPCA.
[13]
W. Zhang et al. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In EMNLP.

Cited By

View all
  • (2024)Sampleformer: An efficient conformer-based Neural Network for Automatic Speech RecognitionIntelligent Data Analysis10.3233/IDA-23061228:6(1647-1659)Online publication date: 15-Nov-2024
  • (2024)NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546727(1-6)Online publication date: 25-Mar-2024
  • (2024)ONE-SA: Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546535(1-6)Online publication date: 25-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference
July 2022
1462 pages
ISBN:9781450391429
DOI:10.1145/3489517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. look-up table
  2. neural network
  3. non-linear function
  4. transformer

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Trade, Industry Energy (MOTIE, Korea)

Conference

DAC '22
Sponsor:
DAC '22: 59th ACM/IEEE Design Automation Conference
July 10 - 14, 2022
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)446
  • Downloads (Last 6 weeks)57
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sampleformer: An efficient conformer-based Neural Network for Automatic Speech RecognitionIntelligent Data Analysis10.3233/IDA-23061228:6(1647-1659)Online publication date: 15-Nov-2024
  • (2024)NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546727(1-6)Online publication date: 25-Mar-2024
  • (2024)ONE-SA: Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546535(1-6)Online publication date: 25-Mar-2024
  • (2024)POSTER:In-network Model Inference for Distributed Systems via Programmable SwitchesProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673749(75-77)Online publication date: 4-Aug-2024
  • (2024)CSTrans-OPU: An FPGA-based Overlay Processor with Full Compilation for Transformer Networks via Sparsity ExplorationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657325(1-6)Online publication date: 23-Jun-2024
  • (2024)IANUS: Integrated Accelerator based on NPU-PIM Unified Memory SystemProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651324(545-560)Online publication date: 27-Apr-2024
  • (2024)SimBU: Self-Similarity-Based Hybrid Binary-Unary Computing for Nonlinear FunctionsIEEE Transactions on Computers10.1109/TC.2024.339851273:9(2192-2205)Online publication date: 1-Sep-2024
  • (2024)ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic ArrayIEEE Transactions on Computers10.1109/TC.2024.339850073:8(1997-2011)Online publication date: 1-Aug-2024
  • (2024)A Transistor Operations Model for Deep Learning Energy Consumption Scaling LawIEEE Transactions on Artificial Intelligence10.1109/TAI.2022.32292805:1(192-204)Online publication date: Jan-2024
  • (2024)PWL- Explorer: A Reconfigurable Architecture for Nonlinear Activation Function with Automatic DSE2024 2nd International Symposium of Electronics Design Automation (ISEDA)10.1109/ISEDA62518.2024.10618045(210-215)Online publication date: 10-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media