short-paper

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion

Authors:

Xuehai ZhouAuthors Info & Claims

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

Pages 599 - 603

https://doi.org/10.1145/3649476.3658810

Published: 12 June 2024 Publication History

Abstract

Attention-based transformers have achieved significant performance breakthroughs in natural language processing (NLP) and computer vision (CV) tasks. Meanwhile, the ever-increasing length of today’s input sequences puts much pressure on computing devices. FPGAs are widely used to accelerate Transformer inference due to their high energy efficiency and flexibility. However, most of the existing FPGA-based Transformer accelerators are oriented to small input lengths, making it hard to accelerate long input sequences. To this end, we design an efficient Transformer accelerator for FPGA and long-sequence input scenarios. We use the tiling softmax algorithm to fuse attention computation, eliminating the memory and bandwidth bottleneck in the attention layer and allowing our accelerator to support arbitrary input sequence lengths. We use BERT-Base on the Alveo U50 board for evaluation, and our implementation achieves computational efficiency improvements of 1.09 ∼ 2.48 × over prior FPGA accelerators. Besides, our accelerator can support up to 175K input sequence length when running BERT-like structures, far more than previous designs.

References

[1]

Tri Dao 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS. 16344–16359.

[2]

Jacob Devlin 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[3]

Alexey Dosovitskiy 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[4]

Hongxiang Fan 2022. Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design. In Proc. MICRO. 599–615.

Digital Library

[5]

Chao Fang 2022. An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers. TVLSI 30, 11 (2022), 1573–1586.

[6]

Lei Gong 2018. MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks With All Layers Mapped on Chip. TCAD (2018).

[7]

Yuntao Han 2023. HPTA: A High Performance Transformer Accelerator Based on FPGA. In Proc. FPL. 27–33.

[8]

Sheng-Chun Kao 2023. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. In Proc. ASPLOS. 295–310.

Digital Library

[9]

Hamza Khan 2021. NPE: An FPGA-based Overlay Processor for Natural Language Processing. In Proc. FPGA.

Digital Library

[10]

Woosuk Kwon 2022. A fast post-training pruning framework for transformers. In NeurIPS. 24101–24116.

[11]

Bingbing Li 2020. Ftrans: energy-efficient acceleration of transformers using fpga. In Proc. ISLPED.

Digital Library

[12]

Wenqi Lou 2022. OctCNN: A High Throughput FPGA Accelerator for CNNs Using Octave Convolution Algorithm. TC 71, 8 (2022), 1847–1859.

[13]

Wenqi Lou 2023. NAF: Deeper Network/Accelerator Co-Exploration for Customizing CNNs on FPGA. In Proc. DATE. IEEE, 1–6.

[14]

Hongwu Peng 2022. A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining. In Proc. DAC. 1135–1140.

Digital Library

[15]

Zheng Qu 2022. Dota: detect and omit weak attentions for scalable transformer acceleration. In Proc. ASPLOS. 14–26.

Digital Library

[16]

Rishov Sarkar 2023. Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. In Proc. ICCAD.

[17]

Ao Wang 2023. CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs. arXiv:2309.15755 (2023).

[18]

Hanrui Wang 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In Proc. HPCA. 97–110.

[19]

Teng Wang 2022. Via: A novel vision-transformer accelerator based on fpga. TCAD 41, 11 (2022), 4088–4099.

[20]

Wenhua Ye 2023. Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array. TECS 22, 6 (2023), 1–22.

Digital Library

Index Terms

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Performance Evaluation on GPU-FPGA Accelerated Computing Considering Interconnections between Accelerators
HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

Graphic processing units (GPUs) are often equipped with HPC systems as accelerators because of their high computing capability. GPUs are powerful computing devices; however, they operate inefficiently on applications that employ partially poor ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

June 2024

797 pages

ISBN:9798400706059

DOI:10.1145/3649476

Editors:
Inna Partin-Vaisband
University of Illinois Chicago, USA
,
Srinivas Katkoori
University of South Florida, USA
,
Lu Peng
Tulane University, USA
,
Boris Vaisband
McGill University, Canada
,
Tooraj Nikoubin
University of Texas at Dallas, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

Jiangsu Provincial Natural Science Foundation
Youth Innovation Promotion Association of the Chinese Academy of Sciences
National Key Research and Development Program of China
National Natural Science Foundation of China

Conference

GLSVLSI '24

Sponsor:

SIGDA

GLSVLSI '24: Great Lakes Symposium on VLSI 2024

June 12 - 14, 2024

FL, Clearwater, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
163
Total Downloads

Downloads (Last 12 months)163
Downloads (Last 6 weeks)24

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents