Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3649476.3658810acmconferencesArticle/Chapter ViewAbstractPublication PagesglsvlsiConference Proceedingsconference-collections
short-paper

Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion

Published: 12 June 2024 Publication History

Abstract

Attention-based transformers have achieved significant performance breakthroughs in natural language processing (NLP) and computer vision (CV) tasks. Meanwhile, the ever-increasing length of today’s input sequences puts much pressure on computing devices. FPGAs are widely used to accelerate Transformer inference due to their high energy efficiency and flexibility. However, most of the existing FPGA-based Transformer accelerators are oriented to small input lengths, making it hard to accelerate long input sequences. To this end, we design an efficient Transformer accelerator for FPGA and long-sequence input scenarios. We use the tiling softmax algorithm to fuse attention computation, eliminating the memory and bandwidth bottleneck in the attention layer and allowing our accelerator to support arbitrary input sequence lengths. We use BERT-Base on the Alveo U50 board for evaluation, and our implementation achieves computational efficiency improvements of 1.09 ∼ 2.48 × over prior FPGA accelerators. Besides, our accelerator can support up to 175K input sequence length when running BERT-like structures, far more than previous designs.

References

[1]
Tri Dao 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS. 16344–16359.
[2]
Jacob Devlin 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[3]
Alexey Dosovitskiy 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[4]
Hongxiang Fan 2022. Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design. In Proc. MICRO. 599–615.
[5]
Chao Fang 2022. An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers. TVLSI 30, 11 (2022), 1573–1586.
[6]
Lei Gong 2018. MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks With All Layers Mapped on Chip. TCAD (2018).
[7]
Yuntao Han 2023. HPTA: A High Performance Transformer Accelerator Based on FPGA. In Proc. FPL. 27–33.
[8]
Sheng-Chun Kao 2023. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. In Proc. ASPLOS. 295–310.
[9]
Hamza Khan 2021. NPE: An FPGA-based Overlay Processor for Natural Language Processing. In Proc. FPGA.
[10]
Woosuk Kwon 2022. A fast post-training pruning framework for transformers. In NeurIPS. 24101–24116.
[11]
Bingbing Li 2020. Ftrans: energy-efficient acceleration of transformers using fpga. In Proc. ISLPED.
[12]
Wenqi Lou 2022. OctCNN: A High Throughput FPGA Accelerator for CNNs Using Octave Convolution Algorithm. TC 71, 8 (2022), 1847–1859.
[13]
Wenqi Lou 2023. NAF: Deeper Network/Accelerator Co-Exploration for Customizing CNNs on FPGA. In Proc. DATE. IEEE, 1–6.
[14]
Hongwu Peng 2022. A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining. In Proc. DAC. 1135–1140.
[15]
Zheng Qu 2022. Dota: detect and omit weak attentions for scalable transformer acceleration. In Proc. ASPLOS. 14–26.
[16]
Rishov Sarkar 2023. Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. In Proc. ICCAD.
[17]
Ao Wang 2023. CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs. arXiv:2309.15755 (2023).
[18]
Hanrui Wang 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In Proc. HPCA. 97–110.
[19]
Teng Wang 2022. Via: A novel vision-transformer accelerator based on fpga. TCAD 41, 11 (2022), 4088–4099.
[20]
Wenhua Ye 2023. Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array. TECS 22, 6 (2023), 1–22.

Index Terms

  1. Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024
      June 2024
      797 pages
      ISBN:9798400706059
      DOI:10.1145/3649476
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Accelerator
      2. Attention Fusion
      3. FPGA
      4. Long Input Sequence
      5. Transformer

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      Conference

      GLSVLSI '24
      Sponsor:
      GLSVLSI '24: Great Lakes Symposium on VLSI 2024
      June 12 - 14, 2024
      FL, Clearwater, USA

      Acceptance Rates

      Overall Acceptance Rate 312 of 1,156 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 163
        Total Downloads
      • Downloads (Last 12 months)163
      • Downloads (Last 6 weeks)24
      Reflects downloads up to 01 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media