research-article

Hardware–Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms

Authors:

Minyi GuoAuthors Info & Claims

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 43, Issue 9

Pages 2783 - 2796

https://doi.org/10.1109/TCAD.2024.3373592

Published: 05 March 2024 Publication History

Abstract

The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Sparse attention techniques, including both static and dynamic sparsity, reduce the quadratic complexity by computing attention on partial queries and keys. These static and dynamic methods exhibit a tradeoff between efficiency and adaptability, making them applicable to different scenarios. However, existing accelerators either target-specific domains or encounter performance degradation when dealing with long sequences. None of them can enable static and dynamic sparse attention mechanisms simultaneously. To this end, we propose SALO2, a hardware–software co-design framework that facilitates efficient static and dynamic sparse attention computations and can be applied to various scenarios, tasks, and inputs. Experiments show that SALO2 achieves <inline-formula> <tex-math notation="LaTeX">$104.80\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$13.65\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$1.38\times $ </tex-math></inline-formula> speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and SALO (the SOTA accelerator exploiting static sparsity) on tasks with long input sequences, and achieves <inline-formula> <tex-math notation="LaTeX">$76.17\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$8.98\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$1.71\times $ </tex-math></inline-formula> speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and Sanger (the SOTA accelerator exploiting dynamic sparsity) on tasks with shorter sequences. The source code is available at <uri>https://github.com/sjtu-zhao-lab/SALO.git</uri>.

References

[1]

A. Vaswani et al., “Attention is all you need,” in Proc. 31st Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–15.

[2]

T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proc. Conf. Empir. Methods Nat. Lang. Process., Syst. Demonstrat., 2020, pp. 38–45.

[3]

T. Wolf et al., “HuggingFace’s transformers: State-of-the-art natural language processing,” 2019, arXiv:1910.03771.

[4]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. ECCV, 2020, pp. 213–229.

[5]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.

[6]

N. Parmar et al., “Image transformer,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4055–4064.

[7]

W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” 2019, arXiv:1912.06813.

[8]

S. Kim et al., “SqueezeFormer: An efficient transformer for automatic speech recognition,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 9361–9373.

[9]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.

[10]

T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877–1901.

[11]

“Gpt-4 technical report,” 2023, arXiv:2303.08774.

[12]

Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “GPTEval: NLG evaluation using GPT-4 with better human alignment,” 2023, arXiv:2303.16634.

[13]

H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the logical reasoning ability of chatGPT and GPT-4,” 2023, arXiv:2304.03439.

[14]

Y. Hao, L. Dong, F. Wei, and K. Xu, “Self-attention attribution: Interpreting information interactions inside transformer,” in Proc. AAAI, vol. 35, 2021, pp. 12963–12971.

[15]

H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 97–110.

[16]

T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” 2023, arXiv:2307.08691.

[17]

R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019, arXiv:1904.10509.

[18]

J. Qiu, H. Ma, O. Levy, S. W.-T. Yih, S. Wang, and J. Tang, “Blockwise self-attention for long document understanding,” 2019, arXiv:1911.02972.

[19]

Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Star-transformer,” 2019, arXiv:1902.09113.

[20]

M. Zaheer et al., “Big bird: Transformers for longer sequences,” in Proc. 34th Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 17283–17297.

[21]

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” 2020, arXiv:2004.05150.

[22]

P. Zhang et al., “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proc. ICCV, 2021, pp. 2998–3008.

[23]

X. Dong et al., “CSWin transformer: A general vision transformer backbone with cross-shaped windows,” in Proc. CVPR, 2022, pp. 12124–12134.

[24]

G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse transformers,” Proc. Conf. Empir. Methods Nat. Lang. Process./Int. Joint Conf. Natural Lang. Process., 2019, pp. 1–20.

[25]

B. Cui, Y. Li, M. Chen, and Z. Zhang, “Fine-tune BERT with sparse self-attention mechanism,” in Proc. EMNLP-IJCNLP, 2019, pp. 3548–3553.

[26]

N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” 2020, arXiv:2001.04451.

[27]

Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse sinkhorn attention,” in Proc. ICML, 2020, pp. 9438–9447.

[28]

A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-based sparse attention with routing transformers,” Trans. Assoc. Comput. Linguist., vol. 9, pp. 53–68, Feb. 2021.

[29]

L. Liu, Z. Qu, Z. Chen, F. Tu, Y. Ding, and Y. Xie, “Dynamic sparse attention for scalable transformer acceleration,” IEEE Trans. Comput., vol. 71, no. 12, pp. 3165–3178, Dec. 2022.

[30]

T. J. Ham et al., “A³: Accelerating attention mechanisms in neural networks with approximation,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), 2020, pp. 328–341.

[31]

L. Lu et al., “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in Proc. 54th Annu. IEEE/ACM Int. Symp. Microarchit., 2021, pp. 977–991.

[32]

S. Tuli and N. K. Jha, “AccelTran: A sparsity-aware accelerator for dynamic inference with transformers,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 11, pp. 4038–4051, Nov. 2023.

Digital Library

[33]

B. Li et al., “FTRANS: Energy-efficient acceleration of transformers using FPGA,” in Proc. ACM/IEEE Int. Symp. Low Power Electron. Design, 2020, pp. 175–180.

[34]

H. You et al., “ViTCoD: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2023, pp. 273–286.

[35]

G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, and M. Guo, “SALO: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences,” in Proc. 59th ACM/IEEE Design Autom. Conf., 2022, pp. 571–576.

[36]

Y. Qin et al., “FACT: FFN-attention co-Optimized transformer architecture with eager correlation prediction,” in Proc. 50th Annu. Int. Symp. Comput. Archit., 2023, pp. 1–14. [Online]. Available: https://doi.org/10.1145/3579371.3589057

[37]

Z. Zhou, J. Liu, Z. Gu, and G. Sun, “Energon: Toward efficient acceleration of transformers using dynamic sparse attention,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 1, pp. 136–149, Jan. 2023.

[38]

T. Yang et al., “DTQAtten: Leveraging dynamic token-based Quantization for efficient attention architecture,” in Proc. Des., Autom. Test Eur. Conf. Exhibit. (DATE), 2022, pp. 700–705.

[39]

Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “DOTA: Detect and omit weak attentions for scalable transformer acceleration,” in Proc. 27th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2022, pp. 14–26. [Online]. Available: https://doi.org/10.1145/3503222.3507738

[40]

H. Fan et al., “Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design,” in Proc. 55th IEEE/ACM Int. Symp. Microarchit. (MICRO), 2022, pp. 599–615.

[41]

J. R. Stevens, R. Venkatesan, S. Dai, B. Khailany, and A. Raghunathan, “SofterMax: Hardware/software co-design of an efficient softmax for transformers,” in Proc. 58th ACM/IEEE Design Autom. Conf. (DAC), 2021, pp. 469–474.

[42]

K. E. Batcher, “Sorting networks and their applications,” in Proc. Joint Comput. Conf., 1968, pp. 307–314.

[43]

“NVIDIA/DeepLearningExamples.” Nvidia. 2021. [Online]. Available: https://github.com/NVIDIA/DeepLearningExamples

[44]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” 2018, arXiv:1804.07461.

[45]

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+ questions for machine comprehension of text,” 2016, arXiv:1606.05250.

[46]

Q. Xie, G. Lai, Z. Dai, and E. Hovy, “Large-scale cloze test dataset created by teachers,” 2017, arXiv:1711.03225.

[47]

J. Bachrach et al., “Chisel: Constructing hardware in a scala embedded language,” in Proc. 49th Annu. Design Autom. Conf., 2012, pp. 1216–1225.

[48]

A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. 33rd Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 8026–8037.

[49]

T. Zhang, Z. Lin, G. Yang, and C. De Sa, “QPyTorch: A low-precision arithmetic simulation framework,” in Proc. 5th Workshop Energy Effic. Mach. Learn. Cogn. Comput. NeurIPS (EMC2-NIPS), 2019, pp. 10–13.

Index Terms

Hardware–Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms

Index terms have been assigned to the content through auto-classification.

Recommendations

Hardware-Software Techniques for Accelerating Sparse Computation
Physical challenges in reliable graphics hardware design
Hardware/software co-design for energy-efficient seismic modeling
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8^th order or larger, which require large-scale HPC clusters to meet the computational demands. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

0278-0070 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 05 March 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents