research-article

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Authors:

Lintao ZhangAuthors Info & Claims

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 63 - 72

https://doi.org/10.1145/3289602.3293898

Published: 20 February 2019 Publication History

Abstract

Neural networks based on Long Short-Term Memory (LSTM) are widely deployed in latency-sensitive language and speech applications. To speed up LSTM inference, previous research proposes weight pruning techniques to reduce computational cost. Unfortunately, irregular computation and memory accesses in unrestricted sparse LSTM limit the realizable parallelism, especially when implemented on FPGA. To address this issue, some researchers propose block-based sparsity patterns to increase the regularity of sparse weight matrices, but these approaches suffer from deteriorated prediction accuracy. This work presents Bank-Balanced Sparsity (BBS), a novel sparsity pattern that can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation. BBS partitions each weight matrix row into banks for parallel computing, while adopts fine-grained pruning inside each bank to maintain model accuracy. We develop a 3-step software-hardware co-optimization approach to apply BBS in real FPGA hardware. First, we propose a bank-balanced pruning method to induce the BBS pattern on weight matrices. Then we introduce a decoding-free sparse matrix format, Compressed Sparse Banks (CSB), that transparently exposes inter-bank parallelism in BBS to hardware. Finally, we design an FPGA accelerator that takes advantage of BBS to eliminate irregular computation and memory accesses. Implemented on Intel Arria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs on sparse LSTM networks with a batch size of 1. Compared to state-of-the-art FPGA accelerators for LSTM with different compression techniques, the BBS accelerator achieves 2.3 ~ 3.7x improvement on energy efficiency and 7.0 ~ 34.4x reduction on latency with negligible loss of model accuracy.

References

[1]

2018. Sparse Matrix Formats. https://docs.scipy.org/doc/scipy/reference/sparse. html/. (2018).

[2]

Nathan Bell and Michael Garland. 2008. Efficient sparse matrix-vector multiplication on CUDA. Technical Report. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.

[3]

AdrianMCaulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, and others. 2016. A cloud-scale acceleration architecture. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 7.

Digital Library

[4]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, and others. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE.

Digital Library

[5]

Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. 2014. A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 36--43.

Digital Library

[6]

Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 21--30.

Digital Library

[7]

John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1--1.1. NASA STI/Recon technical report n 93 (1993).

[8]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, and others. 2017. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75--84.

Digital Library

[9]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

Digital Library

[10]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143.

Digital Library

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[12]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and others. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.

Digital Library

[13]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. arXiv preprint arXiv:1802.08435 (2018).

[14]

Charles Eric LaForest, Ming G Liu, Emma Rae Rapati, and J Gregory Steffan. 2012. Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 209--218.

Digital Library

[15]

Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning. 2849--2858.

Digital Library

[16]

Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 393--405.

Digital Library

[17]

Huizi Mao, Song Han, Jeff Pool,Wenshuo Li, Xingyu Liu, YuWang, and William J Dally. 2017. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922 (2017).

[18]

Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. 1999. Treebank-3 LDC99T42. CD-ROM. Philadelphia, Penn.: Linguistic Data Consortium (1999).

[19]

Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-Sparse Recurrent Neural Networks. arXiv preprint arXiv:1711.02782 (2017).

[20]

Haim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.

[21]

Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 11--20.

Digital Library

[22]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE. IEEE, 1--6.

Digital Library

[23]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. 2074--2082.

Digital Library

[24]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[25]

Hasan Erdem Yantir, Salih Bayar, and Arda Yurdakul. 2013. Efficient implementations of multi-pumped multi-port register files in FPGAs. In Digital System Design (DSD), 2013 Euromicro Conference on. IEEE, 185--192.

Digital Library

[26]

Zhuliang Yao, Shijie Cao, andWencong Xiao. 2018. Balanced Sparsity for Efficient DNN Inference on GPU. arXiv preprint arXiv:1811.00206 (2018).

[27]

Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 548--560.

Digital Library

[28]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).

[29]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.

Digital Library

[30]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[31]

Shijie Zhou, Rajgopal Kannan, Yu Min, and Viktor K Prasanna. 2018. FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 259--268.

Digital Library

Cited By

Tang MWen MYang JXue ZShen J(2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3434359
Zhang CWang YXie ZGuo CLiu YLeng JJi ZXie YHuang R(2025)DSTC: Dual-Side Sparse Tensor Core for DNNs Acceleration on Modern GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2024.347581474:2(341-355)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3475814
Guo CChen YFu Y(2025)FPGA-based component-wise LSTM training accelerator for neural granger causality analysisNeurocomputing10.1016/j.neucom.2024.128871615(128871)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128871
Show More Cited By

Recommendations

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. ...
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on ...
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2019

360 pages

ISBN:9781450361378

DOI:10.1145/3289602

General Chair:
Kia Bazargan
Univ. of Minnesota, USA
,
Program Chair:
Stephen Neuendorffer
Xilinx, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Nature Science Foundation of China

Conference

FPGA '19

Sponsor:

SIGDA

FPGA '19: The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 24 - 26, 2019

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

132
Total Citations
View Citations
1,969
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)18

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang MWen MYang JXue ZShen J(2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3434359
Zhang CWang YXie ZGuo CLiu YLeng JJi ZXie YHuang R(2025)DSTC: Dual-Side Sparse Tensor Core for DNNs Acceleration on Modern GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2024.347581474:2(341-355)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3475814
Guo CChen YFu Y(2025)FPGA-based component-wise LSTM training accelerator for neural granger causality analysisNeurocomputing10.1016/j.neucom.2024.128871615(128871)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128871
Sindhu RArunachalam V(2025)Design of an Approximate Radix-2 FFT Butterfly Unit for LSTM-Speech signal-based Parkinson’s Disease ClassifierCircuits, Systems, and Signal Processing10.1007/s00034-025-03011-1Online publication date: 9-Feb-2025
https://doi.org/10.1007/s00034-025-03011-1
Langehaug TGraham S(2024)CuMONITOR: Continuous Monitoring of Microarchitecture for Software Task Identification and ClassificationDigital Threats: Research and Practice10.1145/36528615:3(1-22)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1145/3652861
Gao CDelbruck TLiu S(2024)Spartus: A 9.4 TOp/s FPGA-Based LSTM Accelerator Exploiting Spatio-Temporal SparsityIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318020935:1(1098-1112)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3180209
Zhang JHuang HSun JLuna JMutlu OWang Z(2024)SparseACC: A Generalized Linear Model Accelerator for Sparse DatasetsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332427643:3(840-853)Online publication date: Mar-2024
https://doi.org/10.1109/TCAD.2023.3324276
Geng HLiu YZheng YZhang LSun JWang YWang YSun GYang MCao TLiu Y(2024)PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block PruningIEEE Transactions on Computers10.1109/TC.2024.344185573:11(2576-2589)Online publication date: Nov-2024
https://doi.org/10.1109/TC.2024.3441855
Feng YLai YChen YZhang ZWei J(2024)LSTM-Based Model Compression for CAN Security in Intelligent VehiclesIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34381105:12(6457-6471)Online publication date: Dec-2024
https://doi.org/10.1109/TAI.2024.3438110
Liu SZhou SLi ZGao CKim KDelbruck T(2024)Bringing Dynamic Sparsity to the Forefront for Low-Power Audio Edge Computing: Brain-inspired approach for sparsifying network updatesIEEE Solid-State Circuits Magazine10.1109/MSSC.2024.345529016:4(62-69)Online publication date: Nov-2025
https://doi.org/10.1109/MSSC.2024.3455290
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten