Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3289602.3293898acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Published: 20 February 2019 Publication History

Abstract

Neural networks based on Long Short-Term Memory (LSTM) are widely deployed in latency-sensitive language and speech applications. To speed up LSTM inference, previous research proposes weight pruning techniques to reduce computational cost. Unfortunately, irregular computation and memory accesses in unrestricted sparse LSTM limit the realizable parallelism, especially when implemented on FPGA. To address this issue, some researchers propose block-based sparsity patterns to increase the regularity of sparse weight matrices, but these approaches suffer from deteriorated prediction accuracy. This work presents Bank-Balanced Sparsity (BBS), a novel sparsity pattern that can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation. BBS partitions each weight matrix row into banks for parallel computing, while adopts fine-grained pruning inside each bank to maintain model accuracy. We develop a 3-step software-hardware co-optimization approach to apply BBS in real FPGA hardware. First, we propose a bank-balanced pruning method to induce the BBS pattern on weight matrices. Then we introduce a decoding-free sparse matrix format, Compressed Sparse Banks (CSB), that transparently exposes inter-bank parallelism in BBS to hardware. Finally, we design an FPGA accelerator that takes advantage of BBS to eliminate irregular computation and memory accesses. Implemented on Intel Arria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs on sparse LSTM networks with a batch size of 1. Compared to state-of-the-art FPGA accelerators for LSTM with different compression techniques, the BBS accelerator achieves 2.3 ~ 3.7x improvement on energy efficiency and 7.0 ~ 34.4x reduction on latency with negligible loss of model accuracy.

References

[1]
2018. Sparse Matrix Formats. https://docs.scipy.org/doc/scipy/reference/sparse. html/. (2018).
[2]
Nathan Bell and Michael Garland. 2008. Efficient sparse matrix-vector multiplication on CUDA. Technical Report. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.
[3]
AdrianMCaulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, and others. 2016. A cloud-scale acceleration architecture. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 7.
[4]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, and others. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE.
[5]
Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. 2014. A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 36--43.
[6]
Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 21--30.
[7]
John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1--1.1. NASA STI/Recon technical report n 93 (1993).
[8]
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, and others. 2017. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75--84.
[9]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[10]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[12]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and others. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.
[13]
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. arXiv preprint arXiv:1802.08435 (2018).
[14]
Charles Eric LaForest, Ming G Liu, Emma Rae Rapati, and J Gregory Steffan. 2012. Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 209--218.
[15]
Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning. 2849--2858.
[16]
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 393--405.
[17]
Huizi Mao, Song Han, Jeff Pool,Wenshuo Li, Xingyu Liu, YuWang, and William J Dally. 2017. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922 (2017).
[18]
Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. 1999. Treebank-3 LDC99T42. CD-ROM. Philadelphia, Penn.: Linguistic Data Consortium (1999).
[19]
Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-Sparse Recurrent Neural Networks. arXiv preprint arXiv:1711.02782 (2017).
[20]
Haim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
[21]
Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 11--20.
[22]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE. IEEE, 1--6.
[23]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. 2074--2082.
[24]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[25]
Hasan Erdem Yantir, Salih Bayar, and Arda Yurdakul. 2013. Efficient implementations of multi-pumped multi-port register files in FPGAs. In Digital System Design (DSD), 2013 Euromicro Conference on. IEEE, 185--192.
[26]
Zhuliang Yao, Shijie Cao, andWencong Xiao. 2018. Balanced Sparsity for Efficient DNN Inference on GPU. arXiv preprint arXiv:1811.00206 (2018).
[27]
Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 548--560.
[28]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[29]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
[30]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.
[31]
Shijie Zhou, Rajgopal Kannan, Yu Min, and Viktor K Prasanna. 2018. FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 259--268.

Cited By

View all
  • (2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
  • (2025)DSTC: Dual-Side Sparse Tensor Core for DNNs Acceleration on Modern GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2024.347581474:2(341-355)Online publication date: Feb-2025
  • (2025)FPGA-based component-wise LSTM training accelerator for neural granger causality analysisNeurocomputing10.1016/j.neucom.2024.128871615(128871)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2019
360 pages
ISBN:9781450361378
DOI:10.1145/3289602
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bank-balanced sparsity
  2. deep neural networks
  3. fpga
  4. inference
  5. lstm
  6. weight pruning

Qualifiers

  • Research-article

Funding Sources

  • National Nature Science Foundation of China

Conference

FPGA '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)201
  • Downloads (Last 6 weeks)18
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From ScratchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343435944:2(497-511)Online publication date: Feb-2025
  • (2025)DSTC: Dual-Side Sparse Tensor Core for DNNs Acceleration on Modern GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2024.347581474:2(341-355)Online publication date: Feb-2025
  • (2025)FPGA-based component-wise LSTM training accelerator for neural granger causality analysisNeurocomputing10.1016/j.neucom.2024.128871615(128871)Online publication date: Jan-2025
  • (2025)Design of an Approximate Radix-2 FFT Butterfly Unit for LSTM-Speech signal-based Parkinson’s Disease ClassifierCircuits, Systems, and Signal Processing10.1007/s00034-025-03011-1Online publication date: 9-Feb-2025
  • (2024)CuMONITOR: Continuous Monitoring of Microarchitecture for Software Task Identification and ClassificationDigital Threats: Research and Practice10.1145/36528615:3(1-22)Online publication date: 28-Mar-2024
  • (2024)Spartus: A 9.4 TOp/s FPGA-Based LSTM Accelerator Exploiting Spatio-Temporal SparsityIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318020935:1(1098-1112)Online publication date: Jan-2024
  • (2024)SparseACC: A Generalized Linear Model Accelerator for Sparse DatasetsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332427643:3(840-853)Online publication date: Mar-2024
  • (2024)PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block PruningIEEE Transactions on Computers10.1109/TC.2024.344185573:11(2576-2589)Online publication date: Nov-2024
  • (2024)LSTM-Based Model Compression for CAN Security in Intelligent VehiclesIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.34381105:12(6457-6471)Online publication date: Dec-2024
  • (2024)Bringing Dynamic Sparsity to the Forefront for Low-Power Audio Edge Computing: Brain-inspired approach for sparsifying network updatesIEEE Solid-State Circuits Magazine10.1109/MSSC.2024.345529016:4(62-69)Online publication date: Nov-2025
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media