Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Published: 20 August 2019 Publication History

Abstract

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.

References

[1]
Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.
[2]
AVNET. 2018. ULTRA96. Retrieved from http://www.ultra96.org/sites/default/files/product_briefs/5354-pb-ultra96-v3b.pdf.
[3]
D. J. Moss et al. 2018. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform: A deep learning case study. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 107--116.
[4]
F. Pedersoli et al. 2018. Espresso: Efficient forward propagation for BCNNs. In Proceedings of the International Conference on Learning Representations.
[5]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).
[6]
J. Bachrach et al. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, 1216--1225.
[7]
M. Kumm and J. Kappauf. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput. 67, 8 (Aug. 2018), 1078--1091.
[8]
Martin Kumm and Peter Zipf. 2014. Pipelined compressor tree optimization using integer linear programming. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--8.
[9]
Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE Comput. 15, 1 (1982), 37--46.
[10]
Kiran Kumar Matam and Viktor K. Prasanna. 2013. Energy-efficient large-scale matrix multiplication on FPGAs. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig’13). IEEE, 1--8.
[11]
Sparsh Mittal. 2016. A survey of techniques for approximate computing. Comput. Surveys 48, 4 (2016), 62.
[12]
Wojchech Mula. 2018. Scalar version of SSE move mask instruction. Retrieved from http://0x80.pl/articles/scalar-sse-movmask.html.
[13]
P. Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture. IEEE, 1--12.
[14]
Hadi Parandeh-Afshar, Arkosnato Neogy, Philip Brisk, and Paolo Ienne. 2011. Compressor tree synthesis on commercial high-performance FPGAs. ACM Trans. Reconfig. Technol. Syst. 4, 4 (Dec. 2011), 39:1--39:19.
[15]
Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. 2017. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5456--5464.
[16]
Thomas B. Preußer. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--7.
[17]
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65--74.
[18]
Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060 (2017).
[19]
Y. Umuroglu, L. Rasnayake, and M. Själander. 2018. BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing. In Proceedings of the Conference on Field Programmable Logic and Applications.
[20]
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2018. HAQ: Hardware-aware automated quantization. arXiv preprint arXiv:1811.08886 (2018).
[21]
Xilinx. 2017. Vivado Design Suite User Guide—Release Notes, Installation, and Licensing (UG973 (v2017.4) ed.). Xilinx.
[22]
Xilinx. 2018. Python Productivity for Zynq (Pynq) Documentation (release 2.2 ed.). Xilinx.
[23]
Xilinx. 2018. UltraScale Architecture and Product Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.
[24]
Xilinx. 2018. Zynq UltraScale+ MPSoC Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds891-zynq-ultrascale-plus-overview.pdf.
[25]
Mehdi R. Zargham. 1996. Computer Architecture: Single and Parallel Systems. Prentice-Hall.

Cited By

View all
  • (2024)High-efficiency Compressor Trees for Latest AMD FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/364509717:2(1-32)Online publication date: 10-Feb-2024
  • (2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: 1-Oct-2024
  • (2023)Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy EfficienciesElectronics10.3390/electronics1210217712:10(2177)Online publication date: 10-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 12, Issue 3
Special Section on Security in FPGAs and Regular Articles
September 2019
150 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3357092
  • Editor:
  • Deming Chen
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2019
Accepted: 01 May 2019
Revised: 01 March 2019
Received: 01 December 2018
Published in TRETS Volume 12, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bit serial
  2. FPGA
  3. matrix multiplication
  4. overlay

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Vetenskapsrådet

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)9
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)High-efficiency Compressor Trees for Latest AMD FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/364509717:2(1-32)Online publication date: 10-Feb-2024
  • (2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: 1-Oct-2024
  • (2023)Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy EfficienciesElectronics10.3390/electronics1210217712:10(2177)Online publication date: 10-May-2023
  • (2023)BISDU: A Bit-Serial Dot-Product Unit for MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/360844722:5(1-22)Online publication date: 26-Sep-2023
  • (2023)On the RTL Implementation of FINN Matrix Vector UnitACM Transactions on Embedded Computing Systems10.1145/354714122:6(1-27)Online publication date: 9-Nov-2023
  • (2023)Co-designing an FPGA-Accelerated Encryption Library With PYNQ: The Pynqrypt Case StudyIEEE EUROCON 2023 - 20th International Conference on Smart Technologies10.1109/EUROCON56442.2023.10198938(683-688)Online publication date: 6-Jul-2023
  • (2023)Dynamic Multi-bit Parallel Computing Method Based on Reconfigurable StructureAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_20(347-359)Online publication date: 20-Oct-2023
  • (2022)Analysis and Comparison of Different Approaches to Implementing a Network-Based Parallel Data Processing AlgorithmJournal of Low Power Electronics and Applications10.3390/jlpea1203003812:3(38)Online publication date: 9-Jul-2022
  • (2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
  • (2021)Accelerating Population Count with a Hardware Co-Processor for MicroBlazeJournal of Low Power Electronics and Applications10.3390/jlpea1102002011:2(20)Online publication date: 24-Apr-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media