research-article

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Authors:

Yaman Umuroglu,

Davide Conficconi,

Lahiru Rasnayake,

Thomas B. Preusser,

Magnus SjälanderAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 12, Issue 3

Article No.: 15, Pages 1 - 24

https://doi.org/10.1145/3337929

Published: 20 August 2019 Publication History

Abstract

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.

References

[1]

Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.

[2]

AVNET. 2018. ULTRA96. Retrieved from http://www.ultra96.org/sites/default/files/product_briefs/5354-pb-ultra96-v3b.pdf.

[3]

D. J. Moss et al. 2018. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform: A deep learning case study. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 107--116.

Digital Library

[4]

F. Pedersoli et al. 2018. Espresso: Efficient forward propagation for BCNNs. In Proceedings of the International Conference on Learning Representations.

[5]

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).

[6]

J. Bachrach et al. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, 1216--1225.

Digital Library

[7]

M. Kumm and J. Kappauf. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput. 67, 8 (Aug. 2018), 1078--1091.

[8]

Martin Kumm and Peter Zipf. 2014. Pipelined compressor tree optimization using integer linear programming. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--8.

[9]

Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE Comput. 15, 1 (1982), 37--46.

Digital Library

[10]

Kiran Kumar Matam and Viktor K. Prasanna. 2013. Energy-efficient large-scale matrix multiplication on FPGAs. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig’13). IEEE, 1--8.

[11]

Sparsh Mittal. 2016. A survey of techniques for approximate computing. Comput. Surveys 48, 4 (2016), 62.

Digital Library

[12]

Wojchech Mula. 2018. Scalar version of SSE move mask instruction. Retrieved from http://0x80.pl/articles/scalar-sse-movmask.html.

[13]

P. Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture. IEEE, 1--12.

Digital Library

[14]

Hadi Parandeh-Afshar, Arkosnato Neogy, Philip Brisk, and Paolo Ienne. 2011. Compressor tree synthesis on commercial high-performance FPGAs. ACM Trans. Reconfig. Technol. Syst. 4, 4 (Dec. 2011), 39:1--39:19.

Digital Library

[15]

Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. 2017. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5456--5464.

[16]

Thomas B. Preußer. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--7.

[17]

Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65--74.

Digital Library

[18]

Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060 (2017).

[19]

Y. Umuroglu, L. Rasnayake, and M. Själander. 2018. BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing. In Proceedings of the Conference on Field Programmable Logic and Applications.

[20]

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2018. HAQ: Hardware-aware automated quantization. arXiv preprint arXiv:1811.08886 (2018).

[21]

Xilinx. 2017. Vivado Design Suite User Guide—Release Notes, Installation, and Licensing (UG973 (v2017.4) ed.). Xilinx.

[22]

Xilinx. 2018. Python Productivity for Zynq (Pynq) Documentation (release 2.2 ed.). Xilinx.

[23]

Xilinx. 2018. UltraScale Architecture and Product Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.

[24]

Xilinx. 2018. Zynq UltraScale+ MPSoC Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds891-zynq-ultrascale-plus-overview.pdf.

[25]

Mehdi R. Zargham. 1996. Computer Architecture: Single and Parallel Systems. Prentice-Hall.

Digital Library

Cited By

Kadyrzhan AKadyrzhan KBakirov ASuleimenov I(2025)Prospects for the Use of Quasi-Mersen Numbers in the Design of Parallel-Serial ProcessorsApplied Sciences10.3390/app1502074115:2(741)Online publication date: 13-Jan-2025
https://doi.org/10.3390/app15020741
Hoßfeld KDamsgaard HNurmi JBlott MPreußer T(2024)High-efficiency Compressor Trees for Latest AMD FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/364509717:2(1-32)Online publication date: 10-Feb-2024
https://dl.acm.org/doi/10.1145/3645097
Lou WGong LWang CQian JWang XLi CZhou X(2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3391688
Show More Cited By

Index Terms

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Power and energy efficiency evaluation for HW and SW implementation of nxn matrix multiplication on Altera FPGAs
FPGAworld '09: Proceedings of the 6th FPGAworld Conference

Matrix multiplication is most often involved in graphics, image processing, digital signal processing, robotics and control engineering applications. In this paper we compared and analyzed the power and energy consumption in three different designs, ...
64-bit floating-point FPGA matrix multiplication
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm ...
Architecture for dense matrix multiplication on a high-performance reconfigurable system
SBCCI '09: Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes

The recent evolution of the programmable logic devices, such as FPGAs (Field Programmable Gate Array), associated with the growing demand for performance improvements in scientific computing applications, has attracted the attention of supercomputers ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 12, Issue 3

Special Section on Security in FPGAs and Regular Articles

September 2019

150 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3357092

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2019

Accepted: 01 May 2019

Revised: 01 March 2019

Received: 01 December 2018

Published in TRETS Volume 12, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Vetenskapsrådet

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
480
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kadyrzhan AKadyrzhan KBakirov ASuleimenov I(2025)Prospects for the Use of Quasi-Mersen Numbers in the Design of Parallel-Serial ProcessorsApplied Sciences10.3390/app1502074115:2(741)Online publication date: 13-Jan-2025
https://doi.org/10.3390/app15020741
Hoßfeld KDamsgaard HNurmi JBlott MPreußer T(2024)High-efficiency Compressor Trees for Latest AMD FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/364509717:2(1-32)Online publication date: 10-Feb-2024
https://dl.acm.org/doi/10.1145/3645097
Lou WGong LWang CQian JWang XLi CZhou X(2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3391688
Cheng XWang YLiu JDing WLou HLi P(2023)Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy EfficienciesElectronics10.3390/electronics1210217712:10(2177)Online publication date: 10-May-2023
https://doi.org/10.3390/electronics12102177
Metz DKumar VSjälander M(2023)BISDU: A Bit-Serial Dot-Product Unit for MicrocontrollersACM Transactions on Embedded Computing Systems10.1145/360844722:5(1-22)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1145/3608447
Alam SGregg DGambardella GPreusser TBlott M(2023)On the RTL Implementation of FINN Matrix Vector UnitACM Transactions on Embedded Computing Systems10.1145/354714122:6(1-27)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3547141
Bertolini RCarloni FConficconi D(2023)Co-designing an FPGA-Accelerated Encryption Library With PYNQ: The Pynqrypt Case StudyIEEE EUROCON 2023 - 20th International Conference on Smart Technologies10.1109/EUROCON56442.2023.10198938(683-688)Online publication date: 6-Jul-2023
https://doi.org/10.1109/EUROCON56442.2023.10198938
Jiang LLiu SZhu JShan RLi Y(2023)Dynamic Multi-bit Parallel Computing Method Based on Reconfigurable StructureAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_20(347-359)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-0801-7_20
Skliarova I(2022)Analysis and Comparison of Different Approaches to Implementing a Network-Based Parallel Data Processing AlgorithmJournal of Low Power Electronics and Applications10.3390/jlpea1203003812:3(38)Online publication date: 9-Jul-2022
https://doi.org/10.3390/jlpea12030038
Sozzo EConficconi DZeni ASalaris MSciuto DSantambrogio M(2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
https://dl.acm.org/doi/10.1145/3532989
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents