research-article

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Authors:

Johannes de Fine Licht,

Grzegorz Kwasniewski,

Torsten HoeflerAuthors Info & Claims

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 244 - 254

https://doi.org/10.1145/3373087.3375296

Published: 24 February 2020 Publication History

Abstract

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.

References

[1]

W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20--24, Mar. 1995.

Digital Library

[2]

E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms," in Euro-Par 2011 Parallel Processing, E. Jeannot, R. Namyst, and J. Roman, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 90--109.

[3]

M. Scquizzato and F. Silvestri, "Communication lower bounds for distributedmemory computations," arXiv preprint arXiv:1307.1805, 2013.

[4]

J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, "Communication-optimal parallel recursive rectangular matrix multiplication," in Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, ser. IPDPS '13, 2013, pp. 261--272.

[5]

A. Aggarwal and S. Vitter, Jeffrey, "The input/output complexity of sorting and related problems," Commun. ACM, vol. 31, no. 9, Sep. 1988.

[6]

Intel. (2007) Intel Math Kernel Library (MKL). [Online]. Available: https://software.intel.com/en-us/mkl

[7]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney et al., LAPACK Users' guide. SIAM, 1999.

[8]

H. Jia-Wei and H.-T. Kung, "I/O complexity: The red-blue pebble game," in Proceedings of the thirteenth annual ACM symposium on Theory of computing. ACM, 1981, pp. 326--333.

[9]

D. Irony, S. Toledo, and A. Tiskin, "Communication lower bounds for distributedmemory matrix multiplication," Journal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 1017--1026, 2004.

Digital Library

[10]

G. A. Geist and E. Ng, "Task scheduling for parallel sparse cholesky factorization," Int. J. Parallel Program., vol. 18, no. 4, pp. 291--314, Jul. 1990.

Digital Library

[11]

M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, "Communication-avoiding QR decomposition for GPUs," in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 2011, pp. 48--58.

[12]

J. Demmel and G. Dinh, "Communication-optimal convolutional neural nets," arXiv preprint arXiv:1802.06905, 2018.

[13]

M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, "Communication lower bounds and optimal algorithms for programs that reference arrays--part 1," arXiv preprint arXiv:1308.0068, 2013.

[14]

V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs," in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, vol. 1. Citeseer, 2011, p. 4.

[15]

M. Del Ben et al., "Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution," Comp. Phys. Comm., 2015.

[16]

C. Lavin and A. Kaviani, "RapidWright: Enabling custom crafted implementations for FPGAs," in 2018 IEEE 26th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM). IEEE, 2018, pp. 133--140.

[17]

T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," ACM Computing Surveys (CSUR), vol. 52, no. 4, p. 65, 2019.

Digital Library

[18]

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, "Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 2018, p. 47.

[19]

D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, "A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform: A deep learning case study," in Proceedings of the 2018 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, ser. FPGA '18. New York, NY, USA: ACM, 2018, pp. 107--116.

[20]

Z. Jovanovic and V. Milutinovic, "FPGA accelerator for floating-point matrix multiplication," IET Computers & Digital Techniques, vol. 6, no. 4, pp. 249--256, 2012.

[21]

V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, no. 4, pp. 354--356, Aug. 1969.

Digital Library

[22]

P. D'Alberto and A. Nicolau, "Using recursion to boost ATLAS's performance," in High-Performance Computing. Springer, 2008, pp. 142--151.

[23]

H. Zhou and J. Xue, "A compiler approach for exploiting partial SIMD parallelism," ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 11:1--11:26, Mar. 2016.

Digital Library

[24]

G. Kwasniewski, M. Kabic, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler, "Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC '19, 2019.

Digital Library

[25]

D. Irony et al., "Communication lower bounds for distributed-memory matrix multiplication," JPDC, 2004.

[26]

G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein, "Register allocation via coloring," Computer Languages, vol. 6, no. 1, pp. 47 -- 57, 1981.

[27]

J. de Fine Licht and T. Hoefler, "hlslib: Software engineering for hardware design," arXiv preprint arXiv:1910.04436, 2019.

[28]

E. H. D'Hollander, "High-level synthesis optimization for blocked floating-point matrix multiplication," SIGARCH Comput. Archit. News, vol. 44, no. 4, pp. 74--79, Jan. 2017.

Digital Library

[29]

Y. Guan, H. Liang, N. Xu,W.Wang, S. Shi, X. Chen, G. Sun,W. Zhang, and J. Cong, "FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates," in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 152--159.

[30]

L. Zhuo and V. K. Prasanna, "Scalable and modular algorithms for floating-point matrix multiplication on FPGAs," in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., April 2004, pp. 92--.

[31]

Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, "64-bit floatingpoint FPGA matrix multiplication," in Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-programmable Gate Arrays, ser. FPGA '05. New York, NY, USA: ACM, 2005, pp. 86--95.

Digital Library

[32]

V. B. Y. Kumar, S. Joshi, S. B. Patkar, and H. Narayanan, "FPGA based high performance double-precision matrix multiplication," in 2009 22nd International Conference on VLSI Design, Jan 2009, pp. 341--346.

[33]

E. Wu, X. Zhang, D. Berman, and I. Cho, "A high-throughput reconfigurable processing array for neural networks," in 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2017, pp. 1--4.

[34]

C. Y. Lin, H. K.-H. So, and P. H. Leong, "A model for matrix multiplication performance on FPGAs," in 2011 21st International Conference on Field Programmable Logic and Applications. IEEE, 2011, pp. 305--310.

[35]

A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, "Integrated model, batch and domain parallelism in training neural networks," arXiv preprint arXiv:1712.04432, 2017.

Cited By

Zhuang JLau JYe HYang ZJi SLo JDenolf KNeuendorffer SJones AHu JShi YChen DCong JZhou P(2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3686163
Zhang WLiu YZang TBao Z(2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3678010
Dai TShi BLuo G(2024)Weave: Abstraction and Integration Flow for Accelerators of Generated ModulesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332597243:3(854-867)Online publication date: Mar-2024
https://doi.org/10.1109/TCAD.2023.3325972
Show More Cited By

Index Terms

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Recommendations

From software to accelerators with LegUp high-level synthesis
CASES '13: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Embedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level ...
Bit-level optimization for high-level synthesis and FPGA-based acceleration
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays

Automated hardware design from behavior-level abstraction has drawn wide interest in FPGA-based acceleration and configurable computing research field. However, for many high-level programming languages, such as C/C++, the description of bitwise access ...
Building zynq® accelerators with Vivado® high level synthesis
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Engineering complex systems inevitably requires a designer to balance many conflicting design requirements including performance, cost, power, and design time. In many cases, FPGAs enable engineers to balance these design requirements in ways not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2020

346 pages

ISBN:9781450370998

DOI:10.1145/3373087

General Chair:
Stephen Neuendorffer
Xilinx, USA
,
Program Chair:
Lesley Shannon
Simon Fraser University, Canada

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Horizon 2020

Conference

FPGA '20

Sponsor:

SIGDA

FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 23 - 25, 2020

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
775
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)9

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhuang JLau JYe HYang ZJi SLo JDenolf KNeuendorffer SJones AHu JShi YChen DCong JZhou P(2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3686163
Zhang WLiu YZang TBao Z(2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3678010
Dai TShi BLuo G(2024)Weave: Abstraction and Integration Flow for Accelerators of Generated ModulesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332597243:3(854-867)Online publication date: Mar-2024
https://doi.org/10.1109/TCAD.2023.3325972
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3604802
Jiang WLi SZhu YDe Fine Licht JHe ZShi RRenggli CZhang SRekatsinas THoefler TAlonso GMohror KArnold DBadia R(2023)Co-design Hardware and Algorithm for Vector SearchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607045(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607045
Zhuang JLau JYe HYang ZDu YLo JDenolf KNeuendorffer SJones AHu JChen DCong JZhou PIenne PZhang Z(2023)CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP ArchitectureProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573210(153-164)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1145/3543622.3573210
Belsten NMilani KPogorelyuk LEickert BRao SDouglas ECahoy K(2023)Evaluating embedded hardware for high-order wavefront sensing and controlTechniques and Instrumentation for Detection of Exoplanets XI10.1117/12.2677816(60)Online publication date: 5-Oct-2023
https://doi.org/10.1117/12.2677816
Gao YGong LWang CWang TLi XZhou X(2023)Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171442:12(4763-4776)Online publication date: Dec-2023
https://doi.org/10.1109/TCAD.2023.3281714
Jia LLuo ZLu LLiang Y(2023)Automatic Generation of Spatial Accelerator for Tensor AlgebraIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320994942:6(1898-1911)Online publication date: Jun-2023
https://doi.org/10.1109/TCAD.2022.3209949
Filgueras AVidal MJiménez-González DÁlvarez CMartorell X(2023)FPGA Framework Improvements for HPC Applications2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00048(286-287)Online publication date: 12-Dec-2023
https://doi.org/10.1109/ICFPT59805.2023.00048
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents