Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3373087.3375296acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Published: 24 February 2020 Publication History

Abstract

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.

References

[1]
W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20--24, Mar. 1995.
[2]
E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms," in Euro-Par 2011 Parallel Processing, E. Jeannot, R. Namyst, and J. Roman, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 90--109.
[3]
M. Scquizzato and F. Silvestri, "Communication lower bounds for distributedmemory computations," arXiv preprint arXiv:1307.1805, 2013.
[4]
J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger, "Communication-optimal parallel recursive rectangular matrix multiplication," in Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, ser. IPDPS '13, 2013, pp. 261--272.
[5]
A. Aggarwal and S. Vitter, Jeffrey, "The input/output complexity of sorting and related problems," Commun. ACM, vol. 31, no. 9, Sep. 1988.
[6]
Intel. (2007) Intel Math Kernel Library (MKL). [Online]. Available: https://software.intel.com/en-us/mkl
[7]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney et al., LAPACK Users' guide. SIAM, 1999.
[8]
H. Jia-Wei and H.-T. Kung, "I/O complexity: The red-blue pebble game," in Proceedings of the thirteenth annual ACM symposium on Theory of computing. ACM, 1981, pp. 326--333.
[9]
D. Irony, S. Toledo, and A. Tiskin, "Communication lower bounds for distributedmemory matrix multiplication," Journal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 1017--1026, 2004.
[10]
G. A. Geist and E. Ng, "Task scheduling for parallel sparse cholesky factorization," Int. J. Parallel Program., vol. 18, no. 4, pp. 291--314, Jul. 1990.
[11]
M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, "Communication-avoiding QR decomposition for GPUs," in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 2011, pp. 48--58.
[12]
J. Demmel and G. Dinh, "Communication-optimal convolutional neural nets," arXiv preprint arXiv:1802.06905, 2018.
[13]
M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, "Communication lower bounds and optimal algorithms for programs that reference arrays--part 1," arXiv preprint arXiv:1308.0068, 2013.
[14]
V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs," in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, vol. 1. Citeseer, 2011, p. 4.
[15]
M. Del Ben et al., "Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution," Comp. Phys. Comm., 2015.
[16]
C. Lavin and A. Kaviani, "RapidWright: Enabling custom crafted implementations for FPGAs," in 2018 IEEE 26th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM). IEEE, 2018, pp. 133--140.
[17]
T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," ACM Computing Surveys (CSUR), vol. 52, no. 4, p. 65, 2019.
[18]
A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, "Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 2018, p. 47.
[19]
D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, "A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform: A deep learning case study," in Proceedings of the 2018 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, ser. FPGA '18. New York, NY, USA: ACM, 2018, pp. 107--116.
[20]
Z. Jovanovic and V. Milutinovic, "FPGA accelerator for floating-point matrix multiplication," IET Computers & Digital Techniques, vol. 6, no. 4, pp. 249--256, 2012.
[21]
V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, no. 4, pp. 354--356, Aug. 1969.
[22]
P. D'Alberto and A. Nicolau, "Using recursion to boost ATLAS's performance," in High-Performance Computing. Springer, 2008, pp. 142--151.
[23]
H. Zhou and J. Xue, "A compiler approach for exploiting partial SIMD parallelism," ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 11:1--11:26, Mar. 2016.
[24]
G. Kwasniewski, M. Kabic, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler, "Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC '19, 2019.
[25]
D. Irony et al., "Communication lower bounds for distributed-memory matrix multiplication," JPDC, 2004.
[26]
G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein, "Register allocation via coloring," Computer Languages, vol. 6, no. 1, pp. 47 -- 57, 1981.
[27]
J. de Fine Licht and T. Hoefler, "hlslib: Software engineering for hardware design," arXiv preprint arXiv:1910.04436, 2019.
[28]
E. H. D'Hollander, "High-level synthesis optimization for blocked floating-point matrix multiplication," SIGARCH Comput. Archit. News, vol. 44, no. 4, pp. 74--79, Jan. 2017.
[29]
Y. Guan, H. Liang, N. Xu,W.Wang, S. Shi, X. Chen, G. Sun,W. Zhang, and J. Cong, "FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates," in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 152--159.
[30]
L. Zhuo and V. K. Prasanna, "Scalable and modular algorithms for floating-point matrix multiplication on FPGAs," in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., April 2004, pp. 92--.
[31]
Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, "64-bit floatingpoint FPGA matrix multiplication," in Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-programmable Gate Arrays, ser. FPGA '05. New York, NY, USA: ACM, 2005, pp. 86--95.
[32]
V. B. Y. Kumar, S. Joshi, S. B. Patkar, and H. Narayanan, "FPGA based high performance double-precision matrix multiplication," in 2009 22nd International Conference on VLSI Design, Jan 2009, pp. 341--346.
[33]
E. Wu, X. Zhang, D. Berman, and I. Cho, "A high-throughput reconfigurable processing array for neural networks," in 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2017, pp. 1--4.
[34]
C. Y. Lin, H. K.-H. So, and P. H. Leong, "A model for matrix multiplication performance on FPGAs," in 2011 21st International Conference on Field Programmable Logic and Applications. IEEE, 2011, pp. 305--310.
[35]
A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, "Integrated model, batch and domain parallelism in training neural networks," arXiv preprint arXiv:1712.04432, 2017.

Cited By

View all
  • (2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
  • (2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
  • (2024)Weave: Abstraction and Integration Flow for Accelerators of Generated ModulesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332597243:3(854-867)Online publication date: Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2020
346 pages
ISBN:9781450370998
DOI:10.1145/3373087
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. communication avoiding algorithms
  2. fpga
  3. gemm
  4. high-level synthesis
  5. hls
  6. i/o optimization
  7. linear algebra
  8. matrix multiplication
  9. systolic arrays
  10. vivado hls

Qualifiers

  • Research-article

Funding Sources

Conference

FPGA '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)9
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
  • (2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
  • (2024)Weave: Abstraction and Integration Flow for Accelerators of Generated ModulesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332597243:3(854-867)Online publication date: Mar-2024
  • (2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
  • (2023)Co-design Hardware and Algorithm for Vector SearchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607045(1-15)Online publication date: 12-Nov-2023
  • (2023)CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP ArchitectureProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573210(153-164)Online publication date: 12-Feb-2023
  • (2023)Evaluating embedded hardware for high-order wavefront sensing and controlTechniques and Instrumentation for Detection of Exoplanets XI10.1117/12.2677816(60)Online publication date: 5-Oct-2023
  • (2023)Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171442:12(4763-4776)Online publication date: Dec-2023
  • (2023)Automatic Generation of Spatial Accelerator for Tensor AlgebraIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320994942:6(1898-1911)Online publication date: Jun-2023
  • (2023)FPGA Framework Improvements for HPC Applications2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00048(286-287)Online publication date: 12-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media