review-article

Automated parallel execution of distributed task graphs with FPGA clusters

Authors:

Juan Miguel de Haro Ruiz,

Carlos Álvarez Martínez,

Daniel Jiménez-González,

Beat WeissAuthors Info & Claims

Volume 160, Issue C

Pages 808 - 824

https://doi.org/10.1016/j.future.2024.06.041

Published: 18 October 2024 Publication History

Abstract

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level task-based programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

Highlights

•

High-level task-based programming model for FPGA clusters with MPI-like communication.

•

High performance computing applications can be easily adapted to FPGA clusters.

•

Automatic MPI communication inferred by the runtime, users do not write MPI API.

•

Easily portable code between AMD (Xilinx) and Intel FPGAs, applications tested on both.

•

N-body, Heat, and Cholesky implementations on cloudFPGA and ESSPER, written in C.

References

[1]

Cong J., Liu B., Neuendorffer S., Noguera J., Vissers K., Zhang Z., High-level synthesis for FPGAs: From prototyping to deployment, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30 (4) (2011) 473–491.

Abstract

Highlights

References

Index Terms

Recommendations

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

Design and implementation of a plesiochronous multi-core 4x4 network-on-chip FPGA platform with MPI HAL support

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations