research-article

SOFF: an OpenCL high-level synthesis framework for FPGAs

Authors:

Jaejin LeeAuthors Info & Claims

ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture

Pages 295 - 308

https://doi.org/10.1109/ISCA45697.2020.00034

Published: 23 September 2020 Publication History

Abstract

Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework of OpenCL for FPGAs, called SOFF. It automatically synthesizes a datapath to execute many OpenCL kernel threads in a pipelined manner. It also synthesizes an efficient memory subsystem for the datapath based on the characteristics of OpenCL kernels. Unlike previous high-level synthesis techniques, we propose a formal way to handle variable-latency instructions, complex control flows, OpenCL barriers, and atomic operations that appear in real-world OpenCL kernels. SOFF is the first OpenCL framework that correctly compiles and executes all applications in the SPEC ACCEL benchmark suite except three applications that require more FPGA resources than are available. In addition, SOFF achieves the speedup of 1.33 over Intel FPGA SDK for OpenCL without any explicit user annotation or source code modification.

References

[1]

Amazon, "Amazon EC2 F1 instances," https://aws.amazon.com/ec2/instance-types/f1/.

[2]

M. Budiu, G. Venkataramani, T. Chelcea, and S. C. Goldstein, "Spatial computation," in Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, 2004, pp. 14--26.

[3]

T. J. Callahan, J. R. Hauser, and J. Wawrzynek, "The garp architecture and c compiler," IEEE Computer, vol. 33, no. 4, pp. 62--69, 2000.

Digital Library

[4]

T. J. Callahan and J. Wawrzynek, "Instruction-level parallelism for reconfigurable computing," in Proceedings of the 8th International Workshop on Field-Programmable Logic and Applications, From FPGAs to Computing Paradigm, 1998, pp. 248--257.

[5]

T. J. Callahan and J. Wawrzynek, "Adapting software pipelining for reconfigurable computing," in Proceedings of the 2000 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2000, pp. 57--64.

[6]

A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and T. Czajkowski, "LegUp: High-level synthesis for FPGA-based processor/accelerator systems," in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2011, pp. 33--36.

[7]

J. M. P. Cardoso, P. C. Diniz, and M. Weinhardt, "Compiling for reconfigurable computing: A survey," ACM Computing Surveys, vol. 42, no. 4, pp. 13:1--13:65, 2010.

Digital Library

[8]

E. Cartwright, S. Ma, D. Andrews, and M. Huang, "Creating HW/SW co-designed MPSoPCs from high level programming models," in Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011, pp. 554--560.

[9]

J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski, "Impact of cache architecture and interface on performance and area of FPGA-based processor/parallel-accelerator systems," in Proceedings of IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, 2012, pp. 17--24.

[10]

J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang, "Platform-based behavior-level and system-level synthesis," in Proceedings of 2006 IEEE International SOC Conference, 2006, pp. 199--202.

[11]

S. P. E. Corporation, "SPEC ACCEL," https://www.spec.org/accel/.

[12]

J. Cortadella, M. Kishinevsky, and B. Grundmann, "Synthesis of synchronous elastic architectures," in Proceedings of the 43rd Annual Design Automation Conference, 2006, pp. 657--662.

[13]

P. Coussy, D. D. Gajski, M. Meredith, and A. Takach, "An introduction to high-level synthesis," IEEE Design & Test of Computers, vol. 26, no. 4, pp. 8--17, 2009.

Digital Library

[14]

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, "Efficiently computing static single assignment form and the control dependence graph," ACM Transactions on Programming Languages and Systems, vol. 13, no. 4, pp. 451--490, 1991.

Digital Library

[15]

T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh, "From OpenCL to high-performance hardware on FPGAs," in Proceedings of the 22nd International Conference on Field Programmable Logic and Applications, 2012, pp. 531--534.

[16]

T. S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong, D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D. Brown, "OpenCL for FPGAs: Prototyping a compiler," in Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, 2012.

[17]

J. A. Fisher, "Trace scheduling: A technique for global microcode compaction," IEEE Transactions on Computers, vol. C-30, no. 7, pp. 478--490, 1981.

[18]

J. A. Fisher, "Very long instruction word architectures and the ELI-512," in Proceedings of the 10th Annual International Symposium on Computer Architecture, 1983, pp. 140--150.

[19]

J. Fowers, J.-Y. Kim, D. Burger, and S. Hauck, "A scalable high-bandwidth architecture for lossless compression on FPGAs," in Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, 2015, pp. 52--59.

[20]

D. D. Gajski, N. D. Dutt, A. C.-H. Wu, and S. Y.-L. Lin, High-level Synthesis: Introduction to Chip and System Design. Norwell, MA, USA: Kluwer Academic Publishers, 1992.

Digital Library

[21]

S. Grauer-Gray and L.-N. Pouchet, "PolyBench/GPU," http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/GPU/.

[22]

Z. Guo, B. Buyukkurt, W. Najjar, and K. Vissers, "Optimized generation of data-path from C codes for FPGAs," in Proceedings of the Conference on Design, Automation and Test in Europe, 2005, pp. 112--117.

[23]

S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, "SPARK: A high-level synthesis framework for applying parallelizing compiler transformations," in Proceedings of 16th International Conference on VLSI Design, 2003, pp. 461--466.

[24]

S. Hadjis, A. Canis, J. Anderson, J. Choi, K. Nam, S. Brown, and T. Czajkowski, "Impact of FPGA architecture on resource sharing in high-level synthesis," in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2012, pp. 111--114.

Digital Library

[25]

Intel, "Intel FPGA SDK for OpenCL," https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html.

[26]

Intel, "Intel Quartus Prime," https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/overview.html.

[27]

Intel, "Intel Stratix 10 fpgas overview," https://www.intel.com/content/www/us/en/products/programmable/fpga/stratix-10.html.

[28]

Intel, "Open programmable acceleration engine - documentation," https://opae.github.io/.

[29]

Intel, "Avalon interface specifications," 2019, https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf.

[30]

H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G. Mercer, and C. J. Myers, "Synchronous interlocked pipelines," in Proceedings of Eighth International Symposium on Asynchronous Circuits and Systems, 2002, pp. 3--12.

[31]

L. Josipovic, R. Ghosal, and P. Ienne, "Dynamically scheduled high-level synthesis," in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018, pp. 127--136.

[32]

Khronos Group, "OpenCL overview - the open standard for parallel programming of heterogeneous systems," https://www.khronos.org/opencl/.

[33]

Khronos OpenCL Working Group, "The OpenCL specification," 2012, https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.

[34]

M. Lin, I. Lebedev, and J. Wawrzynek, "Openrcl: Low-power high-performance computing with reconfigurable devices," in Proceedings of the 2010 International Conference on Field Programmable Logic and Applications, 2010, pp. 458--463.

[35]

S. S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997.

Digital Library

[36]

K. G. Murty, Linear Programming. New York, NY, USA: John Wiley & Sons, 1983.

[37]

K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, "Accelerating deep convolutional neural networks using specialized hardware," Microsoft Research, Tech. Rep., 2015, https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.

[38]

M. Owaida, N. Bellas, K. Daloukas, and C. D. Antonopoulos, "Synthesis of platform architectures from OpenCL programs," in Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011, pp. 186--193.

[39]

A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and W.-M. W. Hwu, "FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs," in Proceedings of the 2009 IEEE 7th Symposium on Application Specific Processors, 2009, pp. 35--42.

[40]

A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, "A reconfigurable fabric for accelerating large-scale datacenter services," in Proceeding of the 41st Annual International Symposium on Computer Architecuture, 2014, pp. 13--24.

[41]

A. Putnam, S. Eggers, D. Bennett, E. Dellinger, J. Mason, H. Styles, P. Sundararajan, and R. Wittig, "Performance and power of cache-based reconfigurable computing," in Proceedings of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 395--405.

[42]

Seoul National University, "SnuCL suite: OpenCL frameworks and tools for heterogeneous clusters," http://snucl.snu.ac.kr.

[43]

K. Shagrithaya, K. Kepa, and P. Athanas, "Enabling development of OpenCL applications on FPGA platforms," in Proceedings of the 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors, 2013, pp. 26--30.

[44]

M. Sharir, "Structural analysis: a new approach to flow analysis in optimizing compilers," Computer Languages, vol. 5, no. 3--4, pp. 141--153, 1980.

[45]

B. Steensgaard, "Points-to analysis in almost linear time," in Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1996, pp. 32--41.

[46]

J. L. Tripp, K. D. Peterson, C. Ahrens, J. D. Poznanovic, and M. B. Gokhale, "Trident: An FPGA compiler framework for floating-point algorithms," in Proceedings of2005 International Conference on Field Programmable Logic and Applications, 2005, pp. 317--322.

[47]

M. Weinhardt and W. Luk, "Pipeline vectorization," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 2, pp. 234--248, 2001.

Digital Library

[48]

Xilinx, "SDAccel development environment," https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.

[49]

Xilinx, "Virtex UltraScale+," https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.

[50]

Xilinx, "Vivado design suite," https://www.xilinx.com/products/design-tools/vivado.html.

[51]

Xilinx, "AXI reference guide," 2017, https://www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.

Cited By

Xiao YLuo ZZhou KLiang YZhang ZPutnam A(2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637561
Xiao YPark DNiu ZHota ADehon A(2023)ExHiPR: Extended High-Level Partial Reconfiguration for Fast Incremental FPGA CompilationACM Transactions on Reconfigurable Technology and Systems10.1145/361783717:2(1-28)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3617837
Vahdatniya PSharifian AHojabr RShriraman AKloeckner AMoreira J(2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569671
Show More Cited By

Recommendations

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's ...
ePython: an implementation of Python for the many-core Epiphany coprocessor
PyHPC '16: Proceedings of the 6th Workshop on Python for High-Performance and Scientific Computing

The Epiphany is a many-core, low power, low on-chip memory architecture and one can very cheaply gain access to a number of parallel cores which is beneficial for HPC education and prototyping. The very low power nature of these architectures also means ...
Runtime coordinated heterogeneous tasks in charm++
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware

Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.

In this paper, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture

May 2020

1152 pages

ISBN:9781728146614

General Chairs:
José Martínez
Cornell University
,
José Duato
Universitat Politècnica de València
,
Program Chair:
Lieven Eeckhout
Ghent University

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE

Publisher

IEEE Press

Publication History

Published: 23 September 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '20

Sponsor:

SIGARCH

ISCA '20: The 47th Annual International Symposium on Computer Architecture

May 30 - June 3, 2020

Virtual Event

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
179
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xiao YLuo ZZhou KLiang YZhang ZPutnam A(2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637561
Xiao YPark DNiu ZHota ADehon A(2023)ExHiPR: Extended High-Level Partial Reconfiguration for Fast Incremental FPGA CompilationACM Transactions on Reconfigurable Technology and Systems10.1145/361783717:2(1-28)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3617837
Vahdatniya PSharifian AHojabr RShriraman AKloeckner AMoreira J(2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569671
Emami MBezati EJanneck JLarus JKloeckner AMoreira J(2022)Auto-Partitioning Heterogeneous Task-Parallel Programs with StreamBlocksProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569659(398-411)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569659
Xiao YMicallef EButt AHofmann MAlston MGoldsmith MMerczynski-Hait ADeHon AFalsafi BFerdman MLu SWenisch T(2022)PLD: fast FPGA compilation to make reconfigurable acceleration compatible with modern incremental refinement software developmentProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507740(933-945)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507740
Schlaak CJuang TDubach C(2022)Memory-Aware Functional IR for Higher-Level Synthesis of AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/350176819:2(1-26)Online publication date: 31-Jan-2022
https://dl.acm.org/doi/10.1145/3501768
Zhao XWen MChen ZShi YZhang CHenkel JLiu X(2021)Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper)Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463845(37-41)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3461648.3463845

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents