research-article

HLScope+,: Fast and accurate performance estimation for FPGA HLS

Authors:

Young-kyu Choi,

Jason CongAuthors Info & Claims

2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Pages 691 - 698

https://doi.org/10.1109/ICCAD.2017.8203844

Published: 13 November 2017 Publication History

Abstract

High-level synthesis (HLS) tools have vastly increased the productivity of field-programmable gate array (FPGA) programmers with design automation and abstraction. However, the side effect is that many architectural details are hidden from the programmers. As a result, programmers who wish to improve the performance of their design often have difficulty identifying the performance bottleneck. It is true that current HLS tools provide some estimate of the performance with a fixed loop count, but they often fail to do so for programs with input-dependent execution behavior. Also, their external memory latency model does not accurately fit the actual bus-based shared memory architecture. This work describes a high-level cycle estimation methodology to solve these problems. To reduce the time overhead, we propose a cycle estimation process that is combined with the HLS software simulation. We also present an automatic code instrumentation technique that finds the reason for stall accurately in on-board execution. The experimental results show that our framework provides a cycle estimate with an average error rate of 1.1% and 5.0% for compute- and DRAM-bound modules, respectively, for ADM-PCIE-7V3 board. The proposed method is about two orders of magnitude faster than the FPGA bitstream generation.

References

[1]

Alpha Data, Alpha Data ADM-PCI E-7V3 Datasheet, 2017, http://www.alpha-data.com/pdfs/adm-pcie-7v3.pdf.

[2]

Alpha Data, Alpha Data ADM-PCI E-KU3 Datasheet, 2017, http://www.alpha-data.com/pdfs/adm-pcie-ku3.pdf.

[3]

Apache Spark examples, http://spark.apache.org/examples.html.

[4]

L. Benini, et al., “SystemC cosimulation and emulation of multiprocessor SoC designs,” Computer, 53–59, 2003.

[5]

L. Cai and D. Gajski, “Transaction level modeling: an overview,” in Proc. Int. Conf. Hardware/software Codesign and System Synthesis, 19–24, 2003.

[6]

A. Canis et al., “From software to accelerators with LegUp high-level synthesis,” in Proc. Int. Conf. CASES, 18–26, 2013.

[7]

Y. Choi et al., “A quantitative analysis on microarchitectures of modern CPU-FPGA platforms,” in Proc. DAC, 109–114, 2016.

[8]

Y. Choi and J. Cong, “HLScope: High-Level performance debugging for FPGA designs,” in Proc. Int. Symp. FCCM, 2017.

[9]

J. Cong et al., “CPU-FPGA co-optimization for big data applications: A case study of in-memory Samtool sorting,” in Proc. Int. Symp. FPGA, 291, 2017.

[10]

D. Finley, Optimized QuickSort, 2007, http://alienryderflex.com/quicksort.

[11]

IBM, Application Note: Understanding DRAM Operation, 1996.

[12]

Intel, Intel FPGA SDK for OpenCL, 2016, http://www.altera.com/.

[13]

J. Jang, S. Choi, and V. Prasanna, “Energy-and time-efficient matrix multiplication on FPGAs,” IEEE T. VLSI, 13(11):1305–19, 2005.

[14]

Kingston, KVR13LSE9/8 memory module specifications, 2012, http://www.kingston.com/datasheets/.

[15]

D. Koeplinger et al., “Automatic generation of efficient accelerators for reconfigurable hardware,” in Proc. ISCA, 2016.

[16]

C. Lee, O. Mutlu, V. Narasiman, and Y. Patt, “Prefetch-aware DRAM controllers,” in Proc. Int. Symp. Microarchitecture, 200–209, 2008.

[17]

J. Lei et al., “A high-throughput architecture for lossless decompression on FPGA designed using HLS,” in Proc. Int. Symp. FPGA, 277, 2016.

[18]

P. Li, P. Zhang, L. Pouchet, and J. Cong, “Resource-aware throughput optimization for high-level synthesis,” in Proc. Int. Symp. FPGA, 200–209, 2015.

[19]

J. Park, P. Diniz, and K. Shayee, “Performance and area modeling of complete FPGA designs in the presence of loop transformations,” IEEE T. Computers, 53(11):1420–1435, 2004.

Digital Library

[20]

L. Pouchet, Poly Bench/C, 2015, http://web.cse.ohio-state.edu/pouchet.2/software/polybench/.

[21]

B. Reagon et al., “Machsuite: Benchmarks for accelerator design and customized architectures,” in Proc. IISWC, 110–119, 2014.

[22]

ROSE compiler infrastructure, http://rosecompiler.org/.

[23]

Y. Shao et al., “Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures,” in Proc. ISCA, 97–108, 2014.

[24]

A. Verma et al., “Developing dynamic profiling and debugging support in OpenCL for FPGAs,” in Proc. DAC, 56–61, 2017.

[25]

Xilinx, AXI Reference Guide UG761, 2012, http://www.xilinx.com/.

[26]

Xilinx, SDAccel Development Environment, 2016, http://www.xilinx.com/.

[27]

Xilinx, Vivado High-level Synthesis UG902, 2016, http://www.xilinx.com/.

[28]

C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proc. Int. Symp. FPGA, 161–170, 2015.

[29]

G. Zhong et al., “Lin-analyzer: A high-level performance analysis tool for FPGA-based accelerators,” in Proc. DAC, 136–141, 2016.

Cited By

Zhao CFaber CChamberlain RZhang X(2024)HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow ArchitecturesACM Transactions on Reconfigurable Technology and Systems10.1145/365562718:1(1-26)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1145/3655627
Cong JLau JLiu GNeuendorffer SPan PVissers KZhang Z(2022)FPGA HLS Today: Successes, Challenges, and OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/353077515:4(1-42)Online publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1145/3530775
Sumeet NDeeksha DNambiar MFeng DBecker SHerbst NLeitner PPapadopoulos A(2022)HLS_ProfilerProceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering10.1145/3489525.3511684(187-198)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3489525.3511684

Index Terms

HLScope+,: Fast and accurate performance estimation for FPGA HLS

Index terms have been assigned to the content through auto-classification.

Recommendations

HLscope+: fast and accurate performance estimation for FPGA HLS
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

High-level synthesis (HLS) tools have vastly increased the productivity of field-programmable gate array (FPGA) programmers with design automation and abstraction. However, the side effect is that many architectural details are hidden from the ...
Accelerating FPGA Prototyping through Predictive Model-Based HLS Design Space Exploration
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

One of the advantages of High-Level Synthesis (HLS), also called C-based VLSI-design, over traditional RT-level VLSI design flows, is that multiple micro-architectures of unique area vs. performance can be automatically generated by setting different ...
Dynamic power and performance back-annotation for fast and accurate functional hardware simulation
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

Virtual platform prototypes are widely used for early design space exploration at the system level. There is, however, a lack of accurate and fast power and performance models of hardware components at such high levels of abstraction. In this paper, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Nov 2017

1049 pages

Copyright © 2017.

Publisher

IEEE Press

Publication History

Published: 13 November 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao CFaber CChamberlain RZhang X(2024)HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow ArchitecturesACM Transactions on Reconfigurable Technology and Systems10.1145/365562718:1(1-26)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1145/3655627
Cong JLau JLiu GNeuendorffer SPan PVissers KZhang Z(2022)FPGA HLS Today: Successes, Challenges, and OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/353077515:4(1-42)Online publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1145/3530775
Sumeet NDeeksha DNambiar MFeng DBecker SHerbst NLeitner PPapadopoulos A(2022)HLS_ProfilerProceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering10.1145/3489525.3511684(187-198)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3489525.3511684

View Options

View options

Figures

Tables

Media

View Table of Conten