research-article

On the RTL Implementation of FINN Matrix Vector Unit

Authors:

Michaela BlottAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 6

Article No.: 94, Pages 1 - 27

https://doi.org/10.1145/3547141

Published: 09 November 2023 Publication History

Get Access

Abstract

Field-programmable gate array (FPGA)–based accelerators are becoming increasingly popular for deep neural network (DNN) inference due to their ability to scale performance with increasing degrees of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared with register-transfer level (RTL)–based design. HLS offers faster development time, better maintainability, and more flexibility in code exploration when evaluating several options for multi-dimension tensors, convolutional layers, or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml.

In this article, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared with HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around 15%. On the other hand, HLS consistently requires more flip-flops (FFs; with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs; 2× more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around 80%. RTL also benefits from at least a 10× reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet, used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits make the RTL abstraction an attractive alternative.

References

[1]

2010. AMBA 4 AXI4-Stream Protocol Specification.

Abstract

References

Cited By

Index Terms

Recommendations

Hardware resource estimation for heterogeneous FPGA-based SoCs

Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC

Using Dynamic Signal-Tracing to Debug Compiler-Optimized HLS Circuits on FPGAs

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations