research-article

Open access

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

Authors:

Jason CongAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 2

Article No.: 23, Pages 1 - 32

https://doi.org/10.1145/3570928

Published: 11 March 2023 Publication History

All formats PDF

Abstract

With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX¹ representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.

1 Introduction

Convolutional Neural Networks (CNNs) are widely used in many machine learning (ML) applications and have evolved quickly over the years. There is a growing interest in FPGA for accelerating CNN computation due to its high energy efficiency and performance (e.g., References [6, 7, 22, 31, 37, 44, 48, 52, 59, 60, 62]). However, the recent advancement in CNN models and FPGA-based CNN acceleration has brought several new challenges. ² ³

Challenge 1: Performance disparity within CNN layers of the same type: In CNNs, layers of the same type (normal convolution layers, for instance) can have different characteristics in terms of their input and output number of channels, feature map size, and kernel size. This changes the computation to communication (CTC) ratio from layer to layer. Therefore, it is important to handle these layers differently given the performance disparity across these layers. We found that tiling factors can play an important role in performance. Zhang et al. [59] showed that the CTC ratio of a single convolution layer varies with different tiling factors. Yang et al. [55] highlighted the importance of choosing proper tiling factors for data reuse in the near and faster memory (on-chip storage for FPGAs) for the overall latency and energy efficiency. These studies lead us to consider using different tiling factors across the network. Figure 1 depicts how different tiling factors can affect the performance of each layer in one CNN network. We compare the performance of using a single set of tiling factors (uniform tiling) to using different tiling factors for each layer (dynamic tiling). For the uniform tiling, we chose the tiling factor that reduces the latency of the entire network. For the dynamic tiling, we focused on each layer and selected the best tiling factor accordingly. Experimental results show that dynamic tiling can speed up the performance of the whole network by \(1.7\times\).

Fig. 1.

Challenge 2: The inefficiency of general-purpose CNN accelerators in processing special CNN layers: Many modern CNNs feature complex architecture topologies with different layer types. One of these special layers is a fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer). It is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20], to name a few. An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32]. Some CNNs include a mixture of convolution layers such as E-Net [38], where normal convolution (N-CONV), transposed convolution, dilated convolution, and asymmetric convolution (A-CONV)⁴ layers are used. Both T-CONV and D-CONV layers can be naïvely implemented as normal convolution layers. However, such implementations introduce many zeros in the input feature maps of T-CONV layers and in the convolution filters of D-CONV layers, leading to a huge underutilization of the FPGA resources. To tackle this problem, we use a decomposition-based approach (discussed in Section 5) to implement N-CONV, T-CONV, and D-CONV layers efficiently in one versatile systolic array on an FPGA with minimal area overhead. Moreover, other networks such as MobileNetV1 [26] use depth-wise separable convolution layers introduced in Reference [45] to decrease the computation cost. MobileNetV2 [41] introduced residual bottleneck block (RBB) to further reduce the computation complexity. These layers reduce the computation cost but keep the same feature map size; this can make the layer more communication-bound and reduce the computation efficiency.

Challenge 3: Integration overheads of using FPGA in ML frameworks: When processing a CNN application in a modern ML framework such as TensorFlow [5], the complete stack consists of reading the input, computing the CNN, processing the result, and displaying and writing the result. Previous works have only focused on optimizing the CNN kernel on FPGA (e.g., References [6, 7, 22, 31, 48, 52, 59, 62]). This is due to the fact that CNN computation is the most time-consuming step of the whole stack. Hence, the rest of the overheads are ignored. While several works [22, 37] have focused on accelerator generation from TensorFlow-described networks, they did not address the challenges of integrating an accelerator into TensorFlow. By integrating our accelerator with TensorFlow, we are able to directly run networks from TensorFlow on an FPGA. Integrating FPGA into TensorFlow introduces a new set of overheads: communication between TensorFlow and FPGA and the communication between the host and the FPGA kernel itself. Figure 2 shows the breakdown of the end-to-end runtime for processing a 384 \(\times\) 384 RGB image using the network in Figure 3. These steps are listed and described in Section 7. The time for CNN processing, using our accelerator denoted as the kernel, only takes 11.8% of the total runtime. This emphasizes the need for an end-to-end SW/HW co-optimization. Our experiments show that this optimization can increase the end-to-end performance of this network from 4.8 FPS to 23.8 FPS, leading to a 5\(\times\) speedup.

Fig. 2.

Fig. 3.

To solve the challenges above, we propose an FPGA-based CNN framework named FlexCNN. Its architecture employs dynamic tiling, layer fusion, and data layout transformation to adapt to the performance disparity of different CNN layers. Another major component of the architecture is our versatile systolic array, which can efficiently process different convolution layer types. The framework has a compilation flow that takes a CNN as an input, performs design space exploration, and generates an optimized hardware accelerator to run on FPGA. The accelerator is further integrated into a software-hardware pipeline to mitigate the large integration overheads by overlapping the software execution with the hardware computation.

A preliminary version of FlexCNN [47] was published in FPGA 2020. The new contributions in this article include: (1) a novel efficient versatile systolic array for normal, transposed, dilated, and asymmetric convolution layers; (2) ONNX support to handle multiple ML frameworks including TensorFlow, PyTorch, and Caffe; (3) code generation for the new TAPA [15] framework, which is integrated with AutoBridge [25] to improve design frequency; and (4) the implementations of U-Net, E-Net, and VGG-16 CNNs on FPGA using the FlexCNN framework.

In summary, the overall contributions of this work are:

•

An efficient, flexible, and composable dataflow architecture employing dynamic tiling, layer fusion, and data layout optimization to support a wide variety of CNNs;

•

A novel versatile systolic array that can efficiently process normal, transposed, dilated, and asymmetric convolution layers;

•

An automated compilation flow that takes a CNN dataflow graph as an input, maps it to the hardware dataflow graph, and performs a design space exploration to generate an optimized accelerator on FPGA;

•

A software-hardware pipelining scheme that can improve the end-to-end performance of CNNs;

•

Real-time efficient implementations of OpenPose, U-Net, E-Net, and VGG-16 CNNs on FPGA.

2 Framework Overview

FlexCNN is an end-to-end framework for automatic hardware acceleration of CNNs on FPGA. FlexCNN implements a flexible and composable dataflow architecture that can be tailored for a variety of complex real-world CNNs. Section 4 discusses the architecture and the multiple optimization techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimization. Another important component of the FlexCNN architecture is our novel versatile systolic array, which we discuss in Section 5. The versatile systolic array can efficiently process N-CONV, T-CONV, D-CONV, and A-CONV layers. Section 6 reviews the automated compilation flow. It takes an ONNX CNN model and an ordered list of FlexCNN modules as inputs, then outputs an optimized FPGA accelerator. The compilation tool maps the CNN dataflow graph to the given FlexCNN architecture, performs design space exploration for the best hardware parameters, and generates the synthesizable code for the architecture. Furthermore, FlexCNN implements a software-hardware pipelining technique (discussed in Section 7) to overlap the software overheads with the hardware execution reducing the end-to-end runtime of a CNN’s inference.

3 Applications

This section introduces the new layer types and building blocks used in the three real-world CNN applications: OpenPose, U-Net, and E-Net CNNs. It then highlights the applications, architectures, and layer types of each CNN.

3.1 New Layers and Building Blocks

3.1.1 Depthwise Separable Convolution (DSC).

In a normal convolution layer (N-CONV), the feature maps are filtered and combined in one step. The DSC splits this step into two phases. The first phase, depthwise convolution (DW), does the filtering, and the second phase, pointwise convolution (PW), combines the produced filtered feature maps using 1 \(\times\) 1 kernels.

A conv layer takes \(N\) feature maps as the input, each of size \(H \times W\). It uses \(M \times N \times K \times K\) kernels to produce M channels for the output. The total computation cost for this layer is \(M \times N \times H \times W\) \(\times\) \(K \times K\).

However, a DSC uses \(N \times K \times K\) kernels for DW and \(M \times N \times 1 \times 1\) kernels for PW. By applying this change, the amount of computation is reduced by a factor of \(\frac{1}{M} + \frac{1}{K^2}\) [26].

3.1.2 Residual Bottleneck Block.

Google introduced RBB in MobileNetV2 [41] to reduce the computation cost. It consists of a 1 \(\times\) 1 conv followed by a 3 \(\times\) 3 DW and then another 1 \(\times\) 1 conv, each of which is followed by ReLU and a batch normalization layer. The 1 \(\times\) 1 convolutions are used for dimension reduction or restoration. The nature of this block allows us to reduce the number of input and output channels. This reduces the computation intensity and makes the network more efficient.

3.1.3 Special Convolution Layers.

Recent CNNs have introduced variations of the normal convolution layers such as:

A fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer) layer is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20].

An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32].

An asymmetric convolution (A-CONV) layer is a normal convolution layer that uses asymmetric filter sizes such as 1 \(\times\) 5 or 5 \(\times\) 1 filters. In terms of hardware acceleration, this layer requires extra logic to handle each dimension of the filters separately.

3.2 OpenPose

OpenPose [8] is the winner of the COCO 2016 Keypoints Challenge that can detect 2D poses of multiple people in an image. OpenPose network first extracts the features of the input image using the first 10 layers of VGG-19 [46]. This is the backbone of the network. These feature maps are the inputs to a two-branch network. The first branch detects confidence maps, representing body part locations, and the second branch detects part affinity fields, a set of 2D vectors showing the location and orientation of the limbs. The results of these two branches are concatenated with the feature maps from the backbone network and form the input for the next stage. After several iterations, these branches produce final predictions.

This network is interesting to us, since it has an irregular architecture compared to modern CNN-based deep-learning applications. Instead of just a linear forward path where each layer consumes the result of its previous layer, it has concatenation layers that need extra data movement. Moreover, to reduce the computation complexity of the network, we use a modified version of OpenPose [29] that replaces the backbone with a modification of MobileNetV2 [41] and employs DSC [45] for the rest of the network, following the trend in the ML community. Figure 3 depicts the network topology of this version; we call this network OpenPose-V2. Due to the space limitation, we only show the convolutional layers. Each convolution is followed by ReLU and batch normalization layers.

3.3 U-Net

U-Net [40] is a famous CNN used for biomedical image segmentation. It is an encoder-decoder neural network, where the encoder includes four downsampling blocks, and the decoder part is made of four upsampling blocks. Furthermore, the outputs of each downsampling block are added with the inputs of the corresponding upsampling block. Figure 4 illustrates U-Net’s architecture and building blocks. We chose U-Net with its irregular architecture topology and various layer types, including T-CONV layers, to show the effectiveness of our versatile systolic array for a real-world application. Table 1 demonstrates the breakdown of U-Net’s layers and the number of Giga floating-point operations (GFLOPs).

Fig. 4.

Table 1.

Layer Type	Layer Count	Actual GFLOPs
N-CONV	13	9.9
T-CONV	4	2.1
MaxPool	4	0
Add	8	0
Batch Normalization	9	0
Leaky ReLU	13	0
Total GFLOPs		12.0

Table 1. U-Net Layers and Number of Operations

3.4 E-Net

E-Net [38] is a lightweight CNN used for pixel-wise semantic segmentation in real-time. Like U-Net, E-Net has an encoder-decoder structure, where the feature maps are downsampled by a factor of 4 and then upsampled by a factor of 4 to restore the original image size. The network is made of bottleneck blocks. Each bottleneck block gets its input from the proceeding block, processes data in two branches, and then merges the two branches with an Add layer to be sent to the next block. The first branch contains a MaxPooling layer for the encoder blocks, an upsampling layer for the decoder blocks, or it can be empty for the intermediate blocks. The second branch contains N-CONV layers, D-CONV layers, A-CONV layers, or T-CONV layers. Figure 5 illustrates E-Net’s architecture and bottleneck blocks. We chose E-Net to test our framework with such a complex CNN topology and various layer types. E-Net represents a stress test for our compilation framework, which needs to map the CNN complex graph into the FlexCNN architecture. E-Net also contains all four types of convolution layers, which is a perfect test case for our versatile systolic array. All the layers used in E-Net and the number of Giga operations (GOPs) are shown in Table 2. The logical GOPs is the number of operations including the zero multiply-accumulate (MAC) operations of T-CONV and D-CONV layers.

Fig. 5.

Table 2.

Layer Type	Layer Count	Actual GOPs	Logical GOPs \(^{*}\)
N-CONV	70	0.886	0.886
D-CONV	8	0.191	7.878
T-CONV	3	0.015	0.062
A-CONV	8	0.106	0.106
MaxPool	3	0	0
PReLU	66	0	0
Add	27	0	0
Batch Normalization	6	0	0
ReLU	15	0	0
Upsample	2	0	0
Concat	1	0	0
Total GOPs		1.198	8.931

Table 2. E-Net Layers and Number of Operations

\(^{*}\)Number of operations including the zeros for T-CONV and D-CONV layers.

4 FlexCNN Architecture

4.1 A Composable Architecture

FlexCNN is a composable dataflow architecture made up of a number of streaming modules that are connected as a directed graph based on the target CNN architecture. It is flexible and composable in the sense that modules can be reordered, new modules can be added, or some modules can be removed, depending on the target CNN graph. This is particularly important, since state-of-the-art CNNs usually come with new special layers that rigid accelerators struggle to process efficiently. Thus, if a CNN layer type is not supported yet, then the user would simply need to develop a single module for that layer. Currently, FlexCNN has modules to support a variety of CNN layers (see Table 3).

Table 3.

Module	Layers Supported
Standard Systolic Array	normal convolution layers
Versatile Systolic Array	normal, transposed, dilated, and asymmetric convolution layers
DW Conv	depth-wise convolution layer
Act & BN	activation (ReLU, ReLU6, PReLU, Leaky ReLU) and batch normalization layers
Add	piece-wise addition of two layers
Concat	concatenation of two layers
Upsample	nearest neighbor or bilinear upsampling layers
Pool	max-pooling or average pooling layers

Table 3. Current Modules and Their Descriptions

Since FlexCNN is a dataflow architecture, it can be thought of as a coarse-grain pipeline where modules are the pipeline stages. To avoid pipeline stalls, we make sure that all modules are fully pipelined with an initiation interval of 1, meaning that each module produces and consumes data every clock cycle. Therefore, the overall latency is calculated as the latency of the longest pipeline stage + the latency of filling and draining the pipeline. Since convolution layers are the most compute-intensive, the longest pipeline stage is the latency of the systolic array modules. Furthermore, FlexCNN supports CNNs in different data types, including float 32-bit, fixed 16-bit, and fixed 8-bit. Figures 6, 7, and 8 show the architectures for OpenPose, U-Net, and E-Net CNNs, respectively.

Fig. 6.

Fig. 7.

Fig. 8.

4.2 Modules

We implement line-buffer-based streaming architectures for the DW Conv, Act & BN, Add, Pool, and Upsample modules using a similar stencil-based architecture as in Reference [14]. All these modules are parameterized by factors as shown in Table 4, which will be explored by the design space exploration (DSE) engine covered in Section 6.2, for optimal performance. We apply double buffering in both the Reader modules and the Writer module. Furthermore, if the outputs of the whole layer can fit into the on-chip buffer, then the data will be pushed into on-chip buffers and directly fetched by the Reader to save the off-chip communication time.

Table 4.

Design Parameters	Explanation
\(Th(k), Tw(k), Tn(k), Tm(k)\)	Tiling factors for \(H\), \(W\), \(N\), and \(M\) for layer \(k\)
\(SIMD\)	SIMD lanes for all modules
\(SA\_ROW, SA\_COL\)	Rows and columns of the systolic array kernel

Table 4. Design Parameters and Explanations

4.3 Layer Fusion and Layer Parallelization

Due to the limited fast on-chip FPGA memory (BRAMs and URAMs), it is usually necessary to use the slow off-chip memory (DRAM), especially for large CNN models. Thus, the intermediate tensors of layers are loaded from the DRAM using the Reader modules, processed by the FlexCNN compute modules, and then written back to DRAM using the Writer module. An important feature of FlexCNN is that each module can be enabled or disabled dynamically during runtime to process or bypass the data flowing through that module. This feature allows FlexCNN to employ layer fusion and layer parallelization where one DRAM read and write can process multiple CNN layers and reduce off-chip communication, thus improving the hardware utilization of the FPGA. Layer fusion applies to sequential layers from the original CNN graph. Layer parallelization applies to layers that are parallel in the original CNN graph. For example, in a downsampling bottleneck block of E-Net (Figure 5), \(L1\) can be fused with the previous or following ReLU layers on the same branch, which represents layer fusion. For layer parallelization, \(L1\) can be executed in parallel with \(L2\) (MaxPool layer for downsampling block), which represents layer parallelization. Section 6.1 examines the details of mapping CNN layers to the FlexCNN architecture in depth.

4.4 Dynamic Tiling

Tiling is applied when processing the network for improving the data locality and minimizing the communication. Table 4 summarizes the tiling factors employed in FlexCNN, where \(N\) corresponds to the number of input feature maps, \(H\) and \(W\) to the height and width of the input feature maps, and \(M\) to the number of output feature maps. When the tiling factors are not sub-multiples of the tiled dimensions, redundant computation is introduced, which degrades the performance of the design. As explained in Section 1, in a normal CNN network, the types and configurations of different layers vary from each other. Therefore, the optimal tiling factors will be different from each other as well. We have observed that using uniform tiling factor for the whole network will lead to up to 1.7\(\times\) performance slowdown compared to the ideal case using different tiling factors across layers. Therefore, in this work, we apply the dynamic tiling by re-configuring the tiling factors of the accelerators on-the-fly for different layers to maximize the performance. This will bring the hardware overheads to support the dynamic tiling. However, such overheads are negligible compared to the performance improvement. Section 8 evaluates the impacts of this technique in detail.

Previous works such as References [44, 51, 62] have also emphasized the need for different tiling factors across layers. Our architecture distinguishes from the previous work by changing all the tiling factors across each layer dynamically, whereas previous work only adjusted part of the tiling factors or used several accelerators, each with distinct uniform tiling factors on-chip. Equation (1) shows the restriction on the tiling factors.

\begin{align} \begin{split} Tw(k) &= c_1 \times SA\_COL\\ Tm(k) &= c_2 \times SA\_ROW\\ Tn(k) &= c_3 \times SIMD\\ Tm(k) &= Tn(k+1) \end{split} \end{align}

(1)

In FlexCNN, the width and output channels of the feature maps are mapped to columns and rows of the SA, respectively. As a result, for each layer, \(Tw(k)\) and \(Tm(k)\) should be multiples of their respective SA dimension. The reduction of multiple input channels is computed in parallel inside each PE of the SA, which is defined as the SIMD lane. This implies that \(Tn(k)\) should be a multiple of SIMD lane. \(Th(k)\) can be any arbitrary value.

As mentioned before, the computation in the DW Conv module can be seen as a stencil kernel. Figure 9 depicts the 3 \(\times\) 3 stencil window connected by line buffers. As depicted in the figure, at each cycle, the line buffers fetch one pixel from a feature map and the data are shifted by one location. The length of the first two lines (for a general case, the first \(K-1\) lines with K being the filter size) is determined by \(Tw(k)\). After all the registers in the line buffers are filled with data (\((K-1) \times Tw(k) + K\) cycles), the computation can start by convolving the registers marked in black with the respective filter. Since the SA module needs to fetch SIMD elements in each cycle, the architecture in Figure 9 is duplicated SIMD times with each one fetching the data from a different feature map. As the length of the line buffer determines the \(Tw(k)\), each line should have “\(\max _{k} Tw(k)\)” registers. We realize dynamic tiling by connecting consecutive rows of the line buffer via a MUX, enabling data feeding from different locations.

Fig. 9.

4.5 Data Layout Optimization

Data layout optimizations are applied to reduce the number of accesses to DRAM and increase the effective DRAM bandwidth. The first optimization is on the concatenation layers. A CNN network may contain blocks that concatenate the results of several layers. As shown in Figure 3, after each stage in the OpenPose-V2 network, results from two branches will be concatenated with the first outputs from the backbone network. This then serves as the inputs for the following stages. Figure 10 presents the optimized data organization of the network.

Fig. 10.

The outputs of the backbone (region B) and each stage (region A, C) are placed close to each other, as shown in Figure 10. To be more specific, the outputs of Stage 1 will be written to region A. Regions A and B will serve as the inputs of Stage 2. In Stage 2, the outputs will be written to region C. The regions B and C will serve as the inputs of Stage 3, similarly. The outputs of each stage are written to regions A and C in a round-robin fashion. With this layout, the outputs of stage branches are concatenated on-the-fly, eliminating unnecessary off-chip DRAM movements.

To further improve the effective DRAM bandwidth, we change the data layout of the feature maps from \(N(k) \times H(k) \times \frac{W(k)}{Tw(k)} \times Tw(k)\) to \(\frac{N(k)}{Tn(k)} \times H(k) \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k)\). This allows us to increase the burst length from \(Tw(k)\) to \(Tn(k) \times Tw(k)\). A DSC layer can easily become communication-bound because of its low computation to communication (CTC) ratio, since it is mostly using 1 \(\times\) 1 convolution kernels. In this case, when the kernel size of the next layer is 1 \(\times\) 1, since there is no overlapped region between different tiles, we further change the data layout to \(\frac{N(k)}{Tn(k)} \times \frac{H(k)}{Th(k)} \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k) \times Th(k)\). It further increases the burst length for these layers to \(Tn(k) \times Th(k) \times Tw(k)\). For other kernel sizes, padding is applied, because a tile of \(Tn(k) \times Tw(k) \times Th(k)\) does not have all the data needed for the computation. We need to have \((p-1)\) and \(((p-1) \times Th(k) + (p-1)^2)\) extra DRAM accesses with burst length of \(Tn(k) \times Tw(k)\) and \(Tn(k)\), respectively, to fetch all the data (\(p\) denoting the kernel size). This increases the number of DRAM accesses with a burst length of \(Tn(k)\), which further increases the communication time and prevents us from applying this data layout.

5 The Versatile Systolic Array

5.1 Problem Formulation

5.1.1 Transposed Convolution.

At first glance, transposed convolution seems to be a completely different operation from a normal convolution. As shown in Figure 11, one T-CONV operation is a scalar multiplication of an input pixel by a \(K \times K\) filter, and the output result (of size \(K \times K\)) is placed in the output feature map (FM) separated by a distance determined by the T-CONV stride (\(S^\prime\)). The overlapping results in the output feature map are then added together to give the final output feature map.

Fig. 11.

This same operation can be performed as a normal convolution by inserting \(S^\prime -1\) zeros between adjacent pixels of the input feature maps and convolving a reversed filter with stride \(S=1,\) as shown in Figure 12. Note that the gray zeros are part of the padding, which is required for N-CONV layers as well.

Fig. 12.

Let \(N I_h I_w M O_h O_w\) represent the channels, height, and width of input and output FMs, respectively. This naïve implementation requires \(K^2 N M O_h O_w\) (Multiply-Accumulate) MAC operations (\(3^2\) \(\times\) 4 \(\times\) 4 MACs in Figure 12), but the non-zero MAC operations are only \(K^2 N M I_h I_w\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 14). The ideal speedup⁵ for T-CONV is given by Equation (2)

\begin{equation} \begin{gathered} Transposed\ Convolution\ Ideal\ Speedup = \frac{O_h O_w}{I_h I_w} = \frac{S^\prime I_h \times S^\prime I_w}{I_h I_w} = S^{\prime 2} \end{gathered} \end{equation}

(2)

5.1.2 Dilated Convolution.

Similar to transposed convolution, dilated convolution can be naïvely implemented as a normal convolution operation by inserting \(d-1\) zeros between the filters’ values (Figure 13), where \(d\) is the D-CONV dilation rate. The number of MAC operations using this method is \((dK-d+1)^2 NO_hO_wM\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 13). However, the effectual non-zero MAC operations are only \(K^2NO_hO_wM\) (\(2^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 15). Equation (3) gives the ideal speedup for D-CONV.

\begin{equation} Dilated\ Convolution\ Ideal\ Speedup = \frac{(dK-d+1)^2}{k^2} \end{equation}

(3)

Fig. 13.

Now, the problem is how to design a versatile SA that can eliminate the ineffectual zero MAC operations to achieve the theoretical ideal speedups with minimal area overhead.

5.2 Approach

Previous FPGA works attempted to accelerate either T-CONV layers, such as References [19, 33, 34, 56, 58], or D-CONV layers, such as Reference [61], but not both. However, some ASIC works proposed versatile accelerators for T-CONV and D-CONV layers. References [10] and [35] target general sparsity including T-CONV and D-CONV layers. Reference [28] uses a systolic array with delay cells to skip the zero MAC operations. Reference [9] proposes a decomposition approach that decomposes T-CONV and D-CONV layers into dense N-CONV layers. However, none of these previous works discussed the area overhead of supporting T-CONV and D-CONV layers efficiently. Table 5 summarizes these works.

Table 5.

Work	Device	Conv Support			Design Generation
Work	Device	N-CONV	T-CONV	D-CONV	Design Generation
FlexCNN (ours)	FPGA	Yes	Yes	Yes	Automatic
Electronics 2021 [61]	FPGA	Yes	No	Yes	Manual
VLSI 2020 [58]	FPGA	Yes	Yes	No	Automatic
Electronics 2020 [19]	FPGA	Yes	Yes	No	Manual
FCCM 2018 [56]	FPGA	Yes	Yes	No	Automatic
ISCAS 2020 [9]	ASIC	Yes	Yes	Yes	Manual
ISCAS 2019 [28]	ASIC	Yes	Yes	Yes	Manual
VLSI 2020 [10]	ASIC	Yes	Yes	Yes	Manual
ISCAS 2019 [35]	ASIC	Yes	Yes	Yes	Manual

Table 5. Versatile SA Comparison with Other Works

We chose to base our work on the decomposition approach in Reference [9], since it requires the least changes and area overhead to a standard SA. However, their work did not provide enough formulation and details on ideal speedups and filter/feature map decomposition for arbitrary filter size, T-CONV stride (\(S^\prime\)), and dilation rate (\(d\)). In this section, we illustrate the decomposition approach and provide a decomposition algorithm for T-CONV and D-CONV.

5.2.1 Decomposition of T-CONV Operation.

The decomposition of T-CONV operation gets rid of the non-effectual zero MAC operations by decomposing the convolution filters into \(S^{\prime 2}\) sub-filters that convolve over the dense input feature maps producing the same outputs as the naïve implementation, as shown in Figure 14.

Fig. 14.

5.2.2 Decomposition of D-CONV Operation.

The decomposition of dilated convolution is more straightforward. While filters are decomposed in T-CONV, the input feature maps of D-CONV are decomposed into \(d^2\) sub-feature maps. Each sub-feature map contains non-contiguous pixels separated by a distance \(d-1\), as shown in Figure 15.

Fig. 15.

5.2.3 Unified Decomposition Algorithm.

Algorithm 1 formulates the decomposition of an \(N \times N\) 2-D input matrix \(I\) given a constant \(Z\), where \(I\) is a dense filter and \(Z=S^\prime\) for T-CONV or \(I\) is a dense input FM and \(Z=d\) for D-CONV. The algorithm has two steps: First, it gets the height and width dimensions of each sub-matrix of \(I\). Second, it gets the values of each sub-matrix. After that, it returns the decomposed sub-matrices of \(I\).

5.3 The Versatile Systolic Array

The versatile systolic array efficiently supports four types of convolutional layers, i.e., N-CONV, T-CONV, D-CONV, and A-CONV layers. Figure 16 illustrates the high-level architecture of the versatile SA. Both T-CONV and D-CONV layers can be naïvely implemented as N-CONV layers by inserting zeros in the input FMs of a T-CONV layer or in the filters of a D-CONV layer, as shown in Figures 12 and 13. However, this naïve implementation leads to huge underutilization of computation resources due to the zero MAC operations. To the best of our knowledge, this is the first efficient FPGA implementation of N-CONV, T-CONV, and D-CONV in one systolic array.

Fig. 16.

5.3.1 The Architecture and Dataflow of the VSA.

To implement the decomposition approach in a systolic array, we used the open-source framework PolySA [16]. The systolic array is output-stationary made of \(SA\_COL \times SA\_ROW\) PEs, and each PE contains \(SIMD\) MAC engines. The SA has \(SA\_COL\) input-feed modules that feed input feature maps to the top-row PEs flowing down to the bottom of the SA, and it has \(SA\_ROW\) weight-feed modules that feed filters to the leftmost column of PEs flowing to the rightmost column of the SA, as shown in Figure 16. Each PE contains a local buffer to accumulate the results of the convolution. Once a PE’s local buffer has the final convolution results, it sends its outputs down the SA to be collected by \(SA\_COL\) out-collect modules.

5.3.2 T-CONV Implementation.

In a normal convolution layer with a 3 \(\times\) 3 filter and one input feature map, each output pixel is computed as the dot product of the filter by the corresponding input pixels. This translates to 9 MAC operations on one of the PE’s local registers. In a T-CONV layer with a 3 \(\times\) 3 filter, \(S^\prime =2\), and one input feature map, the 9 MAC operations are decomposed into \(S^{\prime 2}=4\) N-CONV operations with \((2\) \(\times\) \(2),(2\) \(\times\) \(1),(1\) \(\times\) \(2),\) and \((1\) \(\times\) \(1)\) sub-filters, as illustrated in Figure 14. The MAC operations of the four sub-filters are computed in four different registers. Thus, changing the address of result accumulations is the only modification in the PEs. At this stage, the output pixels in the PEs are not organized. We avoided implementing data reorganization in the PEs and shifted that logic to the out-collect modules (Figure 16), since there are more PEs than out-collect modules in the SA, which minimizes area overheads.

5.3.3 D-CONV Implementation.

The decomposition approach for D-CONV is slightly different from T-CONV. Instead of decomposing the filters, the input feature maps are decomposed, as illustrated in Figure 15. The input-feed modules send non-contiguous input pixels based on the dilation rate, while the weight-feed modules send the dense filters as it does for N-CONV. For example, in a D-CONV operation with a 2 \(\times\) 2 filter, \(d=2\), and one input feature map, each output pixel is computed through 4 MAC operations on the same register using the dense filter but with non-contiguous input pixels. The input pixels are separated by a distance of \(d-1=1\) in this case. D-CONV does not require data reorganization, as all output pixels have the same number of MAC operations.

6 Compilation Framework

The compilation framework takes a CNN graph and an ordered list of the modules needed for that CNN as inputs and generates an optimized FPGA accelerator. The compilation framework has three major components (Figure 17). This section discusses the components of the framework in detail.

Fig. 17.

6.1 CNN Layer Mapper

6.1.1 ONNX.

While the original FlexCNN framework supported TensorFlow CNNs only, the updated framework uses Open Neural Network Exchange (ONNX). This is an open-source framework that establishes open standards for representing machine learning algorithms and software tools. The ONNX representation supports multiple famous ML frameworks such as TensorFlow, PyTorch, Caffe, and ScikitLearn, to name a few. ONNX compacts a deep neural network (DNN) model in a single file. This file contains: (1) the DNN’s graph, where each node represents a DNN layer and each edge represents the data flow from one node to another, and (2) the DNN’s parameters, mainly weights and biases.

6.1.2 CNN Layer Mapping.

Now, having a compact representation of any CNN, it is easier for the CNN Layer Mapper to map the nodes of the CNN to the ordered list of FlexCNN modules. This is the component that performs layer fusion and layer parallelization. The architecture must have modules to support all the CNN’s layers. In most CNNs, the convolution layers are the most compute-intensive operations, and the SA is the bottleneck module. Our mapping algorithm iterates through each convolution node and checks to see if the predecessor, successor, or parallel nodes of the convolution node can be mapped to the ordered list. It then outputs a list of layer bundles that are sent to the design space exploration, which is discussed next.

6.2 Design Space Exploration

Given the network, the accelerator architecture, and the FPGA’s resources information, we will perform the design space exploration to select the optimal design parameters that minimize the inference latency of the CNN when run on the target FPGA. Table 4 lists the design parameters to be determined.

Two analytical models resource_est() and latency_est() are built for estimating the resource usage and latency of designs. Currently, the resource model estimates block RAM (BRAM) and DSP usage, which are usually the bottleneck of designs. The DSE process will sweep through the design space with all feasible combinations of design parameters. For each design parameter list, the resource usage is examined first. Designs that over-utilize the resource will be pruned away. Then, we follow a greedy algorithm to select the optimal tiling factors that minimize the latency layer by layer. The DSE process finishes within minutes on a standard workstation.

6.3 Design Generation

This step creates the code that is synthesized into the hardware accelerator. Since we are targeting Xilinx/AMD FPGAs, our design generator creates Xilinx/AMD High-Level Synthesis (HLS) code [53]. Generating the bitstream for such complex architectures has been challenging, especially when using large systolic arrays. The bitstream generation task would usually fail the placement and routing step. For this reason, we recently added support to generate TAPA code [15]. TAPA is a dataflow HLS framework that offers fast compilation, and it generates high-frequency designs with the help of AutoBridge [25]. AutoBridge is a tool targeted at large dataflow architectures. It helps the process of placement and routing by placing the dataflow modules evenly across the FPGA fabric and connecting them with pipelining registers to minimize the critical paths of the design.

Now, having the optimal hardware parameters from the DSE, the user can choose to produce Xilinx/AMD HLS code or TAPA code. The code generation is template-based. The original FlexCNN paper used the PolySA [16] compiler to generate a standard systolic array. To automate the process of generating new versatile SAs with different dimensions based on an application target, we integrated our modifications on the standard SA into the PolySA compilation framework to create new versatile SAs with a push of a button. We also used Algorithm 1 and other scripts to automatically prepare test data to run on FPGA.

7 Software-hardware Pipelining

Figure 2 illustrates the software overheads when integrating an FPGA kernel to a machine learning framework like TensorFlow. This defeats the purpose of hardware acceleration. To overcome this challenge, we use a software-hardware pipelining technique that can overlap the software execution with the hardware kernel execution. We chose TensorFlow as our ML framework, since it is being widely used for inference in the ML community (e.g., References [27, 36]). To invoke FPGA from TensorFlow, we redefine the nodes in the original computation graph. All computation nodes of CNN are merged into one node that is implemented by FPGA. The rest of the graph is still processed on the CPU.

When FPGA is connected to TensorFlow, the whole integration stack consists of the following steps: (1) reading the inputs of CNN, (2) pre-processing including stages such as image resizing, (3) re-organizing the initial data layouts in CPU memory, (4) transferring data from CPU to FPGA device memory, (5) computation on FPGA, (6) fetching the results back via PCIe, (7) reformatting and passing it to TensorFlow, (8) non-CNN computation stages on CPU, (9) processing the results (e.g., estimating the human poses based on the attained results and drawing them for the OpenPose network), and (10) writing out and displaying the results.

Figure 2 shows the breakdown of these stages in the OpenPose application for a 384 \(\times\) 384 RGB input. Among the whole pipeline, which takes 208.8 ms, the FPGA computation in Step 5 only requires 11.8% of the total time. The integration overheads have led to an \(8.45\times\) performance slowdown. To reduce these overheads, we have applied an optimized software/hardware pipelining.

A two-level pipelining is applied on the whole integration stack that enables the simultaneous processing of the aforementioned steps. The first level overlaps TensorFlow’s overheads (steps 1, 2, 9, 10) with the rest of the steps. The second one overlaps FPGA’s computation with data movement steps (steps 3, 4, 6, 7).

Figure 18 illustrates the first level of the pipeline, which is applied at the TensorFlow level. The numbers in the figure show the related step number. Steps 1, 2, 9, and 10 and the rest of the steps are assigned to different processes connected by a queue. Therefore, steps 1, 2, 9, and 10 are overlapped with FPGA-related steps. The overall performance is determined by the stage with the longest latency. Pipelining is enabled by exploiting multiprocessing. In other words, each of the steps is assigned to a separate process. These processes pass the data to each other through queues, as shown in Figure 18.

Fig. 18.

To further improve the performance, we fully pipeline the communication and computation of FPGA, which consists of steps 3 to 7. This builds the second level of the pipeline. To allow pipelining, a batch of images is sent to FPGA. For a certain batch size, the additional latency incurred by batch processing is dissolved when the first level of the pipeline is applied. After the FPGA finishes processing the batch, the results are passed back to TensorFlow and the non-CNN computations are done in parallel for all the images. Figure 19 depicts the redefined graph that we use to achieve such a pipeline. With this optimization, the data movement steps are overlapped with kernel computation and the latency for non-CNN computation (Step 8) is amortized for the whole batch. Note that such deep software+hardware pipelining techniques were also used in References [12, 17] for integrating FPGA accelerators into Spark-based applications.

Fig. 19.

8 Experimental Results

8.1 Experiment Setup

As mentioned before, the FlexCNN architecture is described either in Xilinx/AMD HLS [53] or TAPA HLS [15]. The target platforms are Xilinx/AMD Virtex Ultrascale+ VCU1525 and Alveo U250 and U280 Data Center Accelerator Cards. Table 6 demonstrates the generated designs and the corresponding tools and FPGA platforms used for each design.

Table 6.

Target CNN	Code	Xilinx/AMD Tool	Platform	Systolic Array	Precision
OpenPose-V2	Vivado HLS	SDAccel 2018.3	VCU1525	Standard SA	float 32-bit
Individual Layer Tests	Vivado HLS	SDAccel 2018.3	VCU1525	Standard SA	float 32-bit
U-Net	Vivado HLS	SDAccel 2018.3	VCU1525	Versatile SA	float 32-bit
E-Net	TAPA HLS	Vitis 2021.2	U250	Versatile SA	float 32-bit
E-Net	TAPA HLS	Vitis 2021.2	U250	Versatile SA	fixed 16-bit
E-Net	TAPA HLS	Vitis 2021.2	U250	Versatile SA	fixed 8-bit
E-Net	TAPA HLS	Vitis 2021.2	U280	Versatile SA	fixed 8-bit
VGG-16	TAPA HLS	Vitis 2021.2	U250	Versatile SA	float 32-bit
VGG-16	TAPA HLS	Vitis 2021.2	U250	Versatile SA	fixed 16-bit
VGG-16	TAPA HLS	Vitis 2021.2	U250	Versatile SA	fixed 8-bit

Table 6. Experiments’ Setup

Observe that the second design in the table, with a standard systolic array, is used to compare the performance of a standard systolic array against the versatile systolic array on individual layers.

8.2 Hardware Optimization

The target FPGA platforms come with four DDR banks. In our implementations, we use two DDR banks, assigning feature maps and weights (including bias) to two separate DDR banks. All the architecture choices are parameterizable and can be adjusted based on the target FPGA. We found the following configurations that work best for the OpenPose-V2 application on Xilinx/AMD VCU1525: The systolic array for our standard conv module is organized as an 8 \(\times\) 8 array with a SIMD factor of 8. For the rest of the modules, we use the same SIMD factor. Table 7 shows the frequency and resource utilization under this configuration.

Table 7.

Precision	Frequency	LUT	FF	BRAM	URAM	DSP
float 32-bit	242.9 MHz	43%	40%	60%	15%	50%

Table 7. Frequency and Resource Utilization of the OpenPose Accelerator

Table 8 shows the benefits of dynamic tiling and data layout transformation. We can see that these optimizations increase the performance by \(2.3\times\). Figure 1 depicts the performance gain of using dynamic tiling in a layer-by-layer fashion for the first 24 convolutional layers. Table 9 shows how applying dynamic tiling and dynamic data layout affects the tiling factors and effective DRAM bandwidth (BW) for the first layer of the last RBB in OpenPose-V2 compared to a design without these optimizations. The kernel size for this layer is 1 \(\times\) 1, which means it can use the optimized data layout with a burst length of \(Tn(k) \times Tw(k) \times Th(k),\) as described in Section 4.5. This data layout, along with the best tiling factor used for this layer, increases the effective DRAM BW and CTC ratio by \(2.8\times\). This results in \(6.1\times\) performance improvement.

Table 8.

Model	Precision	Frequency (MHz)	Runtime (ms)
Model	Precision	Frequency (MHz)	(1)	(2)
All Uniform	float 32-bit	237	57.7	41.5
All Dynamic	float 32-bit	242.9	35.6	24.7

Table 8. Performance on OpenPose-V2

(2): With applying DRAM organization for concatenation layers.

(1): Without applying DRAM organization for concatenation layers.

Table 9.

Model	Th	Tw	Tn	Tm	Eff. DRAM BW (GB/s)	CTC	Throughput (GFLOPs)
All Uniform	12	48	32	32	4.31	14.9	24.4
All Dynamic	12	24	48	48	12.05	41.3	149.2 (\(6.1\times\))

Table 9. Performance Impacts of Dynamic Tiling/Data Layout Transformation

We further test the DSP efficiency of our design on a given convolution layer. Of all the DSPs, 78.7% of them are used in the standard SA module and 11.2% in DW Conv module. We measure DSP efficiency using two factors: the total number of DSPs in the design and the number of DSPs of the modules used by that layer. All the tests are on a 256 \(\times\) 384 \(\times\) 384 input, producing 256 output channels. Table 10 summarizes the results. DSC layers take \(K^2\times\) less computation, making them communication-bound, as shown in Figure 20. This figure depicts that DSC layers fall in the memory-bound region of the roofline model, since they have less CTC ratio. Therefore, we achieve lower computation efficiency in these layers. Additionally, it shows that the data layout optimization for the DSC with the \(1 \times 1\) kernel increases the burst length. This helps to increase the effective DRAM bandwidth, leading to a performance improvement over the \(3 \times 3\) DSC.

Fig. 20.

Table 10.

Layer	Runtime (ms)	Throughput (GFLOPs)	DSP_total eff	DSP_used eff
Conv 3 \(\times\) 3	709.3	245.2	\(73.8\%\)	\(93\%\)
Conv 1 \(\times\) 1	80.2	240.9	\(72.6\%\)	\(91.4\%\)
DSC 3 \(\times\) 3	113.4	176.3	\(53.1\%\)	\(58.6\%\)
DSC 1 \(\times\) 1	84.1	230.8	\(69.5\%\)	\(76.7\%\)

Table 10. Performance on Different Convolutional Layers

8.3 The Versatile Systolic Array

To compare the effectiveness of the decomposition approach and the implementation, we conducted tests on the standard SA and the versatile SA using 10 different layers with various filter sizes, T-CONV strides (\(S^\prime\)), and dilation rates (\(d\)), as shown in Table 11.

Table 11.

Layer (\(N,M,I_{h/w}\))	T-CONV Results						D-CONV Results
	\(k,S^\prime\)	Standard SA		Versatile SA		Speedup (Ideal\(^\dagger\))	\(k,d\)	Standard SA		Versatile SA		Speedup (Ideal\(^\dagger\))
	\(k,S^\prime\)	Latency (ms)	DSP Eff.\(^{*}\)	Latency (ms)	DSP Eff.\(^{*}\)	Speedup (Ideal\(^\dagger\))	\(k,d\)	Latency (ms)	DSP Eff.\(^{*}\)	Latency (ms)	DSP Eff.\(^{*}\)	Speedup (Ideal\(^\dagger\))
(16,16,16)	5,2	0.3	5.15%	0.3	4.61%	\(0.91\times (4\times)\)	5,2	Latency (ms)	DSP Eff.\(^{*}\)	Latency (ms)	DSP Eff.\(^{*}\)	0.3	3.99%	0.3	4.83%	\(1.23\times (3.24\times)\)
(16,16,256)	5,2	14.7	24.25%	5.0	69.50%	\(2.91\times (4\times)\)	5,2	11.8	30.03%	5.2	67.84%	\(2.30\times (3.24\times)\)
(16,256,16)	5,2	1.1	19.68%	0.6	34.64%	\(1.79\times (4\times)\)	5,2	1.0	22.04%	0.5	41.77%	\(1.93\times (3.24\times)\)
(16,256,256)	5,2	228.8	24.88%	74.8	74.92%	\(3.06\times (4\times)\)	5,2	185.3	30.72%	57.5	97.54%	\(3.23\times (3.24\times)\)
(256,16,16)	5,2	1.2	18.52%	0.6	37.23%	\(2.04\times (4\times)\)	5,2	1.1	20.86%	0.7	29.23%	\(1.42\times (3.24\times)\)
(256,16,256)	5,2	228.9	24.88%	57.4	97.54%	\(3.98\times (4\times)\)	5,2	185.2	30.74%	78.0	71.80%	\(2.37\times (3.24\times)\)
(256,256,16)	5,2	14.6	24.37%	3.9	90.08%	\(3.76\times (4\times)\)	5,2	11.8	30.24%	3.9	89.41%	\(3.00\times (3.24\times)\)
(256,256,256)	5,2	3,653.0	24.94%	913.6	98.14%	\(4.00\times (4\times)\)	5,2	2,958.5	30.79%	913.6	98.14%	\(3.24\times (3.24\times)\)
(256,256,256)	3,3	2,960.7	11.08%	329.5	97.95%	\(8.98\times (9\times)\)	4,3	3,652.3	15.96%	584.9	98.10%	\(6.24\times (6.25\times)\)
(256,256,256)	4,4	9,352.8	6.23%	585.3	98.04%	\(15.98\times (16\times)\)	3,5	4,419.3	7.42%	329.4	98.00%	\(13.42\times (13.44\times)\)

Table 11. Performance of Different T-CONV and D-CONV Layers

\(^{*}\) DSP efficiency is measured as the actual performance using non-zero MAC operations divided by the peak performance (GFLOP/s) of the SA.

\(^\dagger\) Ideal speedup is based on our analysis in Section 5.1 using our systolic array architectures.

Notice that layers with small \(N,M, or I_{h/w}\) have low computation-to-communication ratios, which make them communication-bound. This explains the low DSP efficiency for these layers. In contrast, the last three layers are computation-bound, and the DSP efficiency of the T-CONV and D-CONV layers is around \(98\%\), while the DSP efficiency of the standard SA is capped at \(\frac{100}{Ideal Speedup}\%\). This matches our ideal speedup analysis in Section 5.1.

Table 12 demonstrates the frequency and resource utilization of the versatile SA design and the standard SA design. In terms of area overhead, the versatile SA requires only about 7% more LUTs, 3% more Flip Flops, and around 3% more DSPs. For on-chip memory, the PEs utilize the BRAMs for local buffers, while the input, weight, and output buffers are implemented using URAMs. The PEs’ local buffers are larger in the versatile SA as the decomposition approach requires \(S^{\prime 2}\times\) the size of buffers for T-CONV decomposition. This explains the 24% increase in BRAM utilization. However, the standard SA needs larger weight buffers to accommodate the zeros inserted in the filters, and this explains the lower URAM utilization for the versatile SA.

Table 12.

Design	\(SA\_COL \times SA\_ROW \times SIMD\)	Frequency	LUT	FF	BRAM	URAM	DSP
Versatile SA	8 \(\times\) 8 \(\times\) 8	233.9 MHz	48%	43%	37%	44%	45%
Standard SA	8 \(\times\) 8 \(\times\) 8	230.2 MHz	41%	40%	13%	51%	42%

Table 12. Frequency and Resource Utilization of the Layer Tests’ and U-Net Accelerators

8.4 Software-hardware Integration Optimization

In this section, we evaluate the effect of our integration optimization on OpenPose-V2. FlexCNN runs at 24.7 ms, which translates to a peak performance of 40.5 FPS. However, without proper optimization, the direct integration into TensorFlow framework only leads to the performance of 4.8 FPS, as shown in Table 13. Table 13 summarizes the impacts of two-level pipelining on the overall performance. We are using a batch of 16 for the OpenPose network to enable pipelining on FPGA, since it produces the best performance and smoothest output when displaying the result. With two-level pipelining, we achieve up to 5\(\times\) speedup, which leads to the final performance of 23.8 FPS.

Table 13.

Version	\(\frac{{\bf Runtime}}{{\bf frame}}\) (ms)	Throughput (FPS)	Speedup
Original	208.8	4.8	1
1st pipeline	97.1	10.3	2.1
2nd pipeline	42	23.8	5

Table 13. Performance Impacts of Integration Optimization

8.5 Applications

In this subsection, we evaluate the performance of the three real-world CNNs we implemented on FlexCNN and compare the results with other works.

8.5.1 OpenPose-V2.

To the best of our knowledge, there is only one work [6] that has implemented a variant of OpenPose on FPGA. However, they take a different approach. They reduce the computation cost of the original network by making the weights sparse and using only two stages after the backbone network. Furthermore, they quantized the data to a 16-bit fixed point and stored feature maps and weights on-chip. After these modifications, they neither reported their network’s computation cost nor their architecture’s resource utilization. Thus, we can not compare our results to theirs directly. Instead, we have compared our results against the network implementation using TensorFlow on CPU and GPU.

The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we use the NVIDIA Tesla V100 GPU, and it uses cuDNN [13] to run the network. To have a fair comparison of the latency of running the network on different platforms, we measure the runtime of a single image inference using OpenPose-V2 network. Table 14 summarizes the results. The runtime considers only the CNN inference time on RGB images of size 384 \(\times\) 384. For both the FPGA and GPU, the time to transfer the data from host to device and device to host is excluded from the measurement.

Table 14.

Platform	Frequency (GHz)	Throughput (GFLOP/s)	Latency (ms)	Frames Per Second (FPS)
CPU	2.4	29	99.3	10.07
GPU	1.4	114	25.3	39.53
FPGA (ours)	0.243	117	24.7	40.49

Table 14. OpenPose-V2 Performance Comparison of Different Platforms (Batch Size 1)

8.5.2 U-Net.

The U-Net CNN model is made of 51 layers. The breakdown of all the layers is shown in Table 1. The number of T-CONV layers’ operations is 2.1 Giga floating-point operations (GFLOPs), without counting the inserted zeros for T-CONV layers.

First, we compared U-Net performance with the TensorFlow implementation of the network on CPU and GPU. The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we ran the network on NVIDIA A100-PCIE-40GB operating at 1.4 GHz. We measured the runtime of a single image inference. Table 15 summarizes the results. Similar to the OpenPose-V2 experiment, the runtime considers only the CNN inference time on RGB images, excluding the data transfer time for both the FPGA and GPU.

Table 15.

Platform	Frequency (GHz)	Throughput (GFLOP/s)	Latency (ms)	Frames Per Second (FPS)
CPU	2.4	29	401	2.49
GPU	1.4	160	74.9	13.35
FPGA (ours)	0.234	207	58.2	17.18

Table 15. U-Net Performance Comparison on Different Platforms (Batch Size 1)

Second, we found two works [33, 34] that implement U-Net on FPGA. The first work used two separate accelerators—one for N-CONV and one for T-CONV layers. Although their approach gets rid of the zero MAC operations in T-CONV, it results in low performance and a low DSP efficiency compared to N-CONV, as shown in Table 16. The second work has a better overall performance and DSP efficiency, since it is using an 8-bit fixed point precision and combines DSP and ALM resources to create denser MAC units with higher performance. However, this work does not report the DSP efficiency or performance of the T-CONV and N-CONV individually.

Table 16.

Measure	FlexCNN	TRETS 2018 [33]	FPL 2019 [34]
Platform	Xilinx/AMD VCU1525	Xilinx/AMD XC7Z045	Intel A10 660
Data Type	float 32-bit	fixed 16-bit	fixed 8-bit
Frequency	234	200	200
N-CONV GOPS	9.9	5.6	NA
T-CONV GOPS	2.1	0.3	NA
Total GOPS	12.0	5.9	27.4
N-CONV GOP/s	206.5	125	NA
T-CONV GOP/s	209.8	29	NA
Total GOP/s	207.0	107	1578
Peak GOP/s	239.5	N/A	1638
T-CONV support	Yes	Yes	Yes
D-CONV support	Yes	No	No

Table 16. U-Net Evaluation against other Works

8.5.3 E-Net.

Table 2 shows the breakdown of E-Net’s layers. The actual Giga Operations (GOPs) is the number of operations without counting the inserted zeros for T-CONV or D-CONV layers. Two of the previous works [9, 28] use a concept of logical GOPs for T-CONV and D-CONV layers, which counts the redundant zero MAC operations as actual MAC operations. While we do not think it is a good measure, we considered that metric for consistency and comparison purposes. The FlexCNN architecture of E-Net is shown in Figure 8, and we created three designs with float 32-bit, fixed 16-bit, and fixed 8-bit data types. The clock frequency and resource utilization for each design are shown in Table 17. First, we compared the E-Net performance against CPU and GPU. We used the same experimental setup as the U-Net tests. The comparison results are illustrated in Table 18.

Table 17.

Design \(SA\_COL \times SA\_ROW \times SIMD\)	Data Type	Frequency (MHz)	LUT	FF	BRAM	URAM	DSP
8 \(\times\) 9 \(\times\) 8	float 32-bit	241	37.22%	31.02%	24.63%	21.56%	30.58%
16 \(\times\) 9 \(\times\) 16	fixed 16-bit	219	40.16%	29.35%	29.43%	12.50%	27.50%
16 \(\times\) 9 \(\times\) 16	fixed 8-bit	229	43.23%	29.77%	22.43%	9.69%	8.75%

Table 17. E-Net Designs and Hardware Utilization on U250 FPGA

Table 18.

Platform	Frequency (GHz)	Throughput (GFLOP/s)	Latency (ms)	Frames Per Second (FPS)
CPU	2.4	9.8	122.4	8.17
GPU	1.4	16.7	71.71	13.95
FPGA (ours)	0.241	57.2	20.95	47.62

Table 18. E-Net Performance Comparison on Different Platforms (Batch Size 1)

While we did not find any FPGA implementation of E-Net, there are three ASIC-based implementations of E-Net. The comparison results are shown in Table 19. Compared to Reference [28], our fixed 8-bit and 16-bit designs achieve lower latencies and higher frames per second (FPS), but our actual performance is slightly lower than theirs. This article only reports the performance (GOP/s) and FPS, but not the network’s number of operations (GOPs). When we calculated the number of GOPs based on the given GOP/s and FPS numbers, we found their operation count to be 1.4 GOPs, which is higher than ours (1.2 GOPs). They may have included operations from the non-convolution layers, but we did not, and this explains why we have higher FPS but lower GOP/s. Similarly, the second work [9] did not report the number of operations, nor did it report the latency or FPS for their implementation. We used our E-Net model to calculate the number of operations for a 512 \(\times\) 512 input image, which is 3.79 GOPs. Given their performance (168 GOP/s), we calculated the latency and FPS numbers (see Table 19). In terms of FPS, our three designs achieve higher rates, but we are using a smaller image size. Also, their work achieves higher performance in terms of GOP/s, but the ASIC frequency is more than \(2\times\) the frequencies of our designs. For Reference [10], we could not have a good comparison, as they only report the performance but do not report the input image size, the GOPs, the latency, or the FPS numbers.

Table 19.

Work	FlexCNN			ISCAS 2019 [28]	ISCAS 2020 [9]	VLSI 2020 [10]
Work	8 \(\times\) 9 \(\times\) 8	16 \(\times\) 9 \(\times\) 16	16 \(\times\) 9 \(\times\) 16	ISCAS 2019 [28]	ISCAS 2020 [9]	VLSI 2020 [10]
Platform	FPGA	FPGA	FPGA	ASIC	ASIC	ASIC
Frequency (MHz)	241	219	229	200	500	200
Image Size	288 \(\times\) 288	288 \(\times\) 288	288 \(\times\) 288	288 \(\times\) 288	512 \(\times\) 512	N/A
Data Type (w/a)\(^{*}\)	float 32-bit	fixed 16-bit	fixed 8-bit	fixed 8-bit	fixed 16-bit	fixed 2/16-bit
Latency (ms)	20.95	13.86	12.92	14.62	22.55	N/A
FPS	47.72	72.15	77.39	68.40	44.35	N/A
Actual GOP/s	57.2	86.5	92.8	96.0	168.0	196.2
Logical GOP/s	426.2	644.4	691.2	639.7	1,377.0	N/A

Table 19. E-Net Comparison with Other Works

\(^{*}\)“w” for weights and “a” for activations.

8.6 Comparison with Vitis AI

Vitis AI [4] is a Xilinx/AMD library for accelerating AI models on Xilinx FPGAs. The library uses optimized deep-learning processor units (DPU) cores as an overlay along with a software stack to accelerate a variety of DNN models. Different DPUs are optimized for different workloads (such as CNNs, RNNs, and NLPs) and different goals such as latency or throughput. However, the FlexCNN architecture mainly targets CNNs and focuses on optimizing the latency of CNN inference. In this subsection, we compare the performance of ENet on FlexCNN vs. Vitis AI. Xilinx/AMD reported ENet performance on the U280 using two different DPUs, DPUCAHX8H [1] and DPUCAHX8L [2]. DPUCAHX8H is optimized for throughput, while DPUCAHX8L is optimized for latency. Both DPUs use fixed-point 8-bit formats. For a fair comparison, we used FlexCNN to generate an accelerator on the U280 with the same 512 \(\times\) \(1,\!024\) input image size.

Table 20 shows the resource utilization of Vitis AI DPUs and our FlexCNN-generated design on U280. First, note that Vitis AI deploys multiple DPU cores on the FPGA (3 for DPUCAHX8H and 2 for DPUCAHX8L). The DPUCAHX8H core can be configured to have 3, 4, or 5 processing engines (PENs),⁶ and the DPUCAHX8L core is configured to have 1 PEN. Thus, the DPUCAHX8H design has a total of 14 PENs, and the DPUCAHX8L design has 2 PENs. Each PEN can process a separate image batch allowing it to process multiple images in parallel. FlexCNN, however, is optimized for latency with a single VSA. Thus, it processes multiple image batches sequentially. We noticed that for such a low-bit (fixed 8-bit) data format, the LUT and FF dominate the resource utilization in FlexCNN, as they are used along with the DSPs to implement the compute units of the VSA. Vitis AI DPUs,however, are designed in RTL and take more advantage of the DSPs to implement the arithmetic logic. In terms of on-chip memory utilization, FlexCNN consumes less URAM than both DPUs and slightly more BRAMs than the DPUCAHX8H design. In terms of frequency, FlexCNN’s design achieves the highest working frequency of 256 MHz.

Table 20.

Design	Data Type	Frequency (MHz)	LUT	FF	BRAM	URAM	DSP
DPUCAHX8H [1]	3 Cores (5 + 5 + 4 PENs)	150	52.0%	45.2%	13.5%	93.3%	82.7%
DPUCAHX8L [2]	2 Cores (1 + 1 PENs)	250	32.7%	23.0%	22.8%	65.0%	54.3%
FlexCNN	16 \(\times\) 8 \(\times\) 16 VSA	256	59.5%	38.3%	16.7%	13.5%	10.9%

Table 20. Hardware Utilization of FlexCNN Accelerator and Vitis AI DPUs on U280

Table 21 illustrates the performance of E-Net on Vitis AI DPUs vs. FlexCNN’s design in terms of throughput and latency. First, we noticed that the E-Net model used in the Vitis AI experiments has slightly more operation count (GOPs). After investigation, we found that, unlike the original E-Net model in Reference [38], each pair of asymmetric convolution layers (Figure 5) is implemented as a single convolution layer with 5 \(\times\) 5 filters in the Vitis AI E-Net model, which explains the slight increase in GOP count. The DPUCAHX8H design achieves the highest throughput of 1,057.8 GOP/s delivering 123 frames/s. However, such a high throughput is due to using a batch size of 14. The inference latency of ENet on DPUCAHX8H is not reported in Reference [3], but we can calculate a lower and an upper bound for latency. The lower bound is calculated as \(\frac{1}{FPS}\) (8.1 ms), meaning that the 14 PENs run sequentially, which is very unlikely, because this defeats the purpose of deploying 3 cores with 14 PENs. The upper bound is calculated as \(\frac{Batch\ Size}{FPS}\) (113.8 ms) meaning that the 14 PENs run in parallel, which is more likely. Thus, it is more likely that FlexCNN’s design has a comparable or better inference latency than the DPUCAHX8H. The same analysis applies to VGG-16 in Table 24. For the DPUCAHX8L design, FlexCNN delivers \(2.7\times\) faster inference. Moreover, while the DPUCAHX8L design achieves lower latencies than DPUCAHX8H for various CNNs [3] (see Table 24 for the VGG-16 results), it surprisingly gets the slowest inference for E-Net. FlexCNN’s design achieves both higher throughput and lower latency than the DPUCAHX8L design. Finally, in terms of performance density, DPUCAHX8H achieves the highest GOP/s/kLUT and GOP/s/DSP while FlexCNN, written in HLS, achieves higher GOP/s/DSP than DPUCAHX8L.

Table 21.

Design	Model Complexity (GOPs)	Batch Size	Frames/s	Latency (ms)	Throughput (GOP/s)	Performance Density
Design	Model Complexity (GOPs)	Batch Size	Frames/s	Latency (ms)	Throughput (GOP/s)	GOP/s/kLUT	GOP/s/DSP
DPUCAHX8H	8.60	14	123.0	8.1–113.8	1,057.8	1.5606	0.1417
DPUCAHX8L	8.60	2	8.1	175.4	69.7	0.1637	0.0142
FlexCNN	7.58	1	15.1	66.0	114.6	0.1478	0.1166

Table 21. E-Net Performance Comparison with Vitis AI [3] (Image Size: 512 \(\times\) \(1,\!024\))

8.7 Comparison with Other Frameworks

In this subsection, we compare FlexCNN with other FPGA-based DNN frameworks in terms of the scope of the frameworks and the performance of their respective architectures.

8.7.1 Scope of the Framework.

The scope and families of DNNs a framework can support depend on the architecture it employs. Most previous works such as DNNWeaver [42], Angel-Eye [23, 24], Caffeine [52, 60], fpgaConvNet [50], DNNBuilder [62], 2D & 3D CNN [43], Cloud-DNN [11], DNNVM [54], DNNExplorer [63], and 3D-VNPU [18] focused on designing architectures that support normal convolution, fully connected (FC), pooling, and activation and batch normalization (Act & BN) layers. These layers are sufficient for simple sequential CNNs such as AlexNet and VGG-16. In addition to the common CNN layers, many CNNs contain many more layer types such as depth-wise convolution, dilated convolution, transposed convolution, upsampling, and bilinear upsampling layers. Therefore, these previous frameworks cannot support complex CNNs with various layer types and complex branching\(^\dagger\) graph topologies such as OpenPose, U-Net, and E-Net. To accelerate a wide range of real-world CNN applications, FlexCNN supports all the aforementioned layer types with the exception of fully connected layers, since they became less used recently, and many famous models like MobileNet-V2 (used in OpenPose) are employed as a backbone for feature extraction without the FC layers. While FlexCNN and these previous works target CNNs, the FP-DNN [22] framework features an architecture that supports recurrent neural networks (RNN) in addition to CNNs. We will explore such a direction in our next work. Finally, Vitis AI [4] is a comprehensive AI compiler supporting CNNs, RNNs, and natural language processing models (NLPs). Table 22 summarizes the scope of all these frameworks.

Table 22.

Framework	DNNs	Model Topology Branching\(^\dagger\)?	Supported Layers
Framework	DNNs	Model Topology Branching\(^\dagger\)?	N-CONV	T-CONV	D-CONV	DW-CONV	FC	Pool	Act & BN	Upsample	Add	Concat
DnnWeaver [42]	CNNs	✗	✓	✗	✗	✗	✓	✓	✓	✗	✗	✗
Angel-Eye [24]	CNNs	✗	✓	✗	✗	✗	✓	✓	✓	✗	✗	✗
DAC’17 [52]	CNNs	✗	✓	✗	✗	✗	✗	✓	✓	✗	✗	✗
FP-DNN [22]	CNNs, RNNs	✓	✓	✗	✗	✗	✓	✓	✓	✗	✓	✗
Caffeine [60]	CNNs	✗	✓	✗	✗	✗	✓	✓	✓	✗	✗	✗
fpgaConvNet [50]	CNNs	✓	✓	✗	✗	✗	✗	✓	✓	✗	✓	✓
DNNBuilder [62]	CNNs	✗	✓	✗	✗	✗	✓	✓	✓	✗	✗	✗
2D & 3D CNN [43]	CNNs	✗	✓	✗	✗	✗	✓	✓	✓	✗	✗	✗
Cloud-DNN [11]	CNNs	✓	✓	✗	✗	✗	✓	✓	✓	✗	✓	✗
DNNVM [54]	CNNs	✓	✓	✗	✗	✗	✗	✓	✓	✗	✓	✗
DNNExplorer [63]	CNNs	✗	✓	✗	✗	✗	✗	✓	✓	✗	✗	✗
3D-VNPU [18]	CNNs	✗	✓	✗	✗	✗	✗	✓	✓	✗	✗	✗
Vitis AI [4]	CNNs, RNNs, NLPs	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
FlexCNN (ours)	CNNs	✓	✓	✓	✓	✓	✗	✓	✓	✓	✓	✓

Table 22. Scope of the Frameworks

\(^\dagger\)branching means that the CNN’s graph is not sequential but rather contains multiple branches connected with add or concat layers.

Aside from the type of DNNs, an important aspect of the scope of a framework/architecture is the model size it can handle. Some frameworks like DNNBuilder [62] create dedicated hardware modules for each CNN layer consuming most of the FPGA fabric resources, which makes those frameworks limited to small CNNs with a few layers. However, FlexCNN does not have any limitation on the model size, as it stores the weights off-chip and time-shares the same hardware modules.

8.7.2 Performance.

We compare the performance of FlexCNN’s generated accelerators with multiple frameworks on the famous VGG-16 [46] CNN, since it is used by all these previous frameworks. Similar to some other works, we only implemented the feature extraction part of VGG-16 (Convolution layers), but not the classification part (the last three FC layers), since FlexCNN does not have a dedicated FC module yet, but we will consider adding it in future work. For this comparison, we created three designs with various bit widths and data types detailed in Table 23.

Table 23.

Design \(SA\_COL \times SA\_ROW \times SIMD\)	Data Type	Frequency (MHz)	LUT	FF	BRAM	URAM	DSP
8 \(\times\) 14 \(\times\) 8	float 32-bit	266	48.11%	40.81%	45.43%	27.53%	16.25%
8 \(\times\) 14 \(\times\) 32	fixed 16-bit	241	39.51%	26.79%	45.93%	25.31%	37.98%
8 \(\times\) 14 \(\times\) 64	fixed 8-bit	198	59.77%	30.51%	58.43%	24.69%	8.70%

Table 23. VGG-16 Designs and Hardware Utilization on U250 FPGA

We surveyed many previous frameworks targeting CNNs and summarized the results in Table 24. Since we did not implement the FC layers and to have a fair comparison, for each metric used in Table 24, we used the format m1 (m2), where m1 refers to the feature extraction metric (convolution layers with 30.69 GOPs making 99.6% of VGG-16 operations), and m2 refers to the feature extraction + classification (convolution + FC layers with 30.81 GOPs) metric. In general, we can see that the FlexCNN designs achieve performance results better than or comparable to the other frameworks. In terms of throughput, DNNBuilder delivers the highest throughput followed by DNNVM. Cloud-DNN and DNNExplorer achieve comparable throughput to FlexCNN. In terms of the latency of feature extraction, FlexCNN’s 8-bit design achieves the lowest latency of 13.18 ms followed by DNNVM. DNNBuilder achieves the lowest latency of 15.39 ms for feature extraction and classification. In terms of performance density, DNNVM has the highest GOP/s/kLUT followed by Vitis AI DPUs and DNNBuilder, which are all implemented and optimized in RTL. FlexCNN’s fixed-point designs have comparable GOP/s/kLUT to Caffeine and 2D & 3D CNN, and Cloud-DNN, which are all implemented in Xilinx/AMD HLS. As for DSP performance density, FlexCNN’s 8-bit design delivers the highest GOP/s/DSP 2.179. Finally, an important metric to consider is the efficiency of an accelerator measured as the ratio between the achieved performance and the peak performance of the accelerator. DNNBuilder achieves the highest accelerator efficiency followed by DNNExplorer, since they exploit layer-level parallelism by deploying an accelerator for each layer (or a group of layers) of a CNN model. FlexCNN, in contrast, achieves between 82% and 96% accelerator efficiency (higher than DAC’17, Caffeine, DNNVM, and Vitis AI DPUs) while using a single systolic array, thanks to dynamic tiling and the other hardware optimizations employed by FlexCNN.

Table 24.

Framework	Platform	Precision\(^\dagger\)	Frequency (MHz)	Batch Size	Throughput (GOP/s)	Latency (ms)	Performance Density		Actual/Peak Performance
Framework	Platform	Precision\(^\dagger\)	Frequency (MHz)	Batch Size	Throughput (GOP/s)	Latency (ms)	GOP/s/kLUT	GOP/s/DSP	Actual/Peak Performance
DnnWeaver [42]	Zynq Z020	FX(16,16)	150	1	31.35 (31.38)	-	0.896 (0.897)	0.224 (0.224)	-
	Stratix V SGSD5	FX(16,16)	200	1	157.39 (157.51)	-	1.040 (1.041)	0.265 (0.265)	-
	Arria 10 GX115	FX(16,16)	200	1	390.02 (361.55)	-	1.079 (1.000)	0.290 (0.269)	-
Angel-Eye [24]	Zynq Z045	FX(16,16)	150	1	187.80 (136.97)	163.42 (224.60)	1.028 (0.750)	0.241 (0.176)	-
DAC’17 [52]	Arria 10 GT115	FX(16,8)	232	1	- (1,171.30)	- (26.85)	- (3.742)	- (0.781)	89.11% (-)
Caffeine [60]	UltraScale KU060	FX(16,16)	200	1	310.00 (266.00)	- (101.15)	3.100 (2.660)	0.293 (0.251)	84.93% (72.88%)
Caffeine [60]	Virtex 690T	FX(16,16)	150	1	488.00 (354.00)	- (65.13)	1.627 (1.180)	0.172 (0.125)	76.72% (55.66%)
fpgaConvNet [50]	Zynq Z045	FX(16,16)	125	1	155.81 (-)	249.50 (-)	-	0.182 (-)	-
DNNBuilder [62]	UltraScale KU115	FX(16,16)	235	1	- (2,011.00)	- (15.39)	- (7.799)	- (0.466)	- (99.1%)
DNNBuilder [62]	UltraScale KU115	FX(8,8)	235	2	- (4,022.00)	- (15.39)	- (15.597)	- (0.931)	- (99.1%)
2D & 3D CNN [43]	Virtex 690T	FX(16,16)	150	1	- (570.00)	- (54.06)	- (3.257)	- (0.414)	-
2D & 3D CNN [43]	UltraScale VU440	FX(16,16)	200	1	- (821.00)	- (37.53)	- (4.829)	- (0.597)	-
Cloud-DNN [11]	UltraScale VU9P	FX(16,16)	125	1	- (1,068.37)	- (28.96)	- (1.397)	- (0.200)	-
Cloud-DNN [11]	UltraScale VU9P	FX(16,16)	214	1	- (1,828.61)	- (16.92)	- (2.645)	- (0.342)	-
DNNVM [54]	UltraScale ZU2	FX(8,8)	330	1	334 (-)	91.90 (-)	15.215 (-)	1.722 (-)	87.9% (-)
DNNVM [54]	UltraScale ZU9	FX(8,8)	330	3	2,820 (-)	17.24 (-)	23.94 (-)	1.829 (-)	69.6% (-)
DNNExplorer [63]	UltraScale KU115	FX(16,16)	200	1	1,702.30 (-)	18.05 (-)	- (-)	0.363 (-)	95.8% (-)
3D-VNPU [18]	UltraScale ZCU102	FX(8,8)	200	1	1,150 (-)	26.69 (-)	- (-)	1.123 (-)	-
DPUCAHX8H [1]	Alveo U280	FX(8,8)	150	14	- (5,812.07)	- (5.30–74.23)	- (8.575)	- (0.779)	- (67.6%)
DPUCAHX8L [2]	Alveo U280	FX(8,8)	250	2	- (3,272.75)	- (18.83)	- (7.688)	- (0.409)	- (40.9%)
FlexCNN (ours)	Alveo U250	FL(32,32)	266	1	458.6 (-)	66.92 (-)	0.632 (-)	0.082 (-)	96.2% (-)
	Alveo U250	FX(16,16)	241	1	1,543.4 (-)	19.89 (-)	2.262 (-)	0.331 (-)	89.3% (-)
	Alveo U250	FX(8,8)	198	1	2,329.1 (-)	13.18 (-)	2.256 (-)	2.179 (-)	82.1% (-)

Table 24. VGG-16 Performance Comparison with Other Frameworks (Image Size: 224 \(\times\) 224)

Note: m1 (m2) format is used for multiple results, where m1 refers to the Conv. layers only and m2 refers to Conv. + FC layers.

\(^\dagger\)FX/FL(a,w): a = activations, w = weights, FX = fixed-point, and FL = floating-point data types.

9 Conclusion

In this work, we presented the end-to-end FlexCNN framework for accelerating CNNs on FPGA. Our framework targets the challenges of accelerating modern CNNs. The first challenge stems from the disparity within layers of the same type that results in different computation and communication requirements. As a solution, we proposed a few architectural techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimizations. The second challenge arises from the various convolution types such as transposed convolution and dilated convolution. These two layers, if not processed efficiently, can lead to huge underutilization of the computation resources of FPGA due to the large number of redundant zeros. For this, we propose a versatile systolic array that can handle all these layer types efficiently with a small area overhead compared to a standard SA. The third challenge is caused by the software overheads for the end-to-end runtime of CNN inference. To mitigate this issue, we propose a software-hardware pipelining technique that overlaps those overheads with the hardware kernel execution. Finally, we presented our automated compilation flow that takes a CNN model in ONNX format, maps it to the FlexCNN architecture, finds the best hardware parameters and tiling factors using a DSE, and generates accelerators either in Xilinx/AMD HLS or TAPA HLS.

Footnotes

Open Neural Network Exchange.

Jason Cong has a financial interest in AMD.

https://cdsc.ucla.edu/.

⁴

Asymmetric convolution layers are the same as N-CONV layers but use non-square filter sizes, like 1 \(\times\) 5 filters.

⁵

Based on TensorFlow “same” padding.

⁶

Processing engines are abbreviated as (PENs) so as not to be confused with the systolic array processing elements (PEs).

References

[1]

DPUCAHX8H Resource Utilization. (n.d.). Retrieved from https://docs.xilinx.com/r/en-US/pg367-dpucahx8h/Resource-Utilization.

Abstract

1 Introduction

2 Framework Overview

3 Applications

3.1 New Layers and Building Blocks

3.1.1 Depthwise Separable Convolution (DSC).

3.1.2 Residual Bottleneck Block.

3.1.3 Special Convolution Layers.

3.2 OpenPose

3.3 U-Net

3.4 E-Net

4 FlexCNN Architecture

4.1 A Composable Architecture

4.2 Modules

4.3 Layer Fusion and Layer Parallelization

4.4 Dynamic Tiling

4.5 Data Layout Optimization

5 The Versatile Systolic Array

5.1 Problem Formulation

5.1.1 Transposed Convolution.

5.1.2 Dilated Convolution.

5.2 Approach

5.2.1 Decomposition of T-CONV Operation.

5.2.2 Decomposition of D-CONV Operation.

5.2.3 Unified Decomposition Algorithm.

5.3 The Versatile Systolic Array

5.3.1 The Architecture and Dataflow of the VSA.

5.3.2 T-CONV Implementation.

5.3.3 D-CONV Implementation.

6 Compilation Framework

6.1 CNN Layer Mapper

6.1.1 ONNX.

6.1.2 CNN Layer Mapping.

6.2 Design Space Exploration

6.3 Design Generation

7 Software-hardware Pipelining

8 Experimental Results

8.1 Experiment Setup

8.2 Hardware Optimization

8.3 The Versatile Systolic Array

8.4 Software-hardware Integration Optimization

8.5 Applications

8.5.1 OpenPose-V2.

8.5.2 U-Net.

8.5.3 E-Net.

8.6 Comparison with Vitis AI

8.7 Comparison with Other Frameworks

8.7.1 Scope of the Framework.

8.7.2 Performance.

9 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

End-to-End Optimization of Deep Learning Applications

An FPGA-based Fine Tuning Accelerator for a Sparse CNN

NSTBNet: Toward a nonsubsampled shearlet transform for broad convolutional neural network image denoising

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options