Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

Published: 11 March 2023 Publication History

Abstract

With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.

1 Introduction

Convolutional Neural Networks (CNNs) are widely used in many machine learning (ML) applications and have evolved quickly over the years. There is a growing interest in FPGA for accelerating CNN computation due to its high energy efficiency and performance (e.g., References [6, 7, 22, 31, 37, 44, 48, 52, 59, 60, 62]). However, the recent advancement in CNN models and FPGA-based CNN acceleration has brought several new challenges. 23
Challenge 1: Performance disparity within CNN layers of the same type: In CNNs, layers of the same type (normal convolution layers, for instance) can have different characteristics in terms of their input and output number of channels, feature map size, and kernel size. This changes the computation to communication (CTC) ratio from layer to layer. Therefore, it is important to handle these layers differently given the performance disparity across these layers. We found that tiling factors can play an important role in performance. Zhang et al. [59] showed that the CTC ratio of a single convolution layer varies with different tiling factors. Yang et al. [55] highlighted the importance of choosing proper tiling factors for data reuse in the near and faster memory (on-chip storage for FPGAs) for the overall latency and energy efficiency. These studies lead us to consider using different tiling factors across the network. Figure 1 depicts how different tiling factors can affect the performance of each layer in one CNN network. We compare the performance of using a single set of tiling factors (uniform tiling) to using different tiling factors for each layer (dynamic tiling). For the uniform tiling, we chose the tiling factor that reduces the latency of the entire network. For the dynamic tiling, we focused on each layer and selected the best tiling factor accordingly. Experimental results show that dynamic tiling can speed up the performance of the whole network by \(1.7\times\).
Fig. 1.
Fig. 1. Performance comparison of designs using uniform and dynamic tiling factors for the first 24 convolutional layers in the CNN network in Figure 3.
Challenge 2: The inefficiency of general-purpose CNN accelerators in processing special CNN layers: Many modern CNNs feature complex architecture topologies with different layer types. One of these special layers is a fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer). It is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20], to name a few. An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32]. Some CNNs include a mixture of convolution layers such as E-Net [38], where normal convolution (N-CONV), transposed convolution, dilated convolution, and asymmetric convolution (A-CONV)4 layers are used. Both T-CONV and D-CONV layers can be naïvely implemented as normal convolution layers. However, such implementations introduce many zeros in the input feature maps of T-CONV layers and in the convolution filters of D-CONV layers, leading to a huge underutilization of the FPGA resources. To tackle this problem, we use a decomposition-based approach (discussed in Section 5) to implement N-CONV, T-CONV, and D-CONV layers efficiently in one versatile systolic array on an FPGA with minimal area overhead. Moreover, other networks such as MobileNetV1 [26] use depth-wise separable convolution layers introduced in Reference [45] to decrease the computation cost. MobileNetV2 [41] introduced residual bottleneck block (RBB) to further reduce the computation complexity. These layers reduce the computation cost but keep the same feature map size; this can make the layer more communication-bound and reduce the computation efficiency.
Challenge 3: Integration overheads of using FPGA in ML frameworks: When processing a CNN application in a modern ML framework such as TensorFlow [5], the complete stack consists of reading the input, computing the CNN, processing the result, and displaying and writing the result. Previous works have only focused on optimizing the CNN kernel on FPGA (e.g., References [6, 7, 22, 31, 48, 52, 59, 62]). This is due to the fact that CNN computation is the most time-consuming step of the whole stack. Hence, the rest of the overheads are ignored. While several works [22, 37] have focused on accelerator generation from TensorFlow-described networks, they did not address the challenges of integrating an accelerator into TensorFlow. By integrating our accelerator with TensorFlow, we are able to directly run networks from TensorFlow on an FPGA. Integrating FPGA into TensorFlow introduces a new set of overheads: communication between TensorFlow and FPGA and the communication between the host and the FPGA kernel itself. Figure 2 shows the breakdown of the end-to-end runtime for processing a 384 \(\times\) 384 RGB image using the network in Figure 3. These steps are listed and described in Section 7. The time for CNN processing, using our accelerator denoted as the kernel, only takes 11.8% of the total runtime. This emphasizes the need for an end-to-end SW/HW co-optimization. Our experiments show that this optimization can increase the end-to-end performance of this network from 4.8 FPS to 23.8 FPS, leading to a 5\(\times\) speedup.
Fig. 2.
Fig. 2. Runtime breakdown of an FPGA-based CNN acceleration pipeline in TensorFlow.
Fig. 3.
Fig. 3. OpenPose-V2 CNN architecture.
To solve the challenges above, we propose an FPGA-based CNN framework named FlexCNN. Its architecture employs dynamic tiling, layer fusion, and data layout transformation to adapt to the performance disparity of different CNN layers. Another major component of the architecture is our versatile systolic array, which can efficiently process different convolution layer types. The framework has a compilation flow that takes a CNN as an input, performs design space exploration, and generates an optimized hardware accelerator to run on FPGA. The accelerator is further integrated into a software-hardware pipeline to mitigate the large integration overheads by overlapping the software execution with the hardware computation.
A preliminary version of FlexCNN [47] was published in FPGA 2020. The new contributions in this article include: (1) a novel efficient versatile systolic array for normal, transposed, dilated, and asymmetric convolution layers; (2) ONNX support to handle multiple ML frameworks including TensorFlow, PyTorch, and Caffe; (3) code generation for the new TAPA [15] framework, which is integrated with AutoBridge [25] to improve design frequency; and (4) the implementations of U-Net, E-Net, and VGG-16 CNNs on FPGA using the FlexCNN framework.
In summary, the overall contributions of this work are:
An efficient, flexible, and composable dataflow architecture employing dynamic tiling, layer fusion, and data layout optimization to support a wide variety of CNNs;
A novel versatile systolic array that can efficiently process normal, transposed, dilated, and asymmetric convolution layers;
An automated compilation flow that takes a CNN dataflow graph as an input, maps it to the hardware dataflow graph, and performs a design space exploration to generate an optimized accelerator on FPGA;
A software-hardware pipelining scheme that can improve the end-to-end performance of CNNs;
Real-time efficient implementations of OpenPose, U-Net, E-Net, and VGG-16 CNNs on FPGA.

2 Framework Overview

FlexCNN is an end-to-end framework for automatic hardware acceleration of CNNs on FPGA. FlexCNN implements a flexible and composable dataflow architecture that can be tailored for a variety of complex real-world CNNs. Section 4 discusses the architecture and the multiple optimization techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimization. Another important component of the FlexCNN architecture is our novel versatile systolic array, which we discuss in Section 5. The versatile systolic array can efficiently process N-CONV, T-CONV, D-CONV, and A-CONV layers. Section 6 reviews the automated compilation flow. It takes an ONNX CNN model and an ordered list of FlexCNN modules as inputs, then outputs an optimized FPGA accelerator. The compilation tool maps the CNN dataflow graph to the given FlexCNN architecture, performs design space exploration for the best hardware parameters, and generates the synthesizable code for the architecture. Furthermore, FlexCNN implements a software-hardware pipelining technique (discussed in Section 7) to overlap the software overheads with the hardware execution reducing the end-to-end runtime of a CNN’s inference.

3 Applications

This section introduces the new layer types and building blocks used in the three real-world CNN applications: OpenPose, U-Net, and E-Net CNNs. It then highlights the applications, architectures, and layer types of each CNN.

3.1 New Layers and Building Blocks

3.1.1 Depthwise Separable Convolution (DSC).

In a normal convolution layer (N-CONV), the feature maps are filtered and combined in one step. The DSC splits this step into two phases. The first phase, depthwise convolution (DW), does the filtering, and the second phase, pointwise convolution (PW), combines the produced filtered feature maps using 1 \(\times\) 1 kernels.
A conv layer takes \(N\) feature maps as the input, each of size \(H \times W\). It uses \(M \times N \times K \times K\) kernels to produce M channels for the output. The total computation cost for this layer is \(M \times N \times H \times W\) \(\times\) \(K \times K\).
However, a DSC uses \(N \times K \times K\) kernels for DW and \(M \times N \times 1 \times 1\) kernels for PW. By applying this change, the amount of computation is reduced by a factor of \(\frac{1}{M} + \frac{1}{K^2}\) [26].

3.1.2 Residual Bottleneck Block.

Google introduced RBB in MobileNetV2 [41] to reduce the computation cost. It consists of a 1 \(\times\) 1 conv followed by a 3 \(\times\) 3 DW and then another 1 \(\times\) 1 conv, each of which is followed by ReLU and a batch normalization layer. The 1 \(\times\) 1 convolutions are used for dimension reduction or restoration. The nature of this block allows us to reduce the number of input and output channels. This reduces the computation intensity and makes the network more efficient.

3.1.3 Special Convolution Layers.

Recent CNNs have introduced variations of the normal convolution layers such as:
A fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer) layer is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20].
An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32].
An asymmetric convolution (A-CONV) layer is a normal convolution layer that uses asymmetric filter sizes such as 1 \(\times\) 5 or 5 \(\times\) 1 filters. In terms of hardware acceleration, this layer requires extra logic to handle each dimension of the filters separately.

3.2 OpenPose

OpenPose [8] is the winner of the COCO 2016 Keypoints Challenge that can detect 2D poses of multiple people in an image. OpenPose network first extracts the features of the input image using the first 10 layers of VGG-19 [46]. This is the backbone of the network. These feature maps are the inputs to a two-branch network. The first branch detects confidence maps, representing body part locations, and the second branch detects part affinity fields, a set of 2D vectors showing the location and orientation of the limbs. The results of these two branches are concatenated with the feature maps from the backbone network and form the input for the next stage. After several iterations, these branches produce final predictions.
This network is interesting to us, since it has an irregular architecture compared to modern CNN-based deep-learning applications. Instead of just a linear forward path where each layer consumes the result of its previous layer, it has concatenation layers that need extra data movement. Moreover, to reduce the computation complexity of the network, we use a modified version of OpenPose [29] that replaces the backbone with a modification of MobileNetV2 [41] and employs DSC [45] for the rest of the network, following the trend in the ML community. Figure 3 depicts the network topology of this version; we call this network OpenPose-V2. Due to the space limitation, we only show the convolutional layers. Each convolution is followed by ReLU and batch normalization layers.

3.3 U-Net

U-Net [40] is a famous CNN used for biomedical image segmentation. It is an encoder-decoder neural network, where the encoder includes four downsampling blocks, and the decoder part is made of four upsampling blocks. Furthermore, the outputs of each downsampling block are added with the inputs of the corresponding upsampling block. Figure 4 illustrates U-Net’s architecture and building blocks. We chose U-Net with its irregular architecture topology and various layer types, including T-CONV layers, to show the effectiveness of our versatile systolic array for a real-world application. Table 1 demonstrates the breakdown of U-Net’s layers and the number of Giga floating-point operations (GFLOPs).
Fig. 4.
Fig. 4. U-Net downsampling and upsampling blocks and CNN architecture.
Table 1.
Layer TypeLayer CountActual GFLOPs
N-CONV139.9
T-CONV42.1
MaxPool40
Add80
Batch Normalization90
Leaky ReLU130
Total GFLOPs12.0
Table 1. U-Net Layers and Number of Operations

3.4 E-Net

E-Net [38] is a lightweight CNN used for pixel-wise semantic segmentation in real-time. Like U-Net, E-Net has an encoder-decoder structure, where the feature maps are downsampled by a factor of 4 and then upsampled by a factor of 4 to restore the original image size. The network is made of bottleneck blocks. Each bottleneck block gets its input from the proceeding block, processes data in two branches, and then merges the two branches with an Add layer to be sent to the next block. The first branch contains a MaxPooling layer for the encoder blocks, an upsampling layer for the decoder blocks, or it can be empty for the intermediate blocks. The second branch contains N-CONV layers, D-CONV layers, A-CONV layers, or T-CONV layers. Figure 5 illustrates E-Net’s architecture and bottleneck blocks. We chose E-Net to test our framework with such a complex CNN topology and various layer types. E-Net represents a stress test for our compilation framework, which needs to map the CNN complex graph into the FlexCNN architecture. E-Net also contains all four types of convolution layers, which is a perfect test case for our versatile systolic array. All the layers used in E-Net and the number of Giga operations (GOPs) are shown in Table 2. The logical GOPs is the number of operations including the zero multiply-accumulate (MAC) operations of T-CONV and D-CONV layers.
Fig. 5.
Fig. 5. E-Net bottleneck blocks and CNN architecture.
Table 2.
Layer TypeLayer CountActual GOPsLogical GOPs \(^{*}\)
N-CONV700.8860.886
D-CONV80.1917.878
T-CONV30.0150.062
A-CONV80.1060.106
MaxPool300
PReLU6600
Add2700
Batch Normalization600
ReLU1500
Upsample200
Concat100
Total GOPs1.1988.931
Table 2. E-Net Layers and Number of Operations
\(^{*}\)Number of operations including the zeros for T-CONV and D-CONV layers.

4 FlexCNN Architecture

4.1 A Composable Architecture

FlexCNN is a composable dataflow architecture made up of a number of streaming modules that are connected as a directed graph based on the target CNN architecture. It is flexible and composable in the sense that modules can be reordered, new modules can be added, or some modules can be removed, depending on the target CNN graph. This is particularly important, since state-of-the-art CNNs usually come with new special layers that rigid accelerators struggle to process efficiently. Thus, if a CNN layer type is not supported yet, then the user would simply need to develop a single module for that layer. Currently, FlexCNN has modules to support a variety of CNN layers (see Table 3).
Table 3.
ModuleLayers Supported
Standard Systolic Arraynormal convolution layers
Versatile Systolic Arraynormal, transposed, dilated, and asymmetric convolution layers
DW Convdepth-wise convolution layer
Act & BNactivation (ReLU, ReLU6, PReLU, Leaky ReLU) and batch normalization layers
Addpiece-wise addition of two layers
Concatconcatenation of two layers
Upsamplenearest neighbor or bilinear upsampling layers
Poolmax-pooling or average pooling layers
Table 3. Current Modules and Their Descriptions
Since FlexCNN is a dataflow architecture, it can be thought of as a coarse-grain pipeline where modules are the pipeline stages. To avoid pipeline stalls, we make sure that all modules are fully pipelined with an initiation interval of 1, meaning that each module produces and consumes data every clock cycle. Therefore, the overall latency is calculated as the latency of the longest pipeline stage + the latency of filling and draining the pipeline. Since convolution layers are the most compute-intensive, the longest pipeline stage is the latency of the systolic array modules. Furthermore, FlexCNN supports CNNs in different data types, including float 32-bit, fixed 16-bit, and fixed 8-bit. Figures 6, 7, and 8 show the architectures for OpenPose, U-Net, and E-Net CNNs, respectively.
Fig. 6.
Fig. 6. FlexCNN with a standard SA for OpenPose.
Fig. 7.
Fig. 7. FlexCNN with a versatile SA for U-Net.
Fig. 8.
Fig. 8. FlexCNN with a versatile SA for E-Net.

4.2 Modules

We implement line-buffer-based streaming architectures for the DW Conv, Act & BN, Add, Pool, and Upsample modules using a similar stencil-based architecture as in Reference [14]. All these modules are parameterized by factors as shown in Table 4, which will be explored by the design space exploration (DSE) engine covered in Section 6.2, for optimal performance. We apply double buffering in both the Reader modules and the Writer module. Furthermore, if the outputs of the whole layer can fit into the on-chip buffer, then the data will be pushed into on-chip buffers and directly fetched by the Reader to save the off-chip communication time.
Table 4.
Design ParametersExplanation
\(Th(k), Tw(k), Tn(k), Tm(k)\)Tiling factors for \(H\), \(W\), \(N\), and \(M\) for layer \(k\)
\(SIMD\)SIMD lanes for all modules
\(SA\_ROW, SA\_COL\)Rows and columns of the systolic array kernel
Table 4. Design Parameters and Explanations

4.3 Layer Fusion and Layer Parallelization

Due to the limited fast on-chip FPGA memory (BRAMs and URAMs), it is usually necessary to use the slow off-chip memory (DRAM), especially for large CNN models. Thus, the intermediate tensors of layers are loaded from the DRAM using the Reader modules, processed by the FlexCNN compute modules, and then written back to DRAM using the Writer module. An important feature of FlexCNN is that each module can be enabled or disabled dynamically during runtime to process or bypass the data flowing through that module. This feature allows FlexCNN to employ layer fusion and layer parallelization where one DRAM read and write can process multiple CNN layers and reduce off-chip communication, thus improving the hardware utilization of the FPGA. Layer fusion applies to sequential layers from the original CNN graph. Layer parallelization applies to layers that are parallel in the original CNN graph. For example, in a downsampling bottleneck block of E-Net (Figure 5), \(L1\) can be fused with the previous or following ReLU layers on the same branch, which represents layer fusion. For layer parallelization, \(L1\) can be executed in parallel with \(L2\) (MaxPool layer for downsampling block), which represents layer parallelization. Section 6.1 examines the details of mapping CNN layers to the FlexCNN architecture in depth.

4.4 Dynamic Tiling

Tiling is applied when processing the network for improving the data locality and minimizing the communication. Table 4 summarizes the tiling factors employed in FlexCNN, where \(N\) corresponds to the number of input feature maps, \(H\) and \(W\) to the height and width of the input feature maps, and \(M\) to the number of output feature maps. When the tiling factors are not sub-multiples of the tiled dimensions, redundant computation is introduced, which degrades the performance of the design. As explained in Section 1, in a normal CNN network, the types and configurations of different layers vary from each other. Therefore, the optimal tiling factors will be different from each other as well. We have observed that using uniform tiling factor for the whole network will lead to up to 1.7\(\times\) performance slowdown compared to the ideal case using different tiling factors across layers. Therefore, in this work, we apply the dynamic tiling by re-configuring the tiling factors of the accelerators on-the-fly for different layers to maximize the performance. This will bring the hardware overheads to support the dynamic tiling. However, such overheads are negligible compared to the performance improvement. Section 8 evaluates the impacts of this technique in detail.
Previous works such as References [44, 51, 62] have also emphasized the need for different tiling factors across layers. Our architecture distinguishes from the previous work by changing all the tiling factors across each layer dynamically, whereas previous work only adjusted part of the tiling factors or used several accelerators, each with distinct uniform tiling factors on-chip. Equation (1) shows the restriction on the tiling factors.
\begin{align} \begin{split} Tw(k) &= c_1 \times SA\_COL\\ Tm(k) &= c_2 \times SA\_ROW\\ Tn(k) &= c_3 \times SIMD\\ Tm(k) &= Tn(k+1) \end{split} \end{align}
(1)
In FlexCNN, the width and output channels of the feature maps are mapped to columns and rows of the SA, respectively. As a result, for each layer, \(Tw(k)\) and \(Tm(k)\) should be multiples of their respective SA dimension. The reduction of multiple input channels is computed in parallel inside each PE of the SA, which is defined as the SIMD lane. This implies that \(Tn(k)\) should be a multiple of SIMD lane. \(Th(k)\) can be any arbitrary value.
As mentioned before, the computation in the DW Conv module can be seen as a stencil kernel. Figure 9 depicts the 3 \(\times\) 3 stencil window connected by line buffers. As depicted in the figure, at each cycle, the line buffers fetch one pixel from a feature map and the data are shifted by one location. The length of the first two lines (for a general case, the first \(K-1\) lines with K being the filter size) is determined by \(Tw(k)\). After all the registers in the line buffers are filled with data (\((K-1) \times Tw(k) + K\) cycles), the computation can start by convolving the registers marked in black with the respective filter. Since the SA module needs to fetch SIMD elements in each cycle, the architecture in Figure 9 is duplicated SIMD times with each one fetching the data from a different feature map. As the length of the line buffer determines the \(Tw(k)\), each line should have “\(\max _{k} Tw(k)\)” registers. We realize dynamic tiling by connecting consecutive rows of the line buffer via a MUX, enabling data feeding from different locations.
Fig. 9.
Fig. 9. Architecture support for dynamic tiling in the Depth Conv module for a 3 \(\times\) 3 kernel with Tw of size 6/8/10.

4.5 Data Layout Optimization

Data layout optimizations are applied to reduce the number of accesses to DRAM and increase the effective DRAM bandwidth. The first optimization is on the concatenation layers. A CNN network may contain blocks that concatenate the results of several layers. As shown in Figure 3, after each stage in the OpenPose-V2 network, results from two branches will be concatenated with the first outputs from the backbone network. This then serves as the inputs for the following stages. Figure 10 presents the optimized data organization of the network.
Fig. 10.
Fig. 10. Data organization for OpenPose.
The outputs of the backbone (region B) and each stage (region A, C) are placed close to each other, as shown in Figure 10. To be more specific, the outputs of Stage 1 will be written to region A. Regions A and B will serve as the inputs of Stage 2. In Stage 2, the outputs will be written to region C. The regions B and C will serve as the inputs of Stage 3, similarly. The outputs of each stage are written to regions A and C in a round-robin fashion. With this layout, the outputs of stage branches are concatenated on-the-fly, eliminating unnecessary off-chip DRAM movements.
To further improve the effective DRAM bandwidth, we change the data layout of the feature maps from \(N(k) \times H(k) \times \frac{W(k)}{Tw(k)} \times Tw(k)\) to \(\frac{N(k)}{Tn(k)} \times H(k) \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k)\). This allows us to increase the burst length from \(Tw(k)\) to \(Tn(k) \times Tw(k)\). A DSC layer can easily become communication-bound because of its low computation to communication (CTC) ratio, since it is mostly using 1 \(\times\) 1 convolution kernels. In this case, when the kernel size of the next layer is 1 \(\times\) 1, since there is no overlapped region between different tiles, we further change the data layout to \(\frac{N(k)}{Tn(k)} \times \frac{H(k)}{Th(k)} \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k) \times Th(k)\). It further increases the burst length for these layers to \(Tn(k) \times Th(k) \times Tw(k)\). For other kernel sizes, padding is applied, because a tile of \(Tn(k) \times Tw(k) \times Th(k)\) does not have all the data needed for the computation. We need to have \((p-1)\) and \(((p-1) \times Th(k) + (p-1)^2)\) extra DRAM accesses with burst length of \(Tn(k) \times Tw(k)\) and \(Tn(k)\), respectively, to fetch all the data (\(p\) denoting the kernel size). This increases the number of DRAM accesses with a burst length of \(Tn(k)\), which further increases the communication time and prevents us from applying this data layout.

5 The Versatile Systolic Array

5.1 Problem Formulation

5.1.1 Transposed Convolution.

At first glance, transposed convolution seems to be a completely different operation from a normal convolution. As shown in Figure 11, one T-CONV operation is a scalar multiplication of an input pixel by a \(K \times K\) filter, and the output result (of size \(K \times K\)) is placed in the output feature map (FM) separated by a distance determined by the T-CONV stride (\(S^\prime\)). The overlapping results in the output feature map are then added together to give the final output feature map.
Fig. 11.
Fig. 11. T-CONV original operation (\(K=3,S^\prime =2\)).
This same operation can be performed as a normal convolution by inserting \(S^\prime -1\) zeros between adjacent pixels of the input feature maps and convolving a reversed filter with stride \(S=1,\) as shown in Figure 12. Note that the gray zeros are part of the padding, which is required for N-CONV layers as well.
Fig. 12.
Fig. 12. Naïve computation of T-CONV (\(K=3,S^\prime =2\)).
Let \(N I_h I_w M O_h O_w\) represent the channels, height, and width of input and output FMs, respectively. This naïve implementation requires \(K^2 N M O_h O_w\) (Multiply-Accumulate) MAC operations (\(3^2\) \(\times\) 4 \(\times\) 4 MACs in Figure 12), but the non-zero MAC operations are only \(K^2 N M I_h I_w\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 14). The ideal speedup5 for T-CONV is given by Equation (2)
\begin{equation} \begin{gathered} Transposed\ Convolution\ Ideal\ Speedup = \frac{O_h O_w}{I_h I_w} = \frac{S^\prime I_h \times S^\prime I_w}{I_h I_w} = S^{\prime 2} \end{gathered} \end{equation}
(2)

5.1.2 Dilated Convolution.

Similar to transposed convolution, dilated convolution can be naïvely implemented as a normal convolution operation by inserting \(d-1\) zeros between the filters’ values (Figure 13), where \(d\) is the D-CONV dilation rate. The number of MAC operations using this method is \((dK-d+1)^2 NO_hO_wM\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 13). However, the effectual non-zero MAC operations are only \(K^2NO_hO_wM\) (\(2^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 15). Equation (3) gives the ideal speedup for D-CONV.
\begin{equation} Dilated\ Convolution\ Ideal\ Speedup = \frac{(dK-d+1)^2}{k^2} \end{equation}
(3)
Fig. 13.
Fig. 13. Naïve computation of D-CONV (\(K=2,d=2\)).
Now, the problem is how to design a versatile SA that can eliminate the ineffectual zero MAC operations to achieve the theoretical ideal speedups with minimal area overhead.

5.2 Approach

Previous FPGA works attempted to accelerate either T-CONV layers, such as References [19, 33, 34, 56, 58], or D-CONV layers, such as Reference [61], but not both. However, some ASIC works proposed versatile accelerators for T-CONV and D-CONV layers. References [10] and [35] target general sparsity including T-CONV and D-CONV layers. Reference [28] uses a systolic array with delay cells to skip the zero MAC operations. Reference [9] proposes a decomposition approach that decomposes T-CONV and D-CONV layers into dense N-CONV layers. However, none of these previous works discussed the area overhead of supporting T-CONV and D-CONV layers efficiently. Table 5 summarizes these works.
Table 5.
WorkDeviceConv SupportDesign Generation
N-CONVT-CONVD-CONV
FlexCNN (ours)FPGAYesYesYesAutomatic
Electronics 2021 [61]FPGAYesNoYesManual
VLSI 2020 [58]FPGAYesYesNoAutomatic
Electronics 2020 [19]FPGAYesYesNoManual
FCCM 2018 [56]FPGAYesYesNoAutomatic
ISCAS 2020 [9]ASICYesYesYesManual
ISCAS 2019 [28]ASICYesYesYesManual
VLSI 2020 [10]ASICYesYesYesManual
ISCAS 2019 [35]ASICYesYesYesManual
Table 5. Versatile SA Comparison with Other Works
We chose to base our work on the decomposition approach in Reference [9], since it requires the least changes and area overhead to a standard SA. However, their work did not provide enough formulation and details on ideal speedups and filter/feature map decomposition for arbitrary filter size, T-CONV stride (\(S^\prime\)), and dilation rate (\(d\)). In this section, we illustrate the decomposition approach and provide a decomposition algorithm for T-CONV and D-CONV.

5.2.1 Decomposition of T-CONV Operation.

The decomposition of T-CONV operation gets rid of the non-effectual zero MAC operations by decomposing the convolution filters into \(S^{\prime 2}\) sub-filters that convolve over the dense input feature maps producing the same outputs as the naïve implementation, as shown in Figure 14.
Fig. 14.
Fig. 14. Efficient computation of T-CONV (\(K=3,S^\prime =2\)).

5.2.2 Decomposition of D-CONV Operation.

The decomposition of dilated convolution is more straightforward. While filters are decomposed in T-CONV, the input feature maps of D-CONV are decomposed into \(d^2\) sub-feature maps. Each sub-feature map contains non-contiguous pixels separated by a distance \(d-1\), as shown in Figure 15.
Fig. 15.
Fig. 15. Efficient computation of D-CONV (\(K=2,d=2\)).

5.2.3 Unified Decomposition Algorithm.

Algorithm 1 formulates the decomposition of an \(N \times N\) 2-D input matrix \(I\) given a constant \(Z\), where \(I\) is a dense filter and \(Z=S^\prime\) for T-CONV or \(I\) is a dense input FM and \(Z=d\) for D-CONV. The algorithm has two steps: First, it gets the height and width dimensions of each sub-matrix of \(I\). Second, it gets the values of each sub-matrix. After that, it returns the decomposed sub-matrices of \(I\).

5.3 The Versatile Systolic Array

The versatile systolic array efficiently supports four types of convolutional layers, i.e., N-CONV, T-CONV, D-CONV, and A-CONV layers. Figure 16 illustrates the high-level architecture of the versatile SA. Both T-CONV and D-CONV layers can be naïvely implemented as N-CONV layers by inserting zeros in the input FMs of a T-CONV layer or in the filters of a D-CONV layer, as shown in Figures 12 and 13. However, this naïve implementation leads to huge underutilization of computation resources due to the zero MAC operations. To the best of our knowledge, this is the first efficient FPGA implementation of N-CONV, T-CONV, and D-CONV in one systolic array.
Fig. 16.
Fig. 16. The versatile SA architecture.

5.3.1 The Architecture and Dataflow of the VSA.

To implement the decomposition approach in a systolic array, we used the open-source framework PolySA [16]. The systolic array is output-stationary made of \(SA\_COL \times SA\_ROW\) PEs, and each PE contains \(SIMD\) MAC engines. The SA has \(SA\_COL\) input-feed modules that feed input feature maps to the top-row PEs flowing down to the bottom of the SA, and it has \(SA\_ROW\) weight-feed modules that feed filters to the leftmost column of PEs flowing to the rightmost column of the SA, as shown in Figure 16. Each PE contains a local buffer to accumulate the results of the convolution. Once a PE’s local buffer has the final convolution results, it sends its outputs down the SA to be collected by \(SA\_COL\) out-collect modules.

5.3.2 T-CONV Implementation.

In a normal convolution layer with a 3 \(\times\) 3 filter and one input feature map, each output pixel is computed as the dot product of the filter by the corresponding input pixels. This translates to 9 MAC operations on one of the PE’s local registers. In a T-CONV layer with a 3 \(\times\) 3 filter, \(S^\prime =2\), and one input feature map, the 9 MAC operations are decomposed into \(S^{\prime 2}=4\) N-CONV operations with \((2\) \(\times\) \(2),(2\) \(\times\) \(1),(1\) \(\times\) \(2),\) and \((1\) \(\times\) \(1)\) sub-filters, as illustrated in Figure 14. The MAC operations of the four sub-filters are computed in four different registers. Thus, changing the address of result accumulations is the only modification in the PEs. At this stage, the output pixels in the PEs are not organized. We avoided implementing data reorganization in the PEs and shifted that logic to the out-collect modules (Figure 16), since there are more PEs than out-collect modules in the SA, which minimizes area overheads.

5.3.3 D-CONV Implementation.

The decomposition approach for D-CONV is slightly different from T-CONV. Instead of decomposing the filters, the input feature maps are decomposed, as illustrated in Figure 15. The input-feed modules send non-contiguous input pixels based on the dilation rate, while the weight-feed modules send the dense filters as it does for N-CONV. For example, in a D-CONV operation with a 2 \(\times\) 2 filter, \(d=2\), and one input feature map, each output pixel is computed through 4 MAC operations on the same register using the dense filter but with non-contiguous input pixels. The input pixels are separated by a distance of \(d-1=1\) in this case. D-CONV does not require data reorganization, as all output pixels have the same number of MAC operations.

6 Compilation Framework

The compilation framework takes a CNN graph and an ordered list of the modules needed for that CNN as inputs and generates an optimized FPGA accelerator. The compilation framework has three major components (Figure 17). This section discusses the components of the framework in detail.
Fig. 17.
Fig. 17. Compilation system.

6.1 CNN Layer Mapper

6.1.1 ONNX.

While the original FlexCNN framework supported TensorFlow CNNs only, the updated framework uses Open Neural Network Exchange (ONNX). This is an open-source framework that establishes open standards for representing machine learning algorithms and software tools. The ONNX representation supports multiple famous ML frameworks such as TensorFlow, PyTorch, Caffe, and ScikitLearn, to name a few. ONNX compacts a deep neural network (DNN) model in a single file. This file contains: (1) the DNN’s graph, where each node represents a DNN layer and each edge represents the data flow from one node to another, and (2) the DNN’s parameters, mainly weights and biases.

6.1.2 CNN Layer Mapping.

Now, having a compact representation of any CNN, it is easier for the CNN Layer Mapper to map the nodes of the CNN to the ordered list of FlexCNN modules. This is the component that performs layer fusion and layer parallelization. The architecture must have modules to support all the CNN’s layers. In most CNNs, the convolution layers are the most compute-intensive operations, and the SA is the bottleneck module. Our mapping algorithm iterates through each convolution node and checks to see if the predecessor, successor, or parallel nodes of the convolution node can be mapped to the ordered list. It then outputs a list of layer bundles that are sent to the design space exploration, which is discussed next.

6.2 Design Space Exploration

Given the network, the accelerator architecture, and the FPGA’s resources information, we will perform the design space exploration to select the optimal design parameters that minimize the inference latency of the CNN when run on the target FPGA. Table 4 lists the design parameters to be determined.
Two analytical models resource_est() and latency_est() are built for estimating the resource usage and latency of designs. Currently, the resource model estimates block RAM (BRAM) and DSP usage, which are usually the bottleneck of designs. The DSE process will sweep through the design space with all feasible combinations of design parameters. For each design parameter list, the resource usage is examined first. Designs that over-utilize the resource will be pruned away. Then, we follow a greedy algorithm to select the optimal tiling factors that minimize the latency layer by layer. The DSE process finishes within minutes on a standard workstation.

6.3 Design Generation

This step creates the code that is synthesized into the hardware accelerator. Since we are targeting Xilinx/AMD FPGAs, our design generator creates Xilinx/AMD High-Level Synthesis (HLS) code [53]. Generating the bitstream for such complex architectures has been challenging, especially when using large systolic arrays. The bitstream generation task would usually fail the placement and routing step. For this reason, we recently added support to generate TAPA code [15]. TAPA is a dataflow HLS framework that offers fast compilation, and it generates high-frequency designs with the help of AutoBridge [25]. AutoBridge is a tool targeted at large dataflow architectures. It helps the process of placement and routing by placing the dataflow modules evenly across the FPGA fabric and connecting them with pipelining registers to minimize the critical paths of the design.
Now, having the optimal hardware parameters from the DSE, the user can choose to produce Xilinx/AMD HLS code or TAPA code. The code generation is template-based. The original FlexCNN paper used the PolySA [16] compiler to generate a standard systolic array. To automate the process of generating new versatile SAs with different dimensions based on an application target, we integrated our modifications on the standard SA into the PolySA compilation framework to create new versatile SAs with a push of a button. We also used Algorithm 1 and other scripts to automatically prepare test data to run on FPGA.

7 Software-hardware Pipelining

Figure 2 illustrates the software overheads when integrating an FPGA kernel to a machine learning framework like TensorFlow. This defeats the purpose of hardware acceleration. To overcome this challenge, we use a software-hardware pipelining technique that can overlap the software execution with the hardware kernel execution. We chose TensorFlow as our ML framework, since it is being widely used for inference in the ML community (e.g., References [27, 36]). To invoke FPGA from TensorFlow, we redefine the nodes in the original computation graph. All computation nodes of CNN are merged into one node that is implemented by FPGA. The rest of the graph is still processed on the CPU.
When FPGA is connected to TensorFlow, the whole integration stack consists of the following steps: (1) reading the inputs of CNN, (2) pre-processing including stages such as image resizing, (3) re-organizing the initial data layouts in CPU memory, (4) transferring data from CPU to FPGA device memory, (5) computation on FPGA, (6) fetching the results back via PCIe, (7) reformatting and passing it to TensorFlow, (8) non-CNN computation stages on CPU, (9) processing the results (e.g., estimating the human poses based on the attained results and drawing them for the OpenPose network), and (10) writing out and displaying the results.
Figure 2 shows the breakdown of these stages in the OpenPose application for a 384 \(\times\) 384 RGB input. Among the whole pipeline, which takes 208.8 ms, the FPGA computation in Step 5 only requires 11.8% of the total time. The integration overheads have led to an \(8.45\times\) performance slowdown. To reduce these overheads, we have applied an optimized software/hardware pipelining.
A two-level pipelining is applied on the whole integration stack that enables the simultaneous processing of the aforementioned steps. The first level overlaps TensorFlow’s overheads (steps 1, 2, 9, 10) with the rest of the steps. The second one overlaps FPGA’s computation with data movement steps (steps 3, 4, 6, 7).
Figure 18 illustrates the first level of the pipeline, which is applied at the TensorFlow level. The numbers in the figure show the related step number. Steps 1, 2, 9, and 10 and the rest of the steps are assigned to different processes connected by a queue. Therefore, steps 1, 2, 9, and 10 are overlapped with FPGA-related steps. The overall performance is determined by the stage with the longest latency. Pipelining is enabled by exploiting multiprocessing. In other words, each of the steps is assigned to a separate process. These processes pass the data to each other through queues, as shown in Figure 18.
Fig. 18.
Fig. 18. First level of the pipeline.
To further improve the performance, we fully pipeline the communication and computation of FPGA, which consists of steps 3 to 7. This builds the second level of the pipeline. To allow pipelining, a batch of images is sent to FPGA. For a certain batch size, the additional latency incurred by batch processing is dissolved when the first level of the pipeline is applied. After the FPGA finishes processing the batch, the results are passed back to TensorFlow and the non-CNN computations are done in parallel for all the images. Figure 19 depicts the redefined graph that we use to achieve such a pipeline. With this optimization, the data movement steps are overlapped with kernel computation and the latency for non-CNN computation (Step 8) is amortized for the whole batch. Note that such deep software+hardware pipelining techniques were also used in References [12, 17] for integrating FPGA accelerators into Spark-based applications.
Fig. 19.
Fig. 19. The overview of the Process Graph stage.

8 Experimental Results

8.1 Experiment Setup

As mentioned before, the FlexCNN architecture is described either in Xilinx/AMD HLS [53] or TAPA HLS [15]. The target platforms are Xilinx/AMD Virtex Ultrascale+ VCU1525 and Alveo U250 and U280 Data Center Accelerator Cards. Table 6 demonstrates the generated designs and the corresponding tools and FPGA platforms used for each design.
Table 6.
Target CNNCodeXilinx/AMD ToolPlatformSystolic ArrayPrecision
OpenPose-V2Vivado HLSSDAccel 2018.3VCU1525Standard SAfloat 32-bit
Individual Layer TestsVivado HLSSDAccel 2018.3VCU1525Standard SAfloat 32-bit
U-NetVivado HLSSDAccel 2018.3VCU1525Versatile SAfloat 32-bit
E-NetTAPA HLSVitis 2021.2U250Versatile SAfloat 32-bit
E-NetTAPA HLSVitis 2021.2U250Versatile SAfixed 16-bit
E-NetTAPA HLSVitis 2021.2U250Versatile SAfixed 8-bit
E-NetTAPA HLSVitis 2021.2U280Versatile SAfixed 8-bit
VGG-16TAPA HLSVitis 2021.2U250Versatile SAfloat 32-bit
VGG-16TAPA HLSVitis 2021.2U250Versatile SAfixed 16-bit
VGG-16TAPA HLSVitis 2021.2U250Versatile SAfixed 8-bit
Table 6. Experiments’ Setup
Observe that the second design in the table, with a standard systolic array, is used to compare the performance of a standard systolic array against the versatile systolic array on individual layers.

8.2 Hardware Optimization

The target FPGA platforms come with four DDR banks. In our implementations, we use two DDR banks, assigning feature maps and weights (including bias) to two separate DDR banks. All the architecture choices are parameterizable and can be adjusted based on the target FPGA. We found the following configurations that work best for the OpenPose-V2 application on Xilinx/AMD VCU1525: The systolic array for our standard conv module is organized as an 8 \(\times\) 8 array with a SIMD factor of 8. For the rest of the modules, we use the same SIMD factor. Table 7 shows the frequency and resource utilization under this configuration.
Table 7.
PrecisionFrequencyLUTFFBRAMURAMDSP
float 32-bit242.9 MHz43%40%60%15%50%
Table 7. Frequency and Resource Utilization of the OpenPose Accelerator
Table 8 shows the benefits of dynamic tiling and data layout transformation. We can see that these optimizations increase the performance by \(2.3\times\). Figure 1 depicts the performance gain of using dynamic tiling in a layer-by-layer fashion for the first 24 convolutional layers. Table 9 shows how applying dynamic tiling and dynamic data layout affects the tiling factors and effective DRAM bandwidth (BW) for the first layer of the last RBB in OpenPose-V2 compared to a design without these optimizations. The kernel size for this layer is 1 \(\times\) 1, which means it can use the optimized data layout with a burst length of \(Tn(k) \times Tw(k) \times Th(k),\) as described in Section 4.5. This data layout, along with the best tiling factor used for this layer, increases the effective DRAM BW and CTC ratio by \(2.8\times\). This results in \(6.1\times\) performance improvement.
Table 8.
ModelPrecisionFrequency (MHz)Runtime (ms)
(1)(2)
All Uniformfloat 32-bit23757.741.5
All Dynamicfloat 32-bit242.935.624.7
Table 8. Performance on OpenPose-V2
(2): With applying DRAM organization for concatenation layers.
(1): Without applying DRAM organization for concatenation layers.
Table 9.
ModelThTwTnTmEff. DRAM BW (GB/s)CTCThroughput (GFLOPs)
All Uniform124832324.3114.924.4
All Dynamic1224484812.0541.3149.2 (\(6.1\times\))
Table 9. Performance Impacts of Dynamic Tiling/Data Layout Transformation
We further test the DSP efficiency of our design on a given convolution layer. Of all the DSPs, 78.7% of them are used in the standard SA module and 11.2% in DW Conv module. We measure DSP efficiency using two factors: the total number of DSPs in the design and the number of DSPs of the modules used by that layer. All the tests are on a 256 \(\times\) 384 \(\times\) 384 input, producing 256 output channels. Table 10 summarizes the results. DSC layers take \(K^2\times\) less computation, making them communication-bound, as shown in Figure 20. This figure depicts that DSC layers fall in the memory-bound region of the roofline model, since they have less CTC ratio. Therefore, we achieve lower computation efficiency in these layers. Additionally, it shows that the data layout optimization for the DSC with the \(1 \times 1\) kernel increases the burst length. This helps to increase the effective DRAM bandwidth, leading to a performance improvement over the \(3 \times 3\) DSC.
Fig. 20.
Fig. 20. Layers in Table 10 under the roofline model.
Table 10.
LayerRuntime (ms)Throughput (GFLOPs)DSPtotal effDSPused eff
Conv 3 \(\times\) 3709.3245.2\(73.8\%\)\(93\%\)
Conv 1 \(\times\) 180.2240.9\(72.6\%\)\(91.4\%\)
DSC 3 \(\times\) 3113.4176.3\(53.1\%\)\(58.6\%\)
DSC 1 \(\times\) 184.1230.8\(69.5\%\)\(76.7\%\)
Table 10. Performance on Different Convolutional Layers

8.3 The Versatile Systolic Array

To compare the effectiveness of the decomposition approach and the implementation, we conducted tests on the standard SA and the versatile SA using 10 different layers with various filter sizes, T-CONV strides (\(S^\prime\)), and dilation rates (\(d\)), as shown in Table 11.
Table 11.
Layer (\(N,M,I_{h/w}\))T-CONV ResultsD-CONV Results
\(k,S^\prime\)Standard SAVersatile SASpeedup (Ideal\(^\dagger\))\(k,d\)Standard SAVersatile SASpeedup (Ideal\(^\dagger\))
Latency (ms)DSP Eff.\(^{*}\)Latency (ms)DSP Eff.\(^{*}\)Latency (ms)DSP Eff.\(^{*}\)Latency (ms)DSP Eff.\(^{*}\)
(16,16,16)5,20.35.15%0.34.61%\(0.91\times (4\times)\)5,20.33.99%0.34.83%\(1.23\times (3.24\times)\)
(16,16,256)5,214.724.25%5.069.50%\(2.91\times (4\times)\)5,211.830.03%5.267.84%\(2.30\times (3.24\times)\)
(16,256,16)5,21.119.68%0.634.64%\(1.79\times (4\times)\)5,21.022.04%0.541.77%\(1.93\times (3.24\times)\)
(16,256,256)5,2228.824.88%74.874.92%\(3.06\times (4\times)\)5,2185.330.72%57.597.54%\(3.23\times (3.24\times)\)
(256,16,16)5,21.218.52%0.637.23%\(2.04\times (4\times)\)5,21.120.86%0.729.23%\(1.42\times (3.24\times)\)
(256,16,256)5,2228.924.88%57.497.54%\(3.98\times (4\times)\)5,2185.230.74%78.071.80%\(2.37\times (3.24\times)\)
(256,256,16)5,214.624.37%3.990.08%\(3.76\times (4\times)\)5,211.830.24%3.989.41%\(3.00\times (3.24\times)\)
(256,256,256)5,23,653.024.94%913.698.14%\(4.00\times (4\times)\)5,22,958.530.79%913.698.14%\(3.24\times (3.24\times)\)
(256,256,256)3,32,960.711.08%329.597.95%\(8.98\times (9\times)\)4,33,652.315.96%584.998.10%\(6.24\times (6.25\times)\)
(256,256,256)4,49,352.86.23%585.398.04%\(15.98\times (16\times)\)3,54,419.37.42%329.498.00%\(13.42\times (13.44\times)\)
Table 11. Performance of Different T-CONV and D-CONV Layers
\(^{*}\) DSP efficiency is measured as the actual performance using non-zero MAC operations divided by the peak performance (GFLOP/s) of the SA.
\(^\dagger\) Ideal speedup is based on our analysis in Section 5.1 using our systolic array architectures.
Notice that layers with small \(N,M, or I_{h/w}\) have low computation-to-communication ratios, which make them communication-bound. This explains the low DSP efficiency for these layers. In contrast, the last three layers are computation-bound, and the DSP efficiency of the T-CONV and D-CONV layers is around \(98\%\), while the DSP efficiency of the standard SA is capped at \(\frac{100}{Ideal Speedup}\%\). This matches our ideal speedup analysis in Section 5.1.
Table 12 demonstrates the frequency and resource utilization of the versatile SA design and the standard SA design. In terms of area overhead, the versatile SA requires only about 7% more LUTs, 3% more Flip Flops, and around 3% more DSPs. For on-chip memory, the PEs utilize the BRAMs for local buffers, while the input, weight, and output buffers are implemented using URAMs. The PEs’ local buffers are larger in the versatile SA as the decomposition approach requires \(S^{\prime 2}\times\) the size of buffers for T-CONV decomposition. This explains the 24% increase in BRAM utilization. However, the standard SA needs larger weight buffers to accommodate the zeros inserted in the filters, and this explains the lower URAM utilization for the versatile SA.
Table 12.
Design\(SA\_COL \times SA\_ROW \times SIMD\)FrequencyLUTFFBRAMURAMDSP
Versatile SA8 \(\times\) 8 \(\times\) 8233.9 MHz48%43%37%44%45%
Standard SA8 \(\times\) 8 \(\times\) 8230.2 MHz41%40%13%51%42%
Table 12. Frequency and Resource Utilization of the Layer Tests’ and U-Net Accelerators

8.4 Software-hardware Integration Optimization

In this section, we evaluate the effect of our integration optimization on OpenPose-V2. FlexCNN runs at 24.7 ms, which translates to a peak performance of 40.5 FPS. However, without proper optimization, the direct integration into TensorFlow framework only leads to the performance of 4.8 FPS, as shown in Table 13. Table 13 summarizes the impacts of two-level pipelining on the overall performance. We are using a batch of 16 for the OpenPose network to enable pipelining on FPGA, since it produces the best performance and smoothest output when displaying the result. With two-level pipelining, we achieve up to 5\(\times\) speedup, which leads to the final performance of 23.8 FPS.
Table 13.
Version\(\frac{{\bf Runtime}}{{\bf frame}}\) (ms)Throughput (FPS)Speedup
Original208.84.81
1st pipeline97.110.32.1
2nd pipeline4223.85
Table 13. Performance Impacts of Integration Optimization

8.5 Applications

In this subsection, we evaluate the performance of the three real-world CNNs we implemented on FlexCNN and compare the results with other works.

8.5.1 OpenPose-V2.

To the best of our knowledge, there is only one work [6] that has implemented a variant of OpenPose on FPGA. However, they take a different approach. They reduce the computation cost of the original network by making the weights sparse and using only two stages after the backbone network. Furthermore, they quantized the data to a 16-bit fixed point and stored feature maps and weights on-chip. After these modifications, they neither reported their network’s computation cost nor their architecture’s resource utilization. Thus, we can not compare our results to theirs directly. Instead, we have compared our results against the network implementation using TensorFlow on CPU and GPU.
The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we use the NVIDIA Tesla V100 GPU, and it uses cuDNN [13] to run the network. To have a fair comparison of the latency of running the network on different platforms, we measure the runtime of a single image inference using OpenPose-V2 network. Table 14 summarizes the results. The runtime considers only the CNN inference time on RGB images of size 384 \(\times\) 384. For both the FPGA and GPU, the time to transfer the data from host to device and device to host is excluded from the measurement.
Table 14.
PlatformFrequency (GHz)Throughput (GFLOP/s)Latency (ms)Frames Per Second (FPS)
CPU2.42999.310.07
GPU1.411425.339.53
FPGA (ours)0.24311724.740.49
Table 14. OpenPose-V2 Performance Comparison of Different Platforms (Batch Size 1)

8.5.2 U-Net.

The U-Net CNN model is made of 51 layers. The breakdown of all the layers is shown in Table 1. The number of T-CONV layers’ operations is 2.1 Giga floating-point operations (GFLOPs), without counting the inserted zeros for T-CONV layers.
First, we compared U-Net performance with the TensorFlow implementation of the network on CPU and GPU. The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we ran the network on NVIDIA A100-PCIE-40GB operating at 1.4 GHz. We measured the runtime of a single image inference. Table 15 summarizes the results. Similar to the OpenPose-V2 experiment, the runtime considers only the CNN inference time on RGB images, excluding the data transfer time for both the FPGA and GPU.
Table 15.
PlatformFrequency (GHz)Throughput (GFLOP/s)Latency (ms)Frames Per Second (FPS)
CPU2.4294012.49
GPU1.416074.913.35
FPGA (ours)0.23420758.217.18
Table 15. U-Net Performance Comparison on Different Platforms (Batch Size 1)
Second, we found two works [33, 34] that implement U-Net on FPGA. The first work used two separate accelerators—one for N-CONV and one for T-CONV layers. Although their approach gets rid of the zero MAC operations in T-CONV, it results in low performance and a low DSP efficiency compared to N-CONV, as shown in Table 16. The second work has a better overall performance and DSP efficiency, since it is using an 8-bit fixed point precision and combines DSP and ALM resources to create denser MAC units with higher performance. However, this work does not report the DSP efficiency or performance of the T-CONV and N-CONV individually.
Table 16.
MeasureFlexCNNTRETS 2018 [33]FPL 2019 [34]
PlatformXilinx/AMD VCU1525Xilinx/AMD XC7Z045Intel A10 660
Data Typefloat 32-bitfixed 16-bitfixed 8-bit
Frequency234200200
N-CONV GOPS9.95.6NA
T-CONV GOPS2.10.3NA
Total GOPS12.05.927.4
N-CONV GOP/s206.5125NA
T-CONV GOP/s209.829NA
Total GOP/s207.01071578
Peak GOP/s239.5N/A1638
T-CONV supportYesYesYes
D-CONV supportYesNoNo
Table 16. U-Net Evaluation against other Works

8.5.3 E-Net.

Table 2 shows the breakdown of E-Net’s layers. The actual Giga Operations (GOPs) is the number of operations without counting the inserted zeros for T-CONV or D-CONV layers. Two of the previous works [9, 28] use a concept of logical GOPs for T-CONV and D-CONV layers, which counts the redundant zero MAC operations as actual MAC operations. While we do not think it is a good measure, we considered that metric for consistency and comparison purposes. The FlexCNN architecture of E-Net is shown in Figure 8, and we created three designs with float 32-bit, fixed 16-bit, and fixed 8-bit data types. The clock frequency and resource utilization for each design are shown in Table 17. First, we compared the E-Net performance against CPU and GPU. We used the same experimental setup as the U-Net tests. The comparison results are illustrated in Table 18.
Table 17.
Design \(SA\_COL \times SA\_ROW \times SIMD\)Data TypeFrequency (MHz)LUTFFBRAMURAMDSP
8 \(\times\) 9 \(\times\) 8float 32-bit24137.22%31.02%24.63%21.56%30.58%
16 \(\times\) 9 \(\times\) 16fixed 16-bit21940.16%29.35%29.43%12.50%27.50%
16 \(\times\) 9 \(\times\) 16fixed 8-bit22943.23%29.77%22.43%9.69%8.75%
Table 17. E-Net Designs and Hardware Utilization on U250 FPGA
Table 18.
PlatformFrequency (GHz)Throughput (GFLOP/s)Latency (ms)Frames Per Second (FPS)
CPU2.49.8122.48.17
GPU1.416.771.7113.95
FPGA (ours)0.24157.220.9547.62
Table 18. E-Net Performance Comparison on Different Platforms (Batch Size 1)
While we did not find any FPGA implementation of E-Net, there are three ASIC-based implementations of E-Net. The comparison results are shown in Table 19. Compared to Reference [28], our fixed 8-bit and 16-bit designs achieve lower latencies and higher frames per second (FPS), but our actual performance is slightly lower than theirs. This article only reports the performance (GOP/s) and FPS, but not the network’s number of operations (GOPs). When we calculated the number of GOPs based on the given GOP/s and FPS numbers, we found their operation count to be 1.4 GOPs, which is higher than ours (1.2 GOPs). They may have included operations from the non-convolution layers, but we did not, and this explains why we have higher FPS but lower GOP/s. Similarly, the second work [9] did not report the number of operations, nor did it report the latency or FPS for their implementation. We used our E-Net model to calculate the number of operations for a 512 \(\times\) 512 input image, which is 3.79 GOPs. Given their performance (168 GOP/s), we calculated the latency and FPS numbers (see Table 19). In terms of FPS, our three designs achieve higher rates, but we are using a smaller image size. Also, their work achieves higher performance in terms of GOP/s, but the ASIC frequency is more than \(2\times\) the frequencies of our designs. For Reference [10], we could not have a good comparison, as they only report the performance but do not report the input image size, the GOPs, the latency, or the FPS numbers.
Table 19.
WorkFlexCNNISCAS 2019 [28]ISCAS 2020 [9]VLSI 2020 [10]
8 \(\times\) 9 \(\times\) 816 \(\times\) 9 \(\times\) 1616 \(\times\) 9 \(\times\) 16
PlatformFPGAFPGAFPGAASICASICASIC
Frequency (MHz)241219229200500200
Image Size288 \(\times\) 288288 \(\times\) 288288 \(\times\) 288288 \(\times\) 288512 \(\times\) 512N/A
Data Type (w/a)\(^{*}\)float 32-bitfixed 16-bitfixed 8-bitfixed 8-bitfixed 16-bitfixed 2/16-bit
Latency (ms)20.9513.8612.9214.6222.55N/A
FPS47.7272.1577.3968.4044.35N/A
Actual GOP/s57.286.592.896.0168.0196.2
Logical GOP/s426.2644.4691.2639.71,377.0N/A
Table 19. E-Net Comparison with Other Works
\(^{*}\)“w” for weights and “a” for activations.

8.6 Comparison with Vitis AI

Vitis AI [4] is a Xilinx/AMD library for accelerating AI models on Xilinx FPGAs. The library uses optimized deep-learning processor units (DPU) cores as an overlay along with a software stack to accelerate a variety of DNN models. Different DPUs are optimized for different workloads (such as CNNs, RNNs, and NLPs) and different goals such as latency or throughput. However, the FlexCNN architecture mainly targets CNNs and focuses on optimizing the latency of CNN inference. In this subsection, we compare the performance of ENet on FlexCNN vs. Vitis AI. Xilinx/AMD reported ENet performance on the U280 using two different DPUs, DPUCAHX8H [1] and DPUCAHX8L [2]. DPUCAHX8H is optimized for throughput, while DPUCAHX8L is optimized for latency. Both DPUs use fixed-point 8-bit formats. For a fair comparison, we used FlexCNN to generate an accelerator on the U280 with the same 512 \(\times\) \(1,\!024\) input image size.
Table 20 shows the resource utilization of Vitis AI DPUs and our FlexCNN-generated design on U280. First, note that Vitis AI deploys multiple DPU cores on the FPGA (3 for DPUCAHX8H and 2 for DPUCAHX8L). The DPUCAHX8H core can be configured to have 3, 4, or 5 processing engines (PENs),6 and the DPUCAHX8L core is configured to have 1 PEN. Thus, the DPUCAHX8H design has a total of 14 PENs, and the DPUCAHX8L design has 2 PENs. Each PEN can process a separate image batch allowing it to process multiple images in parallel. FlexCNN, however, is optimized for latency with a single VSA. Thus, it processes multiple image batches sequentially. We noticed that for such a low-bit (fixed 8-bit) data format, the LUT and FF dominate the resource utilization in FlexCNN, as they are used along with the DSPs to implement the compute units of the VSA. Vitis AI DPUs,however, are designed in RTL and take more advantage of the DSPs to implement the arithmetic logic. In terms of on-chip memory utilization, FlexCNN consumes less URAM than both DPUs and slightly more BRAMs than the DPUCAHX8H design. In terms of frequency, FlexCNN’s design achieves the highest working frequency of 256 MHz.
Table 20.
DesignData TypeFrequency (MHz)LUTFFBRAMURAMDSP
DPUCAHX8H [1]3 Cores (5 + 5 + 4 PENs)15052.0%45.2%13.5%93.3%82.7%
DPUCAHX8L [2]2 Cores (1 + 1 PENs)25032.7%23.0%22.8%65.0%54.3%
FlexCNN16 \(\times\) 8 \(\times\) 16 VSA25659.5%38.3%16.7%13.5%10.9%
Table 20. Hardware Utilization of FlexCNN Accelerator and Vitis AI DPUs on U280
Table 21 illustrates the performance of E-Net on Vitis AI DPUs vs. FlexCNN’s design in terms of throughput and latency. First, we noticed that the E-Net model used in the Vitis AI experiments has slightly more operation count (GOPs). After investigation, we found that, unlike the original E-Net model in Reference [38], each pair of asymmetric convolution layers (Figure 5) is implemented as a single convolution layer with 5 \(\times\) 5 filters in the Vitis AI E-Net model, which explains the slight increase in GOP count. The DPUCAHX8H design achieves the highest throughput of 1,057.8 GOP/s delivering 123 frames/s. However, such a high throughput is due to using a batch size of 14. The inference latency of ENet on DPUCAHX8H is not reported in Reference [3], but we can calculate a lower and an upper bound for latency. The lower bound is calculated as \(\frac{1}{FPS}\) (8.1 ms), meaning that the 14 PENs run sequentially, which is very unlikely, because this defeats the purpose of deploying 3 cores with 14 PENs. The upper bound is calculated as \(\frac{Batch\ Size}{FPS}\) (113.8 ms) meaning that the 14 PENs run in parallel, which is more likely. Thus, it is more likely that FlexCNN’s design has a comparable or better inference latency than the DPUCAHX8H. The same analysis applies to VGG-16 in Table 24. For the DPUCAHX8L design, FlexCNN delivers \(2.7\times\) faster inference. Moreover, while the DPUCAHX8L design achieves lower latencies than DPUCAHX8H for various CNNs [3] (see Table 24 for the VGG-16 results), it surprisingly gets the slowest inference for E-Net. FlexCNN’s design achieves both higher throughput and lower latency than the DPUCAHX8L design. Finally, in terms of performance density, DPUCAHX8H achieves the highest GOP/s/kLUT and GOP/s/DSP while FlexCNN, written in HLS, achieves higher GOP/s/DSP than DPUCAHX8L.
Table 21.
DesignModel Complexity (GOPs)Batch SizeFrames/sLatency (ms)Throughput (GOP/s)Performance Density
GOP/s/kLUTGOP/s/DSP
DPUCAHX8H8.6014123.08.1–113.81,057.81.56060.1417
DPUCAHX8L8.6028.1175.469.70.16370.0142
FlexCNN7.58115.166.0114.60.14780.1166
Table 21. E-Net Performance Comparison with Vitis AI [3] (Image Size: 512 \(\times\) \(1,\!024\))

8.7 Comparison with Other Frameworks

In this subsection, we compare FlexCNN with other FPGA-based DNN frameworks in terms of the scope of the frameworks and the performance of their respective architectures.

8.7.1 Scope of the Framework.

The scope and families of DNNs a framework can support depend on the architecture it employs. Most previous works such as DNNWeaver [42], Angel-Eye [23, 24], Caffeine [52, 60], fpgaConvNet [50], DNNBuilder [62], 2D & 3D CNN [43], Cloud-DNN [11], DNNVM [54], DNNExplorer [63], and 3D-VNPU [18] focused on designing architectures that support normal convolution, fully connected (FC), pooling, and activation and batch normalization (Act & BN) layers. These layers are sufficient for simple sequential CNNs such as AlexNet and VGG-16. In addition to the common CNN layers, many CNNs contain many more layer types such as depth-wise convolution, dilated convolution, transposed convolution, upsampling, and bilinear upsampling layers. Therefore, these previous frameworks cannot support complex CNNs with various layer types and complex branching\(^\dagger\) graph topologies such as OpenPose, U-Net, and E-Net. To accelerate a wide range of real-world CNN applications, FlexCNN supports all the aforementioned layer types with the exception of fully connected layers, since they became less used recently, and many famous models like MobileNet-V2 (used in OpenPose) are employed as a backbone for feature extraction without the FC layers. While FlexCNN and these previous works target CNNs, the FP-DNN [22] framework features an architecture that supports recurrent neural networks (RNN) in addition to CNNs. We will explore such a direction in our next work. Finally, Vitis AI [4] is a comprehensive AI compiler supporting CNNs, RNNs, and natural language processing models (NLPs). Table 22 summarizes the scope of all these frameworks.
Table 22.
FrameworkDNNsModel Topology Branching\(^\dagger\)?Supported Layers
N-CONVT-CONVD-CONVDW-CONVFCPoolAct & BNUpsampleAddConcat
DnnWeaver [42]CNNs
Angel-Eye [24]CNNs
DAC’17 [52]CNNs
FP-DNN [22]CNNs, RNNs
Caffeine [60]CNNs
fpgaConvNet [50]CNNs
DNNBuilder [62]CNNs
2D & 3D CNN [43]CNNs
Cloud-DNN [11]CNNs
DNNVM [54]CNNs
DNNExplorer [63]CNNs
3D-VNPU [18]CNNs
Vitis AI [4]CNNs, RNNs, NLPs
FlexCNN (ours)CNNs
Table 22. Scope of the Frameworks
\(^\dagger\)branching means that the CNN’s graph is not sequential but rather contains multiple branches connected with add or concat layers.
Aside from the type of DNNs, an important aspect of the scope of a framework/architecture is the model size it can handle. Some frameworks like DNNBuilder [62] create dedicated hardware modules for each CNN layer consuming most of the FPGA fabric resources, which makes those frameworks limited to small CNNs with a few layers. However, FlexCNN does not have any limitation on the model size, as it stores the weights off-chip and time-shares the same hardware modules.

8.7.2 Performance.

We compare the performance of FlexCNN’s generated accelerators with multiple frameworks on the famous VGG-16 [46] CNN, since it is used by all these previous frameworks. Similar to some other works, we only implemented the feature extraction part of VGG-16 (Convolution layers), but not the classification part (the last three FC layers), since FlexCNN does not have a dedicated FC module yet, but we will consider adding it in future work. For this comparison, we created three designs with various bit widths and data types detailed in Table 23.
Table 23.
Design \(SA\_COL \times SA\_ROW \times SIMD\)Data TypeFrequency (MHz)LUTFFBRAMURAMDSP
8 \(\times\) 14 \(\times\) 8float 32-bit26648.11%40.81%45.43%27.53%16.25%
8 \(\times\) 14 \(\times\) 32fixed 16-bit24139.51%26.79%45.93%25.31%37.98%
8 \(\times\) 14 \(\times\) 64fixed 8-bit19859.77%30.51%58.43%24.69%8.70%
Table 23. VGG-16 Designs and Hardware Utilization on U250 FPGA
We surveyed many previous frameworks targeting CNNs and summarized the results in Table 24. Since we did not implement the FC layers and to have a fair comparison, for each metric used in Table 24, we used the format m1 (m2), where m1 refers to the feature extraction metric (convolution layers with 30.69 GOPs making 99.6% of VGG-16 operations), and m2 refers to the feature extraction + classification (convolution + FC layers with 30.81 GOPs) metric. In general, we can see that the FlexCNN designs achieve performance results better than or comparable to the other frameworks. In terms of throughput, DNNBuilder delivers the highest throughput followed by DNNVM. Cloud-DNN and DNNExplorer achieve comparable throughput to FlexCNN. In terms of the latency of feature extraction, FlexCNN’s 8-bit design achieves the lowest latency of 13.18 ms followed by DNNVM. DNNBuilder achieves the lowest latency of 15.39 ms for feature extraction and classification. In terms of performance density, DNNVM has the highest GOP/s/kLUT followed by Vitis AI DPUs and DNNBuilder, which are all implemented and optimized in RTL. FlexCNN’s fixed-point designs have comparable GOP/s/kLUT to Caffeine and 2D & 3D CNN, and Cloud-DNN, which are all implemented in Xilinx/AMD HLS. As for DSP performance density, FlexCNN’s 8-bit design delivers the highest GOP/s/DSP 2.179. Finally, an important metric to consider is the efficiency of an accelerator measured as the ratio between the achieved performance and the peak performance of the accelerator. DNNBuilder achieves the highest accelerator efficiency followed by DNNExplorer, since they exploit layer-level parallelism by deploying an accelerator for each layer (or a group of layers) of a CNN model. FlexCNN, in contrast, achieves between 82% and 96% accelerator efficiency (higher than DAC’17, Caffeine, DNNVM, and Vitis AI DPUs) while using a single systolic array, thanks to dynamic tiling and the other hardware optimizations employed by FlexCNN.
Table 24.
FrameworkPlatformPrecision\(^\dagger\)Frequency (MHz)Batch SizeThroughput (GOP/s)Latency (ms)Performance DensityActual/Peak Performance
GOP/s/kLUTGOP/s/DSP
DnnWeaver [42]Zynq Z020FX(16,16)150131.35 (31.38)-0.896 (0.897)0.224 (0.224)-
Stratix V SGSD5FX(16,16)2001157.39 (157.51)-1.040 (1.041)0.265 (0.265)-
Arria 10 GX115FX(16,16)2001390.02 (361.55)-1.079 (1.000)0.290 (0.269)-
Angel-Eye [24]Zynq Z045FX(16,16)1501187.80 (136.97)163.42 (224.60)1.028 (0.750)0.241 (0.176)-
DAC’17 [52]Arria 10 GT115FX(16,8)2321- (1,171.30)- (26.85)- (3.742)- (0.781)89.11% (-)
Caffeine [60]UltraScale KU060FX(16,16)2001310.00 (266.00)- (101.15)3.100 (2.660)0.293 (0.251)84.93% (72.88%)
Virtex 690TFX(16,16)1501488.00 (354.00)- (65.13)1.627 (1.180)0.172 (0.125)76.72% (55.66%)
fpgaConvNet [50]Zynq Z045FX(16,16)1251155.81 (-)249.50 (-)-0.182 (-)-
DNNBuilder [62]UltraScale KU115FX(16,16)2351- (2,011.00)- (15.39)- (7.799)- (0.466)- (99.1%)
UltraScale KU115FX(8,8)2352- (4,022.00)- (15.39)- (15.597)- (0.931)- (99.1%)
2D & 3D CNN [43]Virtex 690TFX(16,16)1501- (570.00)- (54.06)- (3.257)- (0.414)-
UltraScale VU440FX(16,16)2001- (821.00)- (37.53)- (4.829)- (0.597)-
Cloud-DNN [11]UltraScale VU9PFX(16,16)1251- (1,068.37)- (28.96)- (1.397)- (0.200)-
UltraScale VU9PFX(16,16)2141- (1,828.61)- (16.92)- (2.645)- (0.342)-
DNNVM [54]UltraScale ZU2FX(8,8)3301334 (-)91.90 (-)15.215 (-)1.722 (-)87.9% (-)
UltraScale ZU9FX(8,8)33032,820 (-)17.24 (-)23.94 (-)1.829 (-)69.6% (-)
DNNExplorer [63]UltraScale KU115FX(16,16)20011,702.30 (-)18.05 (-)- (-)0.363 (-)95.8% (-)
3D-VNPU [18]UltraScale ZCU102FX(8,8)20011,150 (-)26.69 (-)- (-)1.123 (-)-
DPUCAHX8H [1]Alveo U280FX(8,8)15014- (5,812.07)- (5.30–74.23)- (8.575)- (0.779)- (67.6%)
DPUCAHX8L [2]Alveo U280FX(8,8)2502- (3,272.75)- (18.83)- (7.688)- (0.409)- (40.9%)
FlexCNN (ours)Alveo U250FL(32,32)2661458.6 (-)66.92 (-)0.632 (-)0.082 (-)96.2% (-)
Alveo U250FX(16,16)24111,543.4 (-)19.89 (-)2.262 (-)0.331 (-)89.3% (-)
Alveo U250FX(8,8)19812,329.1 (-)13.18 (-)2.256 (-)2.179 (-)82.1% (-)
Table 24. VGG-16 Performance Comparison with Other Frameworks (Image Size: 224 \(\times\) 224)
Note: m1 (m2) format is used for multiple results, where m1 refers to the Conv. layers only and m2 refers to Conv. + FC layers.
\(^\dagger\)FX/FL(a,w): a = activations, w = weights, FX = fixed-point, and FL = floating-point data types.

9 Conclusion

In this work, we presented the end-to-end FlexCNN framework for accelerating CNNs on FPGA. Our framework targets the challenges of accelerating modern CNNs. The first challenge stems from the disparity within layers of the same type that results in different computation and communication requirements. As a solution, we proposed a few architectural techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimizations. The second challenge arises from the various convolution types such as transposed convolution and dilated convolution. These two layers, if not processed efficiently, can lead to huge underutilization of the computation resources of FPGA due to the large number of redundant zeros. For this, we propose a versatile systolic array that can handle all these layer types efficiently with a small area overhead compared to a standard SA. The third challenge is caused by the software overheads for the end-to-end runtime of CNN inference. To mitigate this issue, we propose a software-hardware pipelining technique that overlaps those overheads with the hardware kernel execution. Finally, we presented our automated compilation flow that takes a CNN model in ONNX format, maps it to the FlexCNN architecture, finds the best hardware parameters and tiling factors using a DSE, and generates accelerators either in Xilinx/AMD HLS or TAPA HLS.

Footnotes

1
Open Neural Network Exchange.
2
Jason Cong has a financial interest in AMD.
4
Asymmetric convolution layers are the same as N-CONV layers but use non-square filter sizes, like 1 \(\times\) 5 filters.
5
Based on TensorFlow “same” padding.
6
Processing engines are abbreviated as (PENs) so as not to be confused with the systolic array processing elements (PEs).

References

[1]
DPUCAHX8H Resource Utilization. (n.d.). Retrieved from https://docs.xilinx.com/r/en-US/pg367-dpucahx8h/Resource-Utilization.
[2]
DPUCAHX8L Resource Utilization. (n.d.). Retrieved from https://docs.xilinx.com/r/en-US/pg366-dpucahx8l/Resource-Utilization.
[5]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.
[6]
Jinguji Akira, Tomoya Fujii, Shimpei Sato, and Hiroki Nakahara. 2018. An FPGA realization of OpenPose based on a sparse weight convolutional neural network. In International Conference on Field-Programmable Technology (FPT’18). IEEE, 310–313.
[7]
Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circ. Syst. II: Express Briefs 65, 10 (2018), 1415–1419.
[8]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.
[9]
Kuo-Wei Chang and Tian-Sheuan Chang. 2020. Efficient accelerator for dilated and transposed convolution with decomposition. In IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.
[10]
Qinyu Chen, Yan Huang, Rui Sun, Wenqing Song, Zhonghai Lu, Yuxiang Fu, and Li Li. 2020. An efficient accelerator for multiple convolutions from the sparsity perspective. IEEE Trans. Very Large Scale Integ. Syst. 28, 6 (2020), 1540–1544.
[11]
Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen. 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 73–82.
[12]
Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’16).
[13]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[14]
Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.
[15]
Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 204–213.
[16]
Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.
[17]
Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’18).
[18]
Huipeng Deng, Jian Wang, Huafeng Ye, Shanlin Xiao, Xiangyu Meng, and Zhiyi Yu. 2021. 3D-VNPU: A flexible accelerator for 2D/3D CNNs on FPGA. In IEEE 29th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM’21). IEEE, 181–185.
[19]
Xinkai Di, Hai-Gang Yang, Yiping Jia, Zhihong Huang, and Ning Mao. 2020. Exploring efficient acceleration architecture for winograd-transformed transposed convolution of GANs on FPGAs. Electronics 9, 2 (2020), 286.
[20]
Chao Dong, Chen Change Loy, and Xiaoou Tang. 2016. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision. Springer, 391–407.
[21]
Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).
[22]
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.
[23]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2016. Angel-Eye: A complete design flow for mapping cnn onto customized hardware. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI’16). IEEE, 24–29.
[24]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 37, 1 (2017), 35–47.
[25]
Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 81–92.
[26]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[27]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: Running deep neural networks on android smartphones. In European Conference on Computer Vision (ECCV’18). 0–0.
[28]
Dongseok Im, Donghyeon Han, Sungpill Choi, Sanghoon Kang, and Hoi-Jun Yoo. 2019. DT-CNN: Dilated and transposed convolution neural network accelerator for real-time image segmentation on mobile devices. In IEEE International Symposium on Circuits and Systems (ISCAS’19). IEEE, 1–5.
[29]
Ildoo Kim. 2018. tf-pose-estimation. Retrieved from https://github.com/ildoonet/tf-pose-estimation.
[30]
Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857–1865.
[31]
Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–9.
[32]
Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In IEEE Conference on Computer Vision and Pattern Recognition. 1091–1100.
[33]
Shuanglong Liu, Hongxiang Fan, Xinyu Niu, Ho-cheung Ng, Yang Chu, and Wayne Luk. 2018. Optimizing CNN-based segmentation with deeply customized convolutional and deconvolutional architectures on FPGA. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–22.
[34]
Shuanglong Liu and Wayne Luk. 2019. Towards an efficient accelerator for DNN-based remote sensing image segmentation on FPGAs. In 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 187–193.
[35]
Wenjian Liu, Jun Lin, and Zhongfeng Wang. 2019. USCA: A unified systolic convolution array architecture for accelerating sparse neural network. In IEEE International Symposium on Circuits and Systems (ISCAS’19). IEEE, 1–5.
[36]
De G. Matthews, G. Alexander, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. 2017. GPflow: A Gaussian process library using TensorFlow. J. Mach. Learn. Res. 18, 1 (2017), 1299–1304.
[37]
Daniel H. Noronha, Bahar Salehpour, and Steven J. E. Wilton. 2018. LeFlow: Enabling flexible FPGA high-level synthesis of TensorFlow deep neural networks. In 5th International Workshop on FPGAs for Software Programmers. VDE, 1–8.
[38]
Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016).
[39]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
[40]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.
[41]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
[42]
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.
[43]
Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 97–106.
[44]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 535–547.
[45]
Laurent Sifre and Stéphane Mallat. 2014. Rigid-motion scattering for image classification. École Normale Supérieure, Département d’Informatique, Ph.D. Dissertation.
[46]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[47]
Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-end optimization of deep learning applications. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 133–139.
[48]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 16–25.
[49]
Wei Ren Tan, Chee Seng Chan, Hernán E. Aguirre, and Kiyoshi Tanaka. 2017. ArtGAN: Artwork synthesis with conditional categorical GANs. In IEEE International Conference on Image Processing (ICIP’17). IEEE, 3760–3764.
[50]
Stylianos I. Venieris and Christos-Savvas Bouganis. 2018. fpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 30, 2 (2018), 326–342.
[51]
Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.
[52]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 54th Annual Design Automation Conference. ACM, 29.
[53]
Xilinx. 2018. Vivado design suite user guide - high-level synthesis (UG902). https://docs.xilinx.com/v/u/2018.2-English/ug902-vivado-high-level-synthesis.
[54]
Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2019. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aid. Des. Integ. Circ. Syst. 39, 10 (2019), 2668–2681.
[55]
Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis et al. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).
[56]
Amir Yazdanbakhsh, Michael Brzozowski, Behnam Khaleghi, Soroush Ghodrati, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. 2018. FlexiGAN: An end-to-end solution for FPGA acceleration of generative adversarial networks. In IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 65–72.
[57]
Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
[58]
Yunxuan Yu, Tiandong Zhao, Mingyu Wang, Kun Wang, and Lei He. 2020. Uni-OPU: An FPGA-based uniform accelerator for convolutional and transposed convolutional networks. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 28, 7 (2020), 1545–1556.
[59]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 161–170.
[60]
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aid. Des. Integ. Circ. Syst. 38, 11 (2018), 2072–2085.
[61]
Ning Zhang, Xin Wei, He Chen, and Wenchao Liu. 2021. FPGA implementation for CNN-based optical remote sensing object detection. Electronics 10, 3 (2021), 282.
[62]
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In International Conference on Computer-aided Design. ACM, 56.
[63]
Xiaofan Zhang, Hanchen Ye, Junsong Wang, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. DNNExplorer: A framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. In 39th International Conference on Computer-aided Design. 1–9.

Cited By

View all
  • (2025)Object Detection Post Processing Accelerator Based on Co-Design of Hardware and SoftwareInformation10.3390/info1601006316:1(63)Online publication date: 17-Jan-2025
  • (2025)VCONV: A Convolutional Neural Network Accelerator for FPGAsElectronics10.3390/electronics1404065714:4(657)Online publication date: 8-Feb-2025
  • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 2
June 2023
451 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3587031
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 March 2023
Online AM: 20 December 2022
Accepted: 19 October 2022
Revised: 08 September 2022
Received: 25 June 2022
Published in TRETS Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. CNN
  3. ONNX
  4. systolic array
  5. transposed convolution
  6. dilated convolution
  7. OpenPose
  8. U-Net
  9. E-Net

Qualifiers

  • Research-article

Funding Sources

  • NSF/Intel
  • NSF NeuroNex project
  • CRISP center under the JUMP program, and CDSC industrial partners

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,112
  • Downloads (Last 6 weeks)309
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Object Detection Post Processing Accelerator Based on Co-Design of Hardware and SoftwareInformation10.3390/info1601006316:1(63)Online publication date: 17-Jan-2025
  • (2025)VCONV: A Convolutional Neural Network Accelerator for FPGAsElectronics10.3390/electronics1404065714:4(657)Online publication date: 8-Feb-2025
  • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
  • (2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
  • (2025)Stream-HLS: Towards Automatic Dataflow AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708878(103-114)Online publication date: 27-Feb-2025
  • (2025)SAT-Accel: A Modern SAT Solver on a FPGAProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708869(234-246)Online publication date: 27-Feb-2025
  • (2025)Accelerating U-Net: A Patchwise Memory Optimization Approach for Image Segmentation2025 38th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID64188.2025.00030(97-102)Online publication date: 4-Jan-2025
  • (2024)An Overlay Accelerator of DeepLab CNN for Spacecraft Image Segmentation on FPGARemote Sensing10.3390/rs1605089416:5(894)Online publication date: 2-Mar-2024
  • (2024)OptimalNN: A Neural Network Architecture to Monitor Chemical Contamination in Cancer AlleyJournal of Low Power Electronics and Applications10.3390/jlpea1402003314:2(33)Online publication date: 10-Jun-2024
  • (2024)Automated Optimization of Deep Neural Networks: Dynamic Bit-Width and Layer-Width Selection via Cluster-Based Parzen Estimation2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546757(1-6)Online publication date: 25-Mar-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media