1 Introduction
Convolutional Neural Networks (CNNs) are widely used in many
machine learning (ML) applications and have evolved quickly over the years. There is a growing interest in FPGA for accelerating CNN computation due to its high energy efficiency and performance (e.g., References [
6,
7,
22,
31,
37,
44,
48,
52,
59,
60,
62]). However, the recent advancement in CNN models and FPGA-based CNN acceleration has brought several new challenges.
23 Challenge 1: Performance disparity within CNN layers of the same type: In CNNs, layers of the same type (normal convolution layers, for instance) can have different characteristics in terms of their input and output number of channels, feature map size, and kernel size. This changes the
computation to communication (CTC) ratio from layer to layer. Therefore, it is important to handle these layers differently given the performance disparity across these layers. We found that tiling factors can play an important role in performance. Zhang et al. [
59] showed that the CTC ratio of a single convolution layer varies with different tiling factors. Yang et al. [
55] highlighted the importance of choosing proper tiling factors for data reuse in the near and faster memory (on-chip storage for FPGAs) for the overall latency and energy efficiency. These studies lead us to consider using different tiling factors across the network. Figure
1 depicts how different tiling factors can affect the performance of each layer in one CNN network. We compare the performance of using a single set of tiling factors (uniform tiling) to using different tiling factors for each layer (dynamic tiling). For the uniform tiling, we chose the tiling factor that reduces the latency of the entire network. For the dynamic tiling, we focused on each layer and selected the best tiling factor accordingly. Experimental results show that dynamic tiling can speed up the performance of the whole network by
\(1.7\times\).
Challenge 2: The inefficiency of general-purpose CNN accelerators in processing special CNN layers: Many modern CNNs feature complex architecture topologies with different layer types. One of these special layers is a
fractionally strided or transposed convolution (T-CONV) layer [
21] (also referred to as a deconvolution layer). It is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [
40], DCGAN [
39], ArtGAN [
49], DiscoGAN [
30], FSRCNN [
20], to name a few. An
atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [
57]. One famous network that uses D-CONV layers is CSRNet [
32]. Some CNNs include a mixture of convolution layers such as E-Net [
38], where
normal convolution (N-CONV), transposed convolution, dilated convolution, and
asymmetric convolution (A-CONV)4 layers are used. Both T-CONV and D-CONV layers can be naïvely implemented as normal convolution layers. However, such implementations introduce many zeros in the input feature maps of T-CONV layers and in the convolution filters of D-CONV layers, leading to a huge underutilization of the FPGA resources. To tackle this problem, we use a decomposition-based approach (discussed in Section
5) to implement N-CONV, T-CONV, and D-CONV layers efficiently in one versatile systolic array on an FPGA with minimal area overhead. Moreover, other networks such as MobileNetV1 [
26] use depth-wise separable convolution layers introduced in Reference [
45] to decrease the computation cost. MobileNetV2 [
41] introduced
residual bottleneck block (RBB) to further reduce the computation complexity. These layers reduce the computation cost but keep the same feature map size; this can make the layer more communication-bound and reduce the computation efficiency.
Challenge 3: Integration overheads of using FPGA in ML frameworks: When processing a CNN application in a modern ML framework such as TensorFlow [
5], the complete stack consists of reading the input, computing the CNN, processing the result, and displaying and writing the result. Previous works have only focused on optimizing the CNN kernel on FPGA (e.g., References [
6,
7,
22,
31,
48,
52,
59,
62]). This is due to the fact that CNN computation is the most time-consuming step of the whole stack. Hence, the rest of the overheads are ignored. While several works [
22,
37] have focused on accelerator
generation from TensorFlow-described networks, they did not address the challenges of
integrating an accelerator into TensorFlow. By integrating our accelerator with TensorFlow, we are able to directly run networks from TensorFlow on an FPGA. Integrating FPGA into TensorFlow introduces a new set of overheads: communication between TensorFlow and FPGA and the communication between the host and the FPGA kernel itself. Figure
2 shows the breakdown of the end-to-end runtime for processing a 384
\(\times\) 384 RGB image using the network in Figure
3. These steps are listed and described in Section
7. The time for CNN processing, using our accelerator denoted as the kernel, only takes 11.8% of the total runtime. This emphasizes the need for an end-to-end SW/HW co-optimization. Our experiments show that this optimization can increase the end-to-end performance of this network from 4.8 FPS to 23.8 FPS, leading to a 5
\(\times\) speedup.
To solve the challenges above, we propose an FPGA-based CNN framework named FlexCNN. Its architecture employs dynamic tiling, layer fusion, and data layout transformation to adapt to the performance disparity of different CNN layers. Another major component of the architecture is our versatile systolic array, which can efficiently process different convolution layer types. The framework has a compilation flow that takes a CNN as an input, performs design space exploration, and generates an optimized hardware accelerator to run on FPGA. The accelerator is further integrated into a software-hardware pipeline to mitigate the large integration overheads by overlapping the software execution with the hardware computation.
A preliminary version of FlexCNN [
47] was published in FPGA 2020. The new contributions in this article include: (1) a novel efficient versatile systolic array for normal, transposed, dilated, and asymmetric convolution layers; (2) ONNX support to handle multiple ML frameworks including TensorFlow, PyTorch, and Caffe; (3) code generation for the new TAPA [
15] framework, which is integrated with AutoBridge [
25] to improve design frequency; and (4) the implementations of U-Net, E-Net, and VGG-16 CNNs on FPGA using the FlexCNN framework.
In summary, the overall contributions of this work are:
•
An efficient, flexible, and composable dataflow architecture employing dynamic tiling, layer fusion, and data layout optimization to support a wide variety of CNNs;
•
A novel versatile systolic array that can efficiently process normal, transposed, dilated, and asymmetric convolution layers;
•
An automated compilation flow that takes a CNN dataflow graph as an input, maps it to the hardware dataflow graph, and performs a design space exploration to generate an optimized accelerator on FPGA;
•
A software-hardware pipelining scheme that can improve the end-to-end performance of CNNs;
•
Real-time efficient implementations of OpenPose, U-Net, E-Net, and VGG-16 CNNs on FPGA.
7 Software-hardware Pipelining
Figure
2 illustrates the software overheads when integrating an FPGA kernel to a machine learning framework like TensorFlow. This defeats the purpose of hardware acceleration. To overcome this challenge, we use a software-hardware pipelining technique that can overlap the software execution with the hardware kernel execution. We chose TensorFlow as our ML framework, since it is being widely used for inference in the ML community (e.g., References [
27,
36]). To invoke FPGA from TensorFlow, we redefine the nodes in the original computation graph. All computation nodes of CNN are merged into one node that is implemented by FPGA. The rest of the graph is still processed on the CPU.
When FPGA is connected to TensorFlow, the whole integration stack consists of the following steps: (1) reading the inputs of CNN, (2) pre-processing including stages such as image resizing, (3) re-organizing the initial data layouts in CPU memory, (4) transferring data from CPU to FPGA device memory, (5) computation on FPGA, (6) fetching the results back via PCIe, (7) reformatting and passing it to TensorFlow, (8) non-CNN computation stages on CPU, (9) processing the results (e.g., estimating the human poses based on the attained results and drawing them for the OpenPose network), and (10) writing out and displaying the results.
Figure
2 shows the breakdown of these stages in the OpenPose application for a 384
\(\times\) 384 RGB input. Among the whole pipeline, which takes 208.8 ms, the FPGA computation in Step 5 only requires 11.8% of the total time. The integration overheads have led to an
\(8.45\times\) performance slowdown. To reduce these overheads, we have applied an optimized software/hardware pipelining.
A two-level pipelining is applied on the whole integration stack that enables the simultaneous processing of the aforementioned steps. The first level overlaps TensorFlow’s overheads (steps 1, 2, 9, 10) with the rest of the steps. The second one overlaps FPGA’s computation with data movement steps (steps 3, 4, 6, 7).
Figure
18 illustrates the first level of the pipeline, which is applied at the TensorFlow level. The numbers in the figure show the related step number. Steps 1, 2, 9, and 10 and the rest of the steps are assigned to different processes connected by a queue. Therefore, steps 1, 2, 9, and 10 are overlapped with FPGA-related steps. The overall performance is determined by the stage with the longest latency. Pipelining is enabled by exploiting multiprocessing. In other words, each of the steps is assigned to a separate process. These processes pass the data to each other through queues, as shown in Figure
18.
To further improve the performance, we fully pipeline the communication and computation of FPGA, which consists of steps 3 to 7. This builds the second level of the pipeline. To allow pipelining, a batch of images is sent to FPGA. For a certain batch size, the additional latency incurred by batch processing is dissolved when the first level of the pipeline is applied. After the FPGA finishes processing the batch, the results are passed back to TensorFlow and the non-CNN computations are done in parallel for all the images. Figure
19 depicts the redefined graph that we use to achieve such a pipeline. With this optimization, the data movement steps are overlapped with kernel computation and the latency for non-CNN computation (Step 8) is amortized for the whole batch. Note that such deep software+hardware pipelining techniques were also used in References [
12,
17] for integrating FPGA accelerators into Spark-based applications.