The proposed development flow and the multi-engine architecture were applied to the design of ResNet50 for classification of \(224\times 224\) images, YOLOv3 for object detection of \(416\times 416\) images and DeepLabV3+ with a Modified Aligned Xception Backbone for image segmentation of \(300\times 300\) images.
All systems were designed using the proposed development flow, and the hardware modules were synthesized using Xilinx Vitis 2022.1 High Level Synthesis tool. The accelerator development targeted an implementation in the PL of Xilinx Zynq UltraScale+ SoC ZU3EG FPGA present in the Ultra96-v2 development board. Any FPGA could have been used, but using a low-density FPGA shows that the multi-engine architecture can also be used to design accelerators for embedded devices.
5.1 ResNet50 Model
ResNet50 is a well-known model for image classification and used as a backbone in many convolutional neural networks. The model was initially quantized with different bitwidths starting from a pre-trained model (see results in Table
2).
It is not the objective of this work to improve the state-of-the-art quantization accuracy with ResNet50. The variation in accuracy among the quantized solutions is just 1.7 pp, except for the most aggressive quantization (\(4\times 2\)), where there is a drop of 10 pp compared to original floating-point model. The accuracies of quantizations \(8\times 2\) and \(4\times 4\) are very close but quantization \(4\times 4\) is slightly better. Comparing the \(4\times 4\) quantization against \(8\times 8,\) there is a drop of 2 pp in accuracy. Since the design targets a low-density FPGA to run a large model, the \(4\times 4\) was chosen. The model design and optimization step was applied to the full model to run batch fusing, fine-tune, and extract the final weights.
The first layer of ResNet50 is a convolution (
\(64\times 7 \times 7\) filters) followed by batch normalization, ReLU, and
\(3\times 3\) max pooling. Considering the singular characteristics of the layer, a dedicated engine was chosen for this layer. Since the input is just three channels, the depthwise with accumulation engine was used. The last layer does average pool of the whole channels followed by a dense layer with 1,000 kernels. A dense layer engine was used with average pool. The hidden layers are repetitions of the bottleneck structure of ResNet50 (see Table
3).
All layers, including the downsample layer, can be implemented with the 3D convolution engine with ReLU and shortcut addition. Batch normalization was merged with the weights in the previous step.
So, the accelerator includes three engines: one for the first layer, one for the last layer, and one for all hidden layers (see Figure
9).
The three engines work in parallel and can be used in a dataflow streaming to process a sequence of images. The architecture of each engine was configured before scheduling, as specified in the engines in the figure. The configurations were determined based on the execution times of macro-layer. Knowing the number of MACS, parameters and map sizes necessary to run each macro-layer, the ratio between layer operations and peak performance of engines was found together with the expected data transfer times, assuming a data transfer of 16 Bytes/CLK (see Table
4).
The 3D core engine has the best expected computation time. However, the non-overlapping data transfers degrade the total peak execution time.
The hardware/software system with the multi-engine architecture was implemented, tested on the board, and the results compared with the state-of-the-art. The multi-engine runs at 200 MHz (see Table
5).
The multi-engine architecture is better than the other
\(4\times 4\) solution in terms of GOPS/kLUT and GOPS/DSP. The FPS of the proposed architecture is smaller but it runs ResNet50 while the other work runs ResNet18. Compared to Reference [
25], the main difference in the throughput comes from the limited memory bandwidth and on-chip memory of our FPGA device. It is not evident from the results, but the bottleneck of our solution is data communication.
5.2 YOLOv3 Model
YOLO and all its versions [
15] are one-stage object detectors with a common model topology based on convolutional neural networks. The YOLO detector extracts features using a CNN and returns candidate bounding boxes from those features for three different scales: 52 × 52, 26 × 26, and 13 × 13.
The original CNN model of YOLO has convolutional, shortcut, YOLO, upsample, and route layers. Convolutional layers with stride two replace max-pooling. Batch-normalization is applied to all convolutional layers, and all layers use the Leaky ReLU activation function, except the layers before YOLO layers that use a linear activation function. YOLO is able to detect objects of different sizes using three different scales: 52 × 52 to detect small objects, 26 × 26 to detect medium objects, and 13 × 13 to detect big objects. YOLOv3-Tiny replaces convolutions with a stride of two by convolutions with max-pooling and does not use shortcut layers.
Table
6 details the sequence of layers with regards to the input, output, and kernel sizes, and the activation function used in each convolutional layer of YOLOV3-Tiny.
The first part of the network is composed of a series of convolutional and maxpool layers. The object detection and classification part of the network performs object detection and classification at (\(13 \times 13\)) and (\(26\times 26\)) grid scales. The detection at a lower resolution is obtained by passing the feature extraction output over \(3\times 3\) and \(1\times 1\) convolutional layers and a YOLO layer at the end.
The detection at the higher resolution follows the same procedure but uses FMs from two layers of the network. The second detection uses intermediate results from the feature extraction layers concatenated with upscaled FMs used for the lower resolution detection.
Following the design flow, the model was quantized with 8 bits, achieving a precision score 30.8 mAP50 in the COCO 2017 test dataset (the original model with floating-point has a score of 32.9 mAP50). Lower quantizations introduce large errors and therefore were not considered.
The model consists of 3D convolutional layers with maxpool, without maxpool, and with upsample. Different Activation functions are used: leaky, linear, and sigmoid.
The first layer is a convolution (\(16\times 3 \times 3\) filters) followed by leaky function and max pooling. A dedicated engine was also considered for this layer. Since the input is just three channels, the depthwise with accumulation engine was used.
All the remaining layers are based on a 3D convolutional (see Table
7).
All layers can be implemented with the 3D convolution engine. To use a single convolution engine for all layers, it was necessary to design a configurable activation function with support for both Leaky ReLU and Sigmoid.
So, the accelerator includes two engines: one for the first layer and one for all hidden layers (see Figure
10).
The architecture of each engine was also configured initially as specified in the engines in the figure. Once again, knowing the number of MACS, parameters, and map sizes necessary to run each layer, the ratio between layer operations and peak performance of engines were found together with the expected data transfer times, assuming a data transfer of 16 Bytes/CLK (see Table
8).
The hardware/software system with the multi-engine architecture was implemented, tested on the board, and the results compared with the state-of-the-art. The multi-engine runs at 200 MHz (see Table
9).
Only one implementation of YOLOv3-Tiny on FPGA with a
\(8\times 8\) quantization was found. Therefore, the proposed solution was also compared with previous solutions with higher quantizations. Compared to the work from Reference [
2] that also uses an 8-bit quantization with the same FPGA, the proposed multi-engine is
\(4.3\times\) faster and with higher ratios between performance and area.
5.3 DeepLabV3 Model
Semantic segmentation is the process of assigning a label to each pixel of an input image. This is useful for object detection where several objects can be identified. A common architecture for semantic segmentation is the Encoder-Decoder structure [
26,
45]. The Encoder-Decoder structure is a two-stage network where the encoder compresses its input through several convolutions, which captures the semantic information, while the decoder predicts the output from the encoder’s output FM. The encoder is usually a network such as ResNet or Xception that captures the semantic information, through its several convolutions, while the decoder assigns each output pixel to a label using the encoder semantic information.
The semantic segmentation model considered in this work was the DeepLabV3+ [
13] with a Modified Aligned Xception backbone. DeepLabV3+ is an architecture that employs the Encoder-Decoder structure. A simple decoder is added to its predecessor DeepLabV3, which makes use of the
ASPP (Atrous Spatial Pyramid Pooling). ASPP applies several atrous convolutions with different rates in parallel to compute features in multiple scales. ASPP can help with detecting objects that might be too far away or too close to the camera due to use of atrous convolutions with different rates. Different backbones can be used with DeepLabV3+. Recently, with a modified Xception as a backbone, this model was able to improve the results of its predecessor DeepLabV3.
This network consists of 142 Convolutions and ReLU as the activation function. The network contains modifying layers such as feature map sums, average pooling, and interpolations. Figures
11 and
12 show the Modified Aligned Xception and the DeepLabV3+ Networks, respectively, with all the convolutional layers stated in order of execution with their most important parameters as well.
This network features Separable Convolutions, Atrous Convolutions, Shortcut connections with addition, Average Pooling, Bilinear Interpolations, and Map Concatenations. Each Separable Convolution block is composed of a depthwise convolution, a batch normalization layer, a pointwise convolution, another batch normalization layer, and a ReLU layer. A convolution block is composed of a Convolution, a batch normalization layer, and ReLU Layer, except for the last convolution where there is no ReLU layer. The Network is divided in several parts: the Encoder, which is composed of a backbone network—in this case, the Modified Aligned Xception, but could be replaced with another one—and the ASPP block, which gathers features in multiple scales with the use of atrous convolutions. Next to the encoder is the decoder block, which is responsible for decoding the features gathered by the encoder. As for the Modified Aligned Xception, it is divided into three sections: the Entry flow, which consists of several Separable Convolutions, some with a stride of 2 to reduce the size of the FMs; the Middle Flow, which is repeated 16 times in sequence; and the Exit flow, which contains some convolutions with dilation characteristics.
The DeepLabV3+ with an Xception Backbone was trained with the Corsican Fire dataset using only RGB images sized 300 × 300. The Corsican dataset
1 is composed of a total of 1,135 images of fires acquired in the visual range (i.e., RGB values) and in the near infrared (i.e., single channel image) under various conditions of positioning, weather, vegetation, distance to the fire, and brightness. The dataset was divided randomly into three parts: the first being the training partition, where 682 images are used for training the network; the second being the validation partition, where 226 images are tested after each epoch of training to keep track of the network training process; and the third being the test partition, where 226 images are tested to evaluate the network final mIoU. The training was done with a learning rate of 0.02 and with a Cross-entropy loss function. The mIoU reached for validation and testing are 92.45% and 91.5%, respectively.
After training the network, several quantization attempts with varying weights and activation bitwidths were done to determine the best mIoU-to-hardware-resources compromise achievable. Table
10 shows the results obtained.
The bitwidths were tested from 8 to 3 for both activation and weights, as lowering even more the size of the bitwidths would yield mIoU results below 80%. These results are probably not the best achievable with these bitwidths, since we stopped training after 30 epochs. However, at 30 epochs the loss variation is small, meaning that only a small improvement would be achieved with more training epochs. Decreasing both the weights and activations bitwidths lower the accuracy of the network, and with an activation bitwidth of 3, the accuracy decreases significantly, independently of the weights bitwidth. Since using weights with a bitwidth size that is not a power of 2 wastes memory resources, and as the achieved mIoU of 87.31% for a bitwidth of 4 for both activations and weights is only less than 5% lower than the mIoU achieved with the original PyTorch network, the rest of this work will use this bitwidth configuration.
Afterwards, the Batch normalization layers were merged with their respective convolutional layers using the developed tool Batch Normalization Merger, which decreased the mIoU to 81.21%. After further 50 epochs of fine-tune training, the mIoU reached 92.76% (an even higher value than the original PyTorch mIoU of 91.53%). The reason the quantized network performs better than the non-quantized PyTorch network might be due to the noise associated with the quantization.
The architecture was split into the three different engines: 3D Convolution, Depthwise Convolution, and Dilated Convolution. The 3D Convolution engine also computes the pointwise convolutions.
The 3D convolution includes a pre-processing average pooling and a post-processing with ReLU, concatenation with addition, and interpolation.
The average pooling is able to average a variable-sized map into a 1 × 1 map by reading the input stream values and adding them to the previously read values. Once all the values are read, the division by the total number of pixels of the resulting value of the sums is done. The division is replaced by a multiplication of a value calculated statically (
\(\frac{2^{25}}{N}\)) and a bit shift represented by a division by
\(2^{25}\) to reduce the hardware and execution time, as shown in Equation (
12).
The post-processing of the 3D convolution engine includes a bilinear interpolation, which was not in the original implementation of the engine used for the first two models. Interpolation works by applying linear interpolation in two directions. The linear interpolation formula can be seen in Equation (
13), where ×1 and ×2 are the known coordinates values in a 1-dimensional plane, Q1 and Q2 are the values associated with those coordinates, and P is the value to be calculated for the new x.
The Bilinear Interpolation formula, seen in Equation (
14), can be deduced from the linear interpolation formula when moving on to a 2-dimensional plane.
Implementing this formula with fixed point numbers requires scaling the position of the known points for there not to be any fractional bits. The scaling factor is related to the size of the bigger map, e.g., if the bigger map is has a size of 10 pixels, then the scaling factor would be 9.
The final division of the interpolation can be replaced by a bit shift by first multiplying a factor that is the result of the division between a power of two and the actual denominator. This factor is calculated at the beginning of the interpolation process, since the value is always the same. The formula can then be rewritten:
Before starting the Bilinear Interpolation, the module stores two lines of the map. Afterwards, values are loaded and the interpolation is performed while loading the next line of values.
The final architecture with the three engines is illustrated in Figure
13.
The three engines work in parallel and can be used in a dataflow streaming to process a sequence of images. The architecture of each engine was configured as specified in the engines in the figure.
The hardware/software system has four DMAs: two DMAs are connected to the 3D convolution engine, one DMA is connected to the Depthwise IP, and the other is connected to the dilation convolution engine.
The system was synthesized and implemented achieving a working frequency of 175 MHz. Table
11 shows the Hardware Utilization of the whole system.
As shown, no hardware limits are hit, the highest resource usage being the BRAMs. The FPGA bitstream was then generated and exported along the with the necessary files to Vitis IDE to be able to run the network.
The network was run in the Ultra96-v2 Board, the 226 test images were processed, and their respective mIoU was evaluated (92.7%). The input images, weights, and truth images were stored, before execution, into the board DDR memory. The input images were saved already normalized, so no pre-processing on them is needed. Every output image was compared with the truth images, calculating their mIoU, confirming that the implementation is correct. Then the execution time was recorded and averaged over the 226 runs to figure out the system’s throughput. A throughput of 1.4 frames per second was observed.
The results were compared with the only work found of DeepLabV3+ on FPGA (see Table
12).
The solutions compared in the table target different FPGAs. The work from Reference [
27] considers a high-density FPGA. However, the difference in the utilization of resources is much larger than our multi-engine architecture and has throughput
\(5.4\times\) lower.