research-article

Open access

Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines

Authors:

Miguel Reis,

Mário Véstias,

Horácio NetoAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 17, Issue 1

Article No.: 6, Pages 1 - 30

https://doi.org/10.1145/3615870

Published: 27 January 2024 Publication History

PDF eReader

Abstract

Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This article proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny, and image segmentation with DeepLabV3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.

1 Introduction

Deep neural network (DNN) models are known for their ability to solve complex problems with high accuracy in domains such as image segmentation, image classification, object detection, among others. However, DNNs are also known for being computational and memory-intensive, and so designing them for running in embedded systems is quite challenging. Another aspect that is becoming more evident is that DNNs are becoming more complex, not only in terms of the number of parameters and operations but also in its structure and type of layers. This complicates even more the design of hardware accelerators, whether embedded or not.

Software-based computing platforms are the preferred choice for running DNNs but they are performance- and energy-inefficient to execute highly parallel applications. Graphics Processing Units (GPU) are well suited for DNN applications, since they offer many-core computing processing. However, their high power requirements are not suitable for low-power embedded applications. An alternative is to use more efficient dedicated hardware architectures: Application-Specific Integrated Circuits (ASIC) or Field-Programmable Gate Arrays (FPGA). While the former is more efficient, it is unable to keep up with the very fast evolution of DNN models.

Designing hardware circuits for complex DNN models and at the same time looking for the most efficient solution is a complex task. These dedicated accelerators can be designed from hardware description languages, but these are not appropriate for design space exploration, since circuits are described at a low level. With the success of recent high-level synthesis (HLS) tools, the designs are described at a higher level of abstraction that allows easier design space exploration and simulation [19]. In Reference [38] a higher abstraction level has been explored with automatic tools to map DNNs onto hardware. The main idea is to start with a trained model and then map it into hardware. There are two main approaches to implement this mapping. On one side, some tools translate layers of a deep learning model into HDL or HLS descriptions and then interconnect them in a dataflow [14, 36, 43]. Some other approaches consider mapping a model into a configurable hardware core dedicated to deep learning models, like the Vitis-AI platform from Xilinx (https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html).

Full streaming architectures and single computation engines are the most common architectures of hardware accelerators for DNN execution [7, 38]. The former implements all layers in a coarse-grain pipeline allowing per-layer optimization. These architectures are very computationally efficient, with high throughputs, since dedicated hardware modules are designed to run each layer of the network model. Full-streaming architectures are typically implemented in reconfigurable hardware, since new models require a new architecture. Its design in ASIC limits its applicability to a single network model. While being very resource-efficient and providing high throughputs, it has a few disadvantages. It requires computation balance among all layers, since the throughput is determined by the slowest module, and large memory resources to store weights and intermediate maps for all layers, which makes this solution infeasible for embedded systems, unless aggressive weight and activation quantizations are used [35]. Also, a new hardware design is required for every new network model.

The single engine accelerator considers one configurable computing engine that executes all layers in a sequence. The unit is configurable to adapt to the characteristics of each layer. This approach reduces the resource requirements and can execute any model, as long as it only uses layers supported by the engine. However, it introduces inefficiencies from using a single computation engine for different types and sizes of layers and increases the pressure over the communication channel to access external memory, since the intermediate maps are stored in external memory. Another disadvantage is associated with the exploration of the parallelism. As the chip density increases and the quantization of models becomes more aggressive, the number of multiply-accumulate units that can be integrated in a single chip also increases. The problem is that it becomes more difficult to efficiently use this huge number of computing units in a single engine. An example is the Tensor Processing Unit (TPU), where a low-performance utilization is observed for many models [24]. This unit relies on a very high batch to improve performance efficiency. This inefficiency is also referred to in Reference [9], where it is mentioned that accelerators for generic layers are inefficient in executing special layers. The consequence of these limitations is that the great majority of works about hardware acceleration only supports the most common and known models, as can be concluded from a recent survey [5].

One possible solution to overcome some of the pros and cons of full-stream and single-engine architectures is to consider an architecture with multiple configurable engines. The architecture may be homogeneous or heterogeneous. When homogeneous, it has multiple identical engines running the same type of layers. This is useful to distribute the full potential parallelism of the model among engines, which consequently improves the efficiency of each engine. For example, a single engine running 128 MAC operations per cycle requires that at all cycles there should be 128 parallel operations to be executed. Otherwise, some MACs will be idle, reducing the efficiency. It is easier to guarantee the maximum efficiency in engines with fewer parallel MACS. An heterogeneous architecture has different engines, each supporting a different set of layers. This removes the pressure to design a single engine that can run all types of layers: Engines are most efficient when designed for a subset of layers and also permit hybrid quantizations and improved memory architectures.

In this work, we propose a computing architecture with multiple, identical or not, engines to run the inference of Convolutional Neural Networks (CNN), which can be extended to other DNN models. Instead of a configurable single engine, the proposed architecture includes one or more configurable engines that deal with different layers. It allows per-layer type optimization, does not require full streaming layer design, permits the introduction of new layer types without changing the other engines, and the number of engines is determined by the resources available in the target device. Each engine is fully configurable in terms of quantization size of weights and activations, number of processing elements, and optional non-trainable layers. A design flow was also developed to map a trained CNN model into a multi-engine accelerator described in HLS for FPGA implementation.

The design flow and the multi-engine architecture were applied in the design of an FPGA-based system for the execution of a classification algorithm based on ResNet50 (ImageNet dataset), an object detector based on YOLOV3-Tiny (COCO2017 dataset), and an image segmentation algorithm with a DeepLabV3+ with a Modified Aligned Xception backbone (Corsican dataset for fire detection). The system was implemented and tested in a Ultra96-V2 board with a low-density Xilinx Zynq UltraScale+ ZU3EG SYstem-on-Chip FPGA.

The main contributions of this work are:

(1)

A modular runtime configurable multi-engine architecture that supports the implementation of convolutional neural networks with multiple types of layers;

(2)

A development flow to design and map a neural network into a multi-engine architecture using HLS;

(3)

A real-time efficient implementation of ResNet50, YOLOV3-Tiny, and DeepLabV3+ in a low-density FPGA using a multi-engine architecture.

2 Background and Related Work

This section introduces the concepts, the architectures, and the tools related to the research subject proposed in this article.

2.1 Convolutional Neural Networks

CNNs are a category of DNN mostly suited for image-focused tasks such as image classification, segmentation, recognition, and so on. CNNs work by processing input images in the form of matrices with pre-defined dimensions. For a CNN to be able to output a desirable prediction it needs to be trained over several iterations with the help of already classified datasets.

CNNs are made up of several layers, where each one gets its input feature map (IFM) from the previous layer and outputs its resulting FM to the next one. Feature maps are the input and output of each layer, composed of two-dimensional arrays of pre-determined sizes. The number of filters used in a layer will result in the same number of output FMs (OFM). The most common types of layers of a CNN are the convolutional, the pooling, the activation function, the fully connected or dense, and batch normalization. The last one is used to normalize the average and the standard deviation of the layers output FM to zero and one, respectively [21]. This improves overall accuracy and speeds up the training process.

The convolutional layer is the most important component of a CNN responsible for extracting features out of its input FM. This is done by convoluting each channel of the input FM with a predefined kernel. A kernel is a small two-dimensional array composed of weights that are trained to achieve the best possible predictions. A 2D convolution is done by processing a single kernel with its two-dimensional input channel. The kernel is overlaid with the input FM array and the dot-product is performed: The multiplications of each overlapped value are added together and then stored on a position of the output array. A bias value is added to the result, which is also a trained parameter. The process is repeated one position to the right until the output array is filled. There are two parameters that alter the way the convolution is done: the padding and the stride. The stride dictates on how many positions the kernel shifts after each dot-product, and the padding determines how many zero valued positions are added to the edges of the input array. The size of the output FM is calculated with Equation (1) [33], where \(i_h\) and \(i_w\) are the input FM array height and width, the \(k_h\) and \(k_w\) are the kernel height and width, the s is the stride, and the p is the padding.

\begin{equation} OFM_h * OFM_w = \left(\frac{i_{h}-k_{h}+2p}{s}+1\right)*\left(\frac{i_{w}-k_{w}+2p}{s}+1\right) \end{equation}

(1)

The number of total multiplications in a 2D convolution can be calculated using Equation (2), where OFM is the output FM and k is the kernel.

\begin{equation} \# of Multiplications = OFM_h * OFM_w * k_h * k_w \end{equation}

(2)

For a 3D convolution the output FM is composed of several two-dimensional arrays, or channels, the same number as the number of the filters used. A filter is composed of several channels, comparable to a 3D kernel. As in the 2D convolution, several dot-products are performed with the same channels (e.g., the first filter channel is only computed with the first input FM channel, the second filter channel with the second input FM channel, and so forth) and the results of these dot-products are added together along with the bias and stored in a position of a channel of the OFM. Each filter used will result in a different channel on the OFM.

The number of total multiplications in a 3D convolution can be calculated using Equation (3), where OFM is the output FM, \(N_{filters}\) is the number of filters, and F is the Filter.

\begin{equation} \# of Mults = OFM_h * OFM_w * N_{filters} * F_{channels} * F_h * F_w \end{equation}

(3)

Depthwise separable convolutions are a less-expensive way of doing convolutions with a large number of filters [40]. The depthwise separable convolution works by using channel-independent dot-products, much like several 2D convolutions, resulting in an intermediate FM with the same number of channels as the filter used, and then doing a convolution with several filters with 1×1-sized channels with the intermediate FM of the previous convolution. Like the 3D convolution, the output FM has the same number of channels as the number of filters used in the depthwise convolution.

The number of total multiplications in a depthwise separable convolution can be calculated using Equation (4) where OFM is the output FM, IFM is the output FM.

\begin{equation} \begin{split} \# of Mults & = OFM_h * OFM_w * F_h * F_w * IFM_{channels} \\ &+ N_{filters} * IFM_{channels} * OFM_h * OFM_w \end{split} \end{equation}

(4)

Atrous or dilated convolutions allow the control of the resolution of the convolutions by adding a new parameter, the rate. The rate defines the number of spaces between the values of the kernel. This allows for the convolutions to have a wider field of view with the same kernel size.

The size of this convolution output is calculated with Equation (5) where \(i_h\) and \(i_w\) are the input height and width, the \(k_h\) and \(k_w\) are the kernel height and width, the s is stride, the p is the padding, and r is the rate.

\begin{equation} \begin{split} Output Size & = \left(\frac{i_{h}-(k_{h}*r-(r-1))+2p}{s}+1\right) \\ &* \left(\frac{i_{w}-(k_{h}*r-(r-1))+2p}{s}+1\right) = O_h * O_w \end{split} \end{equation}

(5)

The number of multiplications of an Atrous Convolution is also given by Equation (3).

Pooling layers are used to reduce the size of the input FM, and by doing so they make the detected features more prevalent and reduce memory consumption. The most-used types of pooling layers are the max pooling, which gets the maximum value of a window, and average pooling the averages the elements of the window.

The activation function layer simply passes each value of the input FM through a non-linear function. The most common activation functions are the Sigmoid, Leaky ReLU, and the ReLU.

2.2 CNNs in FPGA

The number of computations involved in a CNN is usually very high due to the number of Multiply and Accumulate (MAC) operations. As CNN models get deeper, the number of operations and the storage requirements increase.

In a typical FPGA-based CNN accelerator [1] there is a many-core computing structure and several levels of memory hierarchy with on-chip and off-chip memory. These systems are composed of input buffers that fetch the input FMs and the kernel values from the external memory through one or more DMAs. The input FMs are streamed into processing elements (PE), and on-chip memories are used to store data locally. There are also on-chip output buffers to store the output FMs to be transferred back to the external memory. A processor controls the scheduling of operations sending control signals to the communication and processing modules.

To efficiently use FPGAs to run neural networks, the CNNs can be compressed to reduce the use of memory, transfer sizes, and operations, while taking into account the minimization of accuracy loss. Compression of a CNN is mostly done with quantization.

Data can be quantized to any number of levels using fewer bits to reduce the complexity of the hardware and increase the throughput of the system but it may also decrease the overall accuracy of the model. The designer chooses the best tradeoff between accuracy and throughput and/or hardware resource consumption while designing the architecture. A fixed-point number representation is usually chosen due to the less computationally expensive operations associated with it, in comparison to floating point representation.

Quantization can be performed after the model is trained by simply converting the floating points values into the chosen data size, or during training where the weights and activations are converted to the low-cost representation during the forward pass. The latter gives better results due to the implicit fine-tuning.

Two types of processing flow approaches are typically used to run CNNs on FPGAs, one in which one computing engine runs each layer iteratively, and another where the whole model computation is mapped on the FPGA and each layer is computed in sequence in a pipelined fashion. A fully streamed pipeline accelerator implements all layers individually. Image data read from external memory is fully processed through all dedicated hardware units until the final result that is read back to external memory.

While pipelined accelerators might be faster due to the low memory transfers, they are constrained to a specific CNN model and are more hardware-expensive than the all-purpose configurable engine, because all the individual layers must be simultaneously mapped in the hardware.

Two well-known illustrative examples of these two processing flows are Angel-Eye [17], which uses a single computing engine, and MALOC [16], which uses a pipelined layer processor.

The Angel-Eye [17] is a design flow that uses a runtime-configurable accelerator of CNN models. It uses instructions to describe layers and data transfers to be executed by the accelerator. The quantization is done after training, where afterwards each weight is fine-tuned to get the best accuracy possible. By fusing layers [4], several can be processed in the same iteration of the engine, thus decreasing the number of intermediate memory accesses.

MALOC [16] is a pipelined architecture where all intermediate results are passed through each processing element (PE) resulting in less memory transfer. MALOC uses 16-bit fixed-point representations, loop tiling, as well as fused layers to decrease the amount of buffered data. It also prunes dense layers to decrease the total number of weights used without decreasing the accuracy.

Most recent CNN accelerators [6, 18, 20, 22, 28, 47] are based on these two approaches with some optimizations to improve efficiency.

To support multiple different types of layers in a single engine, some works consider optimized matrix multiplication engines or a systolic array, where a regular structure of processing elements [9, 42] is laid out and the data is shifted between adjacent PEs, removing the need of global interconnects. The model is mapped out into a feasible PE array configuration, making use of the exploitation of data reuse. Data converters are used to convert the data input into a matrix multiplication or systolic format that executes each type of layer. Using a general core for layer processing allows the utilization of well-optimized structures but requires data converters that could be expensive.

The utilization of multiple engines for CNN implementation was already proposed in previous works. Any pipelined architecture with full implementation of all layers in the FPGA is a multi-engine architecture. It allows full optimization of all layers but requires high resource count and high memory bandwidth [20]. In Reference [32] multiple convolution engines are used to increase the computational efficiency while running a particular layer. The work only considers engines for convolutional layers. The right balance is achieved with several engines that require large number of resources and memory bandwidth. In Reference [37] the objective is to map multiple CNN models in a single FPGA. In this case, a multiple engine is used with each engine dedicated to run one CNN model. This uses the multiple engine approach for a coarser optimization where each engine is optimized for a particular model. In Reference [44] the authors consider two engines tightly connected, one for 3D convolutions and the other for depthwise convolutions targeting the MobileNet model. The dataflow is optimized for this sequence of layers, but the idea of using a sequence of two engines is to optimize the implementation of each type of convolution. In Reference [9] a single engine with a sequence of layers is proposed. Like the work in Reference [44], a custom single engine is designed with multiple modules to run different layers. An example considers a custom dataflow with a depthwise module followed by convolution. The implementation of each layer type is optimized, but their sequential execution introduces some computational inefficiency. In Reference [8] a dual convolutional engine is used, similar to the work of Reference [32]. An optimization algorithm is used to determine the configuration of the array of processing elements that improves the execution of the model. The work only considers convolutional layers. These works highlight the importance of using multiple engines as a way to improve the computational efficiency of a single engine, and the works from References [9, 44] approach the design of a single engine with a sequence of modules for different layer types.

This article proposes a generalization of the CNN accelerator with multiple engines. An engine is an independent processing unit, with computing cores and memory, that runs in parallel with other engines. Engines are dynamically and statically configurable and can have pre- and post-processing non-trainable layers. Each trainable layer type has a configurable engine designed to run it efficiently. Engines with multiple trainable layers are not supported. Any of the previous works can be designed with the proposed architecture. The work from Reference [32] can be designed with multiple convolutional engines. The work from Reference [37] can be implemented considering engines that support the execution of all CNN models. Additionally, the architecture proposed in this article allows sharing engines among different CNN models that can potentially improve the solution. The work from Reference [44] is implemented with two different engines: one for 3D convolution and the other for depthwise convolution. The difference is that using independent engines guarantees the possibility of using both engines without a control or data dependency.

2.3 Mapping CNNs on FPGAs

Custom-designed FPGA accelerators using hardware description languages produce very efficient core units [39]. However, a design methodology based on HDL is a time-consuming process, difficult to redesign for other network models, and the design space is hard to explore.

Convolutional neural networks consist of well-defined regular structures that permit the development of domain-specific tools to map CNN on FPGA and the exploration of the design space. A common approach of many of these tools is to consider high-level synthesis to generate CNN accelerators. With HLS it is possible to generate hardware descriptions from the high-level specification of the algorithm in a programming language, such as C/C++ or OpenCL.

An earlier approach to automatically map a CNN on FPGA was DNNWEAVER [31]. It automatically generates a Verilog description of an accelerator from a neural model and a target FPGA device. The CNN is specified in the training framework Caffe [23], and the Verilog description uses designer-optimized templates. The framework includes an optimization tool to reorganize and batch the model operations to improve the utilization of internal resources of the FPGA.

fpgaConvNet [36] is another framework that automatically maps CNNs onto FPGAs. From a custom high-level description of the model, it maps all layers in a pipelined streaming architecture. It considers convolutional, dense, pooling, and activation layers. DeepBurning [41] is similar to fpgaConvNet but schedules the operations dynamically.

FINN [10] from Xilinx automatically maps neural models onto ZYNQ SoC FPGAs. FINN already considers HLS to describe the mapping of the model into a streaming architecture. The tool is particularly efficient to design low-quantization network models. The computing modules per layer can be customized in terms of processing throughput.

HLS4ml [14] is an open-source software–hardware workflow used to automatically implement DNNs in FPGA or ASIC. The flow supports quantization and pruning, whose optimized result is automatically mapped on the target device. The designer can specify the precision expected for the model and considers performance, latency, and resource utilization metrics to orient the design of the accelerator.

Vitis-AI (https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html, accessed on on 22 January 2023) is a development platform for Xilinx devices used to generate an single engine accelerator for the inference of deep neural network models. The approach uses a specialized configurable hardware core that supports the sequential execution of the most common layers. A set of custom instructions is used to dynamically program the accelerator. A Vitis-AI compiler translates the model into this set of instructions.

A recent compilation flow to map CNNs on FPGA [9] supports some recent layer types, besides the traditional ones, such as dilation and transposed convolution. It includes a versatile systolic array that supports multiple types of layers and dynamic tiling. The compilation flow maps an ONNX (Open Neural Network Exchange) description of the model into HLS and allows a design space exploration of the systolic array. The model is executed sequentially, layer-by-layer. The systolic array only implements the MAC operations and is then followed by a sequence of models for pooling or upsample, concatenation, activation, and batch normalization. The introduction of new layers requires the redesign of the systolic array, and depthwise convolution is introduced as a pre-processing layer before the systolic array. The sequential execution of layers in a single engine introduces some computational inefficiency.

In this work, the workflow for mapping a CNN into an FPGA translates into an partition/ allocation/mapping/scheduling problem. A library of configurable HLS engines was developed with a standard interface wrapper. Given a particular CNN model, the designer has to allocate the necessary engines, partition the CNN layers, and maps them on the allocated engines. While the previous mapping tools consider architectures with a single engine or a full stream of engines, the proposed multi-engine architecture requires mapping a CNN into multiple shared engines.

3 Multi-engine Architecture

In this section, we describe in detail the architecture of the multi-engine accelerator and its integration in a hardware/software architecture.

Figure 1 illustrates the architecture of the hardware/software system with the CNN inference accelerator.

Fig. 1.

The architecture includes an interconnection network with DMA (Direct Memory Access) units connected to the memory controller to transfer data between the external memory and the accelerator. The accelerator and the interconnection network are configured at runtime by the processor. The CNN accelerator includes one or more engines with custom dataflows to execute the layers (see Figure 2).

Fig. 2.

The CNN accelerator has multiple engines to run layers. Each engine may run a single layer or a custom dataflow of layers with a single trainable layer. For example, one engine may run a convolutional layer followed by a pooling and activation layers, while other may run a depthwise layer followed only by an activation function. Multiple identical layers are also possible. Direct connections between engines are not supported in this version.

The engines of the CNN accelerator have four main units: IFM (Input Feature Map) unit, OFM (Output Feature Map) unit, weights memory unit, and the dataflow unit. The IFM unit manages the access to the input channels. It includes a memory and a configurable address generator, which is configured before running the engine. The configuration is determined by the number of filters, channels, and map size. The OFM unit manages the buffering of output feature maps to be sent to external memory and on-chip buffers for the support of shortcut layers. It also includes a memory and a configurable address generator. The parameters memory unit is used to store the parameters of the layers implemented in the dataflow. All memories are distributed to provide multiple access points. All engines follow a common wrapper with AXI-Stream interfaces for data transfer and an AXI-Lite interface to configure the engine.

The dataflow unit implements the layers of the compute engine, whose design will be explained later. An example of a compute engine is illustrated in Figure 3.

Fig. 3.

The dataflow is a streaming architecture with several layer types. The dataflow in the example includes a convolutional layer followed by pooling, a shortcut connection with addition, and a ReLU activation function. The dataflow is statically and runtime-configurable. Static configurations include the quantization bitwidth, the configuration of some layers (e.g., the configuration of a convolutional layer includes the number of cores, the number of multiply-accumulate units per core, and the size of on-chip memories for the parameters). Dynamic configurations include the size of the input feature map, the inclusion or not of each computation layer, and configurations of each layer module (for example, different activation functions could be selectively chosen). Dynamic configurations are not required if the engine is used to run only a specific layer or stream of layers.

The execution of the engine includes three tasks: memory read, compute, and memory write. The memory read task transfers the input feature map and the weights from the external main memory to the on-chip buffers of the IFM units and to the parameters memory, respectively, and the output feature map of a previous layer to the OFM to implement shortcut connections. The reading process is controlled by address generator units (AGUs) inside the IFM unit. In the compute task, the dataflow of layers is executed. The memory write task reads data from the OFM buffers and transfers them to main memory. All three tasks work concurrently, depending on data availability and scheduling control.

4 From Model to Multi-engine Architecture Framework

The framework receives a CNN model and generates a multi-engine architecture to be deployed on FPGA. The framework has three main design and development flows: model design and optimization, model to FPGA mapping, and software development (see Figure 4).

Fig. 4.

The model design and optimization flow is mainly used to quantize the model, that is, to convert the floating-point representation of weights and activations to fixed-point. This step is fundamental to avoid using floating-point units in the hardware design without compromising accuracy. The quantization process of the proposed flow considers different possible quantization bitwidths so the most efficient solution can be found. The step also includes batch normalization merging to remove batch normalization layers from the model during inference. The final weights are extracted for the final system implementation.

The next flow is the model to FPGA mapping where model partition, engine allocation, layer mapping, and scheduling tasks determine the engines to be used, partition the models in subsets of layers, maps, and schedules the layers to the engines. The result of this step is then translated to an HLS description of the CNN accelerator. After synthesis, an IP core is generated and integrated in the hardware/software architecture.

The last flow is the development of the software to run the inference of the model in the hardware/software architecture generated in the previous step. It generates drivers for the engines and a baremetal application that runs the inference according to the deep learning model considered in the model design and optimization step. In the following, each of these steps is discussed in detail.

4.1 Model Design and Optimization

The network model has to be chosen along with the dataset to train the model and then quantized. The proposed design flow adopts the PyTorch machine learning framework to train the model and Brevitas [29] to quantize.

PyTorch is an open source machine learning framework implemented in Python optimized for both CPUs and GPUs. The training of networks with this framework is done by feeding the training script, where several training parameters are defined such as learning rates, metrics, and loss functions, and a description of the neural network to be used. The network description consists of several layers, where each one is a function with its parameters passed as arguments. These include the number of input channels, number of output channels, kernel size, stride, and padding.

Brevitas is a PyTorch library used for quantization-aware training that implements a set of building blocks at different levels of abstraction to model a reduced precision hardware datapath at training time. This library provides several quantized versions of the standard PyTorch layers that can be replaced with the original PyTorch model. To quantize the model, the original descriptions of the layers are replaced by their quantized versions from Brevitas.

To better integrate PyTorch, Brevitas, and FPGA deployment, three Python-based tools were developed:

–

Brevitas Converter: a Python script that takes the description file of a PyTorch network model and generates a description file for the Brevitas framework. The script also converts the already trained weights to the generated Brevitas Network. This permits to start the model with pretrained weights, which speeds up the training of the quantized model;

–

Batch Normalization Merger: a tool that merges a Batch Normalization layer with a convolution layer;

–

Weights Extractor: a tool that extracts the weights and bias into binary files to be used for inference running in the FPGA.

Brevitas Converter converts the PyTorch model into a quantized Brevitas network along with the trained weights. It runs through all the layers of the original model and replaces them with the Brevitas version. Afterwards, a study on quantization of data of this network, using Brevitas, has to be done, taking into account the tradeoffs between area and performance from reducing the data bitwidth. This study must be diverse regarding the size in bitwidth of the data used throughout the network for the best bitwidth configuration to be chosen. This work considers bitwidth quantizations of sizes 8, 4, and 2 for both weights and activations. Weights and activations may have different sizes.

The accuracy and computational efficiency of inference is then improved by merging the convolutional layers with the batch normalization layers with a tool developed in Python named Batch Normalization Merger, thus reducing the number of overall layers. Batch Normalization during inference time works like a 1 × 1 convolution, i.e., each FM value is multiplied by the Batch Normalization Weight and added with a Batch Normalization Bias and seen in Equation (6).

\begin{equation} Out_{Value}=W_{BN}*FM_{Value}+Bias_{BN} \end{equation}

(6)

When there are no activation function layers (non-linear layers) between convolutional and batch normalization layers, the weight and bias can be fused together to reduce runtime and memory requirements, as shown in Equation (7).

\[\begin{eqnarray} Out_{Value}&=&W_{BN}*(W_{Conv}*FM_{Value}+Bias_{Conv})+Bias_{BN} \\ \nonumber \nonumber &=&W^{\prime }*FM_{Value}+Bias^{\prime } \end{eqnarray}\]

(7)

where

\[\begin{eqnarray} W^{\prime }&=&W_{BN}*W_{Conv}\\ Bias^{\prime }&=&W_{BN}*Bias_{Conv}+Bias_{BN}. \nonumber \nonumber\\ \nonumber \nonumber \end{eqnarray}\]

(8)

As seen in the formula above, the four parameters are turned into only two, thus reducing runtime and memory requirements.

The Batch Normalization Merger is a script that iterates every layer and fuses grouped convolutions and batch normalization layers automatically and outputs a new weights file. Due to the nature of quantization, there needs to be further training to reach higher values of accuracy. Fine-tuning is done to improve the accuracy further. The weights are then exported into the FPGA to be used with the developed system using another developed tool in Python named Weights Extractor.

Weights extractor is a Python script that extracts all the weights of a trained Brevitas quantized network into binary files in a format that can be read easily. The weights are stored z-wise in a fixed point format with the weights bitwidth being configurable.

4.2 Model to FPGA Mapping

Mapping a model to FPGA consists of partitioning the model into macro layers, allocating a set of engines, mapping the macro layers on the engines, scheduling the execution of the engines, and then generating the accelerator as an IP core to be integrated in the hardware/software system. The whole process is based on a library of engines that run layers. For a layer to be executed in hardware, there must be an engine that supports its execution.

4.2.1 Engine Library.

A number of engines for the most-used layers was developed and stored in a library, namely:

(1)

3D convolution - Executes 3D convolutions. It supports filters with any size. The 3D convolution also runs pointwise convolutions;

(2)

3D convolution with shortcut - Similar to the 3D convolution but implements shortcut connections with addition or concatenation;

(3)

Depthwise convolution - A depthwise unit made of a 2D array of PEs followed by an activation function. It also supports filters of any size. It can be followed by an accumulator to obtain 3D convolutions;

(4)

Atrous or dilated convolution - Dilated convolutions allow the control of the resolution of the convolutions by adding a rate. The rate defines the number of spaces between the values of the kernel. The layer is similar to a depthwise with accumulation, but the kernel is not contiguous;

(5)

Fully connected - An engine for dense layers of a 2D array of PEs followed by an activation function. It supports filters of any size;

(6)

Pooling - A layer for downsampling of feature maps. Max pooling and average pooling are supported. In some cases, like in a shortcut connection, the start layer may have to be downsampled in the shortcut connection.

Any engine is configurable at synthesis time with the following parameters: number of processing elements, configuration of the computing array, memory sizes of IFM, OFM and parameters, fixed-point quantization, and number of MACs per PE.

Any of these layers can be extended with optional post-processing layers: downsampling, upsampling, and activation function. This improves the computing efficiency of the engine and avoids using memory to store intermediate maps. Supported activation functions include ReLU, Leaky ReLU, and Sigmoid. All engines are described as HLS templates and are characterized in terms of resources and performance, depending on its configuration.

The 3D convolution engine is made of a 2D array of arithmetic cores and local memories (see Figure 5).

Fig. 5.

Each column of cores is connected to a local memory of the weight memory block. The weights stored in this memory are broadcast to all cores of a column. To allow overlapping of calculation and communication, memories of the weight memory block are dual-port. Each line of cores is connected to a port of the feature map memory. Activations read from a port of the FMM area broadcast to all cores of the line of cores connected to this port. The number of cores per line of the cluster is limited by the available memory to implement the local memories and by the resources to implement each core. The number of cores per line and per column are statically configurable and therefore the matrix of cores is the same for all layers.

Instead of running multiple 2D convolutions followed by accumulation, the engine calculates a single 3D convolution as a long dot product, and so the window size of the convolution is transparent for the processing cores. Pixels of the initial image or activations of the input feature maps and weights of kernels are stored in order z - x - y. Therefore, the pointwise engine calculates a pointwise convolution with filters of any size with the same computational efficiency.

The post-processing block is designed with any sequence of the supported post-processing layers mentioned above.

The 3D convolution with shortcut engine is the same as before but with the addition of a memory in the OFM unit to store the layer map to be added or concatenated and the split unit to receive the map from memory.

The fully connected engine is a particular case of the pointwise engine. In this engine, there is only a single line of cores, since a filter is applied only once to the input map. It also includes an optional pre-processing unit for average pooling, not represented in the figure.

The depthwise engine runs independent convolutions over each input channel. Therefore, each core receives one map and one filter and produces one map (see Figure 6).

Fig. 6.

The kernel size is configurable at synthesis time, but in any case the core includes enough MACS to run one convolutional in a single clock cycle. For example, for a 3 × 3 kernel, the core has 9 parallel MACS. The depthwise engine can be configured with accumulation. In this case, it executes a 3D convolution.

Since the atrous or dilated convolution cannot be produced in a straightforward way with the previous engines, a dilated engine was developed as in Reference [11].

In this solution, the original feature map is partitioned into \(rate^2\) sections followed by separate convolutions for each section. Then, the outputs of these separate convolutions are joined with their pixels being interleaved in the same manner as the first partitioning.

These engines cover a large set of models, but the library of engines can be extended to cover other types of layers. These can be integrated in the existing engines or created as new ones. This modularity guarantees that the main structure of the accelerator remains unaltered, and the support for new layers and models is achieved with the integration of new engines or extended engines. Any engine implementation is valid as long as it respects the wrapper interface.

4.2.2 Allocation, Mapping, and Scheduling.

This is the core step of the Model to FPGA flow. Initially, the model is reorganized into macro layers that fit the type of available engines. For example, the sequence of layers formed by convolutional, ReLU and pooling is identified as a macro layer, that can be implemented in one of the engines. Next, a set of engines is allocated to support the execution of the macro layers. The macro layers are then mapped and scheduled in the allocated engines (see design flow in Figure 7).

Fig. 7.

In this version of the framework, this flow is semi-automatic, where model partition and engine allocation are done manually by the designer, and map/schedule and performance/area estimation steps are automatic.

The model partition step consists of subdividing the model in macro-layers, where a macro layer has a main trainable layer, followed by pro-processing layers, like an activation function. Let us consider the ResNet18 model to illustrate the process. This model has convolutional layers followed by ReLU and in some cases includes also shortcut additions and average pooling. The last layer is a fully connected. In this case, the model is partitioned as illustrated in Figure 8.

Fig. 8.

Macro layers are identified as squares with dotted lines. The macro layers identified in the example contains one main layer and one or mode post-processing layers: ReLU and/or average pooling, with or without a connection layer. The flow of macro layers is represented as a graph. To implement the model, different sets of configurable engines can be allocated. In the example, two possible engine allocations are:

(1)

Alloc1: One 3D convolution with shortcut, ReLU and average Pooling, one 3D convolution with ReLU, and one fully connected;

(2)

Alloc1: One 3D convolution with short, ReLU and average Pooling, and one fully connected.

The first solution (alloc1) considers three engines and the second (alloc2) only two. For each valid association of an engine with a macro-layer, the area and performance estimation tool is used to characterize its execution time, area, and transfer time. These are used to guide the scheduling tool. These estimations depend on the configuration of the engine, which is done manually by the designer. This can be done initially before running scheduling based on the estimates of execution times of layers or during the allocation and scheduling cycle based on the idle times of the engines. Idle engines can reduce their parallelism, and fully occupied engines can increase their parallelism.

Given a particular allocation of engines and the macro layer graph of the model, the flow runs a resource-constrained scheduling algorithm (see Listing 1).

For each macro-layer ready to be executed, the algorithm selects the minimal compatible engine. If there are no engines available, then it waits until an engine is available. Each pair (engine, macro-layer) has an estimated execution time that is considered by the scheduler to update the time.

The resulting solution is then characterized in terms of performance and area using the area/performance estimation step, to be described in the next subsection. The estimator also determines the idle times of each engine. This is used to iteratively adjust the parallelism of the engines. As mentioned above, the number of MACs per engine is configurable. Therefore, each configuration is characterized by its area and performance.

Multiple allocations, mapping, and schedules can be considered to explore the design space. For example, multiple identical engines can be used, batch processing of inputs, where multiple inputs are executed in parallel and, therefore, multiple identical graphs are scheduled. In this case, the partition graph is replicated before being sent to the map and allocation.

The final allocation and configuration of engines is sent to the HLS generator tool, and the scheduling is sent to the software development flow.

4.2.3 Area and Performance Estimation.

The number of cycles to execute a layer depends on the number of cores, the size of the input memories (weights, IFM) of the engine, and the characteristics of the layer. The input feature map is processed in tiles, unless the IFM has enough storage capacity to hold the entire map. The output map is temporarily buffered before being sent to external memory. When the input map is read in tiles, the weights have to be read from external memory as many times as the number of tiles, except if the number of kernels is lower than the number of cores of the engine.

The number of cycles to execute a layer, layerCycle, is roughly given by Equation (9).

\begin{align} layerCycle = \frac{nFilter \times nConv \times filterSize}{numCores \times nMAC} , \end{align}

(9)

where nFilter is the number of filters, nConv is the number of convolutions or dot-products, filterSize is the size of the filter, numCores is the number of cores, and nMAC is the number of parallel multiply-accumulations of each core.

The number of cycles to transfer feature maps, commFM, and weights, commW, to and from the engines is given by Equation (10).

\begin{align} commFM &= \frac{IFM_{size} + OFM_{size} + Shortcut_{size}}{memBW} &\;\;\; commW &= \frac{filterSize \times nFilter \times nTiles}{memBW} &\;\;\; , \end{align}

(10)

where nTiles is the number of IFM tiles, \(Shortcut_{size}\) is the size of the previous map of the shortcut layer, and memBW is the number of bytes transferred from/to memory per cycle.

Assuming that the IFM writing overlaps with engine computation, the approximate execution time of a layer is given by Equation (11).

\begin{align} Texec_{CL} = commW + max(layerCycle, commFM) \end{align}

(11)

These models are used by an analysis tool to estimate the execution time of the complete model in the allocated engines. Whenever the engines run in parallel, the memory bandwidth is distributed by the engines proportionally to the number of DMAs.

The area model considers both the computing and memory resources. All engines have an associated area model that depends on the number of cores and the size of the memories. The computing model includes the number of LUTs and DSPs. The memory model gives the number of 4k BRAMs.

To help the designer in the configuration of engines to find a balanced solution, each engine is additionally characterized by its peak performance (see Table 1).

Table 1.

Engine	Engine Peak MACS/CLK
Depthwise	\((kernel size)^2\)
3D conv	\(Nº cores \times MACS/core\)
Dense	\(Nº cores \times MACS/core\)
Atrous	\((kernel size)^2\)

Table 1. Ratio between Operations of the Layer and Peak Performance of Engines

The peak performance is proportional to the number of cores of the engine. This allows a rough estimate of the expected performance of each engine and of the whole execution of the model. The estimation tool is fast, and the experimental results show that the performance is close to the estimated results.

4.3 Software Development

The network model is executed layer-by-layer by sending the corresponding data to the correct engines and receiving the output FMs through the DMAs with the use of the developed drivers. Multiple threads are supported to take advantage of parallel engines, that is, while an engine is waiting for the results of another engine, it can run over a different input sample. This is determined by the scheduling step.

A driver sets the engine parameters through AXI-lite, which can be done by writing to the necessary addresses, starting the engine execution by setting the engine start signal on the control register by AXI-Lite, sending in order the bias and weights through AXI-Stream/DMA from the external memory to the engine in a contiguous fashion, setting up the transfer from the engine output to the external memory through AXI-Stream/DMA, and then sending the FM through AXI-Stream/DMA.

The programmer must develop the main application using the drivers and following the sequence of layers as specified in the graph of macro layers.

5 Experimental Results

The proposed development flow and the multi-engine architecture were applied to the design of ResNet50 for classification of \(224\times 224\) images, YOLOv3 for object detection of \(416\times 416\) images and DeepLabV3+ with a Modified Aligned Xception Backbone for image segmentation of \(300\times 300\) images.

All systems were designed using the proposed development flow, and the hardware modules were synthesized using Xilinx Vitis 2022.1 High Level Synthesis tool. The accelerator development targeted an implementation in the PL of Xilinx Zynq UltraScale+ SoC ZU3EG FPGA present in the Ultra96-v2 development board. Any FPGA could have been used, but using a low-density FPGA shows that the multi-engine architecture can also be used to design accelerators for embedded devices.

5.1 ResNet50 Model

ResNet50 is a well-known model for image classification and used as a backbone in many convolutional neural networks. The model was initially quantized with different bitwidths starting from a pre-trained model (see results in Table 2).

Table 2.

Quantization: \(A\times W\)	Accuracy
Floating-point	76.5 %
\(8\times 8\)	73.8 %
\(8\times 4\)	72.6 %
\(8\times 2\)	71.8 %
\(4\times 4\)	72.1 %
\(4\times 2\)	66.2 %

Table 2. Accuracy Achieved after Quantization-aware Training of the ResNet50 Model

It is not the objective of this work to improve the state-of-the-art quantization accuracy with ResNet50. The variation in accuracy among the quantized solutions is just 1.7 pp, except for the most aggressive quantization (\(4\times 2\)), where there is a drop of 10 pp compared to original floating-point model. The accuracies of quantizations \(8\times 2\) and \(4\times 4\) are very close but quantization \(4\times 4\) is slightly better. Comparing the \(4\times 4\) quantization against \(8\times 8,\) there is a drop of 2 pp in accuracy. Since the design targets a low-density FPGA to run a large model, the \(4\times 4\) was chosen. The model design and optimization step was applied to the full model to run batch fusing, fine-tune, and extract the final weights.

The first layer of ResNet50 is a convolution (\(64\times 7 \times 7\) filters) followed by batch normalization, ReLU, and \(3\times 3\) max pooling. Considering the singular characteristics of the layer, a dedicated engine was chosen for this layer. Since the input is just three channels, the depthwise with accumulation engine was used. The last layer does average pool of the whole channels followed by a dense layer with 1,000 kernels. A dense layer engine was used with average pool. The hidden layers are repetitions of the bottleneck structure of ResNet50 (see Table 3).

Table 3.

Layer
Conv \(1\times 1\) + BN + ReLU
Conv \(3\times 3\) + BN + ReLU
Conv \(1\times 1\) + BN + ReLU
Optional Downsample
Shortcut addition

Table 3. Bottleneck Structure of the ResNet50 Model

All layers, including the downsample layer, can be implemented with the 3D convolution engine with ReLU and shortcut addition. Batch normalization was merged with the weights in the previous step.

So, the accelerator includes three engines: one for the first layer, one for the last layer, and one for all hidden layers (see Figure 9).

Fig. 9.

The three engines work in parallel and can be used in a dataflow streaming to process a sequence of images. The architecture of each engine was configured before scheduling, as specified in the engines in the figure. The configurations were determined based on the execution times of macro-layer. Knowing the number of MACS, parameters and map sizes necessary to run each macro-layer, the ratio between layer operations and peak performance of engines was found together with the expected data transfer times, assuming a data transfer of 16 Bytes/CLK (see Table 4).

Table 4.

Engine	Layer	Peak MACS/CLK Cy.	LM/PM	Non-overlap transfers	Total
	MACS	(LM)	(PM)	(CLK cycles)
Depthwise	118M	49	2.4M	10k	2.4M
3D conv	3.8 G	2,048	1.9M	0.7M	2.6M
Dense	2M	1M	2M	0	2M

Table 4. Ratio between Operations of the Layer and Peak Performance of Engines

The 3D core engine has the best expected computation time. However, the non-overlapping data transfers degrade the total peak execution time.

The hardware/software system with the multi-engine architecture was implemented, tested on the board, and the results compared with the state-of-the-art. The multi-engine runs at 200 MHz (see Table 5).

Table 5.

	[34]	[25]	[12]	Ours
Model	ResNet50	ResNet50	ResNet18	ResNet50
FPGA	XCZU9EG	XC7045	XC7045	ZU3EG
\(A \times W\)	\(4/8 \times 5\)	\(8 \times 4\)	\(4 \times 4\)	\(4 \times 4\)
LUTs	180,100	203,000	145,049	58,358
DSPs	2,092	0	900	264
BRAMs	441	443	226	161
Freq	150	150	100	200
FPS	109	104	99	67
GOPS	891	804	359	508
GOPS/kLUT	5.0	4.0	2.5	8.7
GOPS/DSP	0.43	—	0.39	1.9

Table 5. Performance Comparison of the Multi-engine Architecture with Previous Works Running the Inference of ResNet50 on FPGA

The multi-engine architecture is better than the other \(4\times 4\) solution in terms of GOPS/kLUT and GOPS/DSP. The FPS of the proposed architecture is smaller but it runs ResNet50 while the other work runs ResNet18. Compared to Reference [25], the main difference in the throughput comes from the limited memory bandwidth and on-chip memory of our FPGA device. It is not evident from the results, but the bottleneck of our solution is data communication.

5.2 YOLOv3 Model

YOLO and all its versions [15] are one-stage object detectors with a common model topology based on convolutional neural networks. The YOLO detector extracts features using a CNN and returns candidate bounding boxes from those features for three different scales: 52 × 52, 26 × 26, and 13 × 13.

The original CNN model of YOLO has convolutional, shortcut, YOLO, upsample, and route layers. Convolutional layers with stride two replace max-pooling. Batch-normalization is applied to all convolutional layers, and all layers use the Leaky ReLU activation function, except the layers before YOLO layers that use a linear activation function. YOLO is able to detect objects of different sizes using three different scales: 52 × 52 to detect small objects, 26 × 26 to detect medium objects, and 13 × 13 to detect big objects. YOLOv3-Tiny replaces convolutions with a stride of two by convolutions with max-pooling and does not use shortcut layers.

Table 6 details the sequence of layers with regards to the input, output, and kernel sizes, and the activation function used in each convolutional layer of YOLOV3-Tiny.

Table 6.

Layer #	Type	Input (W×H×C)	Output (V×U×N)	Kernel (N×(J×K×C))	Activation
1	Conv.	416×416×3	416×416×16	16×(3×3×3)	Leaky
2	Maxpool	416×416×16	208×208×16
3	Conv.	208×208×16	208×208×32	32×(3×3×16)	Leaky
4	Maxpool	208×208×32	104×104×32
5	Conv.	104×104×32	104×104×64	64×(3×3×32)	Leaky
6	Maxpool	104×104×64	52×52×64
7	Conv.	52×52×64	52×52×128	128×(3×3×64)	Leaky
8	Maxpool	52×52×128	26×26×128
9	Conv.	26×26×128	26×26×256	256×(3×3×128)	Leaky
10	Maxpool	26×26×256	13×13×256
11	Conv.	13×13×256	13×13×512	512×(3×3×256)	Leaky
12	Maxpool	13×13×512	13×13×512
13	Conv.	13×13×512	13×13×1024	1024×(3×3×512)	Leaky
14	Conv.	13×13×1024	13×13×256	256×(1×1×1024)	Leaky
15	Conv.	13×13×256	13×13×512	512×(3×3×256)	Leaky
16	Conv.	13×13×512	13×13×255	255×(1×1×512)	Linear
17	YOLO	13×13×255	13×13×255		Sigmoid
18	Route	Layer 14	13×13×256
19	Conv.	13×13×256	13×13×128	128×(1×1×256)	Leaky
20	Upsample	13×13×128	26×26×128
21	Route	Layer 9+20	26×26×384
22	Conv.	26×26×384	26×26×256	256×(3×3×384)	Leaky
23	Conv.	26×26×256	26×26×255	255×(1×1×256)	Linear
24	YOLO	26×26×255	26×26×255		Sigmoid

Table 6. Tiny YOLOv3 Layers

The first part of the network is composed of a series of convolutional and maxpool layers. The object detection and classification part of the network performs object detection and classification at (\(13 \times 13\)) and (\(26\times 26\)) grid scales. The detection at a lower resolution is obtained by passing the feature extraction output over \(3\times 3\) and \(1\times 1\) convolutional layers and a YOLO layer at the end.

The detection at the higher resolution follows the same procedure but uses FMs from two layers of the network. The second detection uses intermediate results from the feature extraction layers concatenated with upscaled FMs used for the lower resolution detection.

Following the design flow, the model was quantized with 8 bits, achieving a precision score 30.8 mAP50 in the COCO 2017 test dataset (the original model with floating-point has a score of 32.9 mAP50). Lower quantizations introduce large errors and therefore were not considered.

The model consists of 3D convolutional layers with maxpool, without maxpool, and with upsample. Different Activation functions are used: leaky, linear, and sigmoid.

The first layer is a convolution (\(16\times 3 \times 3\) filters) followed by leaky function and max pooling. A dedicated engine was also considered for this layer. Since the input is just three channels, the depthwise with accumulation engine was used.

All the remaining layers are based on a 3D convolutional (see Table 7).

Table 7.

Layer
Conv \(3\times 3\) + Leaky + maxpool
Conv \(1\times 1\) + Leaky + maxpool
Conv \(1\times 1\) + Leaky
Conv \(1\times 1\) + Linear + Sigmoid
Conv \(1\times 1\) + Leaky + upsample

Table 7. Layer Combinations of the YOLOv3-Tiny Model

All layers can be implemented with the 3D convolution engine. To use a single convolution engine for all layers, it was necessary to design a configurable activation function with support for both Leaky ReLU and Sigmoid.

So, the accelerator includes two engines: one for the first layer and one for all hidden layers (see Figure 10).

Fig. 10.

The architecture of each engine was also configured initially as specified in the engines in the figure. Once again, knowing the number of MACS, parameters, and map sizes necessary to run each layer, the ratio between layer operations and peak performance of engines were found together with the expected data transfer times, assuming a data transfer of 16 Bytes/CLK (see Table 8).

Table 8.

Engine	Layer	Peak MACS/CLK Cy.	LM/PM	Non-overlap transfers	Total
	MACS	(LM)	(PM)	(CLK cycles)
Depthwise	75M	27	2.8M	32.5k	2.8M
3D conv	2.7 G	832	3.2M	1M	4.2M

Table 8. Ratio between Operations of the Layer and Peak Performance of YOLO Engines

Table 9.

	[2]	[46]	[3]	[30]	Ours
Device	ZYNQZU9EG	ZYNQ7020	VirtexVX485T	US XCKU040	ZU3EG
Dataset	VOC 2007	COCO Dataset
Quant.	8	16	18	16	8
Freq. (MHz)	250	100	200	143	200
DSPs	242	120	2,304	832	220
LUTs	27,300	26k	49k	139k	58,552
BRAMs	248	93	70	384	161
Exec. (ms)	121	532.0	-	24.4	27.4
FPS	8.3	1.9	-	32	36
FPS/kLUT	0.3	0.07	-	0.23	0.62
FPS/DSP	0.034	0.016	-	0.038	0.16
FPS/BRAM	0.033	0.020	-	0.083	0.22

Table 9. Performance Comparison of the Multi-engine Architecture Running YOLOv3-Tiny against Other FPGA Implementations

Only one implementation of YOLOv3-Tiny on FPGA with a \(8\times 8\) quantization was found. Therefore, the proposed solution was also compared with previous solutions with higher quantizations. Compared to the work from Reference [2] that also uses an 8-bit quantization with the same FPGA, the proposed multi-engine is \(4.3\times\) faster and with higher ratios between performance and area.

5.3 DeepLabV3 Model

Semantic segmentation is the process of assigning a label to each pixel of an input image. This is useful for object detection where several objects can be identified. A common architecture for semantic segmentation is the Encoder-Decoder structure [26, 45]. The Encoder-Decoder structure is a two-stage network where the encoder compresses its input through several convolutions, which captures the semantic information, while the decoder predicts the output from the encoder’s output FM. The encoder is usually a network such as ResNet or Xception that captures the semantic information, through its several convolutions, while the decoder assigns each output pixel to a label using the encoder semantic information.

The semantic segmentation model considered in this work was the DeepLabV3+ [13] with a Modified Aligned Xception backbone. DeepLabV3+ is an architecture that employs the Encoder-Decoder structure. A simple decoder is added to its predecessor DeepLabV3, which makes use of the ASPP (Atrous Spatial Pyramid Pooling). ASPP applies several atrous convolutions with different rates in parallel to compute features in multiple scales. ASPP can help with detecting objects that might be too far away or too close to the camera due to use of atrous convolutions with different rates. Different backbones can be used with DeepLabV3+. Recently, with a modified Xception as a backbone, this model was able to improve the results of its predecessor DeepLabV3.

This network consists of 142 Convolutions and ReLU as the activation function. The network contains modifying layers such as feature map sums, average pooling, and interpolations. Figures 11 and 12 show the Modified Aligned Xception and the DeepLabV3+ Networks, respectively, with all the convolutional layers stated in order of execution with their most important parameters as well.

Fig. 11.

Fig. 12.

This network features Separable Convolutions, Atrous Convolutions, Shortcut connections with addition, Average Pooling, Bilinear Interpolations, and Map Concatenations. Each Separable Convolution block is composed of a depthwise convolution, a batch normalization layer, a pointwise convolution, another batch normalization layer, and a ReLU layer. A convolution block is composed of a Convolution, a batch normalization layer, and ReLU Layer, except for the last convolution where there is no ReLU layer. The Network is divided in several parts: the Encoder, which is composed of a backbone network—in this case, the Modified Aligned Xception, but could be replaced with another one—and the ASPP block, which gathers features in multiple scales with the use of atrous convolutions. Next to the encoder is the decoder block, which is responsible for decoding the features gathered by the encoder. As for the Modified Aligned Xception, it is divided into three sections: the Entry flow, which consists of several Separable Convolutions, some with a stride of 2 to reduce the size of the FMs; the Middle Flow, which is repeated 16 times in sequence; and the Exit flow, which contains some convolutions with dilation characteristics.

The DeepLabV3+ with an Xception Backbone was trained with the Corsican Fire dataset using only RGB images sized 300 × 300. The Corsican dataset¹ is composed of a total of 1,135 images of fires acquired in the visual range (i.e., RGB values) and in the near infrared (i.e., single channel image) under various conditions of positioning, weather, vegetation, distance to the fire, and brightness. The dataset was divided randomly into three parts: the first being the training partition, where 682 images are used for training the network; the second being the validation partition, where 226 images are tested after each epoch of training to keep track of the network training process; and the third being the test partition, where 226 images are tested to evaluate the network final mIoU. The training was done with a learning rate of 0.02 and with a Cross-entropy loss function. The mIoU reached for validation and testing are 92.45% and 91.5%, respectively.

After training the network, several quantization attempts with varying weights and activation bitwidths were done to determine the best mIoU-to-hardware-resources compromise achievable. Table 10 shows the results obtained.

Table 10.

The bitwidths were tested from 8 to 3 for both activation and weights, as lowering even more the size of the bitwidths would yield mIoU results below 80%. These results are probably not the best achievable with these bitwidths, since we stopped training after 30 epochs. However, at 30 epochs the loss variation is small, meaning that only a small improvement would be achieved with more training epochs. Decreasing both the weights and activations bitwidths lower the accuracy of the network, and with an activation bitwidth of 3, the accuracy decreases significantly, independently of the weights bitwidth. Since using weights with a bitwidth size that is not a power of 2 wastes memory resources, and as the achieved mIoU of 87.31% for a bitwidth of 4 for both activations and weights is only less than 5% lower than the mIoU achieved with the original PyTorch network, the rest of this work will use this bitwidth configuration.

Afterwards, the Batch normalization layers were merged with their respective convolutional layers using the developed tool Batch Normalization Merger, which decreased the mIoU to 81.21%. After further 50 epochs of fine-tune training, the mIoU reached 92.76% (an even higher value than the original PyTorch mIoU of 91.53%). The reason the quantized network performs better than the non-quantized PyTorch network might be due to the noise associated with the quantization.

The architecture was split into the three different engines: 3D Convolution, Depthwise Convolution, and Dilated Convolution. The 3D Convolution engine also computes the pointwise convolutions.

The 3D convolution includes a pre-processing average pooling and a post-processing with ReLU, concatenation with addition, and interpolation.

The average pooling is able to average a variable-sized map into a 1 × 1 map by reading the input stream values and adding them to the previously read values. Once all the values are read, the division by the total number of pixels of the resulting value of the sums is done. The division is replaced by a multiplication of a value calculated statically (\(\frac{2^{25}}{N}\)) and a bit shift represented by a division by \(2^{25}\) to reduce the hardware and execution time, as shown in Equation (12).

\begin{equation} Avg=\frac{\sum _{i=0}^{N-1}val[i]}{N}=\frac{\sum _{i=0}^{N-1}val[i]*\frac{2^{25}}{N}}{2^{25}} \end{equation}

(12)

The post-processing of the 3D convolution engine includes a bilinear interpolation, which was not in the original implementation of the engine used for the first two models. Interpolation works by applying linear interpolation in two directions. The linear interpolation formula can be seen in Equation (13), where ×1 and ×2 are the known coordinates values in a 1-dimensional plane, Q1 and Q2 are the values associated with those coordinates, and P is the value to be calculated for the new x.

\begin{equation} P=Q_{1}\frac{x_{2}-x}{x_{2}-x_{1}}+Q_{2}\frac{x-x_{1}}{x_{2}-x_{1}} \end{equation}

(13)

The Bilinear Interpolation formula, seen in Equation (14), can be deduced from the linear interpolation formula when moving on to a 2-dimensional plane.

\begin{equation} \begin{split} P & = Q_{11}\frac{(x_{2}-x)(y_{2}-y)}{(x_{2}-x_{1})(y_{2}-y_{1})} + Q_{21}\frac{(x_{}-x_{1})(y_{2}-y_{})}{(x_{2}-x_{1})(y_{2}-y_{1})} \\ + &Q_{12}\frac{(x_{2}-x)(y-y_{1})}{(x_{2}-x_{1})(y_{2}-y_{1})} + Q_{22}\frac{(x_{}-x_{1})(y-y_{1})}{(x_{2}-x_{1})(y_{2}-y_{1})} \end{split} \end{equation}

(14)

Implementing this formula with fixed point numbers requires scaling the position of the known points for there not to be any fractional bits. The scaling factor is related to the size of the bigger map, e.g., if the bigger map is has a size of 10 pixels, then the scaling factor would be 9.

The final division of the interpolation can be replaced by a bit shift by first multiplying a factor that is the result of the division between a power of two and the actual denominator. This factor is calculated at the beginning of the interpolation process, since the value is always the same. The formula can then be rewritten:

\begin{equation} \begin{split} F & = \frac{2^{25}}{(x_{2}-x_{1})(y_{2}-y_{1})*f^{2}}\\ P & = \frac{(Q_{11}first + Q_{21}second + Q_{12}third + Q_{22}fourth)*F}{2^{25}} \end{split} \end{equation}

(15)

Before starting the Bilinear Interpolation, the module stores two lines of the map. Afterwards, values are loaded and the interpolation is performed while loading the next line of values.

The final architecture with the three engines is illustrated in Figure 13.

Fig. 13.

The three engines work in parallel and can be used in a dataflow streaming to process a sequence of images. The architecture of each engine was configured as specified in the engines in the figure.

The hardware/software system has four DMAs: two DMAs are connected to the 3D convolution engine, one DMA is connected to the Depthwise IP, and the other is connected to the dilation convolution engine.

The system was synthesized and implemented achieving a working frequency of 175 MHz. Table 11 shows the Hardware Utilization of the whole system.

Table 11.

As shown, no hardware limits are hit, the highest resource usage being the BRAMs. The FPGA bitstream was then generated and exported along the with the necessary files to Vitis IDE to be able to run the network.

The network was run in the Ultra96-v2 Board, the 226 test images were processed, and their respective mIoU was evaluated (92.7%). The input images, weights, and truth images were stored, before execution, into the board DDR memory. The input images were saved already normalized, so no pre-processing on them is needed. Every output image was compared with the truth images, calculating their mIoU, confirming that the implementation is correct. Then the execution time was recorded and averaged over the 226 runs to figure out the system’s throughput. A throughput of 1.4 frames per second was observed.

The results were compared with the only work found of DeepLabV3+ on FPGA (see Table 12).

Table 12.

	[27]	Ours
BAckbone	ResNet18	Xception
FPGA	Arria10	ZU3EG
\(A \times W\)	\(16 \times 16\)	\(4 \times 4\)
LUTs	471.5k	52,317
DSPs	394	172
Mem	2,374	844 KBytes
Freq	200	175
FPS	0.26	1.4

Table 12. Performance Comparison of the Multi-engine Architecture with Previous Works Running the Inference of DeepLabV3+ on FPGA

The solutions compared in the table target different FPGAs. The work from Reference [27] considers a high-density FPGA. However, the difference in the utilization of resources is much larger than our multi-engine architecture and has throughput \(5.4\times\) lower.

6 Conclusion

The design of hardware accelerators for the new generation of heterogeneous deep learning models is complex and requires new design strategies. This work proposes a multi-engine architecture to deal with the heterogeneity of the models and provide modular design of hardware accelerators for DNNs.

An end-to-end design flow with manual allocation of engines and mapping of the model on the engines is proposed. It optimizes the model with quantization, maps the model on the multi-engine architecture, and generates the drivers and baremetal software to run it on FPGA. The solution is applicable to any FPGA density.

Three different models were tested with the proposed tools: ResNet50, YOLOv3-Tiny, and DeepLabV3+ with an Xception backbone. In all cases, the end-to-end flow was followed and the models were tested in the target ZU3EG SoC FPGA FPGA. In all three cases, the design succeeded in implementing the models in the FPGA.

The design flow is under intensive research to improve it with new features. These include the fully automatization of the framework flows, the introduction of new engines and improvement of the existing (for example, the dilation convolution), the possibility to support direct connection of engines and complex engines with more than one trainable layer.

Footnote

http://cfdb.univ-corse.fr/

References

[1]

Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, and François Berry. 2018. Accelerating CNN inference on FPGAs: A Survey. (2018). arxiv:cs.DC/1806.01683

Abstract

1 Introduction

2 Background and Related Work

2.1 Convolutional Neural Networks

2.2 CNNs in FPGA

2.3 Mapping CNNs on FPGAs

3 Multi-engine Architecture

4 From Model to Multi-engine Architecture Framework

4.1 Model Design and Optimization

4.2 Model to FPGA Mapping

4.2.1 Engine Library.

4.2.2 Allocation, Mapping, and Scheduling.

4.2.3 Area and Performance Estimation.

4.3 Software Development

5 Experimental Results

5.1 ResNet50 Model

5.2 YOLOv3 Model

5.3 DeepLabV3 Model

6 Conclusion

Footnote

References

Index Terms

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations