research-article

Open access

An Intermediate-Centric Dataflow for Transposed Convolution Acceleration on FPGA

Authors:

Zhengzheng Ma,

Tuo Dai,

Xuechao Wei,

Guojie LuoAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 6

Article No.: 92, Pages 1 - 22

https://doi.org/10.1145/3561053

Published: 09 November 2023 Publication History

PDF eReader

Abstract

Transposed convolution has been prevailing in convolutional neural networks (CNNs), playing an important role in multiple scenarios such as image segmentation and back-propagation process of training CNNs. This mainly benefits from the ability to up-sample the input feature maps by interpolating new information from the input feature pixels. However, the backward-stencil computation constrains its performance and hindered its wide application in diverse platforms. Moreover, in contrast to the efforts on accelerating the convolution, there is a rare investigation on the acceleration of transposed convolution that is identically compute-intensive as the former.

For acceleration of transposed convolution, we propose an intermediate-centric dataflow scheme, in which we decouple the generation of the intermediate patch from its further process, aim at efficiently performing the backward-stencil computation. The intermediate-centric dataflow breaks the transposed convolution into several phases/stages, achieving feeding the input feature maps and performing the backward-stencil computation in a pipelining manner. It also provides four-degree computation parallelism and efficient data reuse of input feature maps/weights. Furthermore, we also theoretically analyze the irregular data dependence leveraging the polyhedral model, which constrains the parallel computing of transposed convolution. Additionally, we devise an optimization problem to explore the design space and automatically generate the optimal design configurations for different transposed convolutional layers and hardware platforms. By selecting the representative transposed convolutional layers from DCGAN, FSRCNN, and FCN, we generate the corresponding accelerator arrays of intermediate-centric dataflow on the Xilinx Alveo U200 platform and reach the performance of 3.92 TOPS, 2.72 TOPS, and 4.76 TOPS, respectively.

1 Introduction

Recently, transposed convolutional layers gradually prevail in several kinds of convolutional neural networks (CNNs), such as generative adversary networks (GANs) and fully convolutional neural networks (FCNs), which are widely applied in image segmentation [11], style transfer [8], and super-imaging applications [6]. Typically, both transposed convolution and convolution are computation-intensive operators in CNNs. Convolution extracts the useful feature information and down-samples the input feature maps. On the contrary, transposed convolution typically up-samples the feature maps by interpolating the useful information and generates larger output features, as plotted in Figure 1. As the reverse process of convolution, transposed convolution also widely exists in the back-propagation algorithm during the training phase.

Fig. 1.

Contrary to convolution that has been extensively optimized in many prior works [13, 24, 26], transposed convolution has received less investigation interest. As the transposed convolution occupies a large portion of CNNs, it gradually becomes the performance bottleneck when performing the networks. For instance, there exist almost identical computation complexity of convolutional layers and transposed convolutional layers in DCGAN [14]. Typically, the acceleration of transposed convolution has been limited by backward-stencil computation. That is, while executing the transposed convolution operation, we multiple each input feature element by the whole weight kernel, generating the up-sampled intermediate patches that partly overlap with adjacent ones. Then, we need to accumulate the horizontal and vertical overlaps of intermediate patches to obtain the final results. Typically, the overlaps between intermediate patches present complex data inter-dependence and irregular computation patterns, which can severely limit the computation parallelism and performance.

To perform the backward-stencil computation of transposed convolutional layers, we propose an intermediate-centric dataflow in this article, as plotted in Figure 3. Typically, we break the transposed convolution into several pipelining phases. The generation of intermediate patches is decoupled from overlap manipulation, which enables the pipelining computation of the overall transposed convolution. In this way, we can concurrently feed the input feature elements and weights into the accelerator array while calculating the horizontal and vertical overlaps. We compute the intermediate patches along array columns and derive the output-reuse paths to handle the overlaps by data movement and accumulation within the array. Furthermore, we qualitatively analyze the irregular data dependence introduced by the backward-stencil computation leveraging the polyhedral model, which indirectly inspires us to propose the intermediate-centric dataflow. Finally, we propose the automatic framework, in which we formulate an optimization problem to explore the design space of the accelerator array with analytical models and generate automatically an optimal design configuration given hardware resources and layer parameters.

Fig. 2.

Fig. 3.

Our contributions in this article mainly include:

(1)

We analyze the irregular data dependence incurred by the backward-stencil computation leveraging polyhedral model, which impedes the parallel computation of the transposed convolution.

(2)

We propose the intermediate-centric dataflow, which decouples generation of the intermediate patches from further manipulation, enabling accelerating transposed convolution in pipelining way.

(3)

We devise the automatic framework to explore the design space with analytical models and generate the accelerator array given network parameters and hardware resources leveraging the hardware templates.

We then organize the article as follows: We introduce the backward-stencil computation and the non-uniform dependence in the Section 2. We introduce the overview of the intermediate-centric dataflow in the Section 3.1 and corresponding hardware design in the Section 3.2. Then, we detailedly present the proposed dataflow in the Sections 3.3–3.5. In the Section 4, we introduce the design space exploration and the analytical model to demonstrate the effectiveness of the intermediate-centric dataflow, we present the experiments on different transposed convolutional layers in the Section 5. Finally, we introduce the related works in the Section 6 and present the conclusion in the Section 7.

2 Background and Motivation

We utilize the parameter tuple $(K, S, I_{H}, I_{W}, I_{C}, O_{C})$ to denote one transposed convolutional layer. Among these parameters, $I_{H}$ and $I_{W}$ denote the height and width of input feature maps, respectively. The kernel size is $K \cdot K$,¹ and the stride size is $S$. Additionally, $I_{C}$ and $O_{C}$ stand for the number of input and output channels, respectively. The height $O_H$ and width $O_W$ of an output feature map can be deduced by this parameter tuple, i.e., $O_{H} = (I_{H} - 1) \cdot S + K$ and $O_{W} = (I_{W} - 1) \cdot S + K$, as shown in Equations (1) and (2).²

\begin{equation} O_{H} = (I_{H} - 1) \cdot S + K, \end{equation}

(1)

\begin{equation} O_{W} = (I_{W} - 1) \cdot S + K. \end{equation}

(2)

2.1 Backward-Stencil Computation

The transposed convolution algorithm is depicted in Figure 1. The nested loops and MAC operations demonstrate its computation intensiveness. The process of transposed convolution is illustrated in Figure 1: (i) We multiply each input feature element by the weight kernel, generating an intermediate patch. The size and the number of intermediate patches equal the size of the weight kernel and the number of input feature elements, i.e., $K \cdot K$ and $I_{W} \cdot I_{H} \cdot I_{C}$, respectively. Then, we should accumulate the intermediate patches produced from multiple input pixels with identical $(I_{h}, I_{w})$ index to flatten the input channel dimension.(ii) It can be seen that these flattened intermediate patches overlap with adjacent ones in both vertical and horizontal directions since ${S}$ is typically smaller than ${K}$. It is necessary to add up the overlapped parts in these two directions as depicted in colorful areas in Figure 1, leading to an output feature map of size $O_H \cdot O_W$. This is the backward-stencil computation of transposed convolution, and we will present the proposed solution in the Section 3. Generally, the overlapped red and blue pixels are produced separately by horizontally and vertically adjacent intermediate patches, which are generated separately by horizontally and vertically adjacent input feature pixels. Besides, there exist the overlapped purple output pixels that are generated together by adjoining intermediate patches within both directions.

2.2 Non-uniform Dependence Distance

Then, we mainly explain the non-uniform dependence distance incurred by backward-stencil computation. Typically, the dependence distance denotes the distance between dependent iterations, e.g., i -> i + 1 and i -> 2i in the polyhedral model. Initially, we attempt to map the transposed convolution layer in the systolic array as the convolution [17], leveraging the state-of-the-art compiler AutoSA [16]. However, the mapping breaks down owing to the non-uniform dependence.

When mapping the programs containing nested loops into the standard systolic array leveraging polyhedral model [3], we generally select two loops as the space loops and one loop as the time loop. We should keep the dependence (flow, anti, output, and read) distances constant. And the dependence distances on space loops should be no larger than 1 so that data communication just occurs between adjacent Processing Elements (PEs). In the transposed convolutional layer of Figure 1, we can assign one of $k_{h}$, $k_{w}$, and $o_{c}$ as the space loop for accessing weights. And the read-after-read (RAR) dependence distances of these assignments can be 1. Similarly, we can also assign one of $i_{h}$ and $i_{w}$ as space loop for accessing input feature maps and the corresponding RAR dependence distances are 1. Then, we set $i_{c}$ as the time loop so that all the PEs perform MAC operations in the input channel dimension. Consequently, we can map the transposed convolution into the systolic array as plotted in Figure 2(a), where the weight elements and input feature maps can be reused along input height dimension and kernel height dimension, respectively. And the array size can be $I_{H} \cdot K_{H}$ supposing that there are abundant hardware resources.

As adopting the mapping strategy as shown in Figure 2(a) in which $S$ can be set as 2, we separately feed one column of input feature map and weight into the array. The input feature element $(i_{h}, i_{w})$ will perform MAC operations with weight element $(k_{h}, k_{w})$ at the PE$(i_{h}, k_{h})$, in which the intermediate MAC result is just a part of the final output result $o(S \cdot i_{h} + k_{h}, S \cdot i_{w} + k_{w}, o_{c})$. However, as for the next input feature element $(i_{h^{\prime }}, i_{w^{\prime }})$ in which $i_{h^{\prime }} = i_{h}+1$ and $i_{w^{\prime }}= i_{w}$ as shown in Equation (3), it will perform MAC operation with weight element $(k_{h^{\prime }}, k_{w^{\prime }})$, in which $k_{h^{\prime }}=k_{h}-S$ and $k_{w^{\prime }}=k_{w}$ as shown in Equation (4). And the produced intermediate MAC result should add up with the intermediate one produced in the PE$(i_{h}, k_{h})$ for final result, as shown in Figure 2(b). That is, while computing the output result, the distance between two dependent $i_{h}$ iterations is uniform, i.e., $i_{h}$ -> $i_{h}+1$. However, the distance between dependent $k_{h}$ iterations change from $k_{h}$ -> $k_{h}+1$ to $k_{h}$ -> $k_{h}-S$, resulting in the non-uniform RAW dependence. And it breaks the second constraint [3, 16] when leveraging the polyhedral model to map the program into systolic array.

As can be seen, the backward-stencil computation introduces the irregular computation pattern and non-uniform dependence, severely limiting the parallel computation of transposed convolution. To perform the backward-stencil computation or speed up the transposed convolution, the researchers have proposed mainly two kinds of methods. The previous works [5, 12] propose to transform the transposed convolution to multiple convolution operations, which can avoid the backward-stencil computation and adopt the convolution acceleration scheme. However, this method performs the complex transformation in the software and introduces more computation and transformation overhead. On the other hand, the works [15, 19, 20] store temporarily the intermediate patches by leveraging the enormous on-chip memory to perform the backward-stencil computation.

\begin{equation} \begin{aligned}& i_\mathit {h^{\prime }} = i_\mathit {h} + 1, \\ & i_\mathit {w^{\prime }} = i_\mathit {w,} \end{aligned} \end{equation}

(3)

\begin{equation} \begin{aligned}& k_\mathit {h^{\prime }} = k_\mathit {h} - S, \\ & k_\mathit {w^{\prime }} = k_\mathit {w.} \end{aligned} \end{equation}

(4)

3 Intermediate-Centric Dataflow

3.1 Overview

To perform the backward-stencil computation of transposed convolution, we propose the intermediate-centric dataflow scheme that decouples the generation of intermediate patches from the further process on the intermediate ones, achieving the overall transposed convolution. The proposed dataflow typically includes two phases: Flattening Phase and Overlapping Phase, as shown in Figure 3. The Flattening Phase generates the intermediate patch elements by multiplying input feature elements by weights. Then, we transfer and accumulate the overlapped intermediate patch elements in the Overlapping Phase. Firstly, we feed the weight kernel and $(N-1)$-th row of input feature map with $I_{c}$ input channels into the accelerator array. In the Flattening Phase, we multiply and accumulate the input feature map with weights to generate the intermediate patch. And these intermediate patch elements plotted in dark blue in the Figure 3 are flattened to distribute in the array PEs. At the moment, these intermediate patch elements can be ready for the next Overlapping Phase, and we can concurrently feed the $N$th row of input feature map into the array in the Flattening Phase. Regarding the intermediate patches in the Overlapping Phase, we move and add up the intermediate patch elements that are stored in the PEs linked by green oblique line in the first stage. Then, we continue to move the vertically overlapped intermediate patch elements between PEs linked by $purple$ vertical line in the second stage. Generally, we can pipeline the Flattening Phase and Overlapping Phase for distinct rows of the input feature map.

3.2 Hardware Support

3.2.1 Accelerator Overview.

To support the proposed intermediate-centric dataflow, we correspondingly devise the specific accelerator as plotted in Figure 4. As can be seen, the proposed architecture mainly contains an array of PEs and matched interconnect networks, which contains the mesh networks and output-reuse path. Generally, in the Flattening Phase, input feature elements and weight elements flow along the mesh networks through the array PEs, in which we perform the MAC operations between both kinds of elements. Then, during the Overlapping Phase, we transfer the overlapped intermediate patch elements along the output-reuse path to destination PEs, where we accumulate the overlapped elements. The array size is determined by given transposed convolution $(K, S, I_{H}, I_{W}, I_{C}, O_{C})$. To implement transposed convolution on the accelerator array, we unroll and map the weight height and weight width dimensions to array rows. That is, the row number of the proposed accelerator array equals $K_{H} \cdot K_{W}$. Since we map input width dimension to array columns, the number of array columns equals $I_{W}$. Additionally, as there is no data dependency between output channels, we achieve concurrent computation in PEs with parallel degree $O_{C}$. Consequently, one accelerator array $({K}^2, {I}_{W}, O_{C})$ can be proposed for target transposed convolution. The row number and column number equal $K^2$ and $I_{W}$, respectively. The parallel factor equals $O_{C}$. For instance, for one transposed convolutional layer $(3,2,16,16,3,5)$, we devise the corresponding accelerator array $(9, 16, 5)$. In this way, we can achieve four-degree parallelism in the accelerator array with the associated dataflow strategy to be explained below.

Fig. 4.

3.2.2 Processing Element.

The accelerator PEs are designed as Figure 4(b) in which the parallel factor $O_{C}$ equals 1 for simplicity. And the function of PE registers is listed in Table 1. Typically, we multiply the input feature pixels by weight elements in different input channels and accumulate products leveraging multipliers and adders located on the left side during the Flattening Phase. And we store this accumulated product, i.e., intermediate result, in register $R0$ for further process. On the right side of PE, the adder accumulates the horizontally and vertically overlapped intermediate results, which are described in Section 2 during the Overlapping Phase. Accordingly, the registers $R1$ and $R2$ separately receive intermediate results of horizontal and vertical overlaps, which are transferred from other PEs. And we store the accumulated overlapped intermediate results into register $R3$. Alternatively, $R3$ can also function as a router in the output-reuse path that will be introduced below.

Table 1.

Registers	Function
$R0$	Store the intermediate results generated in first phase
$R1$	Receive transferred intermediate results from oblique line
$R2$	Receive transferred intermediate results from vertical line
$R3$	Store the accumulated intermediate results
$R4$	Pass the input feature maps to below PEs
$R5$	Pass the weight elements to right-side PEs

Table 1. Function of PE Registers

3.2.3 Interconnect Network.

The interconnect network in the accelerator array mainly consists of two parts: mesh network and output-reuse path. The adjacent PEs are directly connected by the mesh network along which input feature pixels and weight elements flow horizontally and vertically through the accelerator array in the Flattening Phase. By leveraging the mesh network, we realize the efficient data reuse of input feature maps and weights in both directions. Besides, the output-reuse path contains green oblique line and purple vertical line, both of which stand for the data dependence between associated PEs. Logically, along the oblique line and vertical line, we can shift and compute separately horizontal overlaps and vertical overlaps between intermediate patches in the Overlapping Phase. The structure of the output-reuse path depends on kernel size $K$ and stride size $S$. We will present the output-reuse path in detail combined with the dataflow strategy in the next section.

3.2.4 Array Buffer.

To avoid long latency of direct access to off-chip storage, we implement green Input Buffer and yellow Weight Buffer around the array to cache input feature maps and weights in the ping-pong manner. To match accelerator array, Input Buffer contains $K_{H} \cdot K_{W}$ buffers, each of which includes $O_{C}$ banks. Thus, the accelerator can fetch weights from Weight Buffer with high on-chip bandwidth. Similarly, we devise Input Buffer by implementing $I_{W}$ buffers. Additionally, Output Buffer is implemented with $O_{W}$ buffers each of which has $O_{C}$ banks. We can write the generated output pixels temporarily into Output Buffer in the ping-pong manner, after which we store output results to off-chip storage.

3.3 Flattening Phase

We obtain the intermediate results by performing MAC operations between input feature maps and weights from different input channels in the Flattening Phase. As we map kernel height and kernel width dimensions to array rows, input width dimension to array columns, and output channel dimension to PEs, we multiply each input feature element by multiple weight kernels. The data reuse of input feature elements and weights can be achieved in vertical and horizontal directions, respectively.

As shown in Equation (5), the weight streams containing $w(m, n, 0), \ldots , w(m, n, i_{c})$ flow horizontally into the $y$th array row, where $y = (m \cdot K + n)$. Each stream element $w(m, n, i_{c})$ is actually one weight vector, which packs several $w(m, n, i_{c}, o_{c})$ located in $o_{c}$th channels. Correspondingly, input pixel streams including $a(i_{h}, i_{w}, 0), \ldots , a(i_{h}, i_{w}, i_{c})$, which are within $i_{h}$th row of input feature map, flow into $i_{w}$th array column. In this way, PEs can simultaneously receive the weight vectors and associated input feature elements and perform MAC operations. We then formulate the computation process of Flattening Phase as Equation (5):

\[\begin{gather} {R0}_{x,y} = \sum _{i_{c}=0}^{I_{C}-1} a(i_{h},x,i_{c}) \cdot w(m,n,i_{c}), \\ \text{where}\ y = m \cdot K + n, \nonumber \nonumber \end{gather}\]

(5)

in which the MAC result is stored in ${R0}_{x,y}$ of $\mathit {PE}_{x,y}$. Each time PEs receive the weight vector and input pixel, they perform parallel MAC operations and generate the calculation results.

When input feature elements and weights flow out of the array, the Flattening Phase ends. And we obtain the calculated results stored in ${R0}_{i,j}$ of ${PE}_{i,j}$. Then, the calculated results shift from $R0$ to $R3$ for the next operation. As this phase corresponds to step $i$ mentioned in Section 2, these results are actually intermediate results that form the intermediate patches. In the accelerator architecture, each intermediate patch distributes within one array column during this phase.

3.4 Overlapping Phase

Generally, the process of Overlapping Phase contains two stages, i.e., the accumulation stage and transfer stage, in which we separately accumulate the overlapped intermediate results and move the intermediate results upward.

3.4.1 Accumulation Stage.

We calculate the horizontal and vertical overlaps by accumulating corresponding intermediate results in the accumulation stage. As shown in Figure 3(a), the intermediate patches, which are generated by adjacent pixels within the same row of the input feature map, form the horizontal overlaps when $S$ is smaller than $K$. Observing the horizontal overlaps, we find that length of horizontal overlaps within the same rows equals $K-S$. Then, we utilize parameter ${I}^{k}_{x}$ to denote the element of $k$th intermediate patch. In Figure 5, the elements ${I}^{0}_{2}, {I}^{0}_{5}, {I}^{0}_{8}$ of the 0th intermediate patch overlap with ${I}^{1}_{0}, {I}^{1}_{3}, {I}^{1}_{6}$, respectively. These overlapped intermediate elements distributed in separate array columns should be accumulated. On the other hand, by adopting the intermediate-centric dataflow in the array, we map the $k$th intermediate patch to the $k$th array column. And the array can be partitioned into $K$ row blocks, each of which contains $K$ rows. The $t$th row of intermediate patches belongs to the $t$th row block. Obviously the horizontally overlapped elements between $t$th row of intermediate patch reside in different columns of the $t$th row block.

Fig. 5.

To calculate the horizontal overlaps inside the row blocks, we utilize oblique line to logically denote the associated PEs that store the horizontally overlapped intermediate results. In fact, the horizontally overlapped data are transferred along the dog-leg path, which is part of the output-reuse path, as depicted in Figure 6. The register $R3$ of other PEs, which the oblique line passes through, can function as a router to transfer the intermediate results. Along the oblique line, we transfer the intermediate results from ${PE}_{x,y}$ to ${PE}_{x+1,y-S}$. After accumulating ${R3}_{x,y}$ with ${R3}_{x+1, y-S}$, the generated sum ${R3}_{x+1, y-S}$ continues to move to ${PE}_{x+2, y- 2 \cdot S}$. In this way, we attain the final horizontally overlapped results in Converging PEs that reside in the 0th row and $(I_{W}-1)$-th column within row blocks. Additionally, there is another kind of Static PE that is not connected by oblique line when $S \gt 1$. Typically, the total number of Converging PEs and Static PEs within each row block equals $(I_{W}-1) \cdot S + K$.

Fig. 6.

We then present the computation process in the row blocks, which affects the topological structure of oblique line. From the perspective of Converging PE of $j$th row block, we can figure out the final horizontally overlapped results in Equation (9):

\begin{align} {R3}^{t}_{x,y} & \leftarrow {R0}^{t}_{x,y,} & \end{align}

(6)

\begin{align} {R1}^{t}_{x,y} & \leftarrow {R1}^{t}_{x-1,y+S} & \text{for ${\begin{matrix}0 \le t \lt K \end{align}

(7)

\begin{align} y\%K+S\lt K,\end{matrix}}$} \\ {R3}^{t}_{x,y} & \leftarrow {R3}^{t}_{x,y} + {R1}^{t}_{x,y} & \text{for $0 \le t \lt K.$} \end{align}

(8)

And thus,

\[\begin{gather} {R3}^{t}_{x,y} \leftarrow {R3}^{t}_{x,y} + \sum _{{\begin{matrix}0 \le i \lt \lceil K/S \rceil \\ x-i \ge 0 \\ y\%K+S\lt K\end{matrix}}} {R0}^{t}_{x-i, y+i \cdot S,} \end{gather}\]

(9)

in which $R3$ stores the horizontally overlapped result of Converging ${PE}_{x,y}$. And it can be calculated by accumulating several intermediate results in associated PEs that can be deduced from Equation (9). These constraints bound these associated PEs within the same row block. Furthermore, the Equation (9) determines the end nodes and length of oblique lines. Generally, the oblique line in each row block is consistent as horizontal overlaps of all intermediate patch rows form identically. Notably, the numbers of accumulation operations of Converging PEs are different. Consequently, the accumulation stage will keep until all the accumulation operations for horizontal overlaps finish. Besides, we need not transfer the intermediate results of Static PEs since they are not connected by oblique line. Finally, we attain the horizontally overlapped results in Converging PEs within row blocks.

Before presenting the vertical overlaps, we first define the Interface PE that can interact with peripheral memory, as plotted in the grey in the Figure 4(a). The Interface PEs mainly consist of the Converging PEs and Static PEs in top $S$ row blocks, and their total number can be $((I_{W}-1) \cdot S + K) \cdot S$.

Then, we explain how to calculate the vertical overlaps and computation process of PEs in this phase together. For Converging PEs that are terminal nodes of both kinds of lines, we accumulate the register $R0$ with $R2$, storing the sum into $R3$. This addition operation actually calculates the vertical overlaps. And $R2$ caches the vertically overlapped intermediate results transferred from vertical line, which will be explained in the next section. Then we continue to accumulate $R3$ with $R1$ and store the result in $R3$. This addition operation corresponds to the accumulation in ${R3}^{t}_{x,y} \leftarrow {R1}^{t}_{x,y} + {R2}^{t}_{x,y} + {R3}^{t}_{x,y}$ as presented in Equation (10). Typically, the hybrid purple pixels can be calculated by these two addition operations. For the Converging PEs that are terminal nodes of one kind of line, we can simply perform the corresponding addition operation. Moreover, for Static PEs belonging to interface node, when they are the terminal node of vertical line, they will accumulate register $R0$ with $R2$, producing the accumulated vertical overlaps. When these Static PEs are not connected by vertical line, their intermediate results can be stored in on-chip memory.

\begin{align} {R3}^{t}_{x,y} & \leftarrow {R1}^{t}_{x,y} + {R2}^{t}_{x,y} + {R3}^{t}_{x,y} & \text{for}\ 0 \le t \lt K. \end{align}

(10)

3.4.2 Transfer Stage.

We mainly explain the process of transferring vertically overlapped intermediate results in this stage. Contrary to horizontal overlaps, the vertical overlaps are formed by vertically adjoining intermediate patches, which are generated by vertically adjacent input feature pixels. As shown in Figure 7, we obtain two intermediate patches ${I}^{0}$ and ${I}^{1}$ produced by input pixels $a(p,q)$ and $a(p+1,q)$, respectively. The elements ${I}^{1}_{6}, {I}^{1}_{7}, {I}^{1}_{8}$ separately overlap with ${I}^{1}_{0}, {I}^{1}_{1}, {I}^{1}_{2}$. Generally, the first $K-S$ rows of ${I}^{1}$ totally overlap with last $K-S$ rows of ${I}^{0}$. As intermediate patches are flattened along array columns in the intermediate-centric dataflow, we obtain ${I}^{0}$ and ${I}^{1}$ within the same column separately in time steps $t_{1}$ and $t_{2}$. While the elements ${I}^{1}_{0}, {I}^{1}_{1}, {I}^{1}_{2}$ are generated in 0th row block, the elements ${I}^{1}_{6}, {I}^{1}_{7}, {I}^{1}_{8}$ are generated in $S$th row block and need move upward for further accumulation.

Fig. 7.

Before the transfer stage, the horizontal overlaps have been calculated and stored in Converging PEs. Then, the accumulated horizontal overlaps of Converging PEs and intermediate results of other PEs should be transferred along vertical line from ${PE}_{x,y}$ to ${PE}_{x,y - K \cdot S}$ and stored in registers $R2$. The vertical line is a part of the output-reuse path. Similar to oblique line, the end nodes and length of vertical lines depend on parameters $K$ and $S$, as shown in Figure 8. To avoid degrading frequency owing to long wires, we leverage registers $R3$ of the PEs, which vertical lines pass through, to route the intermediate elements.

Fig. 8.

Then, we can generalize the process of transfer stage as ${R2}^{t}_{x,y} \leftarrow {R2}^{t+S}_{x, y + K \cdot S}$ in the Equation (11). The parameter $R2$ denotes the value of register $R2$ of interface node. When we feed the $p$th row of input feature map into the array, the final result ${R2}^{t}_{x,y}$ can be attained by moving upward the overlapped intermediate elements, which are produced by prior rows in downstream PEs.

\begin{align} {R2}^{t}_{x,y} & \leftarrow {R2}^{t+S}_{x, y + K \cdot S} & \text{for}\ t+S \lt K. \end{align}

(11)

3.5 Dataflow Schedule

We mainly explain the way to schedule the two phases to perform the transposed convolution in the accelerator architecture. For each row of input feature maps, we multiply the input pixels by weight kernels and generate the intermediate patches in Flattening Phase. Then, the intermediate results move along the output-reuse path to accumulate with others in accumulation stage and transfer stage of the Overlapping Phase. With the output-reuse path and register $R0$ of PEs, we can perform the Flattening Phase concurrently with another phase. In other words, we perform accumulation stage and transfer stage of the Overlapping Phase of $(p{-}1)$-th row of input feature maps concurrently while undertaking Flattening Phase of $p$th row, keeping the PEs busy and improving the resource utilization. In current hardware design of accelerator array, we separately implement the global FSM and inner FSMs inside PEs to control the overall circuit. The gobal FSM generally contains several states including S_G_IDLE, S_G_IMAGE_START, S_G_NEXT_ROW, S_G_PAUSE, S_G_IMAGE_END, and S_G_END. In this way, we can control feeding the input features and their rows into the accelerator array. Besides, inside the PEs, we also implement the inner FSMs, including mac-operation FSM and accumulation-FSM, to separately control the MAC operations on the input feature maps/weights and the overlap accumulation operations. As we initially devise the specific accelerator array towards each transposed convolutional layers, we feed the bitstream and implement the specific hardware design in the FPGA board leveraging the EDA platform, instead of designing the general instructions. To design the uniform hardware circuit, we will further design the effective instructions in the future works. Typically, in the accelerator arrays towards specific transposed convolutional layers, the FSMs can be slightly different as it takes distinct cycles to finish the MAC operations between input feature elements and weights. Once the multipliers and accumulators generate the intermediate results, which are stored in the R0 registers, they just continue to perform MAC operations on the new input data. And the newly generated intermediate results sequentially add up with R1 and R2 under the control of inner FSM, obtaining the results that can be stored in the R3.

The latency and throughput of each phase and stage are summarized in Table 2. For one row of input feature maps, the latency of the Flattening Phase can be calculated as $K^{2} + I_{W} + I_{C}$. As we leverage the register $R3$ along the output-reuse path as router, latency of accumulation stage and transfer stage can be $2 \cdot (\lceil K/S \rceil - 1) \cdot (S+1)$ and $2 \cdot K \cdot S$, respectively. Generally, the latency of Flattening Phase is larger than the latency sum of accumulation stage and transfer stage. It means that there is adequate time to transfer intermediate results along the output-reuse path to calculate final overlapped results.

Table 2.

	Latency	Throughput (II)
Flattening phase	$K^{2} + I_{W} + I_{C}$	$I_{C}$
Accumulation stage in overlapping phase	$2 \cdot (\lceil K/S \rceil - 1) \cdot (S+1)$	$2 \cdot (\lceil K/S \rceil - 1) \cdot (S+1)$
Transfer stage in overlapping phase	$2 \cdot K \cdot S$	1

Table 2. Latency and Throughput of these Phases and Stages

On the other hand, we adopt Initiation Interval ($II$), to denote the throughput of these phases and stages. Typically, one row of input feature map that consists of $I_{C}$ input channels flows into the array each time. It takes $I_{C}$ cycles to complete the Flattening Phase, after which we can feed the next row. Correspondingly, the ${II}_{f}$ of Flattening Phase is $I_{C}$. In the accumulation stage, it typically takes ${Latency}_{o}$ cycles to finish shift operations and generate the horizontal overlaps, after which we can execute the Overlapping Phase in the next round. Then, the ${II}_{o}$ can equal ${Latency}_{o}$, e.g., $2 \cdot (\lceil K/S \rceil - 1) \cdot (S+1)$. Regarding transfer stage, ${II}_{s}$ equals 1 as horizontal overlapped results can move upward immediately after the previous phase ends.

Observing the Equations (9) and (11) of accumulation stage and transfer stage, we find that intermediate elements ${I}_{x}$ can participate in several accumulation operations for overlapped results. From another perspective, ${I}_{x}$ can be reused in several accumulation processes in both phases. To support the intermediate element reuse, we introduce the output-reuse path in the array. And the topological structure of output-reuse path can be determined by Equations (9) and (11). We perform the backward-stencil computation by transferring and accumulating associated intermediate results along the output-reuse path in a regular way while performing MAC operations on input feature maps and weights.

4 Automatic Flow

4.1 Overview

Although the accelerator implementation of intermediate-centric dataflow has been presented, it still takes enormous labours and hours to design the accelerator for target networks under given resources. To shorten the tedious and long-term design process, we propose the automatic framework, as plotted in Figure 9, for automatic implementation of the accelerator array. Initially, the framework takes in the input parameters, including layer parameters and FPGA device type, by which we obtain the resource budget of FPGA devices in the Parameter Analysis module. Then, we can explore the design space in the Design Space Exploration module to obtain the optimal design parameters leveraging the proposed analytical model, by which we estimate the design parameters and expedite the design space exploration. Generally, the parameters $N_{i}$ and $N_{o}$ denote the numbers of input feature map and output result. And $O_{TC}$ denotes the adopted parallel factor within PEs. These parameters together affect the resource cost of the array. Additionally, we assign the parameter $F_{freq}$ as certain values to help estimate the performance. In the Hardware Generator module, we realize the agile design of the accelerator array based on the design parameter and hardware templates. We devise the parameterized templates of PE in Chisel [1] for the design automation of arrays [7], since the hardware design of PE is fixed and FSM/output-reuse path is regularly determined by the layer parameters towards transposed convolutional layer $(K, S, I_{H}, I_{W}, I_{C}, O_{C})$. Besides, the associated buffers and global controller can be attained and the generated accelerator array can be implemented on the target device with the Xilinx Vitis platform. Then, an end-to-end framework is devised for the automatic generation of accelerator architecture for transposed convolution. Leveraging the automatic framework, we can obtain the optimal accelerator with a given hardware resource.

\begin{equation} \begin{aligned}& \underset{N_i, N_o, O_{TC}}{\text{minimize}} \quad & \max (T_\mathit {comp}, T_\mathit {load}), \\ & \text{subject to} & {DSP}_{cost} \le \mathit {DSP}_\mathit {capacity} \\ & & {Mem}_{cost} \le \mathit {Mem}_\mathit {capacity} \end{aligned}. \end{equation}

(12)

Fig. 9.

4.2 Design Space Exploration

The problem of accelerator design can be formulated as Equation (12). The $T_{comp}$ denotes the latency of calculating one input feature map in the array, and $T_{load}$ denotes the latency of loading one input feature map from off-chip storage. The objective is to find the maximal one between $T_{comp}$ and $T_{load}$, then minimize it without exceeding the limitations of on-chip resources. To solve the problem of Equation (12), we further propose the related resource model and performance model as presented below.

4.2.1 Resource Model.

We propose the resource model to estimate the consumed on-chip memory and DSPs. Since we mainly utilize DSPs to perform MAC operations in the array, we can estimate the DSP cost by leveraging $O_{TC} \cdot I_{W} \cdot K_{W} \cdot K_{H}$, as shown in Equation (13). As the array contains $K_{H} \cdot K_{W} \cdot I_{W}$ PEs and each PE contains $O_{TC}$ computation units, we can estimate the DSP cost with the production of these four parameters.

\begin{equation} \mathit {DSP}_\mathit {cost} = O_{TC} \cdot I_{W} \cdot K_{W} \cdot K_{H}. \end{equation}

(13)

Typically, we utilize the Input Buffer and Weight Buffer to cache the input feature maps and weights of transposed convolutional layers in a double-buffer way. Due to the limited capacity of on-chip memory, we generally cannot store all the input feature maps and weights on the chip. Then, we can fetch $N_{i}$ input feature maps each time from external storage while processing the cached input feature maps in the array. We estimate the volume of Input Buffer by Equation (15), in which $BW$ denotes the data bitwidth of input elements. The size of one input feature map is $I_{H} \cdot I_{W} \cdot I_{C}$. And we double the size of $N_{i}$ input feature maps since we implement Input Buffer in a double-buffer manner. Similarly, when we cannot store the whole weights of transposed convolutional layer on the chip, we cache parts of weights, i.e., $O_{TC} \cdot I_{C} \cdot K_{H} \cdot K_{W}$, in Weight Buffer each time, as depicted in Equation (16). And we double the capacity of Weight Buffer to be accessed in a double-buffer way. We estimate the capacity of Output Buffer in Equation (17), in which $O_{W} \cdot O_{H} \cdot O_{TC}$ stands for the size of one output result. In addition, we assume that $N_{o}$ output results can be cached in Output Buffer.

\begin{align} \mathit {Mem}_\mathit {cost} & = \mathit {Mem}_\mathit {in} + \mathit {Mem}_\mathit {out} + \mathit {Mem}_\mathit {wt}, \end{align}

(14)

\begin{align} \mathit {Mem}_\mathit {in} & = 2 \cdot N_{i} \cdot (I_{W} \cdot I_{H} \cdot I_C) \cdot BW, \end{align}

(15)

\begin{align} \mathit {Mem}_\mathit {wt} & = 2 \cdot I_C \cdot O_{TC} \cdot K_H \cdot K_W \cdot BW, \end{align}

(16)

\begin{align} \mathit {Mem}_\mathit {out} & = 2 \cdot N_{o} \cdot (O_W \cdot O_H \cdot O_{TC}) \cdot BW. \end{align}

(17)

4.2.2 Performance Model.

We present the formulation of $T_{load}$ and $T_{comp}$ to set up the performance model. We formulate $T_{load}$ as Equation (18), in which the size of one input feature map is $I_{W} \cdot I_{H} \cdot I_{C} \cdot \frac{BW}{8}$ bytes. The $Bandwidth$ is the theoretical bandwidth of off-chip storage. Consequently, we calculate the $T_{load}$ by measuring the latency of transferring one input feature map from external storage to FPGA chips.

Considering the intermediate-centric dataflow, the Overlapping Phase and Sliding Phase of one input feature row can run concurrently with the Flattening phase of the next row. Then, the $T_{comp}$ can be contributed mainly by calculation in the array as shown in Equation (19). Typically, it takes $K^{2} + I_{W} + I_{C}$ cycles to flow through the array for one row of input feature map as shown in Table 2. Thus, computing one whole input feature map will cost $(K^{2} + I_{W} + I_{C}) \cdot I_{H}$ cycles. Additionally, we assign $O_{TC}$ as the parallel factor of PEs considering the limits of fanout and on-chip resources. And we partition the whole transposed convolution into $\lceil \frac{O_C}{O_{TC}} \rceil$ tiles to be implemented in the array.

\begin{equation} T_\mathit {load} = \frac{I_H \cdot I_W \cdot I_C \cdot \frac{{BW}_{i}}{8}}{\mathit {Bandwidth}}, \end{equation}

(18)

\begin{equation} T_\mathit {comp} = \frac{ (K^{2} + I_{W} + I_C) \cdot I_H }{F_{freq}} \cdot \left\lceil \frac{O_C}{O_{TC}} \right\rceil . \end{equation}

(19)

Within the proposed resource model and performance model, several parameters $(N_{i}, N_{o}, O_{TC})$ contribute to the design space of the accelerator. We then need to traverse the design space to obtain the optimal solution for the problem in Equation (12). Primarily, we typically prioritize applying the tiling in the output channel to satisfy the limited resource budget. Additionally, the number of cached input activations $N_{i}$ generally equals $N_{o}$. Leveraging these optimization strategies, we can shrink the design space and explore it with guidance. By enumerating the possible values of these parameters, we can attain the optimal design options for transposed convolution given FPGA platforms.

5 Experiments

5.1 Experimental Setup

We evaluate our intermediate-centric dataflow and corresponding accelerator design in the Alveo U200 board which contains 2,586 K logic cells, 6,840 DSPs, 75.9 Mb blockRAM (BRAM), and 270 Mb UltraRAM (URAM) on the chip. The implementation of hardware accelerator is synthesized by Xilinx Vitis 2020.1 platform. And we select several transposed convolutional layers from widely applied networks, e.g., FSRCNN, DCGAN, FCN, as the benchmark to verify our intermediate-centric dataflow and the accelerator design, as shown in Tables 3 and 4. The FSRCNN used in our experiment contains five layers, in which the last one is transposed convolutional layer that can expand the low-resolution image of the CIFAR-10 and Set5 [2] datasets according to different stride sizes such as 2, 3, and 4. The DCGAN consists of four convolutional layers and four transposed convolutional layers, where all the kernel size is 5. As the work [19], we perform the DCGAN on the LSUN dataset for scene modeling. The FCN has three transposed convolutional layers, where the kernel sizes are separately 4, 4, and 16. And we perform the FCN on the KITTI [19]. We quantize the input feature elements into 16-bit and weight elements into 8-bit, respectively. As the accelerator array is proposed for transposed convolution acceleration, we mainly focus on the transposed convolutional layers within these networks.

Table 3.

Layers	Convolutional Layers $(K, S, I_{H}, I_{W}, I_{C}, O_{C})$	Transposed Convolutional Layers $(K, S, I_{H}, I_{W}, I_{C}, O_{C})$
1st	$({5, 2, 64, 64, 3, 128})$	$({5, 2, 4, 4, 1024, 512})$
2nd	$({5, 2, 32, 32, 128, 256})$	$({5, 2, 8, 8, 512, 256})$
3rd	$({5, 2, 16, 16, 256, 512})$	$({5, 2, 16, 16, 256, 128})$
4th	$({5, 2, 8, 8, 512, 1024})$	$({5, 2, 32, 32, 128, 3})$

Table 3. Parameters of DCGAN Layers

Table 4.

Networks	$(K, S, I_{H}, I_{W}, I_{C}, O_{C})$	accelerator size $(N_{i}, N_{o}, O_{TC})$
DCGAN	$({5, 2, 16, 16, 256, 128})$	$({25, 16, 10})$
FCN	$({16, 2, 10, 10, 21, 21})$	$({256, 10, 2})$
FSRCNN	$({9, S, 32, 32, 32, 1})$	$({81, 32, 1})$

Table 4. Samples of Several Transposed Convolutional Layers Including the 3rd Transposed Convolutional Layer within DCGAN, FCN, and FSRCNN

5.2 Results and Analysis on DCGAN

In recent years, DCGAN has been proposed to make great progress in the field of generator adversary networks, owing to its unique network structures. Generally, DCGAN consists of four convolutional layers and four transposed convolutional layers without any Fully Connected (FC) layers, and their parameters, as described in Section 2, are listed in Table 3.

Initially, we design the specific accelerator array for each layer within DCGAN. As for the first transposed convolutional layer, whose size equals $({5, 2, 4, 4, 1024, 512})$, the array rows and columns can be assigned as ${K}^2$ and ${I}_{W}$, i.e., 25 and 4. Then, we can assign the ${O}_{TC}$ as 10 via the analytical model under the DSP resource limitation, considering the routing feasibility. Consequently, we can devise the corresponding accelerator array $(25, 4, 10)$ for the first transposed convolutional layer of DCGAN. To determine the size of Input Buffer and Output Buffer, we then figure out the parameters $N_i$ and $N_o$ by solving the problem formulated in Section 4. Generally, the sizes of one input feature map and output result are 256 Kb and 512 Kb, we can set the $N_i$ and $N_o$ as 4 from the obtained results of the formulated problem under the memory constraints. Hence, the size of Input Buffer and Output Buffer can be assigned separately as 1 Mb and 2 Mb, and the Weight Buffer size equals 4 Mb with $O_{TC}$ equaling 10. The performance of the devised accelerator array reaches 1101.34GOPS when we implement it in the U200 platform with 200 MHz frequency, as shown in Table 5.

Table 5.

Layers	1st layer	2nd layer	3rd layer	4th layer	Uniform Array
DSP	1,000	2,000	4,000	2,400	2,000
LUT	46 K	72 K	157 K	101 K	72 K
FF	51 K	87 K	132 K	92 K	87 K
BRAM	7 Mb	5 Mb	7 Mb	4.5 Mb	5 Mb
Perf. (GOPS)	1101.34	2012.71	3920.24	2520.66	2012.71
Perf. (GOPS/DSP)	1.10	1.06	0.98	1.05	1.06

Table 5. Experiments of Transposed Convolutional Layers within DCGAN in which the Input Feature Elements and Weights are Quantized in 16-bit and 8-bit Under 200 MHz

Then, we present the accelerator array towards the second transposed convolutional layer of DCGAN, whose layer size is $({5, 2, 8, 8, 512, 256})$. Primarily, the sizes of array row and column can be set as 25 and 8, respectively. Identically, we can also assign the parallel factor $O_{TC}$ as 10 for routing. In this way, we can obtain the corresponding accelerator array towards the second transposed convolutional layer. As the sizes of the input feature map and output result separately are 512 Kb and 1 Mb, we also set the $N_i$ and $N_o$ by the design space exploration. And we implement the Input Buffer, Output Buffer, and Weight Buffer, whose sizes separately equal 1 Mb, 2 Mb and, 2 Mb, in the U200 platform. The performance of devised accelerator array can be 2012.71GOPS.

As for the third layer $({5, 2, 16, 16, 256, 128})$ within DCGAN, we can generate the corresponding accelerator array leveraging the analytical model. As the kernel size $K$ equals 5 and the input width $I_{W}$ equals 16, its output channel $O_{C}$ reaches 128. It is necessary to adopt tiling in the output channel, i.e., $O_{TC}$. Due to the resource constraints and fan-out limits, we assign the $O_{TC}$ as 10 with the help of the analytical model. Then, we still load the input feature maps and weights from off-chip memory in a double-buffer manner to hide the long-latency memory access. Since the $T_{comp}$ can be larger than $T_{load}$ as $O_{TC}$ equals 10, we can assign the $N_{i}$ and $N_{o}$ as 2 to satisfy the resource limitation via the analytical model. We can iteratively load the rows of input feature maps to multiply weights concurrently in the output channels. And we implement the Input Buffer, Output Buffer and Weight Buffer, whose sizes separately equal 1 Mb, 2 Mb, and 1 Mb, around the accelerator array. By consuming 4000 DSPs running at 200 MHz, we can achieve 3920.24 GOPS by implementing the accelerator array on the U200 platform, as listed in Table 5.

Towards the fourth layer $({5, 2, 32, 32, 128, 3})$, we specifically propose the accelerator array that contains 25 rows and 32 columns. Since the number of output channels equals 3, we can assign the parallel factor of PEs as the identical value, without partitioning in the $O_{C}$ dimension. Generally, the sizes of input feature maps and output results of this layer are 2 Mb and 192 Kb, separately. By solving the formulated problem, we set the $N_{i}$ and $N_{o}$ as 2, and implement the 4 Mb Input Buffer and 0.4 Mb Output Buffer, respectively. Besides, we also deploy the Weight Buffer whose size equals 154 Kb. By deploying the designed accelerator array in the target platform, we obtain the acceleration performance 2520.66GOPS.

As can be seen, we firstly select the transposed convolutional layers within DCGAN to demonstrate the effectiveness of accelerator architecture. Typically, since the kernel size and stride size of the four layers are identical, we can only focus on the output channel dimensions in which we apply the tilting method. As we can see, by resolving the min–max problem proposed in the design space problem, we can devise the specific accelerator array with the appropriate tiling factor and buffer size. Benefiting from the systolic way, the devised array can run at a much higher frequency. However, we assign the frequency identical to that adopted in the next section for a fair comparison. Moreover, we can also implement several accelerator arrays concurrently in the selected platform under resource constraints, and map different tiles split in output channel dimension to these arrays. In this way, we can fully utilize the available resource to reap higher performance.

Then, to demonstrate the reuse of the proposed accelerator array, we devise one uniform array in which we can map these transposed convolutional layers. Generally, the size of the uniform array can be set as $(25, 8, 10)$, as the transposed convolutional layers have the same kernel size. This means that we apply tiling in the output channel dimensions of these layers, and the $O_{TC}$ equals 10. Additionally, we also need apply tiling in the input width dimensions of the $3rd$ and $4th$ transposed convolutional layers, i.e., the tiling size of the input width dimension $I_{TW}$ equals 8. Then, the intermediate results obtained from the tiles split in the input width dimensions should be accumulated to form the final results. By applying tiling in the output channel and input width dimensions, we can execute these transposed convolutional layers in the uniform accelerator array.

5.3 Results and Comparison with Baseline Design

We initially re-implement the the basic accelerator design for transposed convolution [18], which are similar to the method in [15, 20], in the Xilinx Alveo U200 platform. Typically, we map the output channel dimension and input width dimension to the array, and leverage the on-chip buffers beside PEs to perform the overlap-and-accumulate operations on the intermediate patches. As the array size equals $16 \times 16$, it takes up 256 DSPs and achieves 0.32 GOPS/DSP, which can be relatively lower than other works. This is mainly because the generation of the intermediate patches and further backward-stencil operation are executing sequentially. After we finish performing the backward-stencil operations on the intermediate patches that are currently stored in the on-chip buffers, we can only continue to feed the next row of input feature maps to generate the new patches that can be written in the on-chip buffers.

5.4 Results and Comparison with Previous Works

In this section, we investigate the effects of kernel size and stride size on acceleration performance. Generally, we select the third transposed convolutional layer of FCN and the last layer of FSRCNN, both of which have different kernel sizes. Specially, we can assign the stride size of the last layer of FSRCNN as distinct numbers, to figure out the effect of stride size on the accelerator array. And we select the 3rd transposed convolutional layer within DCGAN. We present the related parameters of the three layers in Table 4.

Benefiting from the reconfigurability of FPGA, we can re-implement the specific accelerator array towards FCN. For the third transposed convolutional layer $({16, 2, 10, 10, 21, 21})$ in FCN, as its kernel size $K$ and input width $I_{W}$ is 16 and 10, we need to apply tiling in output channel owing to the resource limitation. Leveraging the resource model devised above, we can assign the $O_{TC}$ as 2 without exceeding DSP capacity. We consequently devise the accelerator array $(256, 10, 2)$. Typically, we can load the weights of this layer, totally 0.9 Mb, into on-chip Weight Buffer. By the analytical model, we can assign the $N_{i}$ and $N_{o}$ as 200 while satisfying the memory limitation. Then, we iteratively multiply the rows of feature maps with corresponding weights. In this way, we achieve the 4761.64 GOPS performance by costing 5,120 DSPs. For the first transposed convolutional layer $(4, 2, 1, 1, 21, 21)$ in the FCN, we can design the corresponding accelerator array $(16, 1, 10)$ by applying tiling in the output channel dimension. According to the proposed design space exploration method, we can store the 6.9 Kb weights in the chip and set $N_{i}/N_{o}$ as 1,000 under the memory limitation. Then, we can obtain the 184.20 GOPS by utilizing 160 DSPs. Similarly, for the second transposed convolutional layer $(4, 2, 4, 4, 21, 21)$ of the FCN, we implement the array $(16, 4, 10)$ to execute. We store the 6.9 Kb weights in the on-chip memory and assign the $N_{i}/N_{o}$ as 500. By implementing the array in the FPGA board, we obtain the 742.40 GOPS performance at the cost of 640 DSPs.

For the target layer $({9, S, 32, 32, 32, 1})$ of FSRCNN based on the CIFAR-10 dataset, we devise the corresponding accelerator array with 81 rows and 32 columns. As the output channel $O_{C}$ is 1, we need not apply tiling. The weight volume of this layer is relatively small and we can store these weights in on-chip memory. And we assign the $N_{i}$ and $N_{o}$ as 4 while satisfying the memory constrain and computation need. When handling this layer, we set the stride size S as different values, such as 2, 3, and 4. As the experimental results demonstrate, the distinct stride sizes have no obvious effects on overall performance, different from the previous works [5, 12]. This is because we can concurrently undertake Flattening Phase and another phase. And it typically takes up more cycles than Overlapping Phase with kernel size being larger than these stride sizes, hiding the latency of both phases.

On the other hand, we additionally execute the FSRCNN on the Set5 dataset, where the image size can be larger than that of CIFAR-10 dataset. For larger images, we typically adopt tiling in the output channel and input width dimensions. And the adjacent tiling outputs can have $K-S$ overlapping elements, which should be accumulated to generate the final results. When taking the image of size $256 \times 256$ as the input, the parameter of the transposed convolutional layer can be $(9, S, 256, 256, 32, 1)$. Instead of applying tiling in the output channel dimension whose size is 1, we can apply tiling in the input width dimension to execute the layer. By assigning $I_{TW}$ as 32, we can implement the accelerator array of size (81, 32, 1) to execute the transposed convolution tiles. Additionally, as each tile output overlaps with adjacent one by $K-S$ elements, we just need accumulate these overlapping elements to obtain the final results. Generally, we can store the weights in the chip and set $N_i/N_o$ as 2. Finally, we can achieve the nearly identical performance as performing the layer in the CIFAR-10 dataset.

When accelerating the FSRCNN in the works [5, 12], both of which transform the transposed convolution into multiple convolution operations in an off-line way, their performance is sensitive to the stride size. The proposed intermediate-centric dataflow can reap steady acceleration performance and avoid the enormous transformation overhead. Furthermore, by avoiding the off-line transformation overhead and frequent communication between host/client devices, the intermediate-centric dataflow is more efficient to accelerate the transposed convolution of Back Propagation while performing on-device training in federated learning application [18]. Generally, the work [12] achieves higher DSP performance, because they adopt the FIR algorithm to perform the transformed convolution kernels, so that the multiplication operations can be reduced see Table 6. However, this work demands iteratively transforming the kernel in the software, which impedes it to be applied in accelerating the Error Propagation of the training process. Moreover, it omits directly the padded zeros surrounding the input feature maps after converting transposed convolution, which can also damage the result quality. The proposed intermediate-centric dataflow can directly perform the backward-stencil computation computation and speed up the transposed convolution in-the-fly without any approximation, while keeping reasonable acceleration performance. On the other hand, by solving the formulated problem of Section 4 in these experiments, we find that $T_{comp}$ is generally larger than $T_{load}$, which means that the accelerator array is mainly computation-bounded towards the selected transposed convolutional layers. Consequently, by effectively leveraging the DSP resource with high frequency, we can attain better performance in the accelerator architecture as the computation resources increase.

Table 6.

Year	2018	2018	2020	2020		2022
	[19]	[10]	[5]	[12]		Ours
DNN Model	DCGAN	FCN	FSRCNN	FSRCNN	DCGAN	FSRCNN	DCGAN	FCN
FPGA	Alveo	Zynq	Kintex-7	Virtex-UltraScale		Alveo
FPGA	U200	XCZC7045	XC7K410T	VCU108		U200
Frequency (MHz)	200	200	130	200		200	200	200
Precision	16-8 bit	16 bit	13 bit	16-8 bit		16-8 bit	16-8 bit	16-8 bit
DSP Usage	256	640	1512	576		2592	4 000	5 120
LUTs Usage	107 K	85.68 K	167 K	42 K		96 K	157 K	181 K
FFs Usage	94 K	110.64 K	158 K	20 K		82 K	132 K	151 K
BRAM Usage	9.38%	67%	24%	10.95%		24.5%	9.22%	16.4%
URAM Usage	0	N/A	N/A	N/A		0	0	0
Perf. (GOPS)	81.52	107	s = 2, 780	s = 2, 605.56	823.8	s = 2, 2721.64	3920.24	4761.64
			s = 3, 1576.3	s = 3, 1086.07		s = 3, 2617.92
			s = 4, 2691	s = 4, 1868.78		s = 4, 2643.84
Perf. (GOPS/DSP)	0.32	0.17	s = 2, 0.51	s = 2, 1.05	1.43	s = 2, 1.06	0.98	0.93
			s = 3, 1.04	s = 3, 1.88		s = 3, 1.01
			s = 4, 1.77	s = 4, 3.24		s = 4, 1.02

Table 6. Comparison with State-of-the-Art Accelerators

Furthermore, we discuss the scalability of the intermediate-centric dataflow and the hardware architecture. When we implement the accelerator array in the FPGA board, the scalability can be bounded by several factors: 1) The DSPs are distributed in columns in the FPGA chips. For example, the 6840 DSPs are distributed in 32 columns in the U200 boards. We generally obtain the distorted layout of the square array after placement and routing leveraging the EDA tools due to the DSP distribution in rectangle shape. As the proposed accelerator arrays are designed in rectangle shape, we can still keep the shape almost unchanged after placement and routing, by leveraging the optimization methods as mentioned in the work [25]. 2) The chips in the state-of-the-art FPGA boards generally contain several dies each of which consists of a part of the resources. When the size of accelerator array grows larger, the array can be implemented across the dies, which can limit the frequency. As our proposed dataflow indeed works efficiently, these limitations are mainly caused by the physical design of the FPGA boards. On the other hand, while the Google TPUs [9] contain $512\,\times \,512$ and $256\,\times \,256$ matrix array, our dataflow can probably avoid the resource limitation in the ASIC technique.

As we aim at proposing the intermediate-centric dataflow to accelerate the transposed convolution and perform its backward-stencil computation, we directly adopt the commonly used data types as the works [19, 20]. And the possible effects on the accuracy from quantization have been presented in the work [20].

6 Related Work

Recently, the researchers have proposed several ways to achieve the acceleration of transposed convolutional layers. Primarily, we can convert the transposed convolution to convolution by inserting and padding zeros in the input feature maps [19]. However, the obtained convolution can lead to 75% useless computation due to the zero elements [19]. To skip the inserted zeros and improve the resource utility, FlexiGAN [21] and GANAX [22] implement distributed on-chip buffers in PE array and the related compiler to schedule the acceleration process in the MIMD-SIMD manner, which introduces the non-trivial von Neumann overhead and resource cost. After converting the transposed convolution to convolution by inserting zero elements, F-DNA [12] adopts the Deconvolution Transforming (DT) method to iteratively partition the converted feature maps of convolution, and generate the approximately uniform convolution kernels, avoiding the computation with zero elements and workload imbalance. However, this method mainly deals with the inserted zeros between input feature pixels, without considering the padded zeros surrounding the input feature maps, leading to the information loss at the edge of input feature maps. To avoid calculating the zero elements, Uni-OPU [23] transforms the transposed convolution into the identical computation paradigm as convolution by leveraging the auxiliary kernel, introducing numerous transformation overhead and more computation.

On the other hand, the work [5] devises the Transforming Deconvolution to Convolution (TDC) method to transform the transposed convolution into several convolution kernels by inverse coefficient mapping. This method introduces the great transformation overhead, and different stride sizes will seriously affect the final performance. Additionally, GNA [20] and FCN-Engine [19] propose to implement transposed convolution in the PE array with similar mapping strategies. They both implement on-chip buffers beside PEs to store the generated intermediate elements. The memory cost grows rapidly as the size of the proposed accelerator array enlarges. And there is a large room to improve the computation parallelism and computation efficiency for higher performance.

7 Conclusion and Future Extensions

In this article, we mainly propose the intermediate-centric dataflow to perform the backward-stencil computation of transposed convolution. The intermediate-centric dataflow partitions the transposed convolution into two pipelining phases to decouple the generation of intermediate patches from further process on these intermediate patches, to accelerate the transposed convolution with reasonable performance. To demonstrate the effectiveness of the proposed intermediate-centric dataflow towards different kinds of transposed convolutional layers, we separately explore the design space and implement the corresponding accelerator arrays for selected layers. Then, we implement an uniform accelerator array to support these transposed convolutional layers and reuse the hardware resource [4]. The size of uniform accelerator array can be determined according to the target convolutional networks under resource budget. Besides, to map these layers in the uniform array, we can additionally apply tiling in the input width and kernel width/height dimensions, and propose the automatic tiling strategy to implement the transposed convlutional layers in the target platforms with different resource budgets. In this circumstance, we can handle each input tile in the same way proposed in this article. Additionally, as both the proposed intermediate-centric dataflow and weight-stationary dataflow of convolutional layers have the similar mapping scheme, which flattens the kernel height and kernel width dimensions, we can further devise the unified architecture for supporting both layers to accelerate the inference of generative networks and on-device training of CNNs.

Footnotes

As $K_{W}$ generally equals $K_{H}$, we denote them as $K$ in this article.

To simplify the narrative, we assume a cropping size of 0 in this article.

References

[1]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the Design Automation Conference.

Registers	Function
\(R0\)	Store the intermediate results generated in first phase
\(R1\)	Receive transferred intermediate results from oblique line
\(R2\)	Receive transferred intermediate results from vertical line
\(R3\)	Store the accumulated intermediate results
\(R4\)	Pass the input feature maps to below PEs
\(R5\)	Pass the weight elements to right-side PEs

Layers	Convolutional Layers \((K, S, I_{H}, I_{W}, I_{C}, O_{C})\)	Transposed Convolutional Layers \((K, S, I_{H}, I_{W}, I_{C}, O_{C})\)
1st	\(({5, 2, 64, 64, 3, 128})\)	\(({5, 2, 4, 4, 1024, 512})\)
2nd	\(({5, 2, 32, 32, 128, 256})\)	\(({5, 2, 8, 8, 512, 256})\)
3rd	\(({5, 2, 16, 16, 256, 512})\)	\(({5, 2, 16, 16, 256, 128})\)
4th	\(({5, 2, 8, 8, 512, 1024})\)	\(({5, 2, 32, 32, 128, 3})\)

Networks	\((K, S, I_{H}, I_{W}, I_{C}, O_{C})\)	accelerator size \((N_{i}, N_{o}, O_{TC})\)
DCGAN	\(({5, 2, 16, 16, 256, 128})\)	\(({25, 16, 10})\)
FCN	\(({16, 2, 10, 10, 21, 21})\)	\(({256, 10, 2})\)
FSRCNN	\(({9, S, 32, 32, 32, 1})\)	\(({81, 32, 1})\)

Abstract

1 Introduction

2 Background and Motivation

2.1 Backward-Stencil Computation

2.2 Non-uniform Dependence Distance

3 Intermediate-Centric Dataflow

3.1 Overview

3.2 Hardware Support

3.2.1 Accelerator Overview.

3.2.2 Processing Element.

3.2.3 Interconnect Network.

3.2.4 Array Buffer.

3.3 Flattening Phase

3.4 Overlapping Phase

3.4.1 Accumulation Stage.

3.4.2 Transfer Stage.

3.5 Dataflow Schedule

4 Automatic Flow

4.1 Overview

4.2 Design Space Exploration

4.2.1 Resource Model.

4.2.2 Performance Model.

5 Experiments

5.1 Experimental Setup

5.2 Results and Analysis on DCGAN

5.3 Results and Comparison with Baseline Design

5.4 Results and Comparison with Previous Works

6 Related Work

7 Conclusion and Future Extensions

Footnotes

References

Cited By

Index Terms

Recommendations

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Analysis of a Dynamically Reconfigurable Dataflow Architecture and its Scalable Parallel Extension for Multi-FPGA Platforms

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations