5.2 Results and Analysis on DCGAN
In recent years, DCGAN has been proposed to make great progress in the field of generator adversary networks, owing to its unique network structures. Generally, DCGAN consists of four convolutional layers and four transposed convolutional layers without any
Fully Connected (
FC) layers, and their parameters, as described in Section
2, are listed in Table
3.
Initially, we design the specific accelerator array for each layer within DCGAN. As for the first transposed convolutional layer, whose size equals
\(({5, 2, 4, 4, 1024, 512})\), the array rows and columns can be assigned as
\({K}^2\) and
\({I}_{W}\), i.e., 25 and 4. Then, we can assign the
\({O}_{TC}\) as 10 via the analytical model under the DSP resource limitation, considering the routing feasibility. Consequently, we can devise the corresponding accelerator array
\((25, 4, 10)\) for the first transposed convolutional layer of DCGAN. To determine the size of
Input Buffer and
Output Buffer, we then figure out the parameters
\(N_i\) and
\(N_o\) by solving the problem formulated in Section
4. Generally, the sizes of one input feature map and output result are 256 Kb and 512 Kb, we can set the
\(N_i\) and
\(N_o\) as 4 from the obtained results of the formulated problem under the memory constraints. Hence, the size of
Input Buffer and
Output Buffer can be assigned separately as 1 Mb and 2 Mb, and the
Weight Buffer size equals 4 Mb with
\(O_{TC}\) equaling 10. The performance of the devised accelerator array reaches 1101.34GOPS when we implement it in the U200 platform with 200 MHz frequency, as shown in Table
5.
Then, we present the accelerator array towards the second transposed convolutional layer of DCGAN, whose layer size is \(({5, 2, 8, 8, 512, 256})\). Primarily, the sizes of array row and column can be set as 25 and 8, respectively. Identically, we can also assign the parallel factor \(O_{TC}\) as 10 for routing. In this way, we can obtain the corresponding accelerator array towards the second transposed convolutional layer. As the sizes of the input feature map and output result separately are 512 Kb and 1 Mb, we also set the \(N_i\) and \(N_o\) by the design space exploration. And we implement the Input Buffer, Output Buffer, and Weight Buffer, whose sizes separately equal 1 Mb, 2 Mb and, 2 Mb, in the U200 platform. The performance of devised accelerator array can be 2012.71GOPS.
As for the third layer
\(({5, 2, 16, 16, 256, 128})\) within DCGAN, we can generate the corresponding accelerator array leveraging the analytical model. As the kernel size
\(K\) equals 5 and the input width
\(I_{W}\) equals 16, its output channel
\(O_{C}\) reaches 128. It is necessary to adopt tiling in the output channel, i.e.,
\(O_{TC}\). Due to the resource constraints and fan-out limits, we assign the
\(O_{TC}\) as 10 with the help of the analytical model. Then, we still load the input feature maps and weights from off-chip memory in a double-buffer manner to hide the long-latency memory access. Since the
\(T_{comp}\) can be larger than
\(T_{load}\) as
\(O_{TC}\) equals 10, we can assign the
\(N_{i}\) and
\(N_{o}\) as 2 to satisfy the resource limitation via the analytical model. We can iteratively load the rows of input feature maps to multiply weights concurrently in the output channels. And we implement the
Input Buffer,
Output Buffer and
Weight Buffer, whose sizes separately equal 1 Mb, 2 Mb, and 1 Mb, around the accelerator array. By consuming 4000 DSPs running at 200 MHz, we can achieve 3920.24 GOPS by implementing the accelerator array on the U200 platform, as listed in Table
5.
Towards the fourth layer \(({5, 2, 32, 32, 128, 3})\), we specifically propose the accelerator array that contains 25 rows and 32 columns. Since the number of output channels equals 3, we can assign the parallel factor of PEs as the identical value, without partitioning in the \(O_{C}\) dimension. Generally, the sizes of input feature maps and output results of this layer are 2 Mb and 192 Kb, separately. By solving the formulated problem, we set the \(N_{i}\) and \(N_{o}\) as 2, and implement the 4 Mb Input Buffer and 0.4 Mb Output Buffer, respectively. Besides, we also deploy the Weight Buffer whose size equals 154 Kb. By deploying the designed accelerator array in the target platform, we obtain the acceleration performance 2520.66GOPS.
As can be seen, we firstly select the transposed convolutional layers within DCGAN to demonstrate the effectiveness of accelerator architecture. Typically, since the kernel size and stride size of the four layers are identical, we can only focus on the output channel dimensions in which we apply the tilting method. As we can see, by resolving the min–max problem proposed in the design space problem, we can devise the specific accelerator array with the appropriate tiling factor and buffer size. Benefiting from the systolic way, the devised array can run at a much higher frequency. However, we assign the frequency identical to that adopted in the next section for a fair comparison. Moreover, we can also implement several accelerator arrays concurrently in the selected platform under resource constraints, and map different tiles split in output channel dimension to these arrays. In this way, we can fully utilize the available resource to reap higher performance.
Then, to demonstrate the reuse of the proposed accelerator array, we devise one uniform array in which we can map these transposed convolutional layers. Generally, the size of the uniform array can be set as \((25, 8, 10)\), as the transposed convolutional layers have the same kernel size. This means that we apply tiling in the output channel dimensions of these layers, and the \(O_{TC}\) equals 10. Additionally, we also need apply tiling in the input width dimensions of the \(3rd\) and \(4th\) transposed convolutional layers, i.e., the tiling size of the input width dimension \(I_{TW}\) equals 8. Then, the intermediate results obtained from the tiles split in the input width dimensions should be accumulated to form the final results. By applying tiling in the output channel and input width dimensions, we can execute these transposed convolutional layers in the uniform accelerator array.
5.4 Results and Comparison with Previous Works
In this section, we investigate the effects of kernel size and stride size on acceleration performance. Generally, we select the third transposed convolutional layer of FCN and the last layer of FSRCNN, both of which have different kernel sizes. Specially, we can assign the stride size of the last layer of FSRCNN as distinct numbers, to figure out the effect of stride size on the accelerator array. And we select the 3rd transposed convolutional layer within DCGAN. We present the related parameters of the three layers in Table
4.
Benefiting from the reconfigurability of FPGA, we can re-implement the specific accelerator array towards FCN. For the third transposed convolutional layer \(({16, 2, 10, 10, 21, 21})\) in FCN, as its kernel size \(K\) and input width \(I_{W}\) is 16 and 10, we need to apply tiling in output channel owing to the resource limitation. Leveraging the resource model devised above, we can assign the \(O_{TC}\) as 2 without exceeding DSP capacity. We consequently devise the accelerator array \((256, 10, 2)\). Typically, we can load the weights of this layer, totally 0.9 Mb, into on-chip Weight Buffer. By the analytical model, we can assign the \(N_{i}\) and \(N_{o}\) as 200 while satisfying the memory limitation. Then, we iteratively multiply the rows of feature maps with corresponding weights. In this way, we achieve the 4761.64 GOPS performance by costing 5,120 DSPs. For the first transposed convolutional layer \((4, 2, 1, 1, 21, 21)\) in the FCN, we can design the corresponding accelerator array \((16, 1, 10)\) by applying tiling in the output channel dimension. According to the proposed design space exploration method, we can store the 6.9 Kb weights in the chip and set \(N_{i}/N_{o}\) as 1,000 under the memory limitation. Then, we can obtain the 184.20 GOPS by utilizing 160 DSPs. Similarly, for the second transposed convolutional layer \((4, 2, 4, 4, 21, 21)\) of the FCN, we implement the array \((16, 4, 10)\) to execute. We store the 6.9 Kb weights in the on-chip memory and assign the \(N_{i}/N_{o}\) as 500. By implementing the array in the FPGA board, we obtain the 742.40 GOPS performance at the cost of 640 DSPs.
For the target layer
\(({9, S, 32, 32, 32, 1})\) of FSRCNN based on the CIFAR-10 dataset, we devise the corresponding accelerator array with 81 rows and 32 columns. As the output channel
\(O_{C}\) is 1, we need not apply tiling. The weight volume of this layer is relatively small and we can store these weights in on-chip memory. And we assign the
\(N_{i}\) and
\(N_{o}\) as 4 while satisfying the memory constrain and computation need. When handling this layer, we set the stride size S as different values, such as 2, 3, and 4. As the experimental results demonstrate, the distinct stride sizes have no obvious effects on overall performance, different from the previous works [
5,
12]. This is because we can concurrently undertake
Flattening Phase and another phase. And it typically takes up more cycles than
Overlapping Phase with kernel size being larger than these stride sizes, hiding the latency of both phases.
On the other hand, we additionally execute the FSRCNN on the Set5 dataset, where the image size can be larger than that of CIFAR-10 dataset. For larger images, we typically adopt tiling in the output channel and input width dimensions. And the adjacent tiling outputs can have \(K-S\) overlapping elements, which should be accumulated to generate the final results. When taking the image of size \(256 \times 256\) as the input, the parameter of the transposed convolutional layer can be \((9, S, 256, 256, 32, 1)\). Instead of applying tiling in the output channel dimension whose size is 1, we can apply tiling in the input width dimension to execute the layer. By assigning \(I_{TW}\) as 32, we can implement the accelerator array of size (81, 32, 1) to execute the transposed convolution tiles. Additionally, as each tile output overlaps with adjacent one by \(K-S\) elements, we just need accumulate these overlapping elements to obtain the final results. Generally, we can store the weights in the chip and set \(N_i/N_o\) as 2. Finally, we can achieve the nearly identical performance as performing the layer in the CIFAR-10 dataset.
When accelerating the FSRCNN in the works [
5,
12], both of which transform the transposed convolution into multiple convolution operations in an off-line way, their performance is sensitive to the stride size. The proposed
intermediate-centric dataflow can reap steady acceleration performance and avoid the enormous transformation overhead. Furthermore, by avoiding the off-line transformation overhead and frequent communication between host/client devices, the
intermediate-centric dataflow is more efficient to accelerate the transposed convolution of Back Propagation while performing on-device training in federated learning application [
18]. Generally, the work [
12] achieves higher DSP performance, because they adopt the FIR algorithm to perform the transformed convolution kernels, so that the multiplication operations can be reduced see Table
6. However, this work demands iteratively transforming the kernel in the software, which impedes it to be applied in accelerating the Error Propagation of the training process. Moreover, it omits directly the padded zeros surrounding the input feature maps after converting transposed convolution, which can also damage the result quality. The proposed
intermediate-centric dataflow can directly perform the
backward-stencil computation computation and speed up the transposed convolution in-the-fly without any approximation, while keeping reasonable acceleration performance. On the other hand, by solving the formulated problem of Section
4 in these experiments, we find that
\(T_{comp}\) is generally larger than
\(T_{load}\), which means that the accelerator array is mainly computation-bounded towards the selected transposed convolutional layers. Consequently, by effectively leveraging the DSP resource with high frequency, we can attain better performance in the accelerator architecture as the computation resources increase.
Furthermore, we discuss the scalability of the
intermediate-centric dataflow and the hardware architecture. When we implement the accelerator array in the FPGA board, the scalability can be bounded by several factors: 1) The DSPs are distributed in columns in the FPGA chips. For example, the 6840 DSPs are distributed in 32 columns in the U200 boards. We generally obtain the distorted layout of the square array after placement and routing leveraging the EDA tools due to the DSP distribution in rectangle shape. As the proposed accelerator arrays are designed in rectangle shape, we can still keep the shape almost unchanged after placement and routing, by leveraging the optimization methods as mentioned in the work [
25]. 2) The chips in the state-of-the-art FPGA boards generally contain several dies each of which consists of a part of the resources. When the size of accelerator array grows larger, the array can be implemented across the dies, which can limit the frequency. As our proposed dataflow indeed works efficiently, these limitations are mainly caused by the physical design of the FPGA boards. On the other hand, while the Google TPUs [
9] contain
\(512\,\times \,512\) and
\(256\,\times \,256\) matrix array, our dataflow can probably avoid the resource limitation in the ASIC technique.
As we aim at proposing the
intermediate-centric dataflow to accelerate the transposed convolution and perform its
backward-stencil computation, we directly adopt the commonly used data types as the works [
19,
20]. And the possible effects on the accuracy from quantization have been presented in the work [
20].