High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
Abstract— Over the past few years, 2-D convolutional neural Index Terms— Field-programmable gate array (FPGA), sta-
networks (CNNs) have demonstrated their great success in a tic block floating point (BFP), three-dimensional convolutional
wide range of 2-D computer vision applications, such as image neural network (3-D CNN).
classification and object detection. At the same time, 3-D CNNs,
as a variant of 2-D CNNs, have shown their excellent ability to
analyze 3-D data, such as video and geometric data. However, I. I NTRODUCTION
the heavy algorithmic complexity of 2-D and 3-D CNNs imposes
a substantial overhead over the speed of these networks, which
limits their deployment in real-life applications. Although various
domain-specific accelerators have been proposed to address this
I N RECENT years, deep neural networks (DNNs), espe-
cially convolutional neural networks (CNNs), have demon-
strated their great potential in various computer vision (CV)
challenge, most of them only focus on accelerating 2-D CNNs,
without considering their computational efficiency on 3-D CNNs. applications. In particular, 2-D CNNs, which perform 2-D
In this article, we propose a unified hardware architecture to convolution in the spatial domain to extract 2-D features, have
accelerate both 2-D and 3-D CNNs with high hardware efficiency. achieved state-of-the-art accuracy in a wide range of CV tasks,
Our experiments demonstrate that the proposed accelerator can including image classification [1] and object detection [2].
achieve up to 92.4% and 85.2% multiply-accumulate efficiency on Besides 2-D CNNs, 3-D CNNs [3], due to their ability to
2-D and 3-D CNNs, respectively. To improve the hardware perfor-
mance, we propose a hardware-friendly quantization approach incorporate the 3-D information based on 3-D convolution,
called static block floating point (BFP), which eliminates the fre- have been also adopted in various 3-D CV scenarios, such
quent representation conversions required in traditional dynamic as human action recognition [4] and 3-D medical imaging
BFP arithmetic. Comparing with the integer linear quantization segmentation [5].
using zero-point, the static BFP quantization can decrease the Nevertheless, the memory and computational complex-
logic resource consumption of the convolutional kernel design by
nearly 50% on a field-programmable gate array (FPGA). Without
ity of 2-D and 3-D convolutions put a heavy burden on
time-consuming retraining, the proposed static BFP quantization their hardware performance on general-purpose processors,
is able to quantize the precision to 8-bit mantissa with negligible which restrains their application in real-life scenarios [6]. For
accuracy loss. As different CNNs on our reconfigurable system instance, a classical 3-D CNN designed for human action
require different hardware and software parameters to achieve recognition called C3D requires nearly 78 GOPs, and thus,
optimal hardware performance and accuracy, we also propose
an automatic tool for parameter optimization. Based on our
achieves only 951 ms per inference for a 16-frame video
hardware design and optimization, we demonstrate that the on an Intel i5 CPU, which cannot meet the requirement of
proposed accelerator can achieve 3.8–5.6 times higher energy real-time processing [7]. Therefore, there is a great demand
efficiency than graphics processing unit (GPU) implementation. for domain-specific accelerators for both 2-D and 3-D CNNs.
Comparing with the state-of-the-art FPGA-based accelerators, Different hardware platforms, including graphics processing
our design achieves higher generality and up to 1.4–2.2 times
higher resource efficiency on both 2-D and 3-D CNNs.
units (GPUs), application-specific integrated circuits (ASICs),
and field-programmable gate arrays (FPGAs), have been used
Manuscript received 5 February 2021; revised 14 June 2021; accepted to accelerate both 2-D and 3-D CNNs. Among all these
17 September 2021. Date of publication 13 October 2021; date of current
version 4 August 2023. This work was supported in part by the U.K. EPSRC
hardware platforms, FPGAs are gaining popularity because of
under Grant EP/L016796/1, Grant EP/N031768/1, Grant EP/P010040/1, Grant their better flexibility than ASICs and higher energy efficiency
EP/V028251/1, and Grant EP/S030069/1; in part by the National Natural than GPUs [8], [9]. In spite of these advantages, there are
Science Foundation of China under Grant 62001165; in part by the Hunan
Provincial Natural Science Foundation of China under Grant 2021JJ40357;
several challenges when accelerating both 2-D and 3-D CNNs
in part by the Changsha Municipal Natural Science Foundation under Grant on FPGA:
kq2014079; and in part by the funds from Corerain, Maxeler, Intel, Xilinx, and 1) To improve the hardware performance, most FPGA-
the State Key Laboratory of Space-Ground Integrated Information Technology based accelerators tend to utilize low-precision weights
(SGIIT). (Corresponding author: Shuanglong Liu.)
Hongxiang Fan, Zhiqiang Que, and Wayne Luk are with the Department of or activations [8]. However, previous work either shows
Computing, Imperial College London, London SW7 2AZ, U.K. significant accuracy loss, such as fixed-point [10] and
Shuanglong Liu is with the School of Physics and Electronics, Hunan logarithm arithmetic [11], or requires a large amount of
Normal University, Changsha 410081, China (e-mail: liu.shuanglong@
hunnu.edu.cn). hardware resource on implementing quantization mod-
Xinyu Niu is with Shenzhen Corerain Technologies Company Ltd., ules, such as linear quantization [12] and dynamic BPF
Shenzhen 518048, China. quantization [13].
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3116302. 2) Convolution operations, especially 3-D convolution, are
Digital Object Identifier 10.1109/TNNLS.2021.3116302 highly memory and computation-intensive, making it
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4474 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4475
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4476 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
put matrices. Liu et al. [10] proposed a uniform architecture III. S TATIC B LOCK F LOATING -P OINT Q UANTIZATION
based on 2-D multiply-accumulate (MAC) array. Nevertheless, In this Section, we first introduce the quantization approach
the design uses low-bitwdith fixed-point arithmetic and the and blocking strategy of our static BFP quantization. The
accuracy result cannot be guaranteed. operations of 2-D and 3-D CNNs under static BFP will then
2) Low-Bitwidth Quantization: Quantization [27] and spar- be presented.
sity exploiting [28] are two main-streaming techniques to
reduce the algorithmic complexity of CNNs. Since this article
mainly focuses on quantization, we refer the reader to a review A. Quantization Approach
of the sparsity exploiting by Wang et al. [8]. The quantization approach should consider not only the
A comprehensive survey on quantization methods for DNNs accuracy performance but also the hardware implementation
has been summarized in [29]. There are three main-streaming and performance. As mentioned in Section II-B2, the frequent
quantization approaches, i.e., fixed-point, logarithm, and linear FP-BFP and BFP-FP conversions required by the dynamic
quantization. Courbariaux et al. [30] explore the use of fixed- BFP quantization put a heavy overhead on its resource con-
point arithmetic on CNNs for both training and inference. sumption and hardware performance. In order to eliminate
However, it is only validated on small datasets and may the process of finding the shared exponents at runtime like
introduce significant accuracy loss on large datasets such dynamic BFP quantization, our static quantization scheme
ImageNet [31]. Miyashita et al. [11] proposed a logarithmic fixes the shared exponent for different inputs and determines
quantization method which decreased the precision of the the shared exponents before CNN inference. To achieve this,
VGG-16 [32] and AlexNet [1] to low bitwidth without signif- it is required to collect certain amounts of intermediate results
icant accuracy loss. Nevertheless, this quantization approach by running the CNNs on different inputs and find the properly
has not been evaluated on InceptionV4 and MobileNetv2, shared exponents by minimizing the precision loss. Although
which cannot demonstrate its effectiveness on these light- simply using the maximal exponents in the collected inter-
weight models. Jacob et al. [12] proposed an integer-only mediate statistics is one approach to determine the shared
quantization using zero-point to maintain accuracy. They exponents, we found that, on the ImageNet dataset [31],
demonstrated in their experiments that the integer quantization this maximal strategy will cause significant accuracy drop
with zero-point only introduced negligible accuracy loss on on the CNN models using depthwise convolution, such as
a wide range of CNN models. However, the use of zero- MobileNetv2 [35]. To address this issue, we propose another
point also introduces a heavy burden on the memory and strategy that determines the shared exponent by minimizing the
computational resources, which limits the overall hardware KL divergence [36], which describes the difference between
performance. Jain et al. [33] proposed a power-of-two scaling two distributions.
quantization with trainable quantization thresholds. However,
the process requires time-consuming retraining and its perfor- Algorithm 1 Static BFP Quantization Using KL Divergence
mance on 3-D CNNs is unknown. 1: Run the FP-based CNNs using different inputs
The dynamic BFP quantization mentioned in Section II-B 2: for Each block b do
is first proposed in [17] and applied for CNN inference. 3: Fetch FP statistics and build the target distribution Dt
It is able to compress both the activations and weights to 4: Find the maximum exponent emax
8 bits for ResNet-50 [16] with only negligible accuracy loss 5: Use maximal exponent, eopt b
= emax
on small datasets. Based on the dynamic BFP quantization, 6: Initialize Dmax as the maximal FP value
Lian et al. [13] proposed a high-performance CNN accelera- 7: for eo f f set = 0 to i do
tor on FPGA. Although the design uses 8-bit mantissa BFP 8: ecur = emax − eo f f set
for the main computation, the precision used in the on-chip 9: Apply BFP quantization using ecur
and off-chip communication is still 16-bit and it requires 10: Build the BFP-quantized distribution Dq
frequent conversion between BFP and FP, which brings a 11: Compute the KL divergence between Dt and Dq
heavy burden on the memory usage and bandwidth resource. 12: if Dt < Dmax then
Different from these works, our proposed quantization scheme 13: b
eopt = ecur , Dmax = Dt
uses a fixed shared exponent for different inputs to eliminate
the frequent conversion between BFP and FP. The shared
exponent is determined before CNN inference by minimizing As illustrated in Algorithm 1, in the beginning, we separate
the KL divergence. We also introduce an automatic tool that the intermediate results into several blocks according to the
optimizes the shared exponents, the bitwidth of mantissa, block size. Note that the block size is a hyperparameter in our
and the exponent by balancing the tradeoff between accuracy static BFP quantization and we will discuss it in Section III-B.
and hardware performance. Our prior work [34] explored the For each block, a histogram can be drawn based on the col-
application of static BFP quantization on 2-D CNNs. However, lected intermediate FP statistics by running CNNs on example
its accuracy performance on 3-D CNNs was unknown. In addi- inputs, which are used to record the original distribution. Then,
tion, this article only evaluated the kernel design for 2-D the BFP quantization is applied on i blocks based on i different
convolution without running the actual CNN models, and it did exponents, emax to emax −i + 1, which produces i different
not study the unified hardware architecture for both 2-D and BFP-quantized distributions. Our experiments demonstrate that
3-D CNNs. i = 3 is enough to find the proper exponents in most cases.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4477
B. Blocking Strategy
The block size decides the tradeoff between the precision from their FP counterparts. This section introduces how these
loss and hardware performance. Although a large block size operations are implemented using the static BFP quantization
can decrease the number of shared exponents and the memory scheme.
consumption, it may also increase the precision loss since 1) Convolution: Fig. 2(b) shows the basic operations of both
the variance within one block becomes large. Therefore, 2-D and 3-D convolutions when the static BFP quantization is
our proposed blocking strategy aims at achieving a balance applied. There are two improvements in comparison with the
between precision loss and hardware performance. dynamic BFP quantization.
In a typical convolutional layer, the shape of the weight 1) Because the proposed quantization scheme determines
tensor is Nc × N f × K s × K s × K t and the activation tensor the shared exponents before the runtime, the frequent
has the size Nl × Nc × H × W . In terms of activations, FP-BFP and BFP-FP conversions can be replaced by a
we can block the tensor along the spatial (H × W ), channel simple shift operation. For instance, given two consec-
(Nc ) and temporal (Nl ) dimensions. However, the convolution utive convolutional layers with the shared exponents a
needs to accumulate data from different spatial positions and b, it only needs to perform the exponent realignment
and channels, which means different exponents in these two that shifts the mantissa parts by a − b bits, which avoids
dimensions may cause frequent exponent realignment, and the trivial data conversions.
thus, degrades the hardware performance. Since the spatial and 2) Since the shared exponents are already known,
channel dimensions are not suitable for blocking, this article the required shifting bits for each layer can be precom-
blocks the activations along the temporal dimension, which puted before runtime. Therefore, the need of calculating
generates Nl shared exponents for each activation tensor. For exponents can be eliminated under the static BFP quan-
weights, the blocking can be performed along the kernel tization scheme, which significantly decreases the usage
(K s ×K s ×K t ), channel (Nc ), and filter (N f ) dimensions. Since of the memory and computational resources.
the weights from different kernel positions and channels need 2) Shortcut Addition: In modern CNNs, such as ResNet,
to be accumulated together, we only block the weights along SC [16], [37] has been widely used for residual learning. The
the filter dimension, which produces N f shared exponents for computation of a typical ResNet-like block with FP arithmetic
every weight tensor. Using our proposed blocking strategy, is presented in Fig. 4. The SC adds the outputs of the second
there is no need to perform exponent realignment for a single convolution and the original inputs together to obtain the
convolutional layer. final results. However, since the exponents of both inputs
To visualize the effect of static BFP quantization while and outputs are different under the static BFP quantization
using the proposed exponent strategies (Section III-A) and scheme, the addition cannot be simply performed in an original
blocking strategy, Fig. 3 presents the normalized histograms way. To implement SC under static BFP quantization, a shift
of the output activations of the ninth and 29th convolutional operation is required before the SC to align the exponents of
layers in ResNet–50 using original FP data and quantized data. the two tensors, which is presented in Fig. 4.
To compare two different exponent strategies, we quantize To find the properly shared exponents for BFP-based SC,
the activations using the maximal exponent (max) and shared one straightforward approach is to simply use the maximal
exponents obtained by minimizing the KL divergence (kl). exponents in the output tensor. However, we found that it will
As we can see, naively using the maximum exponents causes bring a significant accuracy drop on the CNN models using
significant precision loss in the output activations of the ninth the depthwise convolution, such as MobileNetv2. To address
convolutional layer, making it a different distribution from the this issue, we observe that the shared exponents of the SC
original FP data. However, while using the KL divergence should be determined based on the output tensor as well as
to determine the shared exponents, the quantized activations the original input tensor. Therefore, both inputs and outputs
follow a similar distribution as the FP data. are concatenated into one tensor, and the shared exponents
are determined by minimizing the KL divergence of this
C. CNNs With Static BFP Quantization concatenated tensor.
Most operations in modern CNNs, such as SC, 2-D, and 3) Batch Normalization, Pooling, and ReLU: BN has been
3-D convolutions, have different BFP-based implementations widely used in modern CNNs to address the internal covariate
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4478 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
Fig. 4. SC using FP and BFP. (a) SC using floating point. (b) SC using BFP.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4479
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4480 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
Fig. 11. MAC design with different two modes. (a) M AC under PC mode.
(b) M AC under PC&P S mode.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4481
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4482 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4483
TABLE III
A CCURACY OF THE S TAIC BFP Q UANTIZATION U SING
D IFFERENT S TRATEGIES AND T IME C OST OF T OOL
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4484 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
TABLE IV
A CCURACY OF CNN M ODELS U NDER D IFFERENT Q UANTIZATION S CHEMES
TABLE V TABLE VI
R ESOURCE C ONSUMPTION OF THE C ONVOLUTIONAL K ERNEL M ODULE R ESOURCE C ONSUMPTION OF THE W HOLE
U SING D IFFERENT Q UANTIZATION ON I NTEL A RRIA 10 GX1150 D ESIGN ON A RRIA 10 GX1150
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4485
TABLE VII
H ARDWARE P ERFORMANCE OF O UR D ESIGN ON D IFFERENT 2-D AND 3-D CNN M ODELS
TABLE VIII
P ERFORMANCE C OMPARISON OF O UR F INAL FPGA D ESIGN V ERSUS CPU AND GPU P LATFORMS
TABLE IX
P ERFORMANCE C OMPARISON OF O UR F INAL FPGA D ESIGN V ERSUS O THER FPGA D ESIGNS
3) By analyzing the computation of a variety of 2-D and which is shown in Table VIII. ResNet–50 and R3D-18 are cho-
3-D CNNs, the unified computational pattern proposed sen as our benchmark models to represent 2-D and 3-D CNNs,
in this article to improve the resource efficiency. respectively. The batch size on all three implementations is
4) By utilizing the reconfigurability of our accelerator, set to be one for a fair comparison. Compared with the CPU
the automatic tool is able to deeply optimize the implementation, our accelerator achieves 6–70 times higher
hardware designs for different CNN models case by throughput on both 2-D and 3-D CNNs. Comparing with the
case. GPU implementation, our design is 1.5 times more energy
efficient. Although GPU is faster on R3D-18 using the 16-nm
D. Performance Comparison technology, our FPGA design (20 nm) can also achieve 45.57
1) Comparison With CPU and GPU: We also compare ms when we scale the performance to the 16-nm technology
our FPGA-based design with CPU and GPU implementations, by 16/20 times.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
4486 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
2) Comparison With Other FPGA Designs: Table IX [5] H. Lu, H. Wang, Q. Zhang, S. W. Yoon, and D. Won, “A 3D convo-
presents the comparison results between our accelerator with lutional neural network for volumetric image semantic segmentation,”
Proc. Manuf., vol. 39, pp. 422–428, Jan. 2019.
the state-of-the-art FPGA designs in terms of latency and [6] H. Fan, H.-C. Ng, S. Liu, Z. Que, X. Niu, and W. Luk, “Reconfigurable
throughput. Because these designs are implemented on dif- acceleration of 3D-CNNs for human action recognition with block
ferent platforms with different hardware resources, their DSP floating-point representation,” in Proc. 28th Int. Conf. Field Program.
Log. Appl. (FPL), Aug. 2018, pp. 287–2877.
consumption may vary from each one. Therefore, we measure [7] H. Fan et al., “F-E3D: FPGA-based acceleration of an efficient 3D
the GOP/s/DSP of these designs for a fair comparison. The convolutional neural network for human action recognition,” in Proc.
GOP/s/DSP is the platform-independent metric to evaluate IEEE 30th Int. Conf. Appl.-Specific Syst., Architectures Processors
(ASAP), Jul. 2019, pp. 1–8.
the quality of hardware architecture, which represents the
[8] E. Wang et al., “Deep neural network approximation for custom hard-
computing ability provided by one DSP. Compared with ware: Where we’ve been, where We’re going,” 2019, arXiv:1901.06955.
previous designs, our accelerator supports a wider range of [Online]. Available: https://arxiv.org/abs/1901.06955
benchmark models, including different 2-D and 3-D CNNs, [9] S. Liu, H. Fan, X. Niu, H.-C. Ng, Y. Chu, and W. Luk, “Optimizing
CNN-based segmentation with deeply customized convolutional and
which demonstrates its higher generality. At the same time, deconvolutional architectures on FPGA,” ACM Trans. Reconfigurable
our accelerator also shows higher throughput and resource Technol. Syst., vol. 11, no. 3, pp. 1–22, Dec. 2018.
efficiency than all these state-of-the-art designs. In comparison [10] Z. Liu, P. Chow, J. Xu, J. Jiang, Y. Dou, and J. Zhou, “A uniform
architecture design for accelerating 2D and 3D CNNs on FPGAs,”
with [24] which has the highest performance on ResNet-50, Electronics, vol. 8, no. 1, p. 65, Jan. 2019.
our design can achieve higher throughput and nearly 1.4 [11] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural net-
times higher resource efficiency. Note that [24] can only works using logarithmic data representation,” 2016, arXiv:1603.01025.
support 2-D CNNs and its performance for 3-D CNNs is [Online]. Available: https://arxiv.org/abs/1603.01025
[12] B. Jacob et al., “Quantization and training of neural networks for
unknown. Comparing with the unified accelerator which sup- efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf.
ports both 2-D and 3-D CNNs, we can achieve 1.6–2 and Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713.
1.9–2.2 times higher throughput and resource efficiency [13] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, “High-performance
FPGA-based CNN accelerator with block-floating-point arithmetic,”
depending on different CNN models. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 8,
pp. 1874–1885, Aug. 2019.
VII. C ONCLUSION [14] J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, and C. Zhang, “Towards
a uniform template-based architecture for accelerating 2D and 3D
This work proposes a uniform hardware architecture to CNNs on FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
accelerate both 2-D and 3-D CNNs with high hardware Arrays, 2018, pp. 97–106.
efficiency. The design is based on a hardware-friendly quan- [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,” 2015,
tization method call static BFP. The proposed static BFP arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502.
eliminates the frequent representation conversions required in 03167
traditional dynamic BFP arithmetic. Without using zero point, [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
static BFP can achieve up to 50% logic resources saving on Jun. 2016, pp. 770–778.
an FPGA compared with conventional integer linear quanti- [17] Z. Song, Z. Liu, and D. Wang, “Computation error analysis of
zation. Extensive experiments on various 2-D and 3-D CNNs block floating point arithmetic oriented convolution neural network
accelerator design,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
demonstrate that the static BFP can decrease the bitwidth of pp. 1–8.
mantissa to 8 with negligible accuracy loss. An automatic [18] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer,
tool is also proposed to optimize the accuracy and hardware “Mixed precision quantization of ConvNets via differentiable neural
performance by determining the proper software and hardware architecture search,” 2018, arXiv:1812.00090. [Online]. Available:
https://arxiv.org/abs/1812.00090
parameters. Our hardware design together with optimizations [19] G. Lacey, G. W. Taylor, and S. Areibi, “Stochastic layer-wise precision
achieves 3.8–5.6 times higher energy efficiency than GPU in deep neural networks,” 2018, arXiv:1807.00942. [Online]. Available:
implementation. Compared with the state-of-the-art FPGA- https://arxiv.org/abs/1807.00942
[20] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A survey of FPGA-
based accelerators, our design can also achieve up to 1.4–2.2 based neural network accelerator,” 2017, arXiv:1712.08934. [Online].
times higher resource efficiency and higher generality on both Available: https://arxiv.org/abs/1712.08934
2-D and 3-D CNNs. Further work includes extending our [21] S. I. Venieris and C.-S. Bouganis, “FPGAConvNet: Mapping regular and
hardware design to support the transformer and other neural irregular convolutional neural networks on FPGAs,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 30, no. 2, pp. 326–342, Jul. 2018.
networks and exploring the mixed-precision BFP on these [22] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
networks to further improve the performance. FPGA-based accelerator design for deep convolutional neural networks,”
in Proc. 2015 ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
2015, pp. 161–170.
R EFERENCES [23] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the convolution
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification operation to accelerate deep neural networks on FPGA,” IEEE Trans.
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
Process. Syst., 2012, pp. 1097–1105. Jul. 2018.
[2] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. [24] Y. Xing et al., “DNNVM: End-to-end compiler leveraging heterogeneous
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. optimizations on FPGA-based CNN accelerators,” IEEE Trans. Comput.-
[3] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Aided Design Integr. Circuits Syst., vol. 39, no. 10, pp. 2668–2681,
spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Oct. 2020.
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497. [25] M. Sun, P. Zhao, M. Gungor, M. Pedram, M. Leeser, and X. Lin,
[4] H. Fan, X. Niu, Q. Liu, and W. Luk, “F-C3D: FPGA-based 3- “3D CNN acceleration on FPGA using hardware-aware pruning,”
dimensional convolutional neural network,” in Proc. 27th Int. Conf. Field in Proc. 57th ACM/IEEE Design Autom. Conf. (DAC), Jul. 2020,
Program. Log. Appl. (FPL), Sep. 2017, pp. 1–4. pp. 1–6.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: HIGH-PERFORMANCE ACCELERATION OF 2-D AND 3-D CNNs ON FPGAs USING STATIC BFP 4487
[26] C. Yang, Y. Wang, X. Wang, and L. Geng, “WRA: A 2.2-to-6.3 TOPS Hongxiang Fan received the B.S. degree in elec-
highly unified dynamically reconfigurable accelerator using a novel tronic engineering from Tianjin University, Tianjin,
Winograd decomposition algorithm for convolutional neural networks,” China, in 2017, and the master’s degree from the
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 9, pp. 3480–3493, Department of Computing, Imperial College Lon-
Sep. 2019. don, London, U.K., in 2018, where he is currently
[27] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com- pursuing the Ph.D. degree in machine learning and
pressing deep neural networks with pruning, trained quantization high-performance computing.
and Huffman coding,” 2015, arXiv:1510.00149. [Online]. Available: His current research interests include efficient
https://arxiv.org/abs/1510.00149 algorithms and acceleration for machine learning
[28] A. Aimar et al., “NullHop: A flexible convolutional neural network applications.
accelerator based on sparse representations of feature maps,” IEEE
Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644–656, Mar. 2018.
[29] T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and
quantization for deep neural network acceleration: A survey,” 2021, Shuanglong Liu received the B.Sc. and M.Sc.
arXiv:2101.09671. [Online]. Available: https://arxiv.org/abs/2101.09671 degrees from the Department of Electronic
[30] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural Engineering, Tsinghua University, Beijing, China,
networks with low precision multiplications,” 2014, arXiv:1412.7024. in 2010 and 2013, respectively, and the Ph.D.
[Online]. Available: https://arxiv.org/abs/1412.7024 degree in electric engineering from the Imperial
[31] O. Russakovsky et al., “ImageNet large scale visual recognition chal- College London, London, U.K, in 2017.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. From 2017 to 2020, he was a Research Associate
[32] K. Simonyan and A. Zisserman, “Very deep convolutional networks with the Department of Computing, Imperial College
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. London. He is currently a Distinguished Professor
Available: https://arxiv.org/abs/1409.1556 with the School of Physics and Electronics, Hunan
[33] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, “Trained quanti- Normal University, Changsha, China. His current
zation thresholds for accurate and efficient fixed-point inference of research interests include reconfigurable and high-performance computing
deep neural networks,” 2019, arXiv:1903.08066. [Online]. Available: for deep neural networks.
https://arxiv.org/abs/1903.08066
[34] H. Fan, G. Wang, M. Ferianc, X. Niu, and W. Luk, “Static block floating-
point quantization for convolutional neural networks on FPGA,” in Proc.
Zhiqiang Que received the B.S. degree in micro-
Int. Conf. Field-Program. Technol. (ICFPT), Dec. 2019, pp. 28–35.
electronics and the M.S. degree in computing
[35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and science from Shanghai Jiao Tong University, Shang-
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” hai, China, in 2008 and 2011, respectively. He is
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, currently pursuing the Ph.D. degree with the Depart-
pp. 4510–4520. ment of Computing, Imperial College London,
[36] Wikipedia Contributors. (2020). Kullback-Leibler Diver- London, U.K.
gence. [Online]. Available: https://en.wikipedia.org/wiki/Kull From 2011 to 2016, he worked on the microachi-
back%E2%80%93Leibler_divergence tecture design and verification of ARM CPUs with
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep Marvell Semiconductor Ltd., Shanghai. He is cur-
residual networks,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: rently a Research Assistant with the Department of
Springer, 2016, pp. 630–645. Computing, Imperial College London. His research interests include computer
[38] H. Fan et al., “A real-time object detection accelerator with compressed architectures, high-performance computing, and computer-aided design tools
SSDLite on FPGA,” in Proc. Int. Conf. Field-Program. Technol. (FPT), for hardware design optimization.
Dec. 2018, pp. 14–21.
[39] M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design alternatives
for barrel shifters,” Proc. SPIE, vol. 4791, pp. 436–447, Dec. 2002.
[40] A. Paszke et al., “PyTorch: An imperative style, high-performance Xinyu Niu received the B.A. degree from Fudan
deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019, University, Shanghai, China, in 2010, and the M.Sc.
pp. 8024–8035. and D.Phil. degrees in computing science from
Imperial College London, London, U.K., in 2011
[41] ONNX Framework. Accessed: Feb. 1, 2021. [Online]. Available:
and 2015, respectively.
https://github.com/onnx/onnx
He is currently a Co-Founder and the CEO
[42] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on of Shenzhen Corerain Technologies Company,
Heterogeneous Systems. [Online]. Available: https://www.tensorflow.org/ Ltd., Shenzhen, China. His current research inter-
[43] T. Chen et al., “MXNet: A flexible and efficient machine learning ests include developing applications and tools
library for heterogeneous distributed systems,” 2015, arXiv:1512.01274. for reconfigurable computing that involve runtime
[Online]. Available: https://arxiv.org/abs/1512.01274 reconfiguration.
[44] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
human actions classes from videos in the wild,” 2012, arXiv:1212.0402.
[Online]. Available: https://arxiv.org/abs/1212.0402
Wayne Luk (Fellow, IEEE) received the B.A.,
[45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, M.Sc., and D.Phil. degrees in engineering and
Inception-ResNet and the impact of residual connections on learning,” computing science from the University of Oxford,
in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 1–7. Oxford, U.K., in 1984, 1985, and 1989, respectively.
[46] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs He founded and currently leads the Custom Com-
retrace the history of 2D CNNs and ImageNet?” in Proc. IEEE/CVF puting Group, Department of Computing, Imperial
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6546–6555. College London, London, U.K., where he is also a
[47] V. Jain and E. Learned-Miller, “FDDB: A benchmark for face detection Professor of computer engineering. He was a Vis-
in unconstrained settings,” Univ. Massachusetts, Amherst, MA, USA, iting Professor with Stanford University, Stanford,
Tech. Rep. UM-CS-2010-009, 2010. CA, USA.
[48] Y. Ma, T. Zheng, Y. Cao, S. Vrudhula, and J.-S. Seo, “Algorithm- Dr. Luk is a fellow of the Royal Academy of
hardware co-design of single shot detector for fast object detection Engineering and the British Computer Society (BCS). He had 15 papers that
on FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design received awards from international conferences. He has been a member of the
(ICCAD), Nov. 2018, pp. 1–8. steering committee and the program committee of various international con-
[49] X. Zhang et al., “DNNBuilder: An automated tool for building high- ferences. He received a Research Excellence Award from the Imperial College
performance dnn hardware accelerators for FPGAs,” in Proc. IEEE/ACM London. He was the Founding Editor-in-Chief of the ACM Transactions on
Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2018, pp. 1–8. Reconfigurable Technology and Systems.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 29,2024 at 19:06:12 UTC from IEEE Xplore. Restrictions apply.