Abstract
This paper presents the design and implementation of a convolutional neural network (CNN) accelerator for embedded and edge computing systems. To be specific, a novel processing flow is proposed in this paper so that the data that is already stored in the accelerator is maximally reused. This greatly reduces the requirements for the on-chip storage elements and the accesses to the off-chip memory. Therefore, significant reductions in the memory-access delay and the area complexity can be achieved. Based on the proposed data processing flow, a highly efficient VLSI architecture is designed and implemented. This architecture is based on a pipelined structure and maximizes the efficiency for the utilizations of hardware components. The implemented circuit is synthesized and placed- and routed with TSMC 90 nm technology, and the evaluations for the performance and area complexity are conducted based on the post-layout estimations. The experimental results show that the proposed CNN accelerator achieves a throughput of 44.06 Giga-MAC/s with the complexity of 5909KGEs. Furthermore, this design leads to a performance of 79.1 frame-per-second (fps) under the frequency of 250 MHz. Compared to the state-of-the-art accelerators, the proposed architecture achieves a significant enhancement in efficiency.
Similar content being viewed by others
Data Availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
A. Ardakani, C. Condo, M. Ahmadi, W.J. Gross, An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 65(4), 1349–1362 (2018)
L. Bai, Y. Zhao, X. Huang, A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circuits Syst. II Express Briefs 65(10), 1415–1419 (2018)
S. Bazrafkan, P.M. Corcoran, Pushing the AI envelope: merging deep networks to accelerate edge artificial intelligence in consumer electronics devices and systems. IEEE Consum. Electron. Mag. 7(2), 55–61 (2018)
W. Chen, Z. Wang, S. Li, Z. Yu, H. Li, Accelerating compact convolutional neural networks with multi-threaded data streaming. in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 519–522 (2019)
Y. Chen, J. Emer, V. Sze, Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3), 12–21 (2017)
Y. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)
Y. Chen, T. Yang, J. Emer, V. Sze, Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 9(2), 292–308 (2019)
G. Desoli et al., 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems, in IEEE International Solid-State Circuits Conference (ISSCC), pp. 238–239 (2017)
Z. Du et al., ShiDianNao: Shifting vision processing closer to the sensor, in 2015 ACM/IEEE Annual International Symposium on Computer Architecture, pp. 92–104 (2015)
L. Jian, Z. Li, X. Yang, W. Wu, A. Ahmad, G. Jeon, Combining unmanned aerial vehicles with artificial-intelligence technology for traffic-congestion recognition: electronic eyes in the skies to spot clogged roads. IEEE Consumer Electron. Mag. 8(3), 81–86 (2019)
A. Krizhevsky, S. Ilya, E.H. Geoffrey, Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
J. Li, X. Liang, S. Shen, T. Xu, J. Feng, S. Yan, Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimedia 20(4), 985–996 (2017)
K.T. Malladi, F.A. Nothaft, K. Periyathambi, B.C. Lee, C. Kozyrakis and M. Horowitz, Towards energy-proportional datacenter memory with mobile DRAM, in 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 37–48 (2012)
B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI, in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247 (2017)
T. Ogunfunmi, R.P. Ramachandran, R. Togneri, Y. Zhao, X. Xia, A primer on deep learning architectures and applications in speech processing. Circuits Syst. Signal Process 38(8), 3406–3432 (2019)
B. Qiang et al., SqueezeNet and fusion network-based accurate fast fully convolutional network for hand detection and gesture recognition. IEEE Access 9, 77661–77674 (2021)
D. Sinha, M. El-Sharkawy, Thin MobileNet: an enhanced MobileNet architecture, in IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), pp. 0280–0285 (2019)
L. Sifre, Rigid-motion scattering for image classification. PhD Thesis in Ecole Polytechnique, CMAP (2014)
J. Su et al., Redundancy-reduced mobilenet acceleration on reconfigurable logic for ImageNet classification, in Applied Reconfigurable Computing. Architectures, Tools, and Applications, pp. 16–28 (2018)
V. Sze, Y. Chen, T. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
C. Szegedy et al., Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
X. Wang, M. Tang, S. Yang, H. Yin, H. Huang, L. He, Automatic hypernasality detection in cleft palate speech using CNN. Circuits Syst. Signal Process. 38(8), 3521–3547 (2019)
Y. Yang, H. Luo, H. Xu, F. Wu, Towards real-time traffic sign detection and classification. IEEE Trans. Intell. Transp. Syst. 17(7), 2022–2031 (2016)
X. Zhang, X. Zhou, M. Lin, J. Sun, ShuffleNet: an extremely efficient convolutional neural network for mobile devices, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
R. Zhao, X. Niu, W. Luk, Automatic optimising CNN with depthwise separable convolution on FPGA: (Abstact only), in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Array, p. 285 (2018)
Acknowledgements
This work is supported in part by the Ministry of Science and Technology, Taiwan under grants MOST 109-2221-E-011-142 and 110-2221-E-011-155. The authors would like to thank Prof. Gerd Ascheid and Dr. Andreas Bytyn of RWTH Aachen University for their valuable inputs regarding the design of CNN accelerator.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lin, HJ., Shen, CA. The Data Flow and Architectural Optimizations for a Highly Efficient CNN Accelerator Based on the Depthwise Separable Convolution. Circuits Syst Signal Process 41, 3547–3569 (2022). https://doi.org/10.1007/s00034-022-01952-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-01952-5