research-article

An Efficient CNN Accelerator for Low-Cost Edge Systems

Authors:

Gerald E. SobelmanAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 21, Issue 4

Article No.: 44, Pages 1 - 20

https://doi.org/10.1145/3539224

Published: 23 August 2022 Publication History

Abstract

Customized hardware based convolutional neural network (CNN or ConvNet) accelerators have attracted significant attention for applications in a low-cost, edge computing system. However, there is a lack of research that seeks to optimize at both the algorithm and hardware levels simultaneously in resource-constrained FPGA systems. In this paper, we first analyze ConvNet models to find one that is most suitable for a low-cost FPGA implementation. Based on the analysis, we select MobileNetV2 as the backbone of our research due to its hardware-friendly structure. We use a quantized implementation with 4-bit precision and optimize further with a smaller input resolution of 192 × 192 to obtain a 68.8% detection accuracy on ImageNet, which represents only a 3.2% accuracy loss compared to a floating-point model that uses the full input size. We then develop a hardware implementation that uses a low-cost FPGA. To accelerate the depth-wise separable ConvNet and utilize DRAM resources efficiently with parallel processing, we propose a novel scoreboard architecture to dynamically schedule DRAM data requests in order to maintain a high hardware utilization. The number of DSP blocks used is about six times smaller than in prior work. In addition, internal block RAM utilization is approximately nine times more efficient than in prior work. Our proposed design achieves 3.07 frames per second (FPS) on the low-cost and resource constrained FPGA system.

References

[1]

ASUS. 2019. Tinker Edge R. (2019). Retrieved Feb. 2, 2022 from https://tinker-board.asus.com/product/tinker-edge-r.html.

[2]

Mohammadreza Baharani, Ushma Sunil, Kaustubh Manohar, Steven Furgurson, and Hamed Tabkhi. 2021. DeepDive: An integrative algorithm/architecture co-design for deep separable convolutional neural networks. In Proceedings of the 2021 on Great Lakes Symposium on VLSI. 247–252.

Digital Library

[3]

Kunlun Bai. 2019. A Comprehensive Introduction to Different Types of Convolutions in Deep Learning. (2019). Retrieved Feb. 11, 2019 from https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215.

[4]

Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 10 (2018), 1415–1419.

[5]

Stephan Patrick Baller, Anshul Jindal, Mohak Chadha, and Michael Gerndt. 2021. DeepEdgeBench: Benchmarking deep neural networks on edge devices. In 2021 IEEE International Conference on Cloud Engineering (IC2E). 20–30.

[6]

Ron Banner, Yury Nahshan, and Daniel Soudry. 2019. Post Training 4-Bit Quantization of Convolutional Networks for Rapid-Deployment. Curran Associates Inc., Red Hook, NY, USA.

[7]

Liang Cai, Feng Dong, Ke Chen, Kehua Yu, Wei Qu, and Jianfei Jiang. 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State Integrated Circuit Technology (ICSICT). 1–3.

[8]

Jungwook Choi, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, and Pierce Chuang. 2019. Accurate and efficient 2-bit quantized neural networks. In Proceedings of Machine Learning and Systems 1, A. Talwalkar, V. Smith, and M. Zaharia (Eds.). 348–359. https://proceedings.mlsys.org/paper/2019/file/006f52e9102a8d3be2fe5614f42ba989-Paper.pdf.

[9]

Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR abs/1602.02830 (2016). arxiv:1602.02830 http://arxiv.org/abs/1602.02830.

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. DOI:

[11]

Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter H. Jin, Sicheng Zhao, and Kurt Keutzer. 2018. SqueezeNext: Hardware-aware neural network design. CoRR abs/1803.10615 (2018). arxiv:1803.10615 http://arxiv.org/abs/1803.10615.

[12]

Google. 2020. Coral Dev Board. (2020). Retrieved Feb. 2, 2022 from https://coral.ai/products/dev-board/.

[13]

Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. 2019. GhostNet: More features from cheap operations. CoRR abs/1911.11907 (2019). arxiv:1911.11907 http://arxiv.org/abs/1911.11907.

[14]

Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen-mei Hwu, and Deming Chen. 2019. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC). 1–6.

Digital Library

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[16]

Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[17]

Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 1314–1324.

[18]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). arxiv:1704.04861 http://arxiv.org/abs/1704.04861.

[19]

Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. 2019. Learning to quantize deep networks by optimizing quantization intervals with task loss. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4345–4354.

[20]

Justin Knapheide, Benno Stabernack, and Maximilian Kuhnke. 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). 277–283.

[21]

Byung Soo Ko. 2018. ImageNet Classification Leaderboard. (2018). Retrieved May 27, 2021 from https://kobiso.github.io/Computer-Vision-Leaderboard/imagenet.html.

[22]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.

Digital Library

[23]

Jiawen Liao, Liangwei Cai, Yuan Xu, and Minya He. 2019. Design of accelerator for MobileNet convolutional neural network based on FPGA. In 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Vol. 1. 1392–1396.

[24]

Jian-Hao Luo and Jianxin Wu. 2017. An entropy-based pruning method for CNN Compression. CoRR abs/1706.05791 (2017). arxiv:1706.05791 http://arxiv.org/abs/1706.05791.

[25]

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164 (2018). arxiv:1807.11164 http://arxiv.org/abs/1807.11164.

[26]

Sachin Mehta, Hannaneh Hajishirzi, and Mohammad Rastegari. 2019. DiCENet: Dimension-wise convolutions for efficient networks. CoRR abs/1906.03516 (2019). arxiv:1906.03516 http://arxiv.org/abs/1906.03516.

[27]

Chunsheng Mei, Zhenyu Liu, Yue Niu, Xiangyang Ji, Wei Zhou, and Dongsheng Wang. 2017. A 200MHZ [email protected] VGG16 accelerator in Xilinx VX690T. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 784–788.

[28]

NVIDIA. 2019. Jetson Nano. (2019). Retrieved Feb. 2, 2022 from https://developer.nvidia.com/embedded/jetson-nano-developer-kit.

[29]

Eunhyeok Park and Sungjoo Yoo. 2020. Profit: A novel training method for sub-4-bit mobilenet models. In European Conference on Computer Vision. Springer, 430–446.

Digital Library

[30]

Rasberry Pi. 2019. Raspberry Pi 4 Model B specifications. (2019). Retrieved Feb. 2, 2022 from https://www.raspberrypi.com/products/raspberry-pi-4-model-b/.

[31]

Abhinav Podili, Chi Zhang, and Viktor Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 11–18.

[32]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510–4520.

[33]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2015). arxiv:cs.CV/1409.1556

[34]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. (2019). arxiv:cs.CV/1807.11626

[35]

Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946 (2019). arxiv:1905.11946 http://arxiv.org/abs/1905.11946.

[36]

Cheng-Hao Tu, Jia-Hong Lee, Yi-Ming Chan, and Chu-Song Chen. 2020. Pruning depthwise separable convolutions for mobilenet compression. In 2020 International Joint Conference on Neural Networks (IJCNN). 1–8.

[37]

Xuan Wang, Chao Wang, Jing Cao, Lei Gong, and Xuehai Zhou. 2020. WinoNN: Optimizing FPGA-based convolutional neural network accelerators using sparse Winograd algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4290–4302.

[38]

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2018. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. CoRR abs/1812.03443 (2018). arxiv:1812.03443 http://arxiv.org/abs/1812.03443.

[39]

Bichen Wu, Alvin Wan, Xiangyu Yue, Peter H. Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. 2017. Shift: A zero FLOP, zero parameter alternative to spatial convolutions. CoRR abs/1711.08141 (2017). arxiv:1711.08141 http://arxiv.org/abs/1711.08141.

[40]

Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for mobilenets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143.

[41]

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2015. Quantized convolutional neural networks for mobile devices. CoRR abs/1512.06473 (2015). arxiv:1512.06473 http://arxiv.org/abs/1512.06473.

[42]

Xilinx. 2018. Zynq-7000 SoC Data Sheet: Overview. (2018). Retrieved Jan 2, 2020 from https://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf.

[43]

Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, and Kurt Keutzer. 2019. Synetgy: Algorithm-hardware co-design for ConvNet accelerators on embedded FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 23–32. DOI:

Digital Library

[44]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6848–6856.

[45]

Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. Association for Computing Machinery, New York, NY, USA.

[46]

Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. 2018. Towards effective low-bitwidth convolutional neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7920–7928.

Cited By

Marković DStamenković ZĐorđević BRanđić S(2024)Image Processing for Smart Agriculture Applications Using Cloud-Fog ComputingSensors10.3390/s2418596524:18(5965)Online publication date: 14-Sep-2024
https://doi.org/10.3390/s24185965
Li WChen YLi B(2024)Optimization of Convolution Operators' Backpropagation for Domestic Accelerators: Convolution Operators: Convolution kernel operation Backpropagation optimization to Enhance Performance and Efficiency in Convolutional Neural NetworkProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering10.1145/3672758.3672817(358-363)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3672758.3672817
Krishna ARohit Nudurupati SChandana DDwivedi Pvan Schaik AMehendale MThakur C(2024)RAMAN: A Reconfigurable and Sparse tinyML Accelerator for Inference on EdgeIEEE Internet of Things Journal10.1109/JIOT.2024.338683211:14(24831-24845)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JIOT.2024.3386832
Show More Cited By

Index Terms

An Efficient CNN Accelerator for Low-Cost Edge Systems
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

Elliptic Curve Cryptography hardware accelerator for high-performance secure servers

Security threats affecting electronics communications in the current world make necessary the encryption and authentication of every transaction. The increasing levels of security required are leading to an overload of transaction servers due to ...
A dedicated hardware accelerator for real-time acceleration of YOLOv2
Abstract
In recent years, dedicated hardware accelerators for the acceleration of the convolutional neural network (CNN) have been extensively studied. Although many studies have presented efficient designs on FPGAs for image classification neural network ...
An Efficient Parallel Architecture for Convolutional Neural Networks Accelerator on FPGAs
HP3C '22: Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications

Convolutional Neural Networks (CNNs) have been widely used in the field of computer vision. Due to the computational complexity of CNNs, their computational efficiency has become a major concern. Field Programmable Gate Array (FPGA) is an ideal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 21, Issue 4

July 2022

330 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3551651

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 23 August 2022

Online AM: 26 May 2022

Accepted: 01 May 2022

Revised: 01 May 2022

Received: 01 August 2021

Published in TECS Volume 21, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,009
Total Downloads

Downloads (Last 12 months)393
Downloads (Last 6 weeks)46

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Marković DStamenković ZĐorđević BRanđić S(2024)Image Processing for Smart Agriculture Applications Using Cloud-Fog ComputingSensors10.3390/s2418596524:18(5965)Online publication date: 14-Sep-2024
https://doi.org/10.3390/s24185965
Li WChen YLi B(2024)Optimization of Convolution Operators' Backpropagation for Domestic Accelerators: Convolution Operators: Convolution kernel operation Backpropagation optimization to Enhance Performance and Efficiency in Convolutional Neural NetworkProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering10.1145/3672758.3672817(358-363)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3672758.3672817
Krishna ARohit Nudurupati SChandana DDwivedi Pvan Schaik AMehendale MThakur C(2024)RAMAN: A Reconfigurable and Sparse tinyML Accelerator for Inference on EdgeIEEE Internet of Things Journal10.1109/JIOT.2024.338683211:14(24831-24845)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JIOT.2024.3386832
Huang CChen YHsu CYang JChang C(2024)FPGA-based UAV and UGV for search and rescue applications: A case studyComputers and Electrical Engineering10.1016/j.compeleceng.2024.109491119(109491)Online publication date: Oct-2024
https://doi.org/10.1016/j.compeleceng.2024.109491
Huang JLiu XGuo TZhao Z(2023)A High-Performance FPGA-Based Depthwise Separable Convolution AcceleratorElectronics10.3390/electronics1207157112:7(1571)Online publication date: 27-Mar-2023
https://doi.org/10.3390/electronics12071571
Ponzina FRios MLevisse AAnsaloni GAtienza D(2023)Overflow-free Compute Memories for Edge AI AccelerationACM Transactions on Embedded Computing Systems10.1145/360938722:5s(1-23)Online publication date: 9-Sep-2023
https://dl.acm.org/doi/10.1145/3609387
Liu HShen C(2023)The Design of Efficient Data Flow and Low-Complexity Architecture for a Highly Configurable CNN AcceleratorCircuits, Systems, and Signal Processing10.1007/s00034-023-02331-442:8(4759-4783)Online publication date: 6-Mar-2023
https://dl.acm.org/doi/10.1007/s00034-023-02331-4
Jewajinda YThongkum T(2022)A Resource-Limited FPGA-based MobileNetV3 Accelerator2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.23919/APSIPAASC55919.2022.9980265(619-622)Online publication date: 7-Nov-2022
https://doi.org/10.23919/APSIPAASC55919.2022.9980265

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents