research-article

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Authors:

Jialiang Zhang,

Jing LiAuthors Info & Claims

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 25 - 34

https://doi.org/10.1145/3020078.3021698

Published: 22 February 2017 Publication History

Abstract

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks.

In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

References

[1]

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM.

Digital Library

[2]

J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., "Going deeper with embedded fpga platform for convolutional neural network," in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2016.

Digital Library

[3]

A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al., "A reconfigurable fabric for accelerating large-scale datacenter services," in 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), IEEE, 2014.

[4]

N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, "Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks," in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2016.

Digital Library

[5]

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.

[6]

Xilinx, "The rise of serial memory andthe future of ddr." http://www.xilinx.com/support/documentation/white_papers/wp456-DDR-serial-mem.pdf, 2015.

[7]

T.S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong, D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D. Brown, "Opencl for fpgas: Prototyping a compiler," in Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), 2012.

[8]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, 1998.

[9]

A. Krizhevsky, I. Sutskever, and G.E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.

Digital Library

[10]

S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in neural information processing systems, pp. 91--99, 2015.

[11]

G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers. Boca Raton, FL, USA: CRC Press, Inc., 1st ed., 2010.

[12]

B. Bosi, G. Bois, and Y. Savaria, "Reconfigurable pipelined 2-d convolvers for fast digital signal processing," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Sept 1999.

[13]

Altera, "Altera sdk for opencl best practices guide," 2016.

[14]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.

[15]

S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, "A dynamically configurable coprocessor for convolutional neural networks," in ACM SIGARCH Computer Architecture News, vol. 38, ACM, 2010.

[16]

S. I. Venieris and C.-S. Bouganis, "fpgaconvnet: A framework for mapping convolutional neural networks on fpgas,"

Cited By

Cheng XWang YDing WLou HLi P(2024)Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering DataflowElectronics10.3390/electronics1307121713:7(1217)Online publication date: 26-Mar-2024
https://doi.org/10.3390/electronics13071217
Wong LZhang JLi J(2024)DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLSACM Transactions on Reconfigurable Technology and Systems10.1145/365003817:3(1-32)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.1145/3650038
Clements JLao Y(2024)Reliable Hardware Watermarks for Deep Learning SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.336024032:4(752-762)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3360240
Show More Cited By

Index Terms

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Base64 Encoding on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Base64 encoding has many applications on the Web. Previous studies are focused on improving the efficiency of Base64 encoding on central processing units (CPUs). As field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2017

312 pages

ISBN:9781450343541

DOI:10.1145/3020078

General Chair:
Jonathan Greene
Microsemi, USA
,
Program Chair:
Jason H. Anderson
University of Toronto, Canada

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FPGA '17

Sponsor:

SIGDA

FPGA '17: The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 22 - 24, 2017

California, Monterey, USA

Acceptance Rates

FPGA '17 Paper Acceptance Rate 25 of 101 submissions, 25%;

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

161
Total Citations
View Citations
2,594
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng XWang YDing WLou HLi P(2024)Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering DataflowElectronics10.3390/electronics1307121713:7(1217)Online publication date: 26-Mar-2024
https://doi.org/10.3390/electronics13071217
Wong LZhang JLi J(2024)DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLSACM Transactions on Reconfigurable Technology and Systems10.1145/365003817:3(1-32)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.1145/3650038
Clements JLao Y(2024)Reliable Hardware Watermarks for Deep Learning SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.336024032:4(752-762)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3360240
Li HChen Y(2024)Hybrid Stochastic Number and Its Neural Network ComputationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.333217032:3(432-441)Online publication date: Mar-2024
https://doi.org/10.1109/TVLSI.2023.3332170
Liu SFan HLuk W(2024)Design of Fully Spectral CNNs for Efficient FPGA-Based AccelerationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322477935:6(8111-8123)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3224779
Gonzalez-Carabarin LHuijben IVeeling BSchmid Avan Sloun R(2024)Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different GranularitiesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317680935:1(733-744)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3176809
Sun HYi QFujita M(2024)FPGA Codec System of Learned Image Compression With Algorithm-Architecture Co-OptimizationIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2024.338632814:2(334-347)Online publication date: Jun-2024
https://doi.org/10.1109/JETCAS.2024.3386328
Karakchi R(2024)A Hierarchical BRAM/URAM Buffer for SNN2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)10.1109/ICMI60790.2024.10585929(1-5)Online publication date: 13-Apr-2024
https://doi.org/10.1109/ICMI60790.2024.10585929
A KKumar VPrasanth PDutta RRoy BChakraborty P(2024)FPGAs as Hardware Accelerators in Data Centers: A Survey From the Data Centric Perspective2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT)10.1109/DICCT61038.2024.10533053(1-6)Online publication date: 15-Mar-2024
https://doi.org/10.1109/DICCT61038.2024.10533053
Wan YXie XYi LJiang BChen JJiang Y(2024)PflowJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103113150:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103113
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents