short-paper

Open access

End-to-End Optimization of Deep Learning Applications

Authors:

Atefeh Sohrabizadeh,

Jason CongAuthors Info & Claims

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 133 - 139

https://doi.org/10.1145/3373087.3375321

Published: 24 February 2020 Publication History

Abstract

The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning and simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that a naive FPGA integration into TensorFlow could lead to up to 8.45x performance degradation. To address the challenges mentioned above, we propose several SW/HW co-design approaches to perform the end-to-end optimization of deep learning applications. We present a flexible and composable architecture called FlexCNN. It can deliver high computation efficiency for different types of convolution layers using techniques including dynamic tiling and data layout optimization. FlexCNN is further integrated into the TensorFlow framework with a fully-pipelined software-hardware integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake and other non-CNN processing stages. We use OpenPose, a popular CNN-based application for human pose recognition, as a case study. Experimental results show that with the FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for OpenPose with floating-point precision, which is the highest performance reported for this application on FPGA in the literature.

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265--283.

[2]

Jinguji Akira, Tomoya Fujii, Shimpei Sato, and Hiroki Nakahara. 2018. An FPGA Realization of OpenPose Based on a Sparse Weight Convolutional Neural Network. In 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 310--313.

[3]

Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 65, 10 (2018), 1415--1419.

[4]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7291--7299.

[5]

Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) .

[6]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[7]

Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[8]

Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[9]

Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) .

[10]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 152--159.

[11]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[12]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV). 0--0.

[13]

Ildoo Kim. 2018. tf-pose-estimation. https://github.com/ildoonet/tf-pose-estimation . (2018).

[14]

Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--9.

[15]

De G Matthews, G Alexander, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. 2017. GPflow: A Gaussian process library using TensorFlow. The Journal of Machine Learning Research, Vol. 18, 1 (2017), 1299--1304.

[16]

Daniel H Noronha, Bahar Salehpour, and Steven JE Wilton. 2018. LeFlow: Enabling flexible FPGA high-level synthesis of TensorFlow deep neural networks. In FSP Workshop 2018; Fifth International Workshop on FPGAs for Software Programmers. VDE, 1--8.

[17]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.

[18]

Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 535--547.

Digital Library

[19]

Laurent Sifre and Stéphane Mallat. 2014. Rigid-motion scattering for image classification. Ph. D. dissertation (2014).

[20]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[21]

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.

Digital Library

[22]

Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[23]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29.

Digital Library

[24]

Xilinx. 2018. Vivado Design Suite User Guide - High-Level Synthesis (UG902). (2018).

[25]

Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et almbox. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).

[26]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.

Digital Library

[27]

Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018a. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018).

[28]

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018b. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 56.

Digital Library

Cited By

Pouget SPouchet LCong J(2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
https://dl.acm.org/doi/10.1145/3711847
Sohrabizadeh ABai YSun YCong J(2024)Harnessing GNNs for Robust Representation Learning in High-Level SynthesisIEEE Transactions on Circuits and Systems for Artificial Intelligence10.1109/TCASAI.2024.34855221:2(114-127)Online publication date: Dec-2024
https://doi.org/10.1109/TCASAI.2024.3485522
Yang GLei JFang ZZhang JZhang JXie WLi Y(2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00036
Show More Cited By

Index Terms

End-to-End Optimization of Deep Learning Applications

Recommendations

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their ...
Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC
ICDSP '18: Proceedings of the 2nd International Conference on Digital Signal Processing

Today, Convolution Neural Networks (CNN) is adopted by various application areas such as computer vision, speech recognition, and natural language processing. Due to a massive amount of computing for CNN, CNN running on an embedded platform may not meet ...
Automated hardware generation of CNN models on FPGAs: late breaking results
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference

In this paper, we propose an automated framework that takes as input a TensorFlow inference graph and generates high-performance accelerators on FPGA by assembling CNN pre-implemented components as a puzzle, based on the graph topology. Using pre-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2020

346 pages

ISBN:9781450370998

DOI:10.1145/3373087

General Chair:
Stephen Neuendorffer
Xilinx, USA
,
Program Chair:
Lesley Shannon
Simon Fraser University, Canada

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

NSF
Intel

Conference

FPGA '20

Sponsor:

SIGDA

FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 23 - 25, 2020

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
3,234
Total Downloads

Downloads (Last 12 months)577
Downloads (Last 6 weeks)68

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pouget SPouchet LCong J(2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
https://dl.acm.org/doi/10.1145/3711847
Sohrabizadeh ABai YSun YCong J(2024)Harnessing GNNs for Robust Representation Learning in High-Level SynthesisIEEE Transactions on Circuits and Systems for Artificial Intelligence10.1109/TCASAI.2024.34855221:2(114-127)Online publication date: Dec-2024
https://doi.org/10.1109/TCASAI.2024.3485522
Yang GLei JFang ZZhang JZhang JXie WLi Y(2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00036
Bai YSohrabizadeh AQin ZHu ZSun YCong JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668084
Guo LChi YLau JSong LTian XKhatti MQiao WWang JUstun EFang ZZhang ZCong J(2023)TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical DesignACM Transactions on Reconfigurable Technology and Systems10.1145/360933516:4(1-31)Online publication date: 18-Sep-2023
https://dl.acm.org/doi/10.1145/3609335
Nazar Shahsavani SFayyazi ANazemi MPedram M(2023)Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level SynthesisACM Transactions on Reconfigurable Technology and Systems10.1145/355954316:2(1-25)Online publication date: 11-Mar-2023
https://dl.acm.org/doi/10.1145/3559543
Song LGuo LBasalama SChi YLucas RCong JIenne PZhang Z(2023)Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient SolverProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573182(247-258)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1145/3543622.3573182
Kim DJeong SKim J(2023)Agamotto: A Performance Optimization Framework for CNN Accelerator With Row Stationary DataflowIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.325841170:6(2487-2496)Online publication date: Jun-2023
https://doi.org/10.1109/TCSI.2023.3258411
Gao YGong LWang CWang TLi XZhou X(2023)Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171442:12(4763-4776)Online publication date: Dec-2023
https://doi.org/10.1109/TCAD.2023.3281714
Jiang WYu HChen FHa Y(2023)AOS: An Automated Overclocking System for High-Performance CNN Accelerator Through Timing Delay Measurement on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.323580342:9(2952-2965)Online publication date: Sep-2023
https://doi.org/10.1109/TCAD.2023.3235803
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten