Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3373087.3375321acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
short-paper
Open access

End-to-End Optimization of Deep Learning Applications

Published: 24 February 2020 Publication History

Abstract

The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning and simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that a naive FPGA integration into TensorFlow could lead to up to 8.45x performance degradation. To address the challenges mentioned above, we propose several SW/HW co-design approaches to perform the end-to-end optimization of deep learning applications. We present a flexible and composable architecture called FlexCNN. It can deliver high computation efficiency for different types of convolution layers using techniques including dynamic tiling and data layout optimization. FlexCNN is further integrated into the TensorFlow framework with a fully-pipelined software-hardware integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake and other non-CNN processing stages. We use OpenPose, a popular CNN-based application for human pose recognition, as a case study. Experimental results show that with the FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for OpenPose with floating-point precision, which is the highest performance reported for this application on FPGA in the literature.

References

[1]
Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265--283.
[2]
Jinguji Akira, Tomoya Fujii, Shimpei Sato, and Hiroki Nakahara. 2018. An FPGA Realization of OpenPose Based on a Sparse Weight Convolutional Neural Network. In 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 310--313.
[3]
Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 65, 10 (2018), 1415--1419.
[4]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7291--7299.
[5]
Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) .
[6]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[7]
Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
[8]
Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
[9]
Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) .
[10]
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 152--159.
[11]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[12]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV). 0--0.
[13]
Ildoo Kim. 2018. tf-pose-estimation. https://github.com/ildoonet/tf-pose-estimation . (2018).
[14]
Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--9.
[15]
De G Matthews, G Alexander, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. 2017. GPflow: A Gaussian process library using TensorFlow. The Journal of Machine Learning Research, Vol. 18, 1 (2017), 1299--1304.
[16]
Daniel H Noronha, Bahar Salehpour, and Steven JE Wilton. 2018. LeFlow: Enabling flexible FPGA high-level synthesis of TensorFlow deep neural networks. In FSP Workshop 2018; Fifth International Workshop on FPGAs for Software Programmers. VDE, 1--8.
[17]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.
[18]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 535--547.
[19]
Laurent Sifre and Stéphane Mallat. 2014. Rigid-motion scattering for image classification. Ph. D. dissertation (2014).
[20]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[21]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.
[22]
Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
[23]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29.
[24]
Xilinx. 2018. Vivado Design Suite User Guide - High-Level Synthesis (UG902). (2018).
[25]
Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et almbox. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).
[26]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
[27]
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018a. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018).
[28]
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018b. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 56.

Cited By

View all
  • (2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
  • (2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
  • (2023)TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical DesignACM Transactions on Reconfigurable Technology and Systems10.1145/360933516:4(1-31)Online publication date: 18-Sep-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2020
346 pages
ISBN:9781450370998
DOI:10.1145/3373087
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cnn
  2. fpga
  3. integration
  4. openpose
  5. tensorflow
  6. tiling

Qualifiers

  • Short-paper

Funding Sources

Conference

FPGA '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)497
  • Downloads (Last 6 weeks)153
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
  • (2023)Towards a comprehensive benchmark for high-level synthesis targeted to FPGAsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668084(45288-45299)Online publication date: 10-Dec-2023
  • (2023)TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical DesignACM Transactions on Reconfigurable Technology and Systems10.1145/360933516:4(1-31)Online publication date: 18-Sep-2023
  • (2023)Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level SynthesisACM Transactions on Reconfigurable Technology and Systems10.1145/355954316:2(1-25)Online publication date: 11-Mar-2023
  • (2023)Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient SolverProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573182(247-258)Online publication date: 12-Feb-2023
  • (2023)Agamotto: A Performance Optimization Framework for CNN Accelerator With Row Stationary DataflowIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.325841170:6(2487-2496)Online publication date: Jun-2023
  • (2023)Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171442:12(4763-4776)Online publication date: Dec-2023
  • (2023)AOS: An Automated Overclocking System for High-Performance CNN Accelerator Through Timing Delay Measurement on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.323580342:9(2952-2965)Online publication date: Sep-2023
  • (2023)Robust GNN-Based Representation Learning for HLS2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323853(1-9)Online publication date: 28-Oct-2023
  • (2023)PASS: Exploiting Post-Activation Sparsity in Streaming Architectures for CNN Acceleration2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00049(288-293)Online publication date: 4-Sep-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media