research-article

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Authors:

Greg StittAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 8, Issue 1

Article No.: 2, Pages 1 - 24

https://doi.org/10.1145/2659000

Published: 06 March 2015 Publication History

Abstract

The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by numerous combinations of accelerators, algorithms, and hw/sw partitions. Exploration of this increased design space is critical due to widely varying performance and energy consumption for each accelerator when used for different application domains and different use cases. To address this problem, numerous studies have evaluated specific applications across different architectures. In this article, we analyze an important domain of applications, referred to as sliding-window applications, implemented on FPGAs, GPUs, and multicore CPUs. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that, for large input sizes, FPGAs can achieve speedups of up to 5.6× and 58× compared to GPUs and multicore CPUs, respectively, while also using up to an order of magnitude less energy. For small input sizes and applications with frequency-domain algorithms, GPUs generally provide the best performance and energy.

References

[1]

Altera. 2013. Altera’s User-Customizable ARM-Based SoC. (2013). Retrieved from http://www.altera.com/literature/br/br-soc-fpga.pdf.

[2]

S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications (FPL’09). 126--131.

[3]

Z. K. Baker, M. B. Gokhale, and J. L. Tripp. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 207--218.

Digital Library

[4]

A. Branover, D. Foley, and M. Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (2012), 28--37.

Digital Library

[5]

J. Chase, B. Nelson, J. Bodily, Zhaoyi Wei, and Dah-Jye Lee. 2008. Real-time optical flow calculations on FPGA and GPU architectures: A comparison study. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM’08). 173--182.

Digital Library

[6]

Shuai Che, Jie Li, J. W. Sheaffer, K. Skadron, and J. Lach. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP’08). 101--107.

Digital Library

[7]

B. Cope, P. Y. K. Cheung, W. Luk, and S. Witt. 2005. Have GPUs made FPGAs redundant in the field of video processing&quest; In Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology. 111--118.

[8]

Yazhuo Dong, Yong Dou, and Jie Zhou. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121.

Digital Library

[9]

Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56.

Digital Library

[10]

B. H. Friemel, L. N. Bohs, and G. E. Trahey. 1995. Relative performance of two-dimensional speckle-tracking techniques: Normalized correlation, non-normalized correlation and sum-absolute-difference. In Proceedings of the 1995 IEEE Ultrasonics Symposium, Vol. 2. 1481--1484.

[11]

Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.

[12]

Zhi Guo, Betul Buyukkurt, and Walid Najjar. 2004a. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’04). ACM, New York, NY, 249--256.

Digital Library

[13]

Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. 2004b. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM, New York, NY, 162--170.

Digital Library

[14]

L. Hunt. 2009. Fault-aware machine vision in small unmanned systems. In Proceedings of the Florida Conference on Recent Advances in Robotics.

[15]

Intel Corporation. 2013. Intel SDK for OpenCL Applications 2013 Optimization Guide. Retrieved from http://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf.

[16]

S. Kestur, J. D. Davis, and O. Williams. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the 2010 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’10). 288--293.

Digital Library

[17]

Weifeng Liu, P. P. Pokharel, and J. C. Principe. 2007. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing 55, 11 (Nov. 2007), 5286--5298.

Digital Library

[18]

Sanyam Mehta, Arindam Misra, Ayush Singhal, Praveen Kumar, and Ankush Mittal. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. In Proceedings of the HiPC Conference.

[19]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40--53.

Digital Library

[20]

NVIDIA. 2013. Tegra 4 Processors, Smartphones, Tablets. Retrieved from http://www.nvidia.com/object/tegra.html.

[21]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. GPU computing. Proceedings of the IEEE 96, 5 (2008), 879--899.

[22]

K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, and M. M. Van Hulle. 2012. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers 61, 7 (July 2012), 999--1012.

Digital Library

[23]

Victor Podlozhnyuk. 2007. FFT-based 2D Convolution. Retrieved from http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf.

[24]

R. B. Porter and N. W. Bergmann. 1997. A generic implementation framework for FPGA based stereo matching. In Proceedings of the IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications (TENCON’97), Vol. 2. 461--464.

[25]

Jose C. Principe, Dongxin Xu, and John Fisher. 2000. Information theoretic learning. Unsupervised Adaptive Filtering 1 (2000), 265--319.

[26]

Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc. 2011. Feature tracking and matching in video using programmable graphics hardware. Machine Vision Applications. 22, 1, Article 17 (Jan. 2011), 11 pages.

Digital Library

[27]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66.

Digital Library

[28]

K. D. Underwood and K. S. Hemmert. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). 219--228.

Digital Library

[29]

Xilinx. 2013. All Programable SoC. Retrieved from http://www.xilinx.com/products/silicon-devices/soc/index.htm.

[30]

Haiqian Yu and M. Leeser. 2006. Automatic sliding window operation optimization for FPGA-Based. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 76--88.

Digital Library

[31]

Jianning Zhang, Yuwen He, Shiqiang Yang, and Yuzhuo Zhong. 2003. Performance and complexity joint optimization for H.264 video coding. In Proceedings of the 2003 International Symposium on Circuits and Systems (ISCAS’03), Vol. 2. II--888--II--891.

Cited By

Hudomalj UMandla CPlattner M(2021)FPGA Implementations of Algorithms for Preprocessing of High Frame Rate and High Resolution Image Streams in Real TimeAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0055:2(50-61)Online publication date: 1-Apr-2021
https://doi.org/10.33166/AETiC.2021.02.005
Gong SLi JLu WYan GLi X(2021)ShuntFlowPlus: An Efficient and Scalable Dataflow Accelerator Architecture for Stream ApplicationsACM Journal on Emerging Technologies in Computing Systems10.1145/345316417:4(1-24)Online publication date: 30-Jun-2021
https://dl.acm.org/doi/10.1145/3453164
Choi YChi YQiao WSamardzic NCong JShannon LAdler M(2021)HBM Connect: High-Performance HLS Interconnect for FPGA HBMThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439301(116-126)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3431920.3439301
Show More Cited By

Index Terms

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 8, Issue 1

February 2015

127 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2744082

Editor:
Steve Wilton
Department of Electrical and Computer Engineering/University of British Columbia/Kaiser 4112, 5500-2332 Main Mall/Vancouver, BC V6T 1Z4 Canada

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2015

Accepted: 01 July 2014

Revised: 01 May 2014

Received: 01 December 2013

Published in TRETS Volume 8, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
581
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hudomalj UMandla CPlattner M(2021)FPGA Implementations of Algorithms for Preprocessing of High Frame Rate and High Resolution Image Streams in Real TimeAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0055:2(50-61)Online publication date: 1-Apr-2021
https://doi.org/10.33166/AETiC.2021.02.005
Gong SLi JLu WYan GLi X(2021)ShuntFlowPlus: An Efficient and Scalable Dataflow Accelerator Architecture for Stream ApplicationsACM Journal on Emerging Technologies in Computing Systems10.1145/345316417:4(1-24)Online publication date: 30-Jun-2021
https://dl.acm.org/doi/10.1145/3453164
Choi YChi YQiao WSamardzic NCong JShannon LAdler M(2021)HBM Connect: High-Performance HLS Interconnect for FPGA HBMThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439301(116-126)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3431920.3439301
Yu XDi SGok ATao DCappello F(2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00065
Pervan BKnezovic J(2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
https://doi.org/10.23919/MIPRO48935.2020.9245341
Chamberlain R(2020)Architecturally truly diverse systems: A reviewFuture Generation Computer Systems10.1016/j.future.2020.03.061Online publication date: Apr-2020
https://doi.org/10.1016/j.future.2020.03.061
Ali KBen Atitallah RAit El Cadi AFakhfakh NDekeyser J(2019)ViParInternational Journal of Reconfigurable Computing10.1155/2019/42980132019Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1155/2019/4298013
Gong SLi JLu WYan GLi X(2019)ShuntFlowProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317910(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317910
Qasaimeh MDenolf KLo JVissers KZambreno JJones P(2019)Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782524(1-8)Online publication date: Jun-2019
https://doi.org/10.1109/ICESS.2019.8782524
Erol AYazar ASchmidt E(2019)OpenStack Generalization for Hardware Accelerated Clouds2019 28th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2019.8847115(1-8)Online publication date: Jul-2019
https://doi.org/10.1109/ICCCN.2019.8847115
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents