Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Published: 06 March 2015 Publication History

Abstract

The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by numerous combinations of accelerators, algorithms, and hw/sw partitions. Exploration of this increased design space is critical due to widely varying performance and energy consumption for each accelerator when used for different application domains and different use cases. To address this problem, numerous studies have evaluated specific applications across different architectures. In this article, we analyze an important domain of applications, referred to as sliding-window applications, implemented on FPGAs, GPUs, and multicore CPUs. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that, for large input sizes, FPGAs can achieve speedups of up to 5.6× and 58× compared to GPUs and multicore CPUs, respectively, while also using up to an order of magnitude less energy. For small input sizes and applications with frequency-domain algorithms, GPUs generally provide the best performance and energy.

References

[1]
Altera. 2013. Altera’s User-Customizable ARM-Based SoC. (2013). Retrieved from http://www.altera.com/literature/br/br-soc-fpga.pdf.
[2]
S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications (FPL’09). 126--131.
[3]
Z. K. Baker, M. B. Gokhale, and J. L. Tripp. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 207--218.
[4]
A. Branover, D. Foley, and M. Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (2012), 28--37.
[5]
J. Chase, B. Nelson, J. Bodily, Zhaoyi Wei, and Dah-Jye Lee. 2008. Real-time optical flow calculations on FPGA and GPU architectures: A comparison study. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM’08). 173--182.
[6]
Shuai Che, Jie Li, J. W. Sheaffer, K. Skadron, and J. Lach. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP’08). 101--107.
[7]
B. Cope, P. Y. K. Cheung, W. Luk, and S. Witt. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology. 111--118.
[8]
Yazhuo Dong, Yong Dou, and Jie Zhou. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121.
[9]
Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56.
[10]
B. H. Friemel, L. N. Bohs, and G. E. Trahey. 1995. Relative performance of two-dimensional speckle-tracking techniques: Normalized correlation, non-normalized correlation and sum-absolute-difference. In Proceedings of the 1995 IEEE Ultrasonics Symposium, Vol. 2. 1481--1484.
[11]
Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.
[12]
Zhi Guo, Betul Buyukkurt, and Walid Najjar. 2004a. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’04). ACM, New York, NY, 249--256.
[13]
Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. 2004b. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM, New York, NY, 162--170.
[14]
L. Hunt. 2009. Fault-aware machine vision in small unmanned systems. In Proceedings of the Florida Conference on Recent Advances in Robotics.
[15]
Intel Corporation. 2013. Intel SDK for OpenCL Applications 2013 Optimization Guide. Retrieved from http://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf.
[16]
S. Kestur, J. D. Davis, and O. Williams. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the 2010 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’10). 288--293.
[17]
Weifeng Liu, P. P. Pokharel, and J. C. Principe. 2007. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing 55, 11 (Nov. 2007), 5286--5298.
[18]
Sanyam Mehta, Arindam Misra, Ayush Singhal, Praveen Kumar, and Ankush Mittal. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. In Proceedings of the HiPC Conference.
[19]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40--53.
[20]
NVIDIA. 2013. Tegra 4 Processors, Smartphones, Tablets. Retrieved from http://www.nvidia.com/object/tegra.html.
[21]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. GPU computing. Proceedings of the IEEE 96, 5 (2008), 879--899.
[22]
K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, and M. M. Van Hulle. 2012. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers 61, 7 (July 2012), 999--1012.
[23]
Victor Podlozhnyuk. 2007. FFT-based 2D Convolution. Retrieved from http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf.
[24]
R. B. Porter and N. W. Bergmann. 1997. A generic implementation framework for FPGA based stereo matching. In Proceedings of the IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications (TENCON’97), Vol. 2. 461--464.
[25]
Jose C. Principe, Dongxin Xu, and John Fisher. 2000. Information theoretic learning. Unsupervised Adaptive Filtering 1 (2000), 265--319.
[26]
Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc. 2011. Feature tracking and matching in video using programmable graphics hardware. Machine Vision Applications. 22, 1, Article 17 (Jan. 2011), 11 pages.
[27]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66.
[28]
K. D. Underwood and K. S. Hemmert. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). 219--228.
[29]
Xilinx. 2013. All Programable SoC. Retrieved from http://www.xilinx.com/products/silicon-devices/soc/index.htm.
[30]
Haiqian Yu and M. Leeser. 2006. Automatic sliding window operation optimization for FPGA-Based. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 76--88.
[31]
Jianning Zhang, Yuwen He, Shiqiang Yang, and Yuzhuo Zhong. 2003. Performance and complexity joint optimization for H.264 video coding. In Proceedings of the 2003 International Symposium on Circuits and Systems (ISCAS’03), Vol. 2. II--888--II--891.

Cited By

View all
  • (2021)FPGA Implementations of Algorithms for Preprocessing of High Frame Rate and High Resolution Image Streams in Real TimeAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0055:2(50-61)Online publication date: 1-Apr-2021
  • (2021)ShuntFlowPlus: An Efficient and Scalable Dataflow Accelerator Architecture for Stream ApplicationsACM Journal on Emerging Technologies in Computing Systems10.1145/345316417:4(1-24)Online publication date: 30-Jun-2021
  • (2021)HBM Connect: High-Performance HLS Interconnect for FPGA HBMThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439301(116-126)Online publication date: 17-Feb-2021
  • Show More Cited By

Index Terms

  1. A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 8, Issue 1
      February 2015
      127 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/2744082
      • Editor:
      • Steve Wilton
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 March 2015
      Accepted: 01 July 2014
      Revised: 01 May 2014
      Received: 01 December 2013
      Published in TRETS Volume 8, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. FPGA
      2. GPU
      3. multicore
      4. parallelism
      5. sliding window
      6. speedup

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Science Foundation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 28 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)FPGA Implementations of Algorithms for Preprocessing of High Frame Rate and High Resolution Image Streams in Real TimeAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0055:2(50-61)Online publication date: 1-Apr-2021
      • (2021)ShuntFlowPlus: An Efficient and Scalable Dataflow Accelerator Architecture for Stream ApplicationsACM Journal on Emerging Technologies in Computing Systems10.1145/345316417:4(1-24)Online publication date: 30-Jun-2021
      • (2021)HBM Connect: High-Performance HLS Interconnect for FPGA HBMThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439301(116-126)Online publication date: 17-Feb-2021
      • (2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
      • (2020)A Survey on Parallel Architectures and Programming Models2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO48935.2020.9245341(999-1005)Online publication date: 28-Sep-2020
      • (2020)Architecturally truly diverse systems: A reviewFuture Generation Computer Systems10.1016/j.future.2020.03.061Online publication date: Apr-2020
      • (2019)ViParInternational Journal of Reconfigurable Computing10.1155/2019/42980132019Online publication date: 1-Jan-2019
      • (2019)ShuntFlowProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317910(1-6)Online publication date: 2-Jun-2019
      • (2019)Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels2019 IEEE International Conference on Embedded Software and Systems (ICESS)10.1109/ICESS.2019.8782524(1-8)Online publication date: Jun-2019
      • (2019)OpenStack Generalization for Hardware Accelerated Clouds2019 28th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2019.8847115(1-8)Online publication date: Jul-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media