Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors

Published: 20 January 2013 Publication History

Abstract

Recent architectural trends have focused on increased parallelism via multicore processors and increased heterogeneity via accelerator devices (e.g., graphics-processing units, field-programmable gate arrays). Although these architectures have significant performance and energy potential, application designers face many device-specific challenges when choosing an appropriate accelerator or when customizing an algorithm for an accelerator. To help address this problem, in this article we thoroughly evaluate convolution, one of the most common operations in digital-signal processing, on multicores, graphics-processing units, and field-programmable gate arrays. Whereas many previous application studies evaluate a specific usage of an application, this article assists designers with design space exploration for numerous use cases by analyzing effects of different input sizes, different algorithms, and different devices, while also determining Pareto-optimal trade-offs between performance and energy.

References

[1]
Altera, Inc. 2011a. Buy FPGA and CPLD Devices, Stratix IV EP4SE820. 2010. http://www.buyaltera.com/scripts/partsearch.dll/multisearch?site=ALTERA&lang=EN&keywords=EP4SE820.
[2]
Altera, Inc., 2011. FFT MegaCore Function User Guide. http://www.altera.com/literature/ug/ug_fft.pdf.
[3]
Araya-Polo, M., Cabezas, J., Hanzich, M., Paricas, M.,Rubio, F., Gelado, I., Shafiq, M., Morancho, E., Navarro, N., Ayguade, E., Cela, J., and Valero, M. 2011. Assessing accelerator-based HPC reverse time migration. IEEE Trans. Parallel and Distrib. Syst. 147--162.
[4]
Bae, S., Cho, Y. C. P., Park, S., Irick, K., Jin, Y., and Narayana, V. 2011. An FPGA implementation of information theoretic visual-saliency system and its optimization. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'11). 41--48.
[5]
Baker, Z. K., Gokhale, M. B., and Tripp, J. L. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'07). 207--218.
[6]
Brookwood, N., March 2010. AMD fusion family of APUs: Enabling a superior, immersive PC experience. http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf
[7]
Che, S., LI, J., Sheaffer, J.W., Skadron, K., Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP'08). 101--107.
[8]
Dong, Y., Dou, Y., and Zhou, J. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121.
[9]
Eles, P., Peng, Z. Kunchinsinski, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and tabu search. In Design Automation for Embedded Systems. 5--32.
[10]
Fowers, J., Brown, G., Cooke, P., and Stitt, G. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In FPGA'12: Proceedings of the ACM. SIGDA Internationall Symposium on Field Programmable Gate Arrays 47--56.
[11]
Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.
[12]
Gac, N., Mancini, S., Desvignes, M., and Houzet, D. 2008. High speed 3D tomography on CPU, GPU, and FPGA. EURASIP J. Embed. Syst. 5, 1--12.
[13]
George, A., Lam, H., and Stitt, G. 2011. Novo-g: At the forefront of scalable reconfigurable supercomputing. Comput. Sci. Engin. 13, 1, 82--86.
[14]
Givargis, T., Vahid, F., and Henkel, J. 2002. System-level exploration for pareto-optimal configurations in parameterized system-on-a-chip. In IEEE Trans. VLSI Syst. 416--422.
[15]
Guo, Z., Najjar, W., Vahid, F., and Vissers, K. 2004. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA'04). 162--170.
[16]
Huang, S., Xiao, S., and Feng, W. 2009. On the energy efficiency of graphics processing units for scientific computing. In Proceedings of the 5th IEEE Workshop on High-Performance, Power-Aware Computing. 1--8.
[17]
Kestur, S., Davis, J. D., and Williams, O. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 288--293.
[18]
Koehler, S., Stitt, G., and George, A. 2012. (to appear). Platform-Aware bottleneck detection for reconfigurable computing applications. ACM Trans. Reconfig. Technol. Syst.
[19]
Le, H., Jiang, W., and Prasanna, V. 2011. Memory-efficient IPv4/v6 lookup on FPGAs using distance-bounded path compression. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'2011). 242--249.
[20]
Li, J., Sarunic, M. V., and Shannon, L. 2011. Scalable, high performance fourier domain optical coherence tomography: Why FPGAs and not GPGPUs. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'2011). 49--56.
[21]
Mansour, D. and Gray, A., JR. 1982. Unconstrained frequency-domain adaptive filter. IEEE Trans. Acoust. Speech Signal Process. 30, 5, 726--734.
[22]
Merchant, S., Holland, B., Reardon, C., George, A., Lam, H., Stitt, G., Smith, M., Alam, N., Gonzalez, I., El-Araby, E., Saha, P., El-Ghazawi, T., and Simmler, H. 2008. Strategic challenges for application development productivity in reconfigurable computing. In Proceedings of the IEEE National Aerospace and Electronics Conference (NAECON'08).
[23]
Nelson, B. E., Wirthlin, M. J., Hutchings, B. L., Athanas, P. M., and Bohner, S. 2008. Design productivity for configurable computing. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA'08). 57-66.
[24]
Nvidia Corp. 2011a. CUDA CUFFT library. http://developer.nvidia.com/cuda-toolkit-40
[25]
Nvidia Corp. 2011b. NVIDIA tegra 2. http://www.nvidia.com/object/tegra-2.html
[26]
Nvidia Corp., 2011c. NVIDIA tesla workstations. http://www.nvidia.com/object/personal-supercomputing.html
[27]
Owens, J. D., Houston, M., Luebke, D., Green, S.; Stone, J. E., and Phillips, J. C. 2008. GPU computing. Proc. IEEE 96, 5, 879--899.
[28]
Underwood, K. D. and Hemmert, K. S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In (FCCM'04). Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 219--228.
[29]
Williams, J., George, A., Richardson, J., Gosrani, K., Massie, C., and Lam, H. 2010. Characterization of fixed and reconfigurable multi-core devices for application acceleration, ACM Trans. Reconfig. Technol. Syst. 3, 4, 1--29.
[30]
Xiao, S., Aji, A., and Feng, W. 2009. On the robust mapping of dynamic programming onto a graphics-processing unit. In (ICPADS'09). Proceedings of the 15th International Conference on Parallel and Distributed Systems. 26--33.

Cited By

View all
  • (2023)Review of Energy-Efficient Embedded System Acceleration of Convolution Neural Networks for Organic Weeding RobotsAgriculture10.3390/agriculture1311210313:11(2103)Online publication date: 6-Nov-2023
  • (2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
  • (2022)Design of high‐speed software defined radar with GPU acceleratorIET Radar, Sonar & Navigation10.1049/rsn2.1224416:7(1083-1094)Online publication date: 28-Feb-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 4
Special Issue on High-Performance Embedded Architectures and Compilers
January 2013
876 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2400682
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2013
Accepted: 01 August 2012
Revised: 01 May 2012
Received: 01 July 2011
Published in TACO Volume 9, Issue 4

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)162
  • Downloads (Last 6 weeks)29
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Review of Energy-Efficient Embedded System Acceleration of Convolution Neural Networks for Organic Weeding RobotsAgriculture10.3390/agriculture1311210313:11(2103)Online publication date: 6-Nov-2023
  • (2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
  • (2022)Design of high‐speed software defined radar with GPU acceleratorIET Radar, Sonar & Navigation10.1049/rsn2.1224416:7(1083-1094)Online publication date: 28-Feb-2022
  • (2021)Mixed Noise Estimation Model for Optimized Kernel Minimum Noise Fraction Transformation in Hyperspectral Image Dimensionality ReductionRemote Sensing10.3390/rs1313260713:13(2607)Online publication date: 2-Jul-2021
  • (2020)GPU Fast Convolution via the Overlap-and-Save Method in Shared MemoryACM Transactions on Architecture and Code Optimization10.1145/339411617:3(1-20)Online publication date: 3-Aug-2020
  • (2020)Fast Calculation of Cross-Correlation Function with Video Cards in Coherent Radar2020 9th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO49872.2020.9134357(1-5)Online publication date: Jun-2020
  • (2019)LayupACM Transactions on Architecture and Code Optimization10.1145/335723816:4(1-23)Online publication date: 11-Oct-2019
  • (2019)FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCLACM Transactions on Reconfigurable Technology and Systems10.1145/326893311:4(1-25)Online publication date: 9-Jan-2019
  • (2018)Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA SystemsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174262(173-182)Online publication date: 15-Feb-2018
  • (2018)Optimisation of Convolution of Multiple Different Sized Filters in SKA Pulsar Search Engine2018 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2018.00073(358-361)Online publication date: Dec-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media