research-article

Open access

A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors

Authors:

Greg StittAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 9, Issue 4

Article No.: 25, Pages 1 - 21

https://doi.org/10.1145/2400682.2400684

Published: 20 January 2013 Publication History

Abstract

Recent architectural trends have focused on increased parallelism via multicore processors and increased heterogeneity via accelerator devices (e.g., graphics-processing units, field-programmable gate arrays). Although these architectures have significant performance and energy potential, application designers face many device-specific challenges when choosing an appropriate accelerator or when customizing an algorithm for an accelerator. To help address this problem, in this article we thoroughly evaluate convolution, one of the most common operations in digital-signal processing, on multicores, graphics-processing units, and field-programmable gate arrays. Whereas many previous application studies evaluate a specific usage of an application, this article assists designers with design space exploration for numerous use cases by analyzing effects of different input sizes, different algorithms, and different devices, while also determining Pareto-optimal trade-offs between performance and energy.

References

[1]

Altera, Inc. 2011a. Buy FPGA and CPLD Devices, Stratix IV EP4SE820. 2010. http://www.buyaltera.com/scripts/partsearch.dll/multisearch&quest;site=ALTERA&lang=EN&keywords=EP4SE820.

[2]

Altera, Inc., 2011. FFT MegaCore Function User Guide. http://www.altera.com/literature/ug/ug_fft.pdf.

[3]

Araya-Polo, M., Cabezas, J., Hanzich, M., Paricas, M.,Rubio, F., Gelado, I., Shafiq, M., Morancho, E., Navarro, N., Ayguade, E., Cela, J., and Valero, M. 2011. Assessing accelerator-based HPC reverse time migration. IEEE Trans. Parallel and Distrib. Syst. 147--162.

Digital Library

[4]

Bae, S., Cho, Y. C. P., Park, S., Irick, K., Jin, Y., and Narayana, V. 2011. An FPGA implementation of information theoretic visual-saliency system and its optimization. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'11). 41--48.

Digital Library

[5]

Baker, Z. K., Gokhale, M. B., and Tripp, J. L. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'07). 207--218.

Digital Library

[6]

Brookwood, N., March 2010. AMD fusion family of APUs: Enabling a superior, immersive PC experience. http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf

[7]

Che, S., LI, J., Sheaffer, J.W., Skadron, K., Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP'08). 101--107.

Digital Library

[8]

Dong, Y., Dou, Y., and Zhou, J. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121.

Digital Library

[9]

Eles, P., Peng, Z. Kunchinsinski, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and tabu search. In Design Automation for Embedded Systems. 5--32.

[10]

Fowers, J., Brown, G., Cooke, P., and Stitt, G. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In FPGA'12: Proceedings of the ACM. SIGDA Internationall Symposium on Field Programmable Gate Arrays 47--56.

Digital Library

[11]

Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.

[12]

Gac, N., Mancini, S., Desvignes, M., and Houzet, D. 2008. High speed 3D tomography on CPU, GPU, and FPGA. EURASIP J. Embed. Syst. 5, 1--12.

Digital Library

[13]

George, A., Lam, H., and Stitt, G. 2011. Novo-g: At the forefront of scalable reconfigurable supercomputing. Comput. Sci. Engin. 13, 1, 82--86.

Digital Library

[14]

Givargis, T., Vahid, F., and Henkel, J. 2002. System-level exploration for pareto-optimal configurations in parameterized system-on-a-chip. In IEEE Trans. VLSI Syst. 416--422.

Digital Library

[15]

Guo, Z., Najjar, W., Vahid, F., and Vissers, K. 2004. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA'04). 162--170.

Digital Library

[16]

Huang, S., Xiao, S., and Feng, W. 2009. On the energy efficiency of graphics processing units for scientific computing. In Proceedings of the 5th IEEE Workshop on High-Performance, Power-Aware Computing. 1--8.

Digital Library

[17]

Kestur, S., Davis, J. D., and Williams, O. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 288--293.

Digital Library

[18]

Koehler, S., Stitt, G., and George, A. 2012. (to appear). Platform-Aware bottleneck detection for reconfigurable computing applications. ACM Trans. Reconfig. Technol. Syst.

Digital Library

[19]

Le, H., Jiang, W., and Prasanna, V. 2011. Memory-efficient IPv4/v6 lookup on FPGAs using distance-bounded path compression. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'2011). 242--249.

Digital Library

[20]

Li, J., Sarunic, M. V., and Shannon, L. 2011. Scalable, high performance fourier domain optical coherence tomography: Why FPGAs and not GPGPUs. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'2011). 49--56.

Digital Library

[21]

Mansour, D. and Gray, A., JR. 1982. Unconstrained frequency-domain adaptive filter. IEEE Trans. Acoust. Speech Signal Process. 30, 5, 726--734.

[22]

Merchant, S., Holland, B., Reardon, C., George, A., Lam, H., Stitt, G., Smith, M., Alam, N., Gonzalez, I., El-Araby, E., Saha, P., El-Ghazawi, T., and Simmler, H. 2008. Strategic challenges for application development productivity in reconfigurable computing. In Proceedings of the IEEE National Aerospace and Electronics Conference (NAECON'08).

[23]

Nelson, B. E., Wirthlin, M. J., Hutchings, B. L., Athanas, P. M., and Bohner, S. 2008. Design productivity for configurable computing. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA'08). 57-66.

[24]

Nvidia Corp. 2011a. CUDA CUFFT library. http://developer.nvidia.com/cuda-toolkit-40

[25]

Nvidia Corp. 2011b. NVIDIA tegra 2. http://www.nvidia.com/object/tegra-2.html

[26]

Nvidia Corp., 2011c. NVIDIA tesla workstations. http://www.nvidia.com/object/personal-supercomputing.html

[27]

Owens, J. D., Houston, M., Luebke, D., Green, S.; Stone, J. E., and Phillips, J. C. 2008. GPU computing. Proc. IEEE 96, 5, 879--899.

[28]

Underwood, K. D. and Hemmert, K. S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In (FCCM'04). Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 219--228.

Digital Library

[29]

Williams, J., George, A., Richardson, J., Gosrani, K., Massie, C., and Lam, H. 2010. Characterization of fixed and reconfigurable multi-core devices for application acceleration, ACM Trans. Reconfig. Technol. Syst. 3, 4, 1--29.

Digital Library

[30]

Xiao, S., Aji, A., and Feng, W. 2009. On the robust mapping of dynamic programming onto a graphics-processing unit. In (ICPADS'09). Proceedings of the 15th International Conference on Parallel and Distributed Systems. 26--33.

Digital Library

Cited By

Czymmek VKöhn CHarders LHussmann S(2023)Review of Energy-Efficient Embedded System Acceleration of Convolution Neural Networks for Organic Weeding RobotsAgriculture10.3390/agriculture1311210313:11(2103)Online publication date: 6-Nov-2023
https://doi.org/10.3390/agriculture13112103
Marantos CPapadopoulos LLamprakos CSalapas KSoudris D(2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSUSC.2022.3222409
Li WTang CVishwakarma SWoodbridge KChetty K(2022)Design of high‐speed software defined radar with GPU acceleratorIET Radar, Sonar & Navigation10.1049/rsn2.1224416:7(1083-1094)Online publication date: 28-Feb-2022
https://doi.org/10.1049/rsn2.12244
Show More Cited By

Index Terms

A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors

Recommendations

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to ...
Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two ...
Application Performance on the Newest Processors and GPUs
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity

This paper discusses the capabilities of the newest processors and GPUs to run a mixture of the most common chemistry applications. The baseline system for these comparisons is the 32-core Intel Broadwell processor which has been around for two years. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 9, Issue 4

Special Issue on High-Performance Embedded Architectures and Compilers

January 2013

876 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2400682

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2013

Accepted: 01 August 2012

Revised: 01 May 2012

Received: 01 July 2011

Published in TACO Volume 9, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

Division of Computer and Network Systems

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
1,713
Total Downloads

Downloads (Last 12 months)162
Downloads (Last 6 weeks)29

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Czymmek VKöhn CHarders LHussmann S(2023)Review of Energy-Efficient Embedded System Acceleration of Convolution Neural Networks for Organic Weeding RobotsAgriculture10.3390/agriculture1311210313:11(2103)Online publication date: 6-Nov-2023
https://doi.org/10.3390/agriculture13112103
Marantos CPapadopoulos LLamprakos CSalapas KSoudris D(2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSUSC.2022.3222409
Li WTang CVishwakarma SWoodbridge KChetty K(2022)Design of high‐speed software defined radar with GPU acceleratorIET Radar, Sonar & Navigation10.1049/rsn2.1224416:7(1083-1094)Online publication date: 28-Feb-2022
https://doi.org/10.1049/rsn2.12244
Xue TWang YChen YJia JWen MGuo RWu TDeng X(2021)Mixed Noise Estimation Model for Optimized Kernel Minimum Noise Fraction Transformation in Hyperspectral Image Dimensionality ReductionRemote Sensing10.3390/rs1313260713:13(2607)Online publication date: 2-Jul-2021
https://doi.org/10.3390/rs13132607
Adámek KDimoudi SGiles MArmour W(2020)GPU Fast Convolution via the Overlap-and-Save Method in Shared MemoryACM Transactions on Architecture and Code Optimization10.1145/339411617:3(1-20)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1145/3394116
Barkhatov AKozlov A(2020)Fast Calculation of Cross-Correlation Function with Video Cards in Coherent Radar2020 9th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO49872.2020.9134357(1-5)Online publication date: Jun-2020
https://doi.org/10.1109/MECO49872.2020.9134357
Jiang WMa YLiu BLiu HZhou BZhu JWu SJin H(2019)LayupACM Transactions on Architecture and Code Optimization10.1145/335723816:4(1-23)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3357238
Wang HThiagaraj PSinnen O(2019)FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCLACM Transactions on Reconfigurable Technology and Systems10.1145/326893311:4(1-25)Online publication date: 9-Jan-2019
https://dl.acm.org/doi/10.1145/3268933
Stitt GGupta AEmas MWilson DBaylis AAnderson JBazargan K(2018)Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA SystemsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174262(173-182)Online publication date: 15-Feb-2018
https://dl.acm.org/doi/10.1145/3174243.3174262
Wang HStappers BThiagaraj PSinnen O(2018)Optimisation of Convolution of Multiple Different Sized Filters in SKA Pulsar Search Engine2018 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2018.00073(358-361)Online publication date: Dec-2018
https://doi.org/10.1109/FPT.2018.00073
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents