research-article

A large-scale cross-architecture evaluation of thread-coarsening

Authors:

Christophe Dubach,

Michael F. P. O'BoyleAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 11, Pages 1 - 11

https://doi.org/10.1145/2503210.2503268

Published: 17 November 2013 Publication History

Abstract

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device.

In this paper we consider a data parallel compiler transformation --- thread-coarsening --- and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.

References

[1]

AMD Inc., AMD APP Profiler http://developer.amd.com/tools/heterogeneous-computing/amd-app-profiler/.

[2]

The llvm compiler infrastructure http://llvm.org.

[3]

NVIDIA Corporation, NVIDIA Profiler http://docs.nvidia.com/cuda/profiler-users-guide/.

[4]

Nvidia's Next Generation CUDA Compute Architecture: Fermi http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.

[5]

AMD Accelerated parallel processing OpenCL, 2012.

[6]

Nvidia's Next Generation CUDA Compute Architecture: Kepler http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.

[7]

MICA: Microarchitecture-Independent Characterization of Applications http://boegel.kejo.be/ELIS/mica/, 2013.

[8]

B. Coutinho, D. Sampaio, F. Pereira, and W. Meira. Divergence analysis and optimizations. PACT, pages 320--329, oct. 2011.

Digital Library

[9]

G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. PACT '10, pages 353--364, New York, NY, USA, 2010. ACM.

Digital Library

[10]

C. Dubach, P. Cheng, R. M. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In PLDI, pages 1--12, 2012.

Digital Library

[11]

D. Grewe, Z. Wang, and M. F. O'Boyle. Portable mapping of data parallel programs to opencl for heterogeneous systems. CGO '13. ACM, 2013.

Digital Library

[12]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. ASPLOS '11, pages 381--392, New York, NY, USA, 2011. ACM.

Digital Library

[13]

K. Hoste and L. Eeckhout. Comparing benchmarks using key microarchitecture-independent characteristics. pages 83--92, oct. 2006.

[14]

R. Karrenberg and S. Hack. Whole-function vectorization. CGO '11, pages 141--150, april 2011.

Digital Library

[15]

R. Karrenberg and S. Hack. Improving performance of opencl on cpus. CC, pages 1--20, 2012.

Digital Library

[16]

A. Kerr, G. Diamos, and S. Yalamanchili. Modeling gpu-cpu workloads and systems. GPGPU '10, pages 31--42, New York, NY, USA, 2010. ACM.

Digital Library

[17]

Y. Liu, E. Zhang, and X. Shen. A cross-input adaptive framework for gpu program optimizations. IPDPS '09, pages 1--10, may 2009.

Digital Library

[18]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. SIGPLAN Not., 40(6):190--200, June 2005.

Digital Library

[19]

S. Moll. Decompilation of LLVM IR, 2011.

[20]

B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Jan. 1996.

Digital Library

[21]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.

Digital Library

[22]

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. PPoPP '12, pages 11--22, New York, NY, USA, 2012. ACM.

Digital Library

[23]

S. Unkule, C. Shaltz, and A. Qasem. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. CC, pages 21--40, 2012.

Digital Library

[24]

V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. SC '08, pages 31:1--31:11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[25]

M. Weiser. Program slicing. ICSE '81, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.

Digital Library

[26]

X. Xiang, B. Bao, C. Ding, and Y. Gao. Linear-time modeling of program working set in shared cache. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 350--360, 2011.

Digital Library

[27]

Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou. A unified optimizing compiler framework for different gpgpu architectures. TACO, 9(2):9, 2012.

Digital Library

[28]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Bastidas Fuertes APérez MMeza Hormaza J(2023)Transpilers: A Systematic Mapping Review of Their Usage in Research and IndustryApplied Sciences10.3390/app1306366713:6(3667)Online publication date: 13-Mar-2023
https://doi.org/10.3390/app13063667
Andelfinger PCai W(2022)Advanced TutorialProceedings of the Winter Simulation Conference10.5555/3586210.3586232(268-282)Online publication date: 11-Dec-2022
https://dl.acm.org/doi/10.5555/3586210.3586232
Show More Cited By

Index Terms

A large-scale cross-architecture evaluation of thread-coarsening
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Predictable Thread Coarsening

Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
687
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Bastidas Fuertes APérez MMeza Hormaza J(2023)Transpilers: A Systematic Mapping Review of Their Usage in Research and IndustryApplied Sciences10.3390/app1306366713:6(3667)Online publication date: 13-Mar-2023
https://doi.org/10.3390/app13063667
Andelfinger PCai W(2022)Advanced TutorialProceedings of the Winter Simulation Conference10.5555/3586210.3586232(268-282)Online publication date: 11-Dec-2022
https://dl.acm.org/doi/10.5555/3586210.3586232
VenkataKeerthy SAndaluri YDey SShah RTammana PUpadrasta R(2022)Packet Processing Algorithm Identification using Program EmbeddingsProceedings of the 6th Asia-Pacific Workshop on Networking10.1145/3542637.3542649(76-82)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1145/3542637.3542649
Andelfinger PCai W(2022)Advanced Tutorial: Parallel and Distributed Methods for Scalable Discrete Simulation2022 Winter Simulation Conference (WSC)10.1109/WSC57314.2022.10015291(268-282)Online publication date: 11-Dec-2022
https://doi.org/10.1109/WSC57314.2022.10015291
Rafi MQasem A(2022)Optimal Launch Bound Selection in CPU-GPU Hybrid Graph Applications with Deep Learning2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969364(1-7)Online publication date: 24-Oct-2022
https://doi.org/10.1109/IGSC55832.2022.9969364
Sadrosadati MMirhosseini AHajiabadi AEhsani SFalahati HSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3419973
Alcaide SKosmidis LHernandez CAbella J(2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
https://doi.org/10.1109/TETC.2021.3101922
Zarch MNeff RBecchi M(2021)Exploring Thread Coarsening on FPGA2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00062(436-441)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00062
Zheng RPai SLee J(2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370326
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents