research-article

Cost-driven thread coarsening for GPU kernels

Authors:

Prithayan Barua,

Vivek SarkarAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 32, Pages 1 - 14

https://doi.org/10.1145/3243176.3243196

Published: 01 November 2018 Publication History

Abstract

Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration possible at the compiler level to exploit specific features of different architectures. We observed that traditional applications designed for latency optimized out-of-order pipelined CPUs do not exploit the throughput optimized in-order pipelined GPU architecture efficiently. In this paper we develop a model to estimate the memory throughput of a given application. Then we use the loop interleave transformation to improve the memory bandwidth utilization of a given kernel.

We developed a heuristic to estimate the optimal loop interleave factor, and implemented it in the OpenARC compiler for OpenACC. We evaluated our approach on over 216 kernels to achieve a Geo-mean speedup of 1.32×.

Our compiler optimization aims to provide the right balance between performance, portability and productivity.

References

[1]

2018. OpenACC. https://www.openacc.org/

[2]

2018. OpenMP. https://www.openmp.org/

[3]

Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel P. Midkiff. 2013. The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation. Int. J. Parallel Program. 41, 6 (Dec. 2013), 753--767.

Digital Library

[4]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks---Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91). ACM, New York, NY, USA, 158--165.

Digital Library

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54.

Digital Library

[6]

C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017. End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219--232.

[7]

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319--349.

Digital Library

[8]

Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09). ACM, New York, NY, USA, 152--163.

Digital Library

[9]

Q. Jia and H. Zhou. 2016. Tuning Stencil codes in OpenCL for FPGAs. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 249--256.

[10]

Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2015. An OpenACC-based Unified Programming Model for Multi-accelerator Systems. SIGPLAN Not. 50, 8 (Jan. 2015), 257--258.

Digital Library

[11]

S. Lee and J. S. Vetter. 2014. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study. In 2014 First Workshop on Accelerator Programming using Directives. 1--11.

Digital Library

[12]

Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper.

Digital Library

[13]

John D. C. Little. 2011. OR FORUM---Little's Law As Viewed on Its 50th Anniversary. Oper. Res. 59, 3 (May 2011), 536--549.

Digital Library

[14]

Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 455--466.

Digital Library

[15]

Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2013. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 11, 11 pages.

Digital Library

[16]

NVIDIA. 2018. Cuda Programming Guide. (2018). http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[17]

Vivek Sarkar. 2000. Optimized Unrolling of Nested Loops. In Proceedings of the 14th International Conference on Super-computing (ICS '00). ACM, New York, NY, USA, 153--166.

Digital Library

[18]

Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 11--22.

Digital Library

[19]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.

[20]

Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40.

Digital Library

[21]

Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html

[22]

V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 111.

Digital Library

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Matsumura Kde Gonzalo SPena A(2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00032
Show More Cited By

Index Terms

Cost-driven thread coarsening for GPU kernels
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Predictable Thread Coarsening

Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static ...
Automatic optimization of thread-coarsening for graphics processors
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to ...
A large-scale cross-architecture evaluation of thread-coarsening
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Matsumura Kde Gonzalo SPena A(2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00032
Oh CZheng ZShen XZhai JYi YSarkar VKim H(2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414656
Wu HBecchi M(2020)Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00108(1018-1029)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00108

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents