Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3243176.3243196acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Cost-driven thread coarsening for GPU kernels

Published: 01 November 2018 Publication History

Abstract

Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration possible at the compiler level to exploit specific features of different architectures. We observed that traditional applications designed for latency optimized out-of-order pipelined CPUs do not exploit the throughput optimized in-order pipelined GPU architecture efficiently. In this paper we develop a model to estimate the memory throughput of a given application. Then we use the loop interleave transformation to improve the memory bandwidth utilization of a given kernel.
We developed a heuristic to estimate the optimal loop interleave factor, and implemented it in the OpenARC compiler for OpenACC. We evaluated our approach on over 216 kernels to achieve a Geo-mean speedup of 1.32×.
Our compiler optimization aims to provide the right balance between performance, portability and productivity.

References

[1]
2018. OpenACC. https://www.openacc.org/
[2]
2018. OpenMP. https://www.openmp.org/
[3]
Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel P. Midkiff. 2013. The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation. Int. J. Parallel Program. 41, 6 (Dec. 2013), 753--767.
[4]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks---Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91). ACM, New York, NY, USA, 158--165.
[5]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54.
[6]
C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017. End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219--232.
[7]
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319--349.
[8]
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09). ACM, New York, NY, USA, 152--163.
[9]
Q. Jia and H. Zhou. 2016. Tuning Stencil codes in OpenCL for FPGAs. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 249--256.
[10]
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2015. An OpenACC-based Unified Programming Model for Multi-accelerator Systems. SIGPLAN Not. 50, 8 (Jan. 2015), 257--258.
[11]
S. Lee and J. S. Vetter. 2014. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study. In 2014 First Workshop on Accelerator Programming using Directives. 1--11.
[12]
Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper.
[13]
John D. C. Little. 2011. OR FORUM---Little's Law As Viewed on Its 50th Anniversary. Oper. Res. 59, 3 (May 2011), 536--549.
[14]
Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 455--466.
[15]
Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2013. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 11, 11 pages.
[16]
NVIDIA. 2018. Cuda Programming Guide. (2018). http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[17]
Vivek Sarkar. 2000. Optimized Unrolling of Nested Loops. In Proceedings of the 14th International Conference on Super-computing (ICS '00). ACM, New York, NY, USA, 153--166.
[18]
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 11--22.
[19]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.
[20]
Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40.
[21]
Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html
[22]
V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 111.

Cited By

View all
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)Retargeting and Respecializing GPU Workloads for Performance PortabilityProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
  • (2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
November 2018
494 pages
ISBN:9781450359863
DOI:10.1145/3243176
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

PACT '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)Retargeting and Respecializing GPU Workloads for Performance PortabilityProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
  • (2021)JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00032(182-191)Online publication date: Dec-2021
  • (2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
  • (2020)Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00108(1018-1029)Online publication date: May-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media