research-article

Open access

Predictable Thread Coarsening

Authors:

Nicolai Stawinoga,

Tony FieldAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 2

Article No.: 23, Pages 1 - 26

https://doi.org/10.1145/3194242

Published: 12 June 2018 Publication History

Abstract

Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model. We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction kernels we achieve a maximum speedup of 5.08x, and for the Rodinia benchmarks we achieve a mean speedup of 1.30x over 8 of 19 kernels that were determined safe to coarsen.

References

[1]

George Almási, Calin Caşcaval, and David A. Padua. 2002. Calculating stack distances efficiently. In ACM Sigplan Notices, Vol. 38. ACM, 37--43.

Digital Library

[2]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and others. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

[3]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009 (ISPASS’09). IEEE, 163--174.

[4]

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.

Digital Library

[5]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Notices, Vol. 43. ACM, 101--113.

Digital Library

[6]

Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the 6th Workshop on Declarative Aspects of Multicore Programming. ACM, 3--14.

Digital Library

[7]

Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. 2014. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 88--100.

Digital Library

[8]

Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, and Katherine Yelick. 2005. Self-adapting linear algebra algorithms and software. Proc. IEEE 93, 2 (2005), 293--312.

[9]

Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 353--364.

Digital Library

[10]

Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, Vol. 38. ACM, 245--257.

Digital Library

[11]

Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, and others. 2011. Milepost GCC: Machine learning enabled self-tuning compiler. Int. J. Parallel Program. 39, 3 (2011), 296--327.

[12]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of Innovative Parallel Computing (InPar’12), 2012. IEEE, 1--10.

[13]

Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Vol. 2011.

[14]

Mark Harris. 2007. Optimizing CUDA. SC07: High Performance Computing With CUDA. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf.

[15]

Yingchao Huang and Dong Li. 2017. Performance modeling for optimal data placement on GPU with heterogeneous memory systems. In Proceedings of the 2017 IEEE International Conference onCluster Computing (CLUSTER’17). IEEE, 166--177.

[16]

Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 42.

Digital Library

[17]

Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2014. COFFEE: An optimizing compiler for finite element local assembly. arXiv:1407.0904.

[18]

Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014a. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 455--466.

Digital Library

[19]

Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014b. Exploiting GPU hardware saturation for fast compiler optimization. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 99.

[20]

Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 11.

Digital Library

[21]

Geoffrey Mainland and Greg Morrisett. 2010. Nikola: Embedding compiled GPU functions in Haskell. ACM Sigplan Notices 45, 11 (2010), 67--78.

Digital Library

[22]

Paulius Micikevicius. 2012. GPU performance analysis and optimization. In GPU Technology Conference.

[23]

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.

[24]

CUDA NVidia. 2010. Programming guide. (2010).

[25]

NVIDIA Corporation. 2016. NVIDIA Tesla P100 Whitepaper. NVIDIA Corporation.

[26]

Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. 2013. Portable performance on heterogeneous architectures. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 431--444.

Digital Library

[27]

Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 195--204.

Digital Library

[28]

Jaewook Shin, Mary W. Hall, Jacqueline Chame, Chun Chen, Paul F. Fischer, and Paul D. Hovland. 2010. Speeding up nek5000 with autotuning and specialization. In Proceedings of the 24th ACM International Conference on Supercomputing. ACM, 253--262.

Digital Library

[29]

H. Shojania and B. Li. 2009. Cache miss analysis for gpu programs based on stack distance profile. In Proceedings of the IEEE Int’l Conference on Distributed Computing Systems (ICDCS’09). 623--634.

Digital Library

[30]

J.A. Stratton, N. Anssari, C. Rodrigues, I-Jui Sung, N. Obeid, Liwen Chang, G. D. Liu, and W. Hwu. 2012. Optimization and architecture effects on GPU computing workload performance. In Proceedings of Innovative Parallel Computing (InPar’12). 1--10.

[31]

Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s (2014), 134.

Digital Library

[32]

Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and W. Hwu Wen-mei. 2008. CUDA-lite: Reducing GPU programming complexity. In LCPC, Vol. 8. Springer, 1--15.

Digital Library

[33]

Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic restructuring of GPU kernels for exploiting inter-thread data locality. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). Springer-Verlag, Berlin, Heidelberg, 21--40.

Digital Library

[34]

Vasily Volkov. 2010. Better Performance at Lower Occupancy. In Proceedings of the GPU Technology Conference, GTC. Vol. 10.

[35]

Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC’08). IEEE, 1--11.

Digital Library

[36]

Dongwei Wang and Weijun Xiao. 2016. A reuse distance based performance analysis on GPU L1 data cache. In Proceedings of the 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC’16). IEEE, 1--8.

[37]

Yuan Wen, Zheng Wang, and Michael F. P. O’Boyle. 2014. Smart multi-task scheduling for openCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 2014 21st International Conference on High Performance Computing (HiPC’14). IEEE, 1--10.

[38]

Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12.

Digital Library

[39]

Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM Sigplan Notices, Vol. 26. ACM, 30--44.

Digital Library

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Maity SRoy RMajumder ADey SHota A(2022)Future aware Dynamic Thermal Management in CPU-GPU Embedded Platforms2022 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS55097.2022.00041(396-408)Online publication date: Dec-2022
https://doi.org/10.1109/RTSS55097.2022.00041
Show More Cited By

Index Terms

Predictable Thread Coarsening
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Automatic optimization of thread-coarsening for graphics processors
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to ...
A large-scale cross-architecture evaluation of thread-coarsening
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, ...
Cost-driven thread coarsening for GPU kernels
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 2

June 2018

251 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3212710

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Accepted: 01 March 2018

Revised: 01 February 2018

Received: 01 August 2017

Published in TACO Volume 15, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

EPSRC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,504
Total Downloads

Downloads (Last 12 months)341
Downloads (Last 6 weeks)28

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Ivanov IZinenko ODomke JEndo TMoses W(2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444828
Maity SRoy RMajumder ADey SHota A(2022)Future aware Dynamic Thermal Management in CPU-GPU Embedded Platforms2022 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS55097.2022.00041(396-408)Online publication date: Dec-2022
https://doi.org/10.1109/RTSS55097.2022.00041
Olabi MLuna JMutlu OHwu WEl Hajj ILee J(2022)A compiler framework for optimizing dynamic parallelism on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741284
Ghose AMaity SKar ADey S(2021)Orchestration of Perception Systems for Reliable Performance in Heterogeneous Platforms2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474186(1757-1762)Online publication date: 1-Feb-2021
https://doi.org/10.23919/DATE51398.2021.9474186
Alcaide SKosmidis LHernandez CAbella J(2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
https://doi.org/10.1109/TETC.2021.3101922
Zarch MNeff RBecchi M(2021)Exploring Thread Coarsening on FPGA2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00062(436-441)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00062
Wu HBecchi M(2020)Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00108(1018-1029)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00108

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents