Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Predictable Thread Coarsening

Published: 12 June 2018 Publication History

Abstract

Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model. We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction kernels we achieve a maximum speedup of 5.08x, and for the Rodinia benchmarks we achieve a mean speedup of 1.30x over 8 of 19 kernels that were determined safe to coarsen.

References

[1]
George Almási, Calin Caşcaval, and David A. Padua. 2002. Calculating stack distances efficiently. In ACM Sigplan Notices, Vol. 38. ACM, 37--43.
[2]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and others. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
[3]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009 (ISPASS’09). IEEE, 163--174.
[4]
Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.
[5]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Notices, Vol. 43. ACM, 101--113.
[6]
Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the 6th Workshop on Declarative Aspects of Multicore Programming. ACM, 3--14.
[7]
Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. 2014. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 88--100.
[8]
Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, and Katherine Yelick. 2005. Self-adapting linear algebra algorithms and software. Proc. IEEE 93, 2 (2005), 293--312.
[9]
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 353--364.
[10]
Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, Vol. 38. ACM, 245--257.
[11]
Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, and others. 2011. Milepost GCC: Machine learning enabled self-tuning compiler. Int. J. Parallel Program. 39, 3 (2011), 296--327.
[12]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of Innovative Parallel Computing (InPar’12), 2012. IEEE, 1--10.
[13]
Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Vol. 2011.
[14]
Mark Harris. 2007. Optimizing CUDA. SC07: High Performance Computing With CUDA. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf.
[15]
Yingchao Huang and Dong Li. 2017. Performance modeling for optimal data placement on GPU with heterogeneous memory systems. In Proceedings of the 2017 IEEE International Conference onCluster Computing (CLUSTER’17). IEEE, 166--177.
[16]
Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 42.
[17]
Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2014. COFFEE: An optimizing compiler for finite element local assembly. arXiv:1407.0904.
[18]
Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014a. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 455--466.
[19]
Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014b. Exploiting GPU hardware saturation for fast compiler optimization. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 99.
[20]
Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 11.
[21]
Geoffrey Mainland and Greg Morrisett. 2010. Nikola: Embedding compiled GPU functions in Haskell. ACM Sigplan Notices 45, 11 (2010), 67--78.
[22]
Paulius Micikevicius. 2012. GPU performance analysis and optimization. In GPU Technology Conference.
[23]
Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.
[24]
CUDA NVidia. 2010. Programming guide. (2010).
[25]
NVIDIA Corporation. 2016. NVIDIA Tesla P100 Whitepaper. NVIDIA Corporation.
[26]
Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. 2013. Portable performance on heterogeneous architectures. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 431--444.
[27]
Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 195--204.
[28]
Jaewook Shin, Mary W. Hall, Jacqueline Chame, Chun Chen, Paul F. Fischer, and Paul D. Hovland. 2010. Speeding up nek5000 with autotuning and specialization. In Proceedings of the 24th ACM International Conference on Supercomputing. ACM, 253--262.
[29]
H. Shojania and B. Li. 2009. Cache miss analysis for gpu programs based on stack distance profile. In Proceedings of the IEEE Int’l Conference on Distributed Computing Systems (ICDCS’09). 623--634.
[30]
J.A. Stratton, N. Anssari, C. Rodrigues, I-Jui Sung, N. Obeid, Liwen Chang, G. D. Liu, and W. Hwu. 2012. Optimization and architecture effects on GPU computing workload performance. In Proceedings of Innovative Parallel Computing (InPar’12). 1--10.
[31]
Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s (2014), 134.
[32]
Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and W. Hwu Wen-mei. 2008. CUDA-lite: Reducing GPU programming complexity. In LCPC, Vol. 8. Springer, 1--15.
[33]
Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic restructuring of GPU kernels for exploiting inter-thread data locality. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). Springer-Verlag, Berlin, Heidelberg, 21--40.
[34]
Vasily Volkov. 2010. Better Performance at Lower Occupancy. In Proceedings of the GPU Technology Conference, GTC. Vol. 10.
[35]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC’08). IEEE, 1--11.
[36]
Dongwei Wang and Weijun Xiao. 2016. A reuse distance based performance analysis on GPU L1 data cache. In Proceedings of the 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC’16). IEEE, 1--8.
[37]
Yuan Wen, Zheng Wang, and Michael F. P. O’Boyle. 2014. Smart multi-task scheduling for openCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 2014 21st International Conference on High Performance Computing (HiPC’14). IEEE, 1--10.
[38]
Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12.
[39]
Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM Sigplan Notices, Vol. 26. ACM, 30--44.

Cited By

View all
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
  • (2022)Future aware Dynamic Thermal Management in CPU-GPU Embedded Platforms2022 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS55097.2022.00041(396-408)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 2
June 2018
251 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3212710
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018
Accepted: 01 March 2018
Revised: 01 February 2018
Received: 01 August 2017
Published in TACO Volume 15, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. compiler optimisations

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)341
  • Downloads (Last 6 weeks)28
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
  • (2022)Future aware Dynamic Thermal Management in CPU-GPU Embedded Platforms2022 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS55097.2022.00041(396-408)Online publication date: Dec-2022
  • (2022)A compiler framework for optimizing dynamic parallelism on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
  • (2021)Orchestration of Perception Systems for Reliable Performance in Heterogeneous Platforms2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474186(1757-1762)Online publication date: 1-Feb-2021
  • (2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
  • (2021)Exploring Thread Coarsening on FPGA2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00062(436-441)Online publication date: Dec-2021
  • (2020)Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00108(1018-1029)Online publication date: May-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media