Article

Taming warp divergence

Authors:

Jayvant Anantpur,

R. GovindarajanAuthors Info & Claims

CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

Pages 50 - 60

Published: 04 February 2017 Publication History

Get Access

Abstract

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming

Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution

configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated

to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads,

called warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., warps of a TB may

finish the kernel execution at different points in time, causing the faster warps to wait for their slower sibling warps. This, in

effect, reduces the utilization of resources of SMs and hence the performance of the GPU.

We propose a simple and elegant technique to eliminate the waiting time of warps at the end of kernel execution and

improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual warps, and

enables warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting

for their sibling warps. We propose simple source to source transformations to use virtual thread blocks and virtual warps.

Further, this technique enables us to design a warp scheduling algorithm that is aware of the progress made by the virtual

thread blocks and virtual warps, and uses this knowledge to prioritise warps effectively. Evaluation on a diverse set of

kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean

improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) warp scheduler and 1.09x over Loose

Round Robin (LRR) warp scheduler.

References

[1]

J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. IPDPS-2015

Digital Library

Google Scholar

[2]

J. Anantpur and R. Govindarajan. Taming Control Divergence in GPUs through Control Flow Linearization. CC-2014

Google Scholar

[3]

M. Awatramani, X. Zhu, J. Zambreno and D. Rover. Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications, PACT-2015.

Digital Library

Google Scholar

[4]

A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. ISPASS-2009.

Google Scholar

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC-2009.

Digital Library

Google Scholar

[6]

G. Chen and X.Shen. Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse, MICRO-2015.

Digital Library

Google Scholar

[7]

CUDA. CUDA C Programming Guide.

Google Scholar

[8]

G. Diamos, B. Ashbaugh S. Maiyuran A. Kerr H. Wu and S. Yalamanchili. SIMD Re-Convergence At Thread Frontiers. MICRO-2011

Digital Library

Google Scholar

[9]

Fermi. http://www.nvidia.in/content/PDF/fermi white papers /NVIDIA Fermi Compute Architecture Whitepaper.pdf

Google Scholar

[10]

W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. HPCA-2011

Digital Library

Google Scholar

[11]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. ISCA-2011

Digital Library

Google Scholar

[12]

K. Gupta, J. A. Stuart and J. D. Owens. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads, InPar-2012

Google Scholar

[13]

A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ASPLOS-2013.

Digital Library

Google Scholar

[14]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for gpgpus. ISCA-2013.

Digital Library

Google Scholar

[15]

O. Kayiran, A. Jog, M. T. Kandemir and C. R. Das. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. PACT-2013.

Digital Library

Google Scholar

[16]

Kepler. http://www.nvidia.in/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf

Google Scholar

[17]

F. Khorasani, R. Gupta and L. N. Bhuyan. Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection, MICRO-2015.

Digital Library

Google Scholar

[18]

K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro, M. Annavaram. Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding. HPCA-2016

Google Scholar

[19]

M. Lee, G, Kim, J. Kim, W. Seo, Y. Cho, S. Ryu. iPAWS: Instruction-Issue Pattern-based Adaptive Warp Scheduling for GPGPUs. HPCA-2016

Google Scholar

[20]

S. Lee and C. Wu. CAWS: Criticality-Aware Warp Scheduling for GPGPU Workloads. PACT-2014

Digital Library

Google Scholar

[21]

S. Lee, A. Arunkumar and C. Wu. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. ISCA-2015

Digital Library

Google Scholar

[22]

Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler and K. Asanovic. Convergence and Scalarization for Data-Parallel Architectures, CGO-2013.

Google Scholar

[23]

J. Meng, D. Tarjan and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. ISCA-2010

Digital Library

Google Scholar

[24]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, Y. N. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. MICRO-2011.

Digital Library

Google Scholar

[25]

OpenCL. www.khronos.org/opencl.

Google Scholar

[26]

S. Pai, M. J. Thazhuthaveetil and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels, ASPLOS- 2013.

Digital Library

Google Scholar

[27]

M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. ISCA-2012.

Digital Library

Google Scholar

[28]

M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. HPCA-2013.

Digital Library

Google Scholar

[29]

T. G. Rogers, M. OConnor, and T. M. Aamodt. Cacheconscious wavefront scheduling. MICRO-2012.

Digital Library

Google Scholar

[30]

T. G. Rogers, M. OConnor, and T. M. Aamodt. Divergence-Aware Warp Scheduling, MICRO-2013

Digital Library

Google Scholar

[31]

A. Sethia, D. A. Jamshidi, S. Mahlke. Mascar: Speeding up GPU Warps by Reducing Memory Pitstops. HPCA-2015

Google Scholar

[32]

J. A. Stratton, S. S. Stone, and W. M. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs, LCPC-2008.

Digital Library

Google Scholar

[33]

J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu and W. W. Hwu. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs, CGO-2010.

Digital Library

Google Scholar

[34]

J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, W. W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. UIUC, Tech. Rep. IMPACT-12-01, March 2012

Google Scholar

[35]

W. Wu, G. Chen, D. Li, X. Shen and J. Vetter. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations, ICS-2015.

Digital Library

Google Scholar

[36]

P. Xiang, Y. Yang, H. Zhou. Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation. HPCA-2014.

Google Scholar

[37]

M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit, ISCA-2016

Digital Library

Google Scholar

Cited By

View all

Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976

Taming warp divergence
1. Computer systems organization
2. Networks

Recommendations

Efficient warp execution in presence of divergence with collaborative context collection
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the ...
Divergence-aware warp scheduling
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

This paper uses hardware thread scheduling to improve the performance and energy efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp Scheduling (DAWS), which introduces a divergence-based cache footprint predictor to estimate ...
Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

February 2017

317 pages

ISBN:9781509049318

General Chair:
Vijay Janapa Reddi
University of Texas at Austin, USA
,
Program Chairs:
Aaron Smith
Microsoft Research, UK / University of Edinburgh, UK
,
Lingjia Tang
University of Michigan, USA

Publisher

IEEE Press

Publication History

Published: 04 February 2017

Check for updates

Author Tags

Qualifiers

Article

Conference

CGO '17

Sponsor:

CGO '17: 15th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 4 - 8, 2017

Austin, USA

Acceptance Rates

CGO '17 Paper Acceptance Rate 26 of 116 submissions, 22%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Efficient warp execution in presence of divergence with collaborative context collection

Divergence-aware warp scheduling

Improving GPU performance via large warps and two-level warp scheduling