Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3049832.3049839acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Taming warp divergence

Published: 04 February 2017 Publication History

Abstract

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming
Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution
configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated
to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads,
called warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., warps of a TB may
finish the kernel execution at different points in time, causing the faster warps to wait for their slower sibling warps. This, in
effect, reduces the utilization of resources of SMs and hence the performance of the GPU.
We propose a simple and elegant technique to eliminate the waiting time of warps at the end of kernel execution and
improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual warps, and
enables warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting
for their sibling warps. We propose simple source to source transformations to use virtual thread blocks and virtual warps.
Further, this technique enables us to design a warp scheduling algorithm that is aware of the progress made by the virtual
thread blocks and virtual warps, and uses this knowledge to prioritise warps effectively. Evaluation on a diverse set of
kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean
improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) warp scheduler and 1.09x over Loose
Round Robin (LRR) warp scheduler.

References

[1]
J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. IPDPS-2015
[2]
J. Anantpur and R. Govindarajan. Taming Control Divergence in GPUs through Control Flow Linearization. CC-2014
[3]
M. Awatramani, X. Zhu, J. Zambreno and D. Rover. Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications, PACT-2015.
[4]
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. ISPASS-2009.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC-2009.
[6]
G. Chen and X.Shen. Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse, MICRO-2015.
[7]
CUDA. CUDA C Programming Guide.
[8]
G. Diamos, B. Ashbaugh S. Maiyuran A. Kerr H. Wu and S. Yalamanchili. SIMD Re-Convergence At Thread Frontiers. MICRO-2011
[9]
Fermi. http://www.nvidia.in/content/PDF/fermi white papers /NVIDIA Fermi Compute Architecture Whitepaper.pdf
[10]
W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. HPCA-2011
[11]
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. ISCA-2011
[12]
K. Gupta, J. A. Stuart and J. D. Owens. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads, InPar-2012
[13]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ASPLOS-2013.
[14]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for gpgpus. ISCA-2013.
[15]
O. Kayiran, A. Jog, M. T. Kandemir and C. R. Das. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. PACT-2013.
[16]
Kepler. http://www.nvidia.in/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf
[17]
F. Khorasani, R. Gupta and L. N. Bhuyan. Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection, MICRO-2015.
[18]
K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro, M. Annavaram. Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding. HPCA-2016
[19]
M. Lee, G, Kim, J. Kim, W. Seo, Y. Cho, S. Ryu. iPAWS: Instruction-Issue Pattern-based Adaptive Warp Scheduling for GPGPUs. HPCA-2016
[20]
S. Lee and C. Wu. CAWS: Criticality-Aware Warp Scheduling for GPGPU Workloads. PACT-2014
[21]
S. Lee, A. Arunkumar and C. Wu. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. ISCA-2015
[22]
Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler and K. Asanovic. Convergence and Scalarization for Data-Parallel Architectures, CGO-2013.
[23]
J. Meng, D. Tarjan and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. ISCA-2010
[24]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, Y. N. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. MICRO-2011.
[25]
OpenCL. www.khronos.org/opencl.
[26]
S. Pai, M. J. Thazhuthaveetil and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels, ASPLOS- 2013.
[27]
M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. ISCA-2012.
[28]
M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. HPCA-2013.
[29]
T. G. Rogers, M. OConnor, and T. M. Aamodt. Cacheconscious wavefront scheduling. MICRO-2012.
[30]
T. G. Rogers, M. OConnor, and T. M. Aamodt. Divergence-Aware Warp Scheduling, MICRO-2013
[31]
A. Sethia, D. A. Jamshidi, S. Mahlke. Mascar: Speeding up GPU Warps by Reducing Memory Pitstops. HPCA-2015
[32]
J. A. Stratton, S. S. Stone, and W. M. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs, LCPC-2008.
[33]
J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu and W. W. Hwu. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs, CGO-2010.
[34]
J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, W. W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. UIUC, Tech. Rep. IMPACT-12-01, March 2012
[35]
W. Wu, G. Chen, D. Li, X. Shen and J. Vetter. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations, ICS-2015.
[36]
P. Xiang, Y. Yang, H. Zhou. Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation. HPCA-2014.
[37]
M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit, ISCA-2016

Cited By

View all
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization
February 2017
317 pages
ISBN:9781509049318

Sponsors

Publisher

IEEE Press

Publication History

Published: 04 February 2017

Check for updates

Author Tags

  1. Divergence
  2. GPU
  3. Warp Scheduling

Qualifiers

  • Article

Conference

CGO '17
Sponsor:

Acceptance Rates

CGO '17 Paper Acceptance Rate 26 of 116 submissions, 22%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media