research-article

Divergence-aware warp scheduling

Authors:

Timothy G. Rogers,

Tor M. AamodtAuthors Info & Claims

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 99 - 110

https://doi.org/10.1145/2540708.2540718

Published: 07 December 2013 Publication History

Abstract

This paper uses hardware thread scheduling to improve the performance and energy efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp Scheduling (DAWS), which introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture intra-warp locality in loops. Predictor estimates are created from an online characterization of memory divergence and runtime information about the level of control flow divergence in warps. Unlike prior work on Cache-Conscious Wavefront Scheduling, which makes reactive scheduling decisions based on detected cache thrashing, DAWS makes proactive scheduling decisions based on cache usage predictions. DAWS uses these predictions to schedule warps such that data reused by active scalar threads is unlikely to exceed the capacity of the L1 data cache. DAWS attempts to shift the burden of locality management from software to hardware, increasing the performance of simpler and more portable code on the GPU. We compare the execution time of two Sparse Matrix Vector Multiply implementations and show that DAWS is able to run a simple, divergent version within 4% of a performance optimized version that has been rewritten to make use of the on-chip scratchpad and have less memory divergence. We show that DAWS achieves a harmonic mean 26% performance improvement over Cache-Conscious Wavefront Scheduling on a diverse selection of highly cache-sensitive applications, with minimal additional hardware.

References

[1]

NVIDIA CUDA C Programming Guide v 4.2, 2012.

[2]

T. M. Aamodt et al. GPGPU-Sim 3.x Manual. http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual, 2012.

[3]

A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS 2009, pages 163--174.

[4]

K. Barabash and E. Petrank. Tracing Garbage Collection on Highly Parallel Platforms. In ISMM 2010, pages 1--10.

Digital Library

[5]

R. Barrett et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, 1994.

[6]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC 2009.

Digital Library

[7]

S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC 2009, pages 44--54.

Digital Library

[8]

H.-Y. Cheng et al. Memory Latency Reduction via Thread Throttling. In MICRO-43, pages 53--64, 2010.

Digital Library

[9]

A. Danalis et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In GPGPU 2010.

Digital Library

[10]

H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In ISCA 2011, pages 365--376.

Digital Library

[11]

W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA 2011, pages 25--36.

Digital Library

[12]

W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO-40.

Digital Library

[13]

M. Gebhart and D. R. Johnson et al. Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA 2011, pages 235--246.

Digital Library

[14]

Z. Guz et al. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. Computer Architecture Letters, pages 25--28, jan. 2009.

Digital Library

[15]

T. H. Hetherington et al. Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems. In ISPASS 2012, pages 88--98.

Digital Library

[16]

S. Hong et al. Accelerating CUDA Graph Algorithms at Maximum Warp. In PPoPP 2011, pages 267--276.

Digital Library

[17]

A. Jaleel et al. CRUISE: Cache Replacement and Utility-Aware Scheduling. In ASPLOS 2012, pages 249--260.

Digital Library

[18]

A. Jaleel et al. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In ISCA 2010, pages 60--71.

Digital Library

[19]

W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the use of Demand-Fetched Caches in GPUs. In ICS 2012, pages 15--24.

Digital Library

[20]

A. Jog et al. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS 2013.

Digital Library

[21]

A. Jog et al. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA, 2013.

Digital Library

[22]

O. Kayiran et al. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT 2013.

Digital Library

[23]

Khronos Group. OpenCL. http://www.khronos.org/opencl/.

[24]

N. B. Lakshminarayana and H. Kim. Effect of Instruction Fetch and Memory Scheduling on GPU Performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.

[25]

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO-43, pages 213--224, 2010.

Digital Library

[26]

J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA 2013.

Digital Library

[27]

E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 28(2):39--55, March-April 2008.

Digital Library

[28]

M. Maas et al. How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator. In ISPASS 2013.

[29]

J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA 2010, pages 235--246.

Digital Library

[30]

V. Narasiman et al. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO-44, pages 308--317, 2011.

Digital Library

[31]

M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In ISCA 2007, pages 381--391.

Digital Library

[32]

T. G. Rogers. CCWS Simulation Infrastructure. http://www.ece.ubc.ca/~tgrogers/ccws.html, 2013.

[33]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In MICRO-45, 2012.

Digital Library

[34]

S. Rul et al. An Experimental Study on Performance Portability of OpenCL Kernels. In Application Accelerators in High Performance Computing, 2010.

[35]

G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In ISCA 1995.

Digital Library

[36]

D. Spoonhower, G. Blelloch, and R. Harper. Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection. In Proc. of Int'l Conf. on Virtual Execution Environments (VEE 2005), pages 57--67.

Digital Library

[37]

S. Wilton and N. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Solid-State Circuits, IEEE Journal of, 31(5):677--688, May 1996.

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Show More Cited By

Index Terms

Divergence-aware warp scheduling

Recommendations

Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Taming warp divergence
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming

Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution

...
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 2013

498 pages

ISBN:9781450326384

DOI:10.1145/2540708

General Chair:
Matthew Farrens
UC Davis
,
Program Chair:
Christos Kozyrakis
Stanford University

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MICRO-46

Sponsor:

SIGMICRO

MICRO-46: The 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 7 - 11, 2013

California, Davis

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

134
Total Citations
View Citations
1,107
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
https://doi.org/10.1007/978-981-15-6401-7_66-2
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
https://doi.org/10.1007/978-981-15-6401-7_66-1
Gao LWang JZhang W(2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3547301
Xu SShao ZYang CLiao XJin H(2022)Accelerating Backward Aggregation in GCN Training with Execution Path Preparing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.3205642(1-13)Online publication date: 2022
https://doi.org/10.1109/TPDS.2022.3205642
Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 24-Feb-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3154315
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents