Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2540708.2540718acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Divergence-aware warp scheduling

Published: 07 December 2013 Publication History

Abstract

This paper uses hardware thread scheduling to improve the performance and energy efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp Scheduling (DAWS), which introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture intra-warp locality in loops. Predictor estimates are created from an online characterization of memory divergence and runtime information about the level of control flow divergence in warps. Unlike prior work on Cache-Conscious Wavefront Scheduling, which makes reactive scheduling decisions based on detected cache thrashing, DAWS makes proactive scheduling decisions based on cache usage predictions. DAWS uses these predictions to schedule warps such that data reused by active scalar threads is unlikely to exceed the capacity of the L1 data cache. DAWS attempts to shift the burden of locality management from software to hardware, increasing the performance of simpler and more portable code on the GPU. We compare the execution time of two Sparse Matrix Vector Multiply implementations and show that DAWS is able to run a simple, divergent version within 4% of a performance optimized version that has been rewritten to make use of the on-chip scratchpad and have less memory divergence. We show that DAWS achieves a harmonic mean 26% performance improvement over Cache-Conscious Wavefront Scheduling on a diverse selection of highly cache-sensitive applications, with minimal additional hardware.

References

[1]
NVIDIA CUDA C Programming Guide v 4.2, 2012.
[2]
T. M. Aamodt et al. GPGPU-Sim 3.x Manual. http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual, 2012.
[3]
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS 2009, pages 163--174.
[4]
K. Barabash and E. Petrank. Tracing Garbage Collection on Highly Parallel Platforms. In ISMM 2010, pages 1--10.
[5]
R. Barrett et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, 1994.
[6]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC 2009.
[7]
S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC 2009, pages 44--54.
[8]
H.-Y. Cheng et al. Memory Latency Reduction via Thread Throttling. In MICRO-43, pages 53--64, 2010.
[9]
A. Danalis et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In GPGPU 2010.
[10]
H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In ISCA 2011, pages 365--376.
[11]
W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA 2011, pages 25--36.
[12]
W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO-40.
[13]
M. Gebhart and D. R. Johnson et al. Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA 2011, pages 235--246.
[14]
Z. Guz et al. Many-Core vs. Many-Thread Machines: Stay Away From the Valley. Computer Architecture Letters, pages 25--28, jan. 2009.
[15]
T. H. Hetherington et al. Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems. In ISPASS 2012, pages 88--98.
[16]
S. Hong et al. Accelerating CUDA Graph Algorithms at Maximum Warp. In PPoPP 2011, pages 267--276.
[17]
A. Jaleel et al. CRUISE: Cache Replacement and Utility-Aware Scheduling. In ASPLOS 2012, pages 249--260.
[18]
A. Jaleel et al. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In ISCA 2010, pages 60--71.
[19]
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the use of Demand-Fetched Caches in GPUs. In ICS 2012, pages 15--24.
[20]
A. Jog et al. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS 2013.
[21]
A. Jog et al. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA, 2013.
[22]
O. Kayiran et al. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT 2013.
[23]
Khronos Group. OpenCL. http://www.khronos.org/opencl/.
[24]
N. B. Lakshminarayana and H. Kim. Effect of Instruction Fetch and Memory Scheduling on GPU Performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.
[25]
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO-43, pages 213--224, 2010.
[26]
J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA 2013.
[27]
E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 28(2):39--55, March-April 2008.
[28]
M. Maas et al. How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator. In ISPASS 2013.
[29]
J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA 2010, pages 235--246.
[30]
V. Narasiman et al. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO-44, pages 308--317, 2011.
[31]
M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In ISCA 2007, pages 381--391.
[32]
T. G. Rogers. CCWS Simulation Infrastructure. http://www.ece.ubc.ca/~tgrogers/ccws.html, 2013.
[33]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In MICRO-45, 2012.
[34]
S. Rul et al. An Experimental Study on Performance Portability of OpenCL Kernels. In Application Accelerators in High Performance Computing, 2010.
[35]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In ISCA 1995.
[36]
D. Spoonhower, G. Blelloch, and R. Harper. Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection. In Proc. of Int'l Conf. on Virtual Execution Environments (VEE 2005), pages 57--67.
[37]
S. Wilton and N. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Solid-State Circuits, IEEE Journal of, 31(5):677--688, May 1996.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
December 2013
498 pages
ISBN:9781450326384
DOI:10.1145/2540708
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. caches
  3. divergence
  4. scheduling

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-46
Sponsor:

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)80
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
  • (2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
  • (2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
  • (2022)Accelerating Backward Aggregation in GCN Training with Execution Path Preparing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.3205642(1-13)Online publication date: 2022
  • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 24-Feb-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media