Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2749469.2750418acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

Published: 13 June 2015 Publication History

Abstract

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs -- the warp criticality problem.
To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.

References

[1]
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proc. of the 2009 IEEE International Symposium on Analysis of Systems and Software (ISPASS'09), Boston, MA, USA, April 2009.
[2]
A. Bhattacharjee and M. Martonosi, "Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors," in Proc. of the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA'09), Austin, TX, USA, June 2009.
[3]
K. D. Bois, S. Eyerman, J. B. Sartor, and L. Eeckhout, "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2013.
[4]
M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on GPUs," in Proc. of the 2012 IEEE International Symposium on Workload Characterization (IISWC'12), San Diego, CA, USA, November 2012.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. of the 2009 IEEE International Symposium on Workload Characterization (IISWC'09), Austin, TX, USA, October 2009.
[6]
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proc. of the 2010 IEEE International Symposium on Workload Characterization (IISWC'10), Atlanta, GA, USA, December 2010.
[7]
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, and W. mei Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14), Cambridge, UK, December 2014.
[8]
E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt, "Parallel application memory scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
[9]
W. L. W. Fung and T. M. Aamodt, "Thread block compaction for efficient SIMT control flow," in Proc. of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11), San Antonio, TX, USA, February 2011.
[10]
W. L. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
[11]
M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in Proc. of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA'11), San Jose, CA, USA, June 2011.
[12]
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
[13]
W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in GPUs," in Proc. of the 20th ACM International Conference on Supercomputing (ICS'12), Venice, Italy, June 2012.
[14]
W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: memory request prioritization for massively parallel processors," in Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA'14), Orlando, FL, USA, February 2014.
[15]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel-Aviv, Isreal, June 2013.
[16]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in Proc. of the 18th IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), Houston, TX, USA, March 2013.
[17]
G. Keramidas, P. Petoumenos, and S. Kaxiras, "Cache replacement based on reuse-distance prediction," in Proc. of the 25th IEEE International Conference on Computer Design (ICCD'07), Lake Tahoe, CA, USA, October 2007.
[18]
S. Khan, Y. Tian, and D. Jimenez, "Sampling dead block prediction for last-level caches," in Proc. of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO'10), Atlanta, GA, USA, December 2010.
[19]
A.-C. Lai, C. Fide, and B. Falsafi, "Dead-block prediction & dead-block correlating prefetchers," in Proc. of the 28th IEEE/ACM International Symposium on Computer Architecture (ISCA'01), 2001.
[20]
S.-Y. Lee and C.-J. Wu, "CAWS: Criticality-aware warp scheduling for GPGPU workloads," in Proc. of the 23rd IEEE/ACM International Conference on Parallel Architectures and Compilation (PACT'14), Edmonton, AB, Canada, August 2014.
[21]
S.-Y. Lee and C.-J. Wu, "Characterizing the latency hiding ability of GPUs," in Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14) as Poster Abstract, Monterey, CA, USA, March 2014.
[22]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," IEEE Micro, vol. 28, pp. 39--55, March 2008.
[23]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.
[24]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
[25]
NVIDIA, "PTX ISA," 2009. Available: http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf
[26]
NVIDIA, "NVIDIA CUDA C programming guide v4.2," 2012. Available: http://developer.nvidia.com/nvidia-gpu-computing-documentation
[27]
NVIDIA, "NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made," September 2014.
[28]
M. A. O'Neil and M. Burtscher, "Microarchitectural performance characterization of irregular GPU kernels," in Proc. of the 2014 IEEE International Symposium on Workload Characterization (IISWC'14), Raleigh, NC, USA, October 2014.
[29]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Adaptive insertion policies for high performance caching," in Proc. of the 34th IEEE/ACM International Symposium on Computer Architecture (ISCA'07), San Diego, CA, USA, June 2007.
[30]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Set-dueling-controlled adaptive insertion for high-performance caching," IEEE Micro, vol. 28, no. 1, pp. 91--98, January 2008.
[31]
M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO'06), Orlando, FL, USA, December 2006.
[32]
M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," in Proc. of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13), Shenzhen, China, February 2013.
[33]
M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient GPU architecture," in Proc. of the 46th International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.
[34]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proc. of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO'12), Vancouver, BC, Canada, December 2012.
[35]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in Proc. of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.
[36]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu, "The Parboil technical report," in IMPACT Technical Report (IMPACT-12-01), University of Illinois Urbana-Champaign, Champaign, IL, USA, March 2012.
[37]
A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi, "SIMD divergence optimization through intra-warp compaction," in Proc. of the IEEE/ACM 40th International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2011.
[38]
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.
[39]
X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proc. of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD'13), San Jose, CA, USA, November 2013.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • Show More Cited By

Index Terms

  1. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
        June 2015
        768 pages
        ISBN:9781450334020
        DOI:10.1145/2749469
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 June 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Conference

        ISCA '15
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)103
        • Downloads (Last 6 weeks)27
        Reflects downloads up to 23 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
        • (2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
        • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
        • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
        • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
        • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
        • (2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
        • (2022)Cache-locality Based Adaptive Warp Scheduling for Neural Network Acceleration on GPGPUs2022 IEEE 35th International System-on-Chip Conference (SOCC)10.1109/SOCC56010.2022.9908120(1-6)Online publication date: 5-Sep-2022
        • (2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
        • (2022)ACWS: Adaptive Cache-state Aware Warp Scheduling Based on Cache Feature Analysis2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC57696.2022.10075135(599-603)Online publication date: 2-Dec-2022
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media