Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Published: 16 March 2013 Publication History

Abstract

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.
In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

References

[1]
AMD. Radeon and FirePro Graphics Cards, Nov. 2011.
[2]
AMD. Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870 (Evergreen) Architecture, Oct. 2012.
[3]
R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012.
[4]
A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010.
[5]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
[6]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008.
[7]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In CC/ETAPS 2010.
[8]
M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011.
[9]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999.
[10]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.
[11]
X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012.
[12]
E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011.
[13]
E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.
[14]
W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.
[15]
W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011.
[16]
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011.
[17]
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011.
[18]
S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011.
[19]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008.
[20]
M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012.
[21]
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012.
[22]
A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012.
[23]
D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. IEEE Trans. Comput., 1999.
[24]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.
[25]
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011.
[26]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.
[27]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.
[28]
D. Kirk and Wen-mei. W. Hwu. Programming Massively Parallel Processors. 2010.
[29]
K. Krewell. Amd's Fusion Finally Arrives. MPR, 2011.
[30]
K. Krewell. Ivy Bridge Improves Graphics. MPR, 2011.
[31]
K. Krewell. Most Significant Bits. MPR, 2011.
[32]
K. Krewell. Nvidia Lowers the Heat on Kepler. MPR, 2012.
[33]
N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. Computer Architecture Letters, 2012.
[34]
C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO, 2008.
[35]
C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009.
[36]
J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010.
[37]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008.
[38]
T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007.
[39]
A. Munshi. The OpenCL Specification, June 2011.
[40]
S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011.
[41]
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.
[42]
O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.
[43]
N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012.
[44]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011.
[45]
K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004.
[46]
NVIDIA. CUDA C Programming Guide, Oct. 2010.
[47]
NVIDIA. CUDA C/C++ SDK code samples, 2011.
[48]
NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.
[49]
M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012.
[50]
S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000.
[51]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious Wavefront Scheduling. In MICRO, 2012.
[52]
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007.
[53]
J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.
[54]
I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010.
[55]
R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994.
[56]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture Through Microbenchmarking. In ISPASS, 2010.
[57]
H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012.
[58]
G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling forMany-core Accelerator Architectures. InMICRO, 2009.
[59]
W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order. U.S. Patent Number 5,630,096, 1997.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
    March 2013
    574 pages
    ISBN:9781450318709
    DOI:10.1145/2451116
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013
Published in SIGPLAN Volume 48, Issue 4

Check for updates

Author Tags

  1. GPGPUs
  2. latency tolerance
  3. prefetching
  4. scheduling

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)11
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
  • (2024)Real-Time Scheduling for Computing ArchitecturesHandbook of Computer Architecture10.1007/978-981-15-6401-7_5-1(1-44)Online publication date: 8-May-2024
  • (2024)GPPRMon: GPU Runtime Memory Performance and Power Monitoring ToolEuro-Par 2023: Parallel Processing Workshops10.1007/978-3-031-48803-0_2(17-29)Online publication date: 14-Apr-2024
  • (2023)A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous systemCPU-GPU异构系统感知和预测的批处理内存调度策略Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.220044924:7(994-1006)Online publication date: 28-Jul-2023
  • (2023)LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUsIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2022EAP1084E106.A:8(1043-1050)Online publication date: 1-Aug-2023
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2022)Reconciling QoS and Concurrency in NVIDIA GPUs via Warp-Level Scheduling2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774761(1275-1280)Online publication date: 14-Mar-2022
  • (2022)A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image ProcessingIEEE Transactions on Computers10.1109/TC.2020.303582671:1(13-26)Online publication date: 1-Jan-2022
  • (2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
  • (2022)A power-performance balanced network-on-chip for mixed CPU-GPU systemsPower-Efficient Network-on-Chips: Design and Evaluation10.1016/bs.adcom.2021.09.003(45-80)Online publication date: 2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media