research-article

Exploring cache bypassing and partitioning for multi-tasking on GPUs

Authors:

Xiaolong XieAuthors Info & Claims

ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

Pages 9 - 16

Published: 13 November 2017 Publication History

Abstract

Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to emerge. Multi-tasking allows multiple applications to simultaneously execute on the same GPU and share the resource. This brings new challenges due to the contention among the different applications for the shared resources such as caches. However, the caches on GPUs are difficult to use. If used inappropriately, it may hurt the performance instead of improving it.

In this paper, we propose to use cache partitioning together with cache bypassing as the shared cache management mechanism for multi-tasking on GPUs. The combined approach aims to reduce the interference among the tasks and preserve the locality for each task. However, the interplay among the cache partitioning and bypassing brings greater challenges. On one hand, the partitioned cache space to each task affects its cache bypassing decision. On the other hand, cache bypassing affects the cache capacity required for each task. To address this, we propose a two-step approach. First, we use cache partitioning to assign dedicated cache space to each task to reduce the interference among the tasks. During this process, we compare cache partitioning with coarse-grained cache bypassing. Then, we use fine-grained cache bypassing to selectively bypass certain data requests and threads for each task. We explore different cache partitioning and bypassing designs and demonstrate the potential benefits of this approach. Experiments using a wide range of applications demonstrate that our technique improves the overall system throughput by 52% on average compared to the default multi-tasking solution on GPUs.

References

[1]

Parboil Benchmark Suite. http://impact.crhc.illinois.edu/Parboil/parboil.aspx.

[2]

Rodinia Benchmark Suite. http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/.

[3]

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In HPCA, 2012.

Digital Library

[4]

X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient GPU computing. In MICRO, 2014.

Digital Library

[5]

W. Jia, K. A. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA, 2014.

[6]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004.

Digital Library

[7]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, 2014.

[8]

S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In ISCA, 2015.

Digital Library

[9]

A. Li, G.-J. van den Braak, A. Kumar, and H. Corporaal. Adaptive and transparent cache bypassing for GPUs. In SC, 2015.

Digital Library

[10]

X. Li and Y. Liang. Efficient kernel management on GPUs. In DATE, 2016.

Digital Library

[11]

Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. Efficient GPU spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems, 26(3):748--760, March 2015.

[12]

Z. Lin, L. Nyland, and H. Zhou. Enabling efficient preemption for SIMT architectures with lightweight context switching. In SC, 2016.

Digital Library

[13]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, 2013.

Digital Library

[14]

J.-G. Park, N. Dutt, H. Kim, and S.-S. Lim. HiCAP: Hierarchical FSM-based dynamic integrated CPU-GPU frequency capping governor for energy-efficient mobile gaming. In ISLPED, 2016.

Digital Library

[15]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In ASPLOS, 2015.

Digital Library

[16]

A. Pathania, Q. Jiao, A. Prakash, and T. Mitra. Integrated CPU-GPU power management for 3D mobile games. In DAC, 2014.

Digital Library

[17]

M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006.

Digital Library

[18]

H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 41(9):1054--1068, Sep 1992.

Digital Library

[19]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In ISCA, 2014.

Digital Library

[20]

X. Xie, Y. Liang, G. Sun, and D. Chen. An efficient compiler framework for cache bypassing on GPUs. In ICCAD, 2013.

Digital Library

[21]

X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for GPUs. In HPCA, 2015.

[22]

S. M. Zahedi and B. C. Lee. Ref: Resource elasticity fairness with sharing incentives for multiprocessors. In ASPLOS, 2014.

Digital Library

Cited By

Ahmed AHuang YMishra P(2019)Cache Reconfiguration Using Machine Learning for Vulnerability-aware Energy OptimizationACM Transactions on Embedded Computing Systems10.1145/330976218:2(1-24)Online publication date: 2-Apr-2019
https://dl.acm.org/doi/10.1145/3309762
Li XLiang YYan SJia LLi YHollingsworth JKeidar I(2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295734
Li XLiang YZhang WLiu TLi HLuo GJiang M(2018)cuMBIRProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205309(184-194)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205309

Exploring cache bypassing and partitioning for multi-tasking on GPUs

Recommendations

Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
An efficient compiler framework for cache bypassing on GPUs
ICCAD '13: Proceedings of the International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

November 2017

1077 pages

Conference Chair:
Sri Parameswaran
GENERAL CHAIR

Sponsors

CEDA: Council on Electronic Design Automation
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CAS: Circuits & Systems

In-Cooperation

IEEE-EDS: Electronic Devices Society

Publisher

IEEE Press

Publication History

Published: 13 November 2017

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICCAD '17

Sponsor:

CEDA
SIGDA
IEEE-CAS

ICCAD '17: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN

November 13 - 16, 2017

California, Irvine

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ahmed AHuang YMishra P(2019)Cache Reconfiguration Using Machine Learning for Vulnerability-aware Energy OptimizationACM Transactions on Embedded Computing Systems10.1145/330976218:2(1-24)Online publication date: 2-Apr-2019
https://dl.acm.org/doi/10.1145/3309762
Li XLiang YYan SJia LLi YHollingsworth JKeidar I(2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295734
Li XLiang YZhang WLiu TLi HLuo GJiang M(2018)cuMBIRProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205309(184-194)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205309

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents