research-article

Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs

Authors:

Mark StephensonAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 341 - 352

https://doi.org/10.1145/2967938.2967952

Published: 11 September 2016 Publication History

Abstract

Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler -- Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.

References

[1]

Intel Solid-State Drive DC P3608 Series Product Specification. Intel.

[2]

NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.

[3]

APU 101: All about amd fusion accelerated processing units. AMD White Paper, 2011.

[4]

CUDA C/C++ SDK code samples. NVIDIA, 2011.

[5]

CUPTI User's Guide. NVIDIA, 2015.

[6]

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In Proceedings of HPCA'12, pages 1--12.

Digital Library

[7]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS'09, pages 163--174.

[8]

M. E. Belviranli, C.-H. Chou, L. N. Bhuyan, and R. Gupta. A paradigm shift in gp-gpu computing: Task based execution of applications with dynamic data dependencies. In Proceedings of DIDC'14, pages 29--34.

Digital Library

[9]

J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. W. Hwu. Automatic parallelization of kernels in shared-memory multi-gpu nodes. In Proceedings of ICS'15, pages 3--13.

Digital Library

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC'09, pages 44--54.

Digital Library

[11]

S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of SC'11, pages 13:1--13:11.

Digital Library

[12]

G. E. Collins. A method for overlapping and erasure of lists. Commun. ACM, 3(12):655--657, Dec. 1960.

Digital Library

[13]

D. Foley. NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data, http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data.

[14]

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of ASPLOS'10, pages 347--358.

Digital Library

[15]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Proceedings of InPar'12, pages 1--10.

[16]

C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu performance without the answer. In Proceedings of ISPASS'11, pages 134--144.

Digital Library

[17]

M. Harris. Unified memory in cuda 6.0. NVIDIA GPU Technology Theater, SC'13.

[18]

T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for cpu-gpu architectures. In Proceedings of CGO'12, pages 165--174.

Digital Library

[19]

A. Joshi. Accelerating various c++ applications using cuda. http://joshiscorner.com/files/src/blog/laplace-cuda-code.html.

[20]

J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of PPoPP'11, pages 277--288.

Digital Library

[21]

S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silberstein. Gpunet: Networking abstractions for gpu programs. In Proceedings of OSDI'14, pages 201--216.

Digital Library

[22]

Y. Kim, J. Lee, J.-E. Jo, and J. Kim. GPUdmm: A high-performance and memory-oblivious gpu architecture using dynamic memory management. In Proceedings of HPCA'14, pages 546--557.

[23]

G. Kyriazis. Heterogeneous system architecture: A technical review. AMD, 2012.

[24]

R. Landaverde, T. Zhang, A. K. Coskun, and M. Herbordt. An investigation of unified memory access performance in cuda. In Proceedings of HPEC'14, pages 1--6.

[25]

C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of CGO'04, pages 75--.

Digital Library

[26]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In Proceedings of HPCA'14, pages 260--271.

[27]

D. Lustig and M. Martonosi. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In Proceedings of HPCA'13, pages 354--365.

Digital Library

[28]

S. Rennich. CUDA C/C++ Streams and Concurrency. GPU Technology Conference, 2011.

[29]

M. Shihab, K. Taht, and M. Jung. Gpudrive: Reconsidering storage accesses for gpu acceleration. In Proceedings of ASBD'14.

[30]

M. Silberstein, B. Ford, I. Keidar, and E. Witchel. Gpufs: Integrating a file system with gpus. In Proceedings of ASPLOS'13, pages 485--498.

Digital Library

[31]

L. Song, M. Feng, N. Ravi, Y. Yang, and S. Chakradhar. Comp: Compiler optimizations for manycore processors. In Proceedings of MICRO'14, pages 659--671.

Digital Library

[32]

J. Stam. Maximizing GPU Efficiency in Extreme Throughput Applications. GPU Technology Conference, 2009.

[33]

M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg. Whippletree: Task-based scheduling of dynamic workloads on the gpu. ACM Trans. Graph., 33(6):228:1--228:11, Nov. 2014.

Digital Library

[34]

J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. Liu, and W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Center for Reliable and High-Performance Computing, 2012.

[35]

J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan. Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1--4:11, Feb. 2009.

Digital Library

[36]

T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. Towards High Performance Paged Memory for GPUs. In Proceedings of HPCA'16, pages 345--357.

Cited By

Zou ALi JGill CZhang X(2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3235439
Gerzhoy DYeung D(2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519319
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Show More Cited By

Index Terms

Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data
    2. Serial architectures
      1. Pipeline computing
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Exploiting data-parallelism in gpus
Expressing pipeline parallelism using TBB constructs: a case study on what works and what doesn't
SPLASH '11 Workshops: Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT'11, & VMIL'11

Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to express various ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
304
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zou ALi JGill CZhang X(2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3235439
Gerzhoy DYeung D(2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519319
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Oh CZheng ZShen XZhai JYi YSarkar VKim H(2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414656
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Zheng ZOh CZhai JShen XYi YChen WBahar IHerlihy MWitchel ELebeck A(2019)HiWayLibProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304032(153-166)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304032
Sun QLiu YYang HLuan ZQian D(2019)SMQoS: Improving Utilization and Energy Efficiency with QoS Awareness on GPUs2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891047(1-5)Online publication date: Sep-2019
https://doi.org/10.1109/CLUSTER.2019.8891047
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Gao WZhan JWang LLuo CZheng DTang FXie BZheng CWen XHe XYe HRen REvripidou SStenström PO'Boyle M(2018)Data motifsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243190(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243190
Liao WChang YWang SYang CLee JHwang Y(2018)Scheduling Methods to Optimize Dependent Programs for GPU ArchitectureWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229723(1-8)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229723
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents