Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2967938.2967952acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs

Published: 11 September 2016 Publication History

Abstract

Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler -- Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.

References

[1]
Intel Solid-State Drive DC P3608 Series Product Specification. Intel.
[2]
NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.
[3]
APU 101: All about amd fusion accelerated processing units. AMD White Paper, 2011.
[4]
CUDA C/C++ SDK code samples. NVIDIA, 2011.
[5]
CUPTI User's Guide. NVIDIA, 2015.
[6]
J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for gpgpu spatial multitasking. In Proceedings of HPCA'12, pages 1--12.
[7]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS'09, pages 163--174.
[8]
M. E. Belviranli, C.-H. Chou, L. N. Bhuyan, and R. Gupta. A paradigm shift in gp-gpu computing: Task based execution of applications with dynamic data dependencies. In Proceedings of DIDC'14, pages 29--34.
[9]
J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. W. Hwu. Automatic parallelization of kernels in shared-memory multi-gpu nodes. In Proceedings of ICS'15, pages 3--13.
[10]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC'09, pages 44--54.
[11]
S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of SC'11, pages 13:1--13:11.
[12]
G. E. Collins. A method for overlapping and erasure of lists. Commun. ACM, 3(12):655--657, Dec. 1960.
[13]
D. Foley. NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data, http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data.
[14]
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of ASPLOS'10, pages 347--358.
[15]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Proceedings of InPar'12, pages 1--10.
[16]
C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu performance without the answer. In Proceedings of ISPASS'11, pages 134--144.
[17]
M. Harris. Unified memory in cuda 6.0. NVIDIA GPU Technology Theater, SC'13.
[18]
T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for cpu-gpu architectures. In Proceedings of CGO'12, pages 165--174.
[19]
A. Joshi. Accelerating various c++ applications using cuda. http://joshiscorner.com/files/src/blog/laplace-cuda-code.html.
[20]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of PPoPP'11, pages 277--288.
[21]
S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silberstein. Gpunet: Networking abstractions for gpu programs. In Proceedings of OSDI'14, pages 201--216.
[22]
Y. Kim, J. Lee, J.-E. Jo, and J. Kim. GPUdmm: A high-performance and memory-oblivious gpu architecture using dynamic memory management. In Proceedings of HPCA'14, pages 546--557.
[23]
G. Kyriazis. Heterogeneous system architecture: A technical review. AMD, 2012.
[24]
R. Landaverde, T. Zhang, A. K. Coskun, and M. Herbordt. An investigation of unified memory access performance in cuda. In Proceedings of HPEC'14, pages 1--6.
[25]
C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of CGO'04, pages 75--.
[26]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In Proceedings of HPCA'14, pages 260--271.
[27]
D. Lustig and M. Martonosi. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In Proceedings of HPCA'13, pages 354--365.
[28]
S. Rennich. CUDA C/C++ Streams and Concurrency. GPU Technology Conference, 2011.
[29]
M. Shihab, K. Taht, and M. Jung. Gpudrive: Reconsidering storage accesses for gpu acceleration. In Proceedings of ASBD'14.
[30]
M. Silberstein, B. Ford, I. Keidar, and E. Witchel. Gpufs: Integrating a file system with gpus. In Proceedings of ASPLOS'13, pages 485--498.
[31]
L. Song, M. Feng, N. Ravi, Y. Yang, and S. Chakradhar. Comp: Compiler optimizations for manycore processors. In Proceedings of MICRO'14, pages 659--671.
[32]
J. Stam. Maximizing GPU Efficiency in Extreme Throughput Applications. GPU Technology Conference, 2009.
[33]
M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg. Whippletree: Task-based scheduling of dynamic workloads on the gpu. ACM Trans. Graph., 33(6):228:1--228:11, Nov. 2014.
[34]
J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. Liu, and W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Center for Reliable and High-Performance Computing, 2012.
[35]
J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan. Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1--4:11, Feb. 2009.
[36]
T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. Towards High Performance Paged Memory for GPUs. In Proceedings of HPCA'16, pages 345--357.

Cited By

View all
  • (2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
  • (2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpgpu
  2. overlapping kernels
  3. pipeline parallelism
  4. thread block scheduling

Qualifiers

  • Research-article

Conference

PACT '16
Sponsor:
  • IFIP WG 10.3
  • IEEE TCCA
  • SIGARCH
  • IEEE CS TCPP

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
  • (2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • (2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
  • (2019)HiWayLibProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304032(153-166)Online publication date: 4-Apr-2019
  • (2019)SMQoS: Improving Utilization and Energy Efficiency with QoS Awareness on GPUs2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891047(1-5)Online publication date: Sep-2019
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2018)Data motifsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243190(1-14)Online publication date: 1-Nov-2018
  • (2018)Scheduling Methods to Optimize Dependent Programs for GPU ArchitectureWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229723(1-8)Online publication date: 13-Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media