research-article

Public Access

FLEP: Enabling Flexible and Efficient Preemption on GPUs

Authors:

Changjun JiangAuthors Info & Claims

ACM SIGPLAN Notices, Volume 52, Issue 4

Pages 483 - 496

https://doi.org/10.1145/3093336.3037742

Published: 04 April 2017 Publication History

Abstract

GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.

In this paper, we propose and develop FLEP, the first software system that enables flexible kernel preemption and kernel scheduling on commodity GPUs. The FLEP compilation engine transforms the GPU program into preemptable forms, which can be interrupted during execution and yield all or part of the streaming multi-processors (SMs) in the GPU. The FLEP runtime engine intercepts all kernel invocations and determines which kernels and how those kernels should be preempted and scheduled. Experimental results on two-kernel co-runs demonstrate up to 24.2X speedup for high-priority kernels and up to 27X improvement on normalized average turnaround time for kernels with the same priority. FLEP reduces the preemption latency by up to 41% compared to yielding the whole GPU when the waiting kernels only need several SMs. With all the benefits, FLEP only introduces 2.5% runtime overhead, which is substantially lower than the kernel slicing approach.

References

[1]

clang: a C language family frontend for LLVM. http://clang.llvm.org/; accessed 23-02-2016.

[2]

NVLink Communication Protocol. https://en.wikipedia.org/wiki/NVLink.

[3]

OpenCL. http://www.khronos.org/opencl/.

[4]

J. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for GPGPU spatial multitasking. In 18th IEEE International Symposium on High Performance Computer Architecture, HPCA 2012, New Orleans, LA, USA, 25--29 February, 2012, pages 79--90, 2012.

Digital Library

[5]

C. Basaran and K. Kang. Supporting preemptive task executions and memory copies in gpgpus. In 24th Euromicro Conference on Real-Time Systems, ECRTS 2012, Pisa, Italy, July 11--13, 2012, pages 287--296, 2012.

Digital Library

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.

Digital Library

[7]

G. Chen, X. Shen, and H. Zhou. A software framework for efficient preemptive scheduling on gpu. Technical report, North Carolina State University, 2016.

[8]

G. Chen, Y. Zhao, X. Shen, and H. Zhou. Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'17, 2017.

Digital Library

[9]

Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax : Qos awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 2016.

Digital Library

[10]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, 2010.

Digital Library

[11]

J. Dean and L. A. Barroso. The tail at scale. Commun. ACM, 56(2):74--80, Feb. 2013.

Digital Library

[12]

Y. Dong, M. Xue, X. Zheng, J. Wang, Z. Qi, and H. Guan. Boosting gpu virtualization performance with hybrid shadow page tables. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 517--528, Santa Clara, CA, July 2015. USENIX Association.

Digital Library

[13]

G. A. Elliott and J. H. Anderson. Real-world constraints of gpus in real-time systems. In 17th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2011, Toyama, Japan, August 28--31, 2011, Volume 2, pages 48--54, 2011.

Digital Library

[14]

S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008.

Digital Library

[15]

C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent gpgpu kernels. In Presented as part of the 4th USENIX Workshop on Hot Topics in Parallelism, Berkeley, CA, 2012. USENIX.

Digital Library

[16]

K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012.

[17]

U. Hoelzle and L. A. Barroso. The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009.

[18]

A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS '15, pages 223--234, New York, NY, USA, 2015. ACM.

Digital Library

[19]

S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In Proceedings of the 32nd IEEE Real-Time Systems Symposium, RTSS 2011, Vienna, Austria, November 29 - December 2, 2011, pages 57--66, 2011.

Digital Library

[20]

S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[21]

S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class gpu resource management in the operating system. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 401--412, Boston, MA, 2012. USENIX.

[22]

J. Kehne, J. Metter, and F. Bellosa. Gpuswap: Enabling oversubscription of gpu memory through transparent swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15), pages 65--77, Istanbul, Turkey, Mar. 14--15 2015.

Digital Library

[23]

T. Li, V. K. Narayana, and T. A. El-Ghazawi. Reordering GPU kernel launches to enable efficient concurrent execution. CoRR, abs/1511.07983, 2015.

[24]

Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst., 26(3):748--760, 2015.

[25]

C. Margiolas and M. F. P. O'Boyle. Portable and transparent software managed scheduling on accelerators for fair resource sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 82--93, New York, NY, USA, 2016. ACM.

Digital Library

[26]

S. Muthukrishnan, R. Rajaraman, A. Shaheen, and J. E. Gehrke. Online scheduling to minimize average stretch. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, Washington, DC, USA, 1999.

[27]

NVIDIA. Cuda software development toolkit v7.0\\.texttt https://developer.nvidia.com/cuda-toolkit-70.

[28]

NVIDIA. Nvidia's next generation cuda computer architecture: Fermi. Technical report.

[29]

NVIDIA. Next generation cuda computer architecture kepler gk110. Technical report, 2012.

[30]

NVIDIA. Sharing a gpu between mpi processes: multi-process service (mps) overview. Technical report, 2013.

[31]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013.

Digital Library

[32]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM.

Digital Library

[33]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask: Operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248, New York, NY, USA, 2011. ACM.

Digital Library

[34]

Y. Suzuki, S. Kato, H. Yamada, and K. Kono. Gpuvm: Why not virtualizing gpus at the hypervisor? In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 109--120, Philadelphia, PA, June 2014. USENIX Association.

Digital Library

[35]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press.

[36]

K. Tian, Y. Dong, and D. Cowperthwaite. A full gpu virtualization solution with mediated pass-through. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 121--132, Philadelphia, PA, June 2014. USENIX Association.

Digital Library

[37]

K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm: Device memory management for gpgpu computing. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, pages 533--545, New York, NY, USA, 2014. ACM.

Digital Library

[38]

K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. Proc. VLDB Endow., 7(11):1011--1022, July 2014.

Digital Library

[39]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE COMPUTER ARCHITECTURE LETTERS, PP(99):748--760, 2015.

[40]

B. Wu, G. Chen, D. Li, X. Shen, and J. S. Vetter. Enabling and exploiting flexible task assignment on GPU through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, pages 119--130, 2015.

Digital Library

[41]

H. Zhou, G. Tong, and C. Liu. GPES: a preemptive execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA, April 13--16, 2015, pages 87--97, 2015.

Cited By

Pfandzelter TDhakal AFrachtenberg EChalamalasetti SEmmot DHogade NEnriquez RRattihalli GBermbach DMilojicic D(2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629115
Wang JWang YZhang NMeng WJensen CCremers CKirda E(2023)Secure and Timely GPU Execution in Cyber-physical SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623197(2591-2605)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623197
Weng YGe TZhang XZhang XLu Y(2022)RAISE: Efficient GPU Resource Management via Hybrid Scheduling2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00078(685-695)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00078
Show More Cited By

Index Terms

FLEP: Enabling Flexible and Efficient Preemption on GPUs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking

Recommendations

FLEP: Enabling Flexible and Efficient Preemption on GPUs
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.

In this ...
FLEP: Enabling Flexible and Efficient Preemption on GPUs
Asplos'17

GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.

In this ...
Enabling OpenCL Preemptive Multitasking Using Software Checkpointing
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 52, Issue 4

ASPLOS '17

April 2017

811 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3093336

Editor:
Matthew Fluet

Issue’s Table of Contents

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Published in SIGPLAN Volume 52, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
1,849
Total Downloads

Downloads (Last 12 months)344
Downloads (Last 6 weeks)92

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pfandzelter TDhakal AFrachtenberg EChalamalasetti SEmmot DHogade NEnriquez RRattihalli GBermbach DMilojicic D(2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629115
Wang JWang YZhang NMeng WJensen CCremers CKirda E(2023)Secure and Timely GPU Execution in Cyber-physical SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623197(2591-2605)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623197
Weng YGe TZhang XZhang XLu Y(2022)RAISE: Efficient GPU Resource Management via Hybrid Scheduling2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00078(685-695)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00078
Sun QYi LYang HLi MLuan ZQian D(2022)QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPUParallel Computing10.1016/j.parco.2022.102958113(102958)Online publication date: Oct-2022
https://doi.org/10.1016/j.parco.2022.102958
Lee MAhn HHong CNikolopoulos D(2022)gShareFuture Generation Computer Systems10.1016/j.future.2021.12.016130:C(181-192)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.future.2021.12.016
López-Albelda BCastro FGonzález-Linares JGuil N(2022)FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUsThe Journal of Supercomputing10.1007/s11227-021-03819-z78:1(43-71)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11227-021-03819-z
Timcheck SBuhler J(2022)Interruptible Nodes: Reducing Queueing Costs in Irregular Streaming Dataflow Applications on Wide-SIMD ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-022-00745-251:1(43-60)Online publication date: 5-Dec-2022
https://dl.acm.org/doi/10.1007/s10766-022-00745-2
Kang JYu H(2021)GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization EnvironmentsSymmetry10.3390/sym1303050813:3(508)Online publication date: 20-Mar-2021
https://doi.org/10.3390/sym13030508
Zhu AZeng DGu LLi PChen Q(2021)Gost: Enabling Efficient Spatio-Temporal GPU Sharing for Network Function Virtualization2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS)10.1109/IWQOS52092.2021.9521266(1-10)Online publication date: 25-Jun-2021
https://doi.org/10.1109/IWQOS52092.2021.9521266
Ji ZWang C(2021)CTXBack: Enabling Low Latency GPU Context Switching via Context Flashback2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00021(121-130)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00021
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents