research-article

RSVM: a region-based software virtual memory for GPU

Authors:

Xiaosong MaAuthors Info & Claims

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Pages 269 - 278

Published: 07 October 2013 Publication History

Abstract

While Graphics Processing Units (GPU) have gained much success in general purpose computing in recent years, their programming is still difficult, due to, particularly, explicitly managed GPU memory and manual CPU-GPU data transfer. Despite recent calls for managing GPU resources as first-class citizens in the operating system, a mature GPU memory management mechanism is still missing, which leads to reinventing the wheels in various GPU system software. Meanwhile, due to ever enlarging problem sizes, we urgently need a system-level mechanism for unified CPU-GPU memory management.

In this work, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit; developers write the same code for different problem sizes, but still can optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.

References

[1]

10th DIMACS Implementation Challenge - Graph Partitioning and Graph Clustering. http://www.cc.gatech.edu/dimacs10/index.shtml.

[2]

Graph500. http://www.graph500.org/.

[3]

GTgraph: A suite of synthetic random graph generators. http://www.cse.psu.edu/~madduri/software/GTgraph/index.html.

[4]

NVIDIA CUDA. http://www.nvidia.com/object/cuda.

[5]

OpenCL. http://www.khronos.org/opencl/.

[6]

Rodinia benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Main\_Page.

[7]

The HSA Foundation. http://hsafoundation.com/.

[8]

C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst. Data-Aware Task Scheduling on Multi-accelerator Based Platforms. Parallel and Distributed Systems, International Conference on, 0:291--298, 2010.

Digital Library

[9]

T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broadband engine architecture and its first implementation: a performance view. IBM J. Res. Dev., 51:559--572, September 2007.

Digital Library

[10]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.

Digital Library

[11]

G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th international symposium on High performance distributed computing, HPDC '08, pages 197--200, New York, NY, USA, 2008. ACM.

Digital Library

[12]

A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 161--172, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[13]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM.

Digital Library

[14]

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 347--358, New York, NY, USA, 2010. ACM.

Digital Library

[15]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.

Digital Library

[16]

T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 165--174, New York, NY, USA, 2012. ACM.

Digital Library

[17]

T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 142--151, New York, NY, USA, 2011. ACM.

Digital Library

[18]

K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: high-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 213--226, Copper Mountain Resort, Colorado, December 1995. An earlier version of this work appeared as Technical Report MIT-LCS-TM-517, MIT Laboratory for Computer Science, March 1995.

Digital Library

[19]

S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (ATC), June 2012.

Digital Library

[20]

K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst., 7(4):321--359, Nov. 1989.

Digital Library

[21]

M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 287--296, New York, NY, USA, 2008. ACM.

Digital Library

[22]

C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, New York, NY, USA, 2009. ACM.

Digital Library

[23]

J. Menon, M. De~Kruijf, and K. Sankaralingam. iGPU: exception support and speculative execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[24]

S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, pages 33--42, New York, NY, USA, 2012. ACM.

Digital Library

[25]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248, New York, NY, USA, 2011. ACM.

Digital Library

[26]

B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 431--440, New York, NY, USA, 2009. ACM.

Digital Library

[27]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 papers, SIGGRAPH '08, pages 18:1--18:15, New York, NY, USA, 2008. ACM.

Digital Library

[28]

M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a File System with GPUs. In Proceedings of ASPLOS 2013, 2013.

Digital Library

[29]

M. Steinberger, M. Kenzel, B. Kainz, and D. Schmalstieg. ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU. In Proceedings of Innovative Parallel Computing (InPar'12), 2012.

[30]

J. Stuart, M. Cox, and J. Owens. GPU-to-CPU Callbacks. In M. Guarracino, F. Vivien, J. Träff, M. Cannatoro, M. Danelutto, A. Hast, F. Perla, A. Knüpfer, B. Di~Martino, and M. Alexander, editors, Euro-Par 2010 Parallel Processing Workshops, volume 6586 of Lecture Notes in Computer Science, pages 365--372. Springer Berlin / Heidelberg, 2011. 10.1007/978--3--642--21878--1_45.

Digital Library

[31]

S. Yan, X. Zhou, Y. Gao, H. Chen, G. Wu, S. Luo, and B. Saha. Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform. SIGOPS Oper. Syst. Rev., 45:92--100, February 2011.

Digital Library

Cited By

Yu HPeters AAkshintala ARossbach CLarus JCeze LStrauss K(2020)AvA: Accelerated Virtualization of AcceleratorsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378466(807-825)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378466
Yu HPeters AAkshintala ARossbach C(2019)Automatic Virtualization of AcceleratorsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321423(58-65)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321423
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Show More Cited By

Index Terms

RSVM: a region-based software virtual memory for GPU
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A collaborative CPU–GPU approach for principal component analysis on mobile heterogeneous platforms
Abstract
The advent of the modern GPU architecture has enabled computers to use General Purpose GPU capabilities (GPGPU) to tackle large scale problem at a low computational cost. This technological innovation is also available on mobile ...
Highlights
- A method that combines the advantages of CPU and GPU in a mobile environment.
- A ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

October 2013

422 pages

ISBN:9781479910212

Conference Chair:
Christian Fensch
University of Edinburgh, UK
,
General Chair:
Michael O'Boyle
University of Edinburgh, UK
,
Program Chairs:
André Seznec
INRIA Rennes, France
,
François Bodin
IRISA/CAPS Entreprise, France

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

IEEE Press

Publication History

Published: 07 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
407
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu HPeters AAkshintala ARossbach CLarus JCeze LStrauss K(2020)AvA: Accelerated Virtualization of AcceleratorsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378466(807-825)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378466
Yu HPeters AAkshintala ARossbach C(2019)Automatic Virtualization of AcceleratorsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321423(58-65)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321423
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Shahar SBergman SSilberstein M(2018)ActivePointersACM SIGOPS Operating Systems Review10.1145/3273982.327399052:1(84-95)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3273982.3273990
Tanasic IGelado IJorda MAyguade ENavarro NHunter HMoreno JEmer JSanchez D(2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123950
Shahar SBergman SSilberstein M(2016)ActivePointersACM SIGARCH Computer Architecture News10.1145/3007787.300120044:3(596-608)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001200
Silberstein MKim SHuh SZhang XHu YWated AWitchel E(2016)GPUnetACM Transactions on Computer Systems10.1145/296309834:3(1-31)Online publication date: 17-Sep-2016
https://dl.acm.org/doi/10.1145/2963098
Shahar SSilberstein M(2016)Supporting data-driven I/O on GPUs using GPUfsProceedings of the 9th ACM International on Systems and Storage Conference10.1145/2928275.2928276(1-11)Online publication date: 6-Jun-2016
https://dl.acm.org/doi/10.1145/2928275.2928276
Shahar SBergman SSilberstein MMin SLoh G(2016)ActivePointersProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.58(596-608)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.58
Kwon SKim SKim JJeong JGirault AGuan N(2015)Managing GPU buffers for caching more apps in mobile systemsProceedings of the 12th International Conference on Embedded Software10.5555/2830865.2830888(207-216)Online publication date: 4-Oct-2015
https://dl.acm.org/doi/10.5555/2830865.2830888
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents