research-article

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Authors:

Bharath Pichai,

Abhishek BhattacharjeeAuthors Info & Claims

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 743 - 758

https://doi.org/10.1145/2541940.2541942

Published: 24 February 2014 Publication History

Abstract

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory.

To this end, we are the first to explore GPU Memory Management Units(MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems. We show the performance challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1 parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15\% of runtime). We presume this initial design leaves room for improvement but anticipate that our bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this fruitful area.

References

[1]

AMD, "AMD I/O Virtualization Technology (IOMMU) Specification," 2006.

[2]

N. Amit, M. B. Yehuda, and B.-A. Yassour, "IOMMU: Strategies for Mitigating the IOTLB Bottleneck," WIOSCA, 2010.

Digital Library

[3]

Andrea Arcangeli, "Transparent Hugepage Support," KVM Forum, 2010.

[4]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDAWorkloads Using a Detailed GPU Simulator," ISPASS, 2009.

[5]

T. Barr, A. Cox, and S. Rixner, "Translation Caching: Skip, Don't Walk (the Page Table)," ISCA, 2010.

Digital Library

[6]

____, "SpecTLB: A Mechanism for Speculative Address Translation," ISCA, 2011.

Digital Library

[7]

A. Basu, J. Gandhi, J. Chang,M. Swift, andM. Hill, "Efficient Virtual Memory for Big Memory Servers," ISCA, 2013.

Digital Library

[8]

A. Basu, M. Hill, and M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," ISCA, 2012.

Digital Library

[9]

A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last- Level TLBs for Chip Multiprocessors," HPCA, 2010.

Digital Library

[10]

A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors," ASPLOS, 2010.

Digital Library

[11]

P. Boudier and G. Sellers, "Memory Systemon Fusion APUs," Fusion Developer Summit, 2012.

[12]

W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick, "The Illiac IV System," Proceedings of the IEEE, vol. 60, no. 4, pp. 369--388, April 1972.

[13]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: Stream Computing on Graphics Hardware," SIGGRAPH, 2004.

Digital Library

[14]

M. Cekleov and M. Dubois, "Virtual-Addressed Caches," IEEE Micro, 1997.

[15]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. ha Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," IISWC, 2009.

Digital Library

[16]

D. Clark and J. Emer, "Performance of the VAX-11/780 Translation Buffers: Simulation and Measurement," ACM Transactions on Computer Systems, vol. 3, no. 1, 1985.

Digital Library

[17]

W. Dally, P. Hanrahan, M. Erez, T. Knight, F. Labonte, J.-H. Ahn, N. Jayasena, U. Kapasi, A. Das, J. Gummaraju, and I. Buck, "Merrimac: Supercomputing with Streams," SC, 2003.

Digital Library

[18]

W. Fung and T. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," HPCA, 2011.

Digital Library

[19]

W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO, 2007.

Digital Library

[20]

I. Gelado, J. Cabezas, N. Navarro, J. Stone, S. Patel, and W. mei Hwu, "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems," ASPLOS, 2010.

Digital Library

[21]

B. Hechtman and D. Sorin, "Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips," ISPASS, 2013.

[22]

T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt, "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," ISPASS, 2012.

Digital Library

[23]

Intel, "Intel Virtualization Technology for Directed I/O Architecture Specification," 2006.

[24]

Intel Corporation, "TLBs, Paging-Structure Caches and their Invalidation," Intel Technical Report, 2008.

[25]

T. Jablin, J. Jablin, P. Prabhu, F. Liu, and D. August, "DynamicallyManaged Data for CPU-GPU Architectures," CGO, 2012.

Digital Library

[26]

T. Jablin, P. Prabhu, J. Jablin, N. Johnson, S. Beard, and D. August, "Automatic CPU-GPU Communication Management and Optimization," PLDI, 2011.

Digital Library

[27]

B. Jacob and T. Mudge, "A Look at Several Memory Management Units: TLB-Refill, and Page Table Organizations," ASPLOS, 1998.

Digital Library

[28]

A. Jaleel and B. Jacob, "In-Line Interrupt Handling for Software-Managed TLBs," ICCD, 2001.

[29]

A. Jog, O. Kayiran, N. CN, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," ASPLOS, 2013.

Digital Library

[30]

G. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-Driven Study," ISCA, 2002.

Digital Library

[31]

U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, and B. Khailany, "Efficient Conditional Operations for Data- Parallel Architectures," MICRO, 2000.

Digital Library

[32]

S. Kaxiras and A. Ros, "A New Perspective for Efficient Virtual-Cache Coherence," ISCA, 2013.

Digital Library

[33]

J. Kelm, D. Johnson,M. Johnson, N. Crago,W. Tuohy, A.Mahesri, S. Lumetta, M. Frank, and S. Patel, "Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator," ISCA, 2008.

Digital Library

[34]

H. Kim, "Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions," Workshop on Memory Systems Performance and Correctness in conjunction with PLDI, 2012.

Digital Library

[35]

J. Kim, S. L. Min, S. Jeon, B. Ahn, D.-K. Jeong, and C. S. Kim, "U-Cache: A Cost-Effective Solution to the Synonym Problem," HPCA, 1995.

Digital Library

[36]

R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The Vector-Thread Architecture," ISCA, 2004.

Digital Library

[37]

G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," Whitepaper, 2012.

[38]

K. Lim, D.Meisner, A. Saidi, P. Ranganathan, and T.Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," ISCA, 2013.

Digital Library

[39]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivison for Integrated Branch and Memory Divergence," ISCA, 2010.

Digital Library

[40]

J. Menon, M. de Kruijf, and K. Sankaralingam, "iGPU: Exception Support and Speculative Execution on GPUs," ISCA, 2012.

Digital Library

[41]

G. Morris, B. Gaster, and L. Howes, "Kite: Braided Parallelism for Heterogeneous Systems," 2012.

[42]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "CACTI 6.0: A Tool to Model Large Caches," MICRO, 2007.

[43]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhudinov, O. Mutlu, and Y. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," MICRO, 2011.

Digital Library

[44]

J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," OSDI, 2002.

Digital Library

[45]

J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," IEEE, vol. 96, no. 5, 2008.

[46]

J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and T. Purcell, "A Survey of General-Purpose Computation on Graphcis Hardware," EUROGRAPHICS, vol. 26, no. 1, 2007.

[47]

A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," ISCA, 2013.

Digital Library

[48]

B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large Reach TLBs," MICRO, 2012.

Digital Library

[49]

J. Power, M. Hill, and D. Wood, "Supporting x86--64 Address Translation for 100s of GPU Lanes," HPCA, 2014.

[50]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz, "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing," ISCA, 2013.

Digital Library

[51]

P. Rogers, "AMD Heterogeneous Uniform Memory Access," AMD, 2013.

[52]

T. Rogers, M. O'Connor, and T. Aamodt, "Cache Conscious Wavefront Scheduling," MICRO, 2012.

Digital Library

[53]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, "Larrabee: A Many-Core x86 Architecture for Visual Computing," SIG- GRAPH, 2008.

Digital Library

[54]

I. Singh, A. Shriraman, W. Fung, M. O'Connor, and T. Aamodt, "Cache Coherence for GPU Architecture," HPCA, 2013.

Digital Library

[55]

S. Steele, "ARM GPUs Now and in the Future," 2011.

[56]

M. Talluri and M. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," ASPLOS, 1994.

Digital Library

[57]

N. Wilt, "The CUDA Handbook," 2012.

[58]

L.Wu, R. Barker,M. Kim, and K. Ross, "Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning," ISCA, 2013.

Digital Library

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Hyun BKim TLee DRhu M(2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00029
Alam FLee HBhattacharjee AAwad A(2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614311
Show More Cited By

Index Terms

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
ASPLOS '14

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous ...
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
ASPLOS '14

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous ...
Adaptive and transparent cache bypassing for GPUs
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

February 2014

780 pages

ISBN:9781450323055

DOI:10.1145/2541940

General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '14

Sponsor:

ASPLOS '14: Architectural Support for Programming Languages and Operating Systems

March 1 - 5, 2014

Utah, Salt Lake City, USA

Acceptance Rates

ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

137
Total Citations
View Citations
2,014
Total Downloads

Downloads (Last 12 months)191
Downloads (Last 6 weeks)20

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Hyun BKim TLee DRhu M(2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00029
Alam FLee HBhattacharjee AAwad A(2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614311
Li BGuo YWang YJaleel AYang JTang X(2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614269
Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605593
Xue YLiu YHuang JBaumann ACrooks NSchwarzkopf M(2023)System Virtualization for Neural Processing UnitsProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595912(80-86)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3593856.3595912
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Lee JLee JOh YSong WRo W(2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071063
Li BYin JHoley AZhang YYang JTang X(2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071054
Perez-Cerrolaza JAbella JKosmidis LCalderon ACazorla FFlores J(2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
https://dl.acm.org/doi/10.1145/3549526
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents