Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2541940.2541942acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Published: 24 February 2014 Publication History

Abstract

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory.
To this end, we are the first to explore GPU Memory Management Units(MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems. We show the performance challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1 parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15\% of runtime). We presume this initial design leaves room for improvement but anticipate that our bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this fruitful area.

References

[1]
AMD, "AMD I/O Virtualization Technology (IOMMU) Specification," 2006.
[2]
N. Amit, M. B. Yehuda, and B.-A. Yassour, "IOMMU: Strategies for Mitigating the IOTLB Bottleneck," WIOSCA, 2010.
[3]
Andrea Arcangeli, "Transparent Hugepage Support," KVM Forum, 2010.
[4]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDAWorkloads Using a Detailed GPU Simulator," ISPASS, 2009.
[5]
T. Barr, A. Cox, and S. Rixner, "Translation Caching: Skip, Don't Walk (the Page Table)," ISCA, 2010.
[6]
____, "SpecTLB: A Mechanism for Speculative Address Translation," ISCA, 2011.
[7]
A. Basu, J. Gandhi, J. Chang,M. Swift, andM. Hill, "Efficient Virtual Memory for Big Memory Servers," ISCA, 2013.
[8]
A. Basu, M. Hill, and M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," ISCA, 2012.
[9]
A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last- Level TLBs for Chip Multiprocessors," HPCA, 2010.
[10]
A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors," ASPLOS, 2010.
[11]
P. Boudier and G. Sellers, "Memory Systemon Fusion APUs," Fusion Developer Summit, 2012.
[12]
W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick, "The Illiac IV System," Proceedings of the IEEE, vol. 60, no. 4, pp. 369--388, April 1972.
[13]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: Stream Computing on Graphics Hardware," SIGGRAPH, 2004.
[14]
M. Cekleov and M. Dubois, "Virtual-Addressed Caches," IEEE Micro, 1997.
[15]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. ha Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," IISWC, 2009.
[16]
D. Clark and J. Emer, "Performance of the VAX-11/780 Translation Buffers: Simulation and Measurement," ACM Transactions on Computer Systems, vol. 3, no. 1, 1985.
[17]
W. Dally, P. Hanrahan, M. Erez, T. Knight, F. Labonte, J.-H. Ahn, N. Jayasena, U. Kapasi, A. Das, J. Gummaraju, and I. Buck, "Merrimac: Supercomputing with Streams," SC, 2003.
[18]
W. Fung and T. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," HPCA, 2011.
[19]
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO, 2007.
[20]
I. Gelado, J. Cabezas, N. Navarro, J. Stone, S. Patel, and W. mei Hwu, "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems," ASPLOS, 2010.
[21]
B. Hechtman and D. Sorin, "Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips," ISPASS, 2013.
[22]
T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt, "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," ISPASS, 2012.
[23]
Intel, "Intel Virtualization Technology for Directed I/O Architecture Specification," 2006.
[24]
Intel Corporation, "TLBs, Paging-Structure Caches and their Invalidation," Intel Technical Report, 2008.
[25]
T. Jablin, J. Jablin, P. Prabhu, F. Liu, and D. August, "DynamicallyManaged Data for CPU-GPU Architectures," CGO, 2012.
[26]
T. Jablin, P. Prabhu, J. Jablin, N. Johnson, S. Beard, and D. August, "Automatic CPU-GPU Communication Management and Optimization," PLDI, 2011.
[27]
B. Jacob and T. Mudge, "A Look at Several Memory Management Units: TLB-Refill, and Page Table Organizations," ASPLOS, 1998.
[28]
A. Jaleel and B. Jacob, "In-Line Interrupt Handling for Software-Managed TLBs," ICCD, 2001.
[29]
A. Jog, O. Kayiran, N. CN, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," ASPLOS, 2013.
[30]
G. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-Driven Study," ISCA, 2002.
[31]
U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, and B. Khailany, "Efficient Conditional Operations for Data- Parallel Architectures," MICRO, 2000.
[32]
S. Kaxiras and A. Ros, "A New Perspective for Efficient Virtual-Cache Coherence," ISCA, 2013.
[33]
J. Kelm, D. Johnson,M. Johnson, N. Crago,W. Tuohy, A.Mahesri, S. Lumetta, M. Frank, and S. Patel, "Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator," ISCA, 2008.
[34]
H. Kim, "Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions," Workshop on Memory Systems Performance and Correctness in conjunction with PLDI, 2012.
[35]
J. Kim, S. L. Min, S. Jeon, B. Ahn, D.-K. Jeong, and C. S. Kim, "U-Cache: A Cost-Effective Solution to the Synonym Problem," HPCA, 1995.
[36]
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The Vector-Thread Architecture," ISCA, 2004.
[37]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," Whitepaper, 2012.
[38]
K. Lim, D.Meisner, A. Saidi, P. Ranganathan, and T.Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," ISCA, 2013.
[39]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivison for Integrated Branch and Memory Divergence," ISCA, 2010.
[40]
J. Menon, M. de Kruijf, and K. Sankaralingam, "iGPU: Exception Support and Speculative Execution on GPUs," ISCA, 2012.
[41]
G. Morris, B. Gaster, and L. Howes, "Kite: Braided Parallelism for Heterogeneous Systems," 2012.
[42]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "CACTI 6.0: A Tool to Model Large Caches," MICRO, 2007.
[43]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhudinov, O. Mutlu, and Y. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," MICRO, 2011.
[44]
J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," OSDI, 2002.
[45]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," IEEE, vol. 96, no. 5, 2008.
[46]
J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and T. Purcell, "A Survey of General-Purpose Computation on Graphcis Hardware," EUROGRAPHICS, vol. 26, no. 1, 2007.
[47]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," ISCA, 2013.
[48]
B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large Reach TLBs," MICRO, 2012.
[49]
J. Power, M. Hill, and D. Wood, "Supporting x86--64 Address Translation for 100s of GPU Lanes," HPCA, 2014.
[50]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz, "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing," ISCA, 2013.
[51]
P. Rogers, "AMD Heterogeneous Uniform Memory Access," AMD, 2013.
[52]
T. Rogers, M. O'Connor, and T. Aamodt, "Cache Conscious Wavefront Scheduling," MICRO, 2012.
[53]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, "Larrabee: A Many-Core x86 Architecture for Visual Computing," SIG- GRAPH, 2008.
[54]
I. Singh, A. Shriraman, W. Fung, M. O'Connor, and T. Aamodt, "Cache Coherence for GPU Architecture," HPCA, 2013.
[55]
S. Steele, "ARM GPUs Now and in the Future," 2011.
[56]
M. Talluri and M. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," ASPLOS, 1994.
[57]
N. Wilt, "The CUDA Handbook," 2012.
[58]
L.Wu, R. Barker,M. Kim, and K. Ross, "Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning," ISCA, 2013.

Cited By

View all
  • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
  • (2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
  • (2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
    February 2014
    780 pages
    ISBN:9781450323055
    DOI:10.1145/2541940
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gpus
    2. mmus
    3. tlbs
    4. unified address space

    Qualifiers

    • Research-article

    Conference

    ASPLOS '14

    Acceptance Rates

    ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)191
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
    • (2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
    • (2023)CryptoMMU: Enabling Scalable and Secure Access Control of Third-Party AcceleratorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614311(32-48)Online publication date: 28-Oct-2023
    • (2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
    • (2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
    • (2023)System Virtualization for Neural Processing UnitsProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595912(80-86)Online publication date: 22-Jun-2023
    • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
    • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
    • (2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
    • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media