research-article

Open access

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Authors:

Eiman Ebrahimi,

Sam DuncanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 1

Article No.: 6, Pages 1 - 24

https://doi.org/10.1145/3309710

Published: 08 March 2019 Publication History

All formats PDF

Abstract

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, LLT misses incur long address translation latency and hurt performance. This article proposes two low-overhead hardware mechanisms for reducing the frequency and penalty of on-die LLT misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die Last-Level Cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity significantly. The second, DRAM-TLB, memoizes virtual to physical address translations in DRAM and reduces LLT miss penalty when UCAT is unable to fully cover total application working-set. DRAM-TLB serves as the next larger level in the TLB hierarchy that significantly increases TLB coverage relative to on-chip TLBs. The combination of these two mechanisms, DUCATI, is an address translation architecture that improves GPU performance by 81%; (up to 4.5×) while requiring minimal changes to the existing system design. We show that DUCATI is within 20%, 5%, and 2% the performance of a perfect LLT system when using 4KB, 64KB, and 2MB pages, respectively.

References

[1]

Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15).

Digital Library

[2]

ATS. 2009. PCI Express, Address Translation Service. Retrieved from http://composter.com.ua/documents/ats_r1.1_26Jan09.pdf.

[3]

Thomas W. Barr, Alan Cox, and Scott Rixner. 2010. Translation caching: Skip, don’t walk (the page table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10).

Digital Library

[4]

Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A mechanism for speculative address translation. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 307--318.

Digital Library

[5]

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture.

Digital Library

[6]

Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating two-dimensional page walks for virtualized systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 26--35.

Digital Library

[7]

Abhishek Bhattacharjee. 2013. Large-reach memory management unit caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 383--394.

Digital Library

[8]

Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Los Alamitos, CA, 62--63.

Digital Library

[9]

Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core cooperative TLB for chip multiprocessors. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 359--370.

Digital Library

[10]

W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 19--31.

Digital Library

[11]

J. Bradley Chen, Anita Borg, and Norman P. Jouppi. 1992. A simulation based study of TLB performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). ACM, New York, NY, 114--123.

Digital Library

[12]

Chiachen Chou, Aamer Jaleel, and Moin Qureshi. 2015b. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture.

Digital Library

[13]

Chiachen Chou, Aamer Jaleel, and Moin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO47).

Digital Library

[14]

Chiachen Chou, Aamer Jaleel, and M K Qureshi. 2015a. BATMAN: Maximizing bandwidth utilization for hybrid memory systems. Technical Report for Computer ARchitecture and Emerging Technologies (CARET) Lab, TR-CARET-2015-01.

[15]

CORAL. 2014. CORAL Procurement Benchmarks. Retrieved from https://asc.llnl.gov/CORAL-benchmarks/.

[16]

Jonathan Corbet. 2017. Five-level page tables. Retrieved from https://lwn.net/Articles/717293/.

[17]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 381--394.

Digital Library

[18]

Alejandro Duran and Michael Klemm. 2012. The intel® many integrated core architecture. In Proceedings of the 2012 International Conference on High Performance Computing and Simulation (HPCS’12). IEEE, 365--366.

[19]

John Feehrer, Sumti Jairath, Paul Loewenstein, Ram Sivaramakrishnan, David Smentek, Sebastian Turullols, and Ali Vahidsafa. 2013. The oracle sparc T5 16-core processor scales to eight sockets. IEEE Micro 33, 2 (2013), 48--57.

Digital Library

[20]

Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large pages may be harmful on NUMA systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 231--242. http://dl.acm.org/citation.cfm?id=2643634.2643659

Digital Library

[21]

NVIDIA GP100. 2016. P100 GPU Accelerator.

[22]

Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574 3 (2009).

[23]

HMC Specification 1.0. Retrieved from http://www.hybridmemorycube.org, 2013.

[24]

HSA Foundation 2014. HSA Platform System Architecture Specification. HSA Foundation. Retrieved from http://www.slideshare.net/hsafoundation/hsa-platform-system-architecture-specification-provisional-verl-10-ratifed.

[25]

Intel. 2009. Intel 64 and IA-32 Architectures Optimization Reference Manual.

[26]

Bruce Jacob and Trevor Mudge. 1998. Virtual memory: Issues of implementation. IEEE Comput. 31, 6 (Jun. 1998), 33--43.

Digital Library

[27]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. 12.

Digital Library

[28]

JEDEC. 2013a. DDR4 SPEC (JESD79-4). JEDEC.

[29]

JEDEC. 2013b. High Bandwidth Memory (HBM) DRAM (JESD235).

[30]

James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.

Digital Library

[31]

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison cache: A scalable and effective die-stacked DRAM cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 25--37.

Digital Library

[32]

Zhipeng Jiang, Xiaodong Hu, and Suixiang Gao. 2013. A parallel ford-fulkerson algorithm for maximum flow problem. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13).

[33]

Daniel A. Jiménez. 2013. Insertion and promotion for tree-based pseudolru last-level caches. In Proceedings of the 46th Annual International Symposium on Microarchitecture. 13.

Digital Library

[34]

Stephen Junkins. 2015. The compute architecture of intel processor graphics gen9. Intel Whitepaper v1 (2015).

[35]

Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC, USA, 195--206. http://dl.acm.org/citation.cfm?id=545215.545237

Digital Library

[36]

Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy Kurian John. 2011. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO44). ACM, New York, NY, 24--35.

Digital Library

[37]

Samira M. Khan, Daniel A. JimÃ©nez, and Doug Burgerand Babak Falsafi. 2010. Using dead blocks as a virtual victim cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT-19).

Digital Library

[38]

Milind A. Kulkarni, Martin A. Burtscher, Calin Cascaval, and Keshav Pingali. 2009. Lonestar: A Suite of Parallel Irregular Programs?

[39]

Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12.

Digital Library

[40]

Gabriel H. Loh and Mark D. Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual International Symposium on Microarchitecture. 11.

Digital Library

[41]

Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 213.

Digital Library

[42]

Joe Macri. 2015. AMD’s next generation GPU and high bandwidth memory architecture: FURY. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS’15). IEEE, 1--26.

[43]

Khalid Moammer. 2016. AMD Zen Raven Ridge APU Features HBM, 128GB/s of Bandwidth and Large GPU.

[44]

Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. 2014. Unified memory in cuda 6.0. a brief overview of related data access and transfer issues. Tech. Rep. TR-2014--09, University of Wisconsin—Madison.

[45]

Binh Pham, Arup Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE.

[46]

Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced large-reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Los Alamitos, CA, 258--269.

Digital Library

[47]

Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Large pages and lightweight memory management in virtualized systems: Can you have it both ways? In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[48]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 16.

Digital Library

[49]

Jonathan Power, Mark D. Hill, David Wood, et al. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 568--578.

[50]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture. 11.

Digital Library

[51]

Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-Tags with a simple and practical design. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 12.

Digital Library

[52]

Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William Bolosky, and Jonathan Chew. 1988. Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures. IEEE Transactions on Computers 37, 8 (1988), 896--908.

Digital Library

[53]

Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. 2017. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 469--480.

Digital Library

[54]

Ashley Saulsbury, Fredrik Dahlgren, and Per Stenstrom. 2000. Recency-based TLB preloading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 117--127.

Digital Library

[55]

Jaewoong Sim, Gabriel H. Loh, Hyesoon Kim, Mike O’Connor, and Mithuna Thottethodi. 2012. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 11.

Digital Library

[56]

Jaewoong Sim, Gabriel H. Loh, Vilas Sridharan, and Mike O’Connor. 2013. Resilient die-stacked DRAM caches. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 416--427.

Digital Library

[57]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (2016), 34--46.

Digital Library

[58]

Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, 171--182.

Digital Library

[59]

Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. 1992. Tradeoffs in supporting two page sizes. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). ACM, New York, NY, 415--424.

Digital Library

[60]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). ACM, New York, NY, 279--289.

Digital Library

[61]

Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’16). IEEE, 161--171.

[62]

David A. Wood, Susan J. Eggers, Garth Gibson, Mark D. Hill, and Joan M. Pendleton. 1986. An in-cache address translation mechanism. In ACM SIGARCH Computer Architecture News, Vol. 14. IEEE Computer Society Press, 358--365.

Digital Library

[63]

Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Jr. Simon C. Steely, and Joel Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture (Micro-44).

Digital Library

[64]

Vinson Young, Chiachen Chou, Aamer Jaleel, and Moinuddin Qureshi. 2018a. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 328--339.

Digital Library

[65]

Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018b. Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems. In Proceedings of the 2018 IEEE 51st International Symposium on Microarchitecture (MICRO51). IEEE.

Digital Library

[66]

Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs.

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Jang SPark JKwon OLee YHong S(2024)Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page TableProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676900(325-337)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676900
Feng YNa SKim HJeon H(2024)Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00065(834-847)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00065
Show More Cited By

Index Terms

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Recommendations

Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Efficient Address Translation for Architectures with Multiple Page Sizes
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection. Ideally, TLBs should perform well for any distribution of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 1

March 2019

157 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3313806

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2019

Accepted: 01 January 2019

Revised: 01 January 2019

Received: 01 February 2018

Published in TACO Volume 16, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
2,935
Total Downloads

Downloads (Last 12 months)727
Downloads (Last 6 weeks)132

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Jang SPark JKwon OLee YHong S(2024)Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page TableProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676900(325-337)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676900
Feng YNa SKim HJeon H(2024)Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00065(834-847)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00065
SeyyedAghaei HNaderan-Tahan MEeckhout L(2024)GPU Scale-Model Simulation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00088(1125-1140)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00088
Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605593
Lee JLee JOh YSong WRo W(2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071063
Li BYin JHoley AZhang YYang JTang X(2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071054
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Richter EChen DMitra TYoung EXiong J(2022)QilinProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549431(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549431
B PJawalkar NBasu AHardavellas NCampanoni SGrot BKarpuzcu U(2022)Designing Virtual Memory System of MCM GPUsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00036(404-422)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00036
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents