Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Mosaic: Enabling Application-Transparent Support for Multiple Page Sizes in Throughput Processors

Published: 28 August 2018 Publication History

Abstract

Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page.
In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once).We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages.

References

[1]
"NVIDIA GRID," http://www.nvidia.com/object/grid-boards.html.
[2]
A. Abrevaya, "Linux Transparent Huge Pages, JEMalloc and NuoDB," http://www.nuodb.com/techblog/ linux-transparent-huge-pages-jemalloc-and-nuodb, 2014
[3]
Advanced Micro Devices, "AMD Accelerated Processing Units." {Online}. Available: http://www.amd.com/us/products/technologies/apu/Pages/apu.aspx
[4]
Advanced Micro Devices, Inc., "OpenCL: The Future of Accelerated Application Performance Is Now," https://www.amd.com/Documents/FirePro_OpenCL_ Whitepaper.pdf.
[5]
N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking Bandwidth for GPUs in CC-NUMA Systems," in HPCA, 2015.
[6]
J. Ahn, S. Jin, and J. Huh, "Revisiting Hardware-Assisted Page Walks for Virtualized Systems," in ISCA, 2012.
[7]
J. Ahn, S. Jin, and J. Huh, "Fast Two-Level Address Translation for Virtualized Systems," IEEE TC, 2015.
[8]
Apple Inc., "Huge Page Support in Mac OS X," http://blog.couchbase.com/ often-overlooked-linux-os-tweaks, 2014.
[9]
ARM Holdings, "ARM Cortex-A Series," http://infocenter.arm.com/help/topic/ com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf, 2015.
[10]
R. Ausavarungnirun, "Techniques for Shared Resource Management in Systems with Throughput Processors," Ph.D. dissertation, Carnegie Mellon Univ., 2017.
[11]
R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems," in ISCA, 2012.
[12]
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance," in PACT, 2015.
[13]
R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, "Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes," in MICRO, 2017.
[14]
R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. Rossbach, and O. Mutlu, "MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency," in ASPLOS, 2018.
[15]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.
[16]
T.W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don'tWalk (the Page Table)," in ISCA, 2010.
[17]
T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in ISCA, 2011.
[18]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013.
[19]
A. Bhattacharjee, "Large-Reach Memory Management Unit Caches," in MICRO, 2013.
[20]
A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-level TLBs for Chip Multiprocessors," in HPCA, 2011.
[21]
A. Bhattacharjee and M. Martonosi, "Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors," in PACT, 2009.
[22]
A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB for Chip Multiprocessors," in ASPLOS, 2010.
[23]
D. Bouvier and B. Sander, "Applying AMD's "Kaveri" APU for Heterogeneous Computing," in HOTCHIP, 2014.
[24]
B. Burgess, B. Cohen, J. Dundas, J. Rupley, D. Kaplan, and M. Denman, "Bobcat: AMD's Low-Power x86 Processor," IEEE Micro, 2011.
[25]
B. Catanzaro, M. Garland, and K. Keutzer, "Copperhead: Compiling an Embedded Data Parallel Language," in PPoPP, 2011.
[26]
K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM," in HPCA, 2016
[27]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009.
[28]
M. Clark, "A New X86 Core Architecture for the Next Generation of Computing," in HotChips, 2016
[29]
J. Cong, Z. Fang, Y. Hao, and G. Reinman, "Supporting Address Translation for Accelerator-Centric Architectures," in HPCA, 2017.
[30]
J. Corbet, "Transparent Hugepages," https://lwn.net/Articles/359158/, 2009.
[31]
Couchbase, Inc., "Often Overlooked Linux OS Tweaks," http://blog.couchbase. com/often-overlooked-linux-os-tweaks, 2014.
[32]
G. Cox and A. Bhattacharjee, "Efficient Address Translation for Architectures with Multiple Page Sizes," in ASPLOS, 2017.
[33]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010.
[34]
R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009.
[35]
R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Aérgia: Exploiting Packet Latency Slack in On-chip Networks," in ISCA, 2010.
[36]
Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, "Supporting Superpages in Non-Contiguous Physical Memory," in HPCA, 2015.
[37]
J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, "rCUDA: Reducing the Number of GPU-based Accelerators in High Performance Clusters," in HPCS, 2010.
[38]
S. Eyerman and L. Eeckhout, "System-Level Performance Metrics for Multiprogram Workloads," IEEE Micro, 2008.
[39]
S. Eyerman and L. Eeckhout, "Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance," IEEE CAL, 2014.
[40]
M. Flynn, "Very High-Speed Computing Systems," Proc. of the IEEE, vol. 54, no. 2, 1966.
[41]
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in MICRO, 2007.
[42]
J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks," in MICRO, 2014.
[43]
J. Gandhi, M. D. Hill, and M. M. Swift, "Agile Paging: Exceeding the Best of Nested and Shadow Paging," in ISCA, 2016.
[44]
F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quema, "Large Pages May Be Harmful on NUMA Systems," in USENIX ATC, 2014.
[45]
D. Gay and A. Aiken, "Memory Management with Explicit Regions," in PLDI, 1998.
[46]
M. Gorman, "Huge Pages Part 2 (Interfaces)," https://lwn.net/Articles/375096/, 2010.
[47]
M. Gorman and P. Healy, "Supporting Superpage Allocation Without Additional Hardware Support," in ISMM, 2008
[48]
M. Gorman and P. Healy, "Performance Characteristics of Explicit Superpage Support," in WIOSCA, 2010.
[49]
A. Herrera, "NVIDIA GRID: Graphics Accelerated VDI with the Visual Performance of a Workstation," May 2014.
[50]
Intel Corp., "Introduction to Intel Architecture," http://www. intel.com/content/dam/www/public/us/en/documents/white-papers/ ia-introduction-basics-paper.pdf, 2014.
[51]
Intel Corp., "Intel® 64 and IA-32 Architectures Optimization Reference Manual," https://www.intel.com/content/dam/www/public/us/en/documents/ manuals/64-ia-32-architectures-optimization-manual.pdf, 2016.
[52]
Intel Corp., "6th Generation Intel® CoreTM Processor Family Datasheet, Vol. 1," http://www.intel.com/content/dam/www/public/us/en/documents/ datasheets/desktop-6th-gen-core-family-datasheet-vol-1.pdf, 2017.
[53]
Intel Corporation, "Sandy Bridge Intel Processor Graphics Performance Developer's Guide." {Online}. Available: http://software.intel.com/file/34436
[54]
M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, "A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC," in DAC, 2012.
[55]
A. Jog, "Design and Analysis of Scheduling Techniques for Throughput Processors," Ph.D. dissertation, Pennsylvania State Univ., 2015.
[56]
A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S.W. Keckler, M. T. Kandemir, and C. R. Das, "Anatomy of GPU Memory System for Multi- Application Execution," in MEMSYS, 2015.
[57]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in ISCA, 2013.
[58]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013.
[59]
A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Exploiting Core Criticality for Enhanced GPU Performance," in SIGMETRICS, 2016.
[60]
G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-Driven Study," in ISCA, 2002.
[61]
V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, "Redundant Memory Mappings for Fast Access to Large Memories," in ISCA, 2015.
[62]
V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Unsal, "Energy-Efficient Address Translation," in HPCA, 2016.
[63]
I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F.Wang, D. Richards, M. Schulz, and C. Still, "Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application," in IPDPS, 2013.
[64]
I. Karlin, J. Keasler, and R. Neely, "LULESH 2.0 Updates and Changes," Lawrence Livermore National Lab, Tech. Rep. LLNL-TR-641973, 2013.
[65]
O. Kayiran, N. Chidambaram, A. Jog, R. Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, and C. Das, "Managing GPU Concurrency in Heterogeneous Architectures," in MICRO, 2014.
[66]
Khronos OpenCL Working Group, "The OpenCL Specification," http://www. khronos.org/registry/cl/specs/opencl-1.0.29.pdf, 2008.
[67]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "ATLAS: A Scalable and High- Performance Scheduling Algorithm for Multiple Memory Controllers," in HPCA, 2010
[68]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," in MICRO, 2010
[69]
D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," in ISCA, 1981.
[70]
Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, "Coordinated and Efficient Huge Page Management with Ingens," in OSDI, 2016.
[71]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," https: //developer.amd.com/wordpress/media/2012/10/hsa10.pdf, Advanced Micro Devices, Inc., 2012.
[72]
D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data- Port DRAM," in PACT, 2015
[73]
J. Lee, M. Samadi, and S. Mahlke, "VAST: The Illusion of a Large Memory Space for GPUs," in PACT, 2014.
[74]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, 2008.
[75]
D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," ACM TACO, 2013.
[76]
Mark Mumy, "SAP IQ and Linux Hugepages/Transparent Hugepages," http://scn.sap.com/people/markmumy/blog/2014/05/22/ sap-iq-and-linux-hugepagestransparent-hugepages, SAP SE, 2014.
[77]
X. Mei and X. Chu, "Dissecting GPU Memory Hierarchy Through Microbenchmarking," IEEE TPDS, 2017.
[78]
T. Merrifield and H. R. Taheri, "Performance Implications of Extended Page Tables on Virtualized x86 Processors," in VEE, 2016.
[79]
Microsoft Corp., Large-Page Support in Windows, https://msdn.microsoft.com/ en-us/library/windows/desktop/aa366720(v=vs.85).aspx.
[80]
R. Mijat, "Take GPU Processing Power Beyond Graphics with Mali GPU Computing," 2012.
[81]
MongoDB, Inc., "Disable Transparent Huge Pages (THP)," https://docs.mongodb. org/manual/tutorial/transparent-huge-pages/, 2017.
[82]
S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning," in MICRO, 2011.
[83]
O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007.
[84]
O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008.
[85]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via LargeWarps and Two-LevelWarp Scheduling," in MICRO, 2011.
[86]
J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," in OSDI, 2002.
[87]
K. Nguyen, L. Fang, G. Xu, B. Demsky, S. Lu, S. Alamian, and O. Mutlu, "Yak: A High-Performance Big-Data-Friendly Garbage Collector," in OSDI, 2016.
[88]
NVIDIA Corp., "CUDA C/C++ SDK Code Samples," http://developer.nvidia.com/ cuda-cc-sdk-code-samples, 2011.
[89]
NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_ compute_architecture_whitepaper.pdf, 2011.
[90]
NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.
[91]
NVIDIA Corp., "NVIDIA GeForce GTX 750 Ti," http:// international.download.nvidia.com/geforce-com/international/pdfs/ GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.
[92]
NVIDIA Corp., "CUDA C Programming Guide," http://docs.nvidia.com/cuda/ cuda-c-programming-guide/index.html, 2015.
[93]
NVIDIA Corp., "NVIDIA RISC-V Story," https://riscv.org/wp-content/uploads/ 2016/07/Tue1100_Nvidia_RISCV_Story_V2.pdf, 2016.
[94]
NVIDIA Corp., "NVIDIA Tesla P100," https://images.nvidia.com/content/pdf/ tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016.
[95]
NVIDIA Corp., "NVIDIA GeForce GTX 1080," https://international. download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_ 1080_Whitepaper_FINAL.pdf, 2017
[96]
NVIDIA Corporation, "NVIDIA Tegra K1," http://www.nvidia.com/content/pdf/ tegra_white_papers/tegra-k1-whitepaper-v1.0.pdf.
[97]
NVIDIA Corporation, "NVIDIAǍo TegraǍo X1," https://international.download. nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf.
[98]
NVIDIA Corporation, "Multi-Process Service," https://docs.nvidia.com/deploy/ pdf/CUDA_Multi_Process_Service_Overview.pdf, 2015.
[99]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU Concurrency with Elastic Kernels," in ASPLOS, 2013.
[100]
M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-Based Superpage-Friendly TLB Designs," in HPCA, 2015.
[101]
J. J. K. Park, Y. Park, and S. Mahlke, "Chimera: Collaborative Preemption for Multitasking on a Shared GPU," in ASPLOS, 2015.
[102]
PCI-SIG, "PCI Express Base Specification Revision 3.1a," 2015.
[103]
G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, "A Case for Toggle-aware Compression for GPU Systems," in HPCA, 2016.
[104]
Peter Zaitsev, "Why TokuDB Hates Transparent HugePages," https://www. percona.com/blog/2014/07/23/why-tokudb-hates-transparent-hugepages/, Percona LLC, 2014.
[105]
B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, "Increasing TLB Reach by Exploiting Clustering in Page Translations," in HPCA, 2014.
[106]
B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," in MICRO, 2012.
[107]
B. Pham, J. Vesely, G. Loh, and A. Bhattacharjee, "Large Pages and Lightweight Memory Management in Virtualized Systems: Can You Have It Both Ways?" in MICRO, 2015.
[108]
B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces," in ASPLOS, 2014.
[109]
J. Power, M. D. Hill, and D. A. Wood, "Supporting x86-64 Address Translation for 100s of GPU Lanes," in HPCA, 2014.
[110]
PowerVR, "PowerVR Hardware Architecture Overview for Developers," 2016, http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware. Architecture+Overview+for+Developers.pdf.
[111]
Redis Labs, "Redis Latency Problems Troubleshooting," http://redis.io/topics/ latency.
[112]
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory Access Scheduling," in ISCA, 2000.
[113]
T. G. Rogers, "Locality and Scheduling in the Massively Multithreaded Era," Ph.D. dissertation, Univ. of British Columbia, 2015.
[114]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012.
[115]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, "Dandelion: A Compiler and Runtime for Heterogeneous Systems," in SOSP, 2013.
[116]
SAFARI Research Group, "Mosaic - GitHub Repository," https://github.com/ CMU-SAFARI/Mosaic/.
[117]
SAFARI Research Group, "SAFARI Software Tools - GitHub Repository," https: //github.com/CMU-SAFARI/.
[118]
A. Saulsbury, F. Dahlgren, and P. Stenström, "Recency-Based TLB Preloading," in ISCA, 2000.
[119]
V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization," in ISCA, 2013.
[120]
V. Seshadri and O. Mutlu, "Simple Operations in Memory to Reduce Data Movement," in Advances in Computers, 2017.
[121]
T. Shanley, Pentium Pro Processor System Architecture, 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1996.
[122]
R. L. Sites and R. T. Witek, ALPHA Architecture Reference Manual. Boston, Oxford, Melbourne: Digital Press, 1998. {123} B. Smith, "Architecture and Applications of the HEP Multiprocessor Computer System," SPIE, 1981.
[123]
B. Smith, "Architecture and Applications of the HEP Multiprocessor Computer System," SPIE, 1981
[124]
B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in ICPP, 1978.
[125]
Splunk Inc., "Transparent Huge Memory Pages and Splunk Performance," http://docs.splunk.com/Documentation/Splunk/6.1.3/ReleaseNotes/ SplunkandTHP, 2013.
[126]
S. Srikantaiah and M. Kandemir, "Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors," in MICRO, 2010.
[127]
J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Univ. of Illinois at Urbana-Champaign, IMPACT Research Group, Tech. Rep. IMPACT-12-01, 2012.
[128]
A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun, "Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-specific Languages," TECS, 2014.
[129]
M. Talluri and M. D. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," in ASPLOS, 1994.
[130]
J. E. Thornton, "Parallel Operation in the Control Data 6600," in AFIPS FJCC, 1964.
[131]
J. E. Thornton, Design of a Computer-The Control Data 6600. Scott Foresman & Co, 1970.
[132]
H. Usui, L. Subramanian, K. Chang, and O. Mutlu, "DASH: Deadline-Aware High- Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators," ACM TACO, vol. 12, no. 4, Jan. 2016.
[133]
J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, "Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems," in ISPASS, 2016.
[134]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A Holistic Approach to Resource Virtualization in GPUs," in MICRO, 2016.
[135]
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps," in ISCA, 2015.
[136]
VoltDB, Inc., "VoltDB Documentation: Configure Memory Management," https: //docs.voltdb.com/AdminGuide/adminmemmgt.php.
[137]
L. Vu, H. Sivaraman, and R. Bidarkar, "GPU Virtualization for High Performance General Purpose Computing on the ESX Hypervisor," in HPC, 2014.
[138]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo, "Simultaneous Multikernel: Fine-Grained Sharing of GPUs," in IEEE CAL, 2016.
[139]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo, "Quality of Service Support for Fine-Grained Sharing on GPUs," in ISCA, 2017.
[140]
S. Wasson. (2011, Oct.) AMD's A8-3800 Fusion APU. {Online}. Available: http://techreport.com/articles.x/21730
[141]
Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in ISCA, 2016.
[142]
H. Yoon, J. Lowe-Power, and G. S. Sohi, "Filtering Translation Bandwidth with Virtual Caching," in ASPLOS, 2018.
[143]
G. Yuan, A. Bakhoda, and T. Aamodt, "Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures," in MICRO, 2009.
[144]
T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, "Towards High Performance Paged Memory for GPUs," in HPCA, 2016.
[145]
W. K. Zuravleff and T. Robinson, "Controller for a Synchronous DRAM That Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order," US Patent No. 5,630,096, 1997.

Cited By

View all
  • (2023)Accelerating Extra Dimensional Page Walks for Confidential ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614293(654-669)Online publication date: 28-Oct-2023
  • (2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.3268662(1-14)Online publication date: 2023
  • (2021)GrusACM Transactions on Architecture and Code Optimization10.1145/344484418:2(1-25)Online publication date: 9-Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 52, Issue 1
Special Topics
July 2018
133 pages
ISSN:0163-5980
DOI:10.1145/3273982
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2018
Published in SIGOPS Volume 52, Issue 1

Check for updates

Author Tags

  1. GPGPU applications
  2. address translation
  3. demand paging
  4. graphics processing units
  5. large pages
  6. virtual memory management

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Accelerating Extra Dimensional Page Walks for Confidential ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614293(654-669)Online publication date: 28-Oct-2023
  • (2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.3268662(1-14)Online publication date: 2023
  • (2021)GrusACM Transactions on Architecture and Code Optimization10.1145/344484418:2(1-25)Online publication date: 9-Feb-2021
  • (2021)Modeling and Analysis of the Page Sizing Problem for NVM Storage in Virtualized SystemsIEEE Access10.1109/ACCESS.2021.30699669(52839-52850)Online publication date: 2021
  • (2020)A comprehensive analysis of superpage management mechanisms and policiesProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489203(829-842)Online publication date: 15-Jul-2020
  • (2020)Post-Render Warp with Late Input Sampling Improves Aiming Under High Latency ConditionsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34061873:2(1-18)Online publication date: 26-Aug-2020
  • (2020)Concurrent Binary Trees (with application to longest edge bisection)Proceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34061863:2(1-20)Online publication date: 26-Aug-2020
  • (2020)Hardware-Accelerated Dual-Split TreesProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34061853:2(1-21)Online publication date: 26-Aug-2020
  • (2020)Efficient Adaptive Deferred Shading with Hardware Scatter TilesProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34061843:2(1-17)Online publication date: 26-Aug-2020
  • (2020)Sub-triangle opacity masks for faster ray tracing of transparent objectsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34061803:2(1-12)Online publication date: 26-Aug-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media