research-article

Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Authors:

Mark Stephenson,

Stephen W. KecklerAuthors Info & Claims

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 607 - 618

https://doi.org/10.1145/2694344.2694381

Published: 14 March 2015 Publication History

Abstract

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making choices about memory page placement critical to performance. In this work we show that current page placement policies are not sufficient to maximize GPU performance in these heterogeneous memory systems. We propose two new page placement policies that improve GPU performance: one application agnostic and one using application profile information. Our application agnostic policy, bandwidth-aware (BW-AWARE) placement, maximizes GPU throughput by balancing page placement across the memories based on the aggregate memory bandwidth available in a system. Our simulation-based results show that BW-AWARE placement outperforms the existing Linux INTERLEAVE and LOCAL policies by 35% and 18% on average for GPU compute workloads. We build upon BW-AWARE placement by developing a compiler-based profiling mechanism that provides programmers with information about GPU application data structure access patterns. Combining this information with simple program-annotated hints about memory placement, our hint-based page placement approach performs within 90% of oracular page placement on average, largely mitigating the need for costly dynamic page tracking and migration.

References

[1]

T. M. Aamodt, W. W. L. Fung, I. Singh, A. El-Shafiey, J. Kwa, T. Hetherington, A. Gubran, A. Boktor, T. Rogers, A. Bakhoda, and H. Jooybar. GPGPU-Sim 3.x Manual. http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual, 2014. {Online; accessed 4-December-2014}.

[2]

M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Opportunities Posed by Multiple On- Chip Memory Controllers. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 319--330, September 2010.

Digital Library

[3]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174, April 2009.

[4]

R. A. Bheda, J. A. Poovey, J. G. Beu, and T. M. Conte. Energy Efficient Phase Change Memory Based Main Memory for Future High Performance Systems. In International Green Computing Conference (IGCC), pages 1--8, July 2011.

Digital Library

[5]

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX Annual Technical Conference (USENIXATC), pages 1--15, June 2011.

Digital Library

[6]

W. Bolosky, R. Fitzgerald, and M. Scott. Simple but Effective Techniques for NUMA Memory Management. In Symposium on Operating Systems Principles (SOSP), pages 19--31, December 1989.

Digital Library

[7]

T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS), pages 1--18, September 1993.

Digital Library

[8]

C. Chan, D. Unat, M. Lijewski, W. Zhang, J. Bell, and J. Shalf. Software Design Space Exploration for Exascale Combustion Co-design. In International Supercomputing Conference (ISC), pages 196--212, June 2013.

[9]

N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal, and R. Iyer. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In International Symposium on Microarchitecture (MICRO), pages 13--24, December 2012.

Digital Library

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54, October 2009.

Digital Library

[11]

J. Corbet. AutoNUMA: the other approach to NUMA scheduling. http://lwn.net/Articles/488709/, 2012. {Online; accessed 29-May-2014}.

[12]

M. Daga, A. M. Aji, and W.-C. Feng. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC), pages 141--149, July 2011.

Digital Library

[13]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 381--394, March 2013.

Digital Library

[14]

X. Dong, Y. Xie, N. Muralimanohar, and N. Jouppi. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support. In International Conference on High Performance Networking and Computing (Supercomputing), pages 1--11, November 2010.

Digital Library

[15]

Free Software Foundation. GNU Binutils. http://www.gnu.org/software/binutils/, 2014. {Online; accessed 5-August-2014}.

[16]

B. Gerofi, A.Shimada, A. Hori, T. Masamichi, and Y. Ishikawa. CMCP: A Novel Page Replacement Policy for System Level Hierarchical Memory Management on Many-cores. In International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 73--84, June 2014.

Digital Library

[17]

M. Heroux, D. Doerfler, J. Crozier, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich. Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, September 2009.

[18]

HSA Foundation. HSA Platform System Architecture Specification - Provisional 1.0. http://www.slideshare.net/hsafoundation/hsa-platform-system-architecture-specification-provisional-verl-10-ratifed, 2014. {Online; accessed 28-May-2014}.

[19]

Hynix Semiconductor. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf, 2009. {Online; accessed 30-Jul-2014}.

[20]

HyperTransport Consortium. HyperTransport 3.1 Specification. http://www.hypertransport.org/docs/twgdocs/HTC20051222-0046-0035.pdf, 2010. {Online; accessed 7-July-2014}.

[21]

Intel Corporation. An Introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html, 2009. {Online; accessed 7-July-2014}.

[22]

Intel Corporation. Intel Xeon Processor E7-4870 . http://ark.intel.com/products/75260/Intel-Xeon-Processor-E7-8893-v2-37_5M-Cache-3_40-GHz, 2014. {Online; accessed 28-May-2014}.

[23]

R. Iyer, H. Wang, and L. Bhuyan. Design and Analysis of Static Memory Management Policies for CC-NUMA Multiprocessors. Journal of Systems Architecture, 48(1):59--80, September 2002.

Digital Library

[24]

JEDEC. High Bandwidth Memory(HBM) DRAM - JESD235. http://www.jedec.org/standards-documents/docs/jesd235, 2013. {Online; accessed 28-May-2014}.

[25]

X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Integrating DRAM Caches for CMP Server Platforms. IEEE Micro, 31(1):99--108, March 2011.

Digital Library

[26]

J. Y. Kim. Wide IO2 (WIO2) Memory Overview. http://www.cs.utah.edu/events/thememoryforum/joon.PDF, 2014. {Online; accessed 30-Jul-2014}.

[27]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54--66, May 2008.

Digital Library

[28]

E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. Evaluating STT-RAM as an Energy-efficient Main Memory Alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 256--267, April 2013.

[29]

R. LaRowe, Jr., C. Ellis, and M. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. IEEE Transactions on Parallel Distribibuted Systems, 3(6):686--701, November 1992.

Digital Library

[30]

K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In International Symposium on Computer Architecture (ISCA), pages 267--278, June 2009.

Digital Library

[31]

J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, December 1995.

[32]

J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. IEEE Computer Architecture Letters, 11 (2):61--64, July 2012.

Digital Library

[33]

A. Minkin and O. Rubinstein. Circuit and method for prefetching data for a texture cache. US Patent 6,629,188, issued September 20, 2003.

[34]

J. Mogul, E. Argollo, M. Shah, and P. Faraboschi. Operating System Support for NVM+DRAM Hybrid Main Memory. In Workshop on Hot Topics in Operating Systems (HotOS), pages 14--18, May 2009.

Digital Library

[35]

J. Mohd-Yusof and N. Sakharnykh. Optimizing CoMD: A Molecular Dynamics Proxy Application Study. In GPU Technology Conference (GTC), March 2014.

[36]

NVIDIA Corporation. Unified Memory in CUDA 6. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/, 2013. {Online; accessed 28-May-2014}.

[37]

NVIDIA Corporation. CUDA C Best Practices Guide. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#allocation, 2014. {Online; accessed 28-July-2014}.

[38]

NVIDIA Corporation. NVIDIA Launches World's First High-Speed GPU Interconnect, Helping Pave the Way to Exascale Computing. http://nvidianews.nvidia.com/News/NVIDIA-Launches-World-s-First-High-Speed-GPU-Interconnect-Helping-Pave-the-Way-to-Exascale-Computin-ad6.aspx, 2014. {Online; accessed 28-May-2014}.

[39]

NVIDIA Corporation. Compute Unified Device Architecture. https://developer.nvidia.com/cuda-zone, 2014. {Online; accessed 28-May-2014}.

[40]

M. Pavlovic, N. Puzovic, and A. Ramirez. Data Placement in HPC Architectures with Heterogeneous Off-chip Memory. In International Conference on Computer Design (ICCD), pages 193--200, October 2013.

[41]

S. Phadke and S. Narayanasamy. MLP-Aware Heterogeneous Memory System. In Design, Automation & Test in Europe (DATE), pages 1--6, March 2011.

[42]

L. Ramos, E. Gorbatov, and R. Bianchini. Page Placement in Hybrid Memory Systems. In International Conference on Supercomputing (ICS), pages 85--99, June 2011.

Digital Library

[43]

J. Sim, G. Loh, H. Kim, M. O'Connor, and M. Thottethodi. A Mostly- Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In International Symposium on Microarchitecture (MICRO), pages 247--257, December 2012.

Digital Library

[44]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, v.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A Revised Bench- mark Suite for Scientific and Commercial Throughput Computing. Technical report, IMPACT Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, March 2012.

[45]

D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing- aware Scheduling on SMP-CMP-SMT Multiprocessors. In European Conference on Computer Systems (EuroSys), pages 47--58, March 2007.

Digital Library

[46]

J. Tramm, A. Siegel, T. Islam, and M. Schulz. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR), September 2014.

[47]

J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. In International Symposium on Microarchitecture (MICRO), pages 409--422, December 2006.

Digital Library

[48]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 279--289, September 1996.

Digital Library

[49]

B. Wang, B. Wu, D. Li, X. Shen, W. Yu, Y. Jiao, and J. Vetter. Exploring Hybrid Memory for GPU Energy Efficiency Through Software- hardware Co-design. In International Conference on Parallel Archi- tectures and Compilation Techniques (PACT), pages 93--103, September 2013.

Digital Library

[50]

K. Wilson and B. Aglietti. Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C. In International Conference on High Performance Networking and Computing (Supercomputing), pages 33--35, November 2001.

Digital Library

[51]

J. Zhao, G. Sun, G. Loh, and Y. Xie. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In International Symposium on Low Power Electronics and Design (ISLPED), pages 403--408, July 2012.

Digital Library

[52]

J. Zhao, G. Sun, G. Loh, and Y. Xie. Optimizing GPU Energy Efficiency with 3D Die-stacking Graphics Memory and Reconfigurable Memory Interface. ACM Transactions on Architecture and Code Optimization, 10(4):24:1--24:25, December 2013.

Digital Library

[53]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 129--142, March 2010.

Digital Library

Cited By

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
An YTang YYi SPeng LPan XSun GLuo ZLi QZhang J(2024)StreamPIM: Streaming Matrix Computation in Racetrack Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00031(297-311)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00031
Zhang HZhou YXue YLiu YHuang J(2023)G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor MigrationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614309(395-410)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614309
Show More Cited By

Index Terms

Page Placement Strategies for GPUs within Heterogeneous Memory Systems
1. Information systems
  1. Information storage systems
    1. Storage management
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        File systems management
        Memory management

Recommendations

Page Placement Strategies for GPUs within Heterogeneous Memory Systems
ASPLOS'15

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making ...
Page Placement Strategies for GPUs within Heterogeneous Memory Systems
ASPLOS '15

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making ...
Page placement in hybrid memory systems
ICS '11: Proceedings of the international conference on Supercomputing

Phase-Change Memory (PCM) technology has received substantial attention recently. Because PCM is byte-addressable and exhibits access times in the nanosecond range, it can be used in main memory designs. In fact, PCM has higher density and lower idle ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2015

720 pages

ISBN:9781450328357

DOI:10.1145/2694344

General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

ACM SIGPLAN Notices Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US Department of Energy
NSF

Conference

ASPLOS '15

Sponsor:

ASPLOS '15: Architectural Support for Programming Languages and Operating Systems

March 14 - 18, 2015

Istanbul, Turkey

Acceptance Rates

ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

111
Total Citations
View Citations
1,634
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)10

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
An YTang YYi SPeng LPan XSun GLuo ZLi QZhang J(2024)StreamPIM: Streaming Matrix Computation in Racetrack Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00031(297-311)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00031
Zhang HZhou YXue YLiu YHuang J(2023)G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor MigrationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614309(395-410)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614309
Li BYin JHoley AZhang YYang JTang X(2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071054
Muthukrishnan HLustig DVilla OWenisch TNellans D(2023)FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070949(516-529)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070949
Rang WLiang HWang YZhou XCheng D(2023)A Unified Hybrid Memory System for Scalable Deep Learning and Big Data ApplicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104820(104820)Online publication date: Dec-2023
https://doi.org/10.1016/j.jpdc.2023.104820
Vijaykumar NOlgun AKanellopoulos KBostanci FHassan HLotfi MGibbons PMutlu O(2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3505250
Haque Rafi MWilliams KQasem A(2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
https://doi.org/10.1109/IGSC55832.2022.9969376
Girolamo JHope JQasem A(2022)Uncovering input-sensitive energy bottlenecks in oversubscribed GPU workloadsSustainable Computing: Informatics and Systems10.1016/j.suscom.2022.10065435(100654)Online publication date: Sep-2022
https://doi.org/10.1016/j.suscom.2022.100654
Ganguly DMelhem RYang J(2021)An Adaptive Framework for Oversubscription Management in CPU-GPU Unified Memory2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9473982(1212-1217)Online publication date: 1-Feb-2021
https://doi.org/10.23919/DATE51398.2021.9473982
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents