Article

Hardware profile-guided automatic page placement for ccNUMA systems

Authors:

Jaydeep Marathe,

Frank MuellerAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 90 - 99

https://doi.org/10.1145/1122971.1122987

Published: 29 March 2006 Publication History

Get Access

Abstract

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.

References

[1]

C versions of nas-2.3 serial programs. http://phase.hpcc.jp/Omni/benchmarks/NPB, 2003.]]

Google Scholar

[2]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.]]

Digital Library

Google Scholar

[3]

W. Bolosky, M. Scott, R. Fitzgerald, R. Fowler, and A. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the fourth International Conference on Architecture Support for Programming Languages and Operating Systems, pages 212--221, 1991.]]

Digital Library

Google Scholar

[4]

J. Bull and C. Johnson. Data Distribution, Migration and Replication on a ccNUMA Architecture. In Proceedings of the Fourth European Workshop on OpenMP, 2002.]]

Google Scholar

[5]

Hewlett-Packard. Perfmon project.]]

Google Scholar

[6]

Intel. Intel Itanium2 Processor Reference Manual for Software Development and Optimization, volume 1. Intel, 2004.]]

Google Scholar

[7]

J. Marathe, F. Mueller, and B. R. de Supinski. A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks. In International Conference on Supercomputing, June 2005.]]

Digital Library

Google Scholar

[8]

D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguade. User-level dynamic page migration for multipro-grammed shared-memory multiprocessors. In International Conference on Parallel Programming, pages 95--103, Aug. 2000.]]

Digital Library

Google Scholar

[9]

D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguade. UPMLIB: A runtime system for tuning the memory performance of openmp programs on scalable shared-memory multiprocessors. In Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 85--99, 2000.]]

Digital Library

Google Scholar

[10]

L. Noordergraaf and R. Zak. Performance experiences on Sun's Wildfire prototype. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999.]]

Digital Library

Google Scholar

[11]

M. M. Tikir and J. K. Hollingsworth. Using hardware counters to automatically improve memory performance. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 46, Washington, DC, USA, 2004. IEEE Computer Society.]]

Digital Library

Google Scholar

[12]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on ccNUMA compute servers. In Proceedings of the seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, 1996.]]

Digital Library

Google Scholar

Cited By

View all

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
Bartolo ASabry Aly MMichelogiannakis GMitra S(2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631905
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Show More Cited By

Index Terms

Hardware profile-guided automatic page placement for ccNUMA systems

Recommendations

Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-core architectures. Under ccNUMA, data placement may influence overall ...
Page Placement Strategies for GPUs within Heterogeneous Memory Systems
ASPLOS'15

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making ...
Page Placement Strategies for GPUs within Heterogeneous Memory Systems
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2006

258 pages

ISBN:1595931899

DOI:10.1145/1122971

General Chair:
Josep Torrellas
University of Illinois
,
Program Chair:
Siddhartha Chatterjee
IBM Research

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP06

Sponsor:

PPoPP06: ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel Programming 2006

March 29 - 31, 2006

New York, New York, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
443
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
Bartolo ASabry Aly MMichelogiannakis GMitra S(2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631905
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Oh DMoon YHam DHam TPark YLee JAhn JLee E(2022)MaPHeA: A Framework for Lightweight Memory Hierarchy-aware Profile-guided Heap AllocationACM Transactions on Embedded Computing Systems10.1145/352785322:1(1-28)Online publication date: 13-Dec-2022
https://dl.acm.org/doi/10.1145/3527853
Park SBhowmik MUta AWeissman JChandra AGavrilovska ATiwari D(2022)DAOSProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531466(4-15)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531466
Loughlin KSaroiu SWolman AManerkar YKasikci BSalapura VZahran MChong FTang L(2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527427
Oh DMoon YLee EHam TPark YLee JAhn JHenkel JLiu X(2021)MaPHeA: a lightweight memory hierarchy-aware profile-guided heap allocation frameworkProceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463844(24-36)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3461648.3463844
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Choi JBlagodurov STseng H(2021)Dancing in the Dark: Profiling for Tiered Memory2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00011(13-22)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00011
Denoyelle NGoglin BJeannot ERopars T(2019)Data and Thread Placement in NUMA ArchitecturesProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337893(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337893
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Page Placement Strategies for GPUs within Heterogeneous Memory Systems