Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1122971.1122987acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Hardware profile-guided automatic page placement for ccNUMA systems

Published: 29 March 2006 Publication History

Abstract

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.

References

[1]
C versions of nas-2.3 serial programs. http://phase.hpcc.jp/Omni/benchmarks/NPB, 2003.]]
[2]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.]]
[3]
W. Bolosky, M. Scott, R. Fitzgerald, R. Fowler, and A. Cox. NUMA policies and their relation to memory architecture. In Proceedings of the fourth International Conference on Architecture Support for Programming Languages and Operating Systems, pages 212--221, 1991.]]
[4]
J. Bull and C. Johnson. Data Distribution, Migration and Replication on a ccNUMA Architecture. In Proceedings of the Fourth European Workshop on OpenMP, 2002.]]
[5]
Hewlett-Packard. Perfmon project.]]
[6]
Intel. Intel Itanium2 Processor Reference Manual for Software Development and Optimization, volume 1. Intel, 2004.]]
[7]
J. Marathe, F. Mueller, and B. R. de Supinski. A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks. In International Conference on Supercomputing, June 2005.]]
[8]
D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguade. User-level dynamic page migration for multipro-grammed shared-memory multiprocessors. In International Conference on Parallel Programming, pages 95--103, Aug. 2000.]]
[9]
D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguade. UPMLIB: A runtime system for tuning the memory performance of openmp programs on scalable shared-memory multiprocessors. In Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 85--99, 2000.]]
[10]
L. Noordergraaf and R. Zak. Performance experiences on Sun's Wildfire prototype. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999.]]
[11]
M. M. Tikir and J. K. Hollingsworth. Using hardware counters to automatically improve memory performance. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 46, Washington, DC, USA, 2004. IEEE Computer Society.]]
[12]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on ccNUMA compute servers. In Proceedings of the seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, 1996.]]

Cited By

View all
  • (2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
  • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
  • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
March 2006
258 pages
ISBN:1595931899
DOI:10.1145/1122971
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. hardware performance monitoring
  3. page placement
  4. profile-guided optimization

Qualifiers

  • Article

Conference

PPoPP06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
  • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
  • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
  • (2022)MaPHeA: A Framework for Lightweight Memory Hierarchy-aware Profile-guided Heap AllocationACM Transactions on Embedded Computing Systems10.1145/352785322:1(1-28)Online publication date: 13-Dec-2022
  • (2022)DAOSProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531466(4-15)Online publication date: 27-Jun-2022
  • (2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
  • (2021)MaPHeA: a lightweight memory hierarchy-aware profile-guided heap allocation frameworkProceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463844(24-36)Online publication date: 22-Jun-2021
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2021)Dancing in the Dark: Profiling for Tiered Memory2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00011(13-22)Online publication date: May-2021
  • (2019)Data and Thread Placement in NUMA ArchitecturesProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337893(1-10)Online publication date: 5-Aug-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media