research-article

Open access

Global-State Aware Automatic NUMA Balancing

Authors:

Zhibin YuAuthors Info & Claims

Internetware '24: Proceedings of the 15th Asia-Pacific Symposium on Internetware

Pages 317 - 326

https://doi.org/10.1145/3671016.3671380

Published: 24 July 2024 Publication History

All formats PDF

Abstract

Non-uniform memory access (NUMA) has become a standard architecture for modern servers. However, NUMA effect (i.e., local memory access typically takes shorter time than remote memory accesses) is unavoidable. To address this issue, Automatic NUMA Balancing(Auto-NUMA) was proposed. Nevertheless, Auto-NUMA can improve or hurt performance of an application, depending on its characteristics which is difficult for end users to know.

To tackle this problem, we propose Global-State Aware Automatic NUMA Balancing (GSA-Auto-NUMA). It innovates two techniques. First, GSA-Auto-NUMA identifies a set of key metrics to accurately assess the current state of a NUMA system. Second, GSA-Auto-NUMA utilizes these metrics to make real-time decisions on whether to enable Auto-NUMA through five steps of evaluation.

We implemented GSA-Auto-NUMA on both ARM and x86 platforms and validated its performance through experiments. The results show that, unlike Auto-NUMA, GAS-Auto-NUMA does not hurt performance at least, and improves performance for most applications. More over, GSA-Auto-NUMA outperforms Auto-NUMA up to 0.47 × and 1.20 × on ARM and x86 NUMA servers, respectively.

References

[1]

Reto Achermann. 2020. GitHub - mitosis-project/mitosis-workload-hashjoin: The HashJoin workload used for evaluation. https://github.com/mitosis-project/mitosis-workload-hashjoin. [Accessed 26-03-2024].

[2]

Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, and Jayneel Gandhi. 2020. Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 283–300. https://doi.org/10.1145/3373376.3378468

Digital Library

[3]

AMD. 2024. 4th Generation AMD EPYC™ Processors. https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series.html. [Accessed 25-03-2024].

[4]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (Toronto, Ontario, Canada) (PACT ’08). Association for Computing Machinery, New York, NY, USA, 72–81. https://doi.org/10.1145/1454115.1454128

Digital Library

[5]

Mei-Ling Chiang and Wei-Lun Su. 2021. Thread-Aware Mechanism to Enhance Inter-Node Load Balancing for Multithreaded Applications on NUMA Systems. Applied Sciences 11, 14 (2021), 6486. https://doi.org/10.3390/app11146486

[6]

Mei-Ling Chiang, Wei-Lun Su, Shu-Wei Tu, and Zhen-Wei Lin. 2019. Memory-aware kernel mechanism and policies for improving internode load balancing on NUMA systems. Software: Practice and Experience 49, 10 (2019), 1485–1508. https://doi.org/10.1002/spe.2731 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.2731

[7]

Younghyun Cho, Camilo A. Celis Guzman, and Bernhard Egger. 2018. Maximizing system utilization via parallelism management for co-located parallel applications. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (Limassol, Cyprus) (PACT ’18). Association for Computing Machinery, New York, NY, USA, Article 14, 14 pages. https://doi.org/10.1145/3243176.3243199

Digital Library

[8]

Linux community. 2024. Perf Wiki — perf.wiki.kernel.org. https://perf.wiki.kernel.org/index.php/Main_Page. [Accessed 17-04-2024].

[9]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: a holistic approach to memory placement on NUMA systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS ’13). Association for Computing Machinery, New York, NY, USA, 381–394. https://doi.org/10.1145/2451116.2451157

Digital Library

[10]

Dell. 2024. PowerEdge Rack Servers – Enterprise Servers. https://www.dell.com/en-au/dt/servers/poweredge-rack-servers.htm?hve=explore+poweredge-rack-servers##tab0=0&tab1=0&accordion0. [Accessed 25-03-2024].

[11]

Adi Yoaz Don Soltis, Irma Esmer and Sailesh Kottapalli. 2017. The New Intel Xeon Scalable Processor(formerly skylake-sp). https://old.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf. [Accessed 26-03-2024].

[12]

Jill Dunbar. 2023. NAS Parallel Benchmarks. https://www.nas.nasa.gov/software/npb.html. [Accessed 26-03-2024].

[13]

Thomas W. Edgar and David O. Manz. 2017. Chapter 4 - Exploratory Study. In Research Methods for Cyber Security, Thomas W. Edgar and David O. Manz (Eds.). Syngress, The United Kingdom Netherlands, 95–130. https://doi.org/10.1016/B978-0-12-805349-2.00004-2

[14]

Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire, and Dejan Kostić. 2019. Make the Most out of Last Level Cache in Intel Processors. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). Association for Computing Machinery, New York, NY, USA, Article 8, 17 pages. https://doi.org/10.1145/3302424.3303977

Digital Library

[15]

Mel Gorman. 2012. Foundation for automatic NUMA balancing. https://lwn.net/Articles/523065/. [Accessed 25-03-2024].

[16]

Red Hat. 2024. numad. https://access.redhat.com/documentation/enus/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-numad. [Accessed 27-03-2024].

[17]

Hisilicon. 2024. Kunpeng 920 Chipset. https://www.hisilicon.com/en/products/Kunpeng/Huawei-Kunpeng/Huawei-Kunpeng-920. [Accessed 25-03-2024].

[18]

HP. 2024. HPE Cray XD supercomputers. https://www.hpe.com/au/en/compute/hpc/supercomputing/cray-exascale-supercomputer.html. [Accessed 25-03-2024].

[19]

Rob J Hyndman. 2011. Moving Averages.

[20]

Intel. 2017. Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring. https://kib.kiev.ua/x86docs/Intel/PerfMon/336274-001.pdf. [Accessed 26-03-2024].

[21]

Intel. 2024. Intel® Xeon® Platinum Processors. https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable/platinum/products.html. [Accessed 25-03-2024].

[22]

CHENG Jian. 2024. scheduling. https://github.com/gatieme/LDD-LinuxDeviceDrivers/blob/master/study/kernel/00-DESCRIPTION/SCHEDULER.md. [Accessed 03-04-2024].

[23]

jtramm. 2024. GitHub - ANL-CESAR/XSBench: XSBench: The Monte Carlo Macroscopic Cross Section Lookup Benchmark. https://github.com/ANL-CESAR/XSBench. [Accessed 26-03-2024].

[24]

The kernel development community. 2020. HiSilicon SoC uncore Performance Monitoring Unit (PMU), The Linux Kernel documentation. https://www.kernel.org/doc/html/v5.5/admin-guide/perf/hisi-pmu.html. [Accessed 26-03-2024].

[25]

Christoph Lameter. 2013. NUMA (Non-Uniform Memory Access): An Overview: NUMA becomes more common because memory controllers get close to execution units on microprocessors.Queue 11, 7 (jul 2013), 40–51. https://doi.org/10.1145/2508834.2513149

Digital Library

[26]

Lenovo. 2024. Shop Rack Servers | Next Gen ThinkSystem 1u, 2u, 4u Rack Servers. https://www.lenovo.com/us/en/c/servers-storage/servers/racks/. [Accessed 25-03-2024].

[27]

Tan Li, Yufei Ren, Dantong Yu, and Shudong Jin. 2017. Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applications. Future Generation Computer Systems 74 (2017), 41–50. https://doi.org/10.1016/j.future.2017.04.001

Digital Library

[28]

Nakul Manchanda and Karan Anand. 2010. Non-uniform memory access (numa). New York University 4 (2010).

[29]

Marcos Maroñas, Antoni Navarro, Eduard Ayguadé, and Vicenç Beltran. 2023. Mitigating the NUMA effect on task-based runtime systems. The Journal of Supercomputing 79, 13 (2023), 14287–14312.

Digital Library

[30]

The Linux Kernel Organization. 2023. The Linux Kernel Archives. https://www.kernel.org/. [Accessed 27-03-2024].

[31]

Ashish Panwar. 2021. GitHub - mitosis-project/vmitosis-workloads. https://github.com/mitosis-project/vmitosis-workloads. [Accessed 26-03-2024].

[32]

Ashish Panwar, Reto Achermann, Arkaprava Basu, Abhishek Bhattacharjee, K. Gopinath, and Jayneel Gandhi. 2021. Fast local page-tables for virtualized NUMA servers with vMitosis. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 194–210. https://doi.org/10.1145/3445814.3446709

Digital Library

[33]

Mihail Popov, Alexandra Jimborean, and David Black-Schaffer. 2019. Efficient thread/page/parallelism autotuning for NUMA systems. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS ’19). Association for Computing Machinery, New York, NY, USA, 342–353. https://doi.org/10.1145/3330345.3330376

Digital Library

[34]

Jianmin Qian. 2022. Research On Resource Management Optimization Strategy for NUMA Architecture In Virtualized Environment. Ph. D. Dissertation. Shanghai Jiao Tong University.

[35]

Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (La Jolla, CA, USA) (ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 1233–1249. https://doi.org/10.1145/3620665.3640369

Digital Library

[36]

Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau. 2018. Multi-CPU Scheduling. https://pages.cs.wisc.edu/ remzi/OSTEP/cpu-sched-multi.pdf. [Accessed 25-03-2024].

[37]

Isaac Sánchez Barrera, David Black-Schaffer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, and Mihail Popov. 2020. Modeling and optimizing NUMA effects and prefetching with machine learning. In Proceedings of the 34th ACM International Conference on Supercomputing (Barcelona, Spain) (ICS ’20). Association for Computing Machinery, New York, NY, USA, Article 34, 13 pages. https://doi.org/10.1145/3392717.3392765

Digital Library

[38]

sherlock wang. 2020. Hisi Perf Uncore Event Introduce. https://blog.csdn.net/scarecrow_byr/article/details/104402257. [Accessed 26-03-2024].

[39]

Jaehyun Song, Minwoo Ahn, Gyusun Lee, Euiseong Seo, and Jinkyu Jeong. 2021. A Performance-Stable NUMA Management Scheme for Linux-Based HPC Systems. IEEE Access 9 (2021), 52987–53002. https://doi.org/10.1109/ACCESS.2021.3069991

[40]

Ubuntu. 2021. Ubuntu 18.04.6 LTS (Bionic Beaver). https://releases.ubuntu.com/18.04.6/?_gl=1*19l7y64*_gcl_au*MTU5ODUzNTY5My4xNzExNTA1NjYz&_ga=2.268622902.611587924.1711505663-838900404.1711505663. [Accessed 27-03-2024].

[41]

Rik van Riel. 2014. Automatic NUMA Balancing. https://www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf. [Accessed 25-03-2024].

[42]

Markus Velten, Robert Schöne, Thomas Ilsche, and Daniel Hackenberg. 2022. Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (Beijing, China) (ICPE ’22). Association for Computing Machinery, New York, NY, USA, 165–175. https://doi.org/10.1145/3489525.3511689

Digital Library

[43]

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services. IEEE Micro 41, 5 (2021), 67–75. https://doi.org/10.1109/MM.2021.3085578

Digital Library

Index Terms

Global-State Aware Automatic NUMA Balancing
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Allocation / deallocation strategies

Recommendations

Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling
ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

The growing demand for memory systems with larger capacities and faster data transfer speeds has driven progress in the widespread adoption of multi-socket machines and memory expansion through Compute eXpress Link (CXL). However, processes running on ...
HydraFS: an efficient NUMA-aware in-memory file system
Abstract
Emerging persistent file systems are designed to achieve high-performance data processing by effectively exploiting the advanced features of Non-volatile Memory (NVM). Non-uniform memory access (NUMA) architectures are universally used in high-...
Cooperative NV-NUMA: prolonging non-volatile memory lifetime through bandwidth sharing
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Resistive memory technologies, such as ReRAM and PCM, are potentially promising replacements for DRAM technology. Their limited endurance (and thus short lifetime), however, is a major obstacle to their commercialization. Analytic models and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

Internetware '24: Proceedings of the 15th Asia-Pacific Symposium on Internetware

July 2024

518 pages

ISBN:9798400707056

DOI:10.1145/3671016

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

Internetware 2024

Sponsor:

SIGSOFT

Internetware 2024: 15th Asia-Pacific Symposium on Internetware

July 24 - 26, 2024

Macau, China

Acceptance Rates

Overall Acceptance Rate 55 of 111 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
456
Total Downloads

Downloads (Last 12 months)456
Downloads (Last 6 weeks)104

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten