poster

Graph partitioning applied to DAG scheduling to reduce NUMA effects

Authors:

Isaac Sánchez Barrera,

Mateo ValeroAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 419 - 420

https://doi.org/10.1145/3178487.3178535

Published: 10 February 2018 Publication History

Get Access

Abstract

The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software.

We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art.

References

[1]

Jairo Balart, Alejandro Duran, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta. 2004. Nanos Mercurium: A Research Compiler for OpenMP. In EWOMP. http://people.ac.upc.edu/aduran/papers/2004/mercurium_ewomp04.pdf

Google Scholar

[2]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. SIGARCH Comput. Archit. News 41 (2013), 381--394.

Digital Library

Google Scholar

[3]

Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In PACT. 277--288.

Digital Library

Google Scholar

[4]

Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In PACT. 125--137.

Digital Library

Google Scholar

[5]

Rabab al-Omairy, Guillermo Miranda, Hatem Ltaief, Rosa M. Badia, Xavier Martorell, Jesús Labarta, and David Keyes. 2015. Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing. Supercomput. Front. Innov. 2 (2015), 49--72.

Digital Library

Google Scholar

[6]

François Pellegrini. 1994. Static Mapping by Dual Recursive Bipartitioning of Process Architecture Graphs. In SHPCC.

Google Scholar

[7]

Xavier Teruel, Xavier Martorell, Alejandro Duran, Roger Ferrer, and Eduard Ayguadé. 2007. Support for OpenMP Tasks in Nanos V4. In CASCON.

Digital Library

Google Scholar

[8]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware Monitors for Dynamic Page Migration. J. Parallel Distrib. Comput. 68 (2008), 1186--1200.

Digital Library

Google Scholar

[9]

Raul Vidal, Marc Casas, Miquel Moretó, Dimitrios Chasapis, Roger Ferrer, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads. In IWOMP. 60--72.

Google Scholar

Cited By

View all

Catalán SIgual FHerrero JRodríguez-Sánchez RQuintana-Ortí E(2023)Programming parallel dense matrix factorizations and inversion for new-generation NUMA architecturesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.01.004175:C(51-65)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2023.01.004
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Antoniadis KGuerraoui RTrigonakis V(2020)Thread-Placement Learning2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00050(877-887)Online publication date: Nov-2020
https://doi.org/10.1109/ICDCS47774.2020.00050
Show More Cited By

Index Terms

Graph partitioning applied to DAG scheduling to reduce NUMA effects
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Parallel computing methodologies

Recommendations

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the ...
Graph partitioning applied to DAG scheduling to reduce NUMA effects
PPoPP '18

The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this ...
Scale-out NUMA
ASPLOS '14

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Funding Sources

Ministerio de Economía y Competitividad
Ministerio de Educación, Cultura y Deporte
European Research Council

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
388
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Catalán SIgual FHerrero JRodríguez-Sánchez RQuintana-Ortí E(2023)Programming parallel dense matrix factorizations and inversion for new-generation NUMA architecturesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.01.004175:C(51-65)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2023.01.004
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Antoniadis KGuerraoui RTrigonakis V(2020)Thread-Placement Learning2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00050(877-887)Online publication date: Nov-2020
https://doi.org/10.1109/ICDCS47774.2020.00050
Liu WLiu HLiao XJin HZhang Y(2021)HNGraph: Parallel Graph Processing in Hybrid Memory Based NUMA Systems2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00063(388-397)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00063
Neves FVilaça RPereira JHung CCerny TShin DBechini A(2020)Black-box inter-application traffic monitoring for adaptive container placementProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3374007(259-266)Online publication date: 30-Mar-2020
https://dl.acm.org/doi/10.1145/3341105.3374007

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies