research-article

EagerMap: A Task Mapping Algorithm to Improve Communication and Load Balancing in Clusters of Multicore Systems

Authors:

Eduardo H. M. Cruz,

Matthias Diener,

Laércio L. Pilla,

Philippe O. A. NavauxAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 5, Issue 4

Article No.: 17, Pages 1 - 24

https://doi.org/10.1145/3309711

Published: 08 March 2019 Publication History

Abstract

Communication between tasks and load imbalance have been identified as a major challenge for the performance and energy efficiency of parallel applications. A common way to improve communication is to increase its locality, that is, to reduce the distances of data transfers, prioritizing the usage of faster and more efficient local interconnections over remote ones. Regarding load imbalance, cores should execute a similar amount of work. An important problem to be solved in this context is how to determine an optimized mapping of tasks to cluster nodes and cores that increases the overall locality and load balancing. In this article, we propose the EagerMap algorithm to determine task mappings, which is based on a greedy heuristic to match application communication patterns to hardware hierarchies and which can also consider the task load. Compared to previous algorithms, EagerMap is faster, scales better, and supports more types of computer systems, while maintaining the same or better quality of the determined task mapping. EagerMap is therefore an interesting choice for task mapping on a variety of modern parallel architectures.

References

[1]

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Trans. Archit. Code Optim. 13, 2 (2016), 1--25.

Digital Library

[2]

Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (Apr. 2009), 56--65.

Digital Library

[3]

David H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. Int. J. Supercomputer Appl. 5, 3 (1991), 66--73.

Digital Library

[4]

S. Bak, H. Menon, S. White, M. Diener, and L. Kale. 2018. Multi-level load balancing with an integrated runtime approach. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

Digital Library

[5]

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, May (2014), 1--155.

[6]

Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of Splash-2 and PARSEC. In IEEE International Symposium on Workload Characterization (IISWC). 86--97.

Digital Library

[7]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.

Digital Library

[8]

Jacques E. Boillat and Peter G. Kropf. 1990. A fast distributed mapping algorithm. In Joint International Conference on Vector and Parallel Processing (CONPAR 90 -- VAPP IV). 405--416.

Digital Library

[9]

S. H. Bokhari. 1981. On the mapping problem. IEEE Trans. Comput. C-30, 3 (1981), 207--214.

Digital Library

[10]

Barbara Brandfass, Thomas Alrutz, and Thomas Gerhold. 2013. Rank reordering for MPI communication optimization. Comput. Fluids 80, July (2013), 372--380.

[11]

Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. HWLOC: A generic framework for managing hardware affinities in HPC applications. In Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP). 180--186.

Digital Library

[12]

Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 353--360.

Digital Library

[13]

C. Chevalier and F. Pellegrini. 2008. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Comput. 34, 6--8 (July 2008), 318--331.

Digital Library

[14]

Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, and Philippe O. A. Navaux. 2014. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. J. Parallel Distrib. Comput. 74, 3 (Mar. 2014), 2215--2228.

Digital Library

[15]

Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015. An efficient algorithm for communication-based task mapping. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 207--214.

Digital Library

[16]

Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2016. Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Trans. Archit. Code Optim. 13, 3, Article 28 (Sept. 2016), 28 pages.

Digital Library

[17]

E. H. M. Cruz, M. Diener, M. S. Serpa, P. O. A. Navaux, L. Pilla, and I. Koren. 2018. Improving communication and load balancing with thread mapping in manycore systems. In Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). 93--100.

[18]

William J. Dally. 2010. GPU Computing to Exascale and Beyond. Technical Report. nVidia.

[19]

Mehmet Deveci, Kamer Kaya, Bora Ucar, and Umit V. Catalyurek. 2015. Fast and high quality topology-aware task mapping. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 197--206.

Digital Library

[20]

Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 124--133.

Digital Library

[21]

Jack Edmonds. 1965. Maximum matching and a polyhedron with 0,1-vertices. J. Res. Nat. Bur. Stand.—Section B. Math. Math. Phys. 69B, 1 and 2 (1965), 125.

[22]

Roland Glantz, Hening Meyerhenke, and Alexander Noe. 2015. Algorithms for mapping parallel processes onto grid and torus architectures. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 236--243.

Digital Library

[23]

Bruce Hendrickson and Robert Lelandy. 1995. The Chaco Users Guide Version 2.0. Technical Report. Sandia National Laboratories.

[24]

Satoshi Ito, Kazuya Goto, and Kenji Ono. 2013. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Comput. Fluids 80, 1 (Jul. 2013), 88--93.

[25]

Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par Parallel Processing. 199--210.

Digital Library

[26]

Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25, 4 (Apr. 2014), 993--1002.

Digital Library

[27]

Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2016. Topology and affinity aware hierarchical and distributed load-balancing in Charm++. In Workshop on Optimization of Communication in HPC (COM-HPC). 63--72.

Digital Library

[28]

H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October. NASA.

[29]

M. Tim Jones. 2009. Inside the Linux 2.6 Completely Fair Scheduler. Retrieved June 2018 from https://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/.

[30]

Zoran Jovanovic and Slavko Maric. 2001. A heuristic algorithm for dynamic task scheduling in highly parallel computing systems. Future Gener. Comput. Syst. 17, 6 (Apr. 2001), 721--732.

Digital Library

[31]

George Karypis and Vipin Kumar. 1995. METIS -- Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0. Technical Report. University of Minnesota, Department of Computer Science.

[32]

George Karypis and Vipin Kumar. 1996. Parallel multilevel k-way partitioning scheme for irregular graphs. In ACM/IEEE Conference on Supercomputing. Article 35, 21 pages.

Digital Library

[33]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.

Digital Library

[34]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 190--200.

Digital Library

[35]

Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, New York, NY, 103--115.

Digital Library

[36]

Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifer, Bob Lucas, Jeremy Kepner, John Mccalpin, David Bailey, Daisuke Takahashi, J. Jack, and Rolf Rabenseifner. 2005. Introduction to the HPC Challenge Benchmark Suite. Technical Report.

[37]

Larry McVoy and Carl Staelin. 1996. Lmbench: Portable tools for performance analysis. In USENIX Annual Technical Conference (ATC). 23--38.

Digital Library

[38]

François Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process and architecture graphs. In Scalable High-Performance Computing Conference (SHPCC). 486--493.

[39]

John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In High Performance Computing for Computational Science (VECPAR). 1--25.

Digital Library

[40]

Jason Sonnek, James Greensky, Robert Reutiman, and Abhishek Chandra. 2010. Starling: Minimizing communication overhead in virtualized computing platforms using decentralized affinity-aware migration. In International Conference on Parallel Processing (ICPP). 228--237.

Digital Library

[41]

François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: A generic framework for performance analysis. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 618--619.

Digital Library

[42]

D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Distrib. Syst. 28, 10 (Oct. 2017), 3007--3020.

Digital Library

[43]

Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. 2012. Performance analysis of thread mappings with a holistic view of the hardware resources. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS).

Digital Library

[44]

Jidong Zhai, Tianwei Sheng, and Jiangzhou He. 2011. Efficiently acquiring communication traces for large-scale parallel applications. IEEE Trans. Parallel Distrib. Syst. 22, 11 (2011), 1862--1870.

Digital Library

Cited By

Koohi SHamid NOthman MIbragimov G(2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TCC.2022.3184081
Adewojo ABass J(2023)A Novel Weight-Assignment Load Balancing Algorithm for Cloud ApplicationsSN Computer Science10.1007/s42979-023-01702-74:3Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s42979-023-01702-7
Agrawal TMalakar P(2022)IPMPI: Improved MPI Communication Logger2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00009(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00009
Show More Cited By

Index Terms

EagerMap: A Task Mapping Algorithm to Improve Communication and Load Balancing in Clusters of Multicore Systems
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Grid computing
    2. Parallel architectures
      1. Multicore architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

An Efficient Algorithm for Communication-Based Task Mapping
PDP '15: Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

The communication between tasks of a parallel application is an important characteristic to consider when mapping tasks to computing cores due to possible differences in communication performance. Within a machine, performance differences are introduced ...
LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting ...
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 5, Issue 4

December 2018

112 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3314574

Editor:
David Bader
Georgia Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2019

Accepted: 01 December 2018

Revised: 01 July 2018

Received: 01 April 2017

Published in TOPC Volume 5, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

HPC4E project
EU H2020 Programme
MCTI/RNP-Brazil
Intel

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
350
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koohi SHamid NOthman MIbragimov G(2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TCC.2022.3184081
Adewojo ABass J(2023)A Novel Weight-Assignment Load Balancing Algorithm for Cloud ApplicationsSN Computer Science10.1007/s42979-023-01702-74:3Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s42979-023-01702-7
Agrawal TMalakar P(2022)IPMPI: Improved MPI Communication Logger2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00009(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00009
Adel AEl Mougy A(2022)Cloud Computing Predictive Resource Management Framework Using Hidden Markov Model2022 5th Conference on Cloud and Internet of Things (CIoT)10.1109/CIoT53061.2022.9766809(205-212)Online publication date: 28-Mar-2022
https://doi.org/10.1109/CIoT53061.2022.9766809
Jeannot E(2022)Process mapping on any topology with TopoMatchJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.08.002170(39-52)Online publication date: Dec-2022
https://doi.org/10.1016/j.jpdc.2022.08.002
Li JLi JYang YJi BChen DHe XNie N(2022)A parallel ETD algorithm for large-scale rate theory simulationThe Journal of Supercomputing10.1007/s11227-022-04434-278:12(14215-14230)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s11227-022-04434-2
Xia WShoemaker C(2021)Improving the speed of global parallel optimization on PDE models with processor affinity schedulingComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1273737:3(279-299)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1111/mice.12737
Denoyelle NPerarnau SVideau BBeckman PJeannot E(2021)Narrowing the Search Space of Applications Mapping on Hierarchical Topologies2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00018(118-128)Online publication date: Nov-2021
https://doi.org/10.1109/PMBS54543.2021.00018
Li JZhang YZhang X(2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099
Pasqualin DDiener MDu Bois APilla M(2021)Characterizing the Sharing Behavior of Applications Using Software Transactional MemoryBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_1(3-21)Online publication date: 2-Mar-2021
https://doi.org/10.1007/978-3-030-71058-3_1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents