Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

EagerMap: A Task Mapping Algorithm to Improve Communication and Load Balancing in Clusters of Multicore Systems

Published: 08 March 2019 Publication History

Abstract

Communication between tasks and load imbalance have been identified as a major challenge for the performance and energy efficiency of parallel applications. A common way to improve communication is to increase its locality, that is, to reduce the distances of data transfers, prioritizing the usage of faster and more efficient local interconnections over remote ones. Regarding load imbalance, cores should execute a similar amount of work. An important problem to be solved in this context is how to determine an optimized mapping of tasks to cluster nodes and cores that increases the overall locality and load balancing. In this article, we propose the EagerMap algorithm to determine task mappings, which is based on a greedy heuristic to match application communication patterns to hardware hierarchies and which can also consider the task load. Compared to previous algorithms, EagerMap is faster, scales better, and supports more types of computer systems, while maintaining the same or better quality of the determined task mapping. EagerMap is therefore an interesting choice for task mapping on a variety of modern parallel architectures.

References

[1]
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Trans. Archit. Code Optim. 13, 2 (2016), 1--25.
[2]
Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (Apr. 2009), 56--65.
[3]
David H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. Int. J. Supercomputer Appl. 5, 3 (1991), 66--73.
[4]
S. Bak, H. Menon, S. White, M. Diener, and L. Kale. 2018. Multi-level load balancing with an integrated runtime approach. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).
[5]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, May (2014), 1--155.
[6]
Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of Splash-2 and PARSEC. In IEEE International Symposium on Workload Characterization (IISWC). 86--97.
[7]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.
[8]
Jacques E. Boillat and Peter G. Kropf. 1990. A fast distributed mapping algorithm. In Joint International Conference on Vector and Parallel Processing (CONPAR 90 -- VAPP IV). 405--416.
[9]
S. H. Bokhari. 1981. On the mapping problem. IEEE Trans. Comput. C-30, 3 (1981), 207--214.
[10]
Barbara Brandfass, Thomas Alrutz, and Thomas Gerhold. 2013. Rank reordering for MPI communication optimization. Comput. Fluids 80, July (2013), 372--380.
[11]
Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. HWLOC: A generic framework for managing hardware affinities in HPC applications. In Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP). 180--186.
[12]
Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 353--360.
[13]
C. Chevalier and F. Pellegrini. 2008. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Comput. 34, 6--8 (July 2008), 318--331.
[14]
Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, and Philippe O. A. Navaux. 2014. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. J. Parallel Distrib. Comput. 74, 3 (Mar. 2014), 2215--2228.
[15]
Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015. An efficient algorithm for communication-based task mapping. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 207--214.
[16]
Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2016. Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Trans. Archit. Code Optim. 13, 3, Article 28 (Sept. 2016), 28 pages.
[17]
E. H. M. Cruz, M. Diener, M. S. Serpa, P. O. A. Navaux, L. Pilla, and I. Koren. 2018. Improving communication and load balancing with thread mapping in manycore systems. In Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). 93--100.
[18]
William J. Dally. 2010. GPU Computing to Exascale and Beyond. Technical Report. nVidia.
[19]
Mehmet Deveci, Kamer Kaya, Bora Ucar, and Umit V. Catalyurek. 2015. Fast and high quality topology-aware task mapping. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 197--206.
[20]
Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 124--133.
[21]
Jack Edmonds. 1965. Maximum matching and a polyhedron with 0,1-vertices. J. Res. Nat. Bur. Stand.—Section B. Math. Math. Phys. 69B, 1 and 2 (1965), 125.
[22]
Roland Glantz, Hening Meyerhenke, and Alexander Noe. 2015. Algorithms for mapping parallel processes onto grid and torus architectures. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 236--243.
[23]
Bruce Hendrickson and Robert Lelandy. 1995. The Chaco Users Guide Version 2.0. Technical Report. Sandia National Laboratories.
[24]
Satoshi Ito, Kazuya Goto, and Kenji Ono. 2013. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Comput. Fluids 80, 1 (Jul. 2013), 88--93.
[25]
Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par Parallel Processing. 199--210.
[26]
Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25, 4 (Apr. 2014), 993--1002.
[27]
Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2016. Topology and affinity aware hierarchical and distributed load-balancing in Charm++. In Workshop on Optimization of Communication in HPC (COM-HPC). 63--72.
[28]
H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October. NASA.
[29]
M. Tim Jones. 2009. Inside the Linux 2.6 Completely Fair Scheduler. Retrieved June 2018 from https://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/.
[30]
Zoran Jovanovic and Slavko Maric. 2001. A heuristic algorithm for dynamic task scheduling in highly parallel computing systems. Future Gener. Comput. Syst. 17, 6 (Apr. 2001), 721--732.
[31]
George Karypis and Vipin Kumar. 1995. METIS -- Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0. Technical Report. University of Minnesota, Department of Computer Science.
[32]
George Karypis and Vipin Kumar. 1996. Parallel multilevel k-way partitioning scheme for irregular graphs. In ACM/IEEE Conference on Supercomputing. Article 35, 21 pages.
[33]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.
[34]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 190--200.
[35]
Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, New York, NY, 103--115.
[36]
Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifer, Bob Lucas, Jeremy Kepner, John Mccalpin, David Bailey, Daisuke Takahashi, J. Jack, and Rolf Rabenseifner. 2005. Introduction to the HPC Challenge Benchmark Suite. Technical Report.
[37]
Larry McVoy and Carl Staelin. 1996. Lmbench: Portable tools for performance analysis. In USENIX Annual Technical Conference (ATC). 23--38.
[38]
François Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process and architecture graphs. In Scalable High-Performance Computing Conference (SHPCC). 486--493.
[39]
John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In High Performance Computing for Computational Science (VECPAR). 1--25.
[40]
Jason Sonnek, James Greensky, Robert Reutiman, and Abhishek Chandra. 2010. Starling: Minimizing communication overhead in virtualized computing platforms using decentralized affinity-aware migration. In International Conference on Parallel Processing (ICPP). 228--237.
[41]
François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: A generic framework for performance analysis. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 618--619.
[42]
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Distrib. Syst. 28, 10 (Oct. 2017), 3007--3020.
[43]
Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. 2012. Performance analysis of thread mappings with a holistic view of the hardware resources. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS).
[44]
Jidong Zhai, Tianwei Sheng, and Jiangzhou He. 2011. Efficiently acquiring communication traces for large-scale parallel applications. IEEE Trans. Parallel Distrib. Syst. 22, 11 (2011), 1862--1870.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 5, Issue 4
December 2018
112 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3314574
  • Editor:
  • David Bader
Issue’s Table of Contents
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2019
Accepted: 01 December 2018
Revised: 01 July 2018
Received: 01 April 2017
Published in TOPC Volume 5, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Task mapping
  2. clusters
  3. communication
  4. locality

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • HPC4E project
  • EU H2020 Programme
  • MCTI/RNP-Brazil
  • Intel

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
  • (2023)A Novel Weight-Assignment Load Balancing Algorithm for Cloud ApplicationsSN Computer Science10.1007/s42979-023-01702-74:3Online publication date: 17-Mar-2023
  • (2022)IPMPI: Improved MPI Communication Logger2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00009(31-40)Online publication date: Nov-2022
  • (2022)Cloud Computing Predictive Resource Management Framework Using Hidden Markov Model2022 5th Conference on Cloud and Internet of Things (CIoT)10.1109/CIoT53061.2022.9766809(205-212)Online publication date: 28-Mar-2022
  • (2022)Process mapping on any topology with TopoMatchJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.08.002170(39-52)Online publication date: Dec-2022
  • (2022)A parallel ETD algorithm for large-scale rate theory simulationThe Journal of Supercomputing10.1007/s11227-022-04434-278:12(14215-14230)Online publication date: 1-Aug-2022
  • (2021)Improving the speed of global parallel optimization on PDE models with processor affinity schedulingComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1273737:3(279-299)Online publication date: 20-Jun-2021
  • (2021)Narrowing the Search Space of Applications Mapping on Hierarchical Topologies2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00018(118-128)Online publication date: Nov-2021
  • (2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
  • (2021)Characterizing the Sharing Behavior of Applications Using Software Transactional MemoryBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_1(3-21)Online publication date: 2-Mar-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media