Article

Topology-aware tile mapping for clusters of SMPs

Authors:

Daniel Chavarría-Miranda,

Jarek Nieplocha,

Vinod TipparajuAuthors Info & Claims

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 383 - 392

https://doi.org/10.1145/1128022.1128073

Published: 03 May 2006 Publication History

Abstract

We propose a technique to optimize the performance of applications using distributed dense arrays and characterized by a nearest-neighbor communication profile by exploiting the topology of SMP clusters. The topological information is used to map array tiles to processors to reduce network communication and improve utilization of shared memory for inter-process communication. The potential benefits of using the SMP-aware mapping are demonstrated through a simulation, as well as a real application solving a wind-driven ocean circulation model on an IBM SP. On 256 processors, the execution time was reduced by almost 30 percent without any changes to the original application source code. The proposed mapping approach is applicable to multiple programming models and distributed array management systems.

References

[1]

O. Beaumont, V. Boudet, and A. Petitet. A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers). IEEE Trans. Comput., 50(10):1052--1070, 2001.]]

Digital Library

[2]

E. Chu. Impact of Physical/logical Network Topology on Parallel Matrix Computation. Int. J. High Perform. Comput. Appl., 13(2):124--145, 1999.]]

Digital Library

[3]

E. Chu and A. George. QR factorization of a dense matrix on a hypercube multiprocessor. SIAM J. Sci. Stat. Comput., 11(5):990--1028, 1990.]]

Digital Library

[4]

A. Darte, D. Chavarría-Miranda, R. Fowler, and J. Mellor-Crummey. Generalized Multipartitioning for Multi-dimensional arrays. In Proceedings of the International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, Apr. 2002.]]

Digital Library

[5]

A. Darte, J. Mellor-Crummey, R. Fowler, and D. Chavarría-Miranda. Generalized Multipartitioning of Multi-dimensional Arrays for Parallelizing Line-sweep Applications. Journal of Parallel and Distributed Computing, 63(9), Sept. 2003.]]

Digital Library

[6]

T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specifications v1.1.1, October 2003.]]

[7]

I. Foster, J. Geisler, C. Kesselman, and S. Tuecke. Managing Multiple Communication Methods in High-performance Networked Computing Systems. Journal of Parallel and Distributed Computing, 40, 1997.]]

Digital Library

[8]

W. Gropp, M. Snir, B. Nitzberg, and E. Lusk. MPI: The Complete Reference. MIT Press, second edition, 1998.]]

Digital Library

[9]

C. Koelbel, D. Loveman, R. Schreiber, G. Steele, Jr., and M. Zosel. The High Performance Fortran Handbook. The MIT Press, Cambridge, MA, 1994.]]

Digital Library

[10]

S. S. Lumetta, A. M. Mainwaring, and D. E. Culler. Multi-protocol Active Messages on a cluster of SMP's. In Supercomputing'97: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pages 1--22, New York, NY, USA, 1997. ACM Press.]]

Digital Library

[11]

P. Mohapatra. Wormhole routing techniques for directly connected multicomputer systems. ACM Comput. Surv., 30(3):374--410, 1998.]]

Digital Library

[12]

J. Nieplocha and B. Carpenter. ARMCI: A Portable Remote Memory Copy Libray for Distributed Array Libraries and Compiler Run-time Systems. In Proceedings of the 11th IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, pages 533--546, London, UK, 1999. Springer-Verlag.]]

Digital Library

[13]

J. Nieplocha, R. Harrison, and R. Littlefield. Global Arrays: A non-uniform memory access programming model for high-performance computers. J. Supercomput., 10:197--220, 1996.]]

Digital Library

[14]

J. Nieplocha, J. Ju, and T. Straatsma. A Multiprotocol Communication Support for the Global Address Space Programmming Model on the IBM SP. In A. Bode, T. Ludwig, W. Karl, and R. Wismuller, editors, Proceedings of the European Conference on Parallel Computing, number 1900 in Lecture Notes in Computer Science, pages 718--728, Munich, Germany, August 2000. Springer-Verlag.]]

Digital Library

[15]

J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Apra. Advances, applications and performance of the Global Arrays shared memory programming toolkit. International Journal of High Performance Computing and Applications, 2006. to appear.]]

Digital Library

[16]

R. W. Numrich and J. K. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum, 17(2):1--31, August 1998.]]

Digital Library

Cited By

Moríñigo JGarcía-Muller PRubio-Montero AGómez-Iglesias AMeyer NMayo-García R(2020)Performance drop at executing communication-intensive parallel algorithmsThe Journal of Supercomputing10.1007/s11227-019-03142-8Online publication date: 6-Jan-2020
https://doi.org/10.1007/s11227-019-03142-8
Rodríguez-Pascual MMoríñigo JMayo-García R(2019)Effect of MPI tasks location on cluster throughput using NASCluster Computing10.1007/s10586-018-02898-7Online publication date: 3-Jan-2019
https://doi.org/10.1007/s10586-018-02898-7
Moríñigo JGarcía-Muller PRubio-Montero AGómez-Iglesias AMeyer NMayo-García R(2019)Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-ScalingHigh Performance Computing10.1007/978-3-030-16205-4_17(224-238)Online publication date: 31-Mar-2019
https://doi.org/10.1007/978-3-030-16205-4_17
Show More Cited By

Index Terms

Topology-aware tile mapping for clusters of SMPs

Recommendations

Reducing the overhead of intra-node communication in clusters of SMPs
ISPA'05: Proceedings of the Third international conference on Parallel and Distributed Processing and Applications

This article presents the C++ library vShark which reduces the intra-node communication overhead of parallel programs on clusters of SMPs. The library is built on top of message-passing libraries like MPI to provide thread-safe communication but most ...
High-Level Data Mapping for Clusters of SMPs
HIPS '01: Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments

Clusters of shared-memory multiprocessors (SMPs) have become the most promising parallel computing platforms for scientific computing. However, SMP clusters significantly increase the complexity of user application development when using the low-level ...
Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs

Distributed-memory parallel computers and networks of workstations (NOWs) both rely on efficient communication over increasingly high-speed networks. Software communication protocols are often the performance bottleneck. Several current and proposed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '06: Proceedings of the 3rd conference on Computing frontiers

May 2006

430 pages

ISBN:1595933026

DOI:10.1145/1128022

General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF06

Sponsor:

CF06: Computing Frontiers Conference

May 3 - 5, 2006

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
240
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moríñigo JGarcía-Muller PRubio-Montero AGómez-Iglesias AMeyer NMayo-García R(2020)Performance drop at executing communication-intensive parallel algorithmsThe Journal of Supercomputing10.1007/s11227-019-03142-8Online publication date: 6-Jan-2020
https://doi.org/10.1007/s11227-019-03142-8
Rodríguez-Pascual MMoríñigo JMayo-García R(2019)Effect of MPI tasks location on cluster throughput using NASCluster Computing10.1007/s10586-018-02898-7Online publication date: 3-Jan-2019
https://doi.org/10.1007/s10586-018-02898-7
Moríñigo JGarcía-Muller PRubio-Montero AGómez-Iglesias AMeyer NMayo-García R(2019)Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-ScalingHigh Performance Computing10.1007/978-3-030-16205-4_17(224-238)Online publication date: 31-Mar-2019
https://doi.org/10.1007/978-3-030-16205-4_17
Rodríguez-Pascual MMoríñigo JMayo-García R(2017)Benchmarking Performance: Influence of Task Location on Cluster ThroughputHigh Performance Computing10.1007/978-3-319-73353-1_9(125-138)Online publication date: 28-Dec-2017
https://doi.org/10.1007/978-3-319-73353-1_9
Karlsson CDavies TChen Z(2012)Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI CommunicationsProceedings of the 2012 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2012.47(486-494)Online publication date: 24-Sep-2012
https://dl.acm.org/doi/10.1109/CLUSTER.2012.47
Karlsson CDavies TDing CLiu HChen Z(2011)Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore ArchitecturesProceedings of the 2011 International Conference on Parallel Processing10.1109/ICPP.2011.26(404-413)Online publication date: 13-Sep-2011
https://dl.acm.org/doi/10.1109/ICPP.2011.26
Runger GSchwind M(2008)Performance effects of gram-schmidt orthogonalization on multi-core infiniband clusters2008 IEEE International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2008.4536474(1-8)Online publication date: Apr-2008
https://doi.org/10.1109/IPDPS.2008.4536474
Runger GSchwind M(2008)Cache optimization for mixed regular and irregular computations2008 IEEE International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2008.4536184(1-8)Online publication date: Apr-2008
https://doi.org/10.1109/IPDPS.2008.4536184
Dümmler JRauber TRünger G(2008)Mapping Algorithms for Multiprocessor Tasks on Multi-Core ClustersProceedings of the 2008 37th International Conference on Parallel Processing10.1109/ICPP.2008.42(141-148)Online publication date: 9-Sep-2008
https://dl.acm.org/doi/10.1109/ICPP.2008.42
Kayi AKornkven EEl-Ghazawi TAl-Bahra SNewby G(2008)Performance Evaluation of Clusters with ccNUMA Nodes - A Case StudyProceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications10.1109/HPCC.2008.111(320-327)Online publication date: 25-Sep-2008
https://dl.acm.org/doi/10.1109/HPCC.2008.111

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents