research-article

Cooperative rendezvous protocols for improved performance and overlap

Authors:

S. Chakraborty,

D. K. PandaAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 28, Pages 1 - 13

https://doi.org/10.1109/SC.2018.00031

Published: 26 July 2019 Publication History

Abstract

With the emergence of larger multi-/many-core clusters and new areas of HPC applications, performance of large message communication is becoming more important. MPI libraries use different rendezvous protocols to perform large message communication. However, existing rendezvous protocols do not take the overall communication pattern into account or make optimal use of the Sender and the Receiver CPUs. In this work, we propose a cooperative rendezvous protocol that can provide up to 2x improvement in intra-node bandwidth and latency for large messages. We also propose designs to dynamically choose the best rendezvous protocol for each message based on the overall communication pattern. Finally, we show how these improvements can increase the overlap of intra-node communication and computation with inter-node communication and lead to application level benefits at scale. We evaluate the proposed designs on three different architectures - Intel Xeon, Knights Landing, and OpenPOWER against state-of-the-art MPI libraries including MVAPICH2 and Open MPI. Compared to existing designs, the proposed designs show benefits of up to 19% with Graph500, 16% with CoMD, and 10% with MiniGhost.

References

[1]

"MPI-3 Standard Document," http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.

[2]

R. Rabenseifner, "Automatic Profiling of MPI Applications with Hardware Performance Counters," in European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 1999, pp. 35--42.

Digital Library

[3]

K. B. Ferreira, P. Bridges, and R. Brightwell, "Characterizing Application Sensitivity to OS Interference using Kernel-level Noise Injection," in Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008, p. 19.

Digital Library

[4]

HPC Advisory Council, http://www.hpcadvisorycouncil.com/best_practices.php.

[5]

T. Hoefler, P. Gottschling, A. Lumsdaine, and W. Rehm, "Optimizing a Conjugate Gradient Solver with Non-blocking Collective Operations," Parallel Computing, vol. 33, no. 9, pp. 624--633, 2007.

Digital Library

[6]

T. Hoefler, P. Gottschling, and A. Lumsdaine, "Leveraging Non-blocking Collective Communication in High-performance Applications," in Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. ACM, 2008, pp. 113--115.

Digital Library

[7]

K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and D. K. Panda, "High-performance and Scalable Non-blocking All-to-all with Collective Offload on InfiniBand Clusters: a Study with Parallel 3D FFT," Computer Science-Research and Development, vol. 26, no. 3--4, p. 237, 2011.

Digital Library

[8]

J. Liu, J. Wu, and D. K. Panda, "High Performance RDMA-based MPI Implementation over InfiniBand," International Journal of Parallel Programming, vol. 32, no. 3, pp. 167--198, 2004.

Digital Library

[9]

S. Sur, H.-W. Jin, L. Chai, and D. K. Panda, "RDMA Read based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits," in Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006, pp. 32--39.

Digital Library

[10]

M. J. Rashti and A. Afsahi, "Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects," in High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on. IEEE, 2008, pp. 95--101.

Digital Library

[11]

Rashti, Mohammad J and Afsahi, Ahmad, " A Speculative and Adaptive MPI Rendezvous Protocol over RDMA-enabled Interconnects," International Journal of Parallel Programming, vol. 37, no. 2, pp. 223--246, 2009.

Digital Library

[12]

S. Pellegrini, R. Prodan, and T. Fahringer, "Tuning MPI Runtime Parameter Setting for High Performance Computing," in Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on. IEEE, 2012, pp. 213--221.

Digital Library

[13]

G. E. Fagg, J. Pjesivac-Grbovic, G. Bosilca, T. Angskun, J. Dongarra, and E. Jeannot, "Flexible Collective Communication Tuning Architecture Applied to Open MPI," in Euro PVM/MPI, 2006.

[14]

M. Small and X. Yuan, "Maximizing MPI Point-to-point Communication Performance on RDMA-enabled Clusters with Customized Protocols," in Proceedings of the 23rd International Conference on Supercomputing, ser. ICS '09. New York, NY, USA: ACM, 2009, pp. 306--315. {Online}. Available

Digital Library

[15]

Z. Gu, M. Small, X. Yuan, A. Marathe, and D. K. Lowenthal, "Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters," International Journal of Parallel Programming, vol. 41, no. 5, pp. 682--703, Oct 2013.

[16]

Vienne, Jerome, "Benefits of Cross Memory Attach for MPI libraries on HPC Clusters," in Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 2014, p. 33.

Digital Library

[17]

K. Pedretti and B. Barrett, "XPMEM: Cross-Process Memory Mapping."

[18]

Goglin, Brice and Moreaud, Stephanie, "KNEM: A Generic and Scalable Kernel-assisted Intra-node MPI Communication Framework," Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 176--188, 2013.

Digital Library

[19]

J. A. Zounmevo and A. Afsahi, "An Efficient MPI Message Queue Mechanism for Large-scale Jobs," in Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on. IEEE, 2012, pp. 464--471.

Digital Library

[20]

M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda, "Adaptive and Dynamic Design for MPI Tag Matching," in Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 2016, pp. 1--10.

[21]

"MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE," http://mvapich.cse.ohio-state.edu/, 2018.

[22]

Network Based Computing Laboratory, "OSU Micro-Benchmarks," http://mvapich.cse.ohio-state.edu/benchmarks, 2018.

[23]

K. Vaidyanathan, L. Chai, D. K. Panda, and W. Huang, "Efficient Asynchronous Memory Copy Operations on Multi-core Systems and I/OAT," in 2007 IEEE International Conference on Cluster Computing(CLUSTER), vol. 00, 09 2007, pp. 159--168. {Online}. Available:

Digital Library

[24]

A. Amer, H. Lu, Y. Wei, P. Balaji, and S. Matsuoka, "MPI+Threads: Runtime Contention and Remedies," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2015. New York, NY, USA: ACM, 2015, pp. 239--248. {Online}. Available

Digital Library

[25]

Open MPI: Open Source High Performance Computing, http://www.open-mpi.org, 2018.

[26]

R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, "Introducing the Graph 500," Cray User's Group (CUG), 2010.

[27]

M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and D. K. Panda, "Scalable Graph500 design with MPI-3 RMA," in Cluster Computing (CLUSTER), 2014 IEEE International Conference on. IEEE, 2014, pp. 230--238.

[28]

(2018) CoMD: Classical Molecular Dynamics Proxy Application. {Online}. Available: hhttps://github.com/ECP-copa/CoMD

[29]

K. Kadau, T. C. Germann, and P. S. Lomdahl, "Large-scale molecular-dynamics simulation of 19 billion particles," International Journal of Modern Physics C, vol. 15, no. 01, pp. 193--201, 2004.

[30]

R. F. Barrett, C. T. Vaughan, and M. A. Heroux, "MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing," Sandia National Laboratories, Tech. Rep. SAND, vol. 5294832, 2011.

[31]

Heroux, Michael A and Doerfler, Douglas W and Crozier, Paul S and Willenbring, James M and Edwards, H Carter and Williams, Alan and Rajan, Mahesh and Keiter, Eric R and Thornquist, Heidi K and Numrich, Robert W, "Improving performance via mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009--5574, vol. 3, 2009.

[32]

E. Hertel, R. Bell, M. Elrick, A. Farnsworth, G. Kerley, J. McGlaun, S. Petney, S. Silling, P. Taylor, and L. Yarrington, "CTH: A Software Family for Multidimensional Shock Physics Analysis," in Shock Waves@Marseille I. Springer, 1995, pp. 377--382.

[33]

S. Moreaud, B. Goglin, R. Namyst, and D. Goodell, "Optimizing MPI Communication within Large Multicore Nodes with Kernel Assistance," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1--7.

[34]

T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, "Locality and Topology Aware Intra-node Communication among Multicore CPUs," in European MPI Users' Group Meeting. Springer, 2010, pp. 265--274.

Digital Library

[35]

S. Chakraborty, H. Subramoni, and D. K. Panda, "Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-Core Systems," in 2017 IEEE International Conference on Cluster Computing (CLUSTER), Sept 2017, pp. 13--24.

[36]

T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. M. Squyres, and J. J. Dongarra, "Kernel Assisted Collective Intra-node MPI Communication among Multi-core and Many-core Cpus," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 532--541.

Digital Library

[37]

T. Ma, G. Bosilca, A. Bouteiller, and J. Dongarra, "HierKNEM: An Adaptive Framework for Kernel-assisted and Topology-aware Collective Communications on Many-core Clusters," in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 970--982.

Digital Library

[38]

H. Subramoni, S. Chakraborty, and D. K. Panda, "Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication," in High Performance Computing. Cham: Springer International Publishing, 2017, pp. 334--354.

[39]

M. Takagi, Y. Nakamura, A. Hori, B. Gerofi, and Y. Ishikawa, "Revisiting Rendezvous Protocols in the Context of RDMA-capable Host Channel Adapters and Many-core Processors," in Proceedings of the 20th European MPI Users' Group Meeting, ser. EuroMPI '13. New York, NY, USA: ACM, 2013, pp. 85--90. {Online}. Available

Digital Library

[40]

C. Keppitiyagama and A. Wagner, "Asynchronous MPI Messaging on Myrinet," in Parallel and Distributed Processing Symposium., Proceedings 15th International. IEEE, 2001, pp. 8-pp.

Digital Library

[41]

R. Kumar, A. R. Mamidala, M. J. Koop, G. Santhanaraman, and D. K. Panda, "Lock-free Asynchronous Rendezvous Design for MPI Point-to-point Communication," in European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 2008, pp. 185--193.

Digital Library

[42]

S. Majumder, S. Rixner, and V. S. Pai, "An Event-driven Architecture for MPI libraries," in The Los Alamos Computer Science Institute Symposium, 2004.

[43]

V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda, "Host-assisted Zero-copy Remote Memory Access Communication on InfiniBand," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 31.

[44]

A. Sikora, E. César, I. Comprés, and M. Gerndt, "Autotuning of MPI Applications using PTF," in Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications. ACM, 2016, pp. 31--38.

Digital Library

[45]

A. Pimenta, E. Cesar, and A. Sikora, "Methodology for MPI Applications Autotuning," in Proceedings of the 20th European MPI Users' Group Meeting. ACM, 2013, pp. 145--146.

Digital Library

[46]

S. Benkner, F. Franchetti, H. M. Gerndt, and J. K. Hollingsworth, "Automatic Application Tuning for HPC Architectures," in Dagstuhl Reports, vol. 3, no. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

Cited By

Huang HJin YXue W(2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673131
Ouyang KSi MHori AChen ZBalaji PCuicchi CQualters IKramer W(2020)CAB-MPIProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433748(1-15)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433748

Recommendations

Cooperative rendezvous protocols for improved performance and overlap
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

With the emergence of larger multi-/many-core clusters and new areas of HPC applications, performance of large message communication is becoming more important. MPI libraries use different rendezvous protocols to perform large message communication. ...
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
High-performance dense tucker decomposition on GPU clusters
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Badges

Artifacts Available

Author Tags

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
44
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang HJin YXue W(2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673131
Ouyang KSi MHori AChen ZBalaji PCuicchi CQualters IKramer W(2020)CAB-MPIProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433748(1-15)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433748

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents