Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/SC.2018.00031acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Cooperative rendezvous protocols for improved performance and overlap

Published: 26 July 2019 Publication History

Abstract

With the emergence of larger multi-/many-core clusters and new areas of HPC applications, performance of large message communication is becoming more important. MPI libraries use different rendezvous protocols to perform large message communication. However, existing rendezvous protocols do not take the overall communication pattern into account or make optimal use of the Sender and the Receiver CPUs. In this work, we propose a cooperative rendezvous protocol that can provide up to 2x improvement in intra-node bandwidth and latency for large messages. We also propose designs to dynamically choose the best rendezvous protocol for each message based on the overall communication pattern. Finally, we show how these improvements can increase the overlap of intra-node communication and computation with inter-node communication and lead to application level benefits at scale. We evaluate the proposed designs on three different architectures - Intel Xeon, Knights Landing, and OpenPOWER against state-of-the-art MPI libraries including MVAPICH2 and Open MPI. Compared to existing designs, the proposed designs show benefits of up to 19% with Graph500, 16% with CoMD, and 10% with MiniGhost.

References

[1]
"MPI-3 Standard Document," http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.
[2]
R. Rabenseifner, "Automatic Profiling of MPI Applications with Hardware Performance Counters," in European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 1999, pp. 35--42.
[3]
K. B. Ferreira, P. Bridges, and R. Brightwell, "Characterizing Application Sensitivity to OS Interference using Kernel-level Noise Injection," in Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008, p. 19.
[4]
HPC Advisory Council, http://www.hpcadvisorycouncil.com/best_practices.php.
[5]
T. Hoefler, P. Gottschling, A. Lumsdaine, and W. Rehm, "Optimizing a Conjugate Gradient Solver with Non-blocking Collective Operations," Parallel Computing, vol. 33, no. 9, pp. 624--633, 2007.
[6]
T. Hoefler, P. Gottschling, and A. Lumsdaine, "Leveraging Non-blocking Collective Communication in High-performance Applications," in Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. ACM, 2008, pp. 113--115.
[7]
K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and D. K. Panda, "High-performance and Scalable Non-blocking All-to-all with Collective Offload on InfiniBand Clusters: a Study with Parallel 3D FFT," Computer Science-Research and Development, vol. 26, no. 3--4, p. 237, 2011.
[8]
J. Liu, J. Wu, and D. K. Panda, "High Performance RDMA-based MPI Implementation over InfiniBand," International Journal of Parallel Programming, vol. 32, no. 3, pp. 167--198, 2004.
[9]
S. Sur, H.-W. Jin, L. Chai, and D. K. Panda, "RDMA Read based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits," in Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006, pp. 32--39.
[10]
M. J. Rashti and A. Afsahi, "Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects," in High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on. IEEE, 2008, pp. 95--101.
[11]
Rashti, Mohammad J and Afsahi, Ahmad, " A Speculative and Adaptive MPI Rendezvous Protocol over RDMA-enabled Interconnects," International Journal of Parallel Programming, vol. 37, no. 2, pp. 223--246, 2009.
[12]
S. Pellegrini, R. Prodan, and T. Fahringer, "Tuning MPI Runtime Parameter Setting for High Performance Computing," in Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on. IEEE, 2012, pp. 213--221.
[13]
G. E. Fagg, J. Pjesivac-Grbovic, G. Bosilca, T. Angskun, J. Dongarra, and E. Jeannot, "Flexible Collective Communication Tuning Architecture Applied to Open MPI," in Euro PVM/MPI, 2006.
[14]
M. Small and X. Yuan, "Maximizing MPI Point-to-point Communication Performance on RDMA-enabled Clusters with Customized Protocols," in Proceedings of the 23rd International Conference on Supercomputing, ser. ICS '09. New York, NY, USA: ACM, 2009, pp. 306--315. {Online}. Available
[15]
Z. Gu, M. Small, X. Yuan, A. Marathe, and D. K. Lowenthal, "Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters," International Journal of Parallel Programming, vol. 41, no. 5, pp. 682--703, Oct 2013.
[16]
Vienne, Jerome, "Benefits of Cross Memory Attach for MPI libraries on HPC Clusters," in Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 2014, p. 33.
[17]
K. Pedretti and B. Barrett, "XPMEM: Cross-Process Memory Mapping."
[18]
Goglin, Brice and Moreaud, Stephanie, "KNEM: A Generic and Scalable Kernel-assisted Intra-node MPI Communication Framework," Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 176--188, 2013.
[19]
J. A. Zounmevo and A. Afsahi, "An Efficient MPI Message Queue Mechanism for Large-scale Jobs," in Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on. IEEE, 2012, pp. 464--471.
[20]
M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda, "Adaptive and Dynamic Design for MPI Tag Matching," in Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 2016, pp. 1--10.
[21]
"MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE," http://mvapich.cse.ohio-state.edu/, 2018.
[22]
Network Based Computing Laboratory, "OSU Micro-Benchmarks," http://mvapich.cse.ohio-state.edu/benchmarks, 2018.
[23]
K. Vaidyanathan, L. Chai, D. K. Panda, and W. Huang, "Efficient Asynchronous Memory Copy Operations on Multi-core Systems and I/OAT," in 2007 IEEE International Conference on Cluster Computing(CLUSTER), vol. 00, 09 2007, pp. 159--168. {Online}. Available:
[24]
A. Amer, H. Lu, Y. Wei, P. Balaji, and S. Matsuoka, "MPI+Threads: Runtime Contention and Remedies," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2015. New York, NY, USA: ACM, 2015, pp. 239--248. {Online}. Available
[25]
Open MPI: Open Source High Performance Computing, http://www.open-mpi.org, 2018.
[26]
R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, "Introducing the Graph 500," Cray User's Group (CUG), 2010.
[27]
M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and D. K. Panda, "Scalable Graph500 design with MPI-3 RMA," in Cluster Computing (CLUSTER), 2014 IEEE International Conference on. IEEE, 2014, pp. 230--238.
[28]
(2018) CoMD: Classical Molecular Dynamics Proxy Application. {Online}. Available: hhttps://github.com/ECP-copa/CoMD
[29]
K. Kadau, T. C. Germann, and P. S. Lomdahl, "Large-scale molecular-dynamics simulation of 19 billion particles," International Journal of Modern Physics C, vol. 15, no. 01, pp. 193--201, 2004.
[30]
R. F. Barrett, C. T. Vaughan, and M. A. Heroux, "MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing," Sandia National Laboratories, Tech. Rep. SAND, vol. 5294832, 2011.
[31]
Heroux, Michael A and Doerfler, Douglas W and Crozier, Paul S and Willenbring, James M and Edwards, H Carter and Williams, Alan and Rajan, Mahesh and Keiter, Eric R and Thornquist, Heidi K and Numrich, Robert W, "Improving performance via mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009--5574, vol. 3, 2009.
[32]
E. Hertel, R. Bell, M. Elrick, A. Farnsworth, G. Kerley, J. McGlaun, S. Petney, S. Silling, P. Taylor, and L. Yarrington, "CTH: A Software Family for Multidimensional Shock Physics Analysis," in Shock Waves@Marseille I. Springer, 1995, pp. 377--382.
[33]
S. Moreaud, B. Goglin, R. Namyst, and D. Goodell, "Optimizing MPI Communication within Large Multicore Nodes with Kernel Assistance," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1--7.
[34]
T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, "Locality and Topology Aware Intra-node Communication among Multicore CPUs," in European MPI Users' Group Meeting. Springer, 2010, pp. 265--274.
[35]
S. Chakraborty, H. Subramoni, and D. K. Panda, "Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-Core Systems," in 2017 IEEE International Conference on Cluster Computing (CLUSTER), Sept 2017, pp. 13--24.
[36]
T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. M. Squyres, and J. J. Dongarra, "Kernel Assisted Collective Intra-node MPI Communication among Multi-core and Many-core Cpus," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 532--541.
[37]
T. Ma, G. Bosilca, A. Bouteiller, and J. Dongarra, "HierKNEM: An Adaptive Framework for Kernel-assisted and Topology-aware Collective Communications on Many-core Clusters," in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 970--982.
[38]
H. Subramoni, S. Chakraborty, and D. K. Panda, "Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication," in High Performance Computing. Cham: Springer International Publishing, 2017, pp. 334--354.
[39]
M. Takagi, Y. Nakamura, A. Hori, B. Gerofi, and Y. Ishikawa, "Revisiting Rendezvous Protocols in the Context of RDMA-capable Host Channel Adapters and Many-core Processors," in Proceedings of the 20th European MPI Users' Group Meeting, ser. EuroMPI '13. New York, NY, USA: ACM, 2013, pp. 85--90. {Online}. Available
[40]
C. Keppitiyagama and A. Wagner, "Asynchronous MPI Messaging on Myrinet," in Parallel and Distributed Processing Symposium., Proceedings 15th International. IEEE, 2001, pp. 8-pp.
[41]
R. Kumar, A. R. Mamidala, M. J. Koop, G. Santhanaraman, and D. K. Panda, "Lock-free Asynchronous Rendezvous Design for MPI Point-to-point Communication," in European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 2008, pp. 185--193.
[42]
S. Majumder, S. Rixner, and V. S. Pai, "An Event-driven Architecture for MPI libraries," in The Los Alamos Computer Science Institute Symposium, 2004.
[43]
V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda, "Host-assisted Zero-copy Remote Memory Access Communication on InfiniBand," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 31.
[44]
A. Sikora, E. César, I. Comprés, and M. Gerndt, "Autotuning of MPI Applications using PTF," in Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications. ACM, 2016, pp. 31--38.
[45]
A. Pimenta, E. Cesar, and A. Sikora, "Methodology for MPI Applications Autotuning," in Proceedings of the 20th European MPI Users' Group Meeting. ACM, 2013, pp. 145--146.
[46]
S. Benkner, F. Franchetti, H. M. Gerndt, and J. K. Hollingsworth, "Automatic Application Tuning for HPC Architectures," in Dagstuhl Reports, vol. 3, no. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

Cited By

View all
  • (2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
  • (2020)CAB-MPIProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433748(1-15)Online publication date: 9-Nov-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Badges

Author Tags

  1. HPC
  2. MPI
  3. rendezvous protocols

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
  • (2020)CAB-MPIProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433748(1-15)Online publication date: 9-Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media