Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1362622.1362692acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Implementation and performance analysis of non-blocking collective operations for MPI

Published: 10 November 2007 Publication History

Abstract

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.

References

[1]
T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001.
[2]
A. Adelmann, W. P. P. A. Bonelli and, and C. W. Ueberhuber. Communication efficiency of parallel 3d ffts. In High Performance Computing for Computational Science - VECPAR 2004, 6th International Conference, Valencia, Spain, June 28--30, 2004, Revised Selected and Invited Papers, volume 3402 of Lecture Notes in Computer Science, pages 901--907. Springer, 2004.
[3]
F. Baude, D. Caromel, N. Furmento, and D. Sagnol. Optimizing metacomputing with communication-computation overlap. In PaCT '01: Proceedings of the 6th International Conference on Parallel Computing Technologies, pages 190--204, London, UK, 2001. Springer-Verlag.
[4]
P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. Operating system issues for petascale systems. SIGOPS Operating System Review, 40(2):29--33, 2006.
[5]
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAC 06), April 2006.
[6]
R. Brightwell, R. Riesen, and K. D. Underwood. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl., 19(2):103--117, 2005.
[7]
E. D. Brooks. The Butterfly Barrier. International Journal of Parallel Programming, 15(4):295--307, 1986.
[8]
BZIP2. http://www.bzip.org, 2006.
[9]
P.-Y. Calland, J. Dongarra, and Y. Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3):139--153, 1999.
[10]
C. E. Cramer and J. A. Board. The development and integration of a distributed 3d fft for a cluster of workstations. In Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta, volume 4. USENIX Association, 2000.
[11]
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993.
[12]
A. Dubey and D. Tessera. Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience, 13(3):209--220, 2001.
[13]
L. A. Estefanel and G. Mounie. Fast Tuning of Intra-Cluster Collective Communications. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19--22, 2004. Proceedings, 2004.
[14]
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.
[15]
W. D. Gropp and R. Thakur. Issues in developing a thread-safe mpi implementation. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17--20, 2006, Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 12--21. Springer, 2006.
[16]
T. Hoefler, L. Cerquetti, T. Mehlan, F. Mietke, and W. Rehm. A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP '05), pages 562--569, June 2005.
[17]
T. Hoefler, P. Gottschling, W. Rehm, and A. Lumsdaine. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In Recent Advantages in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI User's Group Meeting, Proceedings, LNCS 4192, pages 374--382. Springer, 9 2006.
[18]
T. Hoefler and A. Lumsdaine. Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, 08 2006.
[19]
T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Fast Barrier Synchronization for InfiniBand. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAG 06), April 2006.
[20]
T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, and W. Rehm. Non-Blocking Collective Operations for MPI-2. Technical report, Open Systems Lab, Indiana University, 08 2006.
[21]
T. Hoefler, J. Squyres, W. Rehm, and A. Lumsdaine. A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, volume 4331/2006, pages 155--164. Springer Berlin / Heidelberg, 12 2006.
[22]
T. Hoefler, C. Viertel, T. Mehlan, F. Mietke, and W. Rehm. Assessing Single-Message and Multi-Node Communication Performance of InfiniBand. In Proceedings of IEEE Inernational Conference on Parallel Computing in Electrical Engineering, PARELEC 2006, pages 227--232. IEEE Computer Society, 9 2006.
[23]
C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05), pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society.
[24]
IBM. IBM Parallel Environment for AIX, MPI Subroutine Reference, 1993. http://publibfp.boulder.ibm.com/epubs/pdf/a2274230.pdf.
[25]
J. W. III and S. Bova. Where's the Overlap? - An Analysis of Popular MPI Implementations, 1999.
[26]
Intel Corporation. Intel Application Notes - Using the RDTSC Instruction for Performance Monitoring. Technical report, Intel, 1997.
[27]
K. Iskra, P. Beckman, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Proceedings of Cluster Computing, 2006 IEEE International Conference, 2006.
[28]
L. V. Kale, S. Kumar, and K. Vardarajan. A Framework for Collective Personalized Communication. In Proceedings of IPDPS '03, Nice, France, April 2003.
[29]
A. Kanevsky, A. Skjellum, and A. Rounbehler. MPI/RT - an emerging standard for high-performance real-time systems. In HICSS (3), pages 157--166, 1998.
[30]
W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. Comb: A portable benchmark suite for assessing mpi overlap. In 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 23--26 September 2002, Chicago, IL, USA, pages 472--475. IEEE Computer Society, 2002.
[31]
C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. In In ACM Trans. Math. Soft., 5 (1979), pp. 308--323, 1979.
[32]
LibNBC. http://www.unixer.de/NBC, 2006.
[33]
G. Liu and T. Abdelrahman. Computation-communication overlap on network-of-workstation multiprocessors. In Proc. of the Int'l Conference on Parallel and Distributed Processing Techniques and Applications, pages 1635--1642, July 1998.
[34]
J. Liu, A. Mamidala, and D. Panda. Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support. Technical report, OSU-CISRC-10/03-TR57, 2003.
[35]
J. Liu, J. Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. Int'l Journal of Parallel Programming, 2004, 2004.
[36]
Message Passing Interface Forum. MPI-2 Journal of Development, July 1997.
[37]
F. Mietke, R. Baumgartl, R. Rex, T. Mehlan, T. Hoefler, and W. Rehm. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. 8 2006. Accepted for publication at Euro-Par 2006 Conference.
[38]
Myrinet. http://www.myrinet.com, 2006.
[39]
Quadrics. http://www.quadrics.com, 2006.
[40]
R. Rabenseifner. Automatic MPI Counter Profiling. In 42nd CUG Conference, 2000.
[41]
M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers for shared-memory multiprocessors. Int. J. Parallel Program., 22(4):449--481, 1994.
[42]
M. Technologies. Infiniband - industry standard data center fabric is ready for prime time. Mellanox White Papers, December 2005.
[43]
S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective communications. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 3, Washington, DC, USA, 2000. IEEE Computer Society.
[44]
W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), CD-ROM / Abstracts Proceedings, 26--30 April 2004, Santa Fe, New Mexico, USA, 2004.

Cited By

View all
  • (2024)An MPI-based parallel genetic algorithm for multiple geographical feature label placement based on the hybrid of fixed-sliding modelsGeo-spatial Information Science10.1080/10095020.2024.2313326(1-19)Online publication date: 15-Mar-2024
  • (2024)Ray Decomposition Radiation Transport for High Performance ComputingJournal of Computational Physics10.1016/j.jcp.2024.113567(113567)Online publication date: Nov-2024
  • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
November 2007
723 pages
ISBN:9781595937643
DOI:10.1145/1362622
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. collective operations
  3. non-blocking collective operations
  4. non-blocking communication
  5. overlap

Qualifiers

  • Research-article

Funding Sources

Conference

SC '07
Sponsor:

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An MPI-based parallel genetic algorithm for multiple geographical feature label placement based on the hybrid of fixed-sliding modelsGeo-spatial Information Science10.1080/10095020.2024.2313326(1-19)Online publication date: 15-Mar-2024
  • (2024)Ray Decomposition Radiation Transport for High Performance ComputingJournal of Computational Physics10.1016/j.jcp.2024.113567(113567)Online publication date: Nov-2024
  • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
  • (2023)Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path AnalysisAlgorithms10.3390/a1611050516:11(505)Online publication date: 31-Oct-2023
  • (2023)Generating and Scaling a Multi-Language Test-Suite for MPIProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615329(1-10)Online publication date: 11-Sep-2023
  • (2023)TH-Allreduce: Optimizing Small Data Allreduce Operation on Tianhe System2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00261(1903-1911)Online publication date: 17-Dec-2023
  • (2023)Exploiting copy engines for intra-node MPI collective communicationThe Journal of Supercomputing10.1007/s11227-023-05340-x79:16(17962-17982)Online publication date: 11-May-2023
  • (2022)Enabling Global MPI Process Addressing in MPI ApplicationsProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555829(27-36)Online publication date: 14-Sep-2022
  • (2022)HammingMesh: A Network Topology for Large-Scale Deep LearningSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00016(1-18)Online publication date: Nov-2022
  • (2022)Performance Analysis of the CFD Solver CODA - Harnessing Synergies between Application and Performance Tools Developers2022 IEEE/ACM Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools56701.2022.00010(31-40)Online publication date: Nov-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media