Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

437 Accesses
5 Citations
Explore all metrics

Abstract

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication–computation overlap. This paper presents an approach to maximize the communication–computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Verification of Self-consistent MPI Performance Guidelines

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

References

PLUTO: An automatic parallelizer and locality optimizer for affine loop nests. http://pluto-compiler.sourceforge.net/
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 95–105. ACM Press (1995)
Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O’Reilly, U.M., Amarasinghe, S.: Opentuner: An extensible framework for program autotuning. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 303–316. ACM Press, New York (2014)
Barigou, Y., Venkatesan, V., Gabriel, E.: Auto-tuning non-blocking collective communication operations. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), pp. 1204–1213 (2015)
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., Temam, O.: Putting polyhedral loop transformations to work. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 209–225. Springer (2003)
Benabderrahmane, M.W., Pouchet, L.N., Cohen, A., Bastoul, C.: The polyhedral model is more widely applicable than you think. In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, pp. 283–303 (2010)
Benkert, K., Gabriel, E., Roller, S.: Timing collective communications in an empirical optimization framework. In: Proceedings of The Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering (PARENG). Civil Comp Press (2011)
Bondhugula, U.: Compiling affine loop nests for distributed-memory parallel architectures. In: ACM/IEEE Supercomputing (SC ’13) (2010)
Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: International Conference on Compiler Construction (ETAPS CC) (2008)
Bondhugula, Uday, Hartono, A., Ramanujan, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM SIGPLAN Programming Languages Design and Implementation (PLDI) (2008)
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI), vol. 30, pp. 279–290 (1995)
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (2002). doi:10.1109/99.660313
Article Google Scholar
Dathathri, R., Chandan, G., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pp. 375–386 (2013)
Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Prog. 36(2), 543–570 (2008)
Article Google Scholar
Feautrier, P.: Dataflow analysis of scalar and array references. Int. J. Parallel Prog. 20, 23–53 (1991)
Article MATH Google Scholar
Feautrier, P.: Some efficient solutions to the affine scheduling problem: I. one-dimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 313–348 (1992)
Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 389–420 (1992)
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI Users’ Group Meeting, pp. 97–104, Budapest (2004)
Gabriel, E., Feki, S., Benkert, K., Resch, M.M.: Towards performance portability through runtime adaption for high performance computing applications. Concurr. Comput. Pract. Exp. 22(16), 2230–2246 (2010)
Article Google Scholar
Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI-The Complete Reference, vol. 2. The MIT Press, Cambridge (1998)
Google Scholar
Hoefler, T., Gottschling, P., Lumsdaine, A.: Brief announcement: leveraging non-blocking collective communication in high-performance applications. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pp. 113–115. Association for Computing Machinery (ACM) (2008) (short paper)
Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society (2008)
Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Perferformance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM (2007)
Hoefler, T., Kambadur, P., Graham, R.L., Shipman, G., Lumsdaine, A.: A case for standard non-blocking collective operations. In: Proceedings of the 14th European PVM/MPI User’s Group Meeting, pp. 125–134. Springer, Berlin (2007)
Inozemtsev, G., Afsahi, A.: Designing an offloaded nonblocking mpi allgather collective using core-direct. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), pp. 477–485 (2012)
Ishizaka, K., Obata, M., Kasahara, H.: Cache optimization for coarse grain task parallel processing using inter-array padding. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 64–76. Springer (2003)
Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance mpi-3 non-blocking alltoallv collective on multi-core systems. In: Proceedings of the 42nd International Conference on Parallel Processing (ICPP), pp. 611–620 (2013)
Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., de Supinski, B.R., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniBand clusters: a case study with conjugate gradient solvers. In: IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS) (2012)
Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, S., Sur, D., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Comput. Sci. Res. Dev. 26, 237–246 (2011)
Article Google Scholar
Kandalla, K.C., Subramoni, H., Vienne, J., Raikar, S.P., Tomko, K., Sur, S., Panda, D.K.: Designing non-blocking broadcast with collective offload on infiniband clusters: a case study with HPL. In: IEEE 19th Annual Symposium on High Performance Interconnects (HOTI) (2011)
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 2.2. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report (2009)
Parallel Software Technologies Laboratory: ADCL: Abstract Data and Communication Library, User Level API Functions, Version 2.1. http://pstl.cs.uh.edu/projects/adcl-spec (2016)
Song, S., Hollingsworth, J.K.: Designing and auto-tuning parallel 3-D FFT for computation–communication overlap. In: Proceedings of the 19th ACM Symposium on Principles and Practices for Parallel Programming (PPOPP’14) (2014)
Richard Stevens, W., Rago, S.A.: Advanced Programming in the UNIX Environment. Addison-Wesley, Boston (2005)
Google Scholar
Tiwari, A., Hollingsworth, J.K., Chen, C., Hall, M., Liao, C., Quinlan, D.J., Chame, J.: Auto-tuning full applications: a case study. Int. J. High Perform. Comput. Appl. 25(3), 286–294 (2011)
Article Google Scholar
Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 185–194. ACM Press, New York (2014)

Download references

Acknowledgements

Partial support for this work was provided by the National Science Foundation’s Computer Systems Research program under Award Nos. CNS-0846002 and CRI-0958464. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors acknowledge the use of the Opuntia Cluster and the support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Youcef Barigou & Edgar Gabriel

Authors

Youcef Barigou
View author publications
You can also search for this author in PubMed Google Scholar
Edgar Gabriel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edgar Gabriel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barigou, Y., Gabriel, E. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations. Int J Parallel Prog 45, 1390–1416 (2017). https://doi.org/10.1007/s10766-016-0477-7

Download citation

Received: 24 March 2016
Accepted: 14 November 2016
Published: 26 November 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10766-016-0477-7

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Verification of Self-consistent MPI Performance Guidelines

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Verification of Self-consistent MPI Performance Guidelines

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation