Abstract
Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication–computation overlap. This paper presents an approach to maximize the communication–computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.
Similar content being viewed by others
References
PLUTO: An automatic parallelizer and locality optimizer for affine loop nests. http://pluto-compiler.sourceforge.net/
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 95–105. ACM Press (1995)
Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O’Reilly, U.M., Amarasinghe, S.: Opentuner: An extensible framework for program autotuning. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 303–316. ACM Press, New York (2014)
Barigou, Y., Venkatesan, V., Gabriel, E.: Auto-tuning non-blocking collective communication operations. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), pp. 1204–1213 (2015)
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., Temam, O.: Putting polyhedral loop transformations to work. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 209–225. Springer (2003)
Benabderrahmane, M.W., Pouchet, L.N., Cohen, A., Bastoul, C.: The polyhedral model is more widely applicable than you think. In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, pp. 283–303 (2010)
Benkert, K., Gabriel, E., Roller, S.: Timing collective communications in an empirical optimization framework. In: Proceedings of The Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering (PARENG). Civil Comp Press (2011)
Bondhugula, U.: Compiling affine loop nests for distributed-memory parallel architectures. In: ACM/IEEE Supercomputing (SC ’13) (2010)
Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: International Conference on Compiler Construction (ETAPS CC) (2008)
Bondhugula, Uday, Hartono, A., Ramanujan, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM SIGPLAN Programming Languages Design and Implementation (PLDI) (2008)
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI), vol. 30, pp. 279–290 (1995)
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (2002). doi:10.1109/99.660313
Dathathri, R., Chandan, G., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pp. 375–386 (2013)
Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Prog. 36(2), 543–570 (2008)
Feautrier, P.: Dataflow analysis of scalar and array references. Int. J. Parallel Prog. 20, 23–53 (1991)
Feautrier, P.: Some efficient solutions to the affine scheduling problem: I. one-dimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 313–348 (1992)
Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 389–420 (1992)
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI Users’ Group Meeting, pp. 97–104, Budapest (2004)
Gabriel, E., Feki, S., Benkert, K., Resch, M.M.: Towards performance portability through runtime adaption for high performance computing applications. Concurr. Comput. Pract. Exp. 22(16), 2230–2246 (2010)
Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI-The Complete Reference, vol. 2. The MIT Press, Cambridge (1998)
Hoefler, T., Gottschling, P., Lumsdaine, A.: Brief announcement: leveraging non-blocking collective communication in high-performance applications. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pp. 113–115. Association for Computing Machinery (ACM) (2008) (short paper)
Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society (2008)
Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Perferformance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM (2007)
Hoefler, T., Kambadur, P., Graham, R.L., Shipman, G., Lumsdaine, A.: A case for standard non-blocking collective operations. In: Proceedings of the 14th European PVM/MPI User’s Group Meeting, pp. 125–134. Springer, Berlin (2007)
Inozemtsev, G., Afsahi, A.: Designing an offloaded nonblocking mpi allgather collective using core-direct. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), pp. 477–485 (2012)
Ishizaka, K., Obata, M., Kasahara, H.: Cache optimization for coarse grain task parallel processing using inter-array padding. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 64–76. Springer (2003)
Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance mpi-3 non-blocking alltoallv collective on multi-core systems. In: Proceedings of the 42nd International Conference on Parallel Processing (ICPP), pp. 611–620 (2013)
Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., de Supinski, B.R., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniBand clusters: a case study with conjugate gradient solvers. In: IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS) (2012)
Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, S., Sur, D., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Comput. Sci. Res. Dev. 26, 237–246 (2011)
Kandalla, K.C., Subramoni, H., Vienne, J., Raikar, S.P., Tomko, K., Sur, S., Panda, D.K.: Designing non-blocking broadcast with collective offload on infiniband clusters: a case study with HPL. In: IEEE 19th Annual Symposium on High Performance Interconnects (HOTI) (2011)
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 2.2. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report (2009)
Parallel Software Technologies Laboratory: ADCL: Abstract Data and Communication Library, User Level API Functions, Version 2.1. http://pstl.cs.uh.edu/projects/adcl-spec (2016)
Song, S., Hollingsworth, J.K.: Designing and auto-tuning parallel 3-D FFT for computation–communication overlap. In: Proceedings of the 19th ACM Symposium on Principles and Practices for Parallel Programming (PPOPP’14) (2014)
Richard Stevens, W., Rago, S.A.: Advanced Programming in the UNIX Environment. Addison-Wesley, Boston (2005)
Tiwari, A., Hollingsworth, J.K., Chen, C., Hall, M., Liao, C., Quinlan, D.J., Chame, J.: Auto-tuning full applications: a case study. Int. J. High Perform. Comput. Appl. 25(3), 286–294 (2011)
Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 185–194. ACM Press, New York (2014)
Acknowledgements
Partial support for this work was provided by the National Science Foundation’s Computer Systems Research program under Award Nos. CNS-0846002 and CRI-0958464. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors acknowledge the use of the Opuntia Cluster and the support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Barigou, Y., Gabriel, E. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations. Int J Parallel Prog 45, 1390–1416 (2017). https://doi.org/10.1007/s10766-016-0477-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-016-0477-7