Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication–computation overlap. This paper presents an approach to maximize the communication–computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. PLUTO: An automatic parallelizer and locality optimizer for affine loop nests. http://pluto-compiler.sourceforge.net/

  2. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 95–105. ACM Press (1995)

  3. Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O’Reilly, U.M., Amarasinghe, S.: Opentuner: An extensible framework for program autotuning. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 303–316. ACM Press, New York (2014)

  4. Barigou, Y., Venkatesan, V., Gabriel, E.: Auto-tuning non-blocking collective communication operations. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), pp. 1204–1213 (2015)

  5. Bastoul, C., Cohen, A., Girbal, S., Sharma, S., Temam, O.: Putting polyhedral loop transformations to work. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 209–225. Springer (2003)

  6. Benabderrahmane, M.W., Pouchet, L.N., Cohen, A., Bastoul, C.: The polyhedral model is more widely applicable than you think. In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, pp. 283–303 (2010)

  7. Benkert, K., Gabriel, E., Roller, S.: Timing collective communications in an empirical optimization framework. In: Proceedings of The Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering (PARENG). Civil Comp Press (2011)

  8. Bondhugula, U.: Compiling affine loop nests for distributed-memory parallel architectures. In: ACM/IEEE Supercomputing (SC ’13) (2010)

  9. Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: International Conference on Compiler Construction (ETAPS CC) (2008)

  10. Bondhugula, Uday, Hartono, A., Ramanujan, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM SIGPLAN Programming Languages Design and Implementation (PLDI) (2008)

  11. Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI), vol. 30, pp. 279–290 (1995)

  12. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (2002). doi:10.1109/99.660313

    Article  Google Scholar 

  13. Dathathri, R., Chandan, G., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pp. 375–386 (2013)

  14. Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Prog. 36(2), 543–570 (2008)

    Article  Google Scholar 

  15. Feautrier, P.: Dataflow analysis of scalar and array references. Int. J. Parallel Prog. 20, 23–53 (1991)

    Article  MATH  Google Scholar 

  16. Feautrier, P.: Some efficient solutions to the affine scheduling problem: I. one-dimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 313–348 (1992)

  17. Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. In: International Journal of Parallel Programming, vol. 21, pp. 389–420 (1992)

  18. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI Users’ Group Meeting, pp. 97–104, Budapest (2004)

  19. Gabriel, E., Feki, S., Benkert, K., Resch, M.M.: Towards performance portability through runtime adaption for high performance computing applications. Concurr. Comput. Pract. Exp. 22(16), 2230–2246 (2010)

    Article  Google Scholar 

  20. Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI-The Complete Reference, vol. 2. The MIT Press, Cambridge (1998)

    Google Scholar 

  21. Hoefler, T., Gottschling, P., Lumsdaine, A.: Brief announcement: leveraging non-blocking collective communication in high-performance applications. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pp. 113–115. Association for Computing Machinery (ACM) (2008) (short paper)

  22. Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society (2008)

  23. Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Perferformance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM (2007)

  24. Hoefler, T., Kambadur, P., Graham, R.L., Shipman, G., Lumsdaine, A.: A case for standard non-blocking collective operations. In: Proceedings of the 14th European PVM/MPI User’s Group Meeting, pp. 125–134. Springer, Berlin (2007)

  25. Inozemtsev, G., Afsahi, A.: Designing an offloaded nonblocking mpi allgather collective using core-direct. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), pp. 477–485 (2012)

  26. Ishizaka, K., Obata, M., Kasahara, H.: Cache optimization for coarse grain task parallel processing using inter-array padding. In: 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC), vol. 2958, pp. 64–76. Springer (2003)

  27. Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance mpi-3 non-blocking alltoallv collective on multi-core systems. In: Proceedings of the 42nd International Conference on Parallel Processing (ICPP), pp. 611–620 (2013)

  28. Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., de Supinski, B.R., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniBand clusters: a case study with conjugate gradient solvers. In: IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS) (2012)

  29. Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, S., Sur, D., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Comput. Sci. Res. Dev. 26, 237–246 (2011)

    Article  Google Scholar 

  30. Kandalla, K.C., Subramoni, H., Vienne, J., Raikar, S.P., Tomko, K., Sur, S., Panda, D.K.: Designing non-blocking broadcast with collective offload on infiniband clusters: a case study with HPL. In: IEEE 19th Annual Symposium on High Performance Interconnects (HOTI) (2011)

  31. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 2.2. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report (2009)

  32. Parallel Software Technologies Laboratory: ADCL: Abstract Data and Communication Library, User Level API Functions, Version 2.1. http://pstl.cs.uh.edu/projects/adcl-spec (2016)

  33. Song, S., Hollingsworth, J.K.: Designing and auto-tuning parallel 3-D FFT for computation–communication overlap. In: Proceedings of the 19th ACM Symposium on Principles and Practices for Parallel Programming (PPOPP’14) (2014)

  34. Richard Stevens, W., Rago, S.A.: Advanced Programming in the UNIX Environment. Addison-Wesley, Boston (2005)

    Google Scholar 

  35. Tiwari, A., Hollingsworth, J.K., Chen, C., Hall, M., Liao, C., Quinlan, D.J., Chame, J.: Auto-tuning full applications: a case study. Int. J. High Perform. Comput. Appl. 25(3), 286–294 (2011)

    Article  Google Scholar 

  36. Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pp. 185–194. ACM Press, New York (2014)

Download references

Acknowledgements

Partial support for this work was provided by the National Science Foundation’s Computer Systems Research program under Award Nos. CNS-0846002 and CRI-0958464. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors acknowledge the use of the Opuntia Cluster and the support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edgar Gabriel.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barigou, Y., Gabriel, E. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations. Int J Parallel Prog 45, 1390–1416 (2017). https://doi.org/10.1007/s10766-016-0477-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0477-7

Keywords

Navigation