Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Tuning collective communication for Partitioned Global Address Space programming models

Published: 01 September 2011 Publication History

Abstract

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

References

[1]
D. Bonachea, GASNet Specification, Technical Report CSD-02-1207, University of California, Berkeley, 2002.
[2]
Numrich, R.W. and Reid, J., Co-array fortran for parallel programming. SIGPLAN Fortran Forum. v17. 1-31.
[3]
UPC, UPC Language Specifications, v1.2, Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.
[4]
P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, K. Yelick, Titanium Language Reference Manual, Technical Report UCB/CSD-01-1163, UC Berkeley, 2001.
[5]
Chamberlain, B.L., Callahan, D. and Zima, H.P., Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. v21. 291-312.
[6]
R. Nishtala, Architectural Probes for Measuring Communication Overlap Potential, Master's thesis, UC Berkeley, 2006.
[7]
Brightwell, R., Goudy, S.P., Rodrigues, A. and Underwood, K.D., Implications of application usage characteristics for collective communication offload. International Journal of High Performance Computing Networks. v4. 104-116.
[8]
Hoefler, T., Lumsdaine, A. and Rehm, W., Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, IEEE Computer Society/ACM.
[9]
R. Nishtala, Automatically Tuning Collective Communication for One-Sided Programming Models, Ph.D. thesis, University of California, Berkeley, 2009.
[10]
Butenhof, D.R., Programming with POSIX Threads. 1997. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[11]
GASNet home page. <http://gasnet.cs.berkeley.edu/>.
[12]
Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E. and Weathersby, D., Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems. v8. 1143-1156.
[13]
BlueGene, IBM BlueGene/P. <http://www.research.ibm.com/journal/rd/521/team.html>.
[14]
Top500 List: List of top 500 supercomputers. <http://www.top500.org/>.
[15]
C. Bell, D. Bonachea, R. Nishtala, K. Yelick, Optimizing bandwidth limited problems using one-sided communication and overlap, in: The 20th International Parallel and Distributed Processing Symposium (IPDPS 2006).
[16]
R. van de Geijn, J. Watts, Summa: Scalable universal matrix multiplication algorithm, TR-95-13, Department of Computer Sciences, University of Texas, 1995.
[17]
HPL website. <http://www.netlib.org/benchmark/hpl/algorithm.html>.
[18]
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, D., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V. and Weeratunga, S.K., The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications. v5. 63-73.
[19]
R. Nishtala, P.H. Hargrove, D.O. Bonachea, K.A. Yelick, Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap (IPDPS 2009).

Cited By

View all
  • (2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
  • (2019)A view of programming scalable data analysisJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-019-0127-x8:1(1-16)Online publication date: 1-Dec-2019
  • (2019)Collective Communication for the RISC-V xBGAS ISA ExtensionWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339196(1-10)Online publication date: 5-Aug-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Parallel Computing
Parallel Computing  Volume 37, Issue 9
September, 2011
155 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 September 2011

Author Tags

  1. Collective communication
  2. One-sided communication
  3. Partitioned Global Address Space languages

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
  • (2019)A view of programming scalable data analysisJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-019-0127-x8:1(1-16)Online publication date: 1-Dec-2019
  • (2019)Collective Communication for the RISC-V xBGAS ISA ExtensionWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339196(1-10)Online publication date: 5-Aug-2019
  • (2016)Parallel Pairwise Epistasis Detection on Heterogeneous Computing ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246024727:8(2329-2340)Online publication date: 13-Jul-2016
  • (2014)Region templatesParallel Computing10.1016/j.parco.2014.09.00340:10(589-610)Online publication date: 1-Dec-2014
  • (2014)Scalable PGAS collective operations in NUMA clustersCluster Computing10.1007/s10586-014-0377-917:4(1473-1495)Online publication date: 1-Dec-2014
  • (2012)Congestion avoidance on manycore high performance computing systemsProceedings of the 26th ACM international conference on Supercomputing10.1145/2304576.2304594(121-132)Online publication date: 25-Jun-2012

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media