article

Tuning collective communication for Partitioned Global Address Space programming models

Authors:

Rajesh Nishtala,

Paul H. Hargrove,

Katherine A. YelickAuthors Info & Claims

Parallel Computing, Volume 37, Issue 9

Pages 576 - 591

https://doi.org/10.1016/j.parco.2011.05.006

Published: 01 September 2011 Publication History

Abstract

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

References

[1]

D. Bonachea, GASNet Specification, Technical Report CSD-02-1207, University of California, Berkeley, 2002.

[2]

Numrich, R.W. and Reid, J., Co-array fortran for parallel programming. SIGPLAN Fortran Forum. v17. 1-31.

[3]

UPC, UPC Language Specifications, v1.2, Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.

[4]

P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, K. Yelick, Titanium Language Reference Manual, Technical Report UCB/CSD-01-1163, UC Berkeley, 2001.

Digital Library

[5]

Chamberlain, B.L., Callahan, D. and Zima, H.P., Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. v21. 291-312.

[6]

R. Nishtala, Architectural Probes for Measuring Communication Overlap Potential, Master's thesis, UC Berkeley, 2006.

[7]

Brightwell, R., Goudy, S.P., Rodrigues, A. and Underwood, K.D., Implications of application usage characteristics for collective communication offload. International Journal of High Performance Computing Networks. v4. 104-116.

[8]

Hoefler, T., Lumsdaine, A. and Rehm, W., Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, IEEE Computer Society/ACM.

[9]

R. Nishtala, Automatically Tuning Collective Communication for One-Sided Programming Models, Ph.D. thesis, University of California, Berkeley, 2009.

[10]

Butenhof, D.R., Programming with POSIX Threads. 1997. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

[11]

GASNet home page. <http://gasnet.cs.berkeley.edu/>.

[12]

Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E. and Weathersby, D., Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems. v8. 1143-1156.

[13]

BlueGene, IBM BlueGene/P. <http://www.research.ibm.com/journal/rd/521/team.html>.

[14]

Top500 List: List of top 500 supercomputers. <http://www.top500.org/>.

[15]

C. Bell, D. Bonachea, R. Nishtala, K. Yelick, Optimizing bandwidth limited problems using one-sided communication and overlap, in: The 20th International Parallel and Distributed Processing Symposium (IPDPS 2006).

[16]

R. van de Geijn, J. Watts, Summa: Scalable universal matrix multiplication algorithm, TR-95-13, Department of Computer Sciences, University of Texas, 1995.

[17]

HPL website. <http://www.netlib.org/benchmark/hpl/algorithm.html>.

[18]

Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, D., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V. and Weeratunga, S.K., The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications. v5. 63-73.

[19]

R. Nishtala, P.H. Hargrove, D.O. Bonachea, K.A. Yelick, Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap (IPDPS 2009).

Cited By

Roa Perdomo DCeccato RNeveu RYviquel HLi XMonsalve Diaz JDoerfert J(2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624609
Talia D(2019)A view of programming scalable data analysisJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-019-0127-x8:1(1-16)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13677-019-0127-x
Williams BWang XLeidel JChen Y(2019)Collective Communication for the RISC-V xBGAS ISA ExtensionWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339196(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3339186.3339196
Show More Cited By

Tuning collective communication for Partitioned Global Address Space programming models

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Optimization of MPI collective communication on BlueGene/L systems
ICS '05: Proceedings of the 19th annual international conference on Supercomputing

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the ...
PGAS '09: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Parallel Computing

Parallel Computing Volume 37, Issue 9

September, 2011

155 pages

ISSN:0167-8191

Issue’s Table of Contents

Copyright © © 2011.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 September 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Roa Perdomo DCeccato RNeveu RYviquel HLi XMonsalve Diaz JDoerfert J(2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624609
Talia D(2019)A view of programming scalable data analysisJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-019-0127-x8:1(1-16)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13677-019-0127-x
Williams BWang XLeidel JChen Y(2019)Collective Communication for the RISC-V xBGAS ISA ExtensionWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339196(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3339186.3339196
González-Domínguez JRamos STouriño JSchmidt B(2016)Parallel Pairwise Epistasis Detection on Heterogeneous Computing ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246024727:8(2329-2340)Online publication date: 13-Jul-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2460247
Teodoro GPan TKurc TKong JCooper LKlasky SSaltz J(2014)Region templatesParallel Computing10.1016/j.parco.2014.09.00340:10(589-610)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1016/j.parco.2014.09.003
Mallón DTaboada GTeijeiro CGonzález-Domínguez JGómez AWibecan B(2014)Scalable PGAS collective operations in NUMA clustersCluster Computing10.1007/s10586-014-0377-917:4(1473-1495)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1007/s10586-014-0377-9
Luo MPanda DIbrahim KIancu CBanerjee UGallivan KBilardi GKatevenis M(2012)Congestion avoidance on manycore high performance computing systemsProceedings of the 26th ACM international conference on Supercomputing10.1145/2304576.2304594(121-132)Online publication date: 25-Jun-2012
https://dl.acm.org/doi/10.1145/2304576.2304594

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents