research-article

Performance without pain = productivity: data layout and collective communication in UPC

Authors:

Rajesh Nishtala,

Calin CascavalAuthors Info & Claims

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Pages 99 - 110

https://doi.org/10.1145/1345206.1345224

Published: 20 February 2008 Publication History

Abstract

The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems.

In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation.

We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.

References

[1]

The cascade high productivity language. hips, 00:52--60, 2004.

[2]

R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high performance parallel algorithm for 1-d fft. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 34--40, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

Digital Library

[3]

E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr., and S. Tobin-Hochstadt. The Fortress Language Specification. Sun Microsystems, Inc., 1.0? edition, Sept. 2006.

[4]

G. Almasi, C. Archer, J. G. C. nos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. SteinmacherBurow, W. Gropp, and B. Toonen. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):393--406, March/May 2005. Available at http://www.research.ibm.com/journal/rd49-23.html.

Digital Library

[5]

G. Almási, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng. Optimization of mpi collective communication on bluegene/l systems. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 253--262, New York, NY, USA, 2005. ACM Press.

Digital Library

[6]

P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur, and W. Gropp. Nonuniformly communicating noncontiguous data: A case study with petsc and mpi. In IEEE Parallel and Distributed Processing Symposium (IPDPS), 2006.

[7]

S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2004.

[8]

C. Barton, C. Casçaval, G. Almási, Y. Zheng, M. Farreras, S. Chatterjee, and J. N. Amaral. Shared memory programming for large scale machines. In Programming Language Design and Implementation (PLDI), Ottawa, Canada, 2006.

Digital Library

[9]

C. Barton, C. Cascaval, G. Almasi, R. Garg, and J. N. Amaral. Multidimensional blocking in UPC. Technical Report RC24305, IBM, July 2007.

[10]

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In The 20th Int'l Parallel and Distributed Processing Symposium (IPDPS), 2006.

Digital Library

[11]

The Berkeley UPC Compiler, 2002. http://upc.lbl.gov.

[12]

BLAS Home Page. http://www.netlib.org/blas/.

[13]

J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. W. y. Efficient algorithms for all-to-all communications in multiport messagepassing systems. 1997.

Digital Library

[14]

L. E. Cannon. A cellular computer to implement the kalman filter algorithm. PhD thesis, Montanat State University, 1969.

Digital Library

[15]

F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. Productivity Analysis of the UPC Language. In IPDPS, 2004.

[16]

B. L. Chamberlain, S.-E. Choi, E. C. Lewis, C. Lin, L. Snyder, and D.Weathersby. ZPL: A machine independent programming language for parallel computers. Software Engineering, 26(3):197--211, 2000.

Digital Library

[17]

W. Chen, D. Bonachea, J. Duell, P. Husband, C. Iancu, and K. Yelick. A Performance Analysis of the Berkeley UPC Compiler. In Proc. of Int'l Conference on Supercomputing (ICS), June 2003.

Digital Library

[18]

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. In PARA '95: Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, pages 107--114, London, UK, 1996. Springer-Verlag.

Digital Library

[19]

C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, Y. Yao, and D. Chavarría-Miranda. An evaluation of global address space languages: co-array fortran and unified parallel c. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 36--47, New York, NY, USA, 2005. ACM Press.

Digital Library

[20]

D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Proc. 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1--12, 1993.

Digital Library

[21]

K. Datta, D. Bonachea, and K. Yelick. Titanium performance and potential: an NPB experimental study. In Proc. of Languages and Compilers for Parallel Computing, 2005.

Digital Library

[22]

T. El-Ghazawi and F. Cantonnet. UPC performance and potential: A NPB experimental study. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1--26, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.

Digital Library

[23]

ESSL User Guide. http://www-03.ibm.com/systems/p/software/essl.html.

[24]

L. S. B. et al. ScaLAPACK: a linear algebra library for message passing computers. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997. Society for Industrial and Applied Mathematics.

[25]

M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".

[26]

A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-burow, T. Takken, and P. Vranas. Overview of the BlueGene/L system architecture. IBM Journal of Research and Development, 49(2/3):195--212, 2005.

Digital Library

[27]

High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPCTR92225, Houston, Tex., 1993.

[28]

P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Tech Report UCB/CSD-01-1163, U.C. Berkeley, November 2001.

Digital Library

[29]

HPL Algorithm description. http://www.netlib.org/benchmark/hpl/algorithm.html.

[30]

Intel Math Kernel Library Reference Manual. http://www.intel.com/software/products/mkl/techtopics/mklman52.pdf.

[31]

T. MathWorks. Using matlab, 1997.

[32]

Message Passing Interface. http://www.mpiforum.org/docs/docs.html.

[33]

J. C. Nash. "The Cholesky Decomposition." In Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, chapter 7, pages 84--93. Bristol, England: Adam Hilger, 2nd edition, 1990.

[34]

R. W. Numrich and J. Reid. Co-array fortran for parallel programming. ACM Fortran Forum, 17(2):1--31, 1998.

Digital Library

[35]

R. W. Numrich and J. Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1--31, 1998.

Digital Library

[36]

OpenMP. Simple, portable, scalable SMP programming. http://www.openmp.org/, 2000.

[37]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Ga?ci?, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.

[38]

Y. Qian and A. Afsahi. Efficient rdma-based multi-port collectives on multi-rail qsnetii clusters. In The 6th Workshop on Communication Architecture for Clusters (CAC 2006), In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006.

Digital Library

[39]

Z. Ryne and S. Seidel. A specification of the extensions to the collective operations of unified parallel c. Technical Report Technical Report 05-08, Michigan Technological University, Department of Computer Science, 2005.

[40]

M. J. Sottile, C. E. Rasmussen, and R. L. Graham. Co-array collectives: Refined semantics for co-array fortran. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, International Conference on Computational Science (2), volume 3992 of Lecture Notes in Computer Science, pages 945--952. Springer, 2006.

Digital Library

[41]

UPC Language Specification, V1.2, May 2005.

[42]

S. S. Vadhiyar, G. E. Fagg, and J. J. Dongarra. Performance modeling for self adapting collective communications for mpi. In LACSI Symposium, 2001.

[43]

R. van de Geijn and J. Watts. Summa: Scalable universal matrix multiplication algorithm. TR-95-13, Department of Computer Sciences, University of Texas, 1995.

Digital Library

[44]

R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing.

[45]

R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.

[46]

The X10 programming language. http://x10.sourceforge.net, 2004.

[47]

K. Yelick. Keynote: Compilation techniques for partitioned global address space languages. In The 19th International Workshop on Languages and Compilers for Parallel Computing, 2006.

Digital Library

[48]

K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11-13), September-November 1998

Cited By

Ozog DRahman MTaylor GDinan J(2019)Designing, Implementing, and Evaluating the Upcoming OpenSHMEM Teams API2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM49560.2019.00009(37-46)Online publication date: Nov-2019
https://doi.org/10.1109/PAW-ATM49560.2019.00009
Kutil R(2016)Towards an object oriented programming framework for parallel matrix algorithms2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568413(776-783)Online publication date: Jul-2016
https://doi.org/10.1109/HPCSim.2016.7568413
Kumar SMamidala AHeidelberger PChen DFaraj D(2014)Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputerThe International Journal of High Performance Computing Applications10.1177/109434201455208628:4(450-464)Online publication date: 7-Nov-2014
https://doi.org/10.1177/1094342014552086
Show More Cited By

Index Terms

Performance without pain = productivity: data layout and collective communication in UPC
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures
      2. Language types
        Parallel programming languages

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI

The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space (PGAS) languages like UPC and Co-Array Fortran. In this paper, ...
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

We present the architecture of the Deep Computing Messaging Framework (DCMF), a message passing runtime designed for the Blue Gene/P machine and other HPC architectures. DCMF has been designed to easily support several programming paradigms such as the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

February 2008

308 pages

ISBN:9781595937957

DOI:10.1145/1345206

General Chair:
Siddhartha Chatterjee
IBM Research USA
,
Program Chair:
Michael L. Scott
University of Rochester USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP08

Sponsor:

PPoPP08: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 20 - 23, 2008

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
604
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ozog DRahman MTaylor GDinan J(2019)Designing, Implementing, and Evaluating the Upcoming OpenSHMEM Teams API2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM49560.2019.00009(37-46)Online publication date: Nov-2019
https://doi.org/10.1109/PAW-ATM49560.2019.00009
Kutil R(2016)Towards an object oriented programming framework for parallel matrix algorithms2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568413(776-783)Online publication date: Jul-2016
https://doi.org/10.1109/HPCSim.2016.7568413
Kumar SMamidala AHeidelberger PChen DFaraj D(2014)Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputerThe International Journal of High Performance Computing Applications10.1177/109434201455208628:4(450-464)Online publication date: 7-Nov-2014
https://doi.org/10.1177/1094342014552086
Kumar SBlocksome MDongarra JIshikawa YHori A(2014)Scalable MPI-3.0 RMA on the Blue Gene/Q SupercomputerProceedings of the 21st European MPI Users' Group Meeting10.1145/2642769.2642778(7-12)Online publication date: 9-Sep-2014
https://dl.acm.org/doi/10.1145/2642769.2642778
Chavarría-Miranda DAgarwal KStraatsma TEpema D(2013)Scalable PGAS metadata management on extreme scale systemsProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.83(103-111)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1109/CCGrid.2013.83
Teijeiro CTaboada GTouriño JDoallo RMouriño JMallón DWibecan B(2013)Design and Implementation of an Extended Collectives Library for Unified Parallel CJournal of Computer Science and Technology10.1007/s11390-013-1313-928:1(72-89)Online publication date: 1-Feb-2013
https://doi.org/10.1007/s11390-013-1313-9
Alvanos MFarreras MTiotto EMartorell XNg JJacobsen HZou Y(2012)Automatic communication coalescing for irregular computations in UPC languageProceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research10.5555/2399776.2399796(220-234)Online publication date: 5-Nov-2012
https://dl.acm.org/doi/10.5555/2399776.2399796
Spafford KVetter JHollingsworth J(2012)AspenProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389110(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389110
González-Domínguez JMartín MTaboada GTouriño J(2011)Dense Triangular Solvers on Multicore Clusters using UPCProcedia Computer Science10.1016/j.procs.2011.04.0254(231-240)Online publication date: 2011
https://doi.org/10.1016/j.procs.2011.04.025
Chen LLiu LTang SHuang LJing ZXu SZhang DShou B(2011)Unified Parallel C for GPU Clusters: Language Extensions and Compiler ImplementationLanguages and Compilers for Parallel Computing10.1007/978-3-642-19595-2_11(151-165)Online publication date: 2011
https://doi.org/10.1007/978-3-642-19595-2_11
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents