Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Combining Static and Dynamic Data Coalescing in Unified Parallel C

Published: 01 February 2016 Publication History

Abstract

Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15<inline-formula><tex-math>$\times$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="alvanos-ieq1-2405551.gif"/></alternatives></inline-formula> and 21<inline-formula><tex-math> $\times$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="alvanos-ieq2-2405551.gif"/></alternatives></inline-formula> over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.

References

[1]
J. Protic, M. Tomasevic, and V. Milutinovic, “ Distributed shared memory: Concepts and systems,” IEEE Parallel Distrib. Technol.: Syst. Appl., vol. 4, no. 2, pp. 63–71, Jun. 1996.
[2]
B. Nitzberg and V. Lo, “Distributed shared memory: A survey of issues and algorithms,” Computer, vol. 24, no. 8, pp. 52–60, 1991.
[3]
C. Amza, A. L. Cox, H. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, “TreadMarks: Shared memory computing on networks of workstations,” IEEE Comput., vol. 29, no. 2, pp. 18–28, Feb. 1996.
[4]
A. Itzkovitz and A. Schuster, “MultiView and Millipage—Fine-grain sharing in page-based DSMs,” in Proc. 3rd Symp. Oper. Syst. Des. Implementation, 1999, pp. 215– 228.
[5]
U. Consortium, (2013). UPC Specifications, V1.3, [Online]. Available: http://upc.gwu.edu/documentation.html
[6]
R. Numwich and J. Reid, “Co-array fortran for parallel programming,” Rutherford Appleton Lab., Chilton, Oxfordshire, England, Tech. Rep. RAL-TR-1998-060, 1998.
[7]
E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr., and S. Tobin-Hochstadt. (2008, Mar. ). The fortress language specification version 1.0 [Online]. Available: http://labs.oracle.com/projects/plrg/Publications/fortress.1.0.pdf
[8]
Cray Inc. ( 2011, Apr.). Chapel language specification version 0.8 [Online]. Available: http://chapel.cray.com/spec/spec-0.8.pdf
[9]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: An object-oriented approach to non-uniform cluster computing,” ACM SIGPLAN Notices, vol. 40, no. 10, pp. 519–538, Oct. 2005.
[10]
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken, “ Titanium: A high-performance java dialect,” Concurrency-Practice Exp., vol. 10, nos. 11/13, pp. 825–836, 1998.
[11]
MPI Forum, (2014). “$\sf {MPI}$: A message-passing interface standard [Online]. Available: http://www.mpi-forum.org
[12]
C. I. W. Chen and K. Yelick, “Communication optimizations for fine-grained UPC applications,” in Proc. 14th Int. Conf. Parallel Archit. Compilation Techn., 2005, pp. 267– 278.
[13]
D. Chavarria-Miranda and J. Mellor-Crummey, “Effective communication coalescing for data-parallel applications, ” in Proc. 10th ACM SIGPLAN Symp. Principles Practice Parallel Programm., 2005, pp. 14–25.
[14]
C. Barton, G. Almasi, M. Farreras, and J. Nelson Amaral, “A unified parallel C compiler that implements automatic communication coalescing,” presented at the 14th Workshop Compilers Parallel Comput., Zurich, Switzerland, 2009.
[15]
R. Rajamony, L. Arimilli, and K. Gildea, “PERCS: The IBM POWER7-IH high-performance computing system,” IBM J. Res. Develop., vol. 55, no. 3, pp. 3–1, 2011.
[16]
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony, “The PERCS high-performance interconnect,” in Proc. 18th Annu. Symp. High-Perform. Interconnects, 2010, pp. 75–82.
[17]
J. H. Saltz, R. Mirchandaney, and K. Crowley, “ Run-time parallelization and scheduling of loops,” IEEE Trans. Comput. , vol. 40, no. 5, pp. 603–612, May 1991.
[18]
C. Koelbel and P. Mehrotra, “Compiling global name-space parallel loops for distributed execution,” IEEE Trans. Parallel Distrib. Syst., vol. 2, no. 2, pp. 440 –451, Oct. 1991.
[19]
P. Brezany, M. Gerndt, and V. Sipkova, “SVM support in the Vienna fortran compilation system,” Julich Supercomputing Centre, KFA Juelich, Tech. Rep. KFA-ZAM-IB-9401, 1994.
[20]
J. Su and K. Yelick, “ Automatic support for irregular computations in a high-level language,” in Proc. 19th IEEE Int. Parallel Distrib. Process. Symp., 2005, p. 56b.
[21]
International Organization for Standardization, “ISO/IEC 9899:TC2 Programming Languages - C, ” May 2005.
[22]
M. Gupta, E. Schonberg, and H. Srinivasan, “A unified framework for optimizing communication in data-parallel programs,” IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 7, pp. 689– 704, Jul. 1996.
[23]
D. Yokota, S. Chiba, and K. Itano, “A new optimization technique for the inspector-executor method,” in Proc. Int. Conf. Parallel Distrib. Comput. Syst., 2002, pp. 706–711.
[24]
W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick, “Automatic nonblocking communication for partitioned global address space programs,” in Proc. 21st Annu. Int. Conf. Supercomput. , 2007, pp. 158–167.
[25]
C. M. Barton, “Improving access to shared data in a partitioned global address space programming model,” Ph.D. thesis, Department of Computing Science, University of Alberta, 2009.
[26]
Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey, “A multi-platform co-array fortran compiler,” in Proc. 13th Int. Conf. Parallel Arch. Compilation Techn., 2004, pp. 29–40.
[27]
K. Ebcioglu, V. Saraswat, and V. Sarkar, “X10: Programming for hierarchical parallelism and non-uniform data access,” in Proc. Int. Workshop Lang. Runtimes, 2004, pp. 519–538.
[28]
A. Sanz, R. Asenjo, J. Lopez, R. Larrosa, A. Navarro, V. Litvinov, S.-E. Choi, and B. L. Chamberlain, “Global data re-allocation via communication aggregation in Chapel,” in Proc. 24th Int. Symp. Comput. Archit. High Perform. Comput., 2012, pp. 235–242.
[29]
M. Alvanos, M. Farreras, E. Tiotto, and X. Martorell, “Automatic communication coalescing for irregular computations in UPC language,” in Proc. Conf. Center Adv. Studies Collaborative Res., 2012, pp. 220–234.
[30]
M. Alvanos and E. Tiotto, “Data prefetching and coalescing for partitioned global address space languages, ” US Patent App. 13/659,048, Oct. 24, 2012.
[31]
M. Alvanos, M. Farreras, E. Tiotto, J. N. Amaral, and X. Martorell, “Improving communication in pgas environments: Static and dynamic coalescing in UPC,” in Proc. 27th Annu. Int. Conf. Supercomput., 2013, pp. 129–138.
[32]
G. Tanase, G. Almási, E. Tiotto, M. Alvanos, A. Ly, and B. Daltonn, “ Performance analysis of the IBM XL UPC on the PERCS architecture,” Tech. Rep. IBM RC25360, 2013.
[33]
G. I. Tanase, G. Almási, H. Xue, and C. Archer, “Composable, non-blocking collective operations on power7 IH,” in Proc. 26th ACM Int. Conf. Supercomput., 2012, pp. 215–224.
[34]
R. Kalla, B. Sinharoy, W. Starke, and M. Floyd, “Power7: IBM's Next-generation server processor,” IEEE Micro, vol. 30, no. 2, pp. 7–15, Mar./Apr. 2010.
[35]
T. El-Ghazawi and F. Cantonnet, “UPC performance and potential: A NPB experimental study,” in Proc. ACM/IEEE Conf. Supercomput., 2002, pp. 1–26.
[36]
S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, ser. Cambridge Monographs on Mathematical Physics.Cambridge, U.K.: Cambridge Univ. Press, 2003.
[37]
A. K. Dewdney, “Computer recreations sharks and fish wage an ecological war on the toroidal planet wa-tor,” Sci. Amer., vol. 251, pp. 14 –22, 1984.
[38]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed. New York, NY, USA: McGraw-Hill.
[39]
C. Barton, C. Cascaval, G. Almasi, Y. Zheng, M. Farreras, S. Chatterje, and J. N. Amaral, “Shared memory programming for large scale machines,” in Proc. ACM Conf. Programm. Lang. Des. Implementation, Jun. 2006, pp. 108–117.
[40]
K. J. Barker, A. Hoisie, and D. J. Kerbyson, “An early performance analysis of POWER7-IH HPC systems,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2011, pp. 42:1–42:11.
[41]
Redbooks, IBM Power Systems 775 for AIX and Linux HPC Solution, IBM, Armonk, NY, USA, 2012.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

IEEE Press

Publication History

Published: 01 February 2016

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media