Article

Automatic parallel code generation for tiled nested loops

Authors:

Georgios Goumas,

Nikolaos Drosinos,

Maria Athanasaki,

Nectarios KozirisAuthors Info & Claims

SAC '04: Proceedings of the 2004 ACM symposium on Applied computing

Pages 1412 - 1419

https://doi.org/10.1145/967900.968184

Published: 14 March 2004 Publication History

Abstract

This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes.

References

[1]

V. Adve and J. Mellor-Crummey. Advanced Code Generation for High Performance Fortran. In Languages, Compilation Techniques and Run Time Systems for Scalable Parallel Systems, chapter 18, Lecture Notes in Computer Science Series. Springer-Verlag, 1997.]]

Digital Library

[2]

S. P. Amarasinghe and M. S. Lam. Communication Optimization and Code Generation for Distributed Memory Machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Albuquerque, NM, Jun 1993.]]

Digital Library

[3]

C. Ancourt and F. Irigoin. Scanning Polyhedra with DO Loops. In Proceedings of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 39--50, Williamsburg, VA, Apr 1991.]]

Digital Library

[4]

T. Andronikos, N. Koziris, G. Papakonstantinou, and P. Tsanakas. Optimal Scheduling for UET/UET-UCT Generalized N-Dimensional Grid Task Graphs. Journal of Parallel and Distributed Computing, 57(2):140--165, May 1999.]]

Digital Library

[5]

P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-ultimate tiling? INTEGRATION, The VLSI Jounal, 17:33--51, 1994.]]

Digital Library

[6]

B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. In Proceedings of the Third Workshop on Compilers for Parallel Computers, pages 121--160, Jul 1992.]]

Digital Library

[7]

F. Desprez, J. Dongarra, and Y. Robert. Determining the Idle Time of a Tiling: New Results. Journal of Information Science and Engineering, 14:167--190, Mar 1997.]]

[8]

E. D'Hollander. Partitioning and Labeling of Loops by Unimodular Transformations. IEEE Trans. on Parallel and Distributed Systems, 3(4):465--476, Jul 1992.]]

Digital Library

[9]

A. Fernandez, J. Llaberia, and M. Valero. Loop Transformations Using Nonunimodular Matrices. IEEE Trans. on Parallel and Distributed Systems, 6(8):832--840, Aug 1995.]]

Digital Library

[10]

G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran-D Language Specification. Technical Report TR-91-170, Dept. of Computer Science, Rice University, Dec 1991.]]

[11]

G. Goumas, M. Athanasaki, and N. Koziris. Automatic Code Generation for Executing Tiled Nested Loops Onto Parallel Architectures. In Proceedings of the ACM Symposium on Applied Computing (SAC 2002), pages 876--881, Madrid, Spain, Mar 2002.]]

Digital Library

[12]

G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris. Compiling Tiled Iteration Spaces for Clusters. In Proceedings of the 2002 IEEE International Conference on Cluster Computing, pages 360--369, Chicago, Illinois, Sep 2002.]]

Digital Library

[13]

E. Hodzic and W. Shang. On Supernode Transformation with Minimized Total Running Time. IEEE Trans. on Parallel and Distributed Systems, 9(5):417--428, May 1998.]]

Digital Library

[14]

E. Hodzic and W. Shang. On Time Optimal Supernode Shape. IEEE Trans. on Parallel and Distributed Systems, 13(12):1220--1233, Dec 2002.]]

Digital Library

[15]

K Hogstedt, L. Carter, and J. Ferrante. On the Parallel Execution Time of Tiled Loops. IEEE Trans. on Parallel and Distributed Systems, 14(3):307--321, Mar 2003.]]

Digital Library

[16]

F. Irigoin and R. Triolet. Supernode Partitioning. In Proceedings of the 15th Ann. ACM SIGACT-SIGPLAN Symp. Principles of Programming Languages, pages 319--329, San Diego, California, Jan 1988.]]

Digital Library

[17]

M. Kandemir, J. Ramanujam, and A. Choudary. Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines. Journal of Parallel and Distributed Computing, 60:924--965, 2000.]]

Digital Library

[18]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Interface Guide. Technical Report CS-TR-3445, CS Dept., Univ. of Maryland, College Park, Mar 1995.]]

Digital Library

[19]

J. Ramanujam. Beyond Unimodular Transformations. Journal of Supercomputing, 9(4):365--389, Oct 1995.]]

Digital Library

[20]

J. Ramanujam and P. Sadayappan. Tiling Multidimensional Iteration Spaces for Multicomputers. Journal of Parallel and Distributed Computing, 16:108--120, 1992.]]

[21]

J.-P. Sheu and T.-H. Tai. Partitioning and Mapping Nested Loops on Multiprocessor Systems. IEEE Trans. on Parallel and Distributed Systems, 2(4):430--439, Oct 1991.]]

Digital Library

[22]

P. Tang and J. Xue. Generating Efficient Tiled Code for Distributed Memory Machines. Parallel Computing, 26(11):1369--1410, 2000.]]

Digital Library

[23]

P. Tsanakas, N. Koziris, and G. Papakonstantinou. Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays. IEEE Trans. on Parallel and Distributed Systems, 11(9):941--955, Sep 2000.]]

Digital Library

[24]

M. Wolf and M. Lam. A Data Locality Optimizing Algorithm. In ACM SIGPLAN'91 Conference on Programming Language Design and Implementation (PLDI), Toronto, Ontario, Jun 1991.]]

Digital Library

[25]

M. Wolf and M. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. IEEE Trans. on Parallel and Distributed Systems, 2(4):452--471, Oct 1991.]]

Digital Library

[26]

J. Xue. Communication-Minimal Tiling of Uniform Dependence Loops. Journal of Parallel and Distributed Computing, 42(1):42--59, 1997.]]

Digital Library

Cited By

Ozturk O(2011)Data locality and parallelism optimization using a constraint-based approachJournal of Parallel and Distributed Computing10.1016/j.jpdc.2010.08.00571:2(280-287)Online publication date: 1-Feb-2011
https://dl.acm.org/doi/10.1016/j.jpdc.2010.08.005
Kumar Anand CKahl W(2010)Synthesizing and Verifying Multicore Parallelism in Categories of Nested Code GraphsProcess Algebra for Parallel and Distributed Processing10.1201/9781420064872.pt1Online publication date: 31-Jan-2010
https://doi.org/10.1201/9781420064872.pt1
Baskaran MHartono ATavarageri SHenretty TRamanujam JSadayappan PMoshovos ASteffan GHazelwood KKaeli D(2010)Parameterized tiling revisitedProceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1772954.1772983(200-209)Online publication date: 24-Apr-2010
https://dl.acm.org/doi/10.1145/1772954.1772983
Show More Cited By

Automatic parallel code generation for tiled nested loops
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
2. Theory of computation
  1. Models of computation
    1. Concurrency

Recommendations

Automatic code generation for executing tiled nested loops onto parallel architectures
SAC '02: Proceedings of the 2002 ACM symposium on Applied computing

This paper presents a novel approach for the problem of generating tiled code for nested for-loops using a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multi-level memory hierarchies as well as to ...
Generation of parallel synchronization-free tiled code

A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence ...
Parallelizing tightly nested loops
IPPS '91: Proceedings of the Fifth International Parallel Processing Symposium

Presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '04: Proceedings of the 2004 ACM symposium on Applied computing

March 2004

1733 pages

ISBN:1581138121

DOI:10.1145/967900

Conference Chair:
Hisham M. Haddad
Kennesaw State University
,
Program Chairs:
Andrea Omicini
Università degli Studi di Bologna a Cesena
,
Roger L. Wainwright
University of Tulsa
,
Publications Chair:
Lorie M. Liebrock
New Mexico Institute of Mining and Technology

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SAC04

Sponsor:

SIGAPP

SAC04: The 2004 ACM Symposium on Applied Computing

March 14 - 17, 2004

Nicosia, Cyprus

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
471
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ozturk O(2011)Data locality and parallelism optimization using a constraint-based approachJournal of Parallel and Distributed Computing10.1016/j.jpdc.2010.08.00571:2(280-287)Online publication date: 1-Feb-2011
https://dl.acm.org/doi/10.1016/j.jpdc.2010.08.005
Kumar Anand CKahl W(2010)Synthesizing and Verifying Multicore Parallelism in Categories of Nested Code GraphsProcess Algebra for Parallel and Distributed Processing10.1201/9781420064872.pt1Online publication date: 31-Jan-2010
https://doi.org/10.1201/9781420064872.pt1
Baskaran MHartono ATavarageri SHenretty TRamanujam JSadayappan PMoshovos ASteffan GHazelwood KKaeli D(2010)Parameterized tiling revisitedProceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1772954.1772983(200-209)Online publication date: 24-Apr-2010
https://dl.acm.org/doi/10.1145/1772954.1772983
Kandemir MZhang YMuralidhara SOzturk ONarayanan SHenkel JParameswaran S(2009)Slicing based code parallelization for minimizing inter-processor communicationProceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems10.1145/1629395.1629409(87-96)Online publication date: 11-Oct-2009
https://dl.acm.org/doi/10.1145/1629395.1629409
Ozturk OChen GKandemir MSentovich E(2006)Optimizing code parallelization through a constraint network based approachProceedings of the 43rd annual Design Automation Conference10.1145/1146909.1147083(863-688)Online publication date: 24-Jul-2006
https://dl.acm.org/doi/10.1145/1146909.1147083
Bonelli AFranchetti FLorenz JPüschel MUeberhuber C(2006)Automatic performance optimization of the discrete fourier transform on distributed memory computersProceedings of the 4th international conference on Parallel and Distributed Processing and Applications10.1007/11946441_74(818-832)Online publication date: 4-Dec-2006
https://dl.acm.org/doi/10.1007/11946441_74
Nakazawa MLowenthal DZhou WKramer W(2005)The MHETA Execution Model for Heterogeneous ClustersProceedings of the 2005 ACM/IEEE conference on Supercomputing10.1109/SC.2005.73Online publication date: 12-Nov-2005
https://dl.acm.org/doi/10.1109/SC.2005.73

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents