Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests

Published: 08 April 2016 Publication History

Abstract

Affine transformations have proven to be powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multidimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks such as the Pluto algorithm, which include a cost function for modern multicore architectures for which coarse-grained parallelism and locality are crucial, consider only a subspace of transformations to avoid a combinatorial explosion in finding transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations: in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In addition, there is currently no proof that the algorithm successfully finds a tree of permutable loop bands for all affine loop nests. In this article, we propose an approach to address these two issues (1) by modeling a much larger space of practically useful affine transformations in conjunction with the existing cost function of Pluto, and (2) by extending the Pluto algorithm in a way that allows a proof for its soundness and completeness for all affine loop nests. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance for any benchmark from Polybench. For the Lattice Boltzmann Method (LBM) simulations with periodic boundary conditions, it provides a mean speedup of 1.33 × over Pluto. We also show that Pluto+ does not increase compilation time significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time by only 15%. In cases in which it improves execution time significantly, it increased polyhedral optimization time by only 2.04 × .

References

[1]
Aravind Acharya and Uday Bondhugula. 2015. PLUTO+: Near-complete modeling of affine transformations for parallelism and locality. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, New York, NY.
[2]
N. Ahmed, Nikolay Mateev, and Keshav Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loops. International Journal of Parallel Programming 29, 5.
[3]
Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, and Monica S. Lam. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Prentice Hall, Upper Saddle River, NJ.
[4]
Corinne Ancourt and Francois Irigoin. 1991. Scanning polyhedra with DO loops. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 39--50.
[5]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Supercomputing. Article 40, 11 pages.
[6]
Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 7--16.
[7]
Uday Bondhugula. 2008. Effective Automatic Parallelization and Locality Optimization Using the Polyhedral Model. Ph.D. Dissertation. The Ohio State University, Columbus, OH.
[8]
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 39--50.
[9]
Uday Bondhugula, M. Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008a. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC’08).
[10]
Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In International Conference on Parallel Architectures and Compilation Techniques. 343--352.
[11]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008b. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation (PLDI’08). ACM, New York, NY, 101--113.
[12]
Chun Chen. 2012. Polyhedra scanning revisited. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 499--508.
[13]
Shiyi Chen and Gary D. Doolen. 1998. Lattice Boltzmann method for fluid flows. Annual Review of Fluid Mechanics 30, 1, 329--364.
[14]
C. Choffrut and K. Culik. 1983. Folding of the plane and the design of systolic arrays. Information Processing Letters 17, 3, 149--153.
[15]
Cloog 2004. The CLooG Code Generator in the Polyhedral Model. http://www.cloog.org.
[16]
COIN-OR. 2001. Computational Infrastructure for Operations Research. Retrieved March 2, 2016 from http://www.coin-or.org.
[17]
Dominique d’Humières. 2002. Multiple--relaxation--time Lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 360, 1792, 437--451.
[18]
P. Feautrier. 1988. Parametric integer programming. RAIRO Recherche Opérationnelle 22, 3, 243--268.
[19]
P. Feautrier. 1991. Dataflow analysis of scalar and array references. International Journal of Parallel Programming 20, 1, 23--53.
[20]
P. Feautrier. 1992a. Some efficient solutions to the affine scheduling problem: Part I, one-dimensional time. International Journal of Parallel Programming 21, 5, 313--348.
[21]
P. Feautrier. 1992b. Some efficient solutions to the affine scheduling problem: Part II, multidimensional time. International Journal of Parallel Programming 21, 6, 389--420.
[22]
GNU. 2012. GLPK (GNU Linear Programming Kit). Retrieved March 2, 2016 from https://www.gnu.org/software/glpk/.
[23]
Martin Griebl. 2004. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. Habilitation thesis. University of Passau, Passau, Germany.
[24]
Tobias Grosser, Hongbin Zheng, Ragesh Aloor, Andreas Simbrger, Armin Grolinger, and Louis-Noël Pouchet. 2011. Polly polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11).
[25]
Albert Hartono, Muthu Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, and J. Ramanujam. 2009. A parametric multi-level tiler for imperfect loop nests. In International Conference on Supercomputing (ICS’09).
[26]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short SIMD architectures. In ETAPS International Conference on Compiler Construction (CC’11). 225--245.
[27]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In ACM International Conference on Supercomputing. ACM, New York, NY.
[28]
F. Irigoin and R. Triolet. 1988. Supernode partitioning. In ACM SIGPLAN Principles of Programming Languages. ACM, New York, NY, 319--329.
[29]
W. Kelly and W. Pugh. 1995. A Unifying Framework for Iteration Reordering Transformations. Technical Report CS-TR-3430. Department of Computer Science, University of Maryland, College Park, MD.
[30]
W. Kelly, W. Pugh, and E. Rosser. 1995. Code generation for multiple mappings. In International Symposium on the Frontiers of Massively Parallel Computation. 332--341.
[31]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY.
[32]
A. Leung, N. T. Vasilache, B. Meister, and R. A. Lethin. 2010. Methods and apparatus for joint parallelism and locality optimization in source code compilation. (June 2010). US Patent 2010/0070956 A1. Filed September 16, 2009, Issued March 18, 2010.
[33]
Wei Li and Keshav Pingali. 1994. A singular loop transformation framework based on non-singular matrices. International Journal of Parallel Programming 22, 2, 183--205.
[34]
A. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM International Conference on Supercomputing (ICS’99). 228--237.
[35]
A. Lim and Monica S. Lam. 1997. Maximizing parallelism and minimizing synchronization with affine transforms. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 201--214.
[36]
A. Lim and Monica S. Lam. 1998. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing 24, 3--4, 445--475.
[37]
Robin Lougee-Heimer. 2003. The common optimization interface for operations research. IBM Journal of Research and Development 47, 1, 57--66.
[38]
Benoit Meister, Allen Leung, Nicolas Vasilache, David Wohlford, Cédric Bastoul, and Richard Lethin. 2009. Productivity via automatic code generation for PGAS platforms with the R-stream compiler. In Workshop on Asynchrony in the PGAS Programming Model.
[39]
Benoît Meister, Nicolas Vasilache, David Wohlford, Muthu Baskaran, Allen Leung, and Richard Lethin. 2011. R-stream compiler. In Encyclopedia of Parallel Computing. 1756--1765.
[40]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15).
[41]
Nissa Osheim, Michelle Mills Strout, Dave Rostron, and Sanjay Rajopadhye. 2008. Smashing: Folding space to tile through time. In Workshop on Languages and Compilers for Parallel Computing (LCPC’08). Springer, Berlin, 80--93.
[42]
Palabos. 2012. Palabos. Retrieved March 2, 2016 from http://www.palabos.org/.
[43]
Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An optimizing code generator for a class of Lattice-Boltzmann computations. ACM Transactions on Architecture and Code Optimization. 12, 2, 14:1--14:23.
[44]
Pluto 2008. PLUTO: An automatic parallelizer and locality optimizer for affine loop nests. http://pluto-compiler.sourceforge.net.
[45]
Polybench 2010. Polybench suite. Retrieved March 2, 2016 from http://polybench.sourceforge.net.
[46]
L.-N. Pouchet, Cédric Bastoul, John Cavazos, and Albert Cohen. 2008. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation (PLDI).
[47]
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11).
[48]
Fabien Quilleré, Sanjay V. Rajopadhye, and Doran Wilde. 2000. Generation of efficient nested loops from polyhedra. International Journal of Parallel Programming 28, 5, 469--498.
[49]
Gabe Rudy, Malik Murtaza Khan, Mary Hall, Chun Chen, and Jacqueline Chame. 2011. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC’10). Springer, Berlin, 136--150.
[50]
Robert Sadourny. 1975. The dynamics of finite-difference models of the shallow-water equations. Journal of the Atmospheric Sciences 32, 4.
[51]
Alexander Schrijver. 1986. Theory of Linear and Integer Programming. John Wiley & Sons, Hoboken, NJ.
[52]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In International Conference on Parallel Processing (ICPP’11). 571--581.
[53]
Paul N. Swarztrauber. 2000. 171.swim SPEC CPU2000 Benchmark Description File. Standard Performance Evaluation Corporation. Retrieved March 2, 2016 from http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html.
[54]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). New York, NY, 117--128.
[55]
Nicolas Vasilache. 2007. Scalable Program Optimization Techniques in the Polyhedral Model. Ph.D. Dissertation. Université de Paris-Sud, INRIA Futurs, Orsay, France.
[56]
Nicolas Vasilache, Benoit Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In International Workshop on Polyhedral Compilation Techniques (IMPACT’12).
[57]
Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Mathematical Software—ICMS 2010, Komei Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Lecture Series in Computer Science, Vol. 6327. Springer, Berlin, 299--302.
[58]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4, 54:1--54:23.
[59]
M. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation. ACM, New York, NY, 30--44.
[60]
D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS. 171--180.
[61]
Yoav Yaacoby and Peter R. Cappello. 1995. Converting affine recurrence equations to quasi-uniform recurrence equations. VLSI Signal Processing 11, 1--2, 113--131.
[62]
Tomofumi Yuki. 2014. Understanding Polybench/C 3.2 kernels. In International Workshop on Polyhedral Compilation Techniques (IMPACT’14).
[63]
Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye. 2013. AlphaZ: A system for design space exploration in the polyhedral model. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 7760. Springer, Berlin, 17--31.
[64]
Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems 31, 6, Article 20, 39 pages.
[65]
Qisu Zou and Xiaoyi He. 1997. On pressure and velocity boundary conditions for the Lattice Boltzmann BGK model. Physics of Fluids (1994-present) 9, 6, 1591--1598.

Cited By

View all
  • (2024)Fully Verified Instruction SchedulingProceedings of the ACM on Programming Languages10.1145/36897398:OOPSLA2(791-816)Online publication date: 8-Oct-2024
  • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
  • (2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
  • Show More Cited By

Index Terms

  1. The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Programming Languages and Systems
      ACM Transactions on Programming Languages and Systems  Volume 38, Issue 3
      May 2016
      209 pages
      ISSN:0164-0925
      EISSN:1558-4593
      DOI:10.1145/2914585
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 April 2016
      Accepted: 01 December 2015
      Revised: 01 December 2015
      Received: 01 March 2015
      Published in TOPLAS Volume 38, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Automatic parallelization
      2. affine transformations
      3. locality optimization
      4. loop transformations
      5. polyhedral model
      6. tiling

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • INRIA Associate Team PolyFlow

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)246
      • Downloads (Last 6 weeks)42
      Reflects downloads up to 24 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Fully Verified Instruction SchedulingProceedings of the ACM on Programming Languages10.1145/36897398:OOPSLA2(791-816)Online publication date: 8-Oct-2024
      • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
      • (2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
      • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
      • (2024)PolyTOPS: Reconfigurable and Flexible Polyhedral SchedulerProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444791(28-40)Online publication date: 2-Mar-2024
      • (2023)Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured GridProceedings of the ACM on Programming Languages10.1145/36228227:OOPSLA2(686-715)Online publication date: 16-Oct-2023
      • (2023)A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier TransformsACM Transactions on Parallel Computing10.1145/360633810:4(1-34)Online publication date: 14-Dec-2023
      • (2023)Code Generation for In-Place StencilsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580006(2-13)Online publication date: 17-Feb-2023
      • (2023)SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)10.1109/AICAS57966.2023.10168625(1-5)Online publication date: 11-Jun-2023
      • (2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media