Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3337821.3337835acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Tessellating Star Stencils

Published: 05 August 2019 Publication History

Abstract

Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Existing work often uniformly handles different stencil shapes. This paper first presents a concept called natural block to identify the difference between the star and box stencils. Then we propose anew two-level tessellation scheme for star stencils, where the natural block, as well as its successors can tessellate the spatial space and their extensions along the time dimension are able to form a tessellation of the iteration space. Furthermore, a novel implementation technique called double updating is developed for star stencils specifically to improve the in-core data reuse pattern. Evaluation results are provided that demonstrate the effectiveness of the approach.

References

[1]
V. Bandishti, I. Pananilath, and U. Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC '12. 1--11.
[2]
Protonu Basu, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, and Phillip Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In IPDPS '15.
[3]
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and Optimizing Time-iterated Computations on Periodic Domains. In PACT '14.
[4]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In PLDI '08.
[5]
Hui-Min Cui, Lei Wang, Dong-Rui Fan, and Xiao-Bing Feng. 2010. Landing Stencil Code on Godson-T. J. Comput. Sci. Technol. 25, 4 (2010).
[6]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-tuning on State-of-the-art Multicore Architectures. In SC 08.
[7]
Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3 (April 2014).
[8]
Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In ICS '01.
[9]
James Demmel, Mark Frederick Hoemmen, Marghoob Mohiyuddin, and Katherine A. Yelick. 2007. Avoiding Communication in Computing Krylov Subspaces. In Technical Report, EECS UCB.
[10]
Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems. In SC '01.
[11]
Nikolaos Drosinos and Nectarios Koziris. 2006. The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms. The Journal of Supercomputing 35, 1 (2006).
[12]
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2014. On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution. In SPAA '14.
[13]
Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In ICS '05. ACM, 361--366.
[14]
Matteo Frigo and Volker Strumpen. 2006. The cache complexity of multithreaded cache oblivious algorithms. In SPAA '06.
[15]
Michael A. Frumkin and Rob F. Van der Wijngaart. 2002. Tight Bounds on Cache Use for Stencil Operations on Rectangular Grids. J. ACM 49, 3 (May 2002).
[16]
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In CGO '14.
[17]
Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles. In GPGPU-6.
[18]
Tobias Grosser, Sven Verdoolaege, Albert Cohen, and P. Sadayappan. 2014. The Relation Between Diamond Tiling and Hexagonal Tiling. Parallel Processing Letters 24, 03 (2014).
[19]
Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In ICS '15.
[20]
Ioan Hadade and Luca Mare. 2014. Exploiting SIMD and Thread-Level Parallelism in Multiblock CFD. In ISC 2014.
[21]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data Layout Transformation for Stencil Computations on Short-vector SIMD Architectures. In CC'11/ETAPS'11.
[22]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. AStencil Compiler for Short-vector SIMD Architectures. In ICS '13.
[23]
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. High-performance Code Generation for Stencil Computations on GPU Architectures. In ICS '12.
[24]
Philipp Hupp and Riko Jacob. 2013. Tight Bounds for Low Dimensional Star Stencils in the External Memory Model. In WADS '13.
[25]
F. Irigoin and R. Triolet. 1988. Supernode Partitioning. In POPL '88.
[26]
Hong Jia-Wei and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In STOC '81.
[27]
Guohua Jin, John Mellor-Crummey, and Robert Fowler. Increasing Temporal Locality with Skewing and Recursive Blocking. In SC '01.
[28]
Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In CF '16.
[29]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In PLDI '07.
[30]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In ASPLOS IV.
[31]
Claudia Leopold. On Optimal Temporal Locality of Stencil Codes. In SAC '02.
[32]
Claudia Leopold. 2002. An Analytical Evaluation of Tiling for Stencil Codes with Time Loop. In IPDPS '02.
[33]
Claudia Leopold. 2002. Tight Bounds on Capacity Misses for 3D Stencil Codes (ICCS '02). Springer Berlin Heidelberg, Berlin, Heidelberg, 843--852.
[34]
Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model. In ICS '15.
[35]
Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. PARTANS: An Autotuning Framework for Stencil Computation on multi-GPU Systems. ACM Trans. Archit. Code Optim. 9, 4 Jan. 2013).
[36]
T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. Keyes. 2015. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM Journal on Scientific Computing (2015).
[37]
Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. (2017), 12:1--12:32.
[38]
A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM (1969).
[39]
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In ICS '09.
[40]
Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing Communication in Sparse Matrix Solvers. In SC '09.
[41]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC '10. 1--13.
[42]
Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations. ACM Trans. Archit. Code Optim. (2015).
[43]
E. H. Phillips and M. Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In IPDPS '10. 1--10.
[44]
Nirmal Prajapati, Waruna Ranasinghe, Sanjay Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils. In PPoPP '17.
[45]
Shah M. Faizur Rahman, Qing Yi, and Apan Qasem. 2011. Understanding Stencil Code Performance on Multicore Architectures. In CF '11.
[46]
Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles. In IPDPS '02.
[47]
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In PPoPP '18.
[48]
Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations. In SC '00.
[49]
N. Sedaghati, R. Thomas, L. N. Pouchet, R. Teodorescu, and P. Sadayappan. 2011. StVEC: A Vector Instruction Extension for High Performance Stencil Computation. In PACT '11. 276--287.
[50]
Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2014. Tradeoffs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations. In SPAA '14.
[51]
Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI '99.
[52]
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. In ICS '15.
[53]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. Cache Accurate Time Skewing in Iterative Stencil Computations. In ICPP '11.
[54]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations. In ICS '10.
[55]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In SPAA '11.
[56]
G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske. 2009. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. In COMPSAC '09. 579--586.
[57]
Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI '91.
[58]
M. Wolfe. 1989. More Iteration Space Tiling. In Supercomputing '89.
[59]
D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '07. 171--180.
[60]
David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002).
[61]
David G Wonnacott and Michelle Mills Strout. 2013. On the scalability ofloop tiling techniques. IMPACT 2013 (2013), 3.
[62]
Liang Yuan, Yunquan Zhang, Peng Guo, and Shan Huang. 2017. Tessellating Stencils (SC '17).
[63]
Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and Auto-tuning of 3D Stencil Codes on GPU Clusters. In CGO '12.
[64]
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical Overlapped Tiling. In CGO '12.
[65]
Gerhard Zumbusch. 2012. Vectorized Higher Order Finite Difference Kernels. Springer Berlin Heidelberg, Berlin, Heidelberg, 343--357.

Cited By

View all
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2023)AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323101334:3(766-780)Online publication date: 1-Mar-2023
  • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
August 2019
1107 pages
ISBN:9781450362955
DOI:10.1145/3337821
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Box Stencil
  2. Star Stencil
  3. Stencil Computation
  4. Tessellation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2019

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2023)AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323101334:3(766-780)Online publication date: 1-Mar-2023
  • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
  • (2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
  • (2020)Recursive DiamondCandyProceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications10.1145/3407947.3407951(20-24)Online publication date: 27-Jun-2020
  • (2020)Pencil: A Pipelined Algorithm for Distributed StencilsSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00089(1-16)Online publication date: Nov-2020
  • (2020)A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00020(95-104)Online publication date: May-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media