research-article

Tessellating Star Stencils

Authors:

Hang CaoAuthors Info & Claims

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Article No.: 43, Pages 1 - 10

https://doi.org/10.1145/3337821.3337835

Published: 05 August 2019 Publication History

Abstract

Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Existing work often uniformly handles different stencil shapes. This paper first presents a concept called natural block to identify the difference between the star and box stencils. Then we propose anew two-level tessellation scheme for star stencils, where the natural block, as well as its successors can tessellate the spatial space and their extensions along the time dimension are able to form a tessellation of the iteration space. Furthermore, a novel implementation technique called double updating is developed for star stencils specifically to improve the in-core data reuse pattern. Evaluation results are provided that demonstrate the effectiveness of the approach.

References

[1]

V. Bandishti, I. Pananilath, and U. Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC '12. 1--11.

Digital Library

[2]

Protonu Basu, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, and Phillip Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In IPDPS '15.

Digital Library

[3]

Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and Optimizing Time-iterated Computations on Periodic Domains. In PACT '14.

Digital Library

[4]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In PLDI '08.

Digital Library

[5]

Hui-Min Cui, Lei Wang, Dong-Rui Fan, and Xiao-Bing Feng. 2010. Landing Stencil Code on Godson-T. J. Comput. Sci. Technol. 25, 4 (2010).

Digital Library

[6]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-tuning on State-of-the-art Multicore Architectures. In SC 08.

Digital Library

[7]

Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3 (April 2014).

Digital Library

[8]

Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In ICS '01.

Digital Library

[9]

James Demmel, Mark Frederick Hoemmen, Marghoob Mohiyuddin, and Katherine A. Yelick. 2007. Avoiding Communication in Computing Krylov Subspaces. In Technical Report, EECS UCB.

[10]

Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems. In SC '01.

Digital Library

[11]

Nikolaos Drosinos and Nectarios Koziris. 2006. The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms. The Journal of Supercomputing 35, 1 (2006).

Digital Library

[12]

Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2014. On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution. In SPAA '14.

Digital Library

[13]

Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In ICS '05. ACM, 361--366.

Digital Library

[14]

Matteo Frigo and Volker Strumpen. 2006. The cache complexity of multithreaded cache oblivious algorithms. In SPAA '06.

Digital Library

[15]

Michael A. Frumkin and Rob F. Van der Wijngaart. 2002. Tight Bounds on Cache Use for Stencil Operations on Rectangular Grids. J. ACM 49, 3 (May 2002).

Digital Library

[16]

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In CGO '14.

Digital Library

[17]

Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles. In GPGPU-6.

Digital Library

[18]

Tobias Grosser, Sven Verdoolaege, Albert Cohen, and P. Sadayappan. 2014. The Relation Between Diamond Tiling and Hexagonal Tiling. Parallel Processing Letters 24, 03 (2014).

[19]

Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In ICS '15.

Digital Library

[20]

Ioan Hadade and Luca Mare. 2014. Exploiting SIMD and Thread-Level Parallelism in Multiblock CFD. In ISC 2014.

Digital Library

[21]

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data Layout Transformation for Stencil Computations on Short-vector SIMD Architectures. In CC'11/ETAPS'11.

Digital Library

[22]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. AStencil Compiler for Short-vector SIMD Architectures. In ICS '13.

Digital Library

[23]

Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. High-performance Code Generation for Stencil Computations on GPU Architectures. In ICS '12.

Digital Library

[24]

Philipp Hupp and Riko Jacob. 2013. Tight Bounds for Low Dimensional Star Stencils in the External Memory Model. In WADS '13.

Digital Library

[25]

F. Irigoin and R. Triolet. 1988. Supernode Partitioning. In POPL '88.

Digital Library

[26]

Hong Jia-Wei and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In STOC '81.

Digital Library

[27]

Guohua Jin, John Mellor-Crummey, and Robert Fowler. Increasing Temporal Locality with Skewing and Recursive Blocking. In SC '01.

Digital Library

[28]

Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In CF '16.

Digital Library

[29]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In PLDI '07.

Digital Library

[30]

Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In ASPLOS IV.

Digital Library

[31]

Claudia Leopold. On Optimal Temporal Locality of Stencil Codes. In SAC '02.

Digital Library

[32]

Claudia Leopold. 2002. An Analytical Evaluation of Tiling for Stencil Codes with Time Loop. In IPDPS '02.

Digital Library

[33]

Claudia Leopold. 2002. Tight Bounds on Capacity Misses for 3D Stencil Codes (ICCS '02). Springer Berlin Heidelberg, Berlin, Heidelberg, 843--852.

Digital Library

[34]

Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model. In ICS '15.

Digital Library

[35]

Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. PARTANS: An Autotuning Framework for Stencil Computation on multi-GPU Systems. ACM Trans. Archit. Code Optim. 9, 4 Jan. 2013).

Digital Library

[36]

T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. Keyes. 2015. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM Journal on Scientific Computing (2015).

[37]

Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. (2017), 12:1--12:32.

Digital Library

[38]

A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM (1969).

Digital Library

[39]

Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In ICS '09.

Digital Library

[40]

Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing Communication in Sparse Matrix Solvers. In SC '09.

Digital Library

[41]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC '10. 1--13.

Digital Library

[42]

Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations. ACM Trans. Archit. Code Optim. (2015).

Digital Library

[43]

E. H. Phillips and M. Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In IPDPS '10. 1--10.

[44]

Nirmal Prajapati, Waruna Ranasinghe, Sanjay Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils. In PPoPP '17.

Digital Library

[45]

Shah M. Faizur Rahman, Qing Yi, and Apan Qasem. 2011. Understanding Stencil Code Performance on Multicore Architectures. In CF '11.

[46]

Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles. In IPDPS '02.

Digital Library

[47]

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In PPoPP '18.

[48]

Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations. In SC '00.

Digital Library

[49]

N. Sedaghati, R. Thomas, L. N. Pouchet, R. Teodorescu, and P. Sadayappan. 2011. StVEC: A Vector Instruction Extension for High Performance Stencil Computation. In PACT '11. 276--287.

Digital Library

[50]

Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2014. Tradeoffs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations. In SPAA '14.

Digital Library

[51]

Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI '99.

Digital Library

[52]

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. In ICS '15.

Digital Library

[53]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. Cache Accurate Time Skewing in Iterative Stencil Computations. In ICPP '11.

Digital Library

[54]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations. In ICS '10.

Digital Library

[55]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In SPAA '11.

Digital Library

[56]

G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske. 2009. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. In COMPSAC '09. 579--586.

Digital Library

[57]

Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI '91.

Digital Library

[58]

M. Wolfe. 1989. More Iteration Space Tiling. In Supercomputing '89.

Digital Library

[59]

D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '07. 171--180.

Digital Library

[60]

David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002).

Digital Library

[61]

David G Wonnacott and Michelle Mills Strout. 2013. On the scalability ofloop tiling techniques. IMPACT 2013 (2013), 3.

[62]

Liang Yuan, Yunquan Zhang, Peng Guo, and Shan Huang. 2017. Tessellating Stencils (SC '17).

Digital Library

[63]

Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and Auto-tuning of 3D Stencil Codes on GPU Clusters. In CGO '12.

Digital Library

[64]

Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical Overlapped Tiling. In CGO '12.

Digital Library

[65]

Gerhard Zumbusch. 2012. Vectorized Higher Order Finite Difference Kernels. Springer Berlin Heidelberg, Berlin, Heidelberg, 343--357.

Digital Library

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Cao HYuan LZhang HZhang YWu BLi KLi SZhang MLu PXiao J(2023)AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323101334:3(766-780)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TPDS.2022.3231013
Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Show More Cited By

Recommendations

Temporal vectorization for stencils
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Stencil computations represent a very common class of nested loops in scientific and engineering applications. Exploiting vector units in modern CPUs is crucial to achieving peak performance. Previous vectorization approaches often consider the data ...
Tessellating stencils
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Unlike ...
Generalized Unstructured Decimation

This article presents a general algorithm for decimating unstructured discretized data sets. The discretized space may be a planar triangulation, a general 3D surface triangulation, or a 3D tetrahedrization. Local dynamic vertex removal is performed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

August 2019

1107 pages

ISBN:9781450362955

DOI:10.1145/3337821

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2019

ICPP 2019: 48th International Conference on Parallel Processing

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
128
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Cao HYuan LZhang HZhang YWu BLi KLi SZhang MLu PXiao J(2023)AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323101334:3(766-780)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TPDS.2022.3231013
Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Li KYuan LZhang YYue Yde Supinski BHall MGamblin T(2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476154
Perepelkina ALevchenko V(2020)Recursive DiamondCandyProceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications10.1145/3407947.3407951(20-24)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3407947.3407951
Wang HChandramowlishwaran A(2020)Pencil: A Pipelined Algorithm for Distributed StencilsSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00089(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00089
Cao HYuan LZhang HWu BLi SLu PZhang YXu YZhang M(2020)A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00020(95-104)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00020

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten