research-article

Open access

FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUs

Author:

Qianchao ZhuAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 1022 - 1031

https://doi.org/10.1145/3673038.3673076

Published: 12 August 2024 Publication History

All formats PDF

Abstract

Parallel numerical solvers for partial differential equations (PDEs) on structured meshes are critical in various scientific computing applications. However, current PDE solver frameworks suffer from programming or performance issues on modern accelerators such as GPUs. The complexity in programming significantly imposes a heavy burden on the development of new solver algorithms and incurs performance challenges, particularly for operator-based frameworks with redundant implementations or those that require cross-architecture optimization. In this paper, we propose FreeStencil, a linear solver compiler that emphasizes programming and optimization in fine grain. For programming, we utilized modular abstraction to implement matrix-free stencil computations in a fine-grained manner to avoid redundancy. For performance, we enabled graph optimizations towards solver iterations and employed multi-level tiling with fine-grained hardware awareness for high-performance GPU code generation. Experimental results demonstrate that FreeStencil achieves up to a 3.29x speedup (average 2.32x) on multiple GPU platforms for typical applications.

References

[1]

John David Anderson and John Wendt. 1995. Computational fluid dynamics. Vol. 206. Springer.

[2]

Steven F Ashby and Robert D Falgout. 1996. A parallel multigrid preconditioned conjugate gradient algorithm for groundwater flow simulations. Nuclear science and engineering 124, 1 (1996), 145–159.

[3]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference, Delft, The Netherlands, August 25-28, 2009. Proceedings 15. Springer, 863–874.

Digital Library

[4]

David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report. Technical Report NAS-95-020, NASA Ames Research Center.

[5]

Satish Balay, Shrirang Abhyankar, Mark Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, W Gropp, 2019. PETSc users manual. (2019).

[6]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.

Digital Library

[7]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.

Digital Library

[8]

Tal Ben-Nun, Johannes de Fine Licht, Alexandros N Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[9]

Huanqi Cao, Shizhi Tang, Qianchao Zhu, Bowen Yu, and Wenguang Chen. 2023. Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid. Proceedings of the ACM on Programming Languages 7, OOPSLA2 (2023), 686–715.

Digital Library

[10]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799 (2018).

[11]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31 (2018).

[12]

Ying Chen and Chen Shen. 2006. A Jacobian-free Newton-GMRES (m) method with adaptive preconditioner and its application for power flow calculations. IEEE Transactions on Power Systems 21, 3 (2006), 1096–1103.

[13]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1–30.

Digital Library

[14]

Alain Darte. 1999. On the complexity of loop fusion. In 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. PR00425). IEEE, 149–157.

[15]

Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. 2021. Ios: Inter-operator scheduler for cnn acceleration. Proceedings of Machine Learning and Systems 3 (2021), 167–180.

[16]

Jack Dongarra and Michael A Heroux. 2013. Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744 312 (2013), 150.

[17]

Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel processing letters 21, 02 (2011), 173–193.

[18]

H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of parallel and distributed computing 74, 12 (2014), 3202–3216.

Digital Library

[19]

Robert D Falgout and Ulrike Meier Yang. 2002. hypre: A library of high performance preconditioners. In Computational Science—ICCS 2002: International Conference Amsterdam, The Netherlands, April 21–24, 2002 Proceedings, Part III. Springer, 632–641.

[20]

Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4, 9 (2018).

[21]

Michael A Heroux, Roscoe A Bartlett, Vicki E Howle, Robert J Hoekstra, Jonathan J Hu, Tamara G Kolda, Richard B Lehoucq, Kevin R Long, Roger P Pawlowski, Eric T Phipps, 2005. An overview of the Trilinos project. ACM Transactions on Mathematical Software (TOMS) 31, 3 (2005), 397–423.

Digital Library

[22]

Magnus R Hestenes, Eduard Stiefel, 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards 49, 6 (1952), 409–436.

[23]

Justin Holewinski, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing. 311–320.

Digital Library

[24]

Dmitrii Kochkov, Jamie A Smith, Ayya Alieva, Qing Wang, Michael P Brenner, and Stephan Hoyer. 2021. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences 118, 21 (2021), e2101784118.

[25]

Klaus Kofler, Biagio Cosenza, and Thomas Fahringer. 2015. Automatic data layout optimizations for GPUs. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings 21. Springer, 263–274.

[26]

Christian Lengauer, Sven Apel, Matthias Bolten, Armin Größlinger, Frank Hannig, Harald Köstler, Ulrich Rüde, Jürgen Teich, Alexander Grebhahn, Stefan Kronawitter, 2014. ExaStencils: Advanced stencil-code engineering. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II 20. Springer, 553–564.

[27]

G Makov and MC Payne. 1995. Periodic boundary conditions in ab initio calculations. Physical Review B 51, 7 (1995), 4014.

[28]

Dave A May, Jed Brown, and Laetitia Le Pourhiet. 2015. A scalable, matrix-free multigrid preconditioner for finite element discretizations of heterogeneous Stokes flow. Computer methods in applied mechanics and engineering 290 (2015), 496–523.

[29]

ROYUD Nishino and Shohei Hido Crissman Loomis. 2017. Cupy: A numpy-compatible library for nvidia gpu calculations. 31st confernce on neural information processing systems 151, 7 (2017).

[30]

NVIDIA. 2020. NVIDIA Ampere Architecture Whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

[31]

NVIDIA Corporation. 2022. cuBLAS. https://developer.nvidia.com/cublas

[32]

NVIDIA Corporation. 2022. cuSPARSE. https://developer.nvidia.com/cusparse

[33]

James M Ortega and Robert G Voigt. 1985. Solution of partial differential equations on vector and parallel computers. SIAM review 27, 2 (1985), 149–240.

[34]

Christopher C Paige and Michael A Saunders. 1975. Solution of sparse indefinite systems of linear equations. SIAM journal on numerical analysis 12, 4 (1975), 617–629.

[35]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[36]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48, 6 (2013), 519–530.

Digital Library

[37]

Yousef Saad. 2003. Iterative methods for sparse linear systems. SIAM.

[38]

Youcef Saad and Martin H Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing 7, 3 (1986), 856–869.

[39]

Mehmet Sahin and Robert G Owens. 2003. A novel fully implicit finite volume method applied to the lid-driven cavity problem—Part I: High Reynolds number flow calculations. International journal for numerical methods in fluids 42, 1 (2003), 57–77.

[40]

Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 872–887.

Digital Library

[41]

Henk A Van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on scientific and Statistical Computing 13, 2 (1992), 631–644.

[42]

Jinchao Xu. 1989. Theory of multilevel methods. Vol. 8924558. Cornell University Ithaca, NY.

[43]

Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 57–68.

Digital Library

[44]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 863–879.

[45]

Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, 2022. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373.

Digital Library

[46]

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, 2022. { ROLLER} : Fast and Efficient Tensor Compilation for Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 233–248.

[47]

Qianchao Zhu, Hao Luo, Chao Yang, Mingshuo Ding, Wanwang Yin, and Xinhui Yuan. 2021. Enabling and scaling the HPCG benchmark on the newest generation sunway supercomputer with 42 million heterogeneous cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.

Digital Library

Index Terms

FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUs
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering

Recommendations

Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This ...
Fine-grained parallelization of lattice QCD kernel routine on GPUs

Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel ...
GeST: Generalized Stencil Auto-tuning Framework on GPUs
ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

Stencil computations are widely used in high performance computing (HPC) applications. In recent years, stencils have become more diverse in terms of stencil order, memory accesses, and computation patterns. To adapt diverse stencils to GPUs, a variety ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
181
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)49

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten