Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3673038.3673076acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUs

Published: 12 August 2024 Publication History

Abstract

Parallel numerical solvers for partial differential equations (PDEs) on structured meshes are critical in various scientific computing applications. However, current PDE solver frameworks suffer from programming or performance issues on modern accelerators such as GPUs. The complexity in programming significantly imposes a heavy burden on the development of new solver algorithms and incurs performance challenges, particularly for operator-based frameworks with redundant implementations or those that require cross-architecture optimization. In this paper, we propose FreeStencil, a linear solver compiler that emphasizes programming and optimization in fine grain. For programming, we utilized modular abstraction to implement matrix-free stencil computations in a fine-grained manner to avoid redundancy. For performance, we enabled graph optimizations towards solver iterations and employed multi-level tiling with fine-grained hardware awareness for high-performance GPU code generation. Experimental results demonstrate that FreeStencil achieves up to a 3.29x speedup (average 2.32x) on multiple GPU platforms for typical applications.

References

[1]
John David Anderson and John Wendt. 1995. Computational fluid dynamics. Vol. 206. Springer.
[2]
Steven F Ashby and Robert D Falgout. 1996. A parallel multigrid preconditioned conjugate gradient algorithm for groundwater flow simulations. Nuclear science and engineering 124, 1 (1996), 145–159.
[3]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference, Delft, The Netherlands, August 25-28, 2009. Proceedings 15. Springer, 863–874.
[4]
David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report. Technical Report NAS-95-020, NASA Ames Research Center.
[5]
Satish Balay, Shrirang Abhyankar, Mark Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, W Gropp, 2019. PETSc users manual. (2019).
[6]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[7]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[8]
Tal Ben-Nun, Johannes de Fine Licht, Alexandros N Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[9]
Huanqi Cao, Shizhi Tang, Qianchao Zhu, Bowen Yu, and Wenguang Chen. 2023. Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid. Proceedings of the ACM on Programming Languages 7, OOPSLA2 (2023), 686–715.
[10]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799 (2018).
[11]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31 (2018).
[12]
Ying Chen and Chen Shen. 2006. A Jacobian-free Newton-GMRES (m) method with adaptive preconditioner and its application for power flow calculations. IEEE Transactions on Power Systems 21, 3 (2006), 1096–1103.
[13]
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1–30.
[14]
Alain Darte. 1999. On the complexity of loop fusion. In 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. PR00425). IEEE, 149–157.
[15]
Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. 2021. Ios: Inter-operator scheduler for cnn acceleration. Proceedings of Machine Learning and Systems 3 (2021), 167–180.
[16]
Jack Dongarra and Michael A Heroux. 2013. Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744 312 (2013), 150.
[17]
Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel processing letters 21, 02 (2011), 173–193.
[18]
H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of parallel and distributed computing 74, 12 (2014), 3202–3216.
[19]
Robert D Falgout and Ulrike Meier Yang. 2002. hypre: A library of high performance preconditioners. In Computational Science—ICCS 2002: International Conference Amsterdam, The Netherlands, April 21–24, 2002 Proceedings, Part III. Springer, 632–641.
[20]
Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4, 9 (2018).
[21]
Michael A Heroux, Roscoe A Bartlett, Vicki E Howle, Robert J Hoekstra, Jonathan J Hu, Tamara G Kolda, Richard B Lehoucq, Kevin R Long, Roger P Pawlowski, Eric T Phipps, 2005. An overview of the Trilinos project. ACM Transactions on Mathematical Software (TOMS) 31, 3 (2005), 397–423.
[22]
Magnus R Hestenes, Eduard Stiefel, 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards 49, 6 (1952), 409–436.
[23]
Justin Holewinski, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing. 311–320.
[24]
Dmitrii Kochkov, Jamie A Smith, Ayya Alieva, Qing Wang, Michael P Brenner, and Stephan Hoyer. 2021. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences 118, 21 (2021), e2101784118.
[25]
Klaus Kofler, Biagio Cosenza, and Thomas Fahringer. 2015. Automatic data layout optimizations for GPUs. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings 21. Springer, 263–274.
[26]
Christian Lengauer, Sven Apel, Matthias Bolten, Armin Größlinger, Frank Hannig, Harald Köstler, Ulrich Rüde, Jürgen Teich, Alexander Grebhahn, Stefan Kronawitter, 2014. ExaStencils: Advanced stencil-code engineering. In Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II 20. Springer, 553–564.
[27]
G Makov and MC Payne. 1995. Periodic boundary conditions in ab initio calculations. Physical Review B 51, 7 (1995), 4014.
[28]
Dave A May, Jed Brown, and Laetitia Le Pourhiet. 2015. A scalable, matrix-free multigrid preconditioner for finite element discretizations of heterogeneous Stokes flow. Computer methods in applied mechanics and engineering 290 (2015), 496–523.
[29]
ROYUD Nishino and Shohei Hido Crissman Loomis. 2017. Cupy: A numpy-compatible library for nvidia gpu calculations. 31st confernce on neural information processing systems 151, 7 (2017).
[30]
NVIDIA. 2020. NVIDIA Ampere Architecture Whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
[31]
NVIDIA Corporation. 2022. cuBLAS. https://developer.nvidia.com/cublas
[32]
NVIDIA Corporation. 2022. cuSPARSE. https://developer.nvidia.com/cusparse
[33]
James M Ortega and Robert G Voigt. 1985. Solution of partial differential equations on vector and parallel computers. SIAM review 27, 2 (1985), 149–240.
[34]
Christopher C Paige and Michael A Saunders. 1975. Solution of sparse indefinite systems of linear equations. SIAM journal on numerical analysis 12, 4 (1975), 617–629.
[35]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[36]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48, 6 (2013), 519–530.
[37]
Yousef Saad. 2003. Iterative methods for sparse linear systems. SIAM.
[38]
Youcef Saad and Martin H Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing 7, 3 (1986), 856–869.
[39]
Mehmet Sahin and Robert G Owens. 2003. A novel fully implicit finite volume method applied to the lid-driven cavity problem—Part I: High Reynolds number flow calculations. International journal for numerical methods in fluids 42, 1 (2003), 57–77.
[40]
Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 872–887.
[41]
Henk A Van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on scientific and Statistical Computing 13, 2 (1992), 631–644.
[42]
Jinchao Xu. 1989. Theory of multilevel methods. Vol. 8924558. Cornell University Ithaca, NY.
[43]
Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 57–68.
[44]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 863–879.
[45]
Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, 2022. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373.
[46]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, 2022. { ROLLER} : Fast and Efficient Tensor Compilation for Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 233–248.
[47]
Qianchao Zhu, Hao Luo, Chao Yang, Mingshuo Ding, Wanwang Yin, and Xinhui Yuan. 2021. Enabling and scaling the HPCG benchmark on the newest generation sunway supercomputer with 42 million heterogeneous cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.

Index Terms

  1. FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUs

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
      August 2024
      1279 pages
      ISBN:9798400717932
      DOI:10.1145/3673038
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 August 2024

      Check for updates

      Author Tags

      1. Compile Optimization
      2. Memory Intensive
      3. PDE Solver
      4. Stencil Computation
      5. Structured Mesh

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICPP '24

      Acceptance Rates

      Overall Acceptance Rate 91 of 313 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 49
        Total Downloads
      • Downloads (Last 12 months)49
      • Downloads (Last 6 weeks)26
      Reflects downloads up to 16 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media