research-article

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures

Authors:

Ichitaro Yamazaki,

Sivasankaran Rajamanickam,

Nathan EllingwoodAuthors Info & Claims

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

Article No.: 70, Pages 1 - 11

https://doi.org/10.1145/3404397.3404428

Published: 17 August 2020 Publication History

Abstract

Sparse triangular solver is an important kernel in many computational applications. However, a fast, parallel, sparse triangular solver on a manycore architecture such as GPU has been an open issue in the field for several years. In this paper, we develop a sparse triangular solver that takes advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We implemented our solver using Kokkos and Kokkos Kernels such that our solver is portable to different manycore architectures. This has the additional benefit of allowing our triangular solver to use the team-level kernels and take advantage of the hierarchical parallelism available on the GPU. We compare the effects of different scheduling schemes on the performance and also investigate an algorithmic variant called the partitioned inverse. Our performance results on an NVIDIA V100 or P100 GPU demonstrate that our implementation can be 12.4 × or 19.5 × faster than the vendor optimized implementation in NVIDIA’s CuSPARSE library.

References

[1]

F. L. Alvarado, A. Pothen, and R. Schreiber. 1993. Highly Parallel Sparse Triangular Solution. In Graph Theory and Sparse Matrix Computation. The IMA Volumes in Mathematics and its Applications, A. George A, J. R. Gilbert, and J. W. H. Liu (Eds.). Springer, New York, NY, Chapter 56, 141–157.

[2]

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK User’s Guide(3 ed.). SIAM, Philadelpha, PA.

[3]

E. Anderson and Y. Saad. 1989. Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput. 1 (1989), 73–95.

Digital Library

[4]

E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist. 2012. Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems. Sci. Programming 20(2012), 241–255.

Digital Library

[5]

A. M. Bradley. 2016. A hybrid multithreaded direct sparse triangular solver. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. SIAM, 13–22.

[6]

Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam. 2008. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS) 35, 3 (2008), 1–14.

Digital Library

[7]

Timothy A Davis, Sivasankaran Rajamanickam, and Wissam M Sid-Lakhdar. 2016. A survey of direct methods for sparse linear systems. Acta Numerica 25(2016), 383–566.

[8]

N. Ding, S. Williams, Y. Liu, and X. Li. 2020. Leveraging One-Sided Communication for Sparse Triangular Solvers. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing.

[9]

C. R. Dohrmann, A. Klawonn, and O. B. Widlund. 2008. Domain decomposition for less regular subdomains: Overlapping Schwarz in two dimensions. SIAM J. Numer. Anal. 46(2008), 2153–2168.

Digital Library

[10]

H. Edwards, C. Trott, and D. Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202–3216.

Digital Library

[11]

A. Heinlein, A. Klawonn, S. Rajamanickam, and O. Rheinbach. 2018. FROSch: A Fast and Robust Overlapping Schwarz Domain Decomposition Preconditioner Based on Xpetra in Trilinos.Technical Report. Sandia National Lab.(SNL-NM).

[12]

N. J. Higham and A. Pothen. 1994. Stability of the partitioned inverse method for parallel solution of sparse triangular systems. SIAM J. Sci. Comput. 15(1994), 139–148.

Digital Library

[13]

G. Karypis. 2013. METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. Technical Report.

[14]

R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. Journal of Supercomputing 63 (2013), 443–466.

Digital Library

[15]

R. Li and C. Zhang. 2020. Efficient Parallel Implementations of Sparse Triangular Solves for GPU Architecture. In Proceedings of SIAM Conference on Parallel Proc. for Sci. Comput.118–128.

[16]

X. S. Li, J. W. Demmel, J. R. Gilbert, L. Grigori, P. Sao, M. Shao, and I. Yamazaki. 1999. SuperLU Users’ Guide. Technical Report LBNL-44289.

[17]

P. Lin, M. Bettencourt, S. Domino, T. Fisher, M. Hoemmen, J. Hu, E. Phipps, A. Prokopenko, S. Rajamanickam, C. Siefert, 2014. Towards extreme-scale simulations for low mach fluids with second-generation trilinos. Parallel processing letters 24, 04 (2014), 1442005.

[18]

M. Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical Report Tech. Rep. NVR-2011.

[19]

[19] Kokkos Kernels Home Page.[n.d.]. https://github.com/kokkos/kokkos-kernels. [Online; accessed 2020].

[20]

A. Picciau, G. E. Inggs, J. Wickerson, E. C. Kerrigan, and G. A. Constantinides. 2016. Balancing locality and concurrency: Solving sparse triangular systems on GPUs. In Proceedings of the 23rd IEEE International Conference on High Performance Computing (HiPC). 183–192.

[21]

Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2 ed.). SIAM, Philadelpha, PA.

[22]

J. H. Saltz. 1990. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Comput. 11(1990), 123–144.

Digital Library

[23]

B. Suchoski, C. Severn, M. Shantharam, and P. Raghavan. 2012. Adapting sparse triangular solution to GPUs. In Proceedings of the 41st International Conference on Parallel Processing Workshops. 140––148.

[24]

Sierra Structural Dynamics Development Team. 2017. Sierra Structural Dynamics–User’s Notes. Technical Report SAND2018-2449.

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Bernaschi MCelestini AVella FD'Ambra P(2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: Aug-2023
https://doi.org/10.1109/TPDS.2023.3287238
Yamazaki IHeinlein ARajamanickam S(2023)An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00073(680-689)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00073

Recommendations

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

The sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of ...
Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures
ICPP '12: Proceedings of the 2012 41st International Conference on Parallel Processing

Bilateral filtering is an ubiquitous tool for several kinds of image processing applications. This work explores multicore and many core accelerations for the embarrassingly parallel yet compute-intensive bilateral filtering kernel. For many core ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

August 2020

844 pages

ISBN:9781450388160

DOI:10.1145/3404397

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '20

ICPP '20: 49th International Conference on Parallel Processing

August 17 - 20, 2020

AB, Edmonton, Canada

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
176
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Bernaschi MCelestini AVella FD'Ambra P(2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: Aug-2023
https://doi.org/10.1109/TPDS.2023.3287238
Yamazaki IHeinlein ARajamanickam S(2023)An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00073(680-689)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00073

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents