Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581784.3607050acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems

Published: 11 November 2023 Publication History

Abstract

Sparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. However, such a data layout may bring more unevenness when the structure of the input matrix is not ideal, and using dense BLAS may waste many floating-point operations on zero fill-ins.
In this paper, we propose a new sparse direct solver called PanguLU. Unlike the multifrontal/supernodal layout, our work relies on simpler regular 2D blocking and stores the blocks in their sparse forms to avoid any extra fill-ins. Based on the sparse patterns of the blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust the mapping of blocks to processes for overall more balanced workload, and propose a synchronisation-free communication strategy considering the dependencies among different sub-tasks to reduce overall latency overhead.
Experiments on two distributed heterogeneous platforms consisting of 128 NVIDIA A100 GPUs and 128 AMD MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over the latest SuperLU_DIST, and scales up to 47.51x and 74.84x on the 128 A100 and MI50 GPUs over a single GPU, respectively.

Supplemental Material

MP4 File - SC23 paper presentation recording for "PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems"
SC23 paper presentation recording for "PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems", by Xu Fu, Bingbin Zhang, Tengcheng Wang, Wenhao Li, Yuechen Lu, Enxin Yi, Jianqi Zhao, Xiaohan Geng, Fangying Li, Jingwen Zhang, Zhou Jin, Weifeng Liu

References

[1]
https://www.top500.org/. 2023.
[2]
E. Agullo, P. R. Amestoy, A. Buttari, A. Guermouche, J.-Y. L'Excellent, and F.-H. Rouet. Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations. SIAM Journal on Scientific Computing, 38(3), 2016.
[3]
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU Factorization for Accelerator-Based Systems. In AICCSA '11. IEEE, 2011.
[4]
P. R. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. Improving Multifrontal Methods by Means of Block Low-rank Representations. SIAM Journal on Scientific Computing, 37(3), 2015.
[5]
P. R. Amestoy, T. A. Davis, and I. S. Duff. An Approximate Minimum Degree Ordering Algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4), 1996.
[6]
P. R. Amestoy, T. A. Davis, and I. S. Duff. Algorithm 837: AMD, An Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.
[7]
P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1), 2001.
[8]
P. R. Amestoy, I. S. Duff, J.-Y. L' Excellent, and J. Koster. MUMPS: A General Purpose Distributed Memory Sparse Solver. In International Workshop on Applied Parallel Computing, 2000.
[9]
P. R. Amestoy, I. S. Duff, and C. Vömel. Task Scheduling in an Asynchronous Distributed Memory Multifrontal Solver. SIAM Journal on Matrix Analysis and Applications, 26(2), 2004.
[10]
P. R. Amestoy and C. Puglisi. An Unsymmetrized Multifrontal LU Factorization. SIAM Journal on Matrix Analysis and Applications, 24(2), 2002.
[11]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, and A. McKenney. LAPACK Users' Guide. SIAM, 1999.
[12]
S. Cayrols, I. S. Duff, and F. Lopez. Parallelization of the Solve Phase in A Task-based Cholesky Solver Using A Sequential Task Flow Model. The International Journal of High Performance Computing Applications, 34(3), 2020.
[13]
X. Chen, L. Ren, Y. Wang, and H. Yang. GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling. IEEE Transactions on Parallel and Distributed Systems, 26(3), 2015.
[14]
X. Chen, Y. Wang, and H. Yang. An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation. In ASP-DAC '12, 2012.
[15]
X. Chen, Y. Wang, and H. Yang. NICSLU: An Adaptive Sparse Matrix Solver For parallel Circuit Simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(2), 2013.
[16]
X. Chen, Y. Wang, and H. Yang. A Fast Parallel Sparse Solver For SPICE-based Circuit Simulators. In DATE '15, 2015.
[17]
T. A. Davis. Algorithm 832: UMFPACK V4.3---an Unsymmetric-Pattern Multifrontal Method. ACM Transactions on Mathematical Software, 30(2), 2004.
[18]
T. A. Davis. Direct Methods for Sparse Linear Systems. SIAM, 2006.
[19]
T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.
[20]
T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. Algorithm 836: COLAMD, A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.
[21]
T. A. Davis and W. W. Hager. Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves. ACM Transactions on Mathematical Software, 35(4), 2009.
[22]
T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38(1), 2011.
[23]
T. A. Davis and E. Palamadai Natarajan. Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems. ACM Transactions on Mathematical Software, 37(3), 2010.
[24]
J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu. A Supernodal Approach to Sparse Partial Pivoting. SIAM Journal on Matrix Analysis and Applications, 20(3), 1999.
[25]
J. W. Demmel, J. R. Gilbert, and X. S. Li. An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.
[26]
N. Ding, Y. Liu, S. Williams, and X. S. Li. A Message-driven, Multi-GPU Parallel Sparse Triangular Solver. In ACDA21 '21, 2021.
[27]
I. S. Duff. A Survey of Sparse Matrix Research. Proceedings of the IEEE, 65(4), 1977.
[28]
I. S. Duff. Parallel Implementation of Multifrontal Schemes. Parallel computing, 3(3), 1986.
[29]
I. S. Duff. Parallel Algorithms for Sparse Matrix Solution. Parallel Computing, 2020.
[30]
I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Inc, 2nd edition, 2017.
[31]
I. S. Duff, J. D. Hogg, and F. Lopez. A New Sparse Symmetric Indefinite Solver Using A Posteriori Threshold Pivoting. NLAFET Working Note, 2018.
[32]
I. S. Duff and J. Koster. The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.
[33]
I. S. Duff and J. Koster. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix. SIAM Journal on Matrix Analysis and Applications, 22(4), 2001.
[34]
I. S. Duff, P. Leleux, D. Ruiz, and F. S. Torun. Improving the Scalability of the ABCD Solver with a Combination of New Load Balancing and Communication Minimization Techniques. In Parco '19, 2019.
[35]
I. S. Duff and S. Pralet. Strategies for Scaling and Pivoting for Sparse Symmetric Indefinite Problems. SIAM Journal on Matrix Analysis and Applications, 27(2), 2005.
[36]
I. S. Duff and J. K. Reid. The Multifrontal Solution of Indefinite Sparse Symmetric Linear. ACM Transactions on Mathematical Software, 9(3), 1983.
[37]
I. S. Duff and J. K. Reid. MA48, a Fortran code for direct solution of sparse un-symmetric linear systems of equations. Science and Engineering Research Council, Rutherford Appleton Laboratory, 1993.
[38]
I. S. Duff and J. K. Reid. The Design of MA48: A Code for The Direct Solution of Sparse Unsymmetric Linear Systems of Equations. ACM Transactions on Mathematical Software, 22(2), 1996.
[39]
I. S. Duff and J. A. Scott. A Parallel Direct Solver for Large Sparse Highly Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 30(2), 2004.
[40]
S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in Unsymmetric Sparse Symbolic Factorization. SIAM Journal on Matrix Analysis and Applications, 13(1), 1992.
[41]
S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in a Sparse Partial Pivoting Code. SIAM Journal on Scientific Computing, 14(1), 1993.
[42]
K. Eswar, P. Sadayappan, C.-H. Huang, and V. Visvanathan. Supernodal Sparse Cholesky Factorization on Distributed-memory Multiprocessors. In ICPP '93, volume 3, 1993.
[43]
K. Eswar, P. Sadayappan, and V. Visvanathan. Multifrontal Factorization of Sparse Matrices on Shared-Memory Multiprocessors. In ICPP '91, 1991.
[44]
A. Gaihre, X. S. Li, and H. Liu. gSoFa: Scalable Sparse Symbolic LU Factorization on GPUs. IEEE Transactions on Parallel and Distributed Systems, 33(4), 2022.
[45]
L. Grigori, J. W. Demmel, and X. S. Li. Parallel Symbolic Factorization for Sparse LU with Static Pivoting. SIAM Journal on Scientific Computing, 29(3), 2007.
[46]
L. Grigori, J. W. Demmel, and H. Xiang. CALU: A Communication Optimal LU Factorization Algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4), 2011.
[47]
A. Gupta. WSMP: Watson Sparse Matrix Package (Part-I: Direct Solution of Symmetric Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.
[48]
A. Gupta. WSMP: Watson Sparse Matrix Package (Part-II: Direct Solution of General Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.
[49]
K. He, S. X.-D. Tan, H. Wang, and G. Shi. GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis. IEEE Transactions on Very Large Scale Integration Systems, 24(3), 2015.
[50]
G. Karypis and V. Kumar. METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-reducing Orderings of Sparse Matrices. 1997.
[51]
B. Kumar, P. Sadayappan, and C.-H. Huang. On Sparse Matrix Reordering for Parallel Factorization. In ICS '94, 1994.
[52]
W.-K. Lee, R. Achar, and M. S. Nakhla. Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation. IEEE Transactions on Very Large Scale Integration Systems, 26(11), 2018.
[53]
X. S. Li. An Overview of SuperLU: Algorithms, Implementation, and User Interface. ACM Transactions on Mathematical Software, 31(3), 2005.
[54]
X. S. Li and J. W. Demmel. Making Sparse Gaussian Elimination Scalable by Static Pivoting. In SC '98, 1998.
[55]
X. S. Li and J. W. Demmel. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 29(2), 2003.
[56]
X. S. Li, P. Lin, Y. Liu, and P. Sao. Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver. ACM Transactions on Mathematical Software, 49(1), 2023.
[57]
J. W. H. Liu. The Role of Elimination Trees in Sparse Factorization. SIAM Journal on Matrix Analysis and Applications, 11(1), 1990.
[58]
W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Euro-Par '16, 2016.
[59]
Y. Liu, N. Ding, P. Sao, S. Williams, and X. S. Li. Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters. In SC '23, 2023.
[60]
Y. Liu, M. Jacquelin, P. Ghysels, and X. S. Li. Highly Scalable Distributed-memory Sparse Triangular Solution Algorithms. In CSC '18, 2018.
[61]
S. Peng and S. X.-D. Tan. GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation. IEEE Design & Test, 37(3), 2020.
[62]
L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang. Sparse LU Factorization for Parallel Circuit Simulation on GPU. In DAC '12, 2012.
[63]
P. Sadayappan and S. K. Rao. Communication Reduction for Distributed Sparse Matrix Factorization on a Processor Mesh. In SC '89, 1989.
[64]
P. Sadayappan and V. Visvanathan. Efficient Sparse Matrix Factorization for Circuit Simulation On Vector Supercomputers. In DAC '98, 1989.
[65]
P. Sao, R. Kannan, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Sparse Triangular Solver. In ICS '19, 2019.
[66]
P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices. In IPDPS '18, 2018.
[67]
P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Algorithm for Sparse LU Factorization on Heterogeneous Systems. Journal of Parallel and Distributed Computing, 131, 2019.
[68]
P. Sao, X. Liu, R. Vuduc, and X. S. Li. A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems. In IPDPS '15, 2015.
[69]
P. Sao, R. Vuduc, and X. S. Li. A Distributed CPU-GPU Sparse Direct Solver. In Euro-Par '14, 2014.
[70]
O. Schenk and K. Gärtner. Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. Future Generation Computer Systems, 20(3), 2004.
[71]
O. Schenk and K. Gärtner. Two-level Dynamic Scheduling in PARDISO: Improved scalability on Shared memory Multiprocessing Systems. Parallel Computing, 28(2), 2002.
[72]
O. Schenk, K. Gärtner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT Numerical Mathematics, 40, 2000.
[73]
O. Schenk, K. Gärtner, W. Fichtner, and A. Stricker. PARDISO: a High-Performance Serial and Parallel Sparse Linear Solver in Semiconductor Device Simulation. Future Generation Computer Systems, 18(1), 2001.
[74]
G. Tan, C. Shui, Y. Wang, X. Yu, and Y. Yan. Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems. IEEE Transactions on Parallel and Distributed Systems, 32(9), 2021.
[75]
M. Tian, J. Wang, Z. Zhang, W. Du, J. Pan, and T. Liu. swSuperLU: A Highly Scalable Sparse Direct Solver on Sunway Manycore Architecture. The Journal of Supercomputing, 78(9), 2022.
[76]
T. Wang, W. Li, H. Pei, Y. Sun, Z. Jin, and W. Liu. Accelerating Sparse LU Factorization with Density-Aware Adaptive Matrix Multiplication for Circuit Simulation. In DAC '23, 2023.
[77]
J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li. Superfast Multifrontal Method for Large Structured Linear Systems of Equations. SIAM Journal on Matrix Analysis and Applications, 31(3), 2010.
[78]
Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. End-to-End LU Factorization of Large Matrices on GPUs. In PPoPP '23, 2023.
[79]
Z. Yan, B. Xie, X. Li, and Y. Bao. Exploiting Architecture Advances for Sparse Solvers in Circuit Simulation. In DATE '22, 2022.
[80]
C. Yu, W. Wang, and D. Pierce. A CPU-GPU Hybrid Approach for the Unsymmetric Multifrontal Method. Parallel Computing, 37(12), 2011.
[81]
J. Zhao, Y. Wen, Y. Luo, Z. Jin, W. Liu, and Z. Zhou. SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs. In DAC '21, 2021.

Cited By

View all
  • (2024)Machine Learning and GPU Accelerated Sparse Linear Solvers for Transistor-Level Circuit Simulation: A Perspective Survey (Invited Paper)Proceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473846(96-101)Online publication date: 22-Jan-2024
  • (2024)Accelerating Large-Scale Sparse LU Factorization for RF Circuit SimulationEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_13(182-195)Online publication date: 26-Aug-2024

Index Terms

  1. PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2023
      1428 pages
      ISBN:9798400701092
      DOI:10.1145/3581784
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 November 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. sparse LU
      2. regular 2D block
      3. distributed heterogeneous systems

      Qualifiers

      • Research-article

      Conference

      SC '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)884
      • Downloads (Last 6 weeks)67
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Machine Learning and GPU Accelerated Sparse Linear Solvers for Transistor-Level Circuit Simulation: A Perspective Survey (Invited Paper)Proceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473846(96-101)Online publication date: 22-Jan-2024
      • (2024)Accelerating Large-Scale Sparse LU Factorization for RF Circuit SimulationEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_13(182-195)Online publication date: 26-Aug-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media