Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fast tridiagonal solvers on the GPU

Published: 09 January 2010 Publication History

Abstract

We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

References

[1]
General-purpose computation using graphics hardware. http://www.gpgpu.org/.
[2]
NVIDIA CUDA compute unified device architecture, programming guide, 2009. Version 2.0.
[3]
S. Allmann, T. Rauber, and G. Runger. Cyclic reduction on distributed shared memory machines. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pages 290--297, 2001.
[4]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of Supercomputing '90, pages 2--11. IEEE Computer Society Press, 1990.
[5]
G. E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, Nov. 1990.
[6]
B. L. Buzbee, G. H. Golub, and C. W. Nielson. On direct methods for solving Poisson's equations. SIAM Journal on Numerical Analysis, 7(4):627--656, 1970.
[7]
J.-J. Climent, C. Perea, L. Tortosa, and A. Zamora. An overlapped two-way method for solving tridiagonal linear systems in a bsp computer. Applied Mathematics and Computation, 161(2):475--500, 2005.
[8]
Y. Dotsenko, N. K. Govindaraju, P.-P. J. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, pages 205--213. ACM, June 2008.
[9]
P. Dubois and G. Rodrigue. An analysis of the recursive doubling algorithm. In D. Kuck, D. Lawrie, and A. Sameh, editors, High Speed Computer and Algorithm Organization, pages 299--305. Academic Press, New York, NY, 1977.
[10]
Ö. Eğecioğlu, C. K. Koc, and A. J. Laub. A recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. Journal of Computational and Applied Mathematics, 27:95--108, 1989.
[11]
D. Göddeke and R. Strzodka. Accurate mixed-precision GPUmultigrid solvers on anisotropic grids. Submitted to IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators.
[12]
A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997.
[13]
G. R. Halliwell. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM). Ocean Modelling, 7:285--322, 2004.
[14]
W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170--1183, Dec. 1986.
[15]
C. T. Ho and S. L. Johnsson. Optimizing tridiagonal solvers for alternating direction methods on boolean cube multiprocessors. SIAM Journal of Scientific and Statistical Computing, 11(3):563--592, 1990.
[16]
R. W. Hockney. A fast direct solution of Poisson's equation using Fourier analysis. Journal of the ACM, 12(1):95--113, Jan. 1965.
[17]
R. W. Hockney and C. R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.
[18]
Y. Huang and W. F. McColl. Two-way BSP algorithm for tridiagonal systems. Future Generation Computer Systems, 13:337--347, Mar. 1998.
[19]
M. Kass, A. Lefohn, and J. D. Owens. Interactive depth of field using simulated diffusion. Technical Report 06-01, Pixar Animation Studios, Jan. 2006.
[20]
M. Kass and G. Miller. Rapid, stable fluid dynamics for computer graphics. In Computer Graphics (Proceedings of SIGGRAPH 90), pages 49--57, Aug. 1990.
[21]
J. J. Lambiotte and R. G. Voigt. The solution of tridiagonal linear systems on the CDC STAR-100 computer. ACM Trans. Math. Software, 1(4):308--329, 1975.
[22]
S. M. Müller and D. Sheerer. A method to parallelize tridiagonal solvers. Parallel Computing, 17:181--188, 1991.
[23]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue: Tomorrow's Computing Today, 6(2):40--53, Mar. 2008.
[24]
M. Prieto, R. Santiago, D. Espadas, I. M. Llorente, and F. Tirado. Parallel multigrid for anisotropic elliptic equations. J. Parallel Distrib. Comput., 61(1):96--114, 2001.
[25]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007.
[26]
S. Sengupta, A. E. Lefohn, and J. D. Owens. A work-efficient step-efficient prefix sum algorithm. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, pages D-26-27, May 2006.
[27]
H. S. Stone. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of the ACM, 20(1):27--38, Jan. 1973.
[28]
X.-H. Sun, H. Zhang, and L. M. Ni. Efficient tridiagonal solvers on multicomputers. IEEE Transactions on Computers, C-41(3):286--296, Mar. 1992.
[29]
X.-H. Sun and W. Zhang. A parallel two-level hybrid method for tridiagonal systems and its application to fast Poisson solvers. IEEE Transactions on Parallel and Distributed Systems, PDS-15(2):97--106, Feb. 2004.
[30]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, page 31 (11pp), Nov. 2008.
[31]
V. Volkov and J. W. Demmel. Using GPUs to accelerate the bisection algorithm for finding eigenvalues of symmetric tridiagonal matrices. LAPACKWorking Note 197, Department of Computer Science, University of Tennessee, Knoxville, Jan. 2008.
[32]
H. H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math. Software, 7:170--183, 1981.
[33]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.

Cited By

View all
  • (2023)A Novel Compute-Efficient Tridiagonal Solver for Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321476234:1(195-206)Online publication date: 1-Jan-2023
  • (2023)Efficient GPU implementation of the multivariate empirical mode decomposition algorithmJournal of Computational Science10.1016/j.jocs.2023.10218074(102180)Online publication date: Dec-2023
  • (2023)The N-shaped partition method: A novel parallel implementation of the Crank Nicolson algorithmComputer Physics Communications10.1016/j.cpc.2023.108713287(108713)Online publication date: Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2010
    372 pages
    ISBN:9781605588773
    DOI:10.1145/1693453
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010
Published in SIGPLAN Volume 45, Issue 5

Check for updates

Author Tags

  1. gpgpu
  2. performance optimization
  3. tridiagonal linear system

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Novel Compute-Efficient Tridiagonal Solver for Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321476234:1(195-206)Online publication date: 1-Jan-2023
  • (2023)Efficient GPU implementation of the multivariate empirical mode decomposition algorithmJournal of Computational Science10.1016/j.jocs.2023.10218074(102180)Online publication date: Dec-2023
  • (2023)The N-shaped partition method: A novel parallel implementation of the Crank Nicolson algorithmComputer Physics Communications10.1016/j.cpc.2023.108713287(108713)Online publication date: Jun-2023
  • (2023)Numerical simulation of acoustic streaming in standing wavesComputers & Mathematics with Applications10.1016/j.camwa.2023.10.027152:C(199-220)Online publication date: 15-Dec-2023
  • (2023)Efficient GPU implementation of a Boltzmann-Schrödinger-Poisson solver for the simulation of nanoscale DG MOSFETsThe Journal of Supercomputing10.1007/s11227-023-05189-079:12(13370-13401)Online publication date: 23-Mar-2023
  • (2022)Sparse Linear AlgebraDeveloping Linear Algebra Codes on Modern Processors10.4018/978-1-7998-7082-1.ch004(94-137)Online publication date: 14-Oct-2022
  • (2022)On the Benefits of Empirical Mode Decomposition in Spatio-temporal EEG Analysis2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO55190.2022.9803337(333-338)Online publication date: 23-May-2022
  • (2022) Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices 1 2022 IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH)10.1109/ScalAH56622.2022.00011(43-50)Online publication date: Nov-2022
  • (2022)Influence of thermal effects on the breakup of thin films of nanometric thicknessPhysical Review Fluids10.1103/PhysRevFluids.7.0640017:6Online publication date: 2-Jun-2022
  • (2022)Scalable parallel linear solver for compact banded systems on heterogeneous architecturesJournal of Computational Physics10.1016/j.jcp.2022.111443468(111443)Online publication date: Nov-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media