Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs

Ichitaro Yamazaki¹⁶,
Stanimire Tomov¹⁶,
Tingxing Dong¹⁶ &
…
Jack Dongarra¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8969))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

872 Accesses
4 Citations

Abstract

The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method for solving a nonsymmetric linear system of equations. On modern computers, communication is becoming increasingly expensive compared to arithmetic operations, and a communication-avoiding variant (CA-GMRES) may improve the performance of GMRES. To further enhance the performance of CA-GMRES, in this paper, we propose two techniques, focusing on the two main computational kernels of CA-GMRES, tall-skinny QR (TSQR) and matrix powers kernel (MPK). First, to improve the numerical stability of TSQR based on the Cholesky QR (CholQR) factorization, we use higher-precision arithmetic at carefully-selected steps of the factorization. In particular, our mixed-precision CholQR takes the input matrix in the standard $64$-bit double precision, but accumulates some of its intermediate results in a software-emulated double-double precision. Compared with the standard CholQR, this mixed-precision CholQR requires about $8.5\times $ more computation but a much smaller increase in communication. Since the computation is becoming less expensive compared to the communication on a newer computer, the relative overhead of the mixed-precision CholQR is decreasing. Our case studies on a GPU demonstrate that using higher-precision arithmetic for this small but critical segment of the algorithm can improve not only the overall numerical stability of CA-GMRES but also, in some cases, its performance. We then study an adaptive scheme to dynamically adjust the step size of MPK based on the static inputs and the performance measurements gathered during the first restart loop of CA-GMRES. Since the optimal step size of MPK is often much smaller than that of the orthogonalization kernel, the overall performance of CA-GMRES can be improved using different step sizes for these two kernels. Our performance results on multiple GPUs show that our adaptive scheme can choose a near optimal step size for MPK, reducing the total solution time of CA-GMRES.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving the Performance of the GMRES Method Using Mixed-Precision Techniques

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Article 05 September 2019

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Article 11 October 2016

Notes

1.
In the current implementation, the numbers of rows and columns in $X$ and $Y$ are a multiple of $h$, and multiples of $m_b$ and $n_b$, respectively, where $n_b$ is a multiple of $n_r$.
2.
http://mplapack.sourceforge.net.
3.
http://www.cise.ufl.edu/research/sparse/matrices/.
4.
http://www.eecs.berkeley.edu/~hdnguyen/rblas.
5.
http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/corvette/precimonious.

References

Bai, Z., Hu, D., Reichel, L.: A Newton basis GMRES implementation. IMA J. Numer. Anal. 14, 563–581 (1994)
Article MATH MathSciNet Google Scholar
Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, University of California Berkeley EECS Department, October 2007
Google Scholar
Golub, G., van Loan, C.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2012)
Google Scholar
Hida, Y., Li, X., Bailey, D.: Quad-double arithmetic: algorithms, implementation, and application. Technical report LBNL-46996 (2000)
Google Scholar
Hoemmen, M.: Communication-avoiding Krylov subspace methods. Ph.D. thesis, University of California, Berkeley (2010)
Google Scholar
Saad, Y., Schultz, M.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869 (1986)
Article MATH MathSciNet Google Scholar
Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
Article MATH MathSciNet Google Scholar
Yamazaki, I., Anzt, H., Tomov, S., Hoemmen, M., Dongarra, J.: Improving the performance of CA-GMRES on multicores with multiple GPUs. In: The Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 382–391 (2014)
Google Scholar
Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPUs with multiple GPUs. Submitted to SIAM J. Sci. Comput. (2014)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research (ASCR), and was founded in part by National Science Foundation under Grant No. ACI-1339822, DOE Grant #DE-SC0010042: “Extreme-scale Algorithms & Solver Resilience (EASIR),” Russian Scientific Fund, Agreement N14-11-00190, and NVIDIA. This research used resources of the Keeneland Computing Facility at the Georgia Institute of Technology, which is supported by the National Science Foundation under Contract OCI-0910735. We thank Maho Nakata, Daichi Mukunoki, and the members of the DOE EASIR project for helpful discussions.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, USA
Ichitaro Yamazaki, Stanimire Tomov, Tingxing Dong & Jack Dongarra

Authors

Ichitaro Yamazaki
View author publications
You can also search for this author in PubMed Google Scholar
Stanimire Tomov
View author publications
You can also search for this author in PubMed Google Scholar
Tingxing Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ichitaro Yamazaki .

Editor information

Editors and Affiliations

IRIT, ENSEEIHT, Toulouse Cedex, France
Michel Daydé
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Osni Marques
Information Technology Center, The University of Tokyo, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamazaki, I., Tomov, S., Dong, T., Dongarra, J. (2015). Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-17353-5_2
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics