research-article

Reconstructing Householder vectors from Tall-Skinny QR

Authors:

H.D. NguyenAuthors Info & Claims

Journal of Parallel and Distributed Computing, Volume 85, Issue C

Pages 3 - 31

https://doi.org/10.1016/j.jpdc.2015.06.003

Published: 01 November 2015 Publication History

Abstract

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation.We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. We also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance. We reconstruct Householder vectors representing the Q -factor from Tall-Skinny QR.Our approach has the same asymptotic communication efficiency as TSQR.Additionally, it enables more communication-efficient parallel QR algorithms.We also provide algorithmic improvements to the Householder QR and CAQR algorithms.

References

[1]

E. Agullo, C. Coti, J. Dongarra, T. Herault, J. Langou, QR factorization of tall and skinny matrices in a grid computing environment, in: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1-11.

[2]

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J.D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK Users' Guide, SIAM, Philadelphia, PA, USA, 1992.

[3]

M. Anderson, G. Ballard, J. Demmel, K. Keutzer, Communication-avoiding QR decomposition for GPUs, in: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium IPDPS'11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 48-58.

Digital Library

[4]

T. Auckenthaler, T. Huckle, R. Wittmann, A blocked QR-decomposition for the parallel symmetric eigenvalue problem, Parallel Comput., 40 (2014) 186-194.

Digital Library

[5]

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H.D. Nguyen, E. Solomonik, Reconstructing Householder Vectors from Tall-Skinny QR, in: IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 1159-1170.

[6]

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H.D. Nguyen, E. Solomonik, Reconstructing householder vectors from Tall-Skinny QR, Tech. Rep., EECS Department, University of California, Berkeley, 2013.

[7]

G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing Communication in Linear Algebra, SIAM J. Matrix Anal. Appl. 32 (3).

[8]

C.H. Bischof, X. Sun, On Orthogonal Block Elimination, Tech. Rep. MCS-P450-0794, Argonne National Laboratory, Argonne, IL, 1994.

[9]

L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK Users' Guide, SIAM, Philadelphia, PA, USA, 1997.

[10]

E. Chan, M. Heimlich, A. Purkayastha, R. van~de Geijn, Collective communication: theory, practice, and experience, Concurrency Comput. Pract. Exp., 19 (2007) 1749-1783.

Digital Library

[11]

J. Poulson, B. Marker, R.A. van~de Geijn, J.R. Hammond, N.A. Romero, Elemental: A new framework for distributed memory dense matrix computations, ACM Trans. Math. Softw., 39 (2013) 13:1-13:24.

Digital Library

[12]

J. Demmel, L. Grigori, M. Gu, H. Xiang, Communication Avoiding Rank Revealing QR Factorization with Column Pivoting, Tech. Rep. UCB/EECS-2013-46, EECS Department, University of California, Berkeley, 2013.

[13]

J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput., 34 (2012) A206-A239.

Digital Library

[14]

J. Demmel, L. Grigori, M.F. Hoemmen, J. Langou, Communication-Optimal Parallel and Sequential QR and LU Factorizations, Tech. Rep. UCB/EECS-2008-89, EECS Department, University of California, Berkeley, 2008.

[15]

J. Dongarra, M. Faverge, T. HéRault, M. Jacquelin, J. Langou, Y. Robert, Hierarchical QR factorization algorithms for multi-core clusters, Parallel Comput., 39 (2013) 212-232.

Digital Library

[16]

Edison, NERSC's Cray XC30 System, http://www.nersc.gov/users/computational-systems/edison/, 2014.

[17]

A. Farley, Broadcast time in communication networks, SIAM J. Appl. Math., 39 (1980) 385-390.

Digital Library

[18]

T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, Y. Yamamoto, CholeskyQR2: A simple and communication-avoiding algorithm for computing a Tall-skinny QR factorization on a large-scale parallel system, in: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems ScalA'14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 31-38.

Digital Library

[19]

The Future of Computing Performance: Game Over or Next Level?, in: The Future of Computing Performance: Game Over or Next Level?, The National Academies Press, Washington, D.C., 2011.

[20]

G.H. Golub, R.J. Plemmons, A. Sameh, Parallel block schemes for large-scale least-squares computations, in: High-speed Computing: Scientific Applications and Algorithm Design, University of Illinois Press, Champaign, IL, USA, 1988, pp. 171-179.

[21]

N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2002.

[22]

M. Hoemmen, A Communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method, in: Parallel Distributed Processing Symposium, IPDPS, 2011 IEEE International, 2011, pp. 966-977.

Digital Library

[23]

Hopper, NERSC's Cray XE6 System, http://www.nersc.gov/users/computational-systems/hopper/, 2014.

[24]

C. Bischof, C. Van¿Loan, The WY representation for products of Householder matrices, SIAM J. Sci. Stat. Comput. 8 (1).

[25]

G. Golub, C. Van~Loan, Matrix Computations, in: Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, 2012.

[26]

R. Schreiber, C. Van~Loan, A storage-efficient WY representation for products of householder transformations, SIAM J. Sci. Stat. Comput., 10 (1989) 53-57.

Digital Library

[27]

M. Mohiyuddin, M. Hoemmen, J. Demmel, K. Yelick, Minimizing communication in sparse matrix solvers, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis SC'09, ACM, New York, NY, USA, 2009, pp. 36:1-36:12.

[28]

D. Mori, Y. Yamamoto, S.-L. Zhang, Backward error analysis of the AllReduce algorithm for Householder QR decomposition, Japan J. Ind. Appl. Math., 29 (2012) 111-130.

[29]

C. Puglisi, Modification of the householder method based on compact WY representation, SIAM J. Sci. Stat. Comput., 13 (1992) 723-726.

Digital Library

[30]

R. Schreiber, B. Parlett, Block reflectors: Theory and computation, SIAM J. Numer. Anal., 25 (1988) 189-205.

Digital Library

[31]

E. Solomonik, J. Demmel, Communication-optimal 2.5D matrix multiplication and LU factorization algorithms, in: Springer Lecture Notes in Computer Science, Proceedings of Euro-Par, Bordeaux, France, 2011, pp. 90-109.

[32]

F. Song, H. Ltaief, B. Hadri, J. Dongarra, Scalable tile communication-avoiding QR factorization on multicore cluster systems, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis SC'10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1-11.

Digital Library

[33]

X. Sun, C. Bischof, A basis-kernel representation of orthogonal matrices, SIAM J. Matrix Anal. Appl., 16 (1995) 1184-1196.

Digital Library

[34]

R. Thakur, R. Rabenseifner, W. Gropp, Optimization of collective communication operations in MPICH, Int. J. High Perform. Comput. Appl., 19 (2005) 49-66.

Digital Library

[35]

A. Tiskin, Communication-efficient parallel generic pairwise elimination, Future Gener. Comput. Syst., 23 (2007) 179-188.

Digital Library

[36]

J.L. Träff, A. Ripke, Optimal broadcast for fully connected networks, in: Lecture Notes in Computer Science, vol. 3726, Springer, Berlin, Heidelberg, 2005, pp. 45-56.

[37]

Y. Yamamoto, Aggregation of the compact WY representations generated by the TSQR algorithm, in: Conference Talk Presented at SIAM Applied Linear Algebra, 2012.

[38]

Y. Yamamoto, Personal communication, 2012.

[39]

Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa, T. Fukaya, Roundoff Error Analysis of the CholeskyQR2 Algorithm, Tech. Rep. 43, Department of Mathematical Informatics, University of Tokyo, 2014.

Cited By

Wu XDi Napoli E(2023)Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL libraryProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624249(1688-1696)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624249
Li ZFang QBallard G(2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472472
Tomás AQuintana-Ortí E(2020)Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processorsThe Journal of Supercomputing10.1007/s11227-020-03176-376:11(8771-8786)Online publication date: 24-Jan-2020
https://dl.acm.org/doi/10.1007/s11227-020-03176-3
Show More Cited By

Reconstructing Householder vectors from Tall-Skinny QR

Recommendations

Reconstructing Householder Vectors from Tall-Skinny QR
IPDPS '14: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and ...
Analysis of a QR Algorithm for Computing Singular Values

We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.

We derive a relation between left and right ...
Exploring Dual-Triangular Structure for Efficient R-Initiated Tall-Skinny QR on GPGPU
Advances in Knowledge Discovery and Data Mining
Abstract
The QR decomposition is one of the fundamental matrix decompositions in data mining. A particularly challenging case of QR decomposition is to deal with the tall-and-skinny matrix. Tall-skinny QR has lots of applications such as Krylov subspace ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 85, Issue C

November 2015

104 pages

ISSN:0743-7315

Issue’s Table of Contents

Copyright © Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 November 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu XDi Napoli E(2023)Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL libraryProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624249(1688-1696)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624249
Li ZFang QBallard G(2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472472
Tomás AQuintana-Ortí E(2020)Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processorsThe Journal of Supercomputing10.1007/s11227-020-03176-376:11(8771-8786)Online publication date: 24-Jan-2020
https://dl.acm.org/doi/10.1007/s11227-020-03176-3
Rodríguez-Sánchez RCatalán SHerrero JQuintana-Ortí ETomás A(2019)Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVDNumerical Algorithms10.1007/s11075-018-0500-880:2(635-660)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11075-018-0500-8
Tomás AQuintana-Ortí E(2019)Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics ProcessorsEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_33(469-480)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29400-7_33
Cheng NChen M(2019)Exploring Dual-Triangular Structure for Efficient R-Initiated Tall-Skinny QR on GPGPUAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16145-3_45(578-589)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1007/978-3-030-16145-3_45
Ballard GDemmel JGrigori LJacquelin MKnight NScheideler CFineman J(2018)A 3D Parallel Algorithm for QR DecompositionProceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures10.1145/3210377.3210415(55-65)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.1145/3210377.3210415
Tomás ARodríguez-Sánchez RCatalán SQuintana-Ortí E(2018)Reduction to Band Form for the Singular Value Decomposition on Graphics AcceleratorsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178448(51-60)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178442.3178448

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents