Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Reconstructing Householder vectors from Tall-Skinny QR

Published: 01 November 2015 Publication History

Abstract

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation.We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. We also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance. We reconstruct Householder vectors representing the Q -factor from Tall-Skinny QR.Our approach has the same asymptotic communication efficiency as TSQR.Additionally, it enables more communication-efficient parallel QR algorithms.We also provide algorithmic improvements to the Householder QR and CAQR algorithms.

References

[1]
E. Agullo, C. Coti, J. Dongarra, T. Herault, J. Langou, QR factorization of tall and skinny matrices in a grid computing environment, in: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1-11.
[2]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J.D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK Users' Guide, SIAM, Philadelphia, PA, USA, 1992.
[3]
M. Anderson, G. Ballard, J. Demmel, K. Keutzer, Communication-avoiding QR decomposition for GPUs, in: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium IPDPS'11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 48-58.
[4]
T. Auckenthaler, T. Huckle, R. Wittmann, A blocked QR-decomposition for the parallel symmetric eigenvalue problem, Parallel Comput., 40 (2014) 186-194.
[5]
G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H.D. Nguyen, E. Solomonik, Reconstructing Householder Vectors from Tall-Skinny QR, in: IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 1159-1170.
[6]
G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H.D. Nguyen, E. Solomonik, Reconstructing householder vectors from Tall-Skinny QR, Tech. Rep., EECS Department, University of California, Berkeley, 2013.
[7]
G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing Communication in Linear Algebra, SIAM J. Matrix Anal. Appl. 32 (3).
[8]
C.H. Bischof, X. Sun, On Orthogonal Block Elimination, Tech. Rep. MCS-P450-0794, Argonne National Laboratory, Argonne, IL, 1994.
[9]
L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK Users' Guide, SIAM, Philadelphia, PA, USA, 1997.
[10]
E. Chan, M. Heimlich, A. Purkayastha, R. van~de Geijn, Collective communication: theory, practice, and experience, Concurrency Comput. Pract. Exp., 19 (2007) 1749-1783.
[11]
J. Poulson, B. Marker, R.A. van~de Geijn, J.R. Hammond, N.A. Romero, Elemental: A new framework for distributed memory dense matrix computations, ACM Trans. Math. Softw., 39 (2013) 13:1-13:24.
[12]
J. Demmel, L. Grigori, M. Gu, H. Xiang, Communication Avoiding Rank Revealing QR Factorization with Column Pivoting, Tech. Rep. UCB/EECS-2013-46, EECS Department, University of California, Berkeley, 2013.
[13]
J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput., 34 (2012) A206-A239.
[14]
J. Demmel, L. Grigori, M.F. Hoemmen, J. Langou, Communication-Optimal Parallel and Sequential QR and LU Factorizations, Tech. Rep. UCB/EECS-2008-89, EECS Department, University of California, Berkeley, 2008.
[15]
J. Dongarra, M. Faverge, T. HéRault, M. Jacquelin, J. Langou, Y. Robert, Hierarchical QR factorization algorithms for multi-core clusters, Parallel Comput., 39 (2013) 212-232.
[16]
Edison, NERSC's Cray XC30 System, http://www.nersc.gov/users/computational-systems/edison/, 2014.
[17]
A. Farley, Broadcast time in communication networks, SIAM J. Appl. Math., 39 (1980) 385-390.
[18]
T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, Y. Yamamoto, CholeskyQR2: A simple and communication-avoiding algorithm for computing a Tall-skinny QR factorization on a large-scale parallel system, in: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems ScalA'14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 31-38.
[19]
The Future of Computing Performance: Game Over or Next Level?, in: The Future of Computing Performance: Game Over or Next Level?, The National Academies Press, Washington, D.C., 2011.
[20]
G.H. Golub, R.J. Plemmons, A. Sameh, Parallel block schemes for large-scale least-squares computations, in: High-speed Computing: Scientific Applications and Algorithm Design, University of Illinois Press, Champaign, IL, USA, 1988, pp. 171-179.
[21]
N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 2002.
[22]
M. Hoemmen, A Communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method, in: Parallel Distributed Processing Symposium, IPDPS, 2011 IEEE International, 2011, pp. 966-977.
[23]
Hopper, NERSC's Cray XE6 System, http://www.nersc.gov/users/computational-systems/hopper/, 2014.
[24]
C. Bischof, C. Van¿Loan, The WY representation for products of Householder matrices, SIAM J. Sci. Stat. Comput. 8 (1).
[25]
G. Golub, C. Van~Loan, Matrix Computations, in: Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, 2012.
[26]
R. Schreiber, C. Van~Loan, A storage-efficient WY representation for products of householder transformations, SIAM J. Sci. Stat. Comput., 10 (1989) 53-57.
[27]
M. Mohiyuddin, M. Hoemmen, J. Demmel, K. Yelick, Minimizing communication in sparse matrix solvers, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis SC'09, ACM, New York, NY, USA, 2009, pp. 36:1-36:12.
[28]
D. Mori, Y. Yamamoto, S.-L. Zhang, Backward error analysis of the AllReduce algorithm for Householder QR decomposition, Japan J. Ind. Appl. Math., 29 (2012) 111-130.
[29]
C. Puglisi, Modification of the householder method based on compact WY representation, SIAM J. Sci. Stat. Comput., 13 (1992) 723-726.
[30]
R. Schreiber, B. Parlett, Block reflectors: Theory and computation, SIAM J. Numer. Anal., 25 (1988) 189-205.
[31]
E. Solomonik, J. Demmel, Communication-optimal 2.5D matrix multiplication and LU factorization algorithms, in: Springer Lecture Notes in Computer Science, Proceedings of Euro-Par, Bordeaux, France, 2011, pp. 90-109.
[32]
F. Song, H. Ltaief, B. Hadri, J. Dongarra, Scalable tile communication-avoiding QR factorization on multicore cluster systems, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis SC'10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1-11.
[33]
X. Sun, C. Bischof, A basis-kernel representation of orthogonal matrices, SIAM J. Matrix Anal. Appl., 16 (1995) 1184-1196.
[34]
R. Thakur, R. Rabenseifner, W. Gropp, Optimization of collective communication operations in MPICH, Int. J. High Perform. Comput. Appl., 19 (2005) 49-66.
[35]
A. Tiskin, Communication-efficient parallel generic pairwise elimination, Future Gener. Comput. Syst., 23 (2007) 179-188.
[36]
J.L. Träff, A. Ripke, Optimal broadcast for fully connected networks, in: Lecture Notes in Computer Science, vol. 3726, Springer, Berlin, Heidelberg, 2005, pp. 45-56.
[37]
Y. Yamamoto, Aggregation of the compact WY representations generated by the TSQR algorithm, in: Conference Talk Presented at SIAM Applied Linear Algebra, 2012.
[38]
Y. Yamamoto, Personal communication, 2012.
[39]
Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa, T. Fukaya, Roundoff Error Analysis of the CholeskyQR2 Algorithm, Tech. Rep. 43, Department of Mathematical Informatics, University of Tokyo, 2014.

Cited By

View all
  • (2023)Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL libraryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624249(1688-1696)Online publication date: 12-Nov-2023
  • (2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
  • (2020)Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processorsThe Journal of Supercomputing10.1007/s11227-020-03176-376:11(8771-8786)Online publication date: 24-Jan-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 85, Issue C
November 2015
104 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 November 2015

Author Tags

  1. Communication-avoiding algorithms
  2. Dense linear algebra
  3. QR decomposition

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL libraryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624249(1688-1696)Online publication date: 12-Nov-2023
  • (2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
  • (2020)Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processorsThe Journal of Supercomputing10.1007/s11227-020-03176-376:11(8771-8786)Online publication date: 24-Jan-2020
  • (2019)Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVDNumerical Algorithms10.1007/s11075-018-0500-880:2(635-660)Online publication date: 1-Feb-2019
  • (2019)Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics ProcessorsEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_33(469-480)Online publication date: 26-Aug-2019
  • (2019)Exploring Dual-Triangular Structure for Efficient R-Initiated Tall-Skinny QR on GPGPUAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16145-3_45(578-589)Online publication date: 14-Apr-2019
  • (2018)A 3D Parallel Algorithm for QR DecompositionProceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures10.1145/3210377.3210415(55-65)Online publication date: 11-Jul-2018
  • (2018)Reduction to Band Form for the Singular Value Decomposition on Graphics AcceleratorsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178448(51-60)Online publication date: 24-Feb-2018

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media