Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Published: 01 April 2014 Publication History

Abstract

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implemented with optimizations that: (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to traditional QR algorithms for these problems and is competitive with two commonly used alternatives---Cuppen’s Divide-and-Conquer algorithm and the method of Multiple Relatively Robust Representations---while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved.

Supplementary Material

PDF File (a18-van_zee_appendix.pdf)
The proof is given in an electronic appendix, available online in the ACM Digital Library.

References

[1]
Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J., Dongarra, J. J., Croz, J. D., Hammarling, S., Greenbaum, A., McKenney, A., and Sorensen, D. 1999. LAPACK Users’ Guide 3rd Ed. SIAM, Philadelphia, PA.
[2]
Bientinesi, P., Igual, F. D., Kressner, D., Petschow, M., and Quintana-Ortí, E. S. 2011. Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures. Concurr. Comput. Pract. Exper. 23, 7, 694--707.
[3]
Bischof, C. and Van Loan, C. 1987. The WY representation for products of householder matrices. SIAM J. Sci. Stat. Comput. 8, 1, 2--13.
[4]
Bischof, C., Lang, B., and Sun, X. 1994. Parallel tridiagonalization through two-step band reduction. In Proceedings of the Scalable High-Performance Computing Conference. IEEE Computer Society Press, 23--27.
[5]
Blast. 2002. Basic linear algebra subprograms technical forum standard. Int. J. High Perf. Appl. Supercomput. 16, 1.
[6]
Braman, K., Byers, R., and Mathias, R. 2001a. The multishift QR algorithm. Part I: Maintaining well focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl. 23, 929--947.
[7]
Braman, K., Byers, R., and Mathias, R. 2001b. The multishift QR algorithm. Part II: Aggressive early deflation. SIAM J. Matrix Anal. Appl. 23, 948--973.
[8]
Cline, A. K. and Dhillon, I. S. 2006. Computation of the singular value decomposition. In Handbook of Linear Algebra, CRC Press, Boca Raton, FL, 45--1--45--13.
[9]
Cuppen, J. J. M. 1980. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik 36, 177--195.
[10]
Demmel, J. and Kahan, W. 1990. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Statist. Comput. 11, 873--912.
[11]
Demmel, J. and Veselić, K. 1992. Jacobi’s method is more accurate than QR. SIAM J. Matrix Anal. Appl. 13, 4, 1204--1245.
[12]
Demmel, J., Dhillon, I., and Ren, H. 1995. On the correctness of some bisection-like parallel eigenvalue algorithms in floating point arithmetic. Electron. Trans. Num. Anal. 3, 116--149.
[13]
Demmel, J., Marques, O., Parlett, B., and Vömel, C. 2008. Performance and accuracy of LAPACK’s symmetric tridiagonal eigensolvers. SIAM J. Sci. Comput. 30, 3, 1508--1526.
[14]
Dhillon, I. S. 1997. A new O(n2) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. Ph.D. thesis, EECS Department, University of California, Berkeley.
[15]
Dhillon, I. S. 1998. Current inverse iteration software can fail. BIT Numer. Math. 38, 685--704.
[16]
Dhillon, I. S. and Parlett, B. N. 2003. Orthogonal eigenvectors and relative gaps. SIAM J. Matrix Anal. Appl. 25, 3, 858--899.
[17]
Dhillon, I. S. and Parlett, B. N. 2004. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algor. Appl. 387, 1--28.
[18]
Dhillon, I. S., Parlett, B. N., and Vomel, C. 2006. The design and implementation of the MRRR algorithm. ACM Trans. Math. Softw. 32, 533--560.
[19]
Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1, 1--17.
[20]
Drmač, Z. 2009. A global convergence proof for cyclic Jacobi methods with block rotations. SIAM J. Matrix Anal. Appl. 31, 3, 1329--1350.
[21]
Drmač, Z. and Veselić, K. 2008a. New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29, 4, 1322--1342.
[22]
Drmač, Z. and Veselić, K. 2008b. New fast and accurate Jacobi SVD algorithm. II. SIAM J. Matrix Anal. Appl. 29, 4, 1343--1362.
[23]
Fernando, K. V. and Parlett, B. N. 1994. Accurate singular values and differential qd algorithms. Numer. Math. 67, 191--229.
[24]
Golub, G. H. and Loan, C. F. V. 1996. Matrix Computations 3rd Ed. The Johns Hopkins University Press, Baltimore, MD.
[25]
Goto, K. and van de Geijn, R. A. 2008. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Soft. 34, 3.
[26]
Gu, M. and Eisenstat, S. C. 1995a. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Matrix Anal. Appl. 16, 1, 79--92.
[27]
Gu, M. and Eisenstat, S. C. 1995b. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal. Appl. 16, 1, 172--191.
[28]
Haidar, A., Ltaief, H., and Dongarra, J. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. LAPACK Working Note 254. http://www.netlib.org/lapack/lawnspdf/lawn254.pdf.
[29]
Howell, G. W., Demmel, J. W., Fulton, C. T., Hammarling, S., and Marmol, K. 2008. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Softw. 34, 3, 14:1--14:33.
[30]
Jacobi, C. 1846. Über ein leichtes verfahren, die in der theorie der säkularstörungen vorkommenden gleichungen numerisch aufzulösen. Crelle’s J. 30, 51--94.
[31]
Jiang, E. and Zhang, Z. 1985. A new shift of the QL algorithm for irreducible symmetric tridiagonal matrices. Linear Algebra Appl. 65, 0, 261--272.
[32]
Joffrain, T., Low, T. M., Quintana-Ortí, E. S., van de Geijn, R., and Van Zee, F. 2006. Accumulating householder transformations. ACM Trans. Math. Softw. 32, 2, 169--179.
[33]
Kågström, B., Kressner, D., Quintana-Ortí, E. S., and Quintana-Ortí, G. 2008. Blocked algorithms for the reduction to Hessenberg-triangular form revisited. BIT Numer. Math. 48, 3, 563--584.
[34]
Lang, B. 1998. Using level 3 BLAS in rotation-based algorithms. SIAM J. Sci. Comput. 19, 2, 626--634.
[35]
LAPACK. 2011. http://www.netlib.org/lapack/.
[36]
Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft. 5, 3, 308--323.
[37]
libflame. 2011. http://www.cs.utexas.edu/users/flame/libflame/.
[38]
Luszczek, P., Ltaief, H., and Dongarra, J. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. LAPACK Working Note 244. http://www.netlib.org/lapack/lawnspdf/lawn244.pdf.
[39]
Nakatsukasa, Y., Aishima, K., and Yamazaki, I. 2012. dqds with aggressive early deflation. SIAM J. Matrix Anal. Appl. 33, 1, 22--51.
[40]
OpenBLAS. 2011. https://github.com/xianyi/OpenBLAS/.
[41]
Parlett, B. N. 1980. The Symmetric Eigenvalue Problem. Prentice-Hall.
[42]
Parlett, B. N. and Marques, O. A. 1999. An implementation of the dqds algorithm (positive case). Linear Algebra Appl. 309, 217--259.
[43]
Quintana-Ortí, G. and Vidal, A. M. 1992. Parallel algorithms for QR factorization on shared memory multiprocessors. In Proceedings of the European Workshop on Parallel Computing’92, Parallel Computing: From Theory to Sound Practice. IOS Press, 72--75.
[44]
Rajamanickam, S. 2009. Efficient algorithms for sparse singular value decomposition. Ph.D. thesis, University of Florida.
[45]
Rutter, J. D. 1994. A serial implementation of Cuppen’s divide and conquer algorithm for the symmetric eigenvalue problem. Tech. rep. CSD-94-799, Computer Science Division, University of California at Berkeley.
[46]
Schreiber, R. and van Loan, C. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. 10, 1, 53--57.
[47]
Smith, B. T., Boyle, J. M., Dongarra, J. J., Garbow, B. S., Ikebe, Y., et al. 1976. Matrix Eigensystem Routines -- EISPACK Guide 2nd Ed. Lecture Notes in Computer Science, vol. 6, Springer.
[48]
Stewart, G. W. 2001. Matrix Algorithms Volume II: Eigensystems. SIAM, Philadelphia, PA.
[49]
Van Zee, F. G. 2011. Libflame: The Complete Reference. http://www.lulu.com/.
[50]
Van Zee, F. G., van de Geijn, R., and Quintana-Ortí, G. 2011. Restructuring the QR algorithm for high-performance application of Givens rotations. Tech. rep. TR-11-36, FLAME Working Note 60. Department of Computer Science, The University of Texas at Austin.
[51]
Van Zee, F. G., van de Geijn, R. A., Quintana-Ortí, G., and Elizondo, G. J. 2012. Families of algorithms for reducing a matrix to condensed form. ACM Trans. Math. Soft. 39, 1.
[52]
Watkins, D. S. 1982. Understanding the QR algorithm. SIAM Rev. 24, 4, 427--440.
[53]
Wilkinson, J. H. 1968. Global convergence of tridiagonal QR algorithm with origin shifts. Linear Algebra Appl. 1, 409--420.
[54]
Willems, P. R. and Lang, B. 2011. A framework for the MR3 algorithm: Theory and implementation. Tech. rep., Bergische Universität Wuppertal. http://www-ai.math.uni-wuppertal.de/SciComp/preprints/SC1102.pdf.
[55]
Willems, P. R., Lang, B., and Vömel, C. 2006. Computing the bidiagonal SVD using multiple relatively robust representations. SIAM J. Matrix Anal. Appl. 28, 4, 907--926.

Cited By

View all
  • (2023)ECBC: Efficient Convolution via Blocked ColumnizingIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.309527634:1(433-445)Online publication date: Jan-2023
  • (2023)FPGA-Based Hardware Accelerator for Matrix InversionSN Computer Science10.1007/s42979-022-01542-x4:2Online publication date: 9-Jan-2023
  • (2023)Efficient algorithms for computing rank‐revealing factorizations on a GPUNumerical Linear Algebra with Applications10.1002/nla.251530:6Online publication date: 2-Jun-2023
  • Show More Cited By

Index Terms

  1. Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 40, Issue 3
    April 2014
    152 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/2610268
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 April 2014
    Accepted: 01 October 2013
    Revised: 01 March 2013
    Received: 01 October 2011
    Published in TOMS Volume 40, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EVD
    2. Eigenvalues
    3. Givens rotations
    4. QR algorithm
    5. SVD
    6. bidiagonal
    7. high performance
    8. libraries
    9. linear algebra
    10. singular values
    11. tridiagonal

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ECBC: Efficient Convolution via Blocked ColumnizingIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.309527634:1(433-445)Online publication date: Jan-2023
    • (2023)FPGA-Based Hardware Accelerator for Matrix InversionSN Computer Science10.1007/s42979-022-01542-x4:2Online publication date: 9-Jan-2023
    • (2023)Efficient algorithms for computing rank‐revealing factorizations on a GPUNumerical Linear Algebra with Applications10.1002/nla.251530:6Online publication date: 2-Jun-2023
    • (2022)Optimization of Matrix-Matrix Multiplication Algorithm for Matrix-Panel Multiplication on Intel KNL2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA56895.2022.10017947(1-7)Online publication date: Dec-2022
    • (2022)Efficient algorithm for simultaneous reduction to the $$m$$-Hessenberg-triangular-triangular formBIT10.1007/s10543-014-0516-y55:3(677-703)Online publication date: 11-Mar-2022
    • (2020)qLD: High-performance Computation of Linkage Disequilibrium on CPU and GPU2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE)10.1109/BIBE50027.2020.00019(65-72)Online publication date: Oct-2020
    • (2019)Evaluation of Open-Source Linear Algebra Libraries in Embedded Applications2019 8th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO.2019.8760041(1-6)Online publication date: Jun-2019
    • (2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
    • (2018)Research on distributed heterogeneous data PCA algorithm based on cloud platform10.1063/1.5038988(020016)Online publication date: 2018
    • (2016)Duo: A general program for calculating spectra of diatomic moleculesComputer Physics Communications10.1016/j.cpc.2015.12.021202(262-275)Online publication date: May-2016
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media