research-article

Cache efficient bidiagonalization using BLAS 2.5 operators

Authors:

Gary W. Howell,

James W. Demmel,

Charles T. Fulton,

Sven Hammarling,

Karen MarmolAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 34, Issue 3

Article No.: 14, Pages 1 - 33

https://doi.org/10.1145/1356052.1356055

Published: 16 May 2008 Publication History

Abstract

On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.

References

[1]

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. 1999. LAPACK User's Guide, 3rd. Ed. SIAM, Philadelphia, PA.

Digital Library

[2]

Barlow, J. L., Bosner, N., and Drmač, Z. 2000. A new stable bidiagonal reduction algorithm. Lin. Alg. Appl. 397, 35--84.

[3]

Berry, M. 1992. Large scale singular value computations. Internat. J. Supercomput. Appl. 6, 13--49.

Digital Library

[4]

Berry, M., Do, T., O'brien, G., Krishna, V., and Varadhan, S. 1993. SVDPACKC: version 1.0 user's guide, Tech. rep. CS-93-194, University of Tennessee, Knoxville, TN.

Digital Library

[5]

Berry, M., Dumais, S., and O'brien, G. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 573--595.

Digital Library

[6]

Bischof, C. H. and Van Loan, C. F. 1987. The WY representaion of products of Householder matrices, SIAM J. Sci. Stat. Comput. 8, s2--s13.

Digital Library

[7]

Blackford, S. and Dongarra, J. 1999. Installation guide for LAPACK, LAPACK Working Note 41.

[8]

Blackford, L. S., Corliss, G., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Hu, C., Kahan, W., Kaufmann, L., Kearfott, B., Frogh, F., Li, X., Maany, Z., Petitet, A., Pozo, R., Remington, K., Walster, W., Whaley, C., Wolff, V., Gudenberg, J., and Lumsdaine, A. 2002. Basic linear algebra subprograms technical (BLAST) forum standard, Int. J. High Perform. Comput. 12, 1--2 (www.netlib.org/blas/blast-forum).

[9]

Blackford, S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufmann, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151.

Digital Library

[10]

Bosner, N. and Barlow, J. L. 2005. Block and parallel versions of one-sided bidiagonalization. preprint http://www.cse.psu.edu/barlow/block_bidiag.pdf.

[11]

Choi, J., Dongarra, J., and Walker, D. 1995. The design of a parallel dense linear algebra software library: reduction to Hessenberg, tridiagonal, and bidiagonal form. (LAPACK Working Note &num; 92) Num. Alg., 10, 379--399.

[12]

Dhillon, I. S. 1997. A New O(n²) Algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California, Berkeley, CA.

Digital Library

[13]

Dongarra, J., Hammarling, S., and Sorensen, D. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 215--227.

[14]

Dongarra, J., Duff, I., Sorensen, D., and Van Der Vorst, H. 1988. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA.

Digital Library

[15]

Douglas, C. C., Haase, G., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Portable memory heirarchy techniques for PDE solvers: Part I. SIAM News, 33, 5.

[16]

Fernando, V., Parlett, B., and Dhillon, I. 1995. A way to find the most redundant equation in a tridiagonal system. Mathematics Department, University of California, Berkeley.

[17]

Goedecker, S. and Hoise, A. 2001. Performance Optimization for Numerically Intensive Codes, SIAM, Philadelphia, PA.

Digital Library

[18]

Golub, G. and Kahan, W. 1965. Calculating the singular Values and pseudo-inverse of a matrix, SIAM J. Num. Anal., 2, 205--224.

[19]

Golub, G. and Reinsch, C. 1970. Singular value decomposition and least squares solution. Numer. Math. 14, 403--420.

Digital Library

[20]

Golub, G. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MA.

Digital Library

[21]

Grösser, B. and Lang, B. 1988. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2. http://www.math.uni-wuppertal/org/SciComp/Preprint/SC9802ips.gz.

[22]

Howell, G. W. 2001. Sparse Householder bidiagonalization. CERFACS Sparse Days. http://ncsu. edu/itd/hpc/Documents/Publications/gary_howell/cerfacs01.ps

[23]

Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parall. Comput., 22, 1--18.

Digital Library

[24]

Owens, B. 2003. A Matlab script for 2-blocking to speed Ralha-Barlow one-sided bidiagonalization. Summer intern project at ERDC, MSRC, Vicksburg, MS. http://ncsu.edu/itd/hpc/Documents/Publications/gary_howell/barlow3.m.

[25]

Paige, C. and Saunders, M. 1982. An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8, 1, 43--71.

Digital Library

[26]

Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.

[27]

Ralha, R. M. S. 2003. One-sided reduction to bidiagonal form. Lin. Alg. Appl. 358, 219--238.

[28]

Schreiber, R. and Van Loan, C. F. 1989. A storage-efficient WY representation for products of householder transformations. SIAM Sci. Stat. Comp. 10, 53--57.

Digital Library

[29]

Stanley, K. 1997. Execution time of symmetric eigensolvers. Ph.D. dissertation, University of California, Berkeley, CA.

Digital Library

[30]

Whaley, C. and Dongarra, J. 1999. Automatically tuned linear algebra in software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.

Digital Library

Cited By

Kudo SImamura T(2019)Cache-efficient implementation and batching of tridiagonalization on manycore CPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293329(71-80)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293329
Rodriguez-Gutiez EMoreton-Fernandez AGonzalez-Escribano ALlanos D(2019)Toward a BLAS library truly portable across different accelerator typesThe Journal of Supercomputing10.1007/s11227-019-02925-3Online publication date: 10-Jun-2019
https://doi.org/10.1007/s11227-019-02925-3
Dongarra JGates MHaidar AKurzak JLuszczek PTomov SYamazaki I(2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
https://doi.org/10.1137/17M1117732
Show More Cited By

Index Terms

Cache efficient bidiagonalization using BLAS 2.5 operators
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software

Recommendations

A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices

The harmonic Lanczos bidiagonalization method can be used to compute the smallest singular triplets of a large matrix $A$. We prove that for good enough projection subspaces harmonic Ritz values converge if the columns of $A$ are strongly linearly ...
A bidiagonalization-based numerical algorithm for computing the inverses of (p,q)-tridiagonal matrices
Abstract
As a generalization of k-tridiagonal matrices, many variations of (p,q)-tridiagonal matrices have attracted much attention over the years. In this paper, we present an efficient algorithm for numerically computing the inverses of n-square (p,q)-...
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 34, Issue 3

May 2008

130 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/1356052

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2008

Accepted: 01 April 2007

Revised: 01 January 2007

Received: 01 April 2006

Published in TOMS Volume 34, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Pre-selected

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
387
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kudo SImamura T(2019)Cache-efficient implementation and batching of tridiagonalization on manycore CPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293329(71-80)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293329
Rodriguez-Gutiez EMoreton-Fernandez AGonzalez-Escribano ALlanos D(2019)Toward a BLAS library truly portable across different accelerator typesThe Journal of Supercomputing10.1007/s11227-019-02925-3Online publication date: 10-Jun-2019
https://doi.org/10.1007/s11227-019-02925-3
Dongarra JGates MHaidar AKurzak JLuszczek PTomov SYamazaki I(2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
https://doi.org/10.1137/17M1117732
Pichon GDarve EFaverge MRamet PRoman J(2018)Sparse supernodal solver using block low-rank compression: Design, performance and analysisJournal of Computational Science10.1016/j.jocs.2018.06.00727(255-270)Online publication date: Jul-2018
https://doi.org/10.1016/j.jocs.2018.06.007
Van Zee Fvan de Geijn R(2015)BLIS: A Framework for Rapidly Instantiating BLAS FunctionalityACM Transactions on Mathematical Software10.1145/276445441:3(1-33)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1145/2764454
Ballard GDemmel JKnight N(2015)Avoiding Communication in Successive Band ReductionACM Transactions on Parallel Computing10.1145/26868771:2(1-37)Online publication date: 18-Feb-2015
https://dl.acm.org/doi/10.1145/2686877
Nelson TBelter GSiek JJessup ENorris B(2015)Reliable Generation of High-Performance Matrix AlgebraACM Transactions on Mathematical Software10.1145/262969841:3(1-27)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1145/2629698
Röhrig-Zöllner MThies JKreutzer MAlvermann APieper ABasermann AHager GWellein GFehske H(2015)Increasing the Performance of the Jacobi--Davidson Method by BlockingSIAM Journal on Scientific Computing10.1137/14097601737:6(C697-C722)Online publication date: Jan-2015
https://doi.org/10.1137/140976017
Filipovič JMadzin MFousek JMatyska L(2015)Optimizing CUDA code by kernel fusionThe Journal of Supercomputing10.1007/s11227-015-1483-z71:10(3934-3957)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s11227-015-1483-z
Van Zee Fvan de Geijn RQuintana-Ortí G(2014)Restructuring the Tridiagonal and Bidiagonal QR Algorithms for PerformanceACM Transactions on Mathematical Software10.1145/253537140:3(1-34)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1145/2535371
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents