Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Cache efficient bidiagonalization using BLAS 2.5 operators

Published: 16 May 2008 Publication History

Abstract

On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.

References

[1]
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. 1999. LAPACK User's Guide, 3rd. Ed. SIAM, Philadelphia, PA.
[2]
Barlow, J. L., Bosner, N., and Drmač, Z. 2000. A new stable bidiagonal reduction algorithm. Lin. Alg. Appl. 397, 35--84.
[3]
Berry, M. 1992. Large scale singular value computations. Internat. J. Supercomput. Appl. 6, 13--49.
[4]
Berry, M., Do, T., O'brien, G., Krishna, V., and Varadhan, S. 1993. SVDPACKC: version 1.0 user's guide, Tech. rep. CS-93-194, University of Tennessee, Knoxville, TN.
[5]
Berry, M., Dumais, S., and O'brien, G. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 573--595.
[6]
Bischof, C. H. and Van Loan, C. F. 1987. The WY representaion of products of Householder matrices, SIAM J. Sci. Stat. Comput. 8, s2--s13.
[7]
Blackford, S. and Dongarra, J. 1999. Installation guide for LAPACK, LAPACK Working Note 41.
[8]
Blackford, L. S., Corliss, G., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Hu, C., Kahan, W., Kaufmann, L., Kearfott, B., Frogh, F., Li, X., Maany, Z., Petitet, A., Pozo, R., Remington, K., Walster, W., Whaley, C., Wolff, V., Gudenberg, J., and Lumsdaine, A. 2002. Basic linear algebra subprograms technical (BLAST) forum standard, Int. J. High Perform. Comput. 12, 1--2 (www.netlib.org/blas/blast-forum).
[9]
Blackford, S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufmann, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151.
[10]
Bosner, N. and Barlow, J. L. 2005. Block and parallel versions of one-sided bidiagonalization. preprint http://www.cse.psu.edu/barlow/block_bidiag.pdf.
[11]
Choi, J., Dongarra, J., and Walker, D. 1995. The design of a parallel dense linear algebra software library: reduction to Hessenberg, tridiagonal, and bidiagonal form. (LAPACK Working Note # 92) Num. Alg., 10, 379--399.
[12]
Dhillon, I. S. 1997. A New O(n2) Algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California, Berkeley, CA.
[13]
Dongarra, J., Hammarling, S., and Sorensen, D. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 215--227.
[14]
Dongarra, J., Duff, I., Sorensen, D., and Van Der Vorst, H. 1988. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA.
[15]
Douglas, C. C., Haase, G., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Portable memory heirarchy techniques for PDE solvers: Part I. SIAM News, 33, 5.
[16]
Fernando, V., Parlett, B., and Dhillon, I. 1995. A way to find the most redundant equation in a tridiagonal system. Mathematics Department, University of California, Berkeley.
[17]
Goedecker, S. and Hoise, A. 2001. Performance Optimization for Numerically Intensive Codes, SIAM, Philadelphia, PA.
[18]
Golub, G. and Kahan, W. 1965. Calculating the singular Values and pseudo-inverse of a matrix, SIAM J. Num. Anal., 2, 205--224.
[19]
Golub, G. and Reinsch, C. 1970. Singular value decomposition and least squares solution. Numer. Math. 14, 403--420.
[20]
Golub, G. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MA.
[21]
Grösser, B. and Lang, B. 1988. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2. http://www.math.uni-wuppertal/org/SciComp/Preprint/SC9802ips.gz.
[22]
Howell, G. W. 2001. Sparse Householder bidiagonalization. CERFACS Sparse Days. http://ncsu. edu/itd/hpc/Documents/Publications/gary_howell/cerfacs01.ps
[23]
Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parall. Comput., 22, 1--18.
[24]
Owens, B. 2003. A Matlab script for 2-blocking to speed Ralha-Barlow one-sided bidiagonalization. Summer intern project at ERDC, MSRC, Vicksburg, MS. http://ncsu.edu/itd/hpc/Documents/Publications/gary_howell/barlow3.m.
[25]
Paige, C. and Saunders, M. 1982. An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8, 1, 43--71.
[26]
Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.
[27]
Ralha, R. M. S. 2003. One-sided reduction to bidiagonal form. Lin. Alg. Appl. 358, 219--238.
[28]
Schreiber, R. and Van Loan, C. F. 1989. A storage-efficient WY representation for products of householder transformations. SIAM Sci. Stat. Comp. 10, 53--57.
[29]
Stanley, K. 1997. Execution time of symmetric eigensolvers. Ph.D. dissertation, University of California, Berkeley, CA.
[30]
Whaley, C. and Dongarra, J. 1999. Automatically tuned linear algebra in software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.

Cited By

View all
  • (2019)Cache-efficient implementation and batching of tridiagonalization on manycore CPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293329(71-80)Online publication date: 14-Jan-2019
  • (2019)Toward a BLAS library truly portable across different accelerator typesThe Journal of Supercomputing10.1007/s11227-019-02925-3Online publication date: 10-Jun-2019
  • (2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 34, Issue 3
May 2008
130 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/1356052
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2008
Accepted: 01 April 2007
Revised: 01 January 2007
Received: 01 April 2006
Published in TOMS Volume 34, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BLAS 2.5
  2. Householder reflections
  3. SVD
  4. bidiagonalization
  5. cache-efficient
  6. matrix factorization
  7. singular values

Qualifiers

  • Research-article
  • Research
  • Pre-selected

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Cache-efficient implementation and batching of tridiagonalization on manycore CPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293329(71-80)Online publication date: 14-Jan-2019
  • (2019)Toward a BLAS library truly portable across different accelerator typesThe Journal of Supercomputing10.1007/s11227-019-02925-3Online publication date: 10-Jun-2019
  • (2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
  • (2018)Sparse supernodal solver using block low-rank compression: Design, performance and analysisJournal of Computational Science10.1016/j.jocs.2018.06.00727(255-270)Online publication date: Jul-2018
  • (2015)BLIS: A Framework for Rapidly Instantiating BLAS FunctionalityACM Transactions on Mathematical Software10.1145/276445441:3(1-33)Online publication date: 1-Jun-2015
  • (2015)Avoiding Communication in Successive Band ReductionACM Transactions on Parallel Computing10.1145/26868771:2(1-37)Online publication date: 18-Feb-2015
  • (2015)Reliable Generation of High-Performance Matrix AlgebraACM Transactions on Mathematical Software10.1145/262969841:3(1-27)Online publication date: 1-Jun-2015
  • (2015)Increasing the Performance of the Jacobi--Davidson Method by BlockingSIAM Journal on Scientific Computing10.1137/14097601737:6(C697-C722)Online publication date: Jan-2015
  • (2015)Optimizing CUDA code by kernel fusionThe Journal of Supercomputing10.1007/s11227-015-1483-z71:10(3934-3957)Online publication date: 1-Oct-2015
  • (2014)Restructuring the Tridiagonal and Bidiagonal QR Algorithms for PerformanceACM Transactions on Mathematical Software10.1145/253537140:3(1-34)Online publication date: 1-Apr-2014
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media