article

A recursive formulation of Cholesky factorization of a matrix in packed storage

Editor: Ron Boisvert Authors:

Bjarne Stig Andersen,

Jerzy Waśniewski,

Fred G. GustavsonAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 27, Issue 2

Pages 214 - 244

https://doi.org/10.1145/383738.383741

Published: 01 June 2001 Publication History

Abstract

A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format are included. A new algorithm, called RPC for Recursive Packed Cholesky, that operates on the RPG format is presented. ALgorithm RPC is basd on level-3 BLAS and requires variants of algorithms TRSM and SYRK that work on RPF. We call these RP_TRSM and RP_SYRK and find that they do most of their work by calling GEMM. It follows that most of the execution time of RPC lies in GEMM. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the RPC storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our RPC implementation is decidedly superior. Third, unlike fixed block size algorithms, RPC, requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also MSP parallel computations on the IBM SMP computer are made. The graphs that are attached in Section 7 show that the RPC algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the RPC performance results are almost the same or even better than the traditional full-storage algorithms results.

References

[1]

AGARWAL,R.C.,GUSTAVSON,F.G.,AND ZUBAIR, M. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 5 (Sept.), 563-576.

[2]

ANDERSEN,B.S.,GUSTAVSON, F., KARAIVANOV, A., WASNIEWSKI, J., AND YALAMOV, P. Y. 1999. LAWRA-Linear algebras with recursive algorithms. In Proceedings of the 3rd Interna-tional Conference on Parallel Processing and Applied Mathematics (PPAM '99, Kazimierz Dolny, Poland), J. Szopa. 63-76.

[3]

ANDERSON, E., BAI, Z., BISCHOF, C., BLACKFORD,L.S.,DEMMEL, J., DONGARRA, J., DU CROZ, J., GREENBAUM, A., HAMMARLING, S., MCKENNEY, A., AND SORENSON, D. 1999. LAPACK Users's Guide. 3rd ed. SIAM, Philadelphia, PA.

[4]

BILMES, J., ASANOVIC, K., CHIN, C.-W., AND DEMMEL, J. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 1997 International Conference on Supercomputing (ICS '97, Vienna, Austria, July 7-11), S. J. Wallach and H. Zima, Chairs. ACM Press, New York, NY, 340-347.

[5]

DEMMEL, J. W. 1997. Applied Numerical Linear Algebra. SIAM, Philadelphia, PA.

[6]

DONGARRA,J.AND WASNIEWSKI, J. 1999. High performance linear algebra package- LAPACK90. In Advances in Randomized Parallel Computing, P. M. Pardalos and S. Rajasekaran, Eds. Combinatorial Optimization, vol. 5. Kluwer Academic Publishers, Hingham, MA, 241-275.

[7]

DONGARRA,J.J.,DU CROZ,J.J.,HAMMARLING, S., AND DUFF, I. S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (Mar.), 1-17.

[8]

DONGARRA,J.J.,DU CROZ, J., HAMMARLING, S., AND HANSON, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (Mar.), 1-17.

[9]

DONGARRA,J.J.,DUFF,I.S.,SORENSEN,D.C.,AND VAN DER VORST, H. A. 1998. Numerical Linear Algebra for High-Performance Computers. 2nd ed. SIAM, Philadelphia, PA.

[10]

GOLUB,G.H.AND VAN LOAN, C. F. 1996. Matrix Computations. 3rd ed. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, Baltimore, MD.

[11]

GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM, B., AND LING, P. 1998. Recursive blocked data formats and BLAS for dense linear algebra algorithms. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 195-206.

[12]

GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM, B., AND LING, P. 1998. Superscalar GEMM-based level 3 BLAS-The on-going evolution of portable and high-performance library. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 207-215.

[13]

GUSTAVSON, F. G. 1997. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM J. Res. Dev. 41, 6, 737-756.

[14]

HIGHAM, N. J. 1996. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA.

[15]

IBM. 1997a. IBM engineering and scientific subroutine library for AIX, version 3, volume 1. Pub. No. SA22-7272-0.

[16]

IBM. 1997b. XL Fortran AIX, language reference, 1st edition, version 5, release 1.

[17]

IBM. 1997c. XL Fortran AIX User's Guide. 1st ed. IBM Corp., Riverton, NJ. Version 5, release 1.

[18]

KAGSTROM, B., LING, P., AND VAN LOAN, C. 1998. GEMM-based level 3 BLAS: Highperformance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3 (Sept.), 268-302.

[19]

LAWSON,C.L.,HANSON,R.J.,KINCAID,D.R.,AND KROGH, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3, 308-323.

[20]

METCALF,M.AND REID, J. 1999. Fortran 90/95 Explained. 2nd ed. Oxford University Press, Oxford, UK.

[21]

TOLEDO, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 4, 1065-1081.

[22]

TREFETHEN,L.N.AND BAU, D. 1997. Numerical Lineary Algebra. SIAM, Philadelphia, PA.

[23]

WASNIEWSKI, J., ANDERSEN,B.S.,AND GUSTAVSON, F. 1998. Recursive formation of Cholesky algorithm in Fortran 90. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 574- 578.

[24]

WHALEY,R.C.AND DONGARRA, J. 1999. ATLAS: Automatically tuned linear algebra software. University of Tennessee at Knoxville, Knoxville, TN. http://www.netlib.org/atlas/.

Cited By

Doroshenko ATulika EYatsenko O(2021)Enhancing parallelism of distributed algorithms with the actor model and a smart data movement techniqueInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2021.1971665(1-14)Online publication date: 31-Aug-2021
https://doi.org/10.1080/17445760.2021.1971665
Kurzak JGates MCharara AYarKhan ADongarra JEigenmann RDing CMcKee S(2019)Least squares solvers for distributed-memory machines with GPU acceleratorsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330356(117-126)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330356
Rückert DStamminger M(2019)An Efficient Solution to Structured Optimization Problems using Recursive MatricesComputer Graphics Forum10.1111/cgf.1375838:8(33-39)Online publication date: 14-Nov-2019
https://doi.org/10.1111/cgf.13758
Show More Cited By

Index Terms

A recursive formulation of Cholesky factorization of a matrix in packed storage
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software

Recommendations

Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the ...
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed Format (RFPF). The standard two-dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular ...
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

We present subroutines for the Cholesky factorization of a positive-definite symmetric matrix and for solving corresponding sets of linear equations. They exploit cache memory by using the block hybrid format proposed by the authors in a companion ...

Reviews

Reviewer: Maurice W. Benson

Due to symmetry in Cholesky factorization, only the lower (or upper) triangular part of the matrix need be stored. With standard packed storage, matrix elements in the triangular part are ordered by columns (or rows) and stored in a one-dimensional array. This presents problems of efficiency where memory hierarchy (CPU, registers, cache, and so on) is better used if a locality of reference (blocking) is present in the algorithm. These considerations motivate a new packed storage arrangement produced by a recursive approach, where at the coarsest level, the triangular part of the matrix is partitioned into an approximately square part and two smaller triangular parts. This process is continued recursively for the two smaller triangular parts, with the final new packing taking the same total space as standard packing. The recursive Cholesky algorithm using this new packed storage gets superior performance by utilizing level-3 BLAS (Basic Linear Algebra Subroutines). Using well-constructed figures, the paper provides a clear description of the algorithms. A significant part of the presentation involves tests on a variety of machines (seven) and over a range of matrix sizes. Standard packing, the new formulation, and full matrix (no packing) algorithms are compared. The new approach is shown to be significantly more effective than the one using standard packing. This work should be of interest to researchers in computational linear algebra, as well as to those developing high- performance software libraries. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 27, Issue 2

June 2001

154 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/383738

Editor:
Ron Boisvert
National Institute of Standards and Technology, Gaithersburg, MD

Issue’s Table of Contents

Copyright © 2001 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2001

Published in TOMS Volume 27, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
963
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Doroshenko ATulika EYatsenko O(2021)Enhancing parallelism of distributed algorithms with the actor model and a smart data movement techniqueInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2021.1971665(1-14)Online publication date: 31-Aug-2021
https://doi.org/10.1080/17445760.2021.1971665
Kurzak JGates MCharara AYarKhan ADongarra JEigenmann RDing CMcKee S(2019)Least squares solvers for distributed-memory machines with GPU acceleratorsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330356(117-126)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330356
Rückert DStamminger M(2019)An Efficient Solution to Structured Optimization Problems using Recursive MatricesComputer Graphics Forum10.1111/cgf.1375838:8(33-39)Online publication date: 14-Nov-2019
https://doi.org/10.1111/cgf.13758
Kurzak JGates MCharara AYarKhan AYamazaki IDongarra J(2019)Linear Systems Solvers for Distributed-Memory Machines with GPU AcceleratorsEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_35(495-506)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29400-7_35
Baroudi TSeghir RLoechner V(2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
https://dl.acm.org/doi/10.1145/3162016
Peise EBientinesi P(2017)Algorithm 979ACM Transactions on Mathematical Software10.1145/306166444:2(1-19)Online publication date: 18-Sep-2017
https://dl.acm.org/doi/10.1145/3061664
Pinto CBarreto MBoratto M(2016)Auto-tuning TRSM with an asynchronous task assignment model on multicore, multi-GPU and coprocessor systems2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)10.1109/AICCSA.2016.7945637(1-8)Online publication date: Nov-2016
https://doi.org/10.1109/AICCSA.2016.7945637
Björck ÅBjörck Å(2014)Direct Methods for Linear SystemsNumerical Methods in Matrix Computations10.1007/978-3-319-05089-8_1(1-209)Online publication date: 7-Oct-2014
https://doi.org/10.1007/978-3-319-05089-8_1
Gustavson FWaśniewski JDongarra JHerrero JLangou J(2013)Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky AlgorithmsACM Transactions on Mathematical Software10.1145/2427023.242702639:2(1-10)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1145/2427023.2427026
Cui HYi QXue JFeng X(2013)Layout-oblivious compiler optimization for matrix computationsACM Transactions on Architecture and Code Optimization10.1145/2400682.24006949:4(1-20)Online publication date: 20-Jan-2013
https://dl.acm.org/doi/10.1145/2400682.2400694
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents