Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

Published: 26 April 2019 Publication History

Abstract

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold speedups on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well- and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.

References

[1]
E. Anderson, Z. Bai, C. H. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM, Philadelphia.
[2]
M. Berry and A. Sameh. 1989. An overview of parallel algorithms for the singular value and symmetric eigenvalue problems, Journal of Computational and Applied Mathematics, 27, 1--2 (1989), 191--213.
[3]
L. S. Blackford, J. Choi, A. Cleary, E. F. D’Azevedo, J. W. Demmel, I. S. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. W. Walker, and R. C. Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.
[4]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Hérault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. YarKhan, and J. J. Dongarra. 2011. Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops. IEEE, 1432--1441. Retrieved from http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6008655.
[5]
Chameleon. 2016. The Chameleon Project. Retrieved from https://project.inria.fr/chameleon/.
[6]
J. A. Goldstein and M. Levy. 1991. Linear algebra and quantum chemistry. Amer. Math. Monthly 98, 10 (Oct. 1991), 710--718.
[7]
N. J. Higham. 1986. Computing the polar decomposition with applications. SIAM J. Sci. Stat. Comput. 7, 4 (1986), 1160--1174.
[8]
N. J. Higham and P. Papadimitriou. 1994. A new parallel algorithm for computing the singular value decomposition. In Proceedings of the 5th SIAM Conference on Applied Linear Algebra, John G. Lewis (Ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, 80--84.
[9]
A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. J. Bungartz, and H. Lederer. 2014. The ELPA library: Scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys. Cond. Matt. 26, 21 (2014), 213201. Retrieved from http://www.ncbi.nlm.nih.gov/ /24786764.
[10]
J. Meyer and I. Y. Bar-itzhack. 1977. Practical comparison of iterative matrix orthogonalization algorithms. IEEE Trans. Aero. Electron. Syst. AES-13, 3 (May 1977), 230--235.
[11]
MPI Forum. 1993. MPI: A message passing interface. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE CS Press, Portland, OR, 878--883.
[12]
Y. Nakatsukasa, Z. Bai, and F. Gygi. 2010. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720.
[13]
Y. Nakatsukasa and N. J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.
[14]
A. Petitet, R. C. Whaley, J. J. Dongarra, and A. Cleary. 2008. HPL—A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl.
[15]
J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw. 39, 2 (2013), 13.
[16]
D. Sukkari, H. Ltaief, M. Faverge, and D. Keyes. 2018. Asynchronous task-based polar decomposition on single node manycore architectures. IEEE Trans. Parallel Dist. Syst. 29, 2 (2018), 312--323.
[17]
D. Sukkari, H. Ltaief, and D. E. Keyes. 2016b. A high-performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. 43, 1 (2016), 6:1--6:25.
[18]
D. Sukkari, H. Ltaief, and D. E. Keyes. 2016a. High-performance polar decomposition on distributed memory systems. In Euro-Par 2016: Parallel Processing—22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings (Lecture Notes in Computer Science), Pierre-François Dutot and Denis Trystram (Eds.), Vol. 9833. Springer, 605--616.
[19]
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel Dist. Syst. 28, 10 (Oct. 2017), 3007--3020.
[20]
Y. Wang and L. Zhu. 2017. Research and implementation of SVD in machine learning. In Proceedings of the16th IEEE/ACIS International Conference on Computer and Information Science, G. Zhu, S. Yao, X. Cui, and S. Xu (Eds.). IEEE Computer Society, Wuhan, China, 471--475. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7951674.

Cited By

View all
  • (2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
  • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
  • (2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
  • Show More Cited By

Index Terms

  1. A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Mathematical Software
      ACM Transactions on Mathematical Software  Volume 45, Issue 2
      June 2019
      255 pages
      ISSN:0098-3500
      EISSN:1557-7295
      DOI:10.1145/3326465
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 April 2019
      Accepted: 01 January 2019
      Revised: 01 August 2018
      Received: 01 November 2017
      Published in TOMS Volume 45, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Dense SVD solver
      2. QDWH
      3. distributed-memory manycore systems
      4. performance analysis
      5. polar decomposition

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 25 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
      • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
      • (2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
      • (2022)Stable and Efficient Computation of Generalized Polar DecompositionsSIAM Journal on Matrix Analysis and Applications10.1137/21M141198643:3(1058-1083)Online publication date: 11-Jul-2022
      • (2019)Massively Parallel Polar Decomposition on Distributed-memory SystemsACM Transactions on Parallel Computing10.1145/33287236:1(1-15)Online publication date: 7-Jun-2019
      • (2019)Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891024(1-12)Online publication date: Sep-2019

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media