Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleSeptember 2024
Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems
Journal of Parallel and Distributed Computing (JPDC), Volume 192, Issue Chttps://doi.org/10.1016/j.jpdc.2024.104925AbstractRecent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we ...
Highlights- Matrix-free algorithms for FE matrix-multivector products with thousands of vectors on multi-node CPU and GPU architectures.
- Hardware-tuned batched evaluation strategies for improved data locality.
- Architecture-specific ...
- research-articleMay 2023
Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
Journal of Parallel and Distributed Computing (JPDC), Volume 175, Issue CPages 51–65https://doi.org/10.1016/j.jpdc.2023.01.004AbstractWe propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use ...
Highlights- Exposure of the performance penalty introduced by NUMA-oblivious implementations.
- research-articleMay 2022
Pipelined Preconditioned Conjugate Gradient Methods for real and complex linear systems for distributed memory architectures
Journal of Parallel and Distributed Computing (JPDC), Volume 163, Issue CPages 147–155https://doi.org/10.1016/j.jpdc.2022.01.008Highlights- We introduce PIPECG-OATI-c to solve complex Hermitian and symmetric linear systems.
Preconditioned Conjugate Gradient (PCG) is a popular method for solving large and sparse linear systems of equations. The performance of PCG at scale is affected due to the costly global synchronization steps that arise in dot-products ...
- research-articleFebruary 2022
Comparing the performance of general matrix multiplication routine on heterogeneous computing systems
Journal of Parallel and Distributed Computing (JPDC), Volume 160, Issue CPages 39–48https://doi.org/10.1016/j.jpdc.2021.10.002AbstractThis paper contains the results of research on the general matrix multiplication routine performance on modern heterogeneous computing systems. In addition to the single-threaded and multi-threaded performance of the routine for ...
Highlights- The performance of GEMM routines on heterogeneous computing systems has been studied.
- research-articleApril 2020
sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)
Journal of Parallel and Distributed Computing (JPDC), Volume 138, Issue CPages 153–171https://doi.org/10.1016/j.jpdc.2019.12.002AbstractIn this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the ...
Highlights- Development of a highly optimized auto-tuned library for BLAS-3 and LAPACK operations.
-
- research-articleMarch 2020
Batched transpose-free ADI-type preconditioners for a Poisson solver on GPGPUs
Journal of Parallel and Distributed Computing (JPDC), Volume 137, Issue CPages 148–159https://doi.org/10.1016/j.jpdc.2019.11.004AbstractWe investigate the iterative solution of a symmetric positive definite linear system involving the shifted Laplacian as the system matrix on General Purpose Graphics Processing Units (GPGPUs). We consider in particular the Chebyshev ...
Highlights- Solution of batched tridiagonal linear systems on GPUs.
- Problems originating ...
- research-articleAugust 2019
Node aware sparse matrix–vector multiplication
Journal of Parallel and Distributed Computing (JPDC), Volume 130, Issue CPages 166–178https://doi.org/10.1016/j.jpdc.2019.03.016AbstractThe sparse matrix–vector multiply (SpMV) operation is a key computational kernel in many simulations and linear solvers. The large communication requirements associated with a reference implementation of a parallel SpMV result in poor ...
- research-articleJuly 2019
Optimizing sparse tensor times matrix on GPUs
Journal of Parallel and Distributed Computing (JPDC), Volume 129, Issue CPages 99–109https://doi.org/10.1016/j.jpdc.2018.07.018AbstractThis work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, ...
Highlights- Designed an in-place SpTTM algorithm to avoid tensor-matrix data transformation.
- research-articleJune 2019
Modeling the asynchronous Jacobi method without communication delays
Journal of Parallel and Distributed Computing (JPDC), Volume 128, Issue CPages 84–98https://doi.org/10.1016/j.jpdc.2019.02.002AbstractAsynchronous iterative methods for solving linear systems are gaining renewed interest due to the high cost of synchronization points in massively parallel codes. Historically, theory on asynchronous iterative methods has focused on ...
Highlights- We study the asynchronous Jacobi (AJ) method for solving sparse linear systems.
- research-articleSeptember 2018
Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning
Journal of Parallel and Distributed Computing (JPDC), Volume 119, Issue CPages 219–230https://doi.org/10.1016/j.jpdc.2018.04.017AbstractWhen using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are ...
Highlights- Jacobi solvers for sparse incomplete factor preconditioning are extensively tested.
- articleAugust 2008
Parallel computation of the eigenvalues of symmetric Toeplitz matrices through iterative methods
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 8Pages 1113–1121https://doi.org/10.1016/j.jpdc.2008.03.003This paper presents a new procedure to compute many or all of the eigenvalues and eigenvectors of symmetric Toeplitz matrices. The key to this algorithm is the use of the ''Shift-and-Invert'' technique applied with iterative methods, which allows the ...
- articleJune 2008
A relaxation scheme for increasing the parallelism in Jacobi-SVD
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 6Pages 769–777https://doi.org/10.1016/j.jpdc.2007.12.003The Singular Value Decomposition (SVD) is a vital problem that finds a place in numerous application domains in science and engineering. As an example, SVDs are used in processing voluminous datasets. Many sequential and parallel algorithms have been ...
- articleMay 2008
Parallel block tridiagonalization of real symmetric matrices
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 5Pages 703–715https://doi.org/10.1016/j.jpdc.2007.10.001Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for the block tridiagonal divide-and-conquer algorithm for computing ...
- articleFebruary 2008
Load balancing algorithms based on gradient methods and their analysis through algebraic graph theory
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 2Pages 209–220https://doi.org/10.1016/j.jpdc.2007.09.001The main results of this paper are based on the idea that most load balancing algorithms can be described in the framework of optimization theory. It enables to involve classical results linked with convergence, its speed and other elements. We ...
- articleMay 2007
MPI implementation of parallel subdomain methods for linear and nonlinear convection--diffusion problems
Journal of Parallel and Distributed Computing (JPDC), Volume 67, Issue 5Pages 581–591https://doi.org/10.1016/j.jpdc.2007.01.003The solution of linear and nonlinear convection-diffusion problems via parallel subdomain methods is considered. MPI implementation of parallel Schwarz alternating methods on distributed memory multiprocessors is discussed. Parallel synchronous and ...
- articleNovember 2006
Parallel sparse LU factorization on different message passing platforms
Journal of Parallel and Distributed Computing (JPDC), Volume 66, Issue 11Pages 1387–1403https://doi.org/10.1016/j.jpdc.2006.07.001Several message passing-based parallel solvers have been developed for general (non-symmetric) sparse LU factorization with partial pivoting. Existing solvers were mostly deployed and evaluated on parallel computing platforms with high message passing ...
- articleMarch 2006
Complexity of matrix product on modular linear systolic arrays for algorithms with affine schedules
Journal of Parallel and Distributed Computing (JPDC), Volume 66, Issue 3Pages 323–333https://doi.org/10.1016/j.jpdc.2005.07.008This paper investigates the computation of matrix product on both partially pipelined and fully pipelined modular linear arrays. These investigations are guided by a constructive and unified approach for both target architectures. First, permissible ...
- articleMarch 2005
Distributed block independent set algorithms and parallel multilevel ILU preconditioners
Journal of Parallel and Distributed Computing (JPDC), Volume 65, Issue 3Pages 331–346https://doi.org/10.1016/j.jpdc.2004.10.007We present a class of parallel preconditioning strategies utilizing multilevel block incomplete LU (ILU) factorization techniques to solve large sparse linear systems. The preconditioners are constructed by exploiting the concept of block independent ...
- articleMarch 2005
Toward an automatic parallelization of sparse matrix computations
Journal of Parallel and Distributed Computing (JPDC), Volume 65, Issue 3Pages 313–330https://doi.org/10.1016/j.jpdc.2004.09.017In this paper, we propose a generic method of automatic parallelization for sparse matrix computation. This method is based on both a refinement of the data-dependence test proposed by Bernstein and an inspector-executor scheme which is specialized to ...
- articleSeptember 2003
Communication characteristics of large-scale scientific applications for contemporary cluster architectures
Journal of Parallel and Distributed Computing (JPDC), Volume 63, Issue 9Pages 853–865https://doi.org/10.1016/S0743-7315(03)00104-7This paper examines the explicit communication characteristics of several sophisticated scientific applications, which, by themselves, constitute a representative suite of publicly available benchmarks for large cluster architectures. By focusing on the ...