Automatic performance tuning of sparse matrix kernels

January 2003

Author:
Richard Wilson Vuduc,
Chair:
James W. Demmel

Publisher:

University of California, Berkeley

Order Number:AAI3121741

Pages:

433

Purchase on ProQuest

Bibliometrics

Abstract

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.

Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments ( i.e ., running and timing the code). We build on the S PARSITY system for generating highly-tuned implementations of the SpMV kernel y ý y + Ax , where A is a sparse matrix and x, y are dense vectors. We extend S PARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of A T A·x (or AA T ·x ) and A ý · x .

We develop new models to compute, for particular data structures and kernels, the best absolute performance ( e.g ., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning ( e.g ., better instruction selection and scheduling). (Abstract shortened by UMI.)

Cited By

Contributors

Richard W. Vuduc
Georgia Institute of Technology
- Publication Years2000 - 2024
- Publication counts95
- Citation count2,452
- Available for Download61
- Downloads (cumulative)43,956
- Downloads (12 months)9,991
- Downloads (6 weeks)1,321
- Average Downloads per Article721
- Average Citation per Article26
View Full Profile
James Weldon Demmel
University of California, Berkeley
- Publication Years1983 - 2025
- Publication counts234
- Citation count7,057
- Available for Download73
- Downloads (cumulative)96,399
- Downloads (12 months)7,464
- Downloads (6 weeks)966
- Average Downloads per Article1,321
- Average Citation per Article30
View Full Profile

Index Terms

Automatic performance tuning of sparse matrix kernels
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
    2. Extra-functional properties
      1. Software performance

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs
CSE '12: Proceedings of the 2012 IEEE 15th International Conference on Computational Science and Engineering

Performance of sparse matrix-vector multiplication (SpMV) on GPUs is highly dependent on the structure of the sparse matrix used in the computation, the computing environment, and the selection of certain parameters. In this paper, we show that the ...
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

A wide range of applications in engineering and scientific computing are involved in the acceleration of the sparse matrix vector product (SpMV). Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration ...

Browse Theses

Sections

Cited By

Index Terms

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs

Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs

Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach