Article

Memory hierarchy optimizations and performance bounds for sparse A^TAx

Authors:

Attila Gyulassy,

James W. Demmel,

Katherine A. YelickAuthors Info & Claims

ICCS'03: Proceedings of the 2003 international conference on Computational science: PartIII

Pages 705 - 714

Published: 02 June 2003 Publication History

Abstract

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A^T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.

References

[1]

A.J.C. Bik and H.A.G. Wijsho. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576-1587, 1999.

Digital Library

[2]

J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, July 1997.

Digital Library

[3]

S. Blackford et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. Chapter 3: http://www.netlib.org/blast.

[4]

S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.

Digital Library

[5]

J.W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.

Digital Library

[6]

B.B. Fraguela, R. Doallo, and E.L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999.

[7]

W.D. Gropp, D.K. Kasushik, D.E. Keyes, and B.F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241-248, 1999.

[8]

G. Heber, A.J. Dolgert, M. Alt, K.A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the intel itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001.

[9]

E.-J. Im and K.A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of ICCS, pages 127-136, May 2001.

Digital Library

[10]

J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.

Digital Library

[11]

Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. http://www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.

[12]

P. Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.

Digital Library

[13]

O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing '92, 1992.

Digital Library

[14]

R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.

Digital Library

[15]

R. Vuduc, A. Gyulassy, J.W. Demmel, and K.A. Yelick. Memory hierarchy optimizations and performance bounds for sparse ATAx. Technical Report UCB/CS- 03-1232, University of California, Berkeley, February 2003.

[16]

R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, and K.A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: POHLL Workshop, New York, USA, June 2002.

[17]

W. Wang and D.P. O'Leary. Adaptive use of iterative methods in interior point methods for linear programming. Technical Report UMIACS-95-111, University of Maryland at College Park, College Park, MD, USA, 1995.

Digital Library

[18]

C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomp., Orlando, FL, 1998.

Digital Library

Cited By

Tao YDeng YMu SZhang ZZhu MXiao LRuan L(2015)GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplicationConcurrency and Computation: Practice & Experience10.1002/cpe.341527:14(3771-3789)Online publication date: 25-Sep-2015
https://dl.acm.org/doi/10.1002/cpe.3415
Karlin IJessup EBelter GSiek J(2011)Parallel memory prediction for fused linear algebra kernelsACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422638:4(43-49)Online publication date: 29-Mar-2011
https://dl.acm.org/doi/10.1145/1964218.1964226
Williams SOliker LVuduc RShalf JYelick KDemmel JVerastegui B(2007)Optimization of sparse matrix-vector multiplication on emerging multicore platformsProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362674(1-12)Online publication date: 16-Nov-2007
https://dl.acm.org/doi/10.1145/1362622.1362674
Show More Cited By

Recommendations

Performance optimizations and bounds for sparse matrix-vector multiply
SC '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what ...
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
ISCA 2004

The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese ...
A sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of an SPD matrix

In this paper, a method via sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of a symmetric positive definite matrix is proposed. The resulting factorized sparse approximate inverse is used as a preconditioner for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICCS'03: Proceedings of the 2003 international conference on Computational science: PartIII

June 2003

1165 pages

ISBN:3540401962

Editors:
Peter M. A. Sloot
University of Amsterdam, Informatics Institute, Section of Computational Science, Amsterdam, The Netherlands
,
David Abramson
Monash University, School of Computer Science and Software Engineering, Clayton, VIC, Australia
,
Alexander V. Bogdanov
Institute for High-Performance Computing and Information Systems, Russia
,
Yuriy E. Gorbachev
Institute for High-Performance Computing and Information Systems, Russia
,
Jack J. Dongarra
University of Tennessee and Oak Ridge National Laboratory, Computer Science Dept., Knoxville, TN

Sponsors

ceanet
St. Petersburg State Technical University
Etnus
Pallas GmbH
University of Amsterdam: University of Amsterdam

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 02 June 2003

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tao YDeng YMu SZhang ZZhu MXiao LRuan L(2015)GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplicationConcurrency and Computation: Practice & Experience10.1002/cpe.341527:14(3771-3789)Online publication date: 25-Sep-2015
https://dl.acm.org/doi/10.1002/cpe.3415
Karlin IJessup EBelter GSiek J(2011)Parallel memory prediction for fused linear algebra kernelsACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422638:4(43-49)Online publication date: 29-Mar-2011
https://dl.acm.org/doi/10.1145/1964218.1964226
Williams SOliker LVuduc RShalf JYelick KDemmel JVerastegui B(2007)Optimization of sparse matrix-vector multiplication on emerging multicore platformsProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362674(1-12)Online publication date: 16-Nov-2007
https://dl.acm.org/doi/10.1145/1362622.1362674
Im EBustany IAshcraft CDemmel JYelick K(2004)Performance tuning of matrix triple products based on matrix structureProceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing10.1007/11558958_89(740-746)Online publication date: 20-Jun-2004
https://dl.acm.org/doi/10.1007/11558958_89

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents