Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1762418.1762495guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Memory hierarchy optimizations and performance bounds for sparse ATAx

Published: 02 June 2003 Publication History

Abstract

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.

References

[1]
A.J.C. Bik and H.A.G. Wijsho. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576-1587, 1999.
[2]
J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, July 1997.
[3]
S. Blackford et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. Chapter 3: http://www.netlib.org/blast.
[4]
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.
[5]
J.W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[6]
B.B. Fraguela, R. Doallo, and E.L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999.
[7]
W.D. Gropp, D.K. Kasushik, D.E. Keyes, and B.F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241-248, 1999.
[8]
G. Heber, A.J. Dolgert, M. Alt, K.A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the intel itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001.
[9]
E.-J. Im and K.A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of ICCS, pages 127-136, May 2001.
[10]
J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[11]
Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. http://www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.
[12]
P. Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.
[13]
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing '92, 1992.
[14]
R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.
[15]
R. Vuduc, A. Gyulassy, J.W. Demmel, and K.A. Yelick. Memory hierarchy optimizations and performance bounds for sparse ATAx. Technical Report UCB/CS- 03-1232, University of California, Berkeley, February 2003.
[16]
R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, and K.A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: POHLL Workshop, New York, USA, June 2002.
[17]
W. Wang and D.P. O'Leary. Adaptive use of iterative methods in interior point methods for linear programming. Technical Report UMIACS-95-111, University of Maryland at College Park, College Park, MD, USA, 1995.
[18]
C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomp., Orlando, FL, 1998.

Cited By

View all
  • (2015)GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplicationConcurrency and Computation: Practice & Experience10.1002/cpe.341527:14(3771-3789)Online publication date: 25-Sep-2015
  • (2011)Parallel memory prediction for fused linear algebra kernelsACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422638:4(43-49)Online publication date: 29-Mar-2011
  • (2007)Optimization of sparse matrix-vector multiplication on emerging multicore platformsProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362674(1-12)Online publication date: 16-Nov-2007
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICCS'03: Proceedings of the 2003 international conference on Computational science: PartIII
June 2003
1165 pages
ISBN:3540401962
  • Editors:
  • Peter M. A. Sloot,
  • David Abramson,
  • Alexander V. Bogdanov,
  • Yuriy E. Gorbachev,
  • Jack J. Dongarra

Sponsors

  • ceanet
  • St. Petersburg State Technical University
  • Etnus
  • Pallas GmbH
  • University of Amsterdam: University of Amsterdam

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 02 June 2003

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2015)GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplicationConcurrency and Computation: Practice & Experience10.1002/cpe.341527:14(3771-3789)Online publication date: 25-Sep-2015
  • (2011)Parallel memory prediction for fused linear algebra kernelsACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422638:4(43-49)Online publication date: 29-Mar-2011
  • (2007)Optimization of sparse matrix-vector multiplication on emerging multicore platformsProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362674(1-12)Online publication date: 16-Nov-2007
  • (2004)Performance tuning of matrix triple products based on matrix structureProceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing10.1007/11558958_89(740-746)Online publication date: 20-Jun-2004

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media