Article

Free access

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Authors:

Jeremy D. Frens,

David S. WiseAuthors Info & Claims

PPOPP '97: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 206 - 216

https://doi.org/10.1145/263764.263789

Published: 21 June 1997 Publication History

Abstract

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.

References

[1]

J. M. Anderson, S. P. Amarainghe, & M. S. Lam. Data and computation transformations for multiprocessors. Proc. 5th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming, SIGPLAN Notices 30, 8 (August 1995), 166-- 178.

Digital Library

[2]

R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, & M. Paleczny. A model and compilation strategy for outof-core data parallel programs. Proc. 5th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming, SIGPLAN Notices 30, 8 (August 1995), 1-17.

Digital Library

[3]

F.W. Burton & J. G. Kollias. Comment on 'The explicit quad tree as a structure for computer graphics.' Comput. J. 26, 2 (May 1983), 188.

[4]

D. Cann. Retire FORTRAN? : a debate rekindled. Comm. ACM 35, 8 (August 1992), 81-89.

Digital Library

[5]

S. Carr & R. B. Lehoucq. Compiler Blockability of dense matrix factorizations. ACM Trans. Math.Software (to appear).

Digital Library

[6]

M. Cierniak & W. Li. Unifying data and control transformations for distributed shared-memory machines. Proc. ACM SIGPLAN '95 Conf. on Programming Lang. Design and Implementation, SIGPLAN Notices 30, 6 (June 1995), 205-217.

Digital Library

[7]

J. Cohen & M. Roth. On the implementation of Strassen's fast multiplication algorithm. Acta Informat. 6, 4 (August 1976), 341-355.

Digital Library

[8]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, & T. yon Eicken. LogP: a practical model of parallel computation. Comm. ACM 39, 11 (November 1996), 78-85.

Digital Library

[9]

J. W. Demmel & N, J. Higham. Stability of block algorithms with fast Level 3 BLAS. ACM Trans. Math. Software 18, 3 (September 1992), 274-291.

Digital Library

[10]

J. Dongarra, J. DuCroz, S. Hammarling, & R. Hanson. An extended set of FORTRAN basic linear algebra subprograms. A CM Trans. Math. Sofm" 14 (1988), I- 17.

Digital Library

[11]

R. J. Fateman. Symbolic mathematics system evaluators (extended abstract). In Y. N. Lakshman (ed.), Proc 1996 Intl. Symp. on Symbolic and Algebraic Computation, New York, ACM Press, 86-94.

Digital Library

[12]

P. C. Fischer & R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Comm. ACM 22, 7 (July 1979), 405-415.

Digital Library

[13]

J. Frens & D. S. Wise. Matrix inversion using quadtrees implemented in GOFER. Technical Report 433, Computer Science Dept., Indiana University (May 1995).

[14]

K. A. Gallivan, R. J. Plemmons, & A. H. Sameh. Parallel Algorithms for dense linear algebra. SlAM Review 32, 1 (March 1990), 54-135. Reprinted in Gallivan et al. Parallel Algorithms for Matrix Computation, Philadelphia, SIAM (1990), 1-82.

Digital Library

[15]

G. H. Golub & C. F. Van Loan. Matrix Computations 2nd edition. The Johns Hopkins University Press, Baltimore (1989).

Digital Library

[16]

J. Fasel, P. Hudak, S. Peyton Jones, & P. Wadler (eds.) HASKELL special issue. ACM SIGPLAN Notices 27, 5 (May 1992).

Digital Library

[17]

N.J. Higham. Exploiting fast matrix multiplication within the Level 3 BLAS. ACM Trans. Math. Software 16, 4 (December 1990), 352-368.

Digital Library

[18]

D. E. Knuth. The Art of Computer Programming I, Fundamental Algorithms (2nd ed.), Reading, MA, Addison-Wesley, (1973).

Digital Library

[19]

A.C. McKellar & E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems Comm. A CM 12, 3 (March 1969), 153-165.

Digital Library

[20]

J. Spiess. Untersuchungen des Zeitgewinns durch neue Algorithmen zur Matrix-Multiplikation. Computing 17, 1 (1976), 23-36.

[21]

V. Strassen. Gaussian elimination is not optimal. Numer. Math. 13 (1969), 354-356.

Digital Library

[22]

G. L. Steele, Jr. Debunking the "expensive procedure call" myth, or Procedure call implementations considered harmful, or LAMBDA: the ultimate GOTO.ACM77: Proc. 1977 Annual Conf., New York, ACM (1977), 153-162.

Digital Library

[23]

D. S. Wise. Representing matrices as quadtrees for parallel processors (extended abstract). ACM SIGSAM Bulletin 18, 3 (August 1984), 24-25.

Digital Library

[24]

D. S. Wise. Undulant block elimination and integerpreserving matrix inversion. Sci. Comput. Programming (to appear). Technical Report 418, Computer Science Department, Indiana University (revised, August 1995).

Cited By

Nie QMalik S(2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2925377
Vuduc RDemmel JBilmes J(2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
https://doi.org/10.1177/1094342004041293
Ghosh AMahdian MMcAfee RVassilvitskii S(2015)To Match or Not to MatchACM Transactions on Economics and Computation10.1145/27458013:2(1-18)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2745801
Show More Cited By

Index Terms

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Recommendations

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of ...
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)

Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for ...
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming

Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPOPP '97: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming

June 1997

287 pages

ISBN:0897919068

DOI:10.1145/263764

Chairmen:
Rob Schreiber
Hewlett-Packard Labs, Palo Alto, CA
,
Keshav Pingali
Cornell Univ., Ithaca, NY
,
Editor:
Michael A. Berman

ACM SIGPLAN Notices Volume 32, Issue 7
July 1997
287 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/263767
Chairmen:
Rob Schreiber
Hewlett-Packard Labs, Palo Alto, CA
,
Keshav Pingali
Cornell Univ., Ithaca, NY
,
Editor:
A. Michael Berman
Issue’s Table of Contents

Copyright © 1997 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 1997

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP97

Sponsor:

SIGPLAN

PPoPP97: Principles & Practices of Parallel Programming

June 18 - 21, 1997

Nevada, Las Vegas, USA

Acceptance Rates

PPOPP '97 Paper Acceptance Rate 26 of 86 submissions, 30%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

109
Total Citations
View Citations
1,051
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)23

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nie QMalik S(2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2925377
Vuduc RDemmel JBilmes J(2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
https://doi.org/10.1177/1094342004041293
Ghosh AMahdian MMcAfee RVassilvitskii S(2015)To Match or Not to MatchACM Transactions on Economics and Computation10.1145/27458013:2(1-18)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2745801
Wilkens CSivan B(2015)Single-Call MechanismsACM Transactions on Economics and Computation10.1145/27410273:2(1-60)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2741027
Fang SXu WChen YEeckhout LTemam OChen YWu CFeng X(2015)Practical Iterative Optimization for the Data CenterACM Transactions on Architecture and Code Optimization10.1145/273904812:2(1-26)Online publication date: 11-May-2015
https://dl.acm.org/doi/10.1145/2739048
Hoefer MKesselheim T(2015)Secondary Spectrum Auctions for Symmetric and Submodular BiddersACM Transactions on Economics and Computation10.1145/27390413:2(1-25)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2739041
Chakrabarti DVee E(2015)Traffic Shaping to Optimize Ad DeliveryACM Transactions on Economics and Computation10.1145/27390103:2(1-20)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2739010
Jia WGarza EShaw KMartonosi M(2015)GPU Performance and Power Tuning Using Regression TreesACM Transactions on Architecture and Code Optimization10.1145/273628712:2(1-26)Online publication date: 11-May-2015
https://dl.acm.org/doi/10.1145/2736287
Streit KDoerfert JHammacher CZeller AHack S(2015)Generalized Task ParallelismACM Transactions on Architecture and Code Optimization10.1145/272316412:1(1-25)Online publication date: 2-Apr-2015
https://dl.acm.org/doi/10.1145/2723164
Goldstein DMcAfee RSuri S(2015)Improving the Effectiveness of Time-Based Display AdvertisingACM Transactions on Economics and Computation10.1145/27163233:2(1-20)Online publication date: 20-Apr-2015
https://dl.acm.org/doi/10.1145/2716323
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents