Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/263764.263789acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article
Free access

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Published: 21 June 1997 Publication History

Abstract

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.

References

[1]
J. M. Anderson, S. P. Amarainghe, & M. S. Lam. Data and computation transformations for multiprocessors. Proc. 5th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming, SIGPLAN Notices 30, 8 (August 1995), 166-- 178.
[2]
R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, & M. Paleczny. A model and compilation strategy for outof-core data parallel programs. Proc. 5th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming, SIGPLAN Notices 30, 8 (August 1995), 1-17.
[3]
F.W. Burton & J. G. Kollias. Comment on 'The explicit quad tree as a structure for computer graphics.' Comput. J. 26, 2 (May 1983), 188.
[4]
D. Cann. Retire FORTRAN? : a debate rekindled. Comm. ACM 35, 8 (August 1992), 81-89.
[5]
S. Carr & R. B. Lehoucq. Compiler Blockability of dense matrix factorizations. ACM Trans. Math.Software (to appear).
[6]
M. Cierniak & W. Li. Unifying data and control transformations for distributed shared-memory machines. Proc. ACM SIGPLAN '95 Conf. on Programming Lang. Design and Implementation, SIGPLAN Notices 30, 6 (June 1995), 205-217.
[7]
J. Cohen & M. Roth. On the implementation of Strassen's fast multiplication algorithm. Acta Informat. 6, 4 (August 1976), 341-355.
[8]
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, & T. yon Eicken. LogP: a practical model of parallel computation. Comm. ACM 39, 11 (November 1996), 78-85.
[9]
J. W. Demmel & N, J. Higham. Stability of block algorithms with fast Level 3 BLAS. ACM Trans. Math. Software 18, 3 (September 1992), 274-291.
[10]
J. Dongarra, J. DuCroz, S. Hammarling, & R. Hanson. An extended set of FORTRAN basic linear algebra subprograms. A CM Trans. Math. Sofm" 14 (1988), I- 17.
[11]
R. J. Fateman. Symbolic mathematics system evaluators (extended abstract). In Y. N. Lakshman (ed.), Proc 1996 Intl. Symp. on Symbolic and Algebraic Computation, New York, ACM Press, 86-94.
[12]
P. C. Fischer & R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Comm. ACM 22, 7 (July 1979), 405-415.
[13]
J. Frens & D. S. Wise. Matrix inversion using quadtrees implemented in GOFER. Technical Report 433, Computer Science Dept., Indiana University (May 1995).
[14]
K. A. Gallivan, R. J. Plemmons, & A. H. Sameh. Parallel Algorithms for dense linear algebra. SlAM Review 32, 1 (March 1990), 54-135. Reprinted in Gallivan et al. Parallel Algorithms for Matrix Computation, Philadelphia, SIAM (1990), 1-82.
[15]
G. H. Golub & C. F. Van Loan. Matrix Computations 2nd edition. The Johns Hopkins University Press, Baltimore (1989).
[16]
J. Fasel, P. Hudak, S. Peyton Jones, & P. Wadler (eds.) HASKELL special issue. ACM SIGPLAN Notices 27, 5 (May 1992).
[17]
N.J. Higham. Exploiting fast matrix multiplication within the Level 3 BLAS. ACM Trans. Math. Software 16, 4 (December 1990), 352-368.
[18]
D. E. Knuth. The Art of Computer Programming I, Fundamental Algorithms (2nd ed.), Reading, MA, Addison-Wesley, (1973).
[19]
A.C. McKellar & E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems Comm. A CM 12, 3 (March 1969), 153-165.
[20]
J. Spiess. Untersuchungen des Zeitgewinns durch neue Algorithmen zur Matrix-Multiplikation. Computing 17, 1 (1976), 23-36.
[21]
V. Strassen. Gaussian elimination is not optimal. Numer. Math. 13 (1969), 354-356.
[22]
G. L. Steele, Jr. Debunking the "expensive procedure call" myth, or Procedure call implementations considered harmful, or LAMBDA: the ultimate GOTO.ACM77: Proc. 1977 Annual Conf., New York, ACM (1977), 153-162.
[23]
D. S. Wise. Representing matrices as quadtrees for parallel processors (extended abstract). ACM SIGSAM Bulletin 18, 3 (August 1984), 24-25.
[24]
D. S. Wise. Undulant block elimination and integerpreserving matrix inversion. Sci. Comput. Programming (to appear). Technical Report 418, Computer Science Department, Indiana University (revised, August 1995).

Cited By

View all
  • (2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
  • (2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
  • (2015)To Match or Not to MatchACM Transactions on Economics and Computation10.1145/27458013:2(1-18)Online publication date: 20-Apr-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPOPP '97: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
June 1997
287 pages
ISBN:0897919068
DOI:10.1145/263764
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 1997

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache misses
  2. indexing
  3. paging
  4. quadtrees
  5. storage management
  6. swapping

Qualifiers

  • Article

Conference

PPoPP97
Sponsor:
PPoPP97: Principles & Practices of Parallel Programming
June 18 - 21, 1997
Nevada, Las Vegas, USA

Acceptance Rates

PPOPP '97 Paper Acceptance Rate 26 of 86 submissions, 30%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)109
  • Downloads (Last 6 weeks)23
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
  • (2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
  • (2015)To Match or Not to MatchACM Transactions on Economics and Computation10.1145/27458013:2(1-18)Online publication date: 20-Apr-2015
  • (2015)Single-Call MechanismsACM Transactions on Economics and Computation10.1145/27410273:2(1-60)Online publication date: 20-Apr-2015
  • (2015)Practical Iterative Optimization for the Data CenterACM Transactions on Architecture and Code Optimization10.1145/273904812:2(1-26)Online publication date: 11-May-2015
  • (2015)Secondary Spectrum Auctions for Symmetric and Submodular BiddersACM Transactions on Economics and Computation10.1145/27390413:2(1-25)Online publication date: 20-Apr-2015
  • (2015)Traffic Shaping to Optimize Ad DeliveryACM Transactions on Economics and Computation10.1145/27390103:2(1-20)Online publication date: 20-Apr-2015
  • (2015)GPU Performance and Power Tuning Using Regression TreesACM Transactions on Architecture and Code Optimization10.1145/273628712:2(1-26)Online publication date: 11-May-2015
  • (2015)Generalized Task ParallelismACM Transactions on Architecture and Code Optimization10.1145/272316412:1(1-25)Online publication date: 2-Apr-2015
  • (2015)Improving the Effectiveness of Time-Based Display AdvertisingACM Transactions on Economics and Computation10.1145/27163233:2(1-20)Online publication date: 20-Apr-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media