research-article

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Authors:

Grzegorz Kwasniewski,

Joost VandeVondele,

Raffaele Solcà,

Torsten HoeflerAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 24, Pages 1 - 22

https://doi.org/10.1145/3295500.3356181

Published: 17 November 2019 Publication History

Abstract

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

References

[1]

R. C. Agarwal et al. 1995. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. (1995).

[2]

Alok Aggarwal and S. Vitter, Jeffrey. [n. d.]. The Input/Output Complexity of Sorting and Related Problems. CACM (Sept. [n. d.]).

[3]

Lars Arge et al. 2008. In SPAA.

[4]

Ariful Azad et al. 2015. Parallel triangle counting and enumeration using matrix algebra. In IPDPSW.

[5]

Grey Ballard et al. 2012. Graph expansion and communication costs of fast matrix multiplication. JACM (2012).

[6]

Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (2018).

[7]

Michael A Bender et al. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. TOCS (2010).

[8]

Maciej Besta et al. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In IPDPS.

[9]

Robert D Blumofe et al. 1996. An analysis of dag-consistent distributed shared-memory algorithms. In SPAA.

[10]

Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation.

[11]

Gregory J. Chaitin et al. 1981. Register allocation via coloring. Computer Languages (1981).

[12]

S. M. Chan. 2013. Just a Pebble Game. In CCC.

[13]

Franàoise Chatelin. 2012. Eigenvalues of Matrices: Revised Edition. Siam.

[14]

Jaeyoung Choi et al. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In FRONTIERS.

[15]

Jaeyoung Choi et al. 1996. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines. Sci. Program. (1996).

[16]

J. Choi et al. 1996. ScaLAPACK: a portable linear algebra library for distributed memory computers --- design issues and performance. Comp. Phys. Comm. (1996).

[17]

Jaeyoung Choi, David W Walker, and Jack J Dongarra. 1994. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience 6, 7 (1994), 543--570.

[18]

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to algorithms. MIT press.

[19]

Paolo D'Alberto and Alexandru Nicolau. 2008. Using recursion to boost ATLAS's performance. In High-Performance Computing. Springer.

[20]

Alain Darte. 1999. On the complexity of loop fusion. In PACT.

[21]

Mauro Del Ben et al. 2015. Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution. Comp. Phys. Comm. (2015).

[22]

J. Demmel et al. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In IPDPS.

[23]

V. Elango et al. 2013. Data access complexity: The red/blue pebble game revisited. Technical Report.

[24]

Geoffrey C Fox. 1988. Solving problems on concurrent processors. (1988).

[25]

Geoffrey C Fox, Steve W Otto, and Anthony JG Hey. 1987. Matrix algorithms on a hypercube I: Matrix multiplication. Parallel computing 4, 1 (1987), 17--31.

[26]

M. Frigo et al. 1999. Cache-oblivious algorithms. In FOCS.

[27]

R. Gerstenberger et al. 2013. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided. In SC.

[28]

John R. Gilbert et al. 1979. The Pebbling Problem is Complete in Polynomial Space. In STOC.

[29]

Gero Greiner. 2012. Sparse matrix computations and their I/O complexity. Ph.D. Dissertation. Technische Universität München.

[30]

A. Gupta and V. Kumar. 1993. Scalability of Parallel Algorithms for Matrix Multiplication. In ICPP.

[31]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speedup mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 47.

[32]

T. Hoefler et al. 2015. Remote Memory Access Programming in MPI-3. TOPC (2015).

[33]

Dror Irony et al. 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication. JPDC (2004).

[34]

Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In STOC.

[35]

Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In LCPC.

[36]

Jeremy Kepner et al. 2016. Mathematical foundations of the GraphBLAS. arXiv:1606.05790 (2016).

[37]

Penporn Koanantakool et al. 2016. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS.

[38]

Alfio Lazzaro et al. 2017. Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5 D algorithm and one-sided MPI. In PASC.

[39]

Hyuk-Jae Lee et al. 1997. Generalized Cannon's Algorithm for Parallel Matrix Multiplication. In ICS.

[40]

Quanquan Liu. 2018. Red-Blue and Standard Pebble Games : Complexity and Applications in the Sequential and Parallel Models.

[41]

Carl D Meyer. 2000. Matrix analysis and applied linear algebra. SIAM.

[42]

Sraban Kumar Mohanty. 2010. I/O Efficient Algorithms for Matrix Computations. CoRR abs/1006.1307 (2010).

[43]

Andrew Y Ng et al. 2002. On spectral clustering: Analysis and an algorithm. In NIPS.

[44]

Donald W. Rogers. 2003. Computational Chemistry Using the PC. John Wiley & Sons, Inc.

[45]

Jacob Scott et al. 2015. Matrix multiplication I/O-complexity by path routing. In SPAA.

[46]

Ravi Sethi. 1973. Complete Register Allocation Problems. In STOC.

[47]

Tyler Michael Smith and Robert A. van de Geijn. 2017. Pushing the Bounds for Matrix-Matrix Multiplication. CoRR abs/1702.02017 (2017).

[48]

Raffaele Solcà, Anton Kozhevnikov, Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Thomas C. Schulthess. 2015. Efficient Implementation of Quantum Materials Simulations on Distributed CPU-GPU Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 10, 12 pages.

Digital Library

[49]

Edgar Solomonik et al. 2013. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In IPDPS.

[50]

Edgar Solomonik et al. 2016. Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra omputations. TOPC (2016).

[51]

E. Solomonik et al. 2017. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication. In SC.

[52]

Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In EuroPar.

[53]

Volker Strassen. 1969. Gaussian Elimination is Not Optimal. Numer. Math. (1969).

[54]

Sivan Toledo. 1999. A survey of out-of-core algorithms in numerical linear algebra. External Memory Algorithms and Visualization (1999).

[55]

Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255--274.

[56]

Jeffrey S Vetter and Michael O McCracken. 2001. Statistical scalability analysis of communication operations in distributed applications. ACM SIGPLAN Notices 36, 7 (2001), 123--132.

Digital Library

[57]

Michael E. Wolfand Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI.

[58]

M. Wolfe. 1989. More Iteration Space Tiling. In SC.

[59]

Nan Xiong. 2018. Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs. Ph.D. Dissertation. UC Riverside.

[60]

Qinqing Zheng and John D. Lafferty. 2016. Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent. CoRR (2016).

Cited By

Haghi PTan CGuo AWu CLiu DLi ASkjellum AGeng THerbordt M(2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656616
Jangda AYadav MLee IChabbi MSteuwer M(2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638489
Böhnlein TPapp PYzelman AAgrawal KPetrank E(2024)Brief Announcement: Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660269(285-287)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3660269
Show More Cited By

Index Terms

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Recommendations

Qr factorization revisited
t-Pebbling and Extensions

Graph pebbling is the study of moving discrete pebbles from certain initial distributions on the vertices of a graph to various target distributions via pebbling moves. A pebbling move removes two pebbles from a vertex and places one pebble on one of ...
Optimal Pebbling on Grids

Given a distribution of pebbles on the vertices of a connected graph G, a pebbling move is defined as the removal of two pebbles from some vertex and the placement of one of these on an adjacent vertex. The pebbling number of a graph G is the smallest ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Platform for Advanced Scientific Computing (PASC)
European Research Council (ERC)

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
1,663
Total Downloads

Downloads (Last 12 months)279
Downloads (Last 6 weeks)20

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Haghi PTan CGuo AWu CLiu DLi ASkjellum AGeng THerbordt M(2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656616
Jangda AYadav MLee IChabbi MSteuwer M(2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638489
Böhnlein TPapp PYzelman AAgrawal KPetrank E(2024)Brief Announcement: Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660269(285-287)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3660269
Eyraud-Dubois LIooss GLangou JRastello FAgrawal KPetrank E(2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659986
Chen JLi SGuo RYuan JHoefler T(2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3397800
Besta MHoefler T(2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3303431(1-20)Online publication date: 2024
https://doi.org/10.1109/TPAMI.2023.3303431
Shah SZhang BHuang HPask JSuryanarayana PChow E(2024)Many-Body Electronic Correlation Energy using Krylov Subspace Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00066(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00066
Anastasiadis PPapadopoulou NKoziris NGoumas G(2024)Uncut-GEMMs: Communication-Aware Matrix Multiplication on Multi-GPU Nodes2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00020(143-154)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00020
Besta MGerstenberger RPeter EFischer MPodstawski MBarthels CAlonso GHoefler T(2023)Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph QueriesACM Computing Surveys10.1145/360493256:2(1-40)Online publication date: 15-Sep-2023
https://dl.acm.org/doi/10.1145/3604932
Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents