Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3295500.3356181acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Published: 17 November 2019 Publication History

Abstract

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

References

[1]
R. C. Agarwal et al. 1995. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. (1995).
[2]
Alok Aggarwal and S. Vitter, Jeffrey. [n. d.]. The Input/Output Complexity of Sorting and Related Problems. CACM (Sept. [n. d.]).
[3]
Lars Arge et al. 2008. In SPAA.
[4]
Ariful Azad et al. 2015. Parallel triangle counting and enumeration using matrix algebra. In IPDPSW.
[5]
Grey Ballard et al. 2012. Graph expansion and communication costs of fast matrix multiplication. JACM (2012).
[6]
Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (2018).
[7]
Michael A Bender et al. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. TOCS (2010).
[8]
Maciej Besta et al. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In IPDPS.
[9]
Robert D Blumofe et al. 1996. An analysis of dag-consistent distributed shared-memory algorithms. In SPAA.
[10]
Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation.
[11]
Gregory J. Chaitin et al. 1981. Register allocation via coloring. Computer Languages (1981).
[12]
S. M. Chan. 2013. Just a Pebble Game. In CCC.
[13]
Franàoise Chatelin. 2012. Eigenvalues of Matrices: Revised Edition. Siam.
[14]
Jaeyoung Choi et al. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In FRONTIERS.
[15]
Jaeyoung Choi et al. 1996. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines. Sci. Program. (1996).
[16]
J. Choi et al. 1996. ScaLAPACK: a portable linear algebra library for distributed memory computers --- design issues and performance. Comp. Phys. Comm. (1996).
[17]
Jaeyoung Choi, David W Walker, and Jack J Dongarra. 1994. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience 6, 7 (1994), 543--570.
[18]
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to algorithms. MIT press.
[19]
Paolo D'Alberto and Alexandru Nicolau. 2008. Using recursion to boost ATLAS's performance. In High-Performance Computing. Springer.
[20]
Alain Darte. 1999. On the complexity of loop fusion. In PACT.
[21]
Mauro Del Ben et al. 2015. Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution. Comp. Phys. Comm. (2015).
[22]
J. Demmel et al. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In IPDPS.
[23]
V. Elango et al. 2013. Data access complexity: The red/blue pebble game revisited. Technical Report.
[24]
Geoffrey C Fox. 1988. Solving problems on concurrent processors. (1988).
[25]
Geoffrey C Fox, Steve W Otto, and Anthony JG Hey. 1987. Matrix algorithms on a hypercube I: Matrix multiplication. Parallel computing 4, 1 (1987), 17--31.
[26]
M. Frigo et al. 1999. Cache-oblivious algorithms. In FOCS.
[27]
R. Gerstenberger et al. 2013. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided. In SC.
[28]
John R. Gilbert et al. 1979. The Pebbling Problem is Complete in Polynomial Space. In STOC.
[29]
Gero Greiner. 2012. Sparse matrix computations and their I/O complexity. Ph.D. Dissertation. Technische Universität München.
[30]
A. Gupta and V. Kumar. 1993. Scalability of Parallel Algorithms for Matrix Multiplication. In ICPP.
[31]
Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speedup mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 47.
[32]
T. Hoefler et al. 2015. Remote Memory Access Programming in MPI-3. TOPC (2015).
[33]
Dror Irony et al. 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication. JPDC (2004).
[34]
Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In STOC.
[35]
Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In LCPC.
[36]
Jeremy Kepner et al. 2016. Mathematical foundations of the GraphBLAS. arXiv:1606.05790 (2016).
[37]
Penporn Koanantakool et al. 2016. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS.
[38]
Alfio Lazzaro et al. 2017. Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5 D algorithm and one-sided MPI. In PASC.
[39]
Hyuk-Jae Lee et al. 1997. Generalized Cannon's Algorithm for Parallel Matrix Multiplication. In ICS.
[40]
Quanquan Liu. 2018. Red-Blue and Standard Pebble Games : Complexity and Applications in the Sequential and Parallel Models.
[41]
Carl D Meyer. 2000. Matrix analysis and applied linear algebra. SIAM.
[42]
Sraban Kumar Mohanty. 2010. I/O Efficient Algorithms for Matrix Computations. CoRR abs/1006.1307 (2010).
[43]
Andrew Y Ng et al. 2002. On spectral clustering: Analysis and an algorithm. In NIPS.
[44]
Donald W. Rogers. 2003. Computational Chemistry Using the PC. John Wiley & Sons, Inc.
[45]
Jacob Scott et al. 2015. Matrix multiplication I/O-complexity by path routing. In SPAA.
[46]
Ravi Sethi. 1973. Complete Register Allocation Problems. In STOC.
[47]
Tyler Michael Smith and Robert A. van de Geijn. 2017. Pushing the Bounds for Matrix-Matrix Multiplication. CoRR abs/1702.02017 (2017).
[48]
Raffaele Solcà, Anton Kozhevnikov, Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Thomas C. Schulthess. 2015. Efficient Implementation of Quantum Materials Simulations on Distributed CPU-GPU Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 10, 12 pages.
[49]
Edgar Solomonik et al. 2013. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In IPDPS.
[50]
Edgar Solomonik et al. 2016. Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra omputations. TOPC (2016).
[51]
E. Solomonik et al. 2017. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication. In SC.
[52]
Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In EuroPar.
[53]
Volker Strassen. 1969. Gaussian Elimination is Not Optimal. Numer. Math. (1969).
[54]
Sivan Toledo. 1999. A survey of out-of-core algorithms in numerical linear algebra. External Memory Algorithms and Visualization (1999).
[55]
Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255--274.
[56]
Jeffrey S Vetter and Michael O McCracken. 2001. Statistical scalability analysis of communication operations in distributed applications. ACM SIGPLAN Notices 36, 7 (2001), 123--132.
[57]
Michael E. Wolfand Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI.
[58]
M. Wolfe. 1989. More Iteration Space Tiling. In SC.
[59]
Nan Xiong. 2018. Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs. Ph.D. Dissertation. UC Riverside.
[60]
Qinqing Zheng and John D. Lafferty. 2016. Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent. CoRR (2016).

Cited By

View all
  • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
  • (2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
  • (2024)Brief Announcement: Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660269(285-287)Online publication date: 17-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • Platform for Advanced Scientific Computing (PASC)
  • European Research Council (ERC)

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)279
  • Downloads (Last 6 weeks)20
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
  • (2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
  • (2024)Brief Announcement: Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660269(285-287)Online publication date: 17-Jun-2024
  • (2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
  • (2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
  • (2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3303431(1-20)Online publication date: 2024
  • (2024)Many-Body Electronic Correlation Energy using Krylov Subspace Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00066(1-15)Online publication date: 17-Nov-2024
  • (2024)Uncut-GEMMs: Communication-Aware Matrix Multiplication on Multi-GPU Nodes2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00020(143-154)Online publication date: 24-Sep-2024
  • (2023)Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph QueriesACM Computing Surveys10.1145/360493256:2(1-40)Online publication date: 15-Sep-2023
  • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media