In this paper, we give a straight forward, highly efficient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system.
Cited By
- Okanovic P, Kwasniewski G, Labini P, Besta M, Vella F and Hoefler T High Performance Unstructured SpMM Computation Using Tensor Cores Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-14)
- Hong Y and Buluç A A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix Multiplication Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-14)
- Akbudak K (2024). Hypergraph-based locality-enhancing methods for graph operations in Big Data applications, International Journal of High Performance Computing Applications, 38:3, (210-224), Online publication date: 1-May-2024.
- Munch P, Kormann K and Kronbichler M (2021). hyper.deal: An Efficient, Matrix-free Finite-element Library for High-dimensional Partial Differential Equations, ACM Transactions on Mathematical Software, 47:4, (1-34), Online publication date: 31-Dec-2022.
- Liu Y, Shi L, Zhang J and Robertazzi T Layer based partition for matrix multiplication on heterogeneous mesh networks Proceedings of the High Performance Computing Symposium, (1-12)
- Wang M, Huang C and Li J Supporting Very Large Models using Automatic Dataflow Graph Partitioning Proceedings of the Fourteenth EuroSys Conference 2019, (1-17)
- Huang J, Smith T, Henry G and van de Geijn R Strassen's algorithm reloaded Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-12)
- Rico-Gallego J, Díaz-Martín J and Lastovetsky A (2016). Extending τ -Lop to model concurrent MPI communications in multicore clusters, Future Generation Computer Systems, 61:C, (66-82), Online publication date: 1-Aug-2016.
- Kelefouras V, Kritikakou A, Mporas I and Kolonias V (2016). A high-performance matrix---matrix multiplication methodology for CPU and GPU architectures, The Journal of Supercomputing, 72:3, (804-844), Online publication date: 1-Mar-2016.
- Pellauer M, Parashar A, Adler M, Ahsan B, Allmon R, Crago N, Fleming K, Gambhir M, Jaleel A, Krishna T, Lustig D, Maresh S, Pavlov V, Rayess R, Zhai A and Emer J (2015). Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures, ACM Transactions on Computer Systems, 33:3, (1-32), Online publication date: 11-Sep-2015.
- Bae S, Choi J, Qiu J and Fox G Dimension reduction and visualization of large high-dimensional data via interpolation Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, (203-214)
- Nishtala R, Almasi G and Cascaval C Performance without pain = productivity Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, (99-110)
- Krishnan M and Nieplocha J Memory efficient parallel matrix multiplication operation for irregular problems Proceedings of the 3rd conference on Computing frontiers, (229-240)
- Zekri A and Sedukhin S The general matrix multiply-add operation on 2D torus Proceedings of the 20th international conference on Parallel and distributed processing, (309-309)
- Bikshandi G, Guo J, Hoeflinger D, Almasi G, Fraguela B, Garzarán M, Padua D and von Praun C Programming for parallelism and locality with hierarchically tiled arrays Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, (48-57)
- Zekri A and Sedukhin S Computationally efficient parallel matrix-matrix multiplication on the torus Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems, (219-226)
- El-Qawasmeh E, Al-Ayyoub A and Abu-Ghazaleh N (2004). Quick Matrix Multiplication on Clusters of Workstations, Informatica, 15:2, (203-218), Online publication date: 1-Apr-2004.
- Wang C and Sahni S (2001). Matrix Multiplication on the OTIS-Mesh Optoelectronic Computer, IEEE Transactions on Computers, 50:7, (635-646), Online publication date: 1-Jul-2001.
- Chen Y (2000). Automated Tuning of Parallel I/O Systems, IEEE Transactions on Software Engineering, 26:4, (362-383), Online publication date: 1-Apr-2000.
- Petitet A and Dongarra J (1999). Algorithmic Redistribution Methods for Block-Cyclic Decompositions, IEEE Transactions on Parallel and Distributed Systems, 10:12, (1201-1216), Online publication date: 1-Dec-1999.
- Choi J A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
- Agarwal R, Balle S, Gustavson F, Joshi M and Palkar P (1995). A three-dimensional approach to parallel matrix multiplication, IBM Journal of Research and Development, 39:5, (575-582), Online publication date: 1-Sep-1995.
Recommendations
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...
Scalable task-based algorithm for multiplication of block-rank-sparse matrices
IA3 '15: Proceedings of the 5th Workshop on Irregular Applications: Architectures and AlgorithmsA task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum ...
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing\math. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in \math time by using \math processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Further-more, our ...