Article

The general matrix multiply-add operation on 2D torus

Authors:

Ahmed S. Zekri,

Stanislav G. SedukhinAuthors Info & Claims

IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

Page 309

Published: 25 April 2006 Publication History

Abstract

In this paper, the index space of the (n×n)-matrix multiply-add problem C = C +AċB is represented as a 3D n×n×n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations into the 2D n×n toroidal array processor. All optimal 2D data allocations that solve the problem in n multiply-add-roll steps are obtained. The well known Cannon's algorithm is one of the resulting allocations. We used the optimal data allocations to describe all variants of the GEMM operation on the 2D toroidal array processor. By controling the data movement, the transposition operation is avoided in 75% of the GEMM variants. However, only one matrix transpose is needed for the remaining 25%. Ultimately, we described four versions of the GEMM operation covering the possible layouts of the initially loaded data into the array processor.

References

[1]

R. Agarwal, F. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm on a distributedmemory parallel computer, using overlapped communication. IBM J. of Res. and Develop., 38(6):673-681, 1994.

Digital Library

[2]

L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969.

Digital Library

[3]

J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: parallel universal matrix multiplication algorithms. Concurrency: Practice and Experience, 6(7):543-570, Oct. 1994.

[4]

W. Dally and B.Towles. Principles and Practices of Interconnection Networkks. Elsevier, 2004.

Digital Library

[5]

A. Darte and Y. Robert. Constructive methods for scheduling uniform loop nests. IEEE Trans. Parallel Distributed Systems, 5(8):814-822, Aug. 1994.

Digital Library

[6]

J. J. Dongarra, J. D. Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 16:1-17, 1990.

Digital Library

[7]

J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Software, 14:1-17, 1988.

Digital Library

[8]

G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube I:Matrix multiplication. Parallel Computing, 4:17-31, 1987.

[9]

G. H. Golub and C. F. V. Loan. Matrix Computations. John Hopkins, Baltimore, Maryland, 1989.

[10]

S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhan. Matrix multiplication on the Intel Touchstone Delta. Concurrency: Practice and Experience, 6(7):571-594, Oct. 1994.

[11]

B. Kågström, P. Ling, and C. V. Loan. High performance GEMM-based level 3 BLAS: Sample routines for double precision real data. In High Performance Computing II, pages 269-281. North-Holland, 1991.

[12]

B. Kågström, P. Ling, and C. V. Loan. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Software, 24(3):268-302, 1998.

Digital Library

[13]

S. Kung. VLSI Array Processors. Prentice Hall, 1988.

Digital Library

[14]

D. Lavenier, P. Quinton, and S. Rajopadhye. Advanced systolic design. In Digital Signal Processing for Multimedia systems, Signal Processing Series, chapter 23, pages 657- 692. Marcel Dekker, 1999.

Digital Library

[15]

C. L. Lawson, R. J. Hanson, R. J. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Software, 5:308-323, 1979.

Digital Library

[16]

H. J. Lee and J. A. Fortes. Modular mappings and data distribution independent computations. Parallel Processing Letters, 7(2):169-180, 1997.

[17]

C. H. Sequin. Doubly twisted torus networks for VLSI processor arrays. In the 8th Annual Symposium on Computer Architecture, Minneapolis, Minnesota, USA, 1981.

Digital Library

[18]

R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Technical Report TR-95- 13, The University of Texas, Apr. 1995.

Digital Library

[19]

A. S. Zekri and S. G. Sedukhin. Computationally efficient parallel matrix-matrix multiplication on the torus. In the 6th Int. Symp. on High Performance Computing, Nara City, Japan, Sept. 2005.

Digital Library

[20]

A. S. Zekri and S. G. Sedukhin. Matrix transpose on 2D torus. Technical Report 2005-1-001, The University of Aizu, Aizu Wakamatsu, Japan, Dec. 2005.

Cited By

Dang KMeyer MOkuyama YAbdallah A(2017)A low-overhead soft---hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systemsThe Journal of Supercomputing10.1007/s11227-016-1951-073:6(2705-2729)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s11227-016-1951-0
Nguyen TCicotti PBylaska EQuinlan DBaden SHollingsworth J(2012)BambooProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389050(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389050
Zekri ASedukhin S(2007)Performance evaluation of basic linear algebra subroutines on a matrix co-processorProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786334(1190-1199)Online publication date: 9-Sep-2007
https://dl.acm.org/doi/10.5555/1786194.1786334

Recommendations

Matrix Transpose on 2D Torus Array Processor
CIT '06: Proceedings of the Sixth IEEE International Conference on Computer and Information Technology

Previously, we represented the index space of the (n n)- matrix multiply-add problem C=C+A B as a 3D torus, where A, B, and C are rolled along the corresponding axes of the index space. All optimal 2D data allocations (resulted from projection) to solve ...
MAD MAcce: Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to accelerate matrix multiplication, the core computation of deep learning workloads. However, it is challenging to exploit the MXUs for GPGPU applications whose fundamental ...
Computationally efficient parallel matrix-matrix multiplication on the torus
ISHPC'05/ALPS'06: Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems

In this paper, we represent the computation space of the (n×n)-matrix multiplication problem C=C+AċB as a 3D torus. All possible time-minimal scheduling vectors needed to activate the computations inside the corresponding 3D index points at each step of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

April 2006

399 pages

ISBN:1424400546

Sponsors

IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

In-Cooperation

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE Computer Society Technical Committee on Distributed Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 April 2006

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dang KMeyer MOkuyama YAbdallah A(2017)A low-overhead soft---hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systemsThe Journal of Supercomputing10.1007/s11227-016-1951-073:6(2705-2729)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s11227-016-1951-0
Nguyen TCicotti PBylaska EQuinlan DBaden SHollingsworth J(2012)BambooProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389050(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389050
Zekri ASedukhin S(2007)Performance evaluation of basic linear algebra subroutines on a matrix co-processorProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786334(1190-1199)Online publication date: 9-Sep-2007
https://dl.acm.org/doi/10.5555/1786194.1786334

View Options

View options

Figures

Tables

Media

View Table of Conten