Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1898699.1898840acmotherconferencesArticle/Chapter ViewAbstractPublication PagesidpdsConference Proceedingsconference-collections
Article

The general matrix multiply-add operation on 2D torus

Published: 25 April 2006 Publication History

Abstract

In this paper, the index space of the (n×n)-matrix multiply-add problem C = C +AċB is represented as a 3D n×n×n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations into the 2D n×n toroidal array processor. All optimal 2D data allocations that solve the problem in n multiply-add-roll steps are obtained. The well known Cannon's algorithm is one of the resulting allocations. We used the optimal data allocations to describe all variants of the GEMM operation on the 2D toroidal array processor. By controling the data movement, the transposition operation is avoided in 75% of the GEMM variants. However, only one matrix transpose is needed for the remaining 25%. Ultimately, we described four versions of the GEMM operation covering the possible layouts of the initially loaded data into the array processor.

References

[1]
R. Agarwal, F. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm on a distributedmemory parallel computer, using overlapped communication. IBM J. of Res. and Develop., 38(6):673-681, 1994.
[2]
L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969.
[3]
J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: parallel universal matrix multiplication algorithms. Concurrency: Practice and Experience, 6(7):543-570, Oct. 1994.
[4]
W. Dally and B.Towles. Principles and Practices of Interconnection Networkks. Elsevier, 2004.
[5]
A. Darte and Y. Robert. Constructive methods for scheduling uniform loop nests. IEEE Trans. Parallel Distributed Systems, 5(8):814-822, Aug. 1994.
[6]
J. J. Dongarra, J. D. Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 16:1-17, 1990.
[7]
J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Software, 14:1-17, 1988.
[8]
G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube I:Matrix multiplication. Parallel Computing, 4:17-31, 1987.
[9]
G. H. Golub and C. F. V. Loan. Matrix Computations. John Hopkins, Baltimore, Maryland, 1989.
[10]
S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhan. Matrix multiplication on the Intel Touchstone Delta. Concurrency: Practice and Experience, 6(7):571-594, Oct. 1994.
[11]
B. Kågström, P. Ling, and C. V. Loan. High performance GEMM-based level 3 BLAS: Sample routines for double precision real data. In High Performance Computing II, pages 269-281. North-Holland, 1991.
[12]
B. Kågström, P. Ling, and C. V. Loan. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Software, 24(3):268-302, 1998.
[13]
S. Kung. VLSI Array Processors. Prentice Hall, 1988.
[14]
D. Lavenier, P. Quinton, and S. Rajopadhye. Advanced systolic design. In Digital Signal Processing for Multimedia systems, Signal Processing Series, chapter 23, pages 657- 692. Marcel Dekker, 1999.
[15]
C. L. Lawson, R. J. Hanson, R. J. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Software, 5:308-323, 1979.
[16]
H. J. Lee and J. A. Fortes. Modular mappings and data distribution independent computations. Parallel Processing Letters, 7(2):169-180, 1997.
[17]
C. H. Sequin. Doubly twisted torus networks for VLSI processor arrays. In the 8th Annual Symposium on Computer Architecture, Minneapolis, Minnesota, USA, 1981.
[18]
R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Technical Report TR-95- 13, The University of Texas, Apr. 1995.
[19]
A. S. Zekri and S. G. Sedukhin. Computationally efficient parallel matrix-matrix multiplication on the torus. In the 6th Int. Symp. on High Performance Computing, Nara City, Japan, Sept. 2005.
[20]
A. S. Zekri and S. G. Sedukhin. Matrix transpose on 2D torus. Technical Report 2005-1-001, The University of Aizu, Aizu Wakamatsu, Japan, Dec. 2005.

Cited By

View all
  • (2017)A low-overhead soft---hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systemsThe Journal of Supercomputing10.1007/s11227-016-1951-073:6(2705-2729)Online publication date: 1-Jun-2017
  • (2012)BambooProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389050(1-11)Online publication date: 10-Nov-2012
  • (2007)Performance evaluation of basic linear algebra subroutines on a matrix co-processorProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786334(1190-1199)Online publication date: 9-Sep-2007

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing
April 2006
399 pages
ISBN:1424400546

Sponsors

  • IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

In-Cooperation

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 April 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2017)A low-overhead soft---hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systemsThe Journal of Supercomputing10.1007/s11227-016-1951-073:6(2705-2729)Online publication date: 1-Jun-2017
  • (2012)BambooProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389050(1-11)Online publication date: 10-Nov-2012
  • (2007)Performance evaluation of basic linear algebra subroutines on a matrix co-processorProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786334(1190-1199)Online publication date: 9-Sep-2007

View Options

Login options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media