research-article

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Authors:

Field G. Van Zee,

Paolo Bientinesi,

Enrique S. Quintana-Orti,

Gregorio Quintana-Orti,

Robert van de GeijnAuthors Info & Claims

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Pages 123 - 132

https://doi.org/10.1145/1345206.1345227

Published: 20 February 2008 Publication History

Abstract

This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.

References

[1]

C. Addison, Y. Ren, and M. van Waveren. OpenMP issues arising in the development of parallel BLAS and LAPACK libraries. Scientific Programming, 11(2), 2003.

Digital Library

[2]

R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In SC '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225--233, New York, NY, USA, 1989.

Digital Library

[3]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. LAPACK Users' guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1999.

Digital Library

[4]

Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta. CellSs: A programming model for the Cell BE architecture. In SC '06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pages 5--15, Tampa, FL, USA, November 2006.

Digital Library

[5]

Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, The University of Texas at Austin, 2006.

Digital Library

[6]

Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Orti, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software, 31(1):1--26, March 2005.

Digital Library

[7]

Paolo Bientinesi, Brian Gunter, and Robert van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software. To appear.

Digital Library

[8]

Paolo Bientinesi, Enrique S. Quintana-Orti, and Robert A. van de Geijn. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Transactions on Mathematical Software, 31(1):27--59, March 2005.

Digital Library

[9]

Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in

Digital Library

[10]

Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 2007.

[11]

Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91--99, Austin, TX, USA, September 2007.

Digital Library

[12]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems, 13(11):1105--1123, 2002.

Digital Library

[13]

J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Computer Society Press, 1992.

[14]

Timothy Collins and James C. Browne. Matrix++: An object-oriented environment for parallel high-performance matrix computations. In Proceedings of the Hawaii International Conference on Systems and Software, pages 202--211, 1995.

Digital Library

[15]

Timothy Scott Collins. Efficient Matrix Computations through Hierarchical Type Specifications. PhD thesis, The University of Texas at Austin, 1996.

[16]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1--17, March 1990.

Digital Library

[17]

Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.

[18]

Kazushige Goto. http://www.tacc.utexas.edu/resources/software.

[19]

Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 34(3), 2008.

Digital Library

[20]

John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software, 27(4):422--455, December 2001.

Digital Library

[21]

Fred G. Gustavson, Lars Karlsson, and Bo Kagstrom. Three algorithms for Cholesky factorization on distributed memory using packed storage. In Applied Parallel Computing. State of the Art in Scientific Computing, pages 550--559. Springer Berlin / Heidelberg, 2007.

Digital Library

[22]

Greg Henry. BLAS based on block data structures. Theory Center Technical Report CTC92TR89, Cornell University, February 1992.

Digital Library

[23]

Jose Ramon Herrero. A framework for efficient execution of matrix computations. PhD thesis, Polytechnic University of Catalonia, Spain, 2006.

[24]

James Kahle, Michael Day, Peter Hofstee, Charles Johns, Theodore Maeurer, and David Shippy. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4/5):589--604, September 2005.

Digital Library

[25]

Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the Cell processor using Cholesky factorization. Technical Report UT-CS-07-596, Innovative Computing Laboratory, University of Tennesse, April 2007.

[26]

Jakub Kurzak and Jack Dongarra. Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 Technical Report UT-CS-06-581, University of Tennessee, September 2006.

[27]

Charles Leiserson and Aske Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 31, 1998.

[28]

Tze Meng Low and Robert van de Geijn. An API for manipulating matrices stored by blocks. FLAME Working Note #12 TR-2004-15, Department of Computer Sciences, The University of Texas at Austin, May 2004.

[29]

Bryan Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Orti, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In Euro-Par '07: Proceedings of the Thirteenth International European Conference on Parallel and Distributed Computing, pages 748--757, Rennes, France, August 2007.

Digital Library

[30]

N. Park, B. Hong, and V. K. Prasanna. Tiling, block data layout, and memory hierarchy performance. IEEE Transactions on Parallel and Distributed Systems, 14(7):640--654, 2003.

Digital Library

[31]

Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In PDP '08: Proceedings of the Sixteenth Euromicro International Conference on Parallel, Distributed and network-based Processing}, Toulouse, France, February 2008. To appear.

Digital Library

[32]

Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. International Journal of Parallel and Distributed Systems and Networks, 4(1):26--35, June 2001.

[33]

R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1), 1967.

Digital Library

[34]

Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.

[35]

Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.

Digital Library

[36]

Field G. Van Zee, Paolo Bientinesi, Tze Meng Low, and Robert A. van de Geijn. Scalable parallelization of FLAME code via the workqueuing model. ACM Transactions on Mathematical Software, 34(2), 2008.

Digital Library

Cited By

Fu XSu XDong DYang W(2023)Characterize and Optimize Dense Linear Solver on Multi-core CPUs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00253(1833-1842)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00253
Yang SZhang CMa J(2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00037
Liu FFredriksson AMarkidis S(2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_49
Show More Cited By

Index Terms

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
1. Software and its engineering

Recommendations

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those ...
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization
High Performance Computing for Computational Science - VECPAR 2008

We pursue the scalable parallel implementation of the factor- ization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures. Our approach decomposes the computation into a large number of fine-grained operations ...
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
PDP '08: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)

This paper examines the scalable parallel implementation of the QR factorizationof a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

February 2008

308 pages

ISBN:9781595937957

DOI:10.1145/1345206

General Chair:
Siddhartha Chatterjee
IBM Research USA
,
Program Chair:
Michael L. Scott
University of Rochester USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP08

Sponsor:

PPoPP08: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 20 - 23, 2008

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
648
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fu XSu XDong DYang W(2023)Characterize and Optimize Dense Linear Solver on Multi-core CPUs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00253(1833-1842)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00253
Yang SZhang CMa J(2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00037
Liu FFredriksson AMarkidis S(2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_49
Heavner NIgual FQuintana-Ortí GMartinsson P(2022)Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing ArchitecturesACM Transactions on Mathematical Software10.1145/350746648:2(1-42)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3507466
Choi YStegailov V(2022)Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIeMathematical Modeling and Supercomputer Technologies10.1007/978-3-031-24145-1_23(281-292)Online publication date: 24-Dec-2022
https://doi.org/10.1007/978-3-031-24145-1_23
Castelló ACatalán SIgual FQuintana-Ortí ERodríguez-Sánchez R(2022)QR Factorization Using Malleable BLAS on Multicore ProcessorsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_12(176-189)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-23220-6_12
Gautier TLima J(2021)Evaluation of two topology-aware heuristics on level- 3 BLAS library for multi-GPU platforms2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00013(12-22)Online publication date: Nov-2021
https://doi.org/10.1109/SCWS55283.2021.00013
Gautier TLima J(2020)XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00008(1-8)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00008
Beaumont OEyraud-Dubois LVerite M(2020)2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00032(181-191)Online publication date: Dec-2020
https://doi.org/10.1109/HiPC50609.2020.00032
Zheng FWang ZLiu MChu C(2020)Approximation Algorithms for Multitasking Scheduling ProblemsIEEE Access10.1109/ACCESS.2020.30077558(127530-127534)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3007755
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten