Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1345206.1345227acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Published: 20 February 2008 Publication History

Abstract

This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.

References

[1]
C. Addison, Y. Ren, and M. van Waveren. OpenMP issues arising in the development of parallel BLAS and LAPACK libraries. Scientific Programming, 11(2), 2003.
[2]
R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In SC '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225--233, New York, NY, USA, 1989.
[3]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. LAPACK Users' guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1999.
[4]
Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta. CellSs: A programming model for the Cell BE architecture. In SC '06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pages 5--15, Tampa, FL, USA, November 2006.
[5]
Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, The University of Texas at Austin, 2006.
[6]
Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Orti, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software, 31(1):1--26, March 2005.
[7]
Paolo Bientinesi, Brian Gunter, and Robert van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software. To appear.
[8]
Paolo Bientinesi, Enrique S. Quintana-Orti, and Robert A. van de Geijn. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Transactions on Mathematical Software, 31(1):27--59, March 2005.
[9]
Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in
[10]
Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 2007.
[11]
Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91--99, Austin, TX, USA, September 2007.
[12]
S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems, 13(11):1105--1123, 2002.
[13]
J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Computer Society Press, 1992.
[14]
Timothy Collins and James C. Browne. Matrix++: An object-oriented environment for parallel high-performance matrix computations. In Proceedings of the Hawaii International Conference on Systems and Software, pages 202--211, 1995.
[15]
Timothy Scott Collins. Efficient Matrix Computations through Hierarchical Type Specifications. PhD thesis, The University of Texas at Austin, 1996.
[16]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1--17, March 1990.
[17]
Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.
[18]
Kazushige Goto. http://www.tacc.utexas.edu/resources/software.
[19]
Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 34(3), 2008.
[20]
John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software, 27(4):422--455, December 2001.
[21]
Fred G. Gustavson, Lars Karlsson, and Bo Kagstrom. Three algorithms for Cholesky factorization on distributed memory using packed storage. In Applied Parallel Computing. State of the Art in Scientific Computing, pages 550--559. Springer Berlin / Heidelberg, 2007.
[22]
Greg Henry. BLAS based on block data structures. Theory Center Technical Report CTC92TR89, Cornell University, February 1992.
[23]
Jose Ramon Herrero. A framework for efficient execution of matrix computations. PhD thesis, Polytechnic University of Catalonia, Spain, 2006.
[24]
James Kahle, Michael Day, Peter Hofstee, Charles Johns, Theodore Maeurer, and David Shippy. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4/5):589--604, September 2005.
[25]
Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the Cell processor using Cholesky factorization. Technical Report UT-CS-07-596, Innovative Computing Laboratory, University of Tennesse, April 2007.
[26]
Jakub Kurzak and Jack Dongarra. Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 Technical Report UT-CS-06-581, University of Tennessee, September 2006.
[27]
Charles Leiserson and Aske Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 31, 1998.
[28]
Tze Meng Low and Robert van de Geijn. An API for manipulating matrices stored by blocks. FLAME Working Note #12 TR-2004-15, Department of Computer Sciences, The University of Texas at Austin, May 2004.
[29]
Bryan Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Orti, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In Euro-Par '07: Proceedings of the Thirteenth International European Conference on Parallel and Distributed Computing, pages 748--757, Rennes, France, August 2007.
[30]
N. Park, B. Hong, and V. K. Prasanna. Tiling, block data layout, and memory hierarchy performance. IEEE Transactions on Parallel and Distributed Systems, 14(7):640--654, 2003.
[31]
Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In PDP '08: Proceedings of the Sixteenth Euromicro International Conference on Parallel, Distributed and network-based Processing}, Toulouse, France, February 2008. To appear.
[32]
Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. International Journal of Parallel and Distributed Systems and Networks, 4(1):26--35, June 2001.
[33]
R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1), 1967.
[34]
Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.
[35]
Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.
[36]
Field G. Van Zee, Paolo Bientinesi, Tze Meng Low, and Robert A. van de Geijn. Scalable parallelization of FLAME code via the workqueuing model. ACM Transactions on Mathematical Software, 34(2), 2008.

Cited By

View all
  • (2023)Characterize and Optimize Dense Linear Solver on Multi-core CPUs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00253(1833-1842)Online publication date: 17-Dec-2023
  • (2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
  • (2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 24-Aug-2023
  • Show More Cited By

Index Terms

  1. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
    February 2008
    308 pages
    ISBN:9781595937957
    DOI:10.1145/1345206
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 February 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. algorithms-by-blocks
    2. dependency analysis
    3. dynamic scheduling
    4. out-of-order execution

    Qualifiers

    • Research-article

    Conference

    PPoPP08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Characterize and Optimize Dense Linear Solver on Multi-core CPUs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00253(1833-1842)Online publication date: 17-Dec-2023
    • (2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
    • (2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 24-Aug-2023
    • (2022)Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing ArchitecturesACM Transactions on Mathematical Software10.1145/350746648:2(1-42)Online publication date: 26-May-2022
    • (2022)Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIeMathematical Modeling and Supercomputer Technologies10.1007/978-3-031-24145-1_23(281-292)Online publication date: 24-Dec-2022
    • (2022)QR Factorization Using Malleable BLAS on Multicore ProcessorsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_12(176-189)Online publication date: 29-May-2022
    • (2021)Evaluation of two topology-aware heuristics on level- 3 BLAS library for multi-GPU platforms2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00013(12-22)Online publication date: Nov-2021
    • (2020)XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00008(1-8)Online publication date: Mar-2020
    • (2020)2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00032(181-191)Online publication date: Dec-2020
    • (2020)Approximation Algorithms for Multitasking Scheduling ProblemsIEEE Access10.1109/ACCESS.2020.30077558(127530-127534)Online publication date: 2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media