research-article

High-performance implementation of the level-3 BLAS

Authors:

Kazushige Goto,

Robert Van De GeijnAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 35, Issue 1

Article No.: 4, Pages 1 - 14

https://doi.org/10.1145/1377603.1377607

Published: 25 July 2008 Publication History

Get Access

Abstract

A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.

References

[1]

Bientinesi, P. and van de Geijn, R. 2006. Representing dense linear algebra algorithms: A farewell to indices. FLAME Working Note &num;17 TR-2006-10, Department of Computer Sciences, The University of Texas at Austin.

Google Scholar

[2]

Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1--17.

Digital Library

Google Scholar

[3]

Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46, 1, 3--45.

Crossref

Google Scholar

[4]

Goto, K. and van de Geijn, R. A. 2008. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3.

Digital Library

Google Scholar

[5]

Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A. 2001. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Softw. 27, 4, 422--455.

Digital Library

Google Scholar

[6]

Kågström, B., Ling, P., and Loan, C. V. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3, 268--302.

Digital Library

Google Scholar

[7]

Marker, B., Van Zee, F. G., Goto, K., Quintana-Ortí, G., and van de Geijn, R. A. 2007. Toward scalable matrix multiply on multithreaded architectures. In Proceedings of the International Euro-Par Conference. A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Lecture Notes on Computer Science, vol. 4641. 748--757.

Digital Library

Google Scholar

[8]

Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. In Proceedings of Supercomputing (SC'98).

Digital Library

Google Scholar

Cited By

View all

Mo HWang QLiao LLi BChi LLiu J(2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673101
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3432579
Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2024.3350368
Show More Cited By

Index Terms

High-performance implementation of the level-3 BLAS
1. Mathematics of computing
  1. Mathematical software

Recommendations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
GEMM-Based Level-3 BLAS
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 35, Issue 1

July 2008

136 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/1377603

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2008

Accepted: 01 October 2007

Revised: 01 April 2007

Received: 01 May 2006

Published in TOMS Volume 35, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Division of Computing and Communication Foundations

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

223
Total Citations
View Citations
1,696
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)12

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mo HWang QLiao LLi BChi LLiu J(2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673101
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3432579
Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2024.3350368
Yu KQi XZhang PFang JDong DWang RTang THuang CChe YWang Z(2024)Optimizing General Matrix Multiplications on Modern Multi-core DSPs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00090(964-975)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00090
Anastasiadis PPapadopoulou NKoziris NGoumas G(2024)Uncut-GEMMs: Communication-Aware Matrix Multiplication on Multi-GPU Nodes2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00020(143-154)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00020
Zheng HLi XYuan YWu JHuang S(2024)An efficient parallel approach for quad-constellation GNSS real-time precise orbit determination enabling 5-second intervals updatingMeasurement10.1016/j.measurement.2024.114782233(114782)Online publication date: Jun-2024
https://doi.org/10.1016/j.measurement.2024.114782
Alaejos GMartínez HCastelló ADolz MIgual FAlonso-Jordá PQuintana-Ortí E(2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05955-8
Lu BLuo ZZhong BZhou H(2023)A parallel numerical algorithm by combining MPI and OpenMP programming models with applications in gravity field recoveryFrontiers in Earth Science10.3389/feart.2023.108087911Online publication date: 23-Mar-2023
https://doi.org/10.3389/feart.2023.1080879
Salvador Rohwedder CHenderson NDe Carvalho JChen YAmaral JDubach CBruening DHardekopf B(2023)To Pack or Not to Pack: A Generalized Packing Analysis and TransformationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580024(14-27)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580024
Xu RVan Zee Fvan de Geijn RGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Towards a Unified Implementation of GEMM in BLISProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593707(111-121)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593707
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

GEMM-Based Level-3 BLAS

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance