Article

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Authors:

K. Fatahalian,

J. Sugerman,

P. HanrahanAuthors Info & Claims

HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware

Pages 133 - 137

https://doi.org/10.1145/1058129.1058148

Published: 29 August 2004 Publication History

Get Access

Abstract

Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.

References

[1]

{ABB*99} Anderson E., Bai Z., Bischof C., Blackford S., Demmel J., Dongarra J., Du Croz J., Greenbaum A., Hammarling S., McKenney A., Sorensen D.: LAPACK Users' Guide, third ed. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999.

Digital Library

Google Scholar

[2]

{BFGS03} Bolz J., Farmer I., Grinspun E., Schroder P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph, 22, 3 (2003). 917--924.

Digital Library

Google Scholar

[3]

{BFH*04} Buck I., Foley T., Horn D., Sugerman J., Fatahalian K., Houston M., Hanrahan P.: Brook for GPUs: Stream computing on graphics hardware. In Proceedings of ACM SIGGRAPH 2004 (to appear) (2004).

Digital Library

Google Scholar

[4]

{HCH03} Hall J. D., Carr N. A., Hart J. C.: Cache and bandwidth aware matrix multiplication on the GPU. UIUC Technical Report UIUCDCSR-2003-2328 (2003).

Google Scholar

[5]

{HCSL02} Harris M. J., Coombe G., Scheuermann T., Lastra A.: Physically-based visual simulation on graphics hardware. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (2002), Eurographics Association, pp. 109--118.

Digital Library

Google Scholar

[6]

{HSU*01} Hinton G., Sager D., Upton M., BOGGS D., CARMEAN D., KYKER A., ROUSSEL P.: The microarchitecture of the pentium 4 processor. Intel Technology Journal 22, 1 (2001).

Google Scholar

[7]

{KW03} Kruger J., Westermann R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph. 22, 3 (2003), 908--916.

Digital Library

Google Scholar

[8]

{LM01} Larsen E. S., McAllister D.: Fast matrix multiplies using graphics hardware. In Proc. Supercomputing 2001 (2001).

Digital Library

Google Scholar

[9]

{MA03} Moreland K., Angel E.: The FFT on a GPU. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (2003), Eurographics Association, pp. 112--119.

Digital Library

Google Scholar

[10]

{Mor03} Moravanszky A.: Dense matrix algebra on the GPU, 2003. http://www.shaderx2.com/shaderx.PDF.

Google Scholar

[11]

{TCL98} Thottethodi M., Chatterjee S., Lebeck A. R.: Tuning strassen's matrix multiplication for memory efficiency. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM) (1998), IEEE Computer Society, pp. 1--14.

Digital Library

Google Scholar

[12]

{WPD01} Whaley R. C., Petitet A., Dongarra J. J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 1-2 (2001), 3--35.

Crossref

Google Scholar

Cited By

View all

Drainakis GPantazopoulos PKatsaros KSourlas VAmditis AKaklamani D(2024)Distributed Predictive QoS in Automotive Environments Under Concept Drift2024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619805(549-554)Online publication date: 3-Jun-2024
https://doi.org/10.23919/IFIPNetworking62109.2024.10619805
Jin SLee HRo WDev KYoo JMeinerzhagen P(2024)GUMSO: Gating Unnecessary On-Chip Memory Slices for Power Optimization on GPUsProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design10.1145/3665314.3670800(1-6)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3665314.3670800
Hernandez-Tello JMartinez-del-Amor MOrellana-Martin DCabarle F(2024)Sparse Spiking Neural-like Membrane Systems on Graphics Processing UnitsInternational Journal of Neural Systems10.1142/S0129065724500382Online publication date: 5-Apr-2024
https://doi.org/10.1142/S0129065724500382
Show More Cited By

Index Terms

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces

Recommendations

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL ...
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware

August 2004

142 pages

ISBN:3905673150

DOI:10.1145/1058129

Conference Chairs:
Michael Doggett
ATI, USA
,
Anselmo Lastra
University of North Carolina, USA
,
Program Chairs:
Mark Harris
NVIDIA, USA
,
David Luebke
University of Virginia, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

GH04

Sponsor:

SIGGRAPH
EUROGRAPHICS

GH04: Graphics Hardware 2004

August 29 - 30, 2004

Grenoble, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

197
Total Citations
View Citations
3,157
Total Downloads

Downloads (Last 12 months)246
Downloads (Last 6 weeks)42

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Drainakis GPantazopoulos PKatsaros KSourlas VAmditis AKaklamani D(2024)Distributed Predictive QoS in Automotive Environments Under Concept Drift2024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619805(549-554)Online publication date: 3-Jun-2024
https://doi.org/10.23919/IFIPNetworking62109.2024.10619805
Jin SLee HRo WDev KYoo JMeinerzhagen P(2024)GUMSO: Gating Unnecessary On-Chip Memory Slices for Power Optimization on GPUsProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design10.1145/3665314.3670800(1-6)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3665314.3670800
Hernandez-Tello JMartinez-del-Amor MOrellana-Martin DCabarle F(2024)Sparse Spiking Neural-like Membrane Systems on Graphics Processing UnitsInternational Journal of Neural Systems10.1142/S0129065724500382Online publication date: 5-Apr-2024
https://doi.org/10.1142/S0129065724500382
Rastgar FMasnavi HSharma BAabloo ASwevers JSingh A(2024)PRIEST: Projection Guided Sampling-Based Optimization for Autonomous NavigationIEEE Robotics and Automation Letters10.1109/LRA.2024.33573119:3(2630-2637)Online publication date: Mar-2024
https://doi.org/10.1109/LRA.2024.3357311
Zheng XMak SXie LXie Y(2024)eRPCAStatistical Analysis and Data Mining10.1002/sam.1167017:2Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1002/sam.11670
Ryabinin MDettmers TDiskin MBorzunov AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)SWARM parallelismProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619631(29416-29440)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619631
Xu WSun YFan SYu HFu X(2023)Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUsACM Transactions on Architecture and Code Optimization10.1145/360009220:3(1-26)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600092
Pfandzelter TDhakal AFrachtenberg EChalamalasetti SEmmot DHogade NEnriquez RRattihalli GBermbach DMilojicic D(2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629115
Sun WKatsifodimos AHai R(2023)An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW58674.2023.00034(184-190)Online publication date: Apr-2023
https://doi.org/10.1109/ICDEW58674.2023.00034
Kuzma BKorostelev Ide Carvalho JMoreira JBarton CAraujo GAmaral J(2023)Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic loweringSoftware: Practice and Experience10.1002/spe.321453:9(1793-1814)Online publication date: 14-May-2023
https://doi.org/10.1002/spe.3214
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs