This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.
Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments ( i.e ., running and timing the code). We build on the S PARSITY system for generating highly-tuned implementations of the SpMV kernel y ý y + Ax , where A is a sparse matrix and x, y are dense vectors. We extend S PARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of A T A·x (or AA T ·x ) and A ý · x .
We develop new models to compute, for particular data structures and kernels, the best absolute performance ( e.g ., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning ( e.g ., better instruction selection and scheduling). (Abstract shortened by UMI.)
Cited By
- Wang H, Yang W, Hu R, Ouyang R, Li K and Li K (2023). IAP-SpTV, Journal of Parallel and Distributed Computing, 181:C, Online publication date: 1-Nov-2023.
- Atoofian E (2023). PTTS, Journal of Parallel and Distributed Computing, 173:C, (70-82), Online publication date: 1-Mar-2023.
- Lane P and Booth J (2023). Heterogeneous sparse matrix–vector multiplication via compressed sparse row format, Parallel Computing, 115:C, Online publication date: 1-Feb-2023.
- Cheshmi K, Cetinic Z and Dehnavi M Vectorizing sparse matrix computations with partially-strided codelets Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-15)
- Chen P, Wahib M, Wang X, Hirofuchi T, Ogawa H, Biguri A, Boardman R, Blumensath T and Matsuoka S Scalable FBP decomposition for cone-beam CT reconstruction Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-16)
- XIE C, Chen J, Firoz J, Li J, Song S, Barker K, Raugas M and Li A Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures Proceedings of the 50th International Conference on Parallel Processing, (1-11)
- Liu J, Ren J, Gioiosa R, Li D and Li J Sparta Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (318-333)
- Qian X, Wang Y and Karanth A (2020). Guest Editors’ Introduction to the Special Issue on Machine Learning Architectures and Accelerators, IEEE Transactions on Computers, 69:7, (929-930), Online publication date: 1-Jul-2020.
- Guan Y, Sun G, Yuan Z, Li X, Xu N, Chen S, Cong J and Xie Y (2020). Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs, IEEE Transactions on Computers, 69:7, (931-943), Online publication date: 1-Jul-2020.
- Ma Y, Li J, Wu X, Yan C, Sun J and Vuduc R (2019). Optimizing sparse tensor times matrix on GPUs, Journal of Parallel and Distributed Computing, 129:C, (99-109), Online publication date: 1-Jul-2019.
- Nie J, Zhang C, Zou D, Xia F, Lu L, Wang X and Zhao F Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference, (6-10)
- Augustine T, Sarma J, Pouchet L and Rodríguez G Generating piecewise-regular code from irregular structures Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, (625-639)
- Montagne E and Surós R Systolic sparse matrix vector multiply in the age of TPUs and accelerators Proceedings of the High Performance Computing Symposium, (1-10)
- Li J, Sun J and Vuduc R HiCOO Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
- Li J, Sun J and Vuduc R HiCOO Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
- Yang C, Buluç A and Owens J Implementing Push-Pull Efficiently in GraphBLAS Proceedings of the 47th International Conference on Parallel Processing, (1-11)
- Feinberg B, Vengalam U, Whitehair N, Wang S and Ipek E Enabling scientific computing on memristive accelerators Proceedings of the 45th Annual International Symposium on Computer Architecture, (367-382)
- Aymerich E, Duchateau A, Montagne E and Plochan F On tuning the symmetric sparse matrix vector multiplication with CSR and TJDS Proceedings of the High Performance Computing Symposium, (1-12)
- Wang X, Liu W, Xue W and Wu L (2018). swSpTRSV, ACM SIGPLAN Notices, 53:1, (338-353), Online publication date: 23-Mar-2018.
- Xie B, Zhan J, Liu X, Gao W, Jia Z, He X and Zhang L CVR: efficient vectorization of SpMV on x86 processors Proceedings of the 2018 International Symposium on Code Generation and Optimization, (149-162)
- Wang X, Liu W, Xue W and Wu L swSpTRSV Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (338-353)
- Parashar A, Rhu M, Mukkara A, Puglielli A, Venkatesan R, Khailany B, Emer J, Keckler S and Dally W (2017). SCNN, ACM SIGARCH Computer Architecture News, 45:2, (27-40), Online publication date: 14-Sep-2017.
- Parashar A, Rhu M, Mukkara A, Puglielli A, Venkatesan R, Khailany B, Emer J, Keckler S and Dally W SCNN Proceedings of the 44th Annual International Symposium on Computer Architecture, (27-40)
- Golnari P and Malik S Evaluating matrix representations for error-tolerant computing Proceedings of the Conference on Design, Automation & Test in Europe, (1663-1666)
- Merrill D and Garland M Merge-based parallel sparse matrix-vector multiplication Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-12)
- Merrill D and Garland M (2016). Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format, ACM SIGPLAN Notices, 51:8, (1-2), Online publication date: 9-Nov-2016.
- Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz M and Dally W (2016). EIE, ACM SIGARCH Computer Architecture News, 44:3, (243-254), Online publication date: 12-Oct-2016.
- Catalán S, Malossi A, Bekas C and Quintana-Ortí E The Impact of Voltage-Frequency Scaling for the Matrix-Vector Product on the IBM POWER8 Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833, (103-116)
- Micolet P, Smith A and Dubach C (2016). A machine learning approach to mapping streaming workloads to dynamic multicore processors, ACM SIGPLAN Notices, 51:5, (113-122), Online publication date: 1-Aug-2016.
- Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz M and Dally W EIE Proceedings of the 43rd International Symposium on Computer Architecture, (243-254)
- Micolet P, Smith A and Dubach C A machine learning approach to mapping streaming workloads to dynamic multicore processors Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems, (113-122)
- Buono D, Petrini F, Checconi F, Liu X, Que X, Long C and Tuan T Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics Proceedings of the 2016 International Conference on Supercomputing, (1-12)
- Merrill D and Garland M Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (1-2)
- Liu W and Vinter B (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors, Parallel Computing, 49:C, (179-193), Online publication date: 1-Nov-2015.
- Venkat A, Hall M and Strout M (2015). Loop and data transformations for sparse matrix code, ACM SIGPLAN Notices, 50:6, (521-532), Online publication date: 7-Aug-2015.
- Sedaghati N, Mu T, Pouchet L, Parthasarathy S and Sadayappan P Automatic Selection of Sparse Matrix Representation on GPUs Proceedings of the 29th ACM on International Conference on Supercomputing, (99-108)
- Liu W and Vinter B CSR5 Proceedings of the 29th ACM on International Conference on Supercomputing, (339-350)
- Venkat A, Hall M and Strout M Loop and data transformations for sparse matrix code Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, (521-532)
- Aliaga J, Anzt H, Castillo M, Fernández J, León G, Pérez J and Quintana-Ortí E (2015). Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors, Concurrency and Computation: Practice & Experience, 27:4, (885-904), Online publication date: 25-Mar-2015.
- Sedaghati N, Ashari A, Pouchet L, Parthasarathy S and Sadayappan P Characterizing dataset dependence for sparse matrix-vector multiplication on GPUs Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications, (17-24)
- Ashari A, Sedaghati N, Eisenlohr J, Parthasarathy S and Sadayappan P Fast sparse matrix-vector multiplication on GPUs for graph applications Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (781-792)
- Greathouse J and Daga M Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (769-780)
- Ashari A, Sedaghati N, Eisenlohr J and Sadayappan P An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs Proceedings of the 28th ACM international conference on Supercomputing, (273-282)
- Wijs A and Bošnački D Improving GPU sparse matrix-vector multiplication for probabilistic model checking Proceedings of the 19th international conference on Model Checking Software, (98-116)
- Su B and Keutzer K clSpMV Proceedings of the 26th ACM international conference on Supercomputing, (353-364)
- Georgescu S and Chow P (2011). GPU accelerated CAE using open solvers and the cloud, ACM SIGARCH Computer Architecture News, 39:4, (14-19), Online publication date: 19-Dec-2011.
- Sun X, Zhang Y, Wang T, Long G, Zhang X and Li Y CRSD Proceedings of the 17th international conference on Parallel processing - Volume Part II, (316-327)
- Grewe D and Lokhmotov A Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (1-8)
- Yang X, Parthasarathy S and Sadayappan P (2011). Fast sparse matrix-vector multiplication on GPUs, Proceedings of the VLDB Endowment, 4:4, (231-242), Online publication date: 1-Jan-2011.
- Arnold G, Hölzl J, Köksal A, Bodík R and Sagiv M (2010). Specifying and verifying sparse matrix codes, ACM SIGPLAN Notices, 45:9, (249-260), Online publication date: 27-Sep-2010.
- Arnold G, Hölzl J, Köksal A, Bodík R and Sagiv M Specifying and verifying sparse matrix codes Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, (249-260)
- Greiner G and Jacob R Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model Proceedings of the 35th international conference on Mathematical foundations of computer science, (393-404)
- Haque S, Hossain S and Moreno Maza M Cache friendly sparse matrix-vector multiplication Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, (175-176)
- Choi J, Singh A and Vuduc R (2010). Model-driven autotuning of sparse matrix-vector multiply on GPUs, ACM SIGPLAN Notices, 45:5, (115-126), Online publication date: 1-May-2010.
- Ji X, Cheng T and Wang Q A simulation of large-scale groundwater flow on CUDA-enabled GPUs Proceedings of the 2010 ACM Symposium on Applied Computing, (2402-2403)
- Monakov A, Lokhmotov A and Avetisyan A Automatically tuning sparse matrix-vector multiplication for GPU architectures Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers, (111-125)
- Choi J, Singh A and Vuduc R Model-driven autotuning of sparse matrix-vector multiply on GPUs Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (115-126)
- Mohiyuddin M, Hoemmen M, Demmel J and Yelick K Minimizing communication in sparse matrix solvers Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-12)
- Bell N and Garland M Implementing sparse matrix-vector multiplication on throughput-oriented processors Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-11)
- Mohiyuddin M, Murphy M, Oliker L, Shalf J, Wawrzynek J and Williams S A design methodology for domain-optimized power-efficient supercomputing Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-12)
- Wozniak M, Olas T and Wyrzykowski R Parallel implementation of conjugate gradient method on graphics processors Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, (125-135)
- Belgin M, Back G and Ribbens C Pattern-based sparse matrix representation for memory-efficient SMVM kernels Proceedings of the 23rd international conference on Supercomputing, (100-109)
- Williams S, Oliker L, Vuduc R, Shalf J, Yelick K and Demmel J Optimization of sparse matrix-vector multiplication on emerging multicore platforms Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-12)
- Buttari A, Eijkhout V, Langou J and Filippone S (2007). Performance Optimization and Modeling of Blocked Sparse Kernels, International Journal of High Performance Computing Applications, 21:4, (467-484), Online publication date: 1-Nov-2007.
- Bender M, Brodal G, Fagerberg R, Jacob R and Vicari E Optimal sparse matrix dense vector multiplication in the I/O-model Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, (61-70)
- Williams S, Shalf J, Oliker L, Kamil S, Husbands P and Yelick K The potential of the cell processor for scientific computing Proceedings of the 3rd conference on Computing frontiers, (9-20)
- deLorimier M and DeHon A Floating-point sparse matrix-vector multiply for FPGAs Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, (75-85)
Index Terms
- Automatic performance tuning of sparse matrix kernels
Recommendations
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs
CSE '12: Proceedings of the 2012 IEEE 15th International Conference on Computational Science and EngineeringPerformance of sparse matrix-vector multiplication (SpMV) on GPUs is highly dependent on the structure of the sparse matrix used in the computation, the computing environment, and the selection of certain parameters. In this paper, we show that the ...
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach
A wide range of applications in engineering and scientific computing are involved in the acceleration of the sparse matrix vector product (SpMV). Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration ...