Nothing Special   »   [go: up one dir, main page]

skip to main content
Automatic performance tuning of sparse matrix kernels
Publisher:
  • University of California, Berkeley
Order Number:AAI3121741
Pages:
433
Reflects downloads up to 13 Feb 2025Bibliometrics
Skip Abstract Section
Abstract

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster.

Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2)  search this space for the fastest one using a combination of heuristic models and actual experiments ( i.e ., running and timing the code). We build on the S PARSITY system for generating highly-tuned implementations of the SpMV kernel y ý y + Ax , where A is a sparse matrix and x, y are dense vectors. We extend S PARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of A T A·x (or AA T ·x ) and A ý · x .

We develop new models to compute, for particular data structures and kernels, the best absolute performance ( e.g ., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning ( e.g ., better instruction selection and scheduling). (Abstract shortened by UMI.)

Cited By

  1. Wang H, Yang W, Hu R, Ouyang R, Li K and Li K (2023). IAP-SpTV, Journal of Parallel and Distributed Computing, 181:C, Online publication date: 1-Nov-2023.
  2. Atoofian E (2023). PTTS, Journal of Parallel and Distributed Computing, 173:C, (70-82), Online publication date: 1-Mar-2023.
  3. Lane P and Booth J (2023). Heterogeneous sparse matrix–vector multiplication via compressed sparse row format, Parallel Computing, 115:C, Online publication date: 1-Feb-2023.
  4. Cheshmi K, Cetinic Z and Dehnavi M Vectorizing sparse matrix computations with partially-strided codelets Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-15)
  5. ACM
    Chen P, Wahib M, Wang X, Hirofuchi T, Ogawa H, Biguri A, Boardman R, Blumensath T and Matsuoka S Scalable FBP decomposition for cone-beam CT reconstruction Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-16)
  6. ACM
    XIE C, Chen J, Firoz J, Li J, Song S, Barker K, Raugas M and Li A Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures Proceedings of the 50th International Conference on Parallel Processing, (1-11)
  7. ACM
    Liu J, Ren J, Gioiosa R, Li D and Li J Sparta Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (318-333)
  8. Qian X, Wang Y and Karanth A (2020). Guest Editors’ Introduction to the Special Issue on Machine Learning Architectures and Accelerators, IEEE Transactions on Computers, 69:7, (929-930), Online publication date: 1-Jul-2020.
  9. Guan Y, Sun G, Yuan Z, Li X, Xu N, Chen S, Cong J and Xie Y (2020). Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs, IEEE Transactions on Computers, 69:7, (931-943), Online publication date: 1-Jul-2020.
  10. Ma Y, Li J, Wu X, Yan C, Sun J and Vuduc R (2019). Optimizing sparse tensor times matrix on GPUs, Journal of Parallel and Distributed Computing, 129:C, (99-109), Online publication date: 1-Jul-2019.
  11. ACM
    Nie J, Zhang C, Zou D, Xia F, Lu L, Wang X and Zhao F Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference, (6-10)
  12. ACM
    Augustine T, Sarma J, Pouchet L and Rodríguez G Generating piecewise-regular code from irregular structures Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, (625-639)
  13. Montagne E and Surós R Systolic sparse matrix vector multiply in the age of TPUs and accelerators Proceedings of the High Performance Computing Symposium, (1-10)
  14. Li J, Sun J and Vuduc R HiCOO Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
  15. Li J, Sun J and Vuduc R HiCOO Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
  16. ACM
    Yang C, Buluç A and Owens J Implementing Push-Pull Efficiently in GraphBLAS Proceedings of the 47th International Conference on Parallel Processing, (1-11)
  17. Feinberg B, Vengalam U, Whitehair N, Wang S and Ipek E Enabling scientific computing on memristive accelerators Proceedings of the 45th Annual International Symposium on Computer Architecture, (367-382)
  18. Aymerich E, Duchateau A, Montagne E and Plochan F On tuning the symmetric sparse matrix vector multiplication with CSR and TJDS Proceedings of the High Performance Computing Symposium, (1-12)
  19. ACM
    Wang X, Liu W, Xue W and Wu L (2018). swSpTRSV, ACM SIGPLAN Notices, 53:1, (338-353), Online publication date: 23-Mar-2018.
  20. ACM
    Xie B, Zhan J, Liu X, Gao W, Jia Z, He X and Zhang L CVR: efficient vectorization of SpMV on x86 processors Proceedings of the 2018 International Symposium on Code Generation and Optimization, (149-162)
  21. ACM
    Wang X, Liu W, Xue W and Wu L swSpTRSV Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (338-353)
  22. ACM
    Parashar A, Rhu M, Mukkara A, Puglielli A, Venkatesan R, Khailany B, Emer J, Keckler S and Dally W (2017). SCNN, ACM SIGARCH Computer Architecture News, 45:2, (27-40), Online publication date: 14-Sep-2017.
  23. ACM
    Parashar A, Rhu M, Mukkara A, Puglielli A, Venkatesan R, Khailany B, Emer J, Keckler S and Dally W SCNN Proceedings of the 44th Annual International Symposium on Computer Architecture, (27-40)
  24. Golnari P and Malik S Evaluating matrix representations for error-tolerant computing Proceedings of the Conference on Design, Automation & Test in Europe, (1663-1666)
  25. Merrill D and Garland M Merge-based parallel sparse matrix-vector multiplication Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-12)
  26. ACM
    Merrill D and Garland M (2016). Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format, ACM SIGPLAN Notices, 51:8, (1-2), Online publication date: 9-Nov-2016.
  27. ACM
    Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz M and Dally W (2016). EIE, ACM SIGARCH Computer Architecture News, 44:3, (243-254), Online publication date: 12-Oct-2016.
  28. Catalán S, Malossi A, Bekas C and Quintana-Ortí E The Impact of Voltage-Frequency Scaling for the Matrix-Vector Product on the IBM POWER8 Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833, (103-116)
  29. ACM
    Micolet P, Smith A and Dubach C (2016). A machine learning approach to mapping streaming workloads to dynamic multicore processors, ACM SIGPLAN Notices, 51:5, (113-122), Online publication date: 1-Aug-2016.
  30. Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz M and Dally W EIE Proceedings of the 43rd International Symposium on Computer Architecture, (243-254)
  31. ACM
    Micolet P, Smith A and Dubach C A machine learning approach to mapping streaming workloads to dynamic multicore processors Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems, (113-122)
  32. ACM
    Buono D, Petrini F, Checconi F, Liu X, Que X, Long C and Tuan T Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics Proceedings of the 2016 International Conference on Supercomputing, (1-12)
  33. ACM
    Merrill D and Garland M Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (1-2)
  34. Liu W and Vinter B (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors, Parallel Computing, 49:C, (179-193), Online publication date: 1-Nov-2015.
  35. ACM
    Venkat A, Hall M and Strout M (2015). Loop and data transformations for sparse matrix code, ACM SIGPLAN Notices, 50:6, (521-532), Online publication date: 7-Aug-2015.
  36. ACM
    Sedaghati N, Mu T, Pouchet L, Parthasarathy S and Sadayappan P Automatic Selection of Sparse Matrix Representation on GPUs Proceedings of the 29th ACM on International Conference on Supercomputing, (99-108)
  37. ACM
    Liu W and Vinter B CSR5 Proceedings of the 29th ACM on International Conference on Supercomputing, (339-350)
  38. ACM
    Venkat A, Hall M and Strout M Loop and data transformations for sparse matrix code Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, (521-532)
  39. Aliaga J, Anzt H, Castillo M, Fernández J, León G, Pérez J and Quintana-Ortí E (2015). Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors, Concurrency and Computation: Practice & Experience, 27:4, (885-904), Online publication date: 25-Mar-2015.
  40. ACM
    Sedaghati N, Ashari A, Pouchet L, Parthasarathy S and Sadayappan P Characterizing dataset dependence for sparse matrix-vector multiplication on GPUs Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications, (17-24)
  41. Ashari A, Sedaghati N, Eisenlohr J, Parthasarathy S and Sadayappan P Fast sparse matrix-vector multiplication on GPUs for graph applications Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (781-792)
  42. Greathouse J and Daga M Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (769-780)
  43. ACM
    Ashari A, Sedaghati N, Eisenlohr J and Sadayappan P An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs Proceedings of the 28th ACM international conference on Supercomputing, (273-282)
  44. Wijs A and Bošnački D Improving GPU sparse matrix-vector multiplication for probabilistic model checking Proceedings of the 19th international conference on Model Checking Software, (98-116)
  45. ACM
    Su B and Keutzer K clSpMV Proceedings of the 26th ACM international conference on Supercomputing, (353-364)
  46. ACM
    Georgescu S and Chow P (2011). GPU accelerated CAE using open solvers and the cloud, ACM SIGARCH Computer Architecture News, 39:4, (14-19), Online publication date: 19-Dec-2011.
  47. Sun X, Zhang Y, Wang T, Long G, Zhang X and Li Y CRSD Proceedings of the 17th international conference on Parallel processing - Volume Part II, (316-327)
  48. ACM
    Grewe D and Lokhmotov A Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (1-8)
  49. Yang X, Parthasarathy S and Sadayappan P (2011). Fast sparse matrix-vector multiplication on GPUs, Proceedings of the VLDB Endowment, 4:4, (231-242), Online publication date: 1-Jan-2011.
  50. ACM
    Arnold G, Hölzl J, Köksal A, Bodík R and Sagiv M (2010). Specifying and verifying sparse matrix codes, ACM SIGPLAN Notices, 45:9, (249-260), Online publication date: 27-Sep-2010.
  51. ACM
    Arnold G, Hölzl J, Köksal A, Bodík R and Sagiv M Specifying and verifying sparse matrix codes Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, (249-260)
  52. Greiner G and Jacob R Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model Proceedings of the 35th international conference on Mathematical foundations of computer science, (393-404)
  53. ACM
    Haque S, Hossain S and Moreno Maza M Cache friendly sparse matrix-vector multiplication Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, (175-176)
  54. ACM
    Choi J, Singh A and Vuduc R (2010). Model-driven autotuning of sparse matrix-vector multiply on GPUs, ACM SIGPLAN Notices, 45:5, (115-126), Online publication date: 1-May-2010.
  55. ACM
    Ji X, Cheng T and Wang Q A simulation of large-scale groundwater flow on CUDA-enabled GPUs Proceedings of the 2010 ACM Symposium on Applied Computing, (2402-2403)
  56. Monakov A, Lokhmotov A and Avetisyan A Automatically tuning sparse matrix-vector multiplication for GPU architectures Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers, (111-125)
  57. ACM
    Choi J, Singh A and Vuduc R Model-driven autotuning of sparse matrix-vector multiply on GPUs Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (115-126)
  58. ACM
    Mohiyuddin M, Hoemmen M, Demmel J and Yelick K Minimizing communication in sparse matrix solvers Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-12)
  59. ACM
    Bell N and Garland M Implementing sparse matrix-vector multiplication on throughput-oriented processors Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-11)
  60. ACM
    Mohiyuddin M, Murphy M, Oliker L, Shalf J, Wawrzynek J and Williams S A design methodology for domain-optimized power-efficient supercomputing Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-12)
  61. Wozniak M, Olas T and Wyrzykowski R Parallel implementation of conjugate gradient method on graphics processors Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, (125-135)
  62. ACM
    Belgin M, Back G and Ribbens C Pattern-based sparse matrix representation for memory-efficient SMVM kernels Proceedings of the 23rd international conference on Supercomputing, (100-109)
  63. ACM
    Williams S, Oliker L, Vuduc R, Shalf J, Yelick K and Demmel J Optimization of sparse matrix-vector multiplication on emerging multicore platforms Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-12)
  64. Buttari A, Eijkhout V, Langou J and Filippone S (2007). Performance Optimization and Modeling of Blocked Sparse Kernels, International Journal of High Performance Computing Applications, 21:4, (467-484), Online publication date: 1-Nov-2007.
  65. ACM
    Bender M, Brodal G, Fagerberg R, Jacob R and Vicari E Optimal sparse matrix dense vector multiplication in the I/O-model Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, (61-70)
  66. ACM
    Williams S, Shalf J, Oliker L, Kamil S, Husbands P and Yelick K The potential of the cell processor for scientific computing Proceedings of the 3rd conference on Computing frontiers, (9-20)
  67. ACM
    deLorimier M and DeHon A Floating-point sparse matrix-vector multiply for FPGAs Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, (75-85)
Contributors
  • Georgia Institute of Technology
  • University of California, Berkeley
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations