Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO57630.2024.10444827acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

Published: 27 September 2024 Publication History

Abstract

Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JitSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JitSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JitSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JitSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JitSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JitSpMM provides an average improvement of 3.8× and 1.4×, respectively.

References

[1]
A. Bik, P. Koanantakool, T. Shpeisman, N. Vasilache, B. Zheng, and F. Kjolstad, "Compiler support for sparse tensor computations in mlir," ACM Transactions on Architecture and Code Optimization (TACO), vol. 19, no. 4, pp. 1--25, 2022.
[2]
K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, "Extensor: An accelerator for sparse tensor algebra," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 319--333.
[3]
Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, "Sparsetir: Composable abstractions for sparse compilation in deep learning," in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660--678.
[4]
S. Chou and S. Amarasinghe, "Compilation of dynamic sparse tensor algebra," Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA2, pp. 1408--1437, 2022.
[5]
A. N. Langville and C. D. Meyer, Google's PageRank and beyond: The science of search engine rankings. Princeton university press, 2006.
[6]
Y. Koren, R. Bell, and C. Volinsky, "Matrix factorization techniques for recommender systems," Computer, vol. 42, no. 8, pp. 30--37, 2009.
[7]
S. E. Schaeffer, "Graph clustering," Computer science review, vol. 1, no. 1, pp. 27--64, 2007.
[8]
S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón, "Computing graph neural networks: A survey from algorithms to accelerators," ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1--38, 2021.
[9]
M. Y. Wang, "Deep graph library: Towards efficient and scalable deep learning on graphs," in ICLR workshop on representation learning on graphs and manifolds, 2019.
[10]
Q. Fu, Y. Ji, and H. H. Huang, "Tlpgnn: A lightweight two-level parallelism paradigm for graph neural network computation on gpu," in Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022, pp. 122--134.
[11]
Q. Fu and H. H. Huang, "Automatic generation of high-performance inference kernels for graph neural networks on multi-core systems," in Proceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1--11.
[12]
P. Yin, X. Yan, J. Zhou, Q. Fu, Z. Cai, J. Cheng, B. Tang, and M. Wang, "Dgi: An easy and efficient framework for gnn model evaluation," in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5439--5450.
[13]
Y. Rong, W. Huang, T. Xu, and J. Huang, "Dropedge: Towards deep graph convolutional networks on node classification," arXiv preprint arXiv:1907.10903, 2019.
[14]
M. Zhang and Y. Chen, "Link prediction based on graph neural networks," Advances in neural information processing systems, vol. 31, 2018.
[15]
J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, "Graph neural networks: A review of methods and applications," AI open, vol. 1, pp. 57--81, 2020.
[16]
V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson, "Benchmarking graph neural networks," Journal of Machine Learning Research, vol. 24, no. 43, pp. 1--48, 2023.
[17]
Q. Fu, B. Feng, D. Guo, and Q. Li, "Combating the evolving spammers in online social networks," Computers & Security, vol. 72, 2018.
[18]
E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, Y. Wang, E. Wang, Q. Zhang, B. Shen et al., "Intel math kernel library," High-Performance Computing on the Intel® Xeon Phi™: How to Fully Exploit MIC Architectures, pp. 167--188, 2014.
[19]
C. Yang, A. Buluç, and J. D. Owens, "Design principles for sparse matrix multiplication on the gpu," in Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27--31, 2018, Proceedings. Springer, 2018, pp. 672--687.
[20]
D. Merrill and M. Garland, "Merge-based parallel sparse matrix-vector multiplication," in SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016, pp. 678--689.
[21]
S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, "Merge path-parallel merging made simple," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 2012, pp. 1611--1618.
[22]
P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright et al., "Scipy 1.0: fundamental algorithms for scientific computing in python," Nature methods, vol. 17, no. 3, pp. 261--272, 2020.
[23]
M. Fey and J. E. Lenssen, "Fast graph representation learning with pytorch geometric," arXiv preprint arXiv:1903.02428, 2019.
[24]
O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication," in Proceedings of the ACM International Conference on Supercomputing, 2021, pp. 431--442.
[25]
M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 359--371.
[26]
G. Huang, G. Dai, Y. Wang, and H. Yang, "Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--12.
[27]
G. Ortega, F. Vázquez, I. García, and E. M. Garzón, "Fastspmm: An efficient library for sparse matrix matrix product on gpus," The Computer Journal, vol. 57, no. 7, pp. 968--979, 2014.
[28]
Z. Gong, H. Ji, Y. Yao, C. W. Fletcher, C. J. Hughes, and J. Torrellas, "Graphite: optimizing graph neural networks on cpus through cooperative software-hardware techniques," in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 916--931.
[29]
Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, "Optimizing CNN model inference on CPUs," in 2019 USENIX Annual Technical Conference (USENIX ATC 19). Renton, WA: USENIX Association, Jul. 2019, pp. 1025--1040. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/liu-yizhi
[30]
K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The case for a single-chip multiprocessor," ACM Sigplan Notices, vol. 31, no. 9, pp. 2--11, 1996.
[31]
G. J. Chaitin, "Register allocation & spilling via graph coloring," ACM Sigplan Notices, vol. 17, no. 6, pp. 98--101, 1982.
[32]
H. Inoue, M. Ohara, and K. Taura, "Faster set intersection with simd instructions by reducing branch mispredictions," Proceedings of the VLDB Endowment, vol. 8, no. 3, pp. 293--304, 2014.
[33]
N. Shaylor, "A just-in-time compiler for memory-constrained low-power devices." in Java Virtual Machine Research and Technology Symposium, 2002, pp. 119--126.
[34]
M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. F. Sweeney, "A survey of adaptive optimization in virtual machines," Proceedings of the IEEE, vol. 93, no. 2, pp. 449--466, 2005.
[35]
A. Farcy, O. Temam, R. Espasa, and T. Juan, "Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes," in Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 1998, pp. 59--68.
[36]
Intel, "Xadd - exchange and add," 2023. [Online]. Available: https://www.felixcloutier.com/x86/xadd
[37]
V. S. Pai, P. Ranganathan, and S. V. Adve, "The impact of instruction-level parallelism on multiprocessor performance and simulation methodology," in Proceedings Third International Symposium on High-Performance Computer Architecture. IEEE, 1997, pp. 72--83.
[38]
S. Eyerman, J. E. Smith, and L. Eeckhout, "Characterizing the branch misprediction penalty," in 2006 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 2006, pp. 48--58.
[39]
J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2011.
[40]
Intel, "Intel® 64 and ia-32 architectures software developer manuals," 2023. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
[41]
D. Kusswurm, Modern X86 Assembly Language Programming. Springer, 2014.
[42]
T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1--25, 2011.
[43]
R. Chandra, Parallel programming in OpenMP. Morgan kaufmann, 2001.
[44]
"Asmjit project: Low-latency machine code generation," 2023. [Online]. Available: https://asmjit.com/
[45]
A. C. De Melo, "The new linux'perf'tools," in Slides from Linux Kongress, vol. 18, 2010, pp. 1--42.
[46]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient sparse matrix-vector multiplication on x86-based many-core processors," in Proceedings of the 27th international ACM conference on International conference on supercomputing, 2013, pp. 273--282.
[47]
M. K. Rahman, M. H. Sujon, and A. Azad, "Fusedmm: A unified sddmm-spmm kernel for graph embedding and graph neural networks," in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 256--266.
[48]
C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, "Adaptive sparse tiling for sparse matrix multiplication," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, pp. 300--314.
[49]
C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, Ü. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan, "Efficient sparse-matrix multi-vector product on gpus," in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018, pp. 66--79.
[50]
J. L. Greathouse and M. Daga, "Efficient sparse matrix-vector multiplication on gpus using the csr storage format," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 769--780.
[51]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan, "Fast sparse matrix-vector multiplication on gpus for graph applications," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 781--792.
[52]
S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland, "Optimizing sparse matrix operations on gpus using merge path," in 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2015, pp. 407--416.
[53]
M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, "Cusparse library," in GPU Technology Conference, 2010.

Index Terms

  1. JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization
          March 2024
          486 pages
          ISBN:9798350395099

          Sponsors

          In-Cooperation

          • IEEE CS

          Publisher

          IEEE Press

          Publication History

          Published: 27 September 2024

          Check for updates

          Author Tags

          1. SpMM
          2. just-in-time instruction generation
          3. performance profiling
          4. performance optimization

          Qualifiers

          • Research-article

          Conference

          CGO '24

          Acceptance Rates

          Overall Acceptance Rate 312 of 1,061 submissions, 29%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 5
            Total Downloads
          • Downloads (Last 12 months)5
          • Downloads (Last 6 weeks)4
          Reflects downloads up to 14 Nov 2024

          Other Metrics

          Citations

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media