research-article

Public Access

CVR: efficient vectorization of SpMV on x86 processors

Authors:

Lixin ZhangAuthors Info & Claims

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 149 - 162

https://doi.org/10.1145/3168818

Published: 24 February 2018 Publication History

Abstract

Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.

References

[1]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An Efficient Two-dimensional Blocking Strategy for Sparse Matrix-vector Multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM, New York, NY, USA, 273-282.

Digital Library

[2]

Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors. In Proceedings of the ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA, Article 18, 11 pages.

Digital Library

[3]

Maciej Besta, Florian Marending, Edgar Solomonik, and Torsten Hoefler. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). 32-41.

[4]

Guy E. Blelloch, Michael A. Heroux, and Marco Zagha. 1993. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Technical Report. Pittsburgh, PA, USA.

[5]

Erik G. Boman, Karen D. Devine, and Sivasankaran Rajamanickam. 2013. Scalable Matrix Computations on Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 50, 12 pages.

Digital Library

[6]

Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In Proceedings of the 30th International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 37, 12 pages.

Digital Library

[7]

Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In Proceedings of the 14th International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 47s58.

Digital Library

[8]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010), 115-126.

Digital Library

[9]

Timothy A. Davis. 1997. The University of Florida sparse matrix collection. NA DIGEST (1997).

[10]

Michael Garland. 2008. Sparse Matrix Computations on Manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC '08). ACM, New York, NY, USA, 2s6.

Digital Library

[11]

Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing 50, 1 (01 Oct 2009), 36s77.

Digital Library

[12]

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 769s780.

Digital Library

[13]

Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. International Journal of High Performance Computing Applications 18, 1 (Feb. 2004), 135s158.

Digital Library

[14]

Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, and Chunjie Luo. 2013. Characterizing data analysis workloads in data centers. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '13). 66-76.

[15]

Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2010. Exploiting Compression Opportunities to Improve SpMxV Performance on Shared Memory Systems. The ACM Transactions on Architecture and Code Optimization 7, 3, Article 16 (Dec. 2010), 31 pages.

Digital Library

[16]

Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. SIGPLAN Not. 46, 8 (Feb. 2011), 247-256.

Digital Library

[17]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (June 2014).

[18]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 117-126.

Digital Library

[19]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 339-350.

Digital Library

[20]

Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors. Parallel Comput. 49 (2015), 179-193.

Digital Library

[21]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-vector Multiplication on x86-based Manycore Processors. In Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 273-282.

Digital Library

[22]

Duane Merrill and Michael Garland. 2016. Merge-based Parallel Sparse Matrix-vector Multiplication. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE, Piscataway, NJ, USA, Article 58, 12 pages.

Digital Library

[23]

Nguyen Quang Anh Pham, Rui Fan, and Yonggang Wen. 2015. Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication. In Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS '15). 1043-1052.

Digital Library

[24]

Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-vector Multiplication. In Proceedings of the 13rd ACM/IEEE Conference on Supercomputing (ICS '99). ACM, New York, NY, USA, Article 30.

Digital Library

[25]

Mahesh Ravishankar, Roshan Dathathri, Venmugil Elango, Louis-Noel Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. Distributed Memory Code Generation for Mixed Irregular/Regular Computations. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '15). ACM, New York, NY, USA, 65-75.

Digital Library

[26]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 99-108.

Digital Library

[27]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (2016), 34-46.

Digital Library

[28]

Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 353s364.

Digital Library

[29]

Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1214-1225.

Digital Library

[30]

Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Autotuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th IEEE/ACM International Symposium on Code Generation and Optimization (CGO '15). IEEE Computer Society, Washington, DC, USA, 136-145.

[31]

Yaman Umuroglu and Magnus Jahre. 2014. An energy efficient column-major backend for FPGA SpMV accelerators. In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD '14). 432-439.

[32]

Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. SIGPLAN Not. 50, 6 (June 2015), 521s532.

Digital Library

[33]

Richard Vuduc, James W Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. 16 (Jan. 2005), 521-530.

[34]

Richard Wilson Vuduc. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Ph.D. Dissertation. AAI3121741.

Digital Library

[35]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167-188.

[36]

Lei Wang, Fan Yang, Liangji Zhuang, Huimin Cui, Fang Lv, and Xiaobing Feng. 2016. Articulation Points Guided Redundancy Elimination for Betweenness Centrality. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). ACM, New York, NY, USA, Article 7, 13 pages.

Digital Library

[37]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. Big-DataBench: A big data benchmark suite from internet services. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA '14). 488-499.

[38]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. In Proceedings of the 21st ACM/IEEE Conference on Supercomputing (ICS '07). ACM, New York, NY, USA, Article 38, 12 pages.

Digital Library

[39]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 107-118.

Digital Library

[40]

Xintian Yang, Srinivasan Parthasarathy, and P. Sadayappan. 2011. Fast Sparse Matrix-vector Multiplication on GPUs: Implications for Graph Mining. Proceedings of the VLDB Endowment 4, 4 (Jan. 2011), 231-242.

Digital Library

[41]

Leonid Yavits and Ran Ginosar. 2017. Accelerator for Sparse Machine Learning. IEEE Computer Architecture Letters PP, 99 (2017), 1-1.

[42]

Andy Yoo, Allison H. Baker, Roger Pearce, and Van Emden Henson. 2011. A Scalable Eigensolver for Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 63, 11 pages.

Digital Library

[43]

Yang Zhou, Ling Liu, Kisung Lee, and Qi Zhang. 2015. GraphTwist: Fast Iterative Graph Computation with Two-tier Optimizations. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1262-1273.

Digital Library

Cited By

Chen YYu J(2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673055
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Show More Cited By

Index Terms

CVR: efficient vectorization of SpMV on x86 processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software
    1. Mathematical software performance

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
Regu2D: Accelerating Vectorization of SpMV on Intel Processors through 2D-partitioning and Regular Arrangement
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Sparse matrix-vector multiplication (SpMV) is an elementary kernel of many high-performance computing (HPC) applications, and it is often one of the performance bottlenecks of them. Accelerating SpMV on vector processors usually faces several issues ...
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
1,402
Total Downloads

Downloads (Last 12 months)297
Downloads (Last 6 weeks)50

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYu J(2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673055
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Xu LJia HZhang YWang LJiang XMencagli GDazzi PLowenthal DBadia R(2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658680
Shi ZZou YSong XLi SLiu FXue Q(2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3488053
Chen YYu J(2024)Accelerating SpMV for Scale-Free Graphs with Optimized Bins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00190(2407-2420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00190
Jiang JHuang JBian H(2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606057
Chen YChung Y(2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605579
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten