research-article

Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512

Authors:

Richard T. Mills,

Barry F. SmithAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 55, Pages 1 - 10

https://doi.org/10.1145/3225058.3225100

Published: 13 August 2018 Publication History

Abstract

Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such architectures present several challenges to their efficient use. Here, we explore the efficient implementation of sparse matrix-vector (SpMV) multiplications---a critical kernel for the workhorse iterative linear solvers used in most PDE-based simulations---on recent CPU architectures from Intel as well as the second-generation Knights Landing Intel Xeon Phi, which features many CPU cores, wide SIMD lanes, and on-package high-bandwidth memory. Traditional SpMV algorithms use compressed sparse row storage format, which is a hindrance to exploiting wide SIMD lanes. We study alternative matrix formats and present an efficient optimized SpMV kernel, based on a sliced ELLPACK representation, which we have implemented in the PETSc library. In addition, we demonstrate the benefit of using this representation to accelerate preconditioned iterative solvers in realistic PDE-based simulations in parallel.

References

[1]

Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois Curfman McInnes, Karl Rupp, Patrick Sanan, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. 2017. PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.8. Argonne National Laboratory. http://www.mcs.anl.gov/petsc

[2]

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, and John Dennis. 2017. Evaluating and optimizing the NERSC workload on Knights Landing. In Proceedings of PMBS 2016: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, St. 43--53.

Digital Library

[3]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09. Bell2009, 1.

Digital Library

[4]

Wei Cao, Yao Lu, Zongzhe Li, Yongxian Wang, and Zhenghua Wang. 2010. Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format. In ICCASM 2010 - 2010 International Conference on Computer Application and System Modeling, Proceedings, Vol. 11. IEEE, V11-161--V11-165.

[5]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Notices 45, 5 (may 2010), 115.

Digital Library

[6]

Edmond Chow. 2001. Parallel Implementation and Practical Use of Sparse Approximate Inverse Preconditioners with a Priori Sparsity Patterns. The International Journal of High Performance Computing Applications 15, 1 (feb 2001), 56--74.

Digital Library

[7]

Eduardo F. D'Azevedo, Mark R. Fahe, and Richard T. Mills. 2005. Vectorized sparse matrix multiply for compressed row storage format. {ICCS'05} Computational ScienceâĂŞICCS 2005 3514/2005 (2005), 99--106.

Digital Library

[8]

Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean Luc Vay, and Henri Vincenti. 2016. Applying the roofline performance model to the Intel Xeon Phi Knights Landing processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 9945 LNCS. 339--353.

[9]

Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. Journal of Supercomputing 50, 1 (2009), 36--77.

Digital Library

[10]

Willem Hundsdorfer and Jan G. Verwer. 2003. Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations. Number 33 in Springer series in computational mathematics. Springer.

[11]

Eun-Jin Im and Katherine Yelick. 1999. Optimizing Sparse Matrix Vector Multiplication on SMPs. In In Ninth SIAM Conference on Parallel Processing for Scientific Computing.

[12]

Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. International Journal of High Performance Computing Applications 18, 1 (2004), 135--158.

Digital Library

[13]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, Achim Basermann, and Alan R. Bishop. 2012. Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. 1696--1702.

Digital Library

[14]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401-C423. arXiv:1307.6209

Digital Library

[15]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 (2013), 273.

Digital Library

[16]

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Nicholas J. Wright, Mary W. Hall, and Leonid Oliker. 2015. Roofline model toolkit: A practical tool for architectural and program analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8966. 129--148.

[17]

Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 5952 LNCS. 111--125.

Digital Library

[18]

Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick. 2007. When cache blocking of sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communications and Computing 18, 3 (2007), 297--311.

[19]

Scott Parker, Vitali Morozov, Sudheer Chunduri, Kevin Harms, Chris Knight, and Kalyan Kumaran. 2017. Early Evaluation of the Cray XC40 Xeon Phi System 'Theta' at Argonne. Technical Report. Argonne National Laboratory.

[20]

John E. Pearson. 1993. Complex Patterns in a Simple System. Science 261, 5118 (1993), 189--192.

[21]

Ali Pinar and Michael T. Heath. 1999. Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing 99. ACM Press, New York, New York, USA, 30-es.

Digital Library

[22]

Erik Saule, Kamer Kaya, and Umit V. Catalyurek. 2013. Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8384 LNCS. 559--570. arXiv:arXiv:1302.1078v1

[23]

Francisco Vázquez, José-Jesús Fernández, and Ester M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency Computation Practice and Experience 23, 8 (2011), 815--826. arXiv:arXiv:1302.5679v1

Digital Library

[24]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (2009), 178--194.

Digital Library

[25]

Jianfei Zhang and Lei Zhang. 2013. Efficient CUDA polynomial preconditioned conjugate gradient solver for finite element computation of elasticity problems. Mathematical Problems in Engineering 2013 (2013).

[26]

Cong Zheng, Shuo Gu, Tong Xiang Gu, Bing Yang, and Xing Ping Liu. 2014. BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs. J. Parallel and Distrib. Comput. 74, 7 (jul 2014), 2639--2647.

Cited By

Marques DSousa LIlic A(2024)SpEpistasis: a sparse approach for three-way epistasis detectionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104989(104989)Online publication date: Sep-2024
https://doi.org/10.1016/j.jpdc.2024.104989
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Scheffler PZaruba FSchuiki FHoefler TBenini L(2023)Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear AlgebraIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332202934:12(3147-3161)Online publication date: Dec-2023
https://doi.org/10.1109/TPDS.2023.3322029
Show More Cited By

Recommendations

An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and and the Exascale Computing Project

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
511
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Marques DSousa LIlic A(2024)SpEpistasis: a sparse approach for three-way epistasis detectionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104989(104989)Online publication date: Sep-2024
https://doi.org/10.1016/j.jpdc.2024.104989
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Scheffler PZaruba FSchuiki FHoefler TBenini L(2023)Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear AlgebraIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332202934:12(3147-3161)Online publication date: Dec-2023
https://doi.org/10.1109/TPDS.2023.3322029
Flynn PYi XYan Y(2022)Exploring source-to-source compiler transformation of OpenMP SIMD constructs for Intel AVX and Arm SVE vector architecturesProceedings of the Thirteenth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3528425.3529100(11-20)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3528425.3529100
Xiao GLong ZChen YQin YWu F(2022)Row-Segmented Sparse-Dense Matrix Matrix Multiplication on GPUs2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00074(376-383)Online publication date: Dec-2022
https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00074
Pirahandeh MKim D(2022)MS schedulerFuture Generation Computer Systems10.1016/j.future.2021.08.002126:C(123-135)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.future.2021.08.002
Gao JJi WLiu JShao SWang YShi F(2021)AMF-CSR: Adaptive Multi-Row Folding of CSR for SpMV on GPU2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00058(418-425)Online publication date: Dec-2021
https://doi.org/10.1109/ICPADS53394.2021.00058
Bian HHuang JDong RGuo YLiu LHuang DWang X(2021)A simple and efficient storage format for SIMD-accelerated SpMVCluster Computing10.1007/s10586-021-03340-124:4(3431-3448)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10586-021-03340-1
Xiao GChen YLiu CZhou X(2020)ahSpMV: An Autotuning Hybrid Computing Scheme for SpMV on the Sunway ArchitectureIEEE Internet of Things Journal10.1109/JIOT.2019.29472577:3(1736-1744)Online publication date: Mar-2020
https://doi.org/10.1109/JIOT.2019.2947257
Kataoka HYamashita KIto YNakano KKasagi ATabaru T(2020)An Efficient Multicore CPU Implementation for Convolution-Pooling Computation in CNNs2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00097(548-556)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00097
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten