Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3225058.3225100acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512

Published: 13 August 2018 Publication History

Abstract

Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such architectures present several challenges to their efficient use. Here, we explore the efficient implementation of sparse matrix-vector (SpMV) multiplications---a critical kernel for the workhorse iterative linear solvers used in most PDE-based simulations---on recent CPU architectures from Intel as well as the second-generation Knights Landing Intel Xeon Phi, which features many CPU cores, wide SIMD lanes, and on-package high-bandwidth memory. Traditional SpMV algorithms use compressed sparse row storage format, which is a hindrance to exploiting wide SIMD lanes. We study alternative matrix formats and present an efficient optimized SpMV kernel, based on a sliced ELLPACK representation, which we have implemented in the PETSc library. In addition, we demonstrate the benefit of using this representation to accelerate preconditioned iterative solvers in realistic PDE-based simulations in parallel.

References

[1]
Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois Curfman McInnes, Karl Rupp, Patrick Sanan, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. 2017. PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.8. Argonne National Laboratory. http://www.mcs.anl.gov/petsc
[2]
Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, and John Dennis. 2017. Evaluating and optimizing the NERSC workload on Knights Landing. In Proceedings of PMBS 2016: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, St. 43--53.
[3]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09. Bell2009, 1.
[4]
Wei Cao, Yao Lu, Zongzhe Li, Yongxian Wang, and Zhenghua Wang. 2010. Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format. In ICCASM 2010 - 2010 International Conference on Computer Application and System Modeling, Proceedings, Vol. 11. IEEE, V11-161--V11-165.
[5]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Notices 45, 5 (may 2010), 115.
[6]
Edmond Chow. 2001. Parallel Implementation and Practical Use of Sparse Approximate Inverse Preconditioners with a Priori Sparsity Patterns. The International Journal of High Performance Computing Applications 15, 1 (feb 2001), 56--74.
[7]
Eduardo F. D'Azevedo, Mark R. Fahe, and Richard T. Mills. 2005. Vectorized sparse matrix multiply for compressed row storage format. {ICCS'05} Computational ScienceâĂŞICCS 2005 3514/2005 (2005), 99--106.
[8]
Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean Luc Vay, and Henri Vincenti. 2016. Applying the roofline performance model to the Intel Xeon Phi Knights Landing processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 9945 LNCS. 339--353.
[9]
Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. Journal of Supercomputing 50, 1 (2009), 36--77.
[10]
Willem Hundsdorfer and Jan G. Verwer. 2003. Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations. Number 33 in Springer series in computational mathematics. Springer.
[11]
Eun-Jin Im and Katherine Yelick. 1999. Optimizing Sparse Matrix Vector Multiplication on SMPs. In In Ninth SIAM Conference on Parallel Processing for Scientific Computing.
[12]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. International Journal of High Performance Computing Applications 18, 1 (2004), 135--158.
[13]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, Achim Basermann, and Alan R. Bishop. 2012. Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. 1696--1702.
[14]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401-C423. arXiv:1307.6209
[15]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 (2013), 273.
[16]
Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Nicholas J. Wright, Mary W. Hall, and Leonid Oliker. 2015. Roofline model toolkit: A practical tool for architectural and program analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8966. 129--148.
[17]
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 5952 LNCS. 111--125.
[18]
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick. 2007. When cache blocking of sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communications and Computing 18, 3 (2007), 297--311.
[19]
Scott Parker, Vitali Morozov, Sudheer Chunduri, Kevin Harms, Chris Knight, and Kalyan Kumaran. 2017. Early Evaluation of the Cray XC40 Xeon Phi System 'Theta' at Argonne. Technical Report. Argonne National Laboratory.
[20]
John E. Pearson. 1993. Complex Patterns in a Simple System. Science 261, 5118 (1993), 189--192.
[21]
Ali Pinar and Michael T. Heath. 1999. Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing 99. ACM Press, New York, New York, USA, 30-es.
[22]
Erik Saule, Kamer Kaya, and Umit V. Catalyurek. 2013. Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8384 LNCS. 559--570. arXiv:arXiv:1302.1078v1
[23]
Francisco Vázquez, José-Jesús Fernández, and Ester M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency Computation Practice and Experience 23, 8 (2011), 815--826. arXiv:arXiv:1302.5679v1
[24]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (2009), 178--194.
[25]
Jianfei Zhang and Lei Zhang. 2013. Efficient CUDA polynomial preconditioned conjugate gradient solver for finite element computation of elasticity problems. Mathematical Problems in Engineering 2013 (2013).
[26]
Cong Zheng, Shuo Gu, Tong Xiang Gu, Bing Yang, and Xing Ping Liu. 2014. BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs. J. Parallel and Distrib. Comput. 74, 7 (jul 2014), 2639--2647.

Cited By

View all
  • (2024)SpEpistasis: a sparse approach for three-way epistasis detectionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104989(104989)Online publication date: Sep-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear AlgebraIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332202934:12(3147-3161)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. PETSc
  2. Xeon Phi
  3. many-core
  4. parallel SpMV
  5. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and and the Exascale Computing Project

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SpEpistasis: a sparse approach for three-way epistasis detectionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104989(104989)Online publication date: Sep-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear AlgebraIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332202934:12(3147-3161)Online publication date: Dec-2023
  • (2022)Exploring source-to-source compiler transformation of OpenMP SIMD constructs for Intel AVX and Arm SVE vector architecturesProceedings of the Thirteenth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3528425.3529100(11-20)Online publication date: 2-Apr-2022
  • (2022)Row-Segmented Sparse-Dense Matrix Matrix Multiplication on GPUs2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00074(376-383)Online publication date: Dec-2022
  • (2022)MS schedulerFuture Generation Computer Systems10.1016/j.future.2021.08.002126:C(123-135)Online publication date: 1-Jan-2022
  • (2021)AMF-CSR: Adaptive Multi-Row Folding of CSR for SpMV on GPU2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00058(418-425)Online publication date: Dec-2021
  • (2021)A simple and efficient storage format for SIMD-accelerated SpMVCluster Computing10.1007/s10586-021-03340-124:4(3431-3448)Online publication date: 1-Dec-2021
  • (2020)ahSpMV: An Autotuning Hybrid Computing Scheme for SpMV on the Sunway ArchitectureIEEE Internet of Things Journal10.1109/JIOT.2019.29472577:3(1736-1744)Online publication date: Mar-2020
  • (2020)An Efficient Multicore CPU Implementation for Convolution-Pooling Computation in CNNs2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00097(548-556)Online publication date: May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media