Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2688500.2688534acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
abstract
Public Access

Towards batched linear solvers on accelerated hardware platforms

Published: 24 January 2015 Publication History

Abstract

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.

Cited By

View all
  • (2021)A Set of Batched Basic Linear Algebra Subprograms and LAPACK RoutinesACM Transactions on Mathematical Software10.1145/343192147:3(1-23)Online publication date: 26-Jun-2021
  • (2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
  • (2018)A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky FactorizationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.278392929:5(973-984)Online publication date: 1-May-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 8
    PPoPP '15
    August 2015
    290 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2858788
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Check for updates

Author Tags

  1. batched factorization
  2. hardware accelerators
  3. numerical linear algebra
  4. numerical software libraries
  5. one-sided factorization algorithms

Qualifiers

  • Abstract

Funding Sources

Conference

PPoPP '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)96
  • Downloads (Last 6 weeks)9
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)A Set of Batched Basic Linear Algebra Subprograms and LAPACK RoutinesACM Transactions on Mathematical Software10.1145/343192147:3(1-23)Online publication date: 26-Jun-2021
  • (2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
  • (2018)A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky FactorizationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.278392929:5(973-984)Online publication date: 1-May-2018
  • (2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
  • (2018)Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547576(1-7)Online publication date: Sep-2018
  • (2017)Novel HPC techniques to batch execution of many variable size BLAS computations on GPUsProceedings of the International Conference on Supercomputing10.1145/3079079.3079103(1-10)Online publication date: 14-Jun-2017
  • (2017)Autotuning batch Cholesky factorization in CUDA with interleaved layout of matrices2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.18(1408-1417)Online publication date: May-2017
  • (2016)Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.248189027:7(2036-2048)Online publication date: 1-Jul-2016
  • (2016)Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUsProcedia Computer Science10.1016/j.procs.2016.05.30380:C(119-130)Online publication date: 1-Jun-2016
  • (2015)On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for CoprocessorsHigh Performance Computing10.1007/978-3-319-20119-1_5(58-73)Online publication date: 20-Jun-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media