Designing and dynamically load balancing hybrid LU for multi/many-core

Michael Deisher¹,
Mikhail Smelyanskiy²,
Brian Nickerson³,
Victor W. Lee²,
Michael Chuvelev⁴ &
…
Pradeep Dubey²

217 Accesses
Explore all metrics

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel^® MIC(Many Integrated Core) architecture in both native and hybrid (Intel^® Xeon^® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel’s first implementation of Intel MIC architecture [codenamed Knight’s Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MatRIS: Addressing the Challenges for Portability and Heterogeneity Using Tasking for Matrix Decomposition (Cholesky)

Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

Article 13 March 2024

References

Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys 180(1)
Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ort ES (2008) Solving dense linear systems on graphics processors. In: Proc Euro-par conference on parallel processing, pp 739–748
Google Scholar
Buttari A, Langou J, Kurzak J, Dongarra J (2007) A class of parallel tiled linear algebra algorithms for multicore architectures. In: LAPACK working note 191, pp 1–19
Google Scholar
Demmel J, Grigori L, Xiang H (2010) CALU: a communication optimal lu factorization algorithm
Dongarra JJ, Duff IS, Sorensen DC, van der Vorst HA (1987) Numerical linear algebra for high-performance computers. Society for Industrial Mathematics, Philadelphia
Google Scholar
Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2001) FLAME: formal linear algebra methods environment. ACM Trans Math Softw 27(4):422–455
Article MATH Google Scholar
Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. In: Society of photo-optical instrumentation engineers (SPIE) conference series, vol 7705
Google Scholar
Intel (2009) Intel(R) Math kernel library reference manual. Intel Corporation
McIntosh-Smith S, Irwin J (2007) The best of both worlds: delivering aggregated performance for high-performance math libraries in accelerated systems. In: Proc 2007 international supercomputing conference
Google Scholar
Tomov S (2011) MAGMA 1.0—LAPACK for GPUs. ICL Lunch Talk. http://tinyurl.com/68rz3qk
Tomov S, Dongarra J, Baboulin M (2008) Towards dense linear algebra for hybrid GPU accelerated manycore systems. http://www.netlib.org/lapack/lawnspdf/lawn210.pdf
Tomov S, Nath R, Du P, Dongarra J (2010) MAGMA version 1.0rc2. http://icl.cs.utk.edu/magma
Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2):260–269
Article MATH Google Scholar
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proc ACM/IEEE conf supercomputing, pp 1–11
Google Scholar

Download references

Author information

Authors and Affiliations

Intel Labs, Hillsboro, OR, USA
Michael Deisher
Intel Labs, Santa Clara, CA, USA
Mikhail Smelyanskiy, Victor W. Lee & Pradeep Dubey
Intel Architecture Group, Santa Clara, CA, USA
Brian Nickerson
Software and Solutions Group, Nizhny Novgorod, Russia
Michael Chuvelev

Authors

Michael Deisher
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Smelyanskiy
View author publications
You can also search for this author in PubMed Google Scholar
Brian Nickerson
View author publications
You can also search for this author in PubMed Google Scholar
Victor W. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Michael Chuvelev
View author publications
You can also search for this author in PubMed Google Scholar
Pradeep Dubey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Deisher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deisher, M., Smelyanskiy, M., Nickerson, B. et al. Designing and dynamically load balancing hybrid LU for multi/many-core. Comput Sci Res Dev 26, 211–220 (2011). https://doi.org/10.1007/s00450-011-0169-x

Download citation

Published: 14 April 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s00450-011-0169-x

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MatRIS: Addressing the Challenges for Portability and Heterogeneity Using Tasking for Matrix Decomposition (Cholesky)

Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Designing and dynamically load balancing hybrid LU for multi/many-core

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MatRIS: Addressing the Challenges for Portability and Heterogeneity Using Tasking for Matrix Decomposition (Cholesky)

Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now