article

Designing and dynamically load balancing hybrid LU for multi/many-core

Authors:

Michael Deisher,

Mikhail Smelyanskiy,

Brian Nickerson,

Michael Chuvelev,

Pradeep DubeyAuthors Info & Claims

Computer Science - Research and Development, Volume 26, Issue 3-4

Pages 211 - 220

https://doi.org/10.1007/s00450-011-0169-x

Published: 01 June 2011 Publication History

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel^® MIC(Many Integrated Core) architecture in both native and hybrid (Intel^® Xeon^® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.

References

[1]

Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys 180(1).

[2]

Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ort ES (2008) Solving dense linear systems on graphics processors. In: Proc Euro-par conference on parallel processing, pp 739-748.

[3]

Buttari A, Langou J, Kurzak J, Dongarra J (2007) A class of parallel tiled linear algebra algorithms for multicore architectures. In: LAPACK working note 191, pp 1-19.

[4]

Demmel J, Grigori L, Xiang H (2010) CALU: a communication optimal lu factorization algorithm.

[5]

Dongarra JJ, Duff IS, Sorensen DC, van der Vorst HA (1987) Numerical linear algebra for high-performance computers. Society for Industrial Mathematics, Philadelphia.

[6]

Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore.

[7]

Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2001) FLAME: formal linear algebra methods environment. ACM Trans Math Softw 27(4):422-455.

Digital Library

[8]

Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. In: Society of photo-optical instrumentation engineers (SPIE) conference series, vol 7705.

[9]

Intel (2009) Intel(R) Math kernel library reference manual. Intel Corporation.

[10]

McIntosh-Smith S, Irwin J (2007) The best of both worlds: delivering aggregated performance for high-performance math libraries in accelerated systems. In: Proc 2007 international supercomputing conference.

[11]

Tomov S (2011) MAGMA 1.0--LAPACK for GPUs. ICL Lunch Talk. http://tinyurl.com/68rz3qk

[12]

Tomov S, Dongarra J, Baboulin M (2008) Towards dense linear algebra for hybrid GPU accelerated manycore systems. http://www.netlib.org/lapack/lawnspdf/lawn210.pdf

[13]

Tomov S, Nath R, Du P, Dongarra J (2010) MAGMA version 1.0rc2. http://icl.cs.utk.edu/magma

[14]

Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2):260-269.

Digital Library

[15]

Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proc ACM/IEEE conf supercomputing, pp 1-11.

Cited By

Lin RYuan XXue WYin WYao JShi JSun QSong CWang FMohror KArnold DBadia R(2023)5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607030(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607030
Teodoro GValle EMariano NTorres RMeira WSaltz J(2014)Approximate similarity search for online multimedia services on distributed CPU---GPU platformsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0329-723:3(427-448)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s00778-013-0329-7
Park JTang PSmelyanskiy MKim DBenson T(2013)Efficient backprojection-based synthetic aperture radar computation with many-core processorsScientific Programming10.1155/2013/38971321:3-4(165-179)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1155/2013/389713
Show More Cited By

Designing and dynamically load balancing hybrid LU for multi/many-core

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
LU Factorization with Partial Pivoting for a Multicore System with Accelerators

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU ...
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Science - Research and Development

Computer Science - Research and Development Volume 26, Issue 3-4

June 2011

186 pages

ISSN:1865-2034

Issue’s Table of Contents

Copyright © Copyright © 2011 Springer-Verlag.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin RYuan XXue WYin WYao JShi JSun QSong CWang FMohror KArnold DBadia R(2023)5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607030(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607030
Teodoro GValle EMariano NTorres RMeira WSaltz J(2014)Approximate similarity search for online multimedia services on distributed CPU---GPU platformsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0329-723:3(427-448)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s00778-013-0329-7
Park JTang PSmelyanskiy MKim DBenson T(2013)Efficient backprojection-based synthetic aperture radar computation with many-core processorsScientific Programming10.1155/2013/38971321:3-4(165-179)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1155/2013/389713
Potluri SBureddy DHamidouche KVenkatesh AKandalla KSubramoni HPanda DGropp WMatsuoka S(2013)MVAPICH-PRISMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503288(1-11)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2503210.2503288
Potluri SVenkatesh ABureddy DKandalla KPanda DEpema D(2013)Efficient intra-node communication on Intel-MIC clustersProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.86(128-135)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1109/CCGrid.2013.86
Park JTang PSmelyanskiy MKim DBenson THollingsworth J(2012)Efficient backprojection-based synthetic aperture radar computation with many-core processorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389034(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389034
Chhugani JKim CShukla HPark JDubey PShalf JSimon HHollingsworth J(2012)Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systemsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2388998(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2388998
King CRhodes CBrouilette GFritsche GStewart C(2012)LogosProceedings of the 40th annual ACM SIGUCCS conference on User services10.1145/2382456.2382458(1-6)Online publication date: 15-Oct-2012
https://dl.acm.org/doi/10.1145/2382456.2382458

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents