Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Designing and dynamically load balancing hybrid LU for multi/many-core

Published: 01 June 2011 Publication History

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.

References

[1]
Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys 180(1).
[2]
Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ort ES (2008) Solving dense linear systems on graphics processors. In: Proc Euro-par conference on parallel processing, pp 739-748.
[3]
Buttari A, Langou J, Kurzak J, Dongarra J (2007) A class of parallel tiled linear algebra algorithms for multicore architectures. In: LAPACK working note 191, pp 1-19.
[4]
Demmel J, Grigori L, Xiang H (2010) CALU: a communication optimal lu factorization algorithm.
[5]
Dongarra JJ, Duff IS, Sorensen DC, van der Vorst HA (1987) Numerical linear algebra for high-performance computers. Society for Industrial Mathematics, Philadelphia.
[6]
Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore.
[7]
Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2001) FLAME: formal linear algebra methods environment. ACM Trans Math Softw 27(4):422-455.
[8]
Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. In: Society of photo-optical instrumentation engineers (SPIE) conference series, vol 7705.
[9]
Intel (2009) Intel(R) Math kernel library reference manual. Intel Corporation.
[10]
McIntosh-Smith S, Irwin J (2007) The best of both worlds: delivering aggregated performance for high-performance math libraries in accelerated systems. In: Proc 2007 international supercomputing conference.
[11]
Tomov S (2011) MAGMA 1.0--LAPACK for GPUs. ICL Lunch Talk. http://tinyurl.com/68rz3qk
[12]
Tomov S, Dongarra J, Baboulin M (2008) Towards dense linear algebra for hybrid GPU accelerated manycore systems. http://www.netlib.org/lapack/lawnspdf/lawn210.pdf
[13]
Tomov S, Nath R, Du P, Dongarra J (2010) MAGMA version 1.0rc2. http://icl.cs.utk.edu/magma
[14]
Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2):260-269.
[15]
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proc ACM/IEEE conf supercomputing, pp 1-11.

Cited By

View all
  • (2023)5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607030(1-13)Online publication date: 12-Nov-2023
  • (2014)Approximate similarity search for online multimedia services on distributed CPU---GPU platformsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0329-723:3(427-448)Online publication date: 1-Jun-2014
  • (2013)Efficient backprojection-based synthetic aperture radar computation with many-core processorsScientific Programming10.1155/2013/38971321:3-4(165-179)Online publication date: 1-Jul-2013
  • Show More Cited By
  1. Designing and dynamically load balancing hybrid LU for multi/many-core

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Computer Science - Research and Development
        Computer Science - Research and Development  Volume 26, Issue 3-4
        June 2011
        186 pages

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 June 2011

        Author Tags

        1. Dense linear algebra
        2. High performance computing
        3. Hybrid architecture
        4. Intel MIC architecture
        5. LU factorization
        6. Many-core architecture
        7. Panel factorization
        8. Partial pivoting
        9. Right looking
        10. SGEMM

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 21 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607030(1-13)Online publication date: 12-Nov-2023
        • (2014)Approximate similarity search for online multimedia services on distributed CPU---GPU platformsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0329-723:3(427-448)Online publication date: 1-Jun-2014
        • (2013)Efficient backprojection-based synthetic aperture radar computation with many-core processorsScientific Programming10.1155/2013/38971321:3-4(165-179)Online publication date: 1-Jul-2013
        • (2013)MVAPICH-PRISMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503288(1-11)Online publication date: 17-Nov-2013
        • (2013)Efficient intra-node communication on Intel-MIC clustersProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.86(128-135)Online publication date: 13-May-2013
        • (2012)Efficient backprojection-based synthetic aperture radar computation with many-core processorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389034(1-11)Online publication date: 10-Nov-2012
        • (2012)Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systemsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2388998(1-11)Online publication date: 10-Nov-2012
        • (2012)LogosProceedings of the 40th annual ACM SIGUCCS conference on User services10.1145/2382456.2382458(1-6)Online publication date: 15-Oct-2012

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media