Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Designing and dynamically load balancing hybrid LU for multi/many-core

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel’s first implementation of Intel MIC architecture [codenamed Knight’s Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys 180(1)

  2. Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ort ES (2008) Solving dense linear systems on graphics processors. In: Proc Euro-par conference on parallel processing, pp 739–748

    Google Scholar 

  3. Buttari A, Langou J, Kurzak J, Dongarra J (2007) A class of parallel tiled linear algebra algorithms for multicore architectures. In: LAPACK working note 191, pp 1–19

    Google Scholar 

  4. Demmel J, Grigori L, Xiang H (2010) CALU: a communication optimal lu factorization algorithm

  5. Dongarra JJ, Duff IS, Sorensen DC, van der Vorst HA (1987) Numerical linear algebra for high-performance computers. Society for Industrial Mathematics, Philadelphia

    Google Scholar 

  6. Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  7. Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2001) FLAME: formal linear algebra methods environment. ACM Trans Math Softw 27(4):422–455

    Article  MATH  Google Scholar 

  8. Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. In: Society of photo-optical instrumentation engineers (SPIE) conference series, vol 7705

    Google Scholar 

  9. Intel (2009) Intel(R) Math kernel library reference manual. Intel Corporation

  10. McIntosh-Smith S, Irwin J (2007) The best of both worlds: delivering aggregated performance for high-performance math libraries in accelerated systems. In: Proc 2007 international supercomputing conference

    Google Scholar 

  11. Tomov S (2011) MAGMA 1.0—LAPACK for GPUs. ICL Lunch Talk. http://tinyurl.com/68rz3qk

  12. Tomov S, Dongarra J, Baboulin M (2008) Towards dense linear algebra for hybrid GPU accelerated manycore systems. http://www.netlib.org/lapack/lawnspdf/lawn210.pdf

  13. Tomov S, Nath R, Du P, Dongarra J (2010) MAGMA version 1.0rc2. http://icl.cs.utk.edu/magma

  14. Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2):260–269

    Article  MATH  Google Scholar 

  15. Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proc ACM/IEEE conf supercomputing, pp 1–11

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Deisher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deisher, M., Smelyanskiy, M., Nickerson, B. et al. Designing and dynamically load balancing hybrid LU for multi/many-core. Comput Sci Res Dev 26, 211–220 (2011). https://doi.org/10.1007/s00450-011-0169-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-011-0169-x

Keywords

Navigation