Abstract
Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Luciani X, Albera L (2015) Joint eigenvalue decomposition of non-defective matrices based on the LU factorization with application to ICA. IEEE Trans Signal Process 63(17):1
Petitet A, Whaley RC, Dongarra J, Cleary A (2004) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/
Castaldo AM, Clint Whaley R, Samuel S (2010) Scaling LAPACK panel operations using parallel cache assignment. ACM Trans Math Softw 45(5):223–232
Xu W, Lu Y, Li Q, Zhou E, Song Z, Dong Y, Zhang W (2014) Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci 8(3):367–377
Kogge P, Borkar S, Dan C, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Stephen K (2008) Exascale computing study: technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office
Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: 2013 IEEE 27th international symposium on parallel and distributed processing (IPDPS), pp 126–137
Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46–51
Endo T, Matsuoka S, Nukada A, Maruyama N (2010) Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS), pp 1–8
Jo Gangwon, Nah Jeongho, Lee Jun, Kim Jungwon, Lee Jaejin (2015) Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans Parallel Distrib Syst 26:1
Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX (2011) Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26(5):854–865
Kurzak J, Luszczek P, Faverge M, Dongarra J (2013) LU factorization with partial pivoting for a multicore system with accelerators. IEEE Trans Parallel Distrib Syst 24(24):1613–1621
Deisher M, Smelyanskiy M, Nickerson B, Lee VW, Chuvelev M, Dubey P (2011) Designing and dynamically load balancing hybrid LU for multi/many-core. Comput Sci Res Dev 26(3–4):211–220
Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2015) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, pp 343–355
Dongarra JJ, Duff LS, Sorensen DC, Vander Vorst HA (1998) Numerical linear algebra for high-performance computers. Society for Industrial and Applied Mathematics, Siam
Gustavson FG (1997) Recursion leads to automatic variable blocking for dense liner algebra algorithms. IBM J Res Dev 41(6):737–755
Van De Velde EF (1990) Experiments with multicomputer LU-decomposition. Concurr Pract Exper 2(1):1–6
Fox GC, Johnson MA, Lyzenga GA, Otto SW, Salmon JK, Walker DW (1988) Solving problems on concurrent processors. Vol. 1: general techniques and regular problems, Prentice Hall, Old Tappan
Hipes PG, Kuppermann A (1989) Gauss–Jordan inversion with pivoting on the caltech mark ii hypercube. In: Hypercube concurrent computers and applications, pp 1621–1634
Bach M, Kretz M, Lindenstruth V, Rohr D (2011) Optimized HPL for AMD GPU and multi-core CPU usage. Comput Sci Res Dev 26(3):153–164
Michael K, Gunnels J, Brokenshire D, Benton B (2009) Petascale computing with accelerators. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’09, pp 241–250
Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: International conference on parallel processing and applied mathematics, Springer, pp 571–581
Beckingsale D, Gaudin W, Herdman A, Jarvis S (2015) Resident block-structured adaptive mesh refinement on thousands of graphics processing units. In: 2015 44th international conference on parallel processing (ICPP), pp 61–70
Tan L, Kothapalli S, Chen L, Hussaini O, Bissiri R, Chen Z (2014) A survey of power and energy efficient techniques for high performance numerical linear algebra operations. In: Parallel Comput, December 2014
Haidar A, Dong T, Luszczek P, Tomov S, Dongarra J (2015) Optimization for performance and energy for batched matrix computations on GPUs. In: Proceedings of the 8th workshop on general purpose processing uGPUs, GPGPU-8, pp 59–69
Haidar A, Dong T, Tomov S, Luszczek P, Dongarra J (2015) Framework for batched and gpu-resident factorization algorithms to block householder transformations. In: ISC high performance, pp 07–25
Liu C, Li J, Huang W, Rubio J, Speight E, Lin X (2012) Power-efficient time-sensitive mapping in heterogeneous systems. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT ’12, pp 23–32
Hong S, Kim H (2010) An integrated gpu power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10, pp 280–289
Alonso P, Dolz MF, Igual FD, Mayo R, Quintana-Ort ES (2012) Reducing energy consumption of dense linear algebra operations on hybrid CPU–GPU platforms. In: 2012 IEEE 10th international symposium on parallel and distributed processing with applications, pp 56–62
Intel Math Kernel Library (Intel MKL)
Acknowledgements
This work is supported by the National High Technology R&D Program of China (863 Program) 2015AA01A301, and the National Natural Science Foundation of China (NSFC) 61402488, 61602501.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, C., Fang, J., Tang, T. et al. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 791–811 (2017). https://doi.org/10.1007/s00607-016-0537-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-016-0537-2