LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Cheng Chen ORCID: orcid.org/0000-0003-0255-8182¹,
Jianbin Fang¹,
Tao Tang¹ &
…
Canqun Yang¹

452 Accesses
Explore all metrics

Abstract

Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Energy Efficiency of Multithreaded WZ Factorization with the Use of OpenMP and OpenACC on CPU and GPU

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

Discover the latest articles, news and stories from top researchers in related subjects.

References

Luciani X, Albera L (2015) Joint eigenvalue decomposition of non-defective matrices based on the LU factorization with application to ICA. IEEE Trans Signal Process 63(17):1
Article MathSciNet Google Scholar
Petitet A, Whaley RC, Dongarra J, Cleary A (2004) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/
http://www.top500.org
Castaldo AM, Clint Whaley R, Samuel S (2010) Scaling LAPACK panel operations using parallel cache assignment. ACM Trans Math Softw 45(5):223–232
MATH Google Scholar
Xu W, Lu Y, Li Q, Zhou E, Song Z, Dong Y, Zhang W (2014) Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci 8(3):367–377
Kogge P, Borkar S, Dan C, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Stephen K (2008) Exascale computing study: technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office
Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: 2013 IEEE 27th international symposium on parallel and distributed processing (IPDPS), pp 126–137
Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46–51
Endo T, Matsuoka S, Nukada A, Maruyama N (2010) Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS), pp 1–8
Jo Gangwon, Nah Jeongho, Lee Jun, Kim Jungwon, Lee Jaejin (2015) Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans Parallel Distrib Syst 26:1
Article Google Scholar
Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX (2011) Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26(5):854–865
Article Google Scholar
Kurzak J, Luszczek P, Faverge M, Dongarra J (2013) LU factorization with partial pivoting for a multicore system with accelerators. IEEE Trans Parallel Distrib Syst 24(24):1613–1621
Article Google Scholar
Deisher M, Smelyanskiy M, Nickerson B, Lee VW, Chuvelev M, Dubey P (2011) Designing and dynamically load balancing hybrid LU for multi/many-core. Comput Sci Res Dev 26(3–4):211–220
Article Google Scholar
Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2015) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, pp 343–355
Dongarra JJ, Duff LS, Sorensen DC, Vander Vorst HA (1998) Numerical linear algebra for high-performance computers. Society for Industrial and Applied Mathematics, Siam
Gustavson FG (1997) Recursion leads to automatic variable blocking for dense liner algebra algorithms. IBM J Res Dev 41(6):737–755
Article Google Scholar
Van De Velde EF (1990) Experiments with multicomputer LU-decomposition. Concurr Pract Exper 2(1):1–6
Article Google Scholar
Fox GC, Johnson MA, Lyzenga GA, Otto SW, Salmon JK, Walker DW (1988) Solving problems on concurrent processors. Vol. 1: general techniques and regular problems, Prentice Hall, Old Tappan
Hipes PG, Kuppermann A (1989) Gauss–Jordan inversion with pivoting on the caltech mark ii hypercube. In: Hypercube concurrent computers and applications, pp 1621–1634
Bach M, Kretz M, Lindenstruth V, Rohr D (2011) Optimized HPL for AMD GPU and multi-core CPU usage. Comput Sci Res Dev 26(3):153–164
Article Google Scholar
Michael K, Gunnels J, Brokenshire D, Benton B (2009) Petascale computing with accelerators. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’09, pp 241–250
Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: International conference on parallel processing and applied mathematics, Springer, pp 571–581
Beckingsale D, Gaudin W, Herdman A, Jarvis S (2015) Resident block-structured adaptive mesh refinement on thousands of graphics processing units. In: 2015 44th international conference on parallel processing (ICPP), pp 61–70
Tan L, Kothapalli S, Chen L, Hussaini O, Bissiri R, Chen Z (2014) A survey of power and energy efficient techniques for high performance numerical linear algebra operations. In: Parallel Comput, December 2014
Haidar A, Dong T, Luszczek P, Tomov S, Dongarra J (2015) Optimization for performance and energy for batched matrix computations on GPUs. In: Proceedings of the 8th workshop on general purpose processing uGPUs, GPGPU-8, pp 59–69
Haidar A, Dong T, Tomov S, Luszczek P, Dongarra J (2015) Framework for batched and gpu-resident factorization algorithms to block householder transformations. In: ISC high performance, pp 07–25
Liu C, Li J, Huang W, Rubio J, Speight E, Lin X (2012) Power-efficient time-sensitive mapping in heterogeneous systems. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT ’12, pp 23–32
Hong S, Kim H (2010) An integrated gpu power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10, pp 280–289
Alonso P, Dolz MF, Igual FD, Mayo R, Quintana-Ort ES (2012) Reducing energy consumption of dense linear algebra operations on hybrid CPU–GPU platforms. In: 2012 IEEE 10th international symposium on parallel and distributed processing with applications, pp 56–62
Intel Math Kernel Library (Intel MKL)

Download references

Acknowledgements

This work is supported by the National High Technology R&D Program of China (863 Program) 2015AA01A301, and the National Natural Science Foundation of China (NSFC) 61402488, 61602501.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Cheng Chen, Jianbin Fang, Tao Tang & Canqun Yang

Authors

Cheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Canqun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, C., Fang, J., Tang, T. et al. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 791–811 (2017). https://doi.org/10.1007/s00607-016-0537-2

Download citation

Received: 14 July 2016
Accepted: 23 December 2016
Published: 02 January 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00607-016-0537-2

Keywords

Mathematics Subject Classification

68W10

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Energy Efficiency of Multithreaded WZ Factorization with the Use of OpenMP and OpenACC on CPU and GPU

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now