research-article

Open access

GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems

Authors:

Hadi Zamani,

Laxmi Bhuyan,

Jieyang Chen,

Zizhong ChenAuthors Info & Claims

ACM Transactions on Parallel Computing, Volume 10, Issue 2

Article No.: 12, Pages 1 - 23

https://doi.org/10.1145/3583590

Published: 20 June 2023 Publication History

PDF eReader

Abstract

The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (V_safeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.

1 Introduction

HPC systems are increasingly using heterogeneous systems, FPGAs, and GPUs, with multicore processors to boost the performance of scientific applications. GPUs, in particular, have been widely employed for HPC due to their extraordinarily high compute capability and easy programmability. High performance demands on the GPUs, on the other hand, have led their designs to come with large power consumption [23]. Because existing libraries are mainly concerned with performance, they do not make efficient use of heterogeneous computing systems, resulting in energy inefficiency. Hence, improving the energy efficiency of critical applications running on HPC systems is necessary to deliver better performance at a given power budget.

LU factorization is an algorithm used for solving dense linear algebra problems and is widely used in many scientific and engineering applications [21]. It is included in several popular linear algebra libraries, such as Linear Algebra Package (LAPACK) [4] and Linpack [26] benchmarks. Existing LU factorization implementations are concerned primarily with performance, ignoring the potential for energy savings that do not have a negative impact on performance. When LU factorization is running on heterogeneous system equipped with GPUs, it divides the workload between CPUs and GPUs. CPU handles the panel factorization, which is serial in nature. The GPUs update the trailing matrix because it involves large computation that can be highly parallelized. During the execution, either the CPU or GPUs could be on non-critical paths that can experience idle time, or slack. These slacks can be exploited for energy savings by exploring power-aware techniques such as Dynamic Voltage and Frequency Scaling (DVFS).

Over the last few years, significant efforts have been made to apply various techniques, such as DVFS [12, 22] and undervolting [34]. DVFS approaches have been employed to save energy during underutilized execution phases of the execution, called slack, on CPUs [2, 20, 28, 29] and GPUs [7, 13, 33, 34]. The amount of slack on different components can be estimated using algorithmic knowledge to determine the appropriate level of DVFS so that all the components finish execution at the same time. Hence, the main task is to accurately predict the slack during the execution so that the exact frequency can be computed and DVFS can be enabled. Prior research on LU factorization utilized the slack using a simple performance model in a single GPU environment [7]. It profiled the application to determine the execution time of the first iteration and then estimated the slack for the next iterations based on the prior iteration. However, we develop a more accurate performance model based on the amount of computation extracted from the algorithmic knowledge for heterogeneous systems with multiple GPUs that does not need a profiling phase and performance overhead. Also, we extend the DVFS technique to multiple GPUs and present both performance and energy saving results with two GPUs. We derive the execution times of a multicore CPU and multiple GPUs separately and verify the results through experiment. Then the amount of slack is determined at every iteration of the LU factorization, frequency is calculated, and DVFS is enabled at runtime. Our measurement shows that DVFS reduces the energy consumption by 15%.

DVFS techniques do not target the static power consumption, which is becoming predominant in today’s technology. The static power can be reduced only through the voltage reduction, called undervolting. However, undervolting below the threshold value will introduce errors. Manufacturers specify large safety margins in the nominal frequency-voltage operating points of CPUs, up to 30% [9, 11]. As a result, they are inherently conservative and not energy efficient [5]. Leng et al. investigate the voltage guardband of the GPUs and observe that there is about a 20% voltage guardband on different GPU architectures [16]. There are many efforts to operate hardware at sub-nominal voltage levels on the CPUs [9, 11, 25, 37] and GPUs [16, 18, 32, 34]. However, most of this work focuses on applying undervolting in a conservative way, based on the worst voltage constraint across all workloads and running frequencies. This limits the potential gains since some applications can operate at a higher degree of undervolting. Hence, we empirically extract the minimum safe voltage (\(V_{safeMin}\)) of the underlying CPU/GPUs for various running frequencies while executing the LU factorization. Using the extracted \(V_{safeMin}\) for both CPUs and GPUs, we apply the undervolting during the execution of the LU factorization. It is shown that undervolting reduces energy consumption by 16% beyond DVFS.

This article presents GreenMD, a framework that improves the energy efficiency of heterogeneous multi-GPU systems while maintaining the reliability and performance of LU factorization. GreenMD is built on top of the LU factorization from the highly optimized linear algebra library MAGMA. First, we develop accurate performance models for CPU, GPU, and PCIe bus based on the algorithmic knowledge and underlying hardware details and verify them through rigorous experiments. Then we predict the slack and employ the DVFS on both the CPU and GPUs during the slack periods to save energy. GreenMD also extracts and utilizes the maximum level of undervolting at a fixed frequency to improve the energy efficiency of the system without sacrificing performance. GreenMD is portable, which means it can be used with any GPU and CPU architecture by simply adjusting a few architecture-specific parameters. In summary, this article makes the following contributions:

•

Using algorithmic knowledge and hardware configuration, we develop a more accurate performance model for CPU, GPUs, and the PCIe bus to estimate the execution time of the LU factorization during the different iterations of the execution.

•

We implement DVFS in CPU and single and multiple GPUs based on the accurate performance models.

•

We implement undervolting in CPU and single and multiple GPUs based on the empirical observations.

•

We evaluate the performance, energy savings, and reliability through real implementation and achieve 31% total energy savings for double-precision LU factorization with a matrix of size 18K*18K.

The rest of article is organized as follows. The background and motivation with profiling results are discussed in Section 2. A slack predictor is developed in Section 3 based on the CPU/GPU performance models. Proposed methodology, including DVFS-based slack reclamation and undervolting, is presented in Sections 4 and 5, respectively. Experimental evaluations and results are provided in Section 6. Finally, related works and the conclusion are discussed in Sections 7 and 8.

2 Background and Motivation

2.1 LU Factorization Overview

Figure 1 demonstrates an overview of the LU factorization algorithm. The left side shows the LU factorization at iteration k, while the right side shows the LU factorization at iteration k+1. In a matrix of size n*n, during iteration k, a set of \(k * n/b\) columns (the panel shown in yellow) is factored on the CPU, where b is the block size and n is the row width. The remaining part of the matrix is then subjected to the elementary transformations that result from the panel factorization. This phase is called trailing matrix update. The updating of the trailing matrix requires kernel “row swap” (DLASWP), “triangular solve” (DTRSM), and “matrix multiplication” (DGEMM). In other words, updating the trailing sub involves swapping up \(k * (n/b)\) rows of the trailing sub-matrix (DLASWP), applying a triangular solver to the top \(k * n/b\) rows of the trailing sub-matrix (DTRSM), and finally invoking the matrix multiplication for the part shown in blue (DGEMM). On the right side, at iteration k+1, one column of tiles is transferred from the GPU to the CPU to get factorized and be ready for the next iteration of LU factorization. This column of blocks, called the look-ahead panel, which is used in the next iteration, is transferred before the GPU finishes the trailing matrix update at iteration k. Look-ahead is a communication and computation overlapping technique that reduces the GPU idle time while waiting for the CPU cores to deliver the panel factorization results. This procedure is repeated several times, and in each iteration, the CPUs factor a panel shown in yellow color, while the GPUs update their portions of the trailing matrix shown in the blue color.

Fig. 1.

2.2 Profiling Observations

Figure 2 shows the partial timing profile for a double-precision LU factorization for a matrix of size 18K on the heterogeneous system with two GPUs. Since the profiling tools are not capable of observing the GPU and CPU timelines at the same time, we profile the LU factorization on the CPUs and GPUs separately. For GPU, we used the Nvidia Visusal Profiler to profile LU factorization. The NVIDIA Visual Profiler is a cross-platform performance profiling tool that delivers developers vital feedback for optimizing CUDA C/C++ applications. In a heterogeneous multi-GPU system, matrix decomposition algorithms, including matrix LU factorization algorithms, distribute the workload between CPU and GPU at each iteration. Figure 2 shows the beginning trace of LU factorization on a heterogeneous system with two GPUs for a matrix of size 18K. We enlarged the timing profile to observe the details, but this resulted in the loss of information from the entire LU factorization. Observation shows that there is no slack on the GPUs at the beginning of the LU factorization. This is because, at earlier iterations, panel factorization, which is performed on the CPU, takes less time than the trailing matrix update, which is performed on the GPU. However, the amount of slack on the GPUs increases as LU factorization approaches the end of execution. This is because the size of the trailing matrix update on two GPUs is decreasing over time and takes less and less time to execute on the GPUs compared to the panel factorization. In the following, we explain different components (1 to 7) in the figure so that we can develop a performance model accurately.

Fig. 2.

•

(1) At the beginning the whole matrix is divided and copied into the GPUs. The block columns are distributed across multiple GPUs using a 1-D block-column cyclic distribution. The even and odd block columns are transferred to GPU 0 and 1, respectively. In the case of two GPUs. The input matrix size is 18K*18K, and the block size is 512*512. As a result, each GPU holds 18 block columns, with each block column containing 36 blocks of 512*512 size. Since the GPUs share the PCIe bus with the CPU, these column blocks are transferred in a round-robin fashion.

•

(2) In each iteration, one GPU is responsible for updating the look-ahead panel and sending it to the CPU, while the other GPUs are updating the rest of the trailing matrix. After the look-ahead panel update, the GPU continues to update the rest of trailing matrix along with the other GPUs.

•

(3) Panel factorization is done on CPU.

•

(4) The factorized panel is sent to all GPUs for the update phase for the next iteration.

•

(5) In the next iteration, the look-ahead panel update is done on a single GPU holding the next panel.

•

(6) The updated look-ahead panel is transferred to the CPU. This phase is done to overlap the communication and computation.

•

(7) Then, the rest of the trailing matrix, which is distributed across multiple GPUs, is updated on the corresponding GPUs.

3 Slack Predictor

In each iteration, either the CPU or the GPU is in the critical path. The critical path consists of a group of tasks that takes the maximum time among different paths. Identifying the critical path allows us to extract slack duration in a non-critical path. We precisely estimate the slack length at a given iteration and then reclaim the slack by utilizing the DVFS. Using the algorithmic knowledge, we estimate the execution time of each iteration on the CPU/GPUs based on the amount of operations and their compute capacities. Recall that the earlier paper on DVFS for LU factorization used profiling with a simple performance model [7]. However, our techniques do not involve any benchmarking and the performance can be directly computed.

3.1 Performance Model of LU Factorization on the CPU

By dividing the number of operations (workload) by the compute capability of the underlying architecture, the execution time of the CPU can be estimated. The compute capability is defined as Equation (1), where \(p_{cpu}\) is the maximum peak performance. Peak performance is defined as the maximum number of floating-point operations per second for the underlying architecture; freq and \(Max_{frequency}\) are the current running frequency and maximum frequency of the CPU, respectively. The MAGMA library is highly optimized and is getting close to the maximum peak performance.

\begin{equation} \displaystyle Compute_{capability} = p_{cpu}\times \frac{freq}{Max_{frequency}} \end{equation}

(1)

The number of operations or workload at kth iteration, \(W_{panel}\), required to perform panel factorization on a single block column of matrix with size of \(M_{k} \times b\) can be calculated as [10]

\begin{equation} \displaystyle W_{panel} = \left(M_{k} -\frac{b}{3}\right)\times b^2. \end{equation}

(2)

In LU factorization with matrix of size \(m \times n\) and block size b, \(M_{k}\) varies during each iteration k, which can be calculated as 3:

\begin{equation} \displaystyle M_{k} = \left(\frac{m}{b}-k \right) \times b. \end{equation}

(3)

Hence, the amount of operations, \(W_{panel}\), will be estimated through Equation (2):

\begin{equation} \displaystyle W_{panel_{k}} = \left((\frac{m}{b}-k) \times b) -\frac{b}{3} \right)\times b^2. \end{equation}

(4)

Panel factorization is naturally a sequential program. So the estimation time for panel factorization, \(T_{CPU_{k}}\), for a CPU thread can be determined by Equation (5) for a given iteration k:

\begin{equation} \displaystyle T_{CPU_{k}} = \frac{ (((\frac{m}{b}) - k) \times b) -\frac{b}{3})\times b^2 }{p_{cpu}\times \frac{freq}{Max_{frequency}}}. \end{equation}

(5)

But with a huge number of computing cores provided by the multicore architectures, MAGMA is written to use multi-threading even for panel factorization. Since using multiple threads brings in extra overhead, the number of flops required for the panel factorization increases by about \(b^3 \times log\hspace{2.84526pt} (N_{cpu-threads})\), where \(N_{cpu-threads}\) is the number of threads participating in the panel factorization [10]. As a result, we can estimate the panel factorization execution time of the multi-threaded CPU at the kth iteration as 6:

\begin{equation} \displaystyle T^{\prime }_{CPU_{k}} = \frac{ (((\frac{m}{b}) - k) \times b)-\frac{b}{3})\times b^2 + b^3 \times log\hspace{2.84526pt} (N_{cpu-threads}) }{p_{cpu}\times \frac{freq}{Max_{frequency}} \times N_{cpu-threads} }. \end{equation}

(6)

Since we would run the CPU at different frequencies for DVFS, we compared the estimated values through Equation (5) against the measured values for various frequencies. The empirical results are extracted for double-precision LU factorization with matrix of size 18K. The panel factorization is running on the “CPU Intel(R) Core(TM)i7-6700k” while employing only one thread. Figure 3 presents the analytical results for execution time of the panel factorization during different iterations. The execution time is inversely proportional to the frequency. We also measured the error rate of the execution time for different frequencies. Figure 4 shows the error rates for different frequencies w.r.t different iterations. According to the results, the average error rate for different iterations is less than 4% for different frequencies. The empirical results shown in Figure 4 demonstrate a different amount of error rate for different frequencies. This is because there is a different amount of time needed to charge and discharge the transistors with different frequencies. In higher frequencies, there is less time to charge and discharge the transistors. So timing errors are one of the main reasons for faults/errors. However, in lower frequencies, the execution time increases, and as a result, the probability of observing the errors increases as well. So the probability of fault is dependent on both the frequency and the execution time [30]. Besides, different frequencies can change the temperature of the processor, which could lead to different amounts of error rate. In other words, the error rate is dependent on frequency, execution time, and temperature.

Fig. 3.

Fig. 4.

3.2 Performance Model of the GPU Kernel

Similar to the CPU performance model provided in Equation (5), we develop the GPU performance model for one hardware thread and then extend it to consider all the GPU computing cores (threads). However, unlike CPU, the GPU is an SIMT architecture and there is no resource conflict. The GPU throughput is obtained simply by multiplying per-thread throughput with the number of cores in the GPU. The GPU execution time is estimated by dividing the number of floating-point operations by the GPU computation rate.

In each iteration of the LU factorization, the size of the trailing sub-matrix is reduced by the block size from both row and column width. Given the matrix size as mxn and block size b, the sub-matrix row and column width at iteration k can be calculated as

\begin{equation} \displaystyle Row_{width_{k}} = \left(\frac{m}{b}-k \right) \times b \end{equation}

(7)

\begin{equation} \displaystyle Column_{width_{k}} = \left(\frac{n}{b}-k \right) \times b. \end{equation}

(8)

Therefore, the total number of operations (workload) required for a trailing matrix update at a given iteration k can be written as [10]

\begin{equation} \displaystyle Worklaod_{GPU_{k}} = \left(\left(\frac{m}{b}-k \right) \times b \right) * \left(\left(\frac{n}{b}-k \right) \times b \right)^2. \end{equation}

(9)

Therefore, extending Equation (5) to the GPU, Equation (10) can be derived to estimate the execution time of each GPU at a given iteration k, where

•

\(N_{c}\) is the number of cores per GPU.

•

freq is the running frequency of the GPU.

•

\(T_{GPU_{i@k}}\) is the time required for the kernel (trailing matrix update) running on the ith GPU at a given iteration k.

•

\(Workload_{GPU_{i@k}}\) is the total workload of the ith GPU at a given iteration k.

At iteration k, Equation (9) is used to compute the total number of floating-point operations required. If multiple GPUs are employed, the computation load is distributed evenly among them, and any extra panel assigned to certain GPUs due to the uneven number of panels is assigned a workload calculated using Equation (9).

Which n is the matrix size,

\begin{equation} \displaystyle T_{GPU_{i@k}} = \frac{ Workload_{GPU_{i@k}} }{P_{gpu} \times \frac{freq}{Max_{frequency}} \times N_{c} }. \end{equation}

(10)

Figure 5 shows the estimated execution time of double-precision LU factorization with an input matrix of size 18K on a single GPU card for different running frequencies. The “GPU GTX 1660 super” card, which has a total of 1,408 processing cores, is used to update the trailing matrix. To validate our performance model, we also extracted the empirical results and calculated the execution time error rate for the same configurations. The kernel execution time is measured from the host, which takes the kernel launch time and so forth into account. Figure 6 shows the error rate of the execution time on a single GPU in the presence of various frequencies. The results show that the average error rate is about 5%. However, the error rate is increased toward the last iterations. This is because, with the update matrix getting smaller and smaller, the GPU is not fully utilized. The execution time is getting smaller, which results in a larger error rate. Using Equation (10), we also estimate the execution time of two GPUs. Figure 7 compares the estimated values of single and double GPUs while running the trailing matrix update. The running frequency is fixed at the default value of 1,980 MHz. It shows that the execution time of the GPU is inversely proportional to the number of GPUs at different iterations. In the case of two GPUs, the execution time of the trailing matrix is almost divided by two for different iterations. We also extracted the experimental result for two GPUs and the average error rate for the estimated execution time was about 4.6%.

Fig. 5.

Fig. 6.

Fig. 7.

3.3 Performance Model of Data Transfer

We must first model the PCIe bus latency before we can model the transfer time between CPU and GPU. We take the LogGP model [8] as a starting point and expand it to multiple streams. According to the model, transferring a single chunk of size k bytes takes L+O+(k)B. The time it takes for a single byte to transmit from the source to the destination endpoint is denoted by the L. O specifies the amount of time the processor is working on the transmission. In other words, this is the amount of time the CPU spends registering the DMA request with the controller. B represents the PCIe bus’s bandwidth. We don’t have any overhead on the device side because we’re using DMA transfer, which writes data directly into memory. Even with DMA transfers, the data is split into data chunks and transmitted across PCIe in a continuous stream. We define g as the gap between these chunks. This is the amount of time it takes to start a new transfer API. As a result, transferring multiple chunks requires \(L+O+k_{1}G+(n-1)g+k_{n}G\). When many streams are present, each stream is transferred one at a time, and g will be the overhead to initiate a new transfer API on a different stream. As a result, the time required to transfer k bytes utilizing n streams will be determined using Equation (11). Hence, in case of double-precision LU factorization with matrix of size \(m \times n\), the copy time for each iteration can be extracted using Equation (12). Here m is the row width of matrix; b is the block size or the panel width, which remains the same over the iterations.

\begin{equation} \displaystyle T_{copy_{k}} = L+ O + K / B + (n-1)\times g \end{equation}

(11)

\begin{equation} \displaystyle T_{copy_{k}} = L+ O + ((m- k\times b)\times b\times 8)/ B + (n-1)\times g \end{equation}

(12)

Even though the newer GPUs are equipped with dual copy engines and they might share resources, which affects the transfer time, we are not considering the dual copy engine because in case of LU factorization, the copy transfers are not occurring at the same time. Figure 8 shows the estimated time between the CPU and GPU for different iterations of double-precision LU factorization with matrix size of 18K. The estimated results illustrate that the copy time is proportional and inversely proportional to the size of the data and bandwidth, respectively. L, O, g, and B are extracted using a simple profiling phase. We also compared the estimated values with the experimental results of the copy time for a matrix of size 18K in case of a single GPU (NVIDIA GeForceGTX 1660 SUPER) and CPU (Intel(R) Core(TM) i7-6700k). Figure 9 shows that the error rate is very negligible in the iterations when the PCIe bus is fully utilized. However, the error rate increases due to under-utilization of the PCIe bus during the last iterations of LU factorization. For example, the amount of data that is transferred between the CPU and GPU in the first and last iterations of the LU factorization are 73 MB and 2.42 MB, respectively. On average, the error rate of the estimation model is about 1.3% for different iterations, which is very small. However, it may be observed that, compared to hundreds of msecs in CPU and thousands of msecs in GPU execution times, the data transfer time is negligible.

Fig. 8.

Fig. 9.

4 DVFS-based Slack Reclamation

A slack is a period of time when one computer component waits for another. Load imbalance, inter-task or inter-process communication, and memory access stalls are all common causes of slack. A Critical Path is a certain sequence of tasks that spans from the beginning to the end of the execution and has zero slack. Slack is only observed in the non-critical path of the application. While slack on non-critical paths is commonly utilized for energy savings, fully reclaiming them without affecting application performance is difficult. During the LU factorization, slack occurs on the CPU or GPU at each iteration, as shown in Figure 10. Slack occurs only when the CPU waits for the GPU to update the next panel or when the GPU waits for the CPU to factorize the panel. If the CPU finishes earlier in an iteration, we can slow it down to finish at the same time as the GPU and vice versa. As the panel and matrix update sizes change across iterations, the amount of slack changes. Then we use DVFS on the CPU and GPU to take full advantage of the slack on non-critical paths. We can reduce power consumption without affecting performance for the next iteration by properly reducing the frequency of the processing units to only eliminate the slack.

Fig. 10.

Algorithm 1 provides the overview of the DVFS controller that estimates the panel factorization time on the CPU, matrix update time on GPUs, and data copy time between the CPU and GPU using the equations provided in Section 3. Let \(T_{k}^{CPU}\), \(T_{GPU_{i@k}}\), and \(T_{k}^{Copy}\) represent the CPU time, ith GPU execution time at iteration k, and the transfer time, respectively. The slack lengths of LU factorization at each iteration for the CPU and GPU are given by Equations (13) and (14). These values are extracted according to Equations (6), (10), and (12), respectively. In each iteration the copy time is the summation of both H2D and D2H transfers that happens between CPU and GPUs.

\begin{equation} \begin{split}slack_{CPU_{k}} = Max (T_{GPU_{i@k}}) - T_{k}^{CPU} - T_{k}^{Copy} \hspace{0.0pt} ; \hspace{0.0pt} 0 \le i \lt \frac{N}{nb} \end{split} \end{equation}

(13)

\begin{equation} \begin{split}slack_{GPU_{i@k}} = T_{k}^{CPU} + T_{k}^{Copy} - T_{GPU_{i@k}} \hspace{5.69054pt} ; \hspace{11.38109pt} 0 \le i \lt \frac{N}{nb} \end{split} \end{equation}

(14)

If Equation (13) has a positive value, the CPU has slack time, and the CPU should wait for the GPU to finish its execution. To find the slack on different GPUs, we use Equation (14) for each individual GPU. Similarly, if it has a positive value, the GPU will have the slack. There is no slack in the CPU or GPU when the value is 0.

We can reduce power consumption without affecting performance for the next iteration by properly reducing the frequency of the processing units to only eliminate the slack. As we have shown in the earlier section, the execution time of the LU factorization is proportional to the frequency of computing units.

We use the Advanced Configuration and Power Interface (ACPI) to change the CPU core’s frequency at runtime to reduce the voltage/frequency transition time. Our frequency enforcement is performed by manipulating each core’s “scaling_setspeed” file. When this device file is modified, Linux triggers a group of system calls that adjust the CPU core’s frequency in about 40 microseconds. Compared to the hundreds of msecs of CPU and GPU execution times per iteration (Figures 3 and 7), this overhead is negligible.

We derive the equation to find the optimum frequency for the units on the non-critical path based on the current and target execution time. Equation (15) below is derived to calculate the amount of optimum frequency, where \(f_{opt}\) and \(f_{default}\) are the optimum and default frequencies of the component with slack at a given iteration:

\begin{equation} \displaystyle f_{opt} = f_{default} \times \frac{\max (T_{CPU},T_{GPU})-slack}{\max (T_{CPU},T_{GPU})}. \end{equation}

(15)

The frequency of the CPU/GPUs is adjusted using the pseudo code provided in Algorithm 2. We choose the minimum frequency if the adjusted frequency is less than the minimum defined frequency. When an adjusted frequency is not supported by the hardware, two consecutive available frequencies are used to eliminate the slack. This is because only available discrete frequencies offered by CPU and GPU DVFS are taken into account in GreenMD.

4.1 Scalability of the Proposed Method

With increasing the number of GPUs, since, using the algorithmic knowledge, the amount of operations are known on each GPU, we can still estimate the execution time of each GPU during each iteration of the LU factorization, and as a result, we can find the slack. The amount of slack on the CPU and GPUs will be varied based on the size of the input matrix and the number of GPUs. In this article, the input size of the input matrix was limited by memory size of the GPUs. For the same input size of the matrix, if we increase the number of GPUs, the amount of slack will be even more on the GPUs and we can save even more energy. This is because the main source of power consumption is consumed by the GPUs in a heterogeneous system with GPUs. According to the empirical results provided in the article, at earlier iterations, since the panel widths are larger, the CPU takes more time than the GPU to factorize the panel and we still experience slack on the GPU unless more CPUs and potentially GPUs are employed to help the CPU to factorize the panel. And if the number of GPUs increases, in later iterations, when the trailing matrix size is getting smaller and smaller, there could still be slack in the GPUs that can be utilized for energy saving.

5 Undervolting

Since we don’t compromise the LU cauterization’s performance, DVFS can only be utilized to reclaim slack on non-critical path components. This is because using DVFS on critical paths degrades the application’s overall performance for compute-intensive workloads [3]. DVFS is mainly concerned with dynamic power consumption. However, the static power is becoming predominant in today’s technology. Hence, we also employ undervolting (at a fixed frequency) to reduce both static and dynamic power while maintaining performance.

To ensure that the microprocessor functions reliably under varying load and environmental conditions, microprocessor manufacturers usually append an operational guard-band (a static voltage margin) of up to 20% of the nominal voltage [36]. The guard-bands additionally take into consideration errors caused by the load line, aging effects, noise, and calibration error [27]. Because these errors do not occur frequently, significant energy savings can be achieved by lowering guard-band voltage to a much lower supply voltage [19]. In our work, we aim to save energy by utilizing the voltage guard-band between the nominal voltage and the actual OS safe voltage while maintaining performance. To extract \(V_{safeMin}\) during LU factorization, we take a similar approach as described in [16]. We undervolt to the safe minimum voltage \(V_{safeMin}\) without introducing any fault into the system. The system may experience soft errors if \(V_{safeMin}\) is exceeded.

To determine the \(V_{safeMin}\) for LU factorization, we profile the LU factorization for small matrices. According to [34], the sensitivity analysis is performed by lowering the voltage below the nominal voltage (1.075 V). First, we run the program at nominal voltage and record the result as “golden output.” We start with the corresponding voltage in the CPU and reduce the voltage in 10 mv steps at a given fixed frequency. We run the application 100 times and record the number of faulty runs. If the output does not match the golden output, the application’s run has failed. The frequency is then changed, and the undervolting is repeated at a new fixed frequency. We extract the \(V_{safeMin}\) for both CPU and GPU with different running frequencies. This is because the maximum level of undervolting could be different for different frequencies as observed in Figures 11 and 12. The probability of failure for the CPU is shown in Figure 11. There is no error until the undervolting level reaches 150 mv. At lower running frequencies, we can decrease the voltage even further (10 mv) without observing the error.

Fig. 11.

Fig. 12.

Similar to CPU, we extract the \(V {safeMin}\) for the underlying GPU as well. For a maximum frequency of 1,980 MHz, the GPU default voltage is 1.045 V. Similarly, the underlying GPU (GTX 1660 super) is undervolted in 10 mv steps. For each level of undervolting, the application is executed 100 times and the corresponding output is compared to the golden output to verify correctness. Similar to the CPU, the probability of GPU failure as well as \(V_{safeMin}\) is extracted for different frequencies, as depicted in Figure 12. According to the empirical results, we are able to undervolt the GPU about 140 to 150 mv for different frequencies. Hence, in our method, we undervolt the CPU and GPUs 150 mv and 140 mv, respectively. Prior to the application’s execution, the GPUs are undervolted. This is due to the lack of a runtime API for undervolting the GPU during execution. Undervolting the GPU is also done in a similar way as in [34], and as explained below in Section 6.

6 Evaluation

6.1 Implementation Details

All experiments are preformed on a heterogeneous system with 8-core Intel CPU and Two homogeneous GEFORCE GTX 1660 SUPER, whose architectural specifications are listed in Table 1. We were able to evaluate the results for up to a 18K matrix size due to the GPU’s limited device memory. The proposed power management algorithm is embedded inside the application code and called right before the next iteration. Considering the performance overhead of DVFS, the \(DVFS_{CPU}()\) and \(DVFS_{GPU}()\) functions are called if there is enough slack available in the next iteration.

Table 1.

Component	CPU	GPU
Architecture	Intel(R) Core(TM) i7-6700k	NVIDIA GeForce GTX 1660 SUPER
Minimum Frequency	800 MHz	300 MHz
Maximum Frequency	4,200 MHz	1,980 MHz
Memory	16 GB	6 GB
Cache	L1 (128 KB) L2 (1 MB) L3 (8 MB)	L1 (64 KB per SM) L2 (1,536 KB)
OS	Ubuntu 20.04

Table 1. Experimental Setup Configuration

We use “linux-intel-undervolt” and “cpupower frequency-set” APIs to undervolt and scale the CPU frequency. These APIs can be used for Intel CPUs with an integrated voltage controller (FIVR). In CPU, we change only the CPU frequency and do not touch the memory frequency. However, in the case of GPU, since the GPU core and memory frequencies are coupled, changing the core frequency might change the memory’s frequency as well.

For the GPU profiling phase and extracting the safe minimum voltage of the GPU, MSI After Burner [1] was used. However, MSI After Burner is not supported on the Linux operating system. So, to reduce the voltage of the GPU, we used a similar approach to that employed in [34]. Since there is no direct API to reduce the voltage, we reduce the voltage of the GPU by lowering the GPU’s target power limit at a fixed frequency. To undervolt the GPU, several APIs from the NVIDIA Management Library (NVML) are used as listed in Table 2.

Table 2.

API	Description
nvmlDeviceSetPersistenceMode	Enables persistent mode to prevent driver from unloading
nvmlDeviceSetPowerManagementLimit	Sets new power limit of the GPUdevice
nvmlDeviceGetClock	Retrieves the clock speed for the clock specified by the clock type and clock ID
nvmlDeviceSetGpuLockedClocks	Sets clocks that device will lock to
nvmlDeviceGetTotalEnergyConsumption	Reads the GPU’s total energy consumption
nvmlDeviceResetApplicationsClock	Resets the application clock to the default value
linux-intel-undervolt	Undervolts the Intel CPUs
cpupower frequency-set	Sets the CPU’s frequency

Table 2. Power Management and Undervolting APIs

It is not possible to truly disable GPU Boost in modern NVIDIA architectures without resorting to very risky procedures involving flashing custom firmware. However, it is still possible to lock the graphics frequency in recent GPUs. We use the NVML library’s “nvmlDeviceSetGpuLockedClocks” API to fix the frequency and, as a result, the voltage. This API effectively locks the graphics frequency, ensuring that it remains constant at the desired frequency with tiny variations, which could be due to the auto-boosting option.

6.2 Results

We executed the LU factorization in presence with one and two GPUs. In the case of a single GPU, the amount of slack on the CPU and GPU is shown in Figure 13 for a matrix size of 18K x 18K. The slacks for CPU and single GPU are observed in iterations 0 to 21 and 22 to 34, respectively. This is because, even though the GPU is equipped with a huge number of computing cores, it has a larger ratio of \(workload/(compute-capability)\) compared with the CPU till iteration 21.

Fig. 13.

Since there is no slack on the GPU for iterations before iteration 21, we do not change the GPU frequency until then. We only adjust the CPU frequency to reclaim the slack allowing both the CPU and GPU to complete their tasks at the same time at a given iteration. If the amount of adjusted frequency is less than the minimum frequency of the underlying architecture, we set the frequency to the minimum frequency. Similarly, if the adjusted frequency is greater than the maximum frequency, we set the frequency to the maximum frequency.

Along with slack reclamation, we also extracted the maximum level of undervolting when no fault is introduced in the system. At a given frequency corresponding to each iteration, we apply the maximum level of safe undervolting. The amount of energy consumed at the default scenario (baseline) and proposed approach w.r.t iteration is shown in Figure 14. The x-axis represents the iteration number, while the y-axis represents the amount of energy consumed during each iteration. Using DVFS along with undervolting, for a matrix of size 18K, we were able to save the CPU energy consumption up to 51%.

Fig. 14.

We also measured the energy consumption of the single GPU. Figure 15 shows the GPU’s energy consumption for the default configuration as well as the proposed method. Because there is no slack till iteration 21, the energy improvement comes only from undervolting. However, after that both DVFS and undervolting lead to more energy reduction. Figure 15 shows that, on average, we save energy about 18% on the single GPU.

Fig. 15.

We have also extracted the results for a heterogeneous system with two GPUs. In this case, we observed less slack in the CPU at earlier iterations and more slack in the GPUs at later iterations, compared to a single GPU. This is because the trailing matrix update is done in parallel in both GPUs, reducing the update time and the CPU slack. Figure 16 shows the amount of slack for different iterations for both the CPU and GPUs. Compared with a single GPU case in Figure 13, the slack at the CPU is reduced by almost half. We fully reclaim the CPU slack and adjust the CPU’s frequency to a desired frequency, allowing both CPU and GPU iterations to be completed at the same time. In the second half of iterations, we also change the frequency of the GPUs to reclaim the slack in both GPUs. The frequencies are automatically and independently enabled by the API during the execution.

Fig. 16.

Also, Figure 17 shows the amount of CPU energy consumption in the presence of DVFS and undervolting. Compared to single GPU results, illustrated in Figure 14, we observe less energy improvement in the CPU in the earlier iterations and more energy improvement during the late iterations. This is because, in case of two GPUs, the CPU experiences less amount of slack during the earlier iterations and more amount of slack during the late iterations, which leads to less and more energy improvement during these periods, respectively.

Fig. 17.

Similar to the CPU, we also extracted the energy improvement for the GPU using the proposed method. The energy consumption of the GPU for the default configuration and the proposed method is shown in Figure 18. Since there is no slack in the first half of the iterations, the energy improvement comes only from undervolting the GPU. However, the energy improvement in the second half of the iterations comes from both DVFS and undervolting. According to Figure 18, on average, we save the total energy consumption of the GPUs about 21%. In comparison to the results illustrated in Figure 15 for a single GPU, the small improvement in the energy savings comes mainly from the undervolting part. This is because, even though the trailing matrix execution time is reduced by half, the total power consumption doubles due to the use of two GPUs keeping the energy consumption almost the same. Figures 19 and 20 show the total energy consumption of LU factorization for a matrix of size 18K with single and two GPUs, respectively. As shown in Figure 20, in the first half of iterations, when only the CPU experiences slack, we save 26.2%, while in the second half of the iteration, when the GPUs only have slacks, we save 41.8%. However, the energy consumption is much less than the first half because the trailing matrix gets smaller and the execution time reduces. Overall, there is a 31% energy improvement in total energy for the entire LU factorization with two GPUs.

Fig. 18.

Fig. 19.

Fig. 20.

7 Related Works

There has been a lot of work done in recent years to investigate and improve the energy efficiency of the kernels that are frequently utilized in scientific applications. Matrix multiplication, LU factorization, Cholesky, and QR decomposition are examples of such kernels. There are general methods that are employed to increase the energy efficiency of the applications. Numerous methods have been suggested, which can be categorized into the following categories: (1) studies on the effects of DVFS on the execution of applications and (2) works that introduce the runtime models to predict the GPU application performance and/or power consumption.

Jiao et al. [15] investigated the impacts of core and memory frequency on applications with various features with regard to the effects of DVFS on various applications. The authors noted that as some applications were more sensitive than others to the scaling of each frequency domain, the effect of frequency scaling on performance and power consumption depended on the characteristics of the applications. An alternate strategy for DVFS requires the development of accurate performance and/or power models that enable GPU behavior prediction under various voltage and frequency conditions. In an effort to accurately represent the execution characteristics of GPGPU applications, GPU performance models are generally developed based on GPU pipeline analysis [14, 24, 31], seeking to properly reflect the execution characteristics of GPGPU applications. In some research, the performance models are based on profiling as well as the algorithmic knowledge of the application. For instance, in [7], they introduce a performance model based on profiling the first iteration and then using the algorithmic knowledge to predict the next iterations. Since the error rate is only dependent on the profiling results of the first iteration, the error caused by profiling will be accumulated in the later iterations and will become around 11.4%. In other works [6], they use a similar approach; however, using an online calibration, they avoid error from accumulating and reduce the error rate, but this approach adds the overhead of the online calibration to the overhead of the performance model and needs application changes ahead of time.

Some of the GPU DVFS runtime power modeling techniques are based on empirical techniques, which call for the division of GPU micro-architectures and the analysis of the kernel binary code [17]. Also, in [35], the authors use a micro-benchmark-based methodology to create a throughput model for the instruction pipeline, shared memory access, and global memory access, the three main components of GPU execution time, and they are able to predict the performance of the GPU with a 5% to 15% error rate. However, these methods are frequently product specific and difficult to port to other architectures.

In order to leverage the Dynamic Voltage and Frequency Scaling (DVFS) technique, we present a performance model that incurs a maximum performance overhead of 5%. Our performance model is derived from the general GPU performance model and can be adapted to suit diverse compute-intensive applications with minor modifications to the introduced performance models.

8 Conclusion

In this article, we present GreenMD, an energy-efficient framework to improve the energy consumption of LU factorization on a heterogeneous mutli-GPU system. We profiled the execution trace of a system with two GPUs and explained various kernel executions and data transfers. This gave rise to detailed analytical models to predict the CPU/GPU execution time and PCIe bus transfer time. Then we applied DVFS to the CPU and GPUs based on the amount of slack predicted through our analysis. We designed appropriate APIs and inserted them into the kernel to independently control the DVFS at each iteration. We further improved the energy by reducing the voltage of the CPU and GPUs independently based on the minimum threshold voltages to avoid error.

Since GreenMD employs the same programming interface as MAGMA’s LU factorization, it is transparent to programs that use the LU factorization and does not require users to change the application’s source code. GreenMD is portable and can be utilized with any GPU architecture. Knowing how the workload is distributed across each underlying architecture and an understanding of algorithms are both necessary for this. This means that GreenMD can be used with any GPU architecture by just changing a few model parameters that are particular to each GPU architecture. Also, the approach we propose can be used to improve a broad range of kernels. The performance model will be adjusted and the appropriate level of undervolting will be determined using the offline profiling phase and algorithm information.

GreenMD’s energy consumption was evaluated for two heterogeneous systems, one with a single GPU and the other with two GPUs for a matrix size of 18K*18K. Our results as shown in Table 3 showed that CPU and GPU energy consumption improved by around 51% and 18%, respectively, in a heterogeneous system with a single GPU. Also, in a heterogeneous system with two GPUs, GreenMD reduces the energy consumption of the CPU and GPU by 59% and 21%, respectively. The total energy saved during the entire execution is 31%.

Table 3.

	Energy Improvement (%)
	CPU	GPU/GPUs	Total
Heterogeneous system with one GPU	51%	18%	32.4
Heterogeneous system with two GPUs	59%	21%	31%

Table 3. Energy Improvement of CPU and GPU/GPUs in Heterogeneous Systems with One GPU and Two GPUs

Acknowledgments

The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions.

References

[1]

MSI Afterburner. [n. d.]. http://goo.gl/fs2pti.

Abstract

1 Introduction

2 Background and Motivation

2.1 LU Factorization Overview

2.2 Profiling Observations

3 Slack Predictor

3.1 Performance Model of LU Factorization on the CPU

3.2 Performance Model of the GPU Kernel

3.3 Performance Model of Data Transfer

4 DVFS-based Slack Reclamation

4.1 Scalability of the Proposed Method

5 Undervolting

6 Evaluation

6.1 Implementation Details

6.2 Results

7 Related Works

8 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Statistical GPU power analysis using tree-based methods

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations