Reducing GPU Register File Energy

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11014))

Included in the following conference series:

European Conference on Parallel Processing

Abstract

Graphics Processing Units (GPUs) maintain a large register file to increase the thread level parallelism (TLP). To increase the TLP further, recent GPUs have increased the number of on-chip registers in every generation. However, with the increase in the register file size, the leakage power increases. Also, with the technology advances, the leakage power component has increased and has become an important consideration for the manufacturing process. The leakage power of a register file can be reduced by turning infrequently used registers into low power (drowsy or off) state after accessing them. A major challenge in doing so is the lack of runtime register access information.

To address this, we propose a system called . It employs a compiler analysis that determines the power state of the registers, i.e., which registers can be switched off or placed in drowsy state at each program point and encodes this information in program instructions. Further, it uses a runtime optimization that increases the accuracy of power state of registers. We implemented the proposed ideas using GPGPU-Sim simulator and evaluated them on 21 kernels from several benchmarks suites. We observe that when compared to the baseline without any power optimizations, shows an average reduction of register leakage energy by 69.04% with a negligible number of simulation cycles overhead (0.53%).

Vishwesh Jatala is supported by Tata Consultancy Services (TCS) Research Scholarship Program. Amey Karkare acknowledges the travel fund received from TCS.

You have full access to this open access chapter, Download conference paper PDF

Brief Review of Low-Power GPU Techniques

GPPRMon: GPU Runtime Memory Performance and Power Monitoring Tool

Power-efficient prefetching on GPGPUs

Article 05 December 2014

Keywords

1 Introduction

Graphics Processing Unit (GPU) achieves high throughput by utilizing thread level parallelism (TLP). Typically, GPUs maintain a large register file in each streaming multiprocessor (SM) to improve the TLP. GPUs allow a large number of resident threads [2] in each SM, and the resident threads can store their thread context in the register file, which facilitates faster context switching of the threads. The threads that are launched in each SM are grouped into sets of 32 threads (called warps), and they execute the instructions in a single instruction, multiple threaded (SIMT) manner. To keep improving the TLP of the GPUs, GPU architects increase the maximum number of resident threads and register file sizes in every generation.

Earlier studies [12, 17] show that register files in GPUs consume around 15% of the total power. With the technology advances, the leakage power component has increased and has become an important consideration for the manufacturing process [16]. Moreover, registers in a GPU continue to dissipate leakage power throughout the entire execution of its warp even when they are not accessed by the warp.

1.1 Motivation

To understand the severity of leakage power dissipation by register file, consider Fig. 1 which shows the access patterns of some registers of warp 0 during the execution of MUM application (The experimental methodology has been discussed in Sect. 4). We use the access patterns of the registers of a single warp as a representative since all the warps of a kernel typically show similar behavior during execution [4]. We make the following observations:

Register 10 is accessed very infrequently—it is accessed for only 7 cycles during the complete execution (life time) of the warp (29614 cycles).
Register 1 is the most frequently accessed register during the warp execution. However, it is accessed for only 330 cycles (\(\sim \)1.11%) during the life time of the warp.

This shows that registers are accessed for a very short duration during the warp life time. However, they continue to dissipate leakage power for the entire life time of the warp. Figure 2 shows that the behavior is not specific to MUM, but is seen across a wide range of applications. The figure shows the percentage of simulation cycles spent in register accesses (averaged over all the registers in all the warps) for several applications. We observe that registers on an average spend \(<\!\!2\%\) of the simulation cycles during the warp execution while leaking power during the entire execution. This behavior is expected since GPU allows large number of resident warps in each SM and these warps get executed according to a pre-defined scheduling policy. If a warp gets scheduled less frequently, then its registers leak power for a longer duration.

One solution [3] to reduce the leakage power of the registers is by putting the registers into drowsy or SLEEP^{Footnote 1} state immediately after the registers of an instruction are accessed. However, this can have run-time overhead whenever there are frequent wake up signals to the sleeping register. Consider Fig. 1 again:

Putting register 10 to SLEEP state immediately after its accesses saves significant power due to the gaps of several thousands of cycles between consecutive accesses.
In contrast, register 1 is accessed very frequently. If it is put to SLEEP after every access, it will have a high overhead of wake up signals.
The access pattern of register 7 changes during the warp execution. It is accessed frequently for some duration (for example, between cycles 10500–11250), and not accessed frequently for other duration (between cycles 3000–7500). To optimize energy as well as run-time, the register needs to be kept ON whenever it is frequently accessed, and put to SLEEP otherwise.
The last access to register 8 is at cycle 1602. The register can be turned OFF after its last access to save more power.

In summary, the knowledge of registers’ access patterns helps improve energy efficiency without impacting the run-time adversely. Our proposed solution statically estimates the run-time usage patterns of registers to reduce GPU register file leakage power.

1.2 Contributions

uses a compile-time analysis to determine the power state of the registers (OFF, SLEEP, or ON) for each instruction by estimating the register usage information. Further, it transforms an input assembly language by encoding the power state information at each instruction to make it energy efficient. The static analysis makes safe approximations while computing power state of the registers, therefore, the choice of the state can be suboptimal at run-time. Hence, to improve the accuracy and energy efficiency, it provides a run-time optimization that dynamically corrects the power state of registers of each instruction. We make the following contributions:

1.
We introduce a new instruction format that supports the power states for the instruction registers (Sect. 3.2). We propose a compile-time analysis that determines the power state of the registers at each program point and transforms an input assembly language into a power optimized assembly language (Sects. 3.1 and 3.2).
2.
We propose a run-time optimization to reduce the penalty of suboptimal (but safe) choices made by static analysis (Sect. 3.3).
3.
We implemented the proposed compile-time and run-time optimizations using GPGPU-Sim simulator [10]. We integrated GPUWattch [17] with CACT-P [18] version to enable power saving mechanism (Sect. 4).
4.
We evaluated our implementation on wide range of kernels from different benchmark suites: CUDASDK [8], GPGPU-SIM [5], Parboil [1], and Rodinia [7]. We observe a reduction in the register leakage energy by an average of 69.04% and maximum of 87.95% (Sect. 4) when compared to the baseline approach, which does not have any power optimizations.

In the paper, Sect. 2 briefs the background required for , while the system itself is described in Sect. 3. Section 4 gives the experimental evaluation. Section 5 describes related work, and Sect. 6 concludes the paper.

2 Background

GPUs consist of a set of streaming multiprocessors (SMs). Each SM contains a large number of execution units such as ALUs, SPs, SFUs, and Load/Store units. GPUs achieve high throughput because they can hide long memory execution latencies with massive thread level parallelism. Each SM has a large register file, which allows the resident threads to maintain their contexts, and hence can have faster context switching.

NVIDIA provides a programming language CUDA [8] to parallelize applications on GPU. A program written in CUDA is translated to an intermediate representation (PTX), which is finally translated to an executable code. NVIDIA provides tools such as cuobjdump to disassemble the executable into SASS assembly language. GPGPU-Sim converts SASS code to PTXPlus code for simulation.

GPUWattch [17] framework uses the simulation statistics of GPGPU-Sim to measure the power of each component in the GPUs. The framework is built on McPAT [19], which internally uses CACTI [6]. McPAT models the register files as memory arrays to measure the register power. inserts power state information of registers in the PTXPlus code to enable reduction in the leakage power of the register files.

3

To understand the working of , we need to understand the different access patterns of a register and their effect on the wake up penalty incurred. Let W (threshold) denotes the minimum number of program instructions that are required to offset the wake-up penalty incurred when a register state is switched from OFF or SLEEP state to ON state. Consider a program that accesses some register R in a statement S during execution. The future accesses of R in this execution govern its power state. The following scenarios exist:

1.
The next access (either read or write) to R is by an instruction \(S'\) and there are no more than W instructions between S and \(S'\). In this case, since the two accesses to R are very close, it should be kept ON to avoid any wake-up penalty associated with SLEEP or OFF state.
2.
The next access to R is a read access by an instruction \(S'\) and there are more than W instructions between S and \(S'\). In this case, since the value stored in R is used by \(S'\), we can not switch R to OFF state as it will cause the loss of its value. However, we can put R in SLEEP state.
3.
The next access to R is a write access by an instruction \(S'\) and there are more than W instructions between S and \(S'\). In this case, since the value stored in R is being overwritten by \(S'\), we can put R in OFF state.
4.
There is no further access to R in the program. In this case also, R can be safely turned OFF.

We now describe the compiler analysis used by to capture these scenarios.

3.1 Compiler Analysis

To compute power state of registers at each instruction, we perform compiler analysis at the instruction level. Determining the power state of each register requires knowing the life time of registers as well as the distance between the consecutive accesses to the registers. We use the following notations.

\(\mathsf{IN}(S)\) denotes the program point before the instruction S. \(\mathsf{OUT}(S)\) denotes the program point after the instruction S.
\(\mathsf{SUCC}(S)\) denotes the set of successors of the instruction S. An instruction I is said to be successor of S if the control may transfer to I after executing the instruction S.
\(\mathsf{isLive}(\pi , R)\) is true if there is some path from program point \(\pi \) to Exit that contains a use of R not preceded by its definition.
\(\mathsf{Dist}(\pi , R)\) denotes the distance in terms of number of instructions from program point \(\pi \) till the next access to R. \(\mathsf{Dist}(\pi , R)\) is set to \(\infty \) when it exceeds the threshold W.
\(\mathsf{SleepOff}(\pi , R)\) is true if the register R can be put into SLEEP or OFF state at \(\pi \).
\(\mathsf{Power}(\pi , R)\) denotes the power state of the register R at program point \(\pi \).

The liveness information of each register, \(\mathsf{isLive}(\pi , R)\), can be computed using traditional liveness analysis [15]. The data flow equations to compute the \(\mathsf{Dist}(\mathsf{IN}(S), R)\) and \(\mathsf{Dist}(\mathsf{OUT}(S), R)\) are given in Fig. 3. Since our analysis aims to reduce the power consumption, we compute \(\mathsf{Dist}(\mathsf{OUT}(S), R)\) as the maximum value of \(\mathsf{Dist}(\mathsf{IN}(SS), R)\) over the successors SS of S. A register R can potentially be put into SLEEP or OFF state at a program point \(\pi \) if it is not accessed within the distance window W on some path, i.e., \( \mathsf{SleepOff}(\pi , R) = (\mathsf{Dist}(\pi , R)== \infty ) \).

Table 1. Computing power state of a register R at a program point \(\pi \)

Full size table

The power state of each register at each program point can be computed according to Table 1. Note that in GPUs, all the 32 threads of a warp execute the same instruction in SIMT manner, hence power state computed by the analysis is applicable to 32 registers corresponding to the 32 threads of a warp.

3.2 Encoding Power States

The power state (Power_State) of a register can be one of the three states: OFF, SLEEP, or ON. Thus, it requires two bits to represent Power_State of one register. Since the power state can change after every instruction at run-time, we need to encode the Power_State of the operand registers of an instruction in the instruction itself.

PTXPlus instructions [10] can support up to 4 source and 4 destination registers. Encoding Power_State of all the registers will require 16 bits. We observed that in our benchmarks, most instructions use only up to 2 source registers and 1 destination register. Therefore, to reduce the number of bits required to encode Power_State in each instruction, we encode information only for 2 source registers and 1 destination register. For instructions having more registers, Power_State of the remaining registers is assumed to be SLEEP to enable power saving. The modified instructions format is:

where \(\mathsf{Power}(\mathsf{OUT}(S), R)\) is Power_State encoded for a register R for an instruction S.

Example 1

Figure 4(a) shows a snippet of power optimized PTXPlus code, which is generated for SP benchmark using a threshold value (W) 7. The control flow graph (CFG) corresponding to the snippet is shown in Fig. 4(b). Note that the CFG is shown with respect to traditional basic block level to show it in compact. In Fig. 4(a), explicit branch addresses have been replaced by block labels for ease of understanding. \(\square \)

At run-time, power state of the source registers are set after the register contents have been read, i.e., in the read operands phase in the GPU pipeline, and the power state of the destination registers are set after the register contents have been written, i.e., in the write back stage of the pipeline.

3.3 Run-Time Optimization

Recall that the compiler analysis described in Sect. 3.1 computes \(\mathsf{Dist}(\mathsf{OUT}(S), R)\) as the maximum distance value over all successors when \(\mathsf{OUT}(S)\) is a branch point. This decision increases the chances of power savings, but it can be suboptimal at run-time as shown by the following example.

Example 2

Consider the CFG in Fig. 5(a) for a hypothetical benchmark. Assume the threshold value of 7 for . Instruction S0 defines a register r0. The next access to r0 occurs along two paths: the path along S10 has a use at a distance of 2, and the other (along S1) has a use in S9 at a distance of \(\infty \) (>7). computes \(\mathsf{Dist}(\mathsf{OUT}(S0), r0)\) as \(\infty \), the maximum of the distances along the successors. Further, the state \(\mathsf{Power}(\mathsf{OUT}(S0), r0)\) is computed as SLEEP. When the program executes the path along S1, power is saved. However, if the program executes the path along S10, then the register needs an immediate wake up, causing an overhead. \(\square \)

’s compile-time decision can be corrected at run-time by looking at near future accesses of a register in the pipeline. The hardware is modified to check in the pipeline if any decoded instruction from the same warp accesses a register whose power state is being changed to SLEEP or OFF. If so, then the register power is kept ON. This avoids the wake up latencies for instructions that access the same register within a short duration, thereby avoiding the performance penalty.

Example 3

Figure 5(b) shows a possible execution sequence of a program whose CFG is shown in Fig. 5(a). The instruction S0 writes to register r0. After writing the register value in write back stage (WB), the register needs to be put into SLEEP state. Assume that the program takes the path along S10 and decodes the instruction S11 before the write back stage of S0. Our run-time optimization detects the future access to r0 by S11, and keeps the register in ON state instead of putting it into SLEEP state to avoid additional wake up latencies. On the other hand, if the program takes the path along S1, then the instruction present in the S9 would appear much later in the pipeline (after WB stage of S0). The register r0 will be set to SLEEP state. \(\square \)

4 Experimental Analysis

Implementing requires to modify the GPU pipeline. We implemented the proposed hardware changes and compiler optimizations in GPGPU-Sim V3.x [10]. The details of the modified GPU architecture and the corresponding overheads (negligible) are discussed in [11] and ignored for brevity. The GPGPU-Sim configuration used for the experiments is shown in Table 2. We also evaluated on various other GPU configurations, whose results are reported in our technical report [11]. We measured the power consumption of register file using GPUWattch [17].

Table 2. GPGPU-Sim configuration

Full size table

Note that GPUWattch internally uses CACTI [6], which does not support leakage power saving mechanism. Therefore, we modified GPUWattch to use CACTI-P version [18], which supports the leakage power saving mechanism. CACTI-P uses minimum data retention voltage to enable the SRAM cells to enter into SLEEP state without losing their data. We chose SRAM\(_{vccmin}\) to be the default value (provided by CACTI-P depending on the technology node, 22 nm in this case). To put SRAM cells in OFF state, we configured SRAM\(_{vccmin}\) to 0 V. After running several experiments, we chose the threshold value (W) as 3, which achieves lowest energy for maximum number of kernels. We used the latency to change a register state from SLEEP to ON to be 1 cycle, and the latency to change a register state from OFF to ON to be 2 cycles. We report these latency and energy overheads in our results and also include these overheads throughout our results. We evaluated on 21 kernels from the benchmark suites CUDA-SDK [8], GPGPU-SIM [5], Parboil [1], and Rodinia [7] as shown in Table 3.

Table 3. Benchmarks used for evaluation

Full size table

We use Baseline to denote the default GPGPU-Sim implementation that does not use any leakage power saving mechanisms. Sleep-Reg denotes the approach that optimizes the baseline approach by (1) turning OFF the unallocated registers and (2) turning the allocated registers into SLEEP state immediately after the registers are accessed [3].

Comparing Register Leakage Power: Figure 6 shows the effectiveness of and Sleep-Reg by measuring the reduction in leakage power with respect to Baseline. From the figure, we observe that shows an average (Geometric Mean denoted as G.Mean) reduction of leakage power by 69.21% when compared to the Baseline. It shows the is effective in turning the instruction registers into lower power state, such as SLEEP or OFF state depending on the behavior of the registers. The Baseline does not provide any mechanism to save the leakage power, as a result, the registers of a warp continue to consume leakage power throughout the warp execution. Figure 6 also shows that Sleep-Reg approach reduces the register leakage power by 60.23% when compared to Baseline, however, is more power efficient than Sleep-Reg. It is because Sleep-Reg approach reduces the leakage power by turning the instruction registers into SLEEP state immediately after the instruction operands are accessed, without considering the access pattern of the registers. If a register needs an immediate access, then keeping the register into SLEEP instead of ON state requires additional latency cycles to wake up the register, and during these additional cycles, the registers consume power.

Performance Overhead Using Simulation Cycles: Figure 7 shows the performance overheads of and Sleep-Reg approaches in terms of the number of simulation cycles with respect to Baseline. On an average, the applications show a negligible performance overhead of 0.53% with respect to Baseline. A slowdown is expected because turns the registers into SLEEP or OFF states to enable power savings, and these registers are turned back to ON state (woken up) when they need to be accessed. This wake up process takes few additional latency cycles which leads to increase in the number of simulation cycles. Interestingly, some applications (BP, LPS, MC2, MR1, NN2, SP, and VA) show improvement in their performance. This occurs due to the change in the issuing order of the instructions. The warps that require their registers to be woken up can not be issued in its current cycle, instead other resident warps that are ready can be issued. This change in the issue order leads to change in the memory access patterns, which in turns changes L1 and L2 cache misses etc. For instance, in case of BP, LPS, MC2, and NN1 applications, we observe an improvement in the performance due to less number of pipeline stall cycles with when compared to Baseline. Figure 7 also shows that Sleep-Reg has an average performance degradation of 1.48% when compared to the Baseline approach. This degradation is more when compared to because Sleep-Reg turns all the instruction registers into SLEEP state after the instruction operands are accessed, irrespective of their usage pattern.

Comparing Register Leakage Energy: Figure 8 compares the total energy savings of and Sleep-Reg w.r.t. Baseline. The results show that achieves an average reduction of register leakage energy by 69.04% and 23.29% when compared to Baseline and Sleep-Reg respectively. From Figs. 6 and 7, we see that shows more leakage power saving, also has negligible performance overhead with respect to the Baseline, hence we achieve a significant reduction in leakage energy.

Effectiveness of Optimizations: We show the effectiveness of the proposed optimizations in Fig. 9. We observe that the compiler optimization (discussed in Sect. 3.1, and denoted as Comp-OPT) saves more energy (average 69.09%) when compared to Sleep-Reg (59.65%). This shows that turning the registers into low power states (SLEEP or OFF state) with the knowledge of register access pattern is more effective than turning the registers into SLEEP state after accessing them.

The run-time optimization (discussed in Sect. 3.3) is evaluated by combining it with Comp-OPT, and we denote them as in the figure. From the results, we observe that, for most of the applications, show minor improvements when compared to Comp-OPT respectively. This is because the run-time optimization helps only in correcting power state of a register by turning to ON state when it detects the future access to the register at run-time. However, if the register is not found to be accessed in the near future at run-time, it does not modify and retains the power state as directed by the Comp-OPT. For some applications (e.g. NN3), is less efficient when compared to Comp-OPT. It occurs when a register that is determined to be accessed in the near future does not get accessed due to reasons such as scheduling order, scoreboard stalls, or the unavailability of the corresponding execution unit. Note that the effectiveness of run-time optimization depends on the application behavior at the branch divergence points.

Table 4. Overheads of sleep transistors

Full size table

Analyzing Hardware Overheads: To support leakage power saving, CACTI-P [18] introduces additional sleep transistors into the SRAM structures. These transistors enable us to put the registers into low power states (SLEEP or OFF) after accessing the operands. For the configuration used in our experiments, Table 4 shows the additional area, latency, and energy associated with the additional sleep transistors circuitry. Note that in our experiments, we conservatively consider the latency overhead to change the power state from OFF to ON state to be 2 cycles. We also evaluated by varying the wake up latency cycle overhead (the results are reported in [11]). We observed that even with varying the wake up latency, the applications show significant reduction in the leakage energy when compared to Baseline.

5 Related Work

Leakage power has become a major source of power dissipation in CMOS technology. Reducing the leakage power has been well studied in the context of CPUs when compared to GPUs. Though is only for saving leakage power consumption of GPU register files, we describe briefly the techniques to save leakage power in the context of both CPUs as well as GPUs. A comprehensive list of architectural techniques to reduce leakage power of CPUs are described in [14]. A survey of methods to reduce GPU power is presented in [20].

CPU Leakage Power Saving Techniques: Powell et al. [22] proposed a state destroying technique, Gated-\(V_{dd}\), to minimize the leakage power of SRAM cells by gating supply voltage. Several methods [13, 23] leverage Gated-\(V_{dd}\) technique to reduce the leakage power of cache memory by turning off the inactive cache lines. However, these techniques cannot preserve the state of the cache lines. To maintain the state, Flautner et al. [9] proposed an architectural technique that reduces the leakage power by putting the cache lines into a drowsy state. Other approaches [21] exploit this by using cache access patterns to put cache lines in the drowsy state. As expected, the leakage power savings in this (drowsy) approach are less when compared to Gated-\(V_{dd}\) approach.

GPU Leakage Power Saving Techniques: Warped register file [3] leverages this drowsy approach to reduce leakage power of register files by putting the registers into the drowsy state immediately after accessing them. However, it does not take into account the register access pattern while turning the registers into low power states. Their approach is closest to and has been quantitatively compared in our results. Register file virtualization [12] reduces the register leakage power by reallocating unused registers to another warp. Pilot register file [4] partitions the register file into fast and slow register files, and it allocates the registers into these parts depending on the frequency of the register usage. The partition of the registers is done statically. Therefore, if a register is accessed more frequently for some duration, and less frequently for other duration, then allocating the register to either of the partitions can make it less energy efficient. changes power state during the execution, so it does not have this drawback.

6 Conclusions and Future Work

This paper focuses on reducing the leakage power of the register file in GPUs. We discuss various opportunities to save leakage power of the registers by analyzing the access patterns of the registers. We propose a system called that employs compiler analysis to determine the power state of each register at each program point. To improve the effectiveness further, we introduce a run-time optimization that dynamically corrects the power states determined by the static analysis. On evaluating using several applications, we observed that the knowledge of register access patterns and the compiler optimizations help in improving the energy efficiency of register file with a negligible number of simulation cycles overhead.

In future, we plan to explore several hardware and software strategies to reduce the register leakage energy further. For instance, we can study the effect of various register allocation mechanisms, scheduling polices and propose algorithms that minimize leakage energy by leveraging .

The register leakage power constitutes a part of the total leakage power. Similarly, other resources in the GPU such as shared memory, cache, and DRAM, dissipate leakage power during a kernel execution. In future, we plan to work on reducing the power consumption of the other GPU resources by analyzing the application behavior and the resource access patterns.

Notes

1.
Drowsy [3, 9] and SLEEP [14, 18, 19] states refer to the same low power data preserving states. In this paper, we use the term SLEEP. Also, the techniques [14, 18] to reduce leakage power using low power states address the subthreshold leakage power. Hence, in this paper the savings on leakage energy refer to savings on subthreshold leakage energy.

References

Parboil Benchmarks. http://impact.crhc.illinois.edu/Parboil/parboil.aspx
Kepler Architecture (2014). http://www.nvidia.com/object/nvidia-kepler.html
Abdel-Majeed, M., Annavaram, M.: Warped register file: a power efficient register file for GPGPUs. In: HPCA (2013). https://doi.org/10.1109/HPCA.2013.6522337
Abdel-Majeed, M., Shafaei, A., Jeon, H., Pedram, M., Annavaram, M.: Pilot register file: energy efficient partitioned register file for GPUs. In: HPCA (2017). https://doi.org/10.1109/HPCA.2017.47
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS (2009). https://doi.org/10.1109/ISPASS.2009.4919648
CACTI. http://www.hpl.hp.com/research/cacti
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IISWC (2009). https://doi.org/10.1109/IISWC.2009.5306797
CUDA-SDK (2014). http://docs.nvidia.com/cuda/cuda-samples
Flautner, K., Kim, N.S., Martin, S., Blaauw, D., Mudge, T.: Drowsy caches: simple techniques for reducing leakage power. SIGARCH Comput. Archit. News 30(2) (2002). https://doi.org/10.1145/545214.545232
Article Google Scholar
GPGPU-Sim Simulator (2014). http://www.gpgpu-sim.org
Jatala, V., Anantpur, J., Karkare, A.: GREENER: a tool for improving energy efficiency of register files. CoRR abs/1709.04697 (2017)
Google Scholar
Jeon, H., Ravi, G.S., Kim, N.S., Annavaram, M.: GPU register file virtualization. In: MICRO (2015). https://doi.org/10.1145/2830772.2830784
Kaxiras, S., Hu, Z., Martonosi, M.: Cache decay: exploiting generational behavior to reduce cache leakage power. In: ISCA (2001). https://doi.org/10.1145/379240.379268
Kaxiras, S., Martonosi, M.: Computer Architecture Techniques for Power-Efficiency, 1st edn. Morgan and Claypool Publishers (2008)
Google Scholar
Khedker, U., Sanyal, A., Karkare, B.: Data Flow Analysis: Theory and Practice, 1st edn. CRC Press Inc., Boca Raton (2009)
Book Google Scholar
Kim, N.S., et al.: Leakage current: Moore’s law meets static power. Computer 36(12) (2003). https://doi.org/10.1109/MC.2003.1250885
Article Google Scholar
Leng, J., et al.: GPUWattch: enabling energy optimizations in GPGPUs. In: ISCA (2013). https://doi.org/10.1145/2485922.2485964
Li, S., Chen, K., Ahn, J.H., Brockman, J.B., Jouppi, N.P.: CACTI-P: architecture-level modeling for sram-based structures with advanced leakage reduction techniques. In: ICCAD (2011). https://doi.org/10.1109/ICCAD.2011.6105405
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: The McPAT Framework for multicore and manycore architectures: simultaneously modeling power, area, and timing. TACO 10(1) (2013). https://doi.org/10.1145/2445572.2445577
Article Google Scholar
Mittal, S., Vetter, J.S.: A survey of methods for analyzing and improving GPU energy efficiency. ACM Comput. Surv. 47(2) (2014). https://doi.org/10.1145/2636342
Article Google Scholar
Petit, S., Sahuquillo, J., Such, J.M., Kaeli, D.: Exploiting temporal locality in drowsy cache policies. In: CF (2005). https://doi.org/10.1145/1062261.1062321
Powell, M., Yang, S.H., Falsafi, B., Roy, K., Vijaykumar, T.N.: Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories. In: ISLPED (2000). https://doi.org/10.1145/344166.344526
Zhang, M., Asanović, K.: Fine-grain CAM-tag cache resizing using miss tags. In: ISLPED (2002). https://doi.org/10.1145/566408.566444

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Kanpur, Kanpur, India
Vishwesh Jatala & Amey Karkare
Mentor Graphics India Pvt. Ltd., Bangalore, India
Jayvant Anantpur

Authors

Vishwesh Jatala
View author publications
You can also search for this author in PubMed Google Scholar
Jayvant Anantpur
View author publications
You can also search for this author in PubMed Google Scholar
Amey Karkare
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Vishwesh Jatala , Jayvant Anantpur or Amey Karkare .

Editor information

Editors and Affiliations

Department of Computer Science, University of Torino, Torino, Italy
Marco Aldinucci
Department of Computer Science, University of Torino, Torino, Italy
Luca Padovani
Department of Computer Science, University of Pisa, Pisa, Italy
Massimo Torquati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jatala, V., Anantpur, J., Karkare, A. (2018). Reducing GPU Register File Energy. In: Aldinucci, M., Padovani, L., Torquati, M. (eds) Euro-Par 2018: Parallel Processing. Euro-Par 2018. Lecture Notes in Computer Science(), vol 11014. Springer, Cham. https://doi.org/10.1007/978-3-319-96983-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-96983-1_6
Published: 01 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96982-4
Online ISBN: 978-3-319-96983-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Reducing GPU Register File Energy

Abstract