Nothing Special   »   [go: up one dir, main page]

What Computer Architects Need To Know About Memory Throttling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

What Computer Architects Need to Know About

Memory Throttling
Heather Hanson, Karthick Rajamani

To cite this version:


Heather Hanson, Karthick Rajamani. What Computer Architects Need to Know About Mem-
ory Throttling. John Carter and Karthick Rajamani. WEED 2010 - Workshop on Energy-
Efficient Design, Jun 2010, Saint Malo, France. 2010. <inria-00492851>

HAL Id: inria-00492851


https://hal.inria.fr/inria-00492851
Submitted on 17 Jun 2010

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
What Computer Architects Need to Know About Memory Throttling
Heather Hanson (hlhanson@us.ibm.com)
Karthick Rajamani (karthick@us.ibm.com)
IBM Research

Abstract Enforcing a power cap on the memory subsystem


will become increasingly important for servers with large
Memory throttling is one technique for power and energy
memory configurations to support data-intensive appli-
management that is currently available in commercial
cations, and to support virtual systems that pack multi-
systems, yet has has received little attention in the ar-
ple software stacks–and their bandwidth requirements–
chitecture community. This paper provides an overview
into a compact number of processing elements.
of memory throttling: how it works, how it affects perfor-
Even in systems where memory power is a small frac-
mance, and how it controls power. We provide measured
tion of the total, memory throttling provides an essen-
power and performance data with memory throttling
tial function: in power-limited situations, every Watt
on a commercial blade system, and discuss key issues
trimmed from the memory budget is a Watt gained for
for power management with memory throttling mecha-
use in processor cores and other components.
nisms.
1.2 Comparison to CPU clock throttling
1 Memory throttling At a high level, memory throttling is analogous to the
1.1 Overview more familiar processor clock throttling. Clock throt-
tling creates a duty cycle for the processor core’s clock
Memory throttling is a power management technique signal, where a time period is divided into distinct run
that is currently available in commercial systems and and hold intervals. During the run interval, the clock
incorporated in several proposed power and thermal con- signal runs freely. During the hold interval, the clock
trol schemes, yet the underlying mechanisms and quan- signal is gated to remain still, and computation halts.
titative effects on power and performance are not widely Figure 1 illustrates run-hold duty cycles for clock
known. throttling, not to scale. The period of one run-hold cy-
In a nutshell, memory throttling restricts read and cle is typically on the order of thousands of cycles. With
write traffic to main memory as a means of controlling long (relative to processor pipelines) duty-cycle periods,
power consumption. A significant fraction of memory clock throttling behavior is a repeating sequence of a
power is proportional to read and write bandwidth, and burst of activity followed by an idle period. During the
restricting bandwidth creates an upper bound for mem- burst of activity, processor cores operate normally; dur-
ory power. ing the idle period, computation and memory requests
Computer architects should be aware of memory halt and power consumption drops accordingly. In this
throttling because it manipulates the instruction and way, the clock throttling mechanism acts as a governor
data streams that feed the processor cores and can create for the power and performance of the system, with a di-
both performance bottlenecks due to bandwidth over- rect control of the processor and indirect control of the
constriction, as well as opportunities for improving per- memory system.
formance through judicious power budgeting. Similar to clock throttling’s limits on processor cycles
Simple control with a fixed memory throttle setting within a time interval, memory throttling regulates reads
allows the full capacity of the memory to be available, and write accesses within a time interval. There are
with a regulated rate of access to limit power consump- several approaches to implementing memory throttling
tion, in enterprise systems with large memory configu- features. One technique is similar to clock throttling
rations that if left unchecked would exceed the system’s with run-hold duty cycles, where memory accesses pass
power budget. through at the requested rate untouched during the run
Altering memory throttling dynamically tailors the portion of a time interval, then are halted during the hold
access rate to workload demands and variable power portion [4] [5]. Another memory throttling mechanism
allocation, useful for regulating DIMM tempera- is to periodically insert one or more idle clock cycles
ture [1] [2] [3], providing memory power control beyond between every N th and (N + 1)th memory accesses [1],
power-down modes [4], and optimizing system-wide per- spreading out the accesses and lowering peak bandwidth.
formance or power with techniques such as power shift- Yet another mechanism uses a bit mask to allow or block
ing [5]. memory accesses, optimized to minimize interruptions to
44
free running clock
throttled CPU clock
throttled MEM accesses

Figure 1: Clock throttling maintains the original frequency during a portion of the time interval; memory throttling
allows memory accesses within the time interval up to a quota. In this example, six memory accesses are allowed in
each interval, intermittent in the first interval and sequential until the quota in the second and third intervals.

real-time streaming data [6]. cess transactions are postponed and fulfilled during a
In this paper, we focus on an implementation within later interval; latency for some individual transactions
the memory controller that allows up to N transactions will increase though the average throughput may or may
within a time period. The memory throttling time pe- not be affected. Thus, the performance impact of mem-
riod in this implementation corresponds to a window of ory throttling depends upon both the bandwidth and the
32 frames, approximately 50 ns at a rate of one frame time-varying behavior of workloads.
per 667 MHz bus cycle. Once the quota of N reads and To the first order, memory power is linearly propor-
writes is reached in a time period, any additional re- tional to the sum of read and write bandwidth. In situ-
quests must wait for a future time period to proceed. ations where memory bandwidth requirements fall be-
Instructions waiting for the over-quota memory transac- low the throttle-enforced quota, bandwidth–and thus,
tions must stall until they are processed. The POWER6 power consumption–are unaffected by memory throt-
system used in these experiments uses a single memory tling. Rather, the memory throttling mechanism en-
throttle value universally applied to all DIMMS. Other forces an upper bound on memory power consumption,
implementations operate on finer granularities of unique creating an effective tool for power budgeting.
throttle values, such as one memory throttling option on
2 Infrastructure
commercially available POWER7 systems that supports
unique memory throttle values for each of two channel 2.1 System
pairs within a memory controller. In this study, we characterized memory throttling with
1.3 Power and Performance an IBM JS12 blade system [7] with one dual-core
POWER6 processor with a 3.8 GHz clock rate, host-
There are a few important distinctions to consider for ing a SLES10 linux operating system. A single memory
memory throttling with respect to power and perfor- controller on the POWER6 processor orchestrates reads
mance. First, memory throttling does not alter the and writes between both cores and main memory. The
DRAM clock in the manner that clock throttling alters memory subsystem has a 16 GB capacity, configured as
the CPU clock signal. Memory throttling is usually im- eight DDR2 667 MHz 2 GB (1Rx4) DIMMS.
plemented within the memory controller (processor chip
or chipset) to restrict the memory request rate, and the 2.2 Workloads
DRAM itself is unchanged, unlike memory low-power We characterized the system response to memory throt-
states and power-down modes. tling with a set of micro-benchmarks with specific char-
With quota-style memory throttling, the amount of acteristics. Each benchmark maintains steady operation
additional latency due to throttling restrictions is a func- during execution, providing a single point for compar-
tion of requested bandwidth rather than a fixed amount ison, without complicating the analysis with phases or
of busy time per interval. A workload whose band- other time-varying behavior. This small suite of micro-
width fits within the quota will proceed unchanged by benchmarks covers a wide range of memory characteris-
the memory throttle mechanism. A steady workload tics.
with bandwidth needs beyond the quota will slow down Two micro-benchmarks use the same floating-point
as it waits for memory transactions to complete at a workload, DAXPY, with distinct memory footprints.
slower pace. A bursty workload with bandwidth spikes The small data set for DAXPY-L1 fits within the level-1
above the quota will effectively be smoothed out as ex- data cache, while the DAXPY-DIMM footprint is 8 MB,
45
forcing off-chip memory accesses. By performing the benchmark executed for a fixed time duration, and the
same computation with different data set sizes, we are amount of work completed varied with memory throttle
able to isolate effects due to the memory subsystem be- setting.
havior. We summarized memory characterization data by
Computation within the FPMAC kernel is similar in calculating the median value for memory bandwidth,
nature to DAXPY; the primary difference is that the throughput (IPS) and memory power over the observed
floating-point multiply and accumulate algorithm in FP- intervals for each permutation of workload and memory
MAC computes and stores with a single array while the throttle setting.
DAXPY implementation uses two arrays, providing a
different flavor of memory access pattern with the same 3 Bandwidth
data set size (8 MB) and similar compute load. Figure 2 charts memory traffic normalized to the peak
The RandomMemory-DIMM micro-benchmark gen- traffic observed over this suite, in DAXPY-DIMM.
erates random address locations for reads and writes The curves show three distinct regions of operation:
within an 8 MB memory footprint. The memory access bandwidth-limited where the bandwidth is a linear func-
patterns defeat prefetching that would benefit FPMAC tion of memory throttle, a bandwidth-saturated region,
and DAXPY kernels’ regular access patterns, exposing and a transitional portion between the limited and sat-
the full effects of memory latency at each throttle point. urated regions.
The FPMAC, DAXPY, and RandomMemory kernels are
short C programs with heavy computational load or in- 3.1 Bandwidth-limited
tensive memory accesses and very little overhead. In the bandwidth-limited region, increasing the avail-
We also use a single calibration phase of the Java able bandwidth by increasing the throttle value trans-
benchmark SPECPower ssj2008 that continuously in- lates directly to higher memory traffic. Changing the
jects transactions as fast as the system can process them, throttle value within this region directly affects band-
executing for a fixed period of time to provide insight width, and thus has a direct effect on both perfor-
into the effects of memory throttling on transactional mance and power. The bandwidth-limited region may
workloads. include only very-throttled operation, such as the case
of SPECPower ssj2008 phase, or a wider range of throt-
2.3 Measurements
tle values, as in the case of the DAXPY-DIMM and
The POWER6 Blade system used for the experiments FPMAC-DIMM benchmarks with large memory foot-
is instrumented with on-board power sensors including prints.
memory power, and event counters for throughput in
3.2 Transition
units of instructions per second (IPS), memory reads,
and memory writes (among others). The transition region is critical for power and perfor-
The blade power management controller obtains event mance management with memory throttling. In this re-
counter data via a dedicated I2C port on the proces- gion, changing memory throttle setting does affect mem-
sor chip [8], averages the data over 256 ms, and sends ory traffic, but in a more complex manner than in the
measurements via an Ethernet connection to a separate bandwidth-limited region. The uncertainty in the re-
workstation, where a monitoring program dumps a trace lation between memory throttle and bandwidth within
file for our analysis, without interfering with the memory this region, and in the extent of the region itself, create
characterization experiments. a challenge for managing power and performance.
Memory throttle values that bound transition regions
2.4 Throttling Characterization
vary by benchmark, with transitions at lower throttle
We characterized the blade system’s response to memory values for less-demanding workloads and higher throttle
throttling by executing four copies of a micro-benchmark values for memory-intensive workloads.
(two cores, two threads each) with a fixed quota-style Each workload has a gap between the maximum avail-
throttle while we recorded power and performance data, able and consumed bandwidth in the transition region,
then changed the throttle setting and re-ran the same and the extent of the gap varies. For example, at a
workload. Throttle values range from 1 to 32, out of 30% throttle, RandomMemory-DIMM has not reached
a window of 32 accesses. A 100% throttle setting of its saturation level, yet it consumes less bandwidth than
32/32 is unthrottled, 16/32 is 50% throttled, etc. As other workloads at the same throttle setting. The knee
memory throttle settings are successively lower, memory of the curve is sharper for some workloads than other:
bandwidth is more extensively throttled. the FPMAC-DIMM micro-benchmark has a particularly
The amount of work performed by DAXPY- sharp transition, while the SPECPower ssj2008 phase
L1, DAXPY-DIMM, FPMAC-DIMM, and has a much more gradual transition. Workloads with
RandomMemory-DIMM at each throttle setting sharper transition are able to use more of the available
remained constant, and execution time varied. The bandwidth at a given throttle setting, up to the point
single calibration phase of the SPECpower ssj2008 of saturation. Workloads with more gradual bandwidth
46
100
DAXPY−DIMM
90 FPMAC−DIMM
RandomMemory−DIMM
80 SPECPower_ssj2008
DAXPY−L1
70
Normalized BW (%)

60

50

40

30

20

10

0
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)

Figure 2: Memory traffic: total read and write accesses normalized to peak observed traffic, unthrottled DAXPY-
DIMM. Each workload has unique characteristics for the three throttle regions: bandwidth-limited, transition, and
bandwidth-saturated.

roll-off have other bottlenecks that also factor into lim- by other factors. The cache-resident dataset of DAXPY-
iting the rate of memory requests. L1 naturally limits its memory request rate, and the con-
While the bandwidth regions are clearly visible in sumed bandwidth is so low that it is independent of the
off-line analysis of the full throttle range, it is diffi- memory throttle setting, essentially in the bandwidth-
cult at run time to discern whether the current work- saturated region throughout the entire range.
load(s) are in the transitional area. For example, at Increasing memory throttle settings beyond the sat-
the 50% throttle level in Figure 2, the observed band- uration level has a negligible effect on bandwidth. It
width for DAXPY-DIMM and FPMAC-DIMM work- follows that increasing memory throttle settings beyond
loads are nearly identical, yet DAXPY-DIMM is in the the saturation level will not improve performance, or
linear bandwidth-limited region and FPMAC-DIMM is draw more power. For example, no memory throttle set-
at a very sharp transition point. Without knowing the ting would have any bearing on DAXPY-L1, nor would
bandwidth trends from neighboring throttle points, a modulating memory throttle settings between 40-100%
controller would not know whether to expect a linear, for the SPECPower ssj2008 calibration phase recorded
non-linear, or no change in bandwidth for an incremen- in Figure 2.
tal change in throttle value.
4 Performance
3.3 Bandwidth-saturated
Figure 3 plots performance (IPS) as a function of the
In the bandwidth-saturated region, the flat portions of memory throttle setting. Data are normalized to the
each curve in the graph, memory traffic does not change peak throughput of individual benchmarks to factor out
with memory throttle settings. Each workload settles the effect of disparate throughput levels among bench-
to a unique saturation level. Other bottlenecks and the marks in the suite.
workload’s data footprint limit the memory request rate. One advantage of power-cap control rather than con-
On the blade system used to collect the data shown in tinuous control is that when memory traffic is less than
Figure 2, the memory bus limits the available bandwidth the limit imposed by memory throttling, performance
for memory throttle settings 75% and above, meaning is unchanged. Cache-resident DAXPY-L1 throughput is
that throttle settings between 75-100% provide essen- unaffected by memory throttling.
tially the same amount of available bandwidth. At the opposite end of the spectrum, DAXPY-DIMM
The DAXPY benchmarks illustrate two ends of the shows noticeable throughput loss for throttling up to
saturation spectrum. Bandwidth consumed by DAXPY- 65%. Remember that at 75%, the memory bus becomes
DIMM approaches the architectural limit of the memory the dominant bandwidth-limiting factor. DAXPY-
bus; the other benchmarks in this collection are limited DIMM (and other workloads with similar characteris-
47
100

Normalized IPS (%) 80

60

40
DAXPY−L1
SPECPower_ssj2008
20 RandomMemory−DIMM
FPMAC−DIMM
DAXPY−DIMM
0
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)

Figure 3: Throughput (instructions per second), each benchmark normalized to its own peak.

100
normalized to peak

90
Memory Power

observed (%)

80

70 DAXPY−DIMM
FPMAC−DIMM
60
RandomMemory−DIMM
50 SPECPower_ssj2008
DAXPY−L1
40
0 10 20 30 40 50 60 70 80 90 100
Consumed Memory Bandwidth (%)

Figure 4: Memory power is a linear function of bandwidth. In this system, about 60% of memory power is controlled
by memory throttling.

100
normalized to unthrottled

90
per benchmark (%)
Memory Power

80

70

DAXPY−L1
60
SPECPower_ssj2008
RandomMemory−DIMM
50 FPMAC−DIMM
DAXPY−DIMM
40
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)

Figure 5: Relationship between memory power and throttle varies by workload.

48
tics) are sensitive to the majority of memory throttling a commercial blade server system. We demonstrated
values in the useful range up to 75%, and almost any the three regions of bandwidth response: bandwidth-
power control via memory throttling would directly de- limited, transition, and bandwidth-saturation. Under-
grade performance. standing these regions and the workload characteristics
SPECPower ssj2008 tolerates memory throttling that determine the interaction between throttle settings
without serious performance loss down to about 40% and bandwidth restriction enables wise choices in mem-
throttled, below which it has non-linear performance ory power management design.
loss with memory throttling throughout its wide transi-
7 Acknowledgments
tion region. Kernels like FPMAC-DIMM with very short
transition regions are dominated by linear performance Thank you to our colleagues who contributed technical
loss in the bandwidth-limited region. expertise and system administration support, especially
Workloads with a sharp-knee characteristic would Joab Henderson, Guillermo Silva, Kenneth Wright, and
show no response to changes in memory throttling in the power-aware systems department of the IBM Austin
their bandwidth-saturation regions, until an incremen- Research Laboratory.
tal step down in memory throttling level tipped them References
into a bandwidth-limited region and suddenly dropped
[1] C.-H. R. Wu, “U.S. patent 7352641: Dynamic memory throt-
in performance. tling for power and thermal limitations.” Sun Microsystems,
Inc., issued 2008.
5 Power
[2] J. Iyer, C. L. Hall, J. Shi, and Y. Huang, “System memory
Figure 4 confirms that memory power consumption is power and thermal management in platforms built on Intel
linearly proportional to memory bandwidth on our sys- Centrino Duo Mobile Technology,” Intel Technology Journal,
vol. 10, May 2006.
tem. Data are normalized to the maximum observed
memory power measurement, in DAXPY-DIMM. The [3] E. C. Sampson, A. Navale, and D. M. Puffer, “U.S. patent
6871119: Filter based throttling.” Intel Corporation, issued
near-zero bandwidth requirements of the cache-resident 2005.
DAXPY-L1 show that about 40% of memory power is
[4] I. Hur and C. Lin, “A comprehensive approach to DRAM
not under throttle control in this system. power management,” in Proceedings of the 14th Interna-
Measured memory power data points normalized in- tional Symposium on High Performance Computer Architec-
dividually per benchmark, shown in Figure 5, demon- ture (HPCA ’08), August 2008.
strate where opportunity for power control lies. Memory [5] W. Felter, K. Rajamani, C. Rusu, and T. Keller, “A
throttling offers essentially no power control for core- Performance-Conserving Approach for Reducing Peak Power
Consumption in Server Systems,” in Proceedings of the 19th
bound workloads such as DAXPY-L1, a small range ACM International Conference on Supercomputing, June
of control for moderate-intensity workloads such as 2005.
SPECPower ssj2008, and a large swing for memory in- [6] O. Kahn and E. Birenzwig, “U.S. patent 6662278: Adaptive
tensive workloads such as DAXPY-DIMM and FPMAC- throttling of memory accesses, such as throttling RDRAM ac-
DIMM that have larger unthrottled memory power con- cesses in a real-time system.” Intel Corporation, issued 2003.
sumption. [7] International Business Machines, Inc., “IBM BladeCen-
Since quota-style memory throttling enforces an up- ter JS12 Express blade product description..” Available at
ftp://public.dhe.ibm.comcommon/ssi/pm/sp/n/bld03013usen/-
per bound on power consumption, actual memory power BLD03013USEN.PDF, 2008.
consumption will be in the range between the static [8] M. S. Floyd, S. Ghiasi, T. W. Keller, K. Rajamani, F. L. Raw-
power levels and the memory power cap, depending upon son, J. C. Rubio, and M. S. Ware, “System power management
run-time bandwidth demands. support in the IBM POWER6 microprocessor,” IBM Journal
of Research and Development, vol. 51, pp. 733–746, November
6 Conclusion 2007.

Memory throttling exists in various forms in commercial


systems, yet it has garnered little attention in architec-
ture studies to date. Memory throttling can be used
to enforce memory power budgets, enabling large mem-
ory configurations that would violate power constraints
if left unthrottled, and also supporting dynamic tech-
niques such as power shifting.
As with nearly all power management options, mem-
ory throttling comes with the price of performance penal-
ties in some situations. We point out the regimes of
power control with no performance loss, and where more
extensive power reduction does degrade performance.
This paper characterized the effects of memory throt-
tling on throughput performance and memory power on
49

You might also like