What Computer Architects Need To Know About Memory Throttling
What Computer Architects Need To Know About Memory Throttling
What Computer Architects Need To Know About Memory Throttling
Memory Throttling
Heather Hanson, Karthick Rajamani
Figure 1: Clock throttling maintains the original frequency during a portion of the time interval; memory throttling
allows memory accesses within the time interval up to a quota. In this example, six memory accesses are allowed in
each interval, intermittent in the first interval and sequential until the quota in the second and third intervals.
real-time streaming data [6]. cess transactions are postponed and fulfilled during a
In this paper, we focus on an implementation within later interval; latency for some individual transactions
the memory controller that allows up to N transactions will increase though the average throughput may or may
within a time period. The memory throttling time pe- not be affected. Thus, the performance impact of mem-
riod in this implementation corresponds to a window of ory throttling depends upon both the bandwidth and the
32 frames, approximately 50 ns at a rate of one frame time-varying behavior of workloads.
per 667 MHz bus cycle. Once the quota of N reads and To the first order, memory power is linearly propor-
writes is reached in a time period, any additional re- tional to the sum of read and write bandwidth. In situ-
quests must wait for a future time period to proceed. ations where memory bandwidth requirements fall be-
Instructions waiting for the over-quota memory transac- low the throttle-enforced quota, bandwidth–and thus,
tions must stall until they are processed. The POWER6 power consumption–are unaffected by memory throt-
system used in these experiments uses a single memory tling. Rather, the memory throttling mechanism en-
throttle value universally applied to all DIMMS. Other forces an upper bound on memory power consumption,
implementations operate on finer granularities of unique creating an effective tool for power budgeting.
throttle values, such as one memory throttling option on
2 Infrastructure
commercially available POWER7 systems that supports
unique memory throttle values for each of two channel 2.1 System
pairs within a memory controller. In this study, we characterized memory throttling with
1.3 Power and Performance an IBM JS12 blade system [7] with one dual-core
POWER6 processor with a 3.8 GHz clock rate, host-
There are a few important distinctions to consider for ing a SLES10 linux operating system. A single memory
memory throttling with respect to power and perfor- controller on the POWER6 processor orchestrates reads
mance. First, memory throttling does not alter the and writes between both cores and main memory. The
DRAM clock in the manner that clock throttling alters memory subsystem has a 16 GB capacity, configured as
the CPU clock signal. Memory throttling is usually im- eight DDR2 667 MHz 2 GB (1Rx4) DIMMS.
plemented within the memory controller (processor chip
or chipset) to restrict the memory request rate, and the 2.2 Workloads
DRAM itself is unchanged, unlike memory low-power We characterized the system response to memory throt-
states and power-down modes. tling with a set of micro-benchmarks with specific char-
With quota-style memory throttling, the amount of acteristics. Each benchmark maintains steady operation
additional latency due to throttling restrictions is a func- during execution, providing a single point for compar-
tion of requested bandwidth rather than a fixed amount ison, without complicating the analysis with phases or
of busy time per interval. A workload whose band- other time-varying behavior. This small suite of micro-
width fits within the quota will proceed unchanged by benchmarks covers a wide range of memory characteris-
the memory throttle mechanism. A steady workload tics.
with bandwidth needs beyond the quota will slow down Two micro-benchmarks use the same floating-point
as it waits for memory transactions to complete at a workload, DAXPY, with distinct memory footprints.
slower pace. A bursty workload with bandwidth spikes The small data set for DAXPY-L1 fits within the level-1
above the quota will effectively be smoothed out as ex- data cache, while the DAXPY-DIMM footprint is 8 MB,
45
forcing off-chip memory accesses. By performing the benchmark executed for a fixed time duration, and the
same computation with different data set sizes, we are amount of work completed varied with memory throttle
able to isolate effects due to the memory subsystem be- setting.
havior. We summarized memory characterization data by
Computation within the FPMAC kernel is similar in calculating the median value for memory bandwidth,
nature to DAXPY; the primary difference is that the throughput (IPS) and memory power over the observed
floating-point multiply and accumulate algorithm in FP- intervals for each permutation of workload and memory
MAC computes and stores with a single array while the throttle setting.
DAXPY implementation uses two arrays, providing a
different flavor of memory access pattern with the same 3 Bandwidth
data set size (8 MB) and similar compute load. Figure 2 charts memory traffic normalized to the peak
The RandomMemory-DIMM micro-benchmark gen- traffic observed over this suite, in DAXPY-DIMM.
erates random address locations for reads and writes The curves show three distinct regions of operation:
within an 8 MB memory footprint. The memory access bandwidth-limited where the bandwidth is a linear func-
patterns defeat prefetching that would benefit FPMAC tion of memory throttle, a bandwidth-saturated region,
and DAXPY kernels’ regular access patterns, exposing and a transitional portion between the limited and sat-
the full effects of memory latency at each throttle point. urated regions.
The FPMAC, DAXPY, and RandomMemory kernels are
short C programs with heavy computational load or in- 3.1 Bandwidth-limited
tensive memory accesses and very little overhead. In the bandwidth-limited region, increasing the avail-
We also use a single calibration phase of the Java able bandwidth by increasing the throttle value trans-
benchmark SPECPower ssj2008 that continuously in- lates directly to higher memory traffic. Changing the
jects transactions as fast as the system can process them, throttle value within this region directly affects band-
executing for a fixed period of time to provide insight width, and thus has a direct effect on both perfor-
into the effects of memory throttling on transactional mance and power. The bandwidth-limited region may
workloads. include only very-throttled operation, such as the case
of SPECPower ssj2008 phase, or a wider range of throt-
2.3 Measurements
tle values, as in the case of the DAXPY-DIMM and
The POWER6 Blade system used for the experiments FPMAC-DIMM benchmarks with large memory foot-
is instrumented with on-board power sensors including prints.
memory power, and event counters for throughput in
3.2 Transition
units of instructions per second (IPS), memory reads,
and memory writes (among others). The transition region is critical for power and perfor-
The blade power management controller obtains event mance management with memory throttling. In this re-
counter data via a dedicated I2C port on the proces- gion, changing memory throttle setting does affect mem-
sor chip [8], averages the data over 256 ms, and sends ory traffic, but in a more complex manner than in the
measurements via an Ethernet connection to a separate bandwidth-limited region. The uncertainty in the re-
workstation, where a monitoring program dumps a trace lation between memory throttle and bandwidth within
file for our analysis, without interfering with the memory this region, and in the extent of the region itself, create
characterization experiments. a challenge for managing power and performance.
Memory throttle values that bound transition regions
2.4 Throttling Characterization
vary by benchmark, with transitions at lower throttle
We characterized the blade system’s response to memory values for less-demanding workloads and higher throttle
throttling by executing four copies of a micro-benchmark values for memory-intensive workloads.
(two cores, two threads each) with a fixed quota-style Each workload has a gap between the maximum avail-
throttle while we recorded power and performance data, able and consumed bandwidth in the transition region,
then changed the throttle setting and re-ran the same and the extent of the gap varies. For example, at a
workload. Throttle values range from 1 to 32, out of 30% throttle, RandomMemory-DIMM has not reached
a window of 32 accesses. A 100% throttle setting of its saturation level, yet it consumes less bandwidth than
32/32 is unthrottled, 16/32 is 50% throttled, etc. As other workloads at the same throttle setting. The knee
memory throttle settings are successively lower, memory of the curve is sharper for some workloads than other:
bandwidth is more extensively throttled. the FPMAC-DIMM micro-benchmark has a particularly
The amount of work performed by DAXPY- sharp transition, while the SPECPower ssj2008 phase
L1, DAXPY-DIMM, FPMAC-DIMM, and has a much more gradual transition. Workloads with
RandomMemory-DIMM at each throttle setting sharper transition are able to use more of the available
remained constant, and execution time varied. The bandwidth at a given throttle setting, up to the point
single calibration phase of the SPECpower ssj2008 of saturation. Workloads with more gradual bandwidth
46
100
DAXPY−DIMM
90 FPMAC−DIMM
RandomMemory−DIMM
80 SPECPower_ssj2008
DAXPY−L1
70
Normalized BW (%)
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)
Figure 2: Memory traffic: total read and write accesses normalized to peak observed traffic, unthrottled DAXPY-
DIMM. Each workload has unique characteristics for the three throttle regions: bandwidth-limited, transition, and
bandwidth-saturated.
roll-off have other bottlenecks that also factor into lim- by other factors. The cache-resident dataset of DAXPY-
iting the rate of memory requests. L1 naturally limits its memory request rate, and the con-
While the bandwidth regions are clearly visible in sumed bandwidth is so low that it is independent of the
off-line analysis of the full throttle range, it is diffi- memory throttle setting, essentially in the bandwidth-
cult at run time to discern whether the current work- saturated region throughout the entire range.
load(s) are in the transitional area. For example, at Increasing memory throttle settings beyond the sat-
the 50% throttle level in Figure 2, the observed band- uration level has a negligible effect on bandwidth. It
width for DAXPY-DIMM and FPMAC-DIMM work- follows that increasing memory throttle settings beyond
loads are nearly identical, yet DAXPY-DIMM is in the the saturation level will not improve performance, or
linear bandwidth-limited region and FPMAC-DIMM is draw more power. For example, no memory throttle set-
at a very sharp transition point. Without knowing the ting would have any bearing on DAXPY-L1, nor would
bandwidth trends from neighboring throttle points, a modulating memory throttle settings between 40-100%
controller would not know whether to expect a linear, for the SPECPower ssj2008 calibration phase recorded
non-linear, or no change in bandwidth for an incremen- in Figure 2.
tal change in throttle value.
4 Performance
3.3 Bandwidth-saturated
Figure 3 plots performance (IPS) as a function of the
In the bandwidth-saturated region, the flat portions of memory throttle setting. Data are normalized to the
each curve in the graph, memory traffic does not change peak throughput of individual benchmarks to factor out
with memory throttle settings. Each workload settles the effect of disparate throughput levels among bench-
to a unique saturation level. Other bottlenecks and the marks in the suite.
workload’s data footprint limit the memory request rate. One advantage of power-cap control rather than con-
On the blade system used to collect the data shown in tinuous control is that when memory traffic is less than
Figure 2, the memory bus limits the available bandwidth the limit imposed by memory throttling, performance
for memory throttle settings 75% and above, meaning is unchanged. Cache-resident DAXPY-L1 throughput is
that throttle settings between 75-100% provide essen- unaffected by memory throttling.
tially the same amount of available bandwidth. At the opposite end of the spectrum, DAXPY-DIMM
The DAXPY benchmarks illustrate two ends of the shows noticeable throughput loss for throttling up to
saturation spectrum. Bandwidth consumed by DAXPY- 65%. Remember that at 75%, the memory bus becomes
DIMM approaches the architectural limit of the memory the dominant bandwidth-limiting factor. DAXPY-
bus; the other benchmarks in this collection are limited DIMM (and other workloads with similar characteris-
47
100
60
40
DAXPY−L1
SPECPower_ssj2008
20 RandomMemory−DIMM
FPMAC−DIMM
DAXPY−DIMM
0
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)
Figure 3: Throughput (instructions per second), each benchmark normalized to its own peak.
100
normalized to peak
90
Memory Power
observed (%)
80
70 DAXPY−DIMM
FPMAC−DIMM
60
RandomMemory−DIMM
50 SPECPower_ssj2008
DAXPY−L1
40
0 10 20 30 40 50 60 70 80 90 100
Consumed Memory Bandwidth (%)
Figure 4: Memory power is a linear function of bandwidth. In this system, about 60% of memory power is controlled
by memory throttling.
100
normalized to unthrottled
90
per benchmark (%)
Memory Power
80
70
DAXPY−L1
60
SPECPower_ssj2008
RandomMemory−DIMM
50 FPMAC−DIMM
DAXPY−DIMM
40
0 10 20 30 40 50 60 70 80 90 100
Memory Throttle (%)
48
tics) are sensitive to the majority of memory throttling a commercial blade server system. We demonstrated
values in the useful range up to 75%, and almost any the three regions of bandwidth response: bandwidth-
power control via memory throttling would directly de- limited, transition, and bandwidth-saturation. Under-
grade performance. standing these regions and the workload characteristics
SPECPower ssj2008 tolerates memory throttling that determine the interaction between throttle settings
without serious performance loss down to about 40% and bandwidth restriction enables wise choices in mem-
throttled, below which it has non-linear performance ory power management design.
loss with memory throttling throughout its wide transi-
7 Acknowledgments
tion region. Kernels like FPMAC-DIMM with very short
transition regions are dominated by linear performance Thank you to our colleagues who contributed technical
loss in the bandwidth-limited region. expertise and system administration support, especially
Workloads with a sharp-knee characteristic would Joab Henderson, Guillermo Silva, Kenneth Wright, and
show no response to changes in memory throttling in the power-aware systems department of the IBM Austin
their bandwidth-saturation regions, until an incremen- Research Laboratory.
tal step down in memory throttling level tipped them References
into a bandwidth-limited region and suddenly dropped
[1] C.-H. R. Wu, “U.S. patent 7352641: Dynamic memory throt-
in performance. tling for power and thermal limitations.” Sun Microsystems,
Inc., issued 2008.
5 Power
[2] J. Iyer, C. L. Hall, J. Shi, and Y. Huang, “System memory
Figure 4 confirms that memory power consumption is power and thermal management in platforms built on Intel
linearly proportional to memory bandwidth on our sys- Centrino Duo Mobile Technology,” Intel Technology Journal,
vol. 10, May 2006.
tem. Data are normalized to the maximum observed
memory power measurement, in DAXPY-DIMM. The [3] E. C. Sampson, A. Navale, and D. M. Puffer, “U.S. patent
6871119: Filter based throttling.” Intel Corporation, issued
near-zero bandwidth requirements of the cache-resident 2005.
DAXPY-L1 show that about 40% of memory power is
[4] I. Hur and C. Lin, “A comprehensive approach to DRAM
not under throttle control in this system. power management,” in Proceedings of the 14th Interna-
Measured memory power data points normalized in- tional Symposium on High Performance Computer Architec-
dividually per benchmark, shown in Figure 5, demon- ture (HPCA ’08), August 2008.
strate where opportunity for power control lies. Memory [5] W. Felter, K. Rajamani, C. Rusu, and T. Keller, “A
throttling offers essentially no power control for core- Performance-Conserving Approach for Reducing Peak Power
Consumption in Server Systems,” in Proceedings of the 19th
bound workloads such as DAXPY-L1, a small range ACM International Conference on Supercomputing, June
of control for moderate-intensity workloads such as 2005.
SPECPower ssj2008, and a large swing for memory in- [6] O. Kahn and E. Birenzwig, “U.S. patent 6662278: Adaptive
tensive workloads such as DAXPY-DIMM and FPMAC- throttling of memory accesses, such as throttling RDRAM ac-
DIMM that have larger unthrottled memory power con- cesses in a real-time system.” Intel Corporation, issued 2003.
sumption. [7] International Business Machines, Inc., “IBM BladeCen-
Since quota-style memory throttling enforces an up- ter JS12 Express blade product description..” Available at
ftp://public.dhe.ibm.comcommon/ssi/pm/sp/n/bld03013usen/-
per bound on power consumption, actual memory power BLD03013USEN.PDF, 2008.
consumption will be in the range between the static [8] M. S. Floyd, S. Ghiasi, T. W. Keller, K. Rajamani, F. L. Raw-
power levels and the memory power cap, depending upon son, J. C. Rubio, and M. S. Ware, “System power management
run-time bandwidth demands. support in the IBM POWER6 microprocessor,” IBM Journal
of Research and Development, vol. 51, pp. 733–746, November
6 Conclusion 2007.