1 Introduction
DRAM [
23] is hierarchically organized to improve scaling in density and performance. At the highest level of the hierarchy, a DRAM chip is partitioned into banks that can be accessed simultaneously [
58,
59,
60,
64,
87]. At the lowest level, a collection of DRAM rows (DRAM cells that are activated together) are typically divided into multiple
DRAM mats that can operate individually [
43,
53,
59,
125]. Even though DRAM chips are hierarchically organized, standard DRAM interfaces (e.g., DDRx [
44,
45,
46]) do not expose DRAM mats to the memory controller. To access even a single DRAM cell, the memory controller needs to activate a large number of DRAM cells (e.g., 65,536 DRAM cells in a DRAM row in DDR4 [
81]) and transfer many bits (e.g., a cache block, typically 512 bits [
33]) over the memory channel. Thus, in current systems, both DRAM data transfer and activation are coarse-grained. Coarse-grained data transfer and activation cause significant energy inefficiency in systems that use DRAM as main memory for two major reasons.
First, coarse-grained DRAM data transfer causes unnecessary data movement. Standard DRAM interfaces transfer data at cache block granularity over fixed-size data transfer bursts (e.g., eight-cycle bursts in DDR4 [
45,
81]), but a large fraction of data (e.g., more than 75% [
98]) in a cache block is not used (i.e., referenced by CPU load/store instructions) during the cache block’s residency in the cache hierarchy (i.e., from the moment the cache block is brought to the on-chip caches until it gets evicted) [
62,
63,
97,
98,
131,
132]. Thus, transferring unused words of a cache block over the power-hungry memory channel wastes energy [
3,
16,
31,
69,
70,
92,
96,
116,
124,
127,
135].
Second, coarse-grained DRAM activation causes an unnecessarily large number of DRAM cells in a DRAM row to be activated. Subsequent DRAM accesses to the activated row can be served faster. However, many modern memory-intensive workloads with irregular access patterns cannot benefit from these faster row accesses, as the spatial locality in these workloads is lower than the DRAM row size [
30,
69,
83,
85,
86,
88,
120,
121,
133]. Thus, the energy cost of activating all cells in a DRAM row is not amortized over many accesses to the same row, leading to energy waste from activating a disproportionately large number of cells.
Prior works [
3,
16,
19,
31,
69,
92,
116,
124,
135,
136] develop DRAM substrates that enable fine-grained DRAM data transfer and activation, allowing words of a cache block to be individually retrieved from DRAM and a small number of DRAM cells to be activated with each DRAM access. However, these prior works (i) cannot provide high DRAM throughput [
19,
124], (ii) incur high DRAM area overheads [
3,
16,
31,
92,
116,
135,
136], and (iii) do not fully enable
1 fine-grained DRAM [
19,
31,
69,
124,
136] (Section
3.1).
Our goal is to develop a new, low-cost, and high-throughput DRAM substrate that can mitigate the excessive energy consumption from both (i) transmitting unused data on the memory channel and (ii) activating a disproportionately large number of DRAM cells. To this end, we develop Sectored DRAM. Sectored DRAM leverages two key ideas to enable fine-grained data transfer and row activation at low chip area cost. First, a cache block transfer between main memory and the memory controller happens in a fixed number of DRAM interface clock cycles where only a word of the cache line is transferred in each cycle. Sectored DRAM augments the memory controller and the DRAM chip to perform cache block transfers in a variable number of clock cycles based on the workload access pattern. Second, a large DRAM row, by design, is already partitioned into smaller independent physically isolated regions. Sectored DRAM provides the memory controller with the ability to activate each such region based on the workload access pattern.
Sectored DRAM implements (i) Variable Burst Length (VBL) to enable fine-grained DRAM data transfer, and Sectored Activation (SA) to enable fine-grained DRAM activation. \(VBL\) dynamically adjusts the number of cycles in a burst to transfer a different word of a cache block with each DRAM interface cycle, thus enabling fine-granularity DRAM data transfer. To do so at low cost, \(VBL\) builds on existing DRAM I/O circuitry that already selects one word of a cache block to transfer in one cycle of a burst.
To enable
\(SA\) with low hardware cost, we leverage the fact that DRAM rows are already partitioned into independent physically isolated regions (mats) that can be individually activated with small modifications to the DRAM chip. We refer to a mat that incorporates these modifications as a
sector.
2 Activating a sector consumes considerably smaller energy than activating a DRAM row as a sector typically contains almost an order of magnitude fewer cells (e.g., 1,024 in a DDR4 chip) than a DRAM row (e.g., typically 8,192 in a DDR4 chip).
\(SA\) (i) implements
sector transistors that are each turned on to activate one of the independent mats and (ii)
sector latches that control the sector transistors.
\(SA\) exposes the sector latches to the memory controller by using an existing DRAM command (Section
4.1); therefore,
\(SA\) can be implemented without any changes to the physical DRAM interface. As the power required to activate a mat in a DRAM row is only a fraction of the power required to activate the whole row, Sectored DRAM also relaxes the power delivery constraints in DRAM chips [
69,
92,
136]. Doing so allows for the activation of DRAM rows at a higher rate, increasing memory-level parallelism for memory-intensive workloads.
\(VBL\) and
\(SA\) provide two key primitives for power-efficient, sub-cache-block-sized (e.g., 8-byte or one-word) data transfers between main memory and the rest of the system. However, because modern systems are typically designed to have cache-block-sized (e.g., 64-byte) data transfers between system components, making performance- and energy–efficient use of two Sectored DRAM primitives (
\(VBL\) and
\(SA\) ) requires system-wide modifications in hardware. We develop two hardware techniques (Section
5.2), (i)
Load/Store Queue (LSQ) Lookahead and (ii)
Sector Predictor (SP) to effectively integrate Sectored DRAM into a system. At a high level, LSQ Lookahead and SP determine and predict, respectively, which words of a cache block should be retrieved from a lower-level component of the memory hierarchy. Accurately determining the words of a cache block that are used during the cache block’s
residency in system caches enables high system performance and low system energy consumption by improving data reuse in system caches as opposed to repeating a high-latency main memory access for each used word of a cache block.
LSQ Lookahead accumulates the individual words in a cache block accessed by younger load/store instructions in older load/store instructions’ memory requests. Thus, the execution of a load/store instruction prefetches the portions of cache blocks that will be accessed by the in-flight (i.e., not yet executed) load/store instructions. SP predicts which portions of a cache block will be accessed by a load/store instruction based on that instruction’s past cache block usage patterns. This allows SP to accurately predict the portions of a cache block that will be used by the processor during the cache block’s residency in the cache hierarchy.
We evaluate the performance and energy of Sectored DRAM using 41 workloads from SPEC2006 [
118] and 2017 [
119] and DAMOV [
91,
106] benchmarks using Ramulator [
60,
75,
104,
105], DRAMPower [
13], and Rambus Power Model [
99]. Sectored DRAM significantly reduces system energy consumption and improves system performance for memory-intensive workloads with irregular access patterns (which amounts to 10 of our workloads). For such workloads, compared to a system with conventional coarse-grained DRAM, Sectored DRAM reduces DRAM energy consumption by 20%, improves system performance by 17%, and reduces system energy consumption by 14%, on average. Sectored DRAM does so as it (i) improves workload execution time by issuing
ACTIVATE (ACT) commands at a higher rate and thereby reducing average memory latency, and (ii) activates fewer DRAM cells and retrieves fewer sectors from DRAM at lower power. We estimate the DRAM area overheads of Sectored DRAM using CACTI [
8] and find that it can be implemented with low hardware cost. Sectored DRAM incurs 0.39 mm
2 area overhead (1.7% of a DRAM chip) and does not require modifications to the physical DRAM interface. Compared to the evaluated state-of-the-art fine-grained DRAM architectures [
19,
69,
132,
136], Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture [
136]. Sectored DRAM provides 10% larger performance and 13% larger DRAM energy benefits compared to a low-cost state-of-the-art fine-grained DRAM architecture [
69]. We open source our simulation infrastructure and all datasets to enable reproducibility and help future research [
107].
We make the following contributions:
—
We introduce Sectored DRAM and its two key mechanisms: VBL and SA. Sectored DRAM improves system performance and alleviates system energy consumption by enabling fine-grained DRAM data transfer and activation.
—
We develop two techniques (LSQ Lookahead and SP) to effectively integrate Sectored DRAM into a system. Our techniques reduce the number of high-latency memory accesses by accurately identifying the words of a cache block that will be used by the processor.
—
We evaluate Sectored DRAM with a wide range of workloads and observe that it provides higher system performance and energy efficiency than coarse-grained DRAM as well as multiple prior fine-grained DRAM proposals.
3 Motivation
We study the impact of coarse-grained DRAM data transfer (
Coarse-DRAM-Transfer) and activation (
Coarse-DRAM-Act) in 41 single-core workloads from a variety of domains (see Section
6 for our methodology). We compare their energy consumption to a system that performs (i) fine-grained DRAM data transfer (
Fine-DRAM-Transfer) at word granularity, and (ii) fine-grained DRAM activation (
Fine-DRAM-Act) at mat granularity.
We make two observations from our study. First, the DRAM data transfer energy of the
Coarse-DRAM-Transfer system is 1.27
\(\times\) that of the
Fine-DRAM-Transfer. The large increase in energy consumption in the
Coarse-DRAM-Transfer system is caused by retrieving words in a cache block that the processor does not entirely use. This leads to a 45% increase in the data movement between DRAM and the CPU in the
Coarse-DRAM-Transfer system, on average. Second, the DRAM activation energy of the
Coarse-DRAM-Act system is 1.04
\(\times\) that of the
Fine-DRAM-Act system. Like the system that performs coarse-grained DRAM data transfers, the increase in energy consumption in the coarse-grained DRAM activation system is caused by activating a large, fixed-size DRAM row that the processor does not entirely use. As prior works [
30,
69,
83,
85,
86,
88,
120,
121,
133] show, such an increase in energy consumption with coarse-grained DRAM activation is because modern memory-intensive workloads with irregular access patterns suffer from low spatial locality, which reduces the benefit of a large DRAM row buffer.
3.1 Enabling Fine-Grained DRAM: Challenges and Limitations
Efficiently enabling fine-grained DRAM data transfer and activation can significantly improve system energy. However, to do so, we must overcome three main challenges:
(1)
Maintaining high DRAM throughput: Current DRAM systems leverage coarse-grained data transfers to maximize DRAM’s throughput. Enabling fine-grained DRAM in a straightforward way, such as by placing a piece of the cache block stored by a DRAM chip in the same mat instead of distributing the piece across multiple mats, reduces DRAM throughput as one mat contributes only a fraction of the total DRAM internal throughput (Section
2.3). This issue can be alleviated by increasing the number of the HFFs. However, this approach is costly since it severely complicates DRAM array routing [
69,
125,
136].
(2)
Incurring low DRAM area overhead: DRAM manufacturing is highly optimized for density and cost [
72,
77,
85]. While enabling fine-grained DRAM, one must avoid applying intrusive modifications to the DRAM array since such modifications are difficult to integrate into real designs.
(3)
Fully exploiting fine-grained DRAM: The energy waste of coarse-grained DRAM systems stems from rigid DRAM data transfer and activation granularities. Thus, a fine-grained DRAM system must enable flexible DRAM data transfer and activation granularities for both read and write operations to eliminate such energy waste. However, integrating fine-grained DRAM into current systems is challenging, as systems are typically designed to access DRAM at cache block granularity.
Prior works [
3,
16,
19,
31,
69,
92,
116,
124,
135,
136] propose different mechanisms to enable fine-grained DRAM substrates, aiming to alleviate the energy waste caused by coarse-grained DRAM. Such works can be divided into two broader groups: (1) works that propose
intrusive modifications to the DRAM array circuit and organization (e.g., new DRAM interconnects, considerably more HFFs) [
3,
16,
92,
116,
135] and (2) works that aim to enable coarse-grained DRAM
without intrusive modifications to DRAM [
19,
31,
69,
124,
136]. The intrusive DRAM modifications proposed by the first group lead to significant DRAM area overheads, which makes it difficult to integrate the first group of works into real DRAM designs.
Table
1 qualitatively compares how prior works from the second group address the three challenges of enabling fine-grained DRAM. We observe that no prior work can simultaneously provide (i) high DRAM throughput (FGA [
19] and SBA [
124] change the cache block mapping such that DRAM transfers can be served from only one mat but reduce the throughput of data transfers by doing so), (ii) low area overhead (HalfDRAM [
136] and HalfPage [
31] require changes to the number and organization of DRAM’s HFFs, leading to non-negligible area overheads), and (iii) mechanisms that fully exploit fine-grained DRAM (PRA [
69] only enables fine-grained DRAM data transfer and activation for write operations; HalfDRAM, HalfPage, FGA, and SBA still impose a rigid DRAM data transfer granularity). We conclude that no prior work efficiently enables fine-grained DRAM access (i.e., both data transfer and activation).
Our goal is to address prior works’ limitations while efficiently mitigating the energy consumed by transferring unused data on the memory channel and activating unused DRAM cells. To this end, we develop Sectored DRAM, a new, practical, and high-performance fine-grained DRAM substrate.
5 System Integration
We describe the challenges in integrating Sectored DRAM into a typical system and propose solutions. We assume that the system uses a DDR4 module with
eight chips as main memory and that each chip has
eight sectors to explain the challenges and our solutions clearly.
7 Since there are eight sectors in every chip, one sector from each DRAM chip collectively stores
one word (64 bits) of the cache block.
Integration Challenges. We identify two challenges in integrating Sectored DRAM into a system. First (Section
5.1), to benefit from Sectored DRAM’s potential energy savings, the system and main memory (DRAM) must conduct data transfers at sub-cache-block granularity (e.g., transfer one or multiple words). Therefore, a cache block may have both
valid (up-to-date) and
invalid (stale or evicted) words present in system caches. However, caches keep track of the valid on-chip data at cache block granularity. This granularity is too coarse to keep track of valid words in a cache block. Second (Section
5.2), because some words in a cache block can be invalid, references to these words (e.g., made by load/store instructions) would result in a cache miss. This can induce performance overheads.
We propose the following minor system modifications to overcome Sectored DRAM’s integration challenges. First, to track which words in a cache block are valid, we extend cache blocks with additional bits each of which indicates if a 64-bit word in the cache block is valid. Second, to accurately retrieve all useful words in a cache block (i.e., words that will be used until the cache block is evicted), we develop two techniques: (i) LSQ Lookahead and (ii) SP.
5.1 Tracking Valid Words in the Processor
Since a Sectored DRAM-based system can retrieve individual words of a cache block from DRAM, system caches must store data at a granularity that is finer than the typical 512-bit granularity. One straightforward approach to allow finer-granularity storage in caches is to reduce the cache block size from 512 bits to the size of a word (e.g., 64 bits). However, for the same cache size, doing so requires implementing
\(8\times {}\!\) as much storage for
cache block tags, which introduces significant area overhead. Instead, we extend cache blocks with just 8 additional bits, each of which indicates whether a word in the cache block is valid or invalid, using sectored caches [
4,
7,
37,
51,
73,
74,
101,
103,
112].
Sector Cache Operation. We describe the three-step process performed by a memory request to access a word in the highest-level sector cache (i.e., the L1 cache). First, the processor sends a memory request with a memory address and a vector of eight sector bits to the highest-level cache. The sector bits identify the words in the cache block that the processor core demands. Second, the L1 cache uses the memory address to identify the addressed cache set. Third, the L1 cache uses the cache block tag component of the memory address and the sector bits to access the words requested by the processor. The third step can result in three different scenarios: (i) if both the tag and the sector bits match one of the cache blocks in the cache set (i.e., there is both a tag and a sector bit match), the cache has the word that the processor core demands and this is a sector hit; (ii) if there is a tag match but no sector bit match, the cache has to request the missing sectors from a lower-level cache or main memory and this is a sector miss; and (3) if there is no tag match, this is a cache miss.
Sector Misses. On a sector miss, the cache controller creates a memory request to retrieve the missing sector(s) from a lower-level cache or main memory. The cache controller determines the missing sector(s) by bitwise AND’ing the memory request’s (the request that triggers the sector miss) sector bits and the sector bits that are not set in the cache block. When the created memory request returns from a lower-level cache, the cache controller sets the cache block’s missing sector bits.
Sector Cache Compatibility. Sector caches do not require any modifications to existing
cache coherence protocols (we explain how in the next paragraph). Sector caches are compatible with existing SRAM
error correcting code (ECC) schemes [
78,
82,
122], as the invalid words (i.e., missing sectors) in a cache block can still be used to correctly produce a codeword.
Cache Coherence. Sectored DRAM requires no modifications to existing cache coherence protocols that operate at the granularity of a cache block since cache coherence in Sectored DRAM is still maintained at cache block granularity. A processor core can only modify a sector in a cache block if the core owns the entire cache block (e.g., the cache block is in the M state in a MESI protocol). A cache block shared across multiple cores may have different valid sectors among its copies in different private caches. However, this does not violate cache coherence protocols.
Other Cache Architectures. There are numerous other multi-granularity cache architectures [
39,
63,
97,
98,
102,
111] that could be used instead of sectored caches in Sectored DRAM to improve cache utilization (e.g., by reducing the number of invalid words stored in a cache block) at the cost of increased storage for tags and hardware complexity [
63]. We use sectored caches to minimize the storage and hardware complexity overheads in Sectored DRAM and leave the exploration of other cache architectures in Sectored DRAM to future work.
5.2 Accurate Word Retrieval from Main Memory
With sector caches, a Sectored DRAM based system can transfer data at word granularity between components in the memory hierarchy (e.g., between the L1 and the L2 cache) instead of transferring data at cache block granularity. However, retrieving cache blocks word-by-word from DRAM can reduce system performance compared to bringing cache blocks as a whole because the processor needs to complete multiple high latency DRAM accesses to retrieve a word (on a sector miss) as opposed to completing a single memory access to retrieve the whole cache block. To minimize the performance overheads induced by the additional DRAM accesses and to better benefit from the energy savings provided by Sectored DRAM, we propose two mechanisms that greatly reduce the number of sector misses.
LSQ Lookahead. The key idea behind LSQ Lookahead is to exploit the spatial locality in subsequent load/store instructions that target the same cache block. A load or a store instruction typically references one word in main memory. LSQ Lookahead, at a high level, looks ahead in the processor’s LSQs
8 and finds load and store instructions that reference different words/sectors in the same cache block. LSQ Lookahead then collects the word/sector references, made by younger load/store instructions to the same cache block as the oldest load/store instruction, and stores the collected sector references in the oldest load/store instruction’s sector bits. This way, a load/store instruction, when executed, retrieves all words in a cache block that will be referenced in the near future (by younger load/store instructions) to the L1 cache with only one cache access.
Figure
5(a) depicts how LSQ Lookahead is implemented over an example using load instructions. We extend each
load address queue (LAQ) (which stores metadata for load instructions) entry with sector bits (
SB). LSQ Lookahead works in two steps. First, when a new entry is allocated at the LAQ’s tail (❶), the LSU compares the new entry’s cache block address (
CB address) with each of the existing entries’ cache block addresses (❷). Second, when it finds a matching cache block address, it updates the existing entry’s sector bits by setting the bit that corresponds to the word referenced by the new entry (❸).
Sector Predictor. Although LSQ Lookahead prevents some of the sector misses, it alone cannot significantly reduce the number of sector misses. This is because LSQs are typically not large enough to store many load/store instructions, and dependencies (e.g., data dependencies) prevent the processor core from computing the memory addresses of future load/store instructions. Thus, we require a more powerful mechanism to complement LSQ Lookahead and minimize sector misses. To this end, we develop SP.
SP, at a high level, records which words are used while a cache block is in the cache. The next time the same cache block misses, SP uses that signature to predict that the load would need the same words. SP leverages two key observations to accurately predict which words a load needs to access. First, the processor will “touch” (access or update) one or multiple words in a cache block from the moment a cache block is fetched to system caches (from main memory) until it is evicted to main memory. The touched words in a cache block will likely be touched again when the cache block is next fetched from main memory. Second, dynamic instances of the same static load/store instruction likely touch the same words in different cache blocks. For example, a static load/store instruction in a loop may perform strided accesses to the same word offset in different cache blocks. SP builds on a class of predictors referred to as spatial pattern predictors (e.g., [
17,
62]). We tailor SP for predicting a cache block’s useful words (those that are referenced by the processor during the cache block’s residency in system caches), similar to what is done by Yoon et al. [
132].
Figure
5(b) depicts the organization of the SP. The
Sector History Table (SHT) stores the
previously used sectors that identify the sectors (words) that were touched by the processor in a now evicted cache block in the L1 cache (❶). SHT is accessed with a
table index that is computed by XOR-ing parts of the load/store instruction’s address with the
word offset of the load/store instruction’s memory address upon an L1 cache miss (❷). We extend the L1 cache to store the table index and the
currently used sectors (❸). The currently used sectors in the cache track which sectors are used during a cache block’s residency. The table index is used to update the previously used sectors in an entry in the SHT with the currently used sectors stored in the cache block upon the cache block’s eviction (❹).
We describe how SP operates (not shown in the figure) in five steps based on an example where a memory request accesses the L1 cache. First, when the memory request causes a cache miss or a sector miss, the SHT is queried with the table index to retrieve the previously used sectors. Second, the previously used sectors are added to the sector bits of the memory request and forwarded to the next level in the memory hierarchy. Third, the L1 miss allocates a new cache block in the L1 cache. Fourth, the table index of the newly allocated cache block is updated with the table index used to access the SHT, and the cache block’s currently used sectors are set to logic-0. Fifth, once the missing cache block is placed in the L1 cache, the cache block’s currently used sectors start tracking the words that are touched by future load/store instructions. When the same cache block is evicted from the L1 cache, the SHT entry corresponding to the cache block’s table index is updated with the currently used sectors.
97 Evaluation Results
We evaluate Sectored DRAM’s impact on DRAM power, LLC MPKI, performance, energy, and DRAM area.
7.1 Impact on DRAM Power
Figure
6 shows Sectored DRAM’s impact on DRAM power consumption. We analyze the DRAM array power, DRAM peripheral circuitry power, and DRAM energy consumed by Sectored DRAM to perform
\(ACT\) ,
\(READ\) , and
\(WRITE\) DRAM operations for 8, 4, 2, and 1 sectors. Our results show that
\(READ\) and
\(WRITE\) power and energy greatly reduces as fewer sectors are read or written to.
We make three observations from Figure
6. First,
\(SA\) and
\(VBL\) significantly improve
\(READ\) and
\(WRITE\) power consumption. We find that the power consumed by DRAM while reading from and writing to a sector is 70.0% and 70.6% smaller than reading from and writing to all sectors, respectively. This improvement is due to the (i) reduced sense amplifier activity in the DRAM array, (ii) reduced switching on the DRAM peripheral circuitry that transfers data between the DRAM array and the DRAM I/O, and (iii) smaller number of beats in a burst to transfer data between the DRAM module and the memory controller.
Second, activating only one sector greatly reduces the power consumed by the DRAM array compared to activating all eight sectors. Because
\(SA\) enables activating a small set of DRAM sense amplifiers in a DRAM row, activating a single sector consumes 66.5% less DRAM array power compared to activating eight sectors. However, we find that activating one sector reduces the overall power consumption of an
\(ACT\) operation by only 12.7% compared to the baseline DDR4 module. This effect is small since the power consumed by the peripheral circuitry makes up a large proportion of the activation power and is not affected by the number of sectors activated. Third, the circuitry required to implement
\(SA\) incurs little activation power overhead. Compared to the baseline DDR4 module,
\(SA\) increases activation power by only 0.26% due to additional switching activity in MWL drivers (Section
4.1).
Effects of DRAM Bus Frequency. We investigate how DRAM bus frequency affects the read power (IDD4R) relative to the activate power (IDD0). We repeat our experiments using 2
\(\times {}\) the frequency of the baseline bus frequency (i.e., 3,200 MHz or 6,400 MT/s). The read power (IDD4R) is
\(12.39\times\) and
\(12.42\times\) higher than the activate power (IDD0) at the baseline bus frequency (3,200 MT/s) and twice the baseline bus frequency (6,400 MT/s), respectively (this observation is in line with that of a prior work [
22]).
We conclude that (i) Sectored DRAM’s fine-grained DRAM data transfer and activation provide a significant reduction in DRAM read and write power and energy and (ii) the bus frequency does not significantly affect DRAM read power relative to DRAM activate power.
7.2 Number of Sector Misses
To quantify the number of sector misses (see Section
5.1), we look at the LLC MPKI of workloads run with different Sectored DRAM configurations. Figure
7 plots the LLC MPKI for different LSQ Lookahead (LA
\(\lt\) number \(\gt\) , where
number is the number of entries looked ahead in the LSQ) and SP (SP
\(\lt\) number \(\gt\) , where
number is the number of entries in the SHT) configurations along with the
Basic Sectored DRAM configuration which does not use LSQ Lookahead nor SP. Each bar shows the average LLC MPKI across all evaluated workloads in each benchmark suite (
x-axis; see Table
3 for a list of all workloads classified according to their LLC MPKI) for a Sectored DRAM configuration.
We make three major observations. First, Sectored DRAM without LSQ Lookahead and SP (Basic) greatly increases the LLC MPKI of a workload, on average, by 3.1 \(\times {}\) , compared to the baseline, due to sector misses. Second, LSQ Lookahead reduces the number of LLC misses of Basic by 25%, 41%, and 51% by looking ahead 16, 128, and a currently very costly 2,048 younger entries in the LSQ, respectively. This is because LSQ Lookahead can identify the words that will be used in a cache block and retrieve these from DRAM with a single memory request. Third, LSQ Lookahead together with SP (LA128-SP512) reduces the number of LLC misses of Basic by 52% and of LA128 by 18%. LA128-SP512 performs as well as the currently very costly implementation of LSQ Lookahead which looks ahead 2,048 entries in the LSQ. LA128-SP512 does so as SP greatly reduces the number of additional LLC misses by recognizing intra-cache-block access patterns from previously performed memory requests and correctly predicts the words that will be used in a cache block.
We conclude that LSQ Lookahead with a 128 lookahead size together with SP minimizes the LLC misses caused by sector misses. We use the LA128-SP512 configuration in the remainder of our evaluation.
7.3 Single-Workload Performance and Energy
We evaluate Sectored DRAM’s performance and energy using (i) single-core workloads and (ii) 2-, 4-, 8-, and 16-core multi-programmed workloads made up of identical single-core workloads. We compare Sectored DRAM’s performance and system energy to a baseline coarse-grained DRAM system.
Microbenchmark Performance. Figure
8 shows the normalized parallel speedup of a random access (Random, left) and a strided streaming access (Stride, right) workload for the baseline system and Sectored DRAM (LA128-SP512). The Random workload accesses (i) one randomly determined word in main memory (8 bytes) by executing a load instruction every five instructions and (ii) has a very high LLC MPKI of 178.29. These two properties of Random make it a good fit for Sectored DRAM, as Random accesses only one sector in every cache line. The Stride workload accesses (i) every word address in a contiguous, 16-MiB large memory address range with a stride of 64 bytes (i.e., Stride accesses the following addresses [0, 64, 128, ..., 8, 72, 132, ..., 16, ...]) and (ii) has a very high LLC MPKI of 78.57. Stride is a poor fit for Sectored DRAM because every access to a word in a cache line results in a
sector miss (none of the accessed 8-byte words are cached, and these words have to be fetched from main memory), where the large cache block reuse distance prevents LSQ Lookahead from prefetching all useful words in a cache block.
We make two major observations from Figure
8. First, Sectored DRAM provides significant performance benefits for workloads that randomly access words (e.g., Random). Sectored DRAM’s performance benefits increase with the number of cores (i.e., with increasing LLC MPKI) for Random because a larger fraction of all memory requests (random word accesses) benefit from Sectored DRAM’s reduction in
\(tFAW\) (Section
4.1). Sectored DRAM provides 1.11
\(\times {}\) , 1.69
\(\times {}\) , 1.87
\(\times {}\) , 1.87
\(\times {}\) , and 1.87
\(\times {}\) normalized parallel speedup for 1, 2, 4, 8, and 16 cores, respectively, for Random. Second, Sectored DRAM reduces system performance for workloads that frequently cause sector misses (e.g., Stride). Sectored DRAM provides 0.67
\(\times {}\) , 0.95
\(\times {}\) , 1.00
\(\times {}\) , 1.00
\(\times {}\) , and 1.00
\(\times {}\) the normalized parallel speedup of Baseline for 1, 2, 4, 8, and 16 cores, respectively, for Stride. Sectored DRAM’s performance becomes closer to Baseline as the number of cores increases. This is because the LLC is not large enough to store all cache lines accessed by 4 or more cores for Stride (i.e., Baseline accesses main memory to retrieve each word, similarly to Sectored DRAM).
Performance. The two lines in Figure
9(a) show the normalized parallel speedup (on the primary/left
y-axis) of three representative high MPKI (top row), medium MPKI (middle row), and low MPKI (bottom row) workloads for the baseline system (solid lines) and Sectored DRAM (dashed lines). Figure
9(b) (top row) shows the distribution of normalized parallel speedups of all high, medium, and low MPKI workloads.
11 We make three observations from the two figures. First, Sectored DRAM provides higher parallel speedup over the baseline for high MPKI workloads when the number of cores is larger than 2. For example, Sectored DRAM provides 26% higher parallel speedup than the baseline for all 16-core high MPKI workloads on average.
12 As the average row buffer hit rate for 16-core high MPKI workloads is only 18%, the memory controller needs to issue many
ACT commands to serve the memory requests. Sectored DRAM’s
\(t_{FAW}\) reduction (Section
4.1) allows the memory controller to issue the large number of
\(ACT\) commands required by these workloads (i.e., 82% of all main memory requests) at a higher rate, reducing the average memory access latency for these workloads (by 25% on average for 16-core high MPKI workloads). Second, Sectored DRAM, on average, provides a smaller parallel speedup compared to the baseline for low and medium MPKI workloads. Although Sectored DRAM’s
\(t_{FAW}\) reduction reduces the proportion of processor cycles where the memory controller has to stall to satisfy the
\(t_{FAW}\) timing parameter from 14.4% in the baseline to 6.5% in Sectored DRAM for 16-core low and medium MPKI workloads, the average memory latency for these workloads increases by 0.5% in Sectored DRAM compared to the baseline. Moreover, for these workloads, sector misses increase the number of memory requests on average by 69%. Because a larger number of memory requests experience higher latencies in Sectored DRAM compared to the baseline, Sectored DRAM provides a smaller parallel speedup for these workloads. Third, Sectored DRAM incurs 5.41% performance overhead on average across all single-core workloads. We attribute this to sector misses that increase the number of memory requests and the average memory latency.
System Energy Consumption. The bars in Figure
9(b) show the system energy consumption (on the secondary/right
y-axis) of Sectored DRAM normalized to the system energy consumption of the baseline. Figure
9(b) (bottom) shows the distribution of normalized system energy consumption for workloads from three categories (low, medium, and high MPKI). We make two observations. First, Sectored DRAM reduces system energy consumption for high MPKI workloads when the number of cores is larger than 2. On average, at 16 cores, high MPKI workloads’ system energy consumption reduces by 20%. Sectored DRAM achieves this by a combination of (i) reduced DRAM energy consumption due to
\(SA\) and
\(VBL\) (we present a detailed breakdown of
\(SA\) and
\(VBL\) ’s effects on DRAM energy consumption in Section
7.4) and (ii) reduced background power consumption by the computing system as workloads execute faster. Second, for medium and low MPKI workloads, Sectored DRAM increases system energy consumption. We observe that Sectored DRAM increases the average DRAM energy consumption by 12% for 16-core medium/low MPKI workloads. The increase in DRAM energy consumption together with the increase in background power consumption by the computing system (as workloads execute slower, see Figure
9(a)) increases the system energy consumption for these workloads.
We conclude that Sectored DRAM improves system performance and reduces system energy consumption in high MPKI workloads where (i) a high number of ACTs targets different DRAM banks (i.e., the workload is bound by \(t_{FAW}\) ) and (ii) the SP can accurately predict the used words.
7.4 Multi-Programmed Workload Performance and Energy
We evaluate Sectored DRAM’s performance and energy using multi-programmed workload mixes. To stress DRAM and cache hierarchy, we use high MPKI workload mixes. We compare Sectored DRAM’s performance and main memory access energy to a baseline coarse-grained DRAM system and three state-of-the-art fine-grained DRAM mechanisms: (i)
Fine-Grained Activation (FGA) [
19,
124], (ii)
Partial Row Activation (PRA) [
69], and (iii) HalfDRAM [
136] and HalfPageDRAM [
31].
13 Unless otherwise stated, HalfDRAM depicts the evaluated performance and energy of both HalfDRAM and HalfPageDRAM in the rest of the article.
Performance. Figure
10 (top) shows the weighted speedup [
21,
27,
57] of 16 workload mixes for Sectored DRAM and the three state-of-the-art fine-grained DRAM mechanisms, normalized to the baseline system. We make four observations. First, Sectored DRAM’s weighted speedup is 1.17
\(\times\) (1.36
\(\times\) ) that of the baseline, on average (maximum), across all workload mixes. This is due to Sectored DRAM’s
\(t_{FAW}\) reduction. Sectored DRAM serves
\(READ\) requests faster (Sectored DRAM’s average DRAM read latency is approximately 25% smaller compared to the baseline) and thus improves the performance of the memory-intensive workloads. Second, Sectored DRAM greatly outperforms naive fine-grained DRAM mechanisms (i.e., FGA [
19,
124]). We observe that Sectored DRAM’s weighted speedup is 2.05
\(\times\) that of FGA, on average, across all workloads. FGA mechanisms greatly reduce the throughput of DRAM data transfers, as they are limited to fetching a cache block from a single mat (Section
3.1). Third, Sectored DRAM’s weighted speedup is 1.10
\(\times\) that of PRA, on average, across all workloads. Sectored DRAM outperforms PRA by enabling fine-grained DRAM access and activation for both
\(READ\) and
\(WRITE\) operations, whereas PRA is limited to
\(WRITE\) operations.
Fourth, Sectored DRAM’s weighted speedup is 0.89
\(\times\) that of HalfDRAM, on average, across all workloads. Sectored DRAM cannot improve performance as much as HalfDRAM, because in Sectored DRAM, the memory controller needs to service additional memory requests caused by sector misses. However, as we show next, HalfDRAM’s higher performance benefits come at the cost of
higher area overheads (Section
7.5) and
lower energy savings than Sectored DRAM.
DRAM Energy Consumption. Figure
10 (bottom) shows the DRAM energy consumption of each workload mix for Sectored DRAM and the state-of-the-art mechanisms. Values are normalized to the DRAM energy in the baseline system. We observe that (i) Sectored DRAM
significantly reduces DRAM energy consumption compared to the baseline, leading to up to (average) 33% (20%) lower DRAM energy consumption, and (ii) Sectored DRAM enables larger DRAM energy savings compared to prior works. On average, across all workload mixes, Sectored DRAM reduces DRAM energy consumption by 84%, 13%, and 12% compared to FPA, PRA, and HalfDRAM.
We analyze the impact of Sectored DRAM on the energy consumed by DRAM operations. Figure
11 (left) shows the DRAM energy broken down into
\(ACT\) , background, and
\(RD/WR\) consumption, normalized to the baseline system DRAM energy consumption, averaged across all workload mixes. We make two observations.
First, \(VBL\) greatly reduces the \(RD/WR\) energy by 51%, on average. Using \(VBL\) , the system retrieves only the required (and predicted to be required) words of a cache block from the DRAM module. On average, the number of bytes transferred between the memory controller and the DRAM module is reduced by 55% (not shown) with Sectored DRAM compared to the baseline. In this way, the system uses the power-hungry memory channel more energy-efficiently, eliminating unnecessary data movement. Second, \(SA\) can reduce the energy spent on activating DRAM rows by 6% on average. The reduction in \(ACT\) energy is relatively small. This is because the memory controller issues more \(ACT\) commands compared to the baseline in Sectored DRAM. The new \(ACT\) commands (i) respond to the additional memory requests caused by sector misses and (ii) resolve the row conflicts that occur due to interference created by the sector misses.
System Energy Consumption. Figure
11 (right) shows the energy consumed by the Sectored DRAM system (processor and DRAM) normalized to the energy consumed by the baseline system for all workloads. We observe that Sectored DRAM reduces system energy consumption, on average (at maximum), by 14% (23%). Sectored DRAM does so by (i) reducing DRAM energy consumption and (ii) reducing background power consumption by the processor as workloads execute faster.
7.5 Area Overhead
Modeled DRAM Chip. We use CACTI [
8] to model the area of a DRAM chip (see Table
2) using 22-nm technology. Our model is open source [
107]. Our modeled DRAM chip takes up, in each bank: (i)
\(8.3 \,\mathrm{m}\mathrm{m}^{2}\,\) for DRAM cells, (ii)
\(3.2 \,\mathrm{m}\mathrm{m}^{2}\,\) for wordline drivers, (iii)
\(4.6 \,\mathrm{m}\mathrm{m}^{2}\,\) for sense amplifiers, (iv)
\(0.1 \,\mathrm{m}\mathrm{m}^{2}\,\) for row decoder, (v)
\(\lt\) \(0.1 \,\mathrm{m}\mathrm{m}^{2}\,\) for column decoder, and (vi)
\(\lt\) \(0.4 \,\mathrm{m}\mathrm{m}^{2}\,\) for data and address bus.
Sectored DRAM. We model the overhead of (i) eight additional LWD stripes, (ii) sector transistors, (iii) sector latches, and (iv) wires that propagate sector bits from sector latches to sector transistors to implement
SA (Section
4.1). Sectored DRAM introduces
\(2.26\%\) area overhead (0.39 mm
2) over the baseline DRAM bank. Overall, Sectored DRAM increases the area of the chip (16 banks and I/O circuitry) by only
\(1.72\%\) .
FGA [19, 124] and PRA [69]. We estimate the area overhead of these architectures to be the same as Sectored DRAM because they require the same set of modifications to the DRAM array to enable Fine-DRAM-Act.
HalfDRAM [136] and HalfPage [31]. We estimate the chip area overheads of HalfDRAM and HalfPage as 2.6% and 5.2%, respectively. Both HalfDRAM and HalfPage require eight additional LWD stripes like Sectored DRAM does. HalfDRAM further requires implementing double the number of CSL signals [
136] to enable mirrored connection, and HalfPage requires doubling the number of HFFs per mat [
31].
Processor. We use CACTI to model the storage overhead of sector bits in caches (1 byte/cache block) and the SP (1,088 bytes/core). The sector bits (200 KiB additional storage for a system with 12.5 MiB cumulative L1, L2, and L3 cache capacity) and the predictor storage increase the area of the eight-core processor by \(1.22\%\) .
9 Related Work
Sectored DRAM is the first low-cost and high-performance DRAM substrate that alleviates the energy waste on the DRAM bus by enabling (i)
Fine-DRAM-Transfer and (ii)
Fine-DRAM-Act. We extensively compare Sectored DRAM qualitatively and quantitatively with the most relevant, low-cost, state-of-the-art fine-grained DRAM architectures [
19,
31,
69,
124,
136] in Section
3.1 and Section
7.4. In this section, we discuss other related works.
Other FGA Mechanisms. Prior works [
3,
16,
92,
116,
135] propose other fine-grained DRAM architectures. These architectures require intrusive reorganization of the DRAM array and/or modifications to the DRAM on-chip interconnect. Some works [
16,
92,
116] target higher-bandwidth DRAM standards and offer significant performance and activation energy improvements. Zhang and Guo [
135] develop a new interconnect that serializes data from multiple partially activated banks. Alawneh et al. [
3] divide DRAM mats into submats by adding more HFFs to mitigate the throughput loss of FGA. These works do not reduce the energy wasted on the memory channel by avoiding the transfer of unused words, which Sectored DRAM does at low chip area cost.
DRAM-Module-Level Fine-DRAM-Transfer. A class of prior work [
2,
12,
126,
131,
132,
137] proposes new DRAM module designs (e.g., subranked DIMMs [
2,
126,
131,
132]) that allow independent operation of each DRAM chip in a DRAM module (Section
2.1) to implement
Fine-DRAM-Transfer. From a system standpoint, Sectored DRAM is a more practical mechanism to implement compared to module-level mechanisms because Sectored DRAM requires no modifications to the physical DRAM interface. Similar to Sectored DRAM, these mechanisms (e.g., DGMS [
132]) require modifications to DRAM (i.e., modifications to the DRAM module in DGMS and to the DRAM chip in Sectored DRAM) and the processor. However, on top of these modifications, module-level mechanisms also require modifications to the physical DRAM interface (e.g., three additional pins to select one of the eight chips in a rank [
2]), thus making module-level mechanisms incompatible with current industry standards (i.e., JEDEC specifications [
45]). In contrast, Sectored DRAM does not require modifications to the physical DRAM interface, and thus Sectored DRAM chips comply with existing DRAM industry standards and specifications.
We quantitatively evaluate a subranked DIMM design (DGMS [
132]) that can be implemented with minimal modifications to the physical DRAM interface. This design can operate subranks independently, and each subrank can receive one DRAM command per DRAM command bus cycle (i.e., 1
\(\times\) ABUS scheme [
131]). We find that this design
reduces system performance for the high MPKI workload mixes, causing a 23% reduction in weighted speedup on average. Even though the subranked DIMM allows requests to be served from different subranks in parallel, the DRAM command bus bandwidth is insufficient to allow timely scheduling of these requests to different subranks [
132] (i.e., the command bus becomes the bottleneck). The DRAM command bus bandwidth can be increased to enable higher-performance subranked DIMM designs. However, this comes at additional hardware cost and modifications to the physical DRAM interface [
132]. In contrast, Sectored DRAM improves the weighted speedup for the same set of workloads by 17% on average and requires no modifications to the physical DRAM interface.