Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture

Published: 14 September 2024 Publication History

Abstract

Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations.
We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster.
We evaluate Sectored DRAM using 41 workloads from widely used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly memory intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). It is our hope and belief that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.

1 Introduction

DRAM [23] is hierarchically organized to improve scaling in density and performance. At the highest level of the hierarchy, a DRAM chip is partitioned into banks that can be accessed simultaneously [58, 59, 60, 64, 87]. At the lowest level, a collection of DRAM rows (DRAM cells that are activated together) are typically divided into multiple DRAM mats that can operate individually [43, 53, 59, 125]. Even though DRAM chips are hierarchically organized, standard DRAM interfaces (e.g., DDRx [44, 45, 46]) do not expose DRAM mats to the memory controller. To access even a single DRAM cell, the memory controller needs to activate a large number of DRAM cells (e.g., 65,536 DRAM cells in a DRAM row in DDR4 [81]) and transfer many bits (e.g., a cache block, typically 512 bits [33]) over the memory channel. Thus, in current systems, both DRAM data transfer and activation are coarse-grained. Coarse-grained data transfer and activation cause significant energy inefficiency in systems that use DRAM as main memory for two major reasons.
First, coarse-grained DRAM data transfer causes unnecessary data movement. Standard DRAM interfaces transfer data at cache block granularity over fixed-size data transfer bursts (e.g., eight-cycle bursts in DDR4 [45, 81]), but a large fraction of data (e.g., more than 75% [98]) in a cache block is not used (i.e., referenced by CPU load/store instructions) during the cache block’s residency in the cache hierarchy (i.e., from the moment the cache block is brought to the on-chip caches until it gets evicted) [62, 63, 97, 98, 131, 132]. Thus, transferring unused words of a cache block over the power-hungry memory channel wastes energy [3, 16, 31, 69, 70, 92, 96, 116, 124, 127, 135].
Second, coarse-grained DRAM activation causes an unnecessarily large number of DRAM cells in a DRAM row to be activated. Subsequent DRAM accesses to the activated row can be served faster. However, many modern memory-intensive workloads with irregular access patterns cannot benefit from these faster row accesses, as the spatial locality in these workloads is lower than the DRAM row size [30, 69, 83, 85, 86, 88, 120, 121, 133]. Thus, the energy cost of activating all cells in a DRAM row is not amortized over many accesses to the same row, leading to energy waste from activating a disproportionately large number of cells.
Prior works [3, 16, 19, 31, 69, 92, 116, 124, 135, 136] develop DRAM substrates that enable fine-grained DRAM data transfer and activation, allowing words of a cache block to be individually retrieved from DRAM and a small number of DRAM cells to be activated with each DRAM access. However, these prior works (i) cannot provide high DRAM throughput [19, 124], (ii) incur high DRAM area overheads [3, 16, 31, 92, 116, 135, 136], and (iii) do not fully enable1 fine-grained DRAM [19, 31, 69, 124, 136] (Section 3.1).
Our goal is to develop a new, low-cost, and high-throughput DRAM substrate that can mitigate the excessive energy consumption from both (i) transmitting unused data on the memory channel and (ii) activating a disproportionately large number of DRAM cells. To this end, we develop Sectored DRAM. Sectored DRAM leverages two key ideas to enable fine-grained data transfer and row activation at low chip area cost. First, a cache block transfer between main memory and the memory controller happens in a fixed number of DRAM interface clock cycles where only a word of the cache line is transferred in each cycle. Sectored DRAM augments the memory controller and the DRAM chip to perform cache block transfers in a variable number of clock cycles based on the workload access pattern. Second, a large DRAM row, by design, is already partitioned into smaller independent physically isolated regions. Sectored DRAM provides the memory controller with the ability to activate each such region based on the workload access pattern.
Sectored DRAM implements (i) Variable Burst Length (VBL) to enable fine-grained DRAM data transfer, and Sectored Activation (SA) to enable fine-grained DRAM activation. \(VBL\) dynamically adjusts the number of cycles in a burst to transfer a different word of a cache block with each DRAM interface cycle, thus enabling fine-granularity DRAM data transfer. To do so at low cost, \(VBL\) builds on existing DRAM I/O circuitry that already selects one word of a cache block to transfer in one cycle of a burst.
To enable \(SA\) with low hardware cost, we leverage the fact that DRAM rows are already partitioned into independent physically isolated regions (mats) that can be individually activated with small modifications to the DRAM chip. We refer to a mat that incorporates these modifications as a sector.2 Activating a sector consumes considerably smaller energy than activating a DRAM row as a sector typically contains almost an order of magnitude fewer cells (e.g., 1,024 in a DDR4 chip) than a DRAM row (e.g., typically 8,192 in a DDR4 chip). \(SA\) (i) implements sector transistors that are each turned on to activate one of the independent mats and (ii) sector latches that control the sector transistors. \(SA\) exposes the sector latches to the memory controller by using an existing DRAM command (Section 4.1); therefore, \(SA\) can be implemented without any changes to the physical DRAM interface. As the power required to activate a mat in a DRAM row is only a fraction of the power required to activate the whole row, Sectored DRAM also relaxes the power delivery constraints in DRAM chips [69, 92, 136]. Doing so allows for the activation of DRAM rows at a higher rate, increasing memory-level parallelism for memory-intensive workloads.
\(VBL\) and \(SA\) provide two key primitives for power-efficient, sub-cache-block-sized (e.g., 8-byte or one-word) data transfers between main memory and the rest of the system. However, because modern systems are typically designed to have cache-block-sized (e.g., 64-byte) data transfers between system components, making performance- and energy–efficient use of two Sectored DRAM primitives ( \(VBL\) and \(SA\) ) requires system-wide modifications in hardware. We develop two hardware techniques (Section 5.2), (i) Load/Store Queue (LSQ) Lookahead and (ii) Sector Predictor (SP) to effectively integrate Sectored DRAM into a system. At a high level, LSQ Lookahead and SP determine and predict, respectively, which words of a cache block should be retrieved from a lower-level component of the memory hierarchy. Accurately determining the words of a cache block that are used during the cache block’s residency in system caches enables high system performance and low system energy consumption by improving data reuse in system caches as opposed to repeating a high-latency main memory access for each used word of a cache block.
LSQ Lookahead accumulates the individual words in a cache block accessed by younger load/store instructions in older load/store instructions’ memory requests. Thus, the execution of a load/store instruction prefetches the portions of cache blocks that will be accessed by the in-flight (i.e., not yet executed) load/store instructions. SP predicts which portions of a cache block will be accessed by a load/store instruction based on that instruction’s past cache block usage patterns. This allows SP to accurately predict the portions of a cache block that will be used by the processor during the cache block’s residency in the cache hierarchy.
We evaluate the performance and energy of Sectored DRAM using 41 workloads from SPEC2006 [118] and 2017 [119] and DAMOV [91, 106] benchmarks using Ramulator [60, 75, 104, 105], DRAMPower [13], and Rambus Power Model [99]. Sectored DRAM significantly reduces system energy consumption and improves system performance for memory-intensive workloads with irregular access patterns (which amounts to 10 of our workloads). For such workloads, compared to a system with conventional coarse-grained DRAM, Sectored DRAM reduces DRAM energy consumption by 20%, improves system performance by 17%, and reduces system energy consumption by 14%, on average. Sectored DRAM does so as it (i) improves workload execution time by issuing ACTIVATE (ACT) commands at a higher rate and thereby reducing average memory latency, and (ii) activates fewer DRAM cells and retrieves fewer sectors from DRAM at lower power. We estimate the DRAM area overheads of Sectored DRAM using CACTI [8] and find that it can be implemented with low hardware cost. Sectored DRAM incurs 0.39 mm2 area overhead (1.7% of a DRAM chip) and does not require modifications to the physical DRAM interface. Compared to the evaluated state-of-the-art fine-grained DRAM architectures [19, 69, 132, 136], Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture [136]. Sectored DRAM provides 10% larger performance and 13% larger DRAM energy benefits compared to a low-cost state-of-the-art fine-grained DRAM architecture [69]. We open source our simulation infrastructure and all datasets to enable reproducibility and help future research [107].
We make the following contributions:
We introduce Sectored DRAM and its two key mechanisms: VBL and SA. Sectored DRAM improves system performance and alleviates system energy consumption by enabling fine-grained DRAM data transfer and activation.
We develop two techniques (LSQ Lookahead and SP) to effectively integrate Sectored DRAM into a system. Our techniques reduce the number of high-latency memory accesses by accurately identifying the words of a cache block that will be used by the processor.
We evaluate Sectored DRAM with a wide range of workloads and observe that it provides higher system performance and energy efficiency than coarse-grained DRAM as well as multiple prior fine-grained DRAM proposals.
We open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.

2 DRAM Background

We provide the most relevant background on DRAM organization for our work.3

2.1 DRAM Organization

A typical computing system implements a memory controller in the processor chip. The memory controller connects to multiple DRAM modules over multiple memory channels. Figure 1 illustrates the hierarchical DRAM organization inside a DRAM module (see Figure 1(a)). Multiple DRAM chips constitute a DRAM module (see Figure 1(b)). All DRAM chips in a module operate in lockstep where they receive the same DRAM commands from the memory controller at the same time and respond to commands in unison [15, 55]. A DRAM chip has multiple DRAM banks that can be accessed in parallel (see Figure 1(c)). All banks in a chip share an input/output logic.
Fig. 1.
Fig. 1. DRAM module, chip, and bank organization, as depicted in the work of Oliveira et al. [90].
A DRAM bank consists of a global row buffer (or prefetch buffer), a global row decoder (or a global wordline driver), and multiple subarrays. The global row decoder drives a global wordline signal in every subarray. Global bitlines connect the DRAM cells in each subarray to the global row buffer via column select logic (CSL). Each subarray contains a set of local row decoders (or local wordline drivers (LWDs)), a local row buffer, which is an array of sense amplifiers (depicted as SAs in Figure 1), helper flip-flops (HFFs), and mats (see Figure 1(d)). Inside every mat, DRAM cells are placed in a two-dimensional array of local wordlines (LWLs) and local bitlines. A DRAM cell is composed of an access transistor and a cell capacitor (see Figure 1(e)). DRAM cells that lie on the same LWL across different mats form a DRAM row (not shown in the figure).

2.2 Accessing DRAM

The memory controller accesses DRAM in two steps. First, the memory controller sends an ACT command with a row address to open (i.e., make accessible) a DRAM row. The global row decoder drives the global wordline (master wordline (MWL)) corresponding to the higher-order bits of the row address. The MWL enables asingle LWL, addressed by the lower-order bits of the row address, in every mat in a subarray. A driven LWL enables the access transistors of all cells in the DRAM row, causing the cells to share their charge with their bitlines and the sense amplifiers to read the values in the cells. Second, the memory controller sends a \(READ\) command with a column address to retrieve multiple bytes of data (e.g., 8 B for an x8 DDR4 chip [45]) from the open DRAM row. The \(READ\) command moves the data in the local row buffer, over the HFFs, to the global row buffer (the prefetch buffer). Thus, the throughput of internal DRAM data transfers (i.e., between the global and local row buffer) is constrained by the number of HFFs per mat.
Row Buffer. Once a row is opened (i.e., the row is buffered in the local row buffer), subsequent \(READ\) and \(WRITE\) commands targeting the row can be served at a fast rate. An access that targets an open row is a row buffer hit, and an access that targets a row other than the open row in a bank is a row buffer conflict.
Accessing Another Row. When a bank already has an open row, and the memory controller wants to access another row, the memory controller first issues a PRECHARGE (PRE) command to close the open DRAM row.

2.3 Data Transfer Bursts

DRAM modules transfer data on the memory channel over multiple interface clock cycles. For example, a \(READ\) command transfers a cache block (e.g., 512 bits) over eight interface cycles in DDR4 [81]. Each such transfer is referred to as a burst, and the burst length defines the number of double-data-rate (DDR) interface cycles it takes to transfer the data. A cache block is divided into equally sized pieces and placed in different chips (e.g., if there are eight chips, each chip receives \(\frac{1}{8}\) of the cache block). These equally sized blocks are further split into multiple mats inside a bank in the chip. We depict how a cache block is scattered across multiple chips and mats for a DDR4 module in Figure 2 (left). Figure 2 (right) shows the timing diagram of the command and data buses during a \(WRITE\) transfer.
Fig. 2.
Fig. 2. Example cache block placement in DRAM mats (left) and diagram depicting an eight-cycle data transfer burst to Chip 0 (right). “B” means “byte.”
A DRAM data transfer happens in three steps (we use a \(WRITE\) transfer as an example; a \(READ\) data transfer happens analogously). First, the memory controller drives 64 DQ (data) signals to transfer a 64-bit portion of the cache block in each beat (i.e., data transmitted in one DDR interface cycle) of the data transfer burst ❶. Second, each chip receives 8 bits in a beat ❷. A chip accumulates 64 bits during the burst in its prefetch buffer. Third, these 64 bits are copied into the mats inside the chip ❸. In our example, only 8 bits are transferred in a burst to (from) a mat with a \(WRITE\) ( \(READ\) ) command. Thus, the maximum DRAM throughput can only be obtained if every mat contributes 8 bits to the data transfer burst.

2.4 The tFAW Timing Parameter

DDRx specifications (e.g., DDR4 [45] and DDR5 [46]) define the \(t_{FAW}\) timing parameter, which specifies the time window where no more than four \(ACT\) commands are allowed (i.e., the memory controller can only schedule four \(ACT\) commands in any \(t_{FAW}\) -wide time window). \(t_{FAW}\) allows a DRAM chip to correctly provide the chip’s various components with the power required to activate large DRAM rows. \(t_{FAW}\) typically limits row activation frequency and diminishes memory-level parallelism, degrading the performance of memory-intensive workloads [52].

3 Motivation

We study the impact of coarse-grained DRAM data transfer (Coarse-DRAM-Transfer) and activation (Coarse-DRAM-Act) in 41 single-core workloads from a variety of domains (see Section 6 for our methodology). We compare their energy consumption to a system that performs (i) fine-grained DRAM data transfer (Fine-DRAM-Transfer) at word granularity, and (ii) fine-grained DRAM activation (Fine-DRAM-Act) at mat granularity.
We make two observations from our study. First, the DRAM data transfer energy of the Coarse-DRAM-Transfer system is 1.27 \(\times\) that of the Fine-DRAM-Transfer. The large increase in energy consumption in the Coarse-DRAM-Transfer system is caused by retrieving words in a cache block that the processor does not entirely use. This leads to a 45% increase in the data movement between DRAM and the CPU in the Coarse-DRAM-Transfer system, on average. Second, the DRAM activation energy of the Coarse-DRAM-Act system is 1.04 \(\times\) that of the Fine-DRAM-Act system. Like the system that performs coarse-grained DRAM data transfers, the increase in energy consumption in the coarse-grained DRAM activation system is caused by activating a large, fixed-size DRAM row that the processor does not entirely use. As prior works [30, 69, 83, 85, 86, 88, 120, 121, 133] show, such an increase in energy consumption with coarse-grained DRAM activation is because modern memory-intensive workloads with irregular access patterns suffer from low spatial locality, which reduces the benefit of a large DRAM row buffer.

3.1 Enabling Fine-Grained DRAM: Challenges and Limitations

Efficiently enabling fine-grained DRAM data transfer and activation can significantly improve system energy. However, to do so, we must overcome three main challenges:
(1)
Maintaining high DRAM throughput: Current DRAM systems leverage coarse-grained data transfers to maximize DRAM’s throughput. Enabling fine-grained DRAM in a straightforward way, such as by placing a piece of the cache block stored by a DRAM chip in the same mat instead of distributing the piece across multiple mats, reduces DRAM throughput as one mat contributes only a fraction of the total DRAM internal throughput (Section 2.3). This issue can be alleviated by increasing the number of the HFFs. However, this approach is costly since it severely complicates DRAM array routing [69, 125, 136].
(2)
Incurring low DRAM area overhead: DRAM manufacturing is highly optimized for density and cost [72, 77, 85]. While enabling fine-grained DRAM, one must avoid applying intrusive modifications to the DRAM array since such modifications are difficult to integrate into real designs.
(3)
Fully exploiting fine-grained DRAM: The energy waste of coarse-grained DRAM systems stems from rigid DRAM data transfer and activation granularities. Thus, a fine-grained DRAM system must enable flexible DRAM data transfer and activation granularities for both read and write operations to eliminate such energy waste. However, integrating fine-grained DRAM into current systems is challenging, as systems are typically designed to access DRAM at cache block granularity.
Prior works [3, 16, 19, 31, 69, 92, 116, 124, 135, 136] propose different mechanisms to enable fine-grained DRAM substrates, aiming to alleviate the energy waste caused by coarse-grained DRAM. Such works can be divided into two broader groups: (1) works that propose intrusive modifications to the DRAM array circuit and organization (e.g., new DRAM interconnects, considerably more HFFs) [3, 16, 92, 116, 135] and (2) works that aim to enable coarse-grained DRAM without intrusive modifications to DRAM [19, 31, 69, 124, 136]. The intrusive DRAM modifications proposed by the first group lead to significant DRAM area overheads, which makes it difficult to integrate the first group of works into real DRAM designs.
Table 1 qualitatively compares how prior works from the second group address the three challenges of enabling fine-grained DRAM. We observe that no prior work can simultaneously provide (i) high DRAM throughput (FGA [19] and SBA [124] change the cache block mapping such that DRAM transfers can be served from only one mat but reduce the throughput of data transfers by doing so), (ii) low area overhead (HalfDRAM [136] and HalfPage [31] require changes to the number and organization of DRAM’s HFFs, leading to non-negligible area overheads), and (iii) mechanisms that fully exploit fine-grained DRAM (PRA [69] only enables fine-grained DRAM data transfer and activation for write operations; HalfDRAM, HalfPage, FGA, and SBA still impose a rigid DRAM data transfer granularity). We conclude that no prior work efficiently enables fine-grained DRAM access (i.e., both data transfer and activation).
Table 1.
 High (100%)ThroughputLow (<2%) Area OverheadFully Exploit Fine-Grained DRAM
ActivationData Transfer
FGA [19]
SBA [124]
HalfDRAM [136]
HalfPage [31]
PRA [69]
This Work
Table 1. Sectored DRAM vs. Prior Works
Our goal is to address prior works’ limitations while efficiently mitigating the energy consumed by transferring unused data on the memory channel and activating unused DRAM cells. To this end, we develop Sectored DRAM, a new, practical, and high-performance fine-grained DRAM substrate.

4 Sectored DRAM

We leverage two key observations regarding DRAM chip design to implement Sectored DRAM at low cost. First, we observe that DRAM mats naturally split DRAM rows into fixed-size portions. Second, the DRAM I/O circuitry already implements a mechanism to select one portion of a cache block to transfer it in one beat (i.e., data transmitted in one DDR interface cycle) of a burst.
Sectored DRAM consists of two new mechanisms implemented in a DRAM chip: \(SA\) and \(VBL\) . \(SA\) enables fine-grained control over the activation of sectors in DRAM by making minimal modifications to how LWLs are driven. \(VBL\) enables fine-grained control over data transfer bursts by transferring only the portions of a cache block that correspond to the activated sectors in the DRAM chip.
We expose the two mechanisms to the memory controller such that the system can benefit from Sectored DRAM, with no changes to the physical DRAM interface and only small changes to DRAM interface specification.

4.1 Sectored Activation

Figure 3(a) depicts the architecture of a DRAM array with eight mats in one subarray [31, 53, 61, 92, 136]. We make a key observation: DRAM mats split DRAM rows into fixed-size portions. \(SA\) augments these portions by allowing them to be activated individually. We refer to these augmented portions as sectors.
Fig. 3.
Fig. 3. Wordline organization in a conventional DRAM subarray (left) and in a Sectored DRAM subarray (right).
\(\textbf {SA}\) Design. To implement \(SA\) , we propose minor modifications to the existing architecture. Figure 3(b) depicts our modifications to the DRAM subarray with 8 mats (the modifications are highlighted in blue color).4 First, we insert new LWDs (❶) such that each LWD drives only one LWL. Thus, when a single LWD is enabled, only the cells in a single sector are activated as opposed to cells from all mats in a subarray getting activated in the existing architecture (e.g., LWDs between mat 0 and mat 1 in Figure 3(a) drive two LWLs that extend onto both of the mats). Second, to select which sectors are opened with an \(ACT\) command, we place one sector latch (❷) for every sector in the horizontal direction. Third, we isolate the MWL from the LWDs using sector transistors (❸) such that a driven MWL does not enable all LWDs in all sectors but only enables LWDs in the open sector(s). With these three modifications, a sector with a set (i.e., logic-1) sector latch is activated when the MWL is driven (with an \(ACT\) command) because two sector transistors connect the MWL to the LWDs.
Exposing \(\textbf {SA}\) to the Memory Controller. To make use of \(SA\) , the memory controller needs to control sector latches. To implement \(SA\) with no modifications to the DRAM interface signals, we use the unused bits in the \(PRE\) command’s encoding [45] to encode the sector bits. Each sector bit identifies/encodes if a sector latch is set or reset. The memory controller sends a bitvector of sector bits with every \(PRE\) command to the Sectored DRAM chip. These sector bits are used for the \(ACT\) command that follows the \(PRE\) command. When the bank is closed (i.e., there are no open rows in the bank), the memory controller schedules a \(PRE\) command before the first \(ACT\) command to convey the sector bits to the DRAM chip. The minimum timing delay between successive \(PRE\) and \(ACT\) commands targeting the same bank (dictated by \(t_{RP}\) , \(\sim\) 13 ns [45, 81]) is sufficiently long5 such that the sector bits can propagate from the DRAM chip’s inputs to the sector latches before the \(ACT\) command following a \(PRE\) command is issued to the same DRAM bank. To issue regular, row-level DRAM commands (e.g., a periodic refresh or an ACT command), the memory controller simply sets all sector bits (to enable all sectors) before issuing the row-level command.
Because activating one sector requires considerably less power consumption (Section 7.1) than activating all sectors, the \(t_{FAW}\) timing constraint can be relaxed to allow for, within a \(t_{FAW}\) , a larger number of \(ACT\) commands that activate fewer than all eight sectors in a DRAM row [69, 79, 80, 92, 136]. Section 6.3 describes how exactly we relax \(t_{FAW}\) based on how many sectors an \(ACT\) command activates.

4.2 Variable Burst Length

\(VBL\) ’s goal is to allow a DRAM chip to transmit (i.e., \(READ\) ) and receive (i.e., \(WRITE\) ) data in variable length bursts6 such that each beat of the burst transfers only the data corresponding to one of the enabled sectors.
\(\textbf {VBL}\) Design. Figure 4 depicts the I/O read/write circuitry of a modern DRAM chip [81]. In such a chip, data is first moved from the DRAM array to the Read FIFO (❶) with every \(READ\) command. The Read FIFO comprises eight entries, and each entry stores the data that will be transmitted over the DQ pins in one beat of the data transfer burst. The Read MUX (❷) selects one entry in the Read FIFO based on the value of the burst counter (not shown in the figure), which counts the number of beats in the transfer.
Fig. 4.
Fig. 4. I/O circuitry of a DRAM chip and VBL.
By studying the I/O read/write circuitry of modern DRAM chips, we observe that a DRAM chip selects (using the burst counter) individual entries in the Read FIFO to drive the DQ pins within a beat [81]. Based on this observation, \(VBL\) ’s key idea is to slightly modify the DRAM chip’s Read FIFO entry selection criteria. We replace the burst counter with an 8 \(\times\) 3 encoder (❸) that takes sector bits as input and outputs only the indices for the Read FIFO entries that contain data from one of the closed sectors. Using the encoder, the Read MUX skips the entries in the Read FIFO that correspond to the closed sectors, driving the DQ pins only with the data that comes from the open sectors (❹ in Figure 4, right). Since the Write FIFO is organized in the same way as the Read FIFO, \(VBL\) reuses the same encoder for \(WRITE\) transfers to correctly fill the entries in the Write FIFO.
Exposing \(\textbf {VBL}\) to the Memory Controller. To use \(VBL\) , the memory controller and the DRAM chip need to agree on the burst length before the data transfer starts. This is important for both parties to calibrate their I/O drivers correctly and capture the signals on the high-frequency DRAM interface [44, 45, 46]. We use the sector bits that are already communicated to the DRAM chip by fine-grained DRAM activation operations to determine the burst length (Section 4.1) of data transfers, before the transfers start.
We make two modifications. First, we implement a low overhead 8-bit popcount circuit [20, 28] in both the DRAM chip and the memory controller to count the number of set sector bits in the DRAM bank targeted by a \(READ\) or a \(WRITE\) . The popcount circuitry requires only 34 logic gates to be implemented [20], introducing almost negligible area overhead. Second, we extend the bank state table of the memory controller with sector bits. The bank state table already contains metadata, such as the address of the enabled row, for every bank. The additional storage requirement for sector bits in the bank state table is small, only 128 bits (8 bits for each of the 16 banks [45, 81]).

5 System Integration

We describe the challenges in integrating Sectored DRAM into a typical system and propose solutions. We assume that the system uses a DDR4 module with eight chips as main memory and that each chip has eight sectors to explain the challenges and our solutions clearly.7 Since there are eight sectors in every chip, one sector from each DRAM chip collectively stores one word (64 bits) of the cache block.
Integration Challenges. We identify two challenges in integrating Sectored DRAM into a system. First (Section 5.1), to benefit from Sectored DRAM’s potential energy savings, the system and main memory (DRAM) must conduct data transfers at sub-cache-block granularity (e.g., transfer one or multiple words). Therefore, a cache block may have both valid (up-to-date) and invalid (stale or evicted) words present in system caches. However, caches keep track of the valid on-chip data at cache block granularity. This granularity is too coarse to keep track of valid words in a cache block. Second (Section 5.2), because some words in a cache block can be invalid, references to these words (e.g., made by load/store instructions) would result in a cache miss. This can induce performance overheads.
We propose the following minor system modifications to overcome Sectored DRAM’s integration challenges. First, to track which words in a cache block are valid, we extend cache blocks with additional bits each of which indicates if a 64-bit word in the cache block is valid. Second, to accurately retrieve all useful words in a cache block (i.e., words that will be used until the cache block is evicted), we develop two techniques: (i) LSQ Lookahead and (ii) SP.

5.1 Tracking Valid Words in the Processor

Since a Sectored DRAM-based system can retrieve individual words of a cache block from DRAM, system caches must store data at a granularity that is finer than the typical 512-bit granularity. One straightforward approach to allow finer-granularity storage in caches is to reduce the cache block size from 512 bits to the size of a word (e.g., 64 bits). However, for the same cache size, doing so requires implementing \(8\times {}\!\) as much storage for cache block tags, which introduces significant area overhead. Instead, we extend cache blocks with just 8 additional bits, each of which indicates whether a word in the cache block is valid or invalid, using sectored caches [4, 7, 37, 51, 73, 74, 101, 103, 112].
Sector Cache Operation. We describe the three-step process performed by a memory request to access a word in the highest-level sector cache (i.e., the L1 cache). First, the processor sends a memory request with a memory address and a vector of eight sector bits to the highest-level cache. The sector bits identify the words in the cache block that the processor core demands. Second, the L1 cache uses the memory address to identify the addressed cache set. Third, the L1 cache uses the cache block tag component of the memory address and the sector bits to access the words requested by the processor. The third step can result in three different scenarios: (i) if both the tag and the sector bits match one of the cache blocks in the cache set (i.e., there is both a tag and a sector bit match), the cache has the word that the processor core demands and this is a sector hit; (ii) if there is a tag match but no sector bit match, the cache has to request the missing sectors from a lower-level cache or main memory and this is a sector miss; and (3) if there is no tag match, this is a cache miss.
Sector Misses. On a sector miss, the cache controller creates a memory request to retrieve the missing sector(s) from a lower-level cache or main memory. The cache controller determines the missing sector(s) by bitwise AND’ing the memory request’s (the request that triggers the sector miss) sector bits and the sector bits that are not set in the cache block. When the created memory request returns from a lower-level cache, the cache controller sets the cache block’s missing sector bits.
Sector Cache Compatibility. Sector caches do not require any modifications to existing cache coherence protocols (we explain how in the next paragraph). Sector caches are compatible with existing SRAM error correcting code (ECC) schemes [78, 82, 122], as the invalid words (i.e., missing sectors) in a cache block can still be used to correctly produce a codeword.
Cache Coherence. Sectored DRAM requires no modifications to existing cache coherence protocols that operate at the granularity of a cache block since cache coherence in Sectored DRAM is still maintained at cache block granularity. A processor core can only modify a sector in a cache block if the core owns the entire cache block (e.g., the cache block is in the M state in a MESI protocol). A cache block shared across multiple cores may have different valid sectors among its copies in different private caches. However, this does not violate cache coherence protocols.
Other Cache Architectures. There are numerous other multi-granularity cache architectures [39, 63, 97, 98, 102, 111] that could be used instead of sectored caches in Sectored DRAM to improve cache utilization (e.g., by reducing the number of invalid words stored in a cache block) at the cost of increased storage for tags and hardware complexity [63]. We use sectored caches to minimize the storage and hardware complexity overheads in Sectored DRAM and leave the exploration of other cache architectures in Sectored DRAM to future work.

5.2 Accurate Word Retrieval from Main Memory

With sector caches, a Sectored DRAM based system can transfer data at word granularity between components in the memory hierarchy (e.g., between the L1 and the L2 cache) instead of transferring data at cache block granularity. However, retrieving cache blocks word-by-word from DRAM can reduce system performance compared to bringing cache blocks as a whole because the processor needs to complete multiple high latency DRAM accesses to retrieve a word (on a sector miss) as opposed to completing a single memory access to retrieve the whole cache block. To minimize the performance overheads induced by the additional DRAM accesses and to better benefit from the energy savings provided by Sectored DRAM, we propose two mechanisms that greatly reduce the number of sector misses.
LSQ Lookahead. The key idea behind LSQ Lookahead is to exploit the spatial locality in subsequent load/store instructions that target the same cache block. A load or a store instruction typically references one word in main memory. LSQ Lookahead, at a high level, looks ahead in the processor’s LSQs8 and finds load and store instructions that reference different words/sectors in the same cache block. LSQ Lookahead then collects the word/sector references, made by younger load/store instructions to the same cache block as the oldest load/store instruction, and stores the collected sector references in the oldest load/store instruction’s sector bits. This way, a load/store instruction, when executed, retrieves all words in a cache block that will be referenced in the near future (by younger load/store instructions) to the L1 cache with only one cache access.
Figure 5(a) depicts how LSQ Lookahead is implemented over an example using load instructions. We extend each load address queue (LAQ) (which stores metadata for load instructions) entry with sector bits (SB). LSQ Lookahead works in two steps. First, when a new entry is allocated at the LAQ’s tail (❶), the LSU compares the new entry’s cache block address (CB address) with each of the existing entries’ cache block addresses (❷). Second, when it finds a matching cache block address, it updates the existing entry’s sector bits by setting the bit that corresponds to the word referenced by the new entry (❸).
Fig. 5.
Fig. 5. Two system components of Sectored DRAM: LSQ Lookahead (a) and SP (b).
Sector Predictor. Although LSQ Lookahead prevents some of the sector misses, it alone cannot significantly reduce the number of sector misses. This is because LSQs are typically not large enough to store many load/store instructions, and dependencies (e.g., data dependencies) prevent the processor core from computing the memory addresses of future load/store instructions. Thus, we require a more powerful mechanism to complement LSQ Lookahead and minimize sector misses. To this end, we develop SP.
SP, at a high level, records which words are used while a cache block is in the cache. The next time the same cache block misses, SP uses that signature to predict that the load would need the same words. SP leverages two key observations to accurately predict which words a load needs to access. First, the processor will “touch” (access or update) one or multiple words in a cache block from the moment a cache block is fetched to system caches (from main memory) until it is evicted to main memory. The touched words in a cache block will likely be touched again when the cache block is next fetched from main memory. Second, dynamic instances of the same static load/store instruction likely touch the same words in different cache blocks. For example, a static load/store instruction in a loop may perform strided accesses to the same word offset in different cache blocks. SP builds on a class of predictors referred to as spatial pattern predictors (e.g., [17, 62]). We tailor SP for predicting a cache block’s useful words (those that are referenced by the processor during the cache block’s residency in system caches), similar to what is done by Yoon et al. [132].
Figure 5(b) depicts the organization of the SP. The Sector History Table (SHT) stores the previously used sectors that identify the sectors (words) that were touched by the processor in a now evicted cache block in the L1 cache (❶). SHT is accessed with a table index that is computed by XOR-ing parts of the load/store instruction’s address with the word offset of the load/store instruction’s memory address upon an L1 cache miss (❷). We extend the L1 cache to store the table index and the currently used sectors (❸). The currently used sectors in the cache track which sectors are used during a cache block’s residency. The table index is used to update the previously used sectors in an entry in the SHT with the currently used sectors stored in the cache block upon the cache block’s eviction (❹).
We describe how SP operates (not shown in the figure) in five steps based on an example where a memory request accesses the L1 cache. First, when the memory request causes a cache miss or a sector miss, the SHT is queried with the table index to retrieve the previously used sectors. Second, the previously used sectors are added to the sector bits of the memory request and forwarded to the next level in the memory hierarchy. Third, the L1 miss allocates a new cache block in the L1 cache. Fourth, the table index of the newly allocated cache block is updated with the table index used to access the SHT, and the cache block’s currently used sectors are set to logic-0. Fifth, once the missing cache block is placed in the L1 cache, the cache block’s currently used sectors start tracking the words that are touched by future load/store instructions. When the same cache block is evicted from the L1 cache, the SHT entry corresponding to the cache block’s table index is updated with the currently used sectors.9

6 Evaluation Methodology

In this section, we describe the workloads (Section 6.1), power model (Section 6.2), and simulation infrastructure used to evaluate Sectored DRAM (Section 6.3). Table 2 shows our system configuration that we simulate using Ramulator [60, 75, 104, 105]. Ramulator implements all standard DDR4 timing parameters. Our simulation infrastructure is open source [107].
Table 2.
Processor1–16 cores, 3.6-GHz clock frequency, 4-wide issue
8 MSHRs per core, 128-entry issue window
32-KiB L1, 256-KiB L2, 8-MiB L3 caches
Dynamic Power: 101.7 W [71], Static Power: 32.0 W [71]
Mem. Ctrl.64-entry read/write request queue, FR-FCFS-Cap [86]
scheduling policy, Open-page row buffer policy,
Auto-precharge on last read/write to a row
Row-Bank-Rank-Column-Channel address mapping [60]
DRAMDDR4 [45], 3,200-MT/s data transfer rate, 1, 2, and 4 channels
4 ranks, 16 banks/rank, 32K rows/bank
64 subarrays/bank, 8 sectors/subarray
\(tRCD\) / \(tRAS\) / \(tRC\) / \(tFAW\) 13.75/35.00/48.75/25 ns
\(tRRD\_L\) / \(tRRD\_S\) 5.00/2.50 ns
Sectored DRAM128-entry LSQ Lookahead (default)
512-entry SP (default)
Table 2. System Configuration

6.1 Workloads

We use 41 workloads from SPEC2006 [118] (23 workloads), SPEC2017 [119] (12 workloads),10 and DAMOV [91] (six DRAM-bandwidth-bottlenecked representative application functions) to evaluate Sectored DRAM. For every workload, we generate memory traces corresponding to 100 million instructions from representative regions in each workload using SimPoint [32]. We classify the workloads into three memory-intensity categories (as also done by prior work [34, 91]), which Table 3 describes, using their observed last-level-cache (LLC) misses-per-kilo-instruction (MPKI).
Table 3.
LLC MPKIWorkloads
\(\ge\) 10 (High)ligraPageRank, mcf-2006, libquantum-2006, gobmk-2006, ligraMIS, GemsFDTD-2006, bwaves-2006, lbm-2006, lbm -2017, hashjoinPR
\(1.. 10\) (Medium)omnetpp-2006, gcc-2017, mcf-2017, cactusADM-2006, zeusmp-2006, xalancbmk-2006, ligraKCore, astar-2006, cactus-2017, parest-2017, ligraComponents
\(\le\) 1 (Low)splash2Ocean, tonto-2006, xz-2017, wrf-2006, bzip2-2006, xalancbmk-2017, h264ref-2006, hmmer-2006, namd-2017, blender-2017, sjeng-2006, perlbench-2006, x264-2017, deepsjeng-2017, gromacs-2006, gcc-2006, imagick-2017, leela-2017, povray-2006, calculix-2006
Table 3. Evaluated Workloads
Note: “-2006/-2017” indicates SPEC.
Multi-Core Workloads. We create 2-, 4-, 8-, and 16-core workloads by replicating the same single-core workload over multiple cores. We create 16 eight-core workload mixes for each memory intensity category by randomly picking 8 single-core workloads from every category.

6.2 Power Model

DRAM Power Model. We use the Rambus Power Model [99, 125] to model a DDR4 chip (see Table 2) that supports Sectored DRAM. We modify the model to (i) activate a smaller number of sectors ( \(SA\) ) and (ii) reduce the burst size of data transfers for partially activated DRAM rows ( \(VBL\) ). Our model considers the power overheads introduced by the sector transistors and latches. Rambus Power Model computes and reports the current consumed by a sequence of DRAM commands (e.g., \(ACT\) , \(RD\) , \(WR\) , and \(NOP\) ). We use three command sequences described in one major DRAM manufacturer’s power calculation guide [100] to calculate three important current values: IDD0 ( \(ACT\) ), IDD4R ( \(READ\) ), and IDD4W ( \(WRITE\) ).
Processor Power Model. We use an IPC-based model [1, 132] to estimate the power consumed by the entire processor. The total power our eight-core processor consumes is equal to \(\frac{IPC}{4} \times Dynamic\ Power + Static \ Power\) . We comprehensively account for all Sectored DRAM power overheads. Our model includes the dynamic and static power consumed by the SP and the additional cache storage (modeled by CACTI [8]). Our system energy results represent the energy consumed by main memory and the entire processor during the execution of an evaluated workload.

6.3 Performance and Energy

We evaluate Sectored DRAM’s performance and energy using a modified version of Ramulator [60, 75], a cycle-accurate, trace-based DRAM simulator, and a modified version of DRAMPower [13], a DRAM power and energy estimation tool. We extend Ramulator by implementing Sectored DRAM’s LSQ Lookahead and SP as described in Section 5. We modify how Ramulator enforces the \(t_{FAW}\) timing constraint. Our modification allows for 32 sectors (i.e., the number of sectors in four rows) to be activated within a \(t_{FAW}\) (e.g., the memory controller can activate 4 sectors from eight different rows and 8 sectors from four different rows). The rate of \(ACT\) commands is still constrained by the \(tRRD\_{L}\) and \(tRRD\_{S}\) parameters in our modified Ramulator model. Therefore, our memory controller issues up to 10 ACT commands (calculated as \(25.0 \,\mathrm{n}\mathrm{s}/2.5 \,\mathrm{n}\mathrm{s}\) ) in a \(tFAW\) window. To verify that the peak power draw imposed by the higher rate of finer-granularity \(ACT\) commands (e.g., an \(ACT\) command that activates only 1 sector from a row) does not increase the power requirements of a DRAM chip, we compare the power draw of (i) 4 ACT commands, each of which activates all 8 sectors in a row (the rate of ACT commands are constrained by \(tFAW\) ), to (ii) 10 ACT commands, each of which activates 1 sector in a row (the rate of ACT commands are constrained by \(tRRD\_S\) ), in a default \(tFAW\) long time window ( \(25 \,\mathrm{n}\mathrm{s}\) in our configuration) using Rambus Power Model [99]. We find that 10 ACT commands, each of which activates 1 sector from a row, consume 20.34% less power than 4 ACT commands, each of which activates all 8 sectors in a row (and as such, Sectored DRAM operates within the region that conventional DRAM is designed to operate). We modify DRAMPower by integrating Sectored DRAM’s current values (e.g., IDD0, IDD4R, and IDD4W) that we obtain from the modified Rambus Power Model.
Performance Metrics. We measure single-workload performance using parallel speedup (i.e., the baseline single-core execution time divided by the multi-core Sectored DRAM execution time), which allows us to evaluate Sectored DRAM’s scalability for a single workload. We measure workload mix performance using weighted speedup [115], which allows us to evaluate Sectored DRAM’s system throughput [27] in a heterogeneous computing environment.

7 Evaluation Results

We evaluate Sectored DRAM’s impact on DRAM power, LLC MPKI, performance, energy, and DRAM area.

7.1 Impact on DRAM Power

Figure 6 shows Sectored DRAM’s impact on DRAM power consumption. We analyze the DRAM array power, DRAM peripheral circuitry power, and DRAM energy consumed by Sectored DRAM to perform \(ACT\) , \(READ\) , and \(WRITE\) DRAM operations for 8, 4, 2, and 1 sectors. Our results show that \(READ\) and \(WRITE\) power and energy greatly reduces as fewer sectors are read or written to.
Fig. 6.
Fig. 6. DRAM command power and energy for varying number of sectors.
We make three observations from Figure 6. First, \(SA\) and \(VBL\) significantly improve \(READ\) and \(WRITE\) power consumption. We find that the power consumed by DRAM while reading from and writing to a sector is 70.0% and 70.6% smaller than reading from and writing to all sectors, respectively. This improvement is due to the (i) reduced sense amplifier activity in the DRAM array, (ii) reduced switching on the DRAM peripheral circuitry that transfers data between the DRAM array and the DRAM I/O, and (iii) smaller number of beats in a burst to transfer data between the DRAM module and the memory controller.
Second, activating only one sector greatly reduces the power consumed by the DRAM array compared to activating all eight sectors. Because \(SA\) enables activating a small set of DRAM sense amplifiers in a DRAM row, activating a single sector consumes 66.5% less DRAM array power compared to activating eight sectors. However, we find that activating one sector reduces the overall power consumption of an \(ACT\) operation by only 12.7% compared to the baseline DDR4 module. This effect is small since the power consumed by the peripheral circuitry makes up a large proportion of the activation power and is not affected by the number of sectors activated. Third, the circuitry required to implement \(SA\) incurs little activation power overhead. Compared to the baseline DDR4 module, \(SA\) increases activation power by only 0.26% due to additional switching activity in MWL drivers (Section 4.1).
Effects of DRAM Bus Frequency. We investigate how DRAM bus frequency affects the read power (IDD4R) relative to the activate power (IDD0). We repeat our experiments using 2 \(\times {}\) the frequency of the baseline bus frequency (i.e., 3,200 MHz or 6,400 MT/s). The read power (IDD4R) is \(12.39\times\) and \(12.42\times\) higher than the activate power (IDD0) at the baseline bus frequency (3,200 MT/s) and twice the baseline bus frequency (6,400 MT/s), respectively (this observation is in line with that of a prior work [22]).
We conclude that (i) Sectored DRAM’s fine-grained DRAM data transfer and activation provide a significant reduction in DRAM read and write power and energy and (ii) the bus frequency does not significantly affect DRAM read power relative to DRAM activate power.

7.2 Number of Sector Misses

To quantify the number of sector misses (see Section 5.1), we look at the LLC MPKI of workloads run with different Sectored DRAM configurations. Figure 7 plots the LLC MPKI for different LSQ Lookahead (LA \(\lt\) number \(\gt\) , where number is the number of entries looked ahead in the LSQ) and SP (SP \(\lt\) number \(\gt\) , where number is the number of entries in the SHT) configurations along with the Basic Sectored DRAM configuration which does not use LSQ Lookahead nor SP. Each bar shows the average LLC MPKI across all evaluated workloads in each benchmark suite (x-axis; see Table 3 for a list of all workloads classified according to their LLC MPKI) for a Sectored DRAM configuration.
Fig. 7.
Fig. 7. LLC MPKI for different Sectored DRAM configurations.
We make three major observations. First, Sectored DRAM without LSQ Lookahead and SP (Basic) greatly increases the LLC MPKI of a workload, on average, by 3.1 \(\times {}\) , compared to the baseline, due to sector misses. Second, LSQ Lookahead reduces the number of LLC misses of Basic by 25%, 41%, and 51% by looking ahead 16, 128, and a currently very costly 2,048 younger entries in the LSQ, respectively. This is because LSQ Lookahead can identify the words that will be used in a cache block and retrieve these from DRAM with a single memory request. Third, LSQ Lookahead together with SP (LA128-SP512) reduces the number of LLC misses of Basic by 52% and of LA128 by 18%. LA128-SP512 performs as well as the currently very costly implementation of LSQ Lookahead which looks ahead 2,048 entries in the LSQ. LA128-SP512 does so as SP greatly reduces the number of additional LLC misses by recognizing intra-cache-block access patterns from previously performed memory requests and correctly predicts the words that will be used in a cache block.
We conclude that LSQ Lookahead with a 128 lookahead size together with SP minimizes the LLC misses caused by sector misses. We use the LA128-SP512 configuration in the remainder of our evaluation.

7.3 Single-Workload Performance and Energy

We evaluate Sectored DRAM’s performance and energy using (i) single-core workloads and (ii) 2-, 4-, 8-, and 16-core multi-programmed workloads made up of identical single-core workloads. We compare Sectored DRAM’s performance and system energy to a baseline coarse-grained DRAM system.
Microbenchmark Performance. Figure 8 shows the normalized parallel speedup of a random access (Random, left) and a strided streaming access (Stride, right) workload for the baseline system and Sectored DRAM (LA128-SP512). The Random workload accesses (i) one randomly determined word in main memory (8 bytes) by executing a load instruction every five instructions and (ii) has a very high LLC MPKI of 178.29. These two properties of Random make it a good fit for Sectored DRAM, as Random accesses only one sector in every cache line. The Stride workload accesses (i) every word address in a contiguous, 16-MiB large memory address range with a stride of 64 bytes (i.e., Stride accesses the following addresses [0, 64, 128, ..., 8, 72, 132, ..., 16, ...]) and (ii) has a very high LLC MPKI of 78.57. Stride is a poor fit for Sectored DRAM because every access to a word in a cache line results in a sector miss (none of the accessed 8-byte words are cached, and these words have to be fetched from main memory), where the large cache block reuse distance prevents LSQ Lookahead from prefetching all useful words in a cache block.
Fig. 8.
Fig. 8. Normalized parallel speedup of random and strided streaming microbenchmarks for varying numbers of cores.
We make two major observations from Figure 8. First, Sectored DRAM provides significant performance benefits for workloads that randomly access words (e.g., Random). Sectored DRAM’s performance benefits increase with the number of cores (i.e., with increasing LLC MPKI) for Random because a larger fraction of all memory requests (random word accesses) benefit from Sectored DRAM’s reduction in \(tFAW\) (Section 4.1). Sectored DRAM provides 1.11 \(\times {}\) , 1.69 \(\times {}\) , 1.87 \(\times {}\) , 1.87 \(\times {}\) , and 1.87 \(\times {}\) normalized parallel speedup for 1, 2, 4, 8, and 16 cores, respectively, for Random. Second, Sectored DRAM reduces system performance for workloads that frequently cause sector misses (e.g., Stride). Sectored DRAM provides 0.67 \(\times {}\) , 0.95 \(\times {}\) , 1.00 \(\times {}\) , 1.00 \(\times {}\) , and 1.00 \(\times {}\) the normalized parallel speedup of Baseline for 1, 2, 4, 8, and 16 cores, respectively, for Stride. Sectored DRAM’s performance becomes closer to Baseline as the number of cores increases. This is because the LLC is not large enough to store all cache lines accessed by 4 or more cores for Stride (i.e., Baseline accesses main memory to retrieve each word, similarly to Sectored DRAM).
Performance. The two lines in Figure 9(a) show the normalized parallel speedup (on the primary/left y-axis) of three representative high MPKI (top row), medium MPKI (middle row), and low MPKI (bottom row) workloads for the baseline system (solid lines) and Sectored DRAM (dashed lines). Figure 9(b) (top row) shows the distribution of normalized parallel speedups of all high, medium, and low MPKI workloads.11
Fig. 9.
Fig. 9. Sectored DRAM system performance and energy
We make three observations from the two figures. First, Sectored DRAM provides higher parallel speedup over the baseline for high MPKI workloads when the number of cores is larger than 2. For example, Sectored DRAM provides 26% higher parallel speedup than the baseline for all 16-core high MPKI workloads on average.12 As the average row buffer hit rate for 16-core high MPKI workloads is only 18%, the memory controller needs to issue many ACT commands to serve the memory requests. Sectored DRAM’s \(t_{FAW}\) reduction (Section 4.1) allows the memory controller to issue the large number of \(ACT\) commands required by these workloads (i.e., 82% of all main memory requests) at a higher rate, reducing the average memory access latency for these workloads (by 25% on average for 16-core high MPKI workloads). Second, Sectored DRAM, on average, provides a smaller parallel speedup compared to the baseline for low and medium MPKI workloads. Although Sectored DRAM’s \(t_{FAW}\) reduction reduces the proportion of processor cycles where the memory controller has to stall to satisfy the \(t_{FAW}\) timing parameter from 14.4% in the baseline to 6.5% in Sectored DRAM for 16-core low and medium MPKI workloads, the average memory latency for these workloads increases by 0.5% in Sectored DRAM compared to the baseline. Moreover, for these workloads, sector misses increase the number of memory requests on average by 69%. Because a larger number of memory requests experience higher latencies in Sectored DRAM compared to the baseline, Sectored DRAM provides a smaller parallel speedup for these workloads. Third, Sectored DRAM incurs 5.41% performance overhead on average across all single-core workloads. We attribute this to sector misses that increase the number of memory requests and the average memory latency.
System Energy Consumption. The bars in Figure 9(b) show the system energy consumption (on the secondary/right y-axis) of Sectored DRAM normalized to the system energy consumption of the baseline. Figure 9(b) (bottom) shows the distribution of normalized system energy consumption for workloads from three categories (low, medium, and high MPKI). We make two observations. First, Sectored DRAM reduces system energy consumption for high MPKI workloads when the number of cores is larger than 2. On average, at 16 cores, high MPKI workloads’ system energy consumption reduces by 20%. Sectored DRAM achieves this by a combination of (i) reduced DRAM energy consumption due to \(SA\) and \(VBL\) (we present a detailed breakdown of \(SA\) and \(VBL\) ’s effects on DRAM energy consumption in Section 7.4) and (ii) reduced background power consumption by the computing system as workloads execute faster. Second, for medium and low MPKI workloads, Sectored DRAM increases system energy consumption. We observe that Sectored DRAM increases the average DRAM energy consumption by 12% for 16-core medium/low MPKI workloads. The increase in DRAM energy consumption together with the increase in background power consumption by the computing system (as workloads execute slower, see Figure 9(a)) increases the system energy consumption for these workloads.
We conclude that Sectored DRAM improves system performance and reduces system energy consumption in high MPKI workloads where (i) a high number of ACTs targets different DRAM banks (i.e., the workload is bound by \(t_{FAW}\) ) and (ii) the SP can accurately predict the used words.

7.4 Multi-Programmed Workload Performance and Energy

We evaluate Sectored DRAM’s performance and energy using multi-programmed workload mixes. To stress DRAM and cache hierarchy, we use high MPKI workload mixes. We compare Sectored DRAM’s performance and main memory access energy to a baseline coarse-grained DRAM system and three state-of-the-art fine-grained DRAM mechanisms: (i) Fine-Grained Activation (FGA) [19, 124], (ii) Partial Row Activation (PRA) [69], and (iii) HalfDRAM [136] and HalfPageDRAM [31].13 Unless otherwise stated, HalfDRAM depicts the evaluated performance and energy of both HalfDRAM and HalfPageDRAM in the rest of the article.
Performance. Figure 10 (top) shows the weighted speedup [21, 27, 57] of 16 workload mixes for Sectored DRAM and the three state-of-the-art fine-grained DRAM mechanisms, normalized to the baseline system. We make four observations. First, Sectored DRAM’s weighted speedup is 1.17 \(\times\) (1.36 \(\times\) ) that of the baseline, on average (maximum), across all workload mixes. This is due to Sectored DRAM’s \(t_{FAW}\) reduction. Sectored DRAM serves \(READ\) requests faster (Sectored DRAM’s average DRAM read latency is approximately 25% smaller compared to the baseline) and thus improves the performance of the memory-intensive workloads. Second, Sectored DRAM greatly outperforms naive fine-grained DRAM mechanisms (i.e., FGA [19, 124]). We observe that Sectored DRAM’s weighted speedup is 2.05 \(\times\) that of FGA, on average, across all workloads. FGA mechanisms greatly reduce the throughput of DRAM data transfers, as they are limited to fetching a cache block from a single mat (Section 3.1). Third, Sectored DRAM’s weighted speedup is 1.10 \(\times\) that of PRA, on average, across all workloads. Sectored DRAM outperforms PRA by enabling fine-grained DRAM access and activation for both \(READ\) and \(WRITE\) operations, whereas PRA is limited to \(WRITE\) operations.
Fig. 10.
Fig. 10. Weighted speedup improvement over the baseline (higher is better) on top. DRAM energy normalized to the baseline (lower is better) on the bottom.
Fourth, Sectored DRAM’s weighted speedup is 0.89 \(\times\) that of HalfDRAM, on average, across all workloads. Sectored DRAM cannot improve performance as much as HalfDRAM, because in Sectored DRAM, the memory controller needs to service additional memory requests caused by sector misses. However, as we show next, HalfDRAM’s higher performance benefits come at the cost of higher area overheads (Section 7.5) and lower energy savings than Sectored DRAM.
DRAM Energy Consumption. Figure 10 (bottom) shows the DRAM energy consumption of each workload mix for Sectored DRAM and the state-of-the-art mechanisms. Values are normalized to the DRAM energy in the baseline system. We observe that (i) Sectored DRAM significantly reduces DRAM energy consumption compared to the baseline, leading to up to (average) 33% (20%) lower DRAM energy consumption, and (ii) Sectored DRAM enables larger DRAM energy savings compared to prior works. On average, across all workload mixes, Sectored DRAM reduces DRAM energy consumption by 84%, 13%, and 12% compared to FPA, PRA, and HalfDRAM.
We analyze the impact of Sectored DRAM on the energy consumed by DRAM operations. Figure 11 (left) shows the DRAM energy broken down into \(ACT\) , background, and \(RD/WR\) consumption, normalized to the baseline system DRAM energy consumption, averaged across all workload mixes. We make two observations.
Fig. 11.
Fig. 11. DRAM energy breakdown normalized to total DRAM energy consumed by the baseline (left). System energy normalized to the baseline (right) for different workload mixes.
First, \(VBL\) greatly reduces the \(RD/WR\) energy by 51%, on average. Using \(VBL\) , the system retrieves only the required (and predicted to be required) words of a cache block from the DRAM module. On average, the number of bytes transferred between the memory controller and the DRAM module is reduced by 55% (not shown) with Sectored DRAM compared to the baseline. In this way, the system uses the power-hungry memory channel more energy-efficiently, eliminating unnecessary data movement. Second, \(SA\) can reduce the energy spent on activating DRAM rows by 6% on average. The reduction in \(ACT\) energy is relatively small. This is because the memory controller issues more \(ACT\) commands compared to the baseline in Sectored DRAM. The new \(ACT\) commands (i) respond to the additional memory requests caused by sector misses and (ii) resolve the row conflicts that occur due to interference created by the sector misses.
System Energy Consumption. Figure 11 (right) shows the energy consumed by the Sectored DRAM system (processor and DRAM) normalized to the energy consumed by the baseline system for all workloads. We observe that Sectored DRAM reduces system energy consumption, on average (at maximum), by 14% (23%). Sectored DRAM does so by (i) reducing DRAM energy consumption and (ii) reducing background power consumption by the processor as workloads execute faster.

7.5 Area Overhead

Modeled DRAM Chip. We use CACTI [8] to model the area of a DRAM chip (see Table 2) using 22-nm technology. Our model is open source [107]. Our modeled DRAM chip takes up, in each bank: (i) \(8.3 \,\mathrm{m}\mathrm{m}^{2}\,\) for DRAM cells, (ii) \(3.2 \,\mathrm{m}\mathrm{m}^{2}\,\) for wordline drivers, (iii) \(4.6 \,\mathrm{m}\mathrm{m}^{2}\,\) for sense amplifiers, (iv) \(0.1 \,\mathrm{m}\mathrm{m}^{2}\,\) for row decoder, (v) \(\lt\) \(0.1 \,\mathrm{m}\mathrm{m}^{2}\,\) for column decoder, and (vi) \(\lt\) \(0.4 \,\mathrm{m}\mathrm{m}^{2}\,\) for data and address bus.
Sectored DRAM. We model the overhead of (i) eight additional LWD stripes, (ii) sector transistors, (iii) sector latches, and (iv) wires that propagate sector bits from sector latches to sector transistors to implement SA (Section 4.1). Sectored DRAM introduces \(2.26\%\) area overhead (0.39 mm2) over the baseline DRAM bank. Overall, Sectored DRAM increases the area of the chip (16 banks and I/O circuitry) by only \(1.72\%\) .
FGA [19, 124] and PRA [69]. We estimate the area overhead of these architectures to be the same as Sectored DRAM because they require the same set of modifications to the DRAM array to enable Fine-DRAM-Act.
HalfDRAM [136] and HalfPage [31]. We estimate the chip area overheads of HalfDRAM and HalfPage as 2.6% and 5.2%, respectively. Both HalfDRAM and HalfPage require eight additional LWD stripes like Sectored DRAM does. HalfDRAM further requires implementing double the number of CSL signals [136] to enable mirrored connection, and HalfPage requires doubling the number of HFFs per mat [31].
Processor. We use CACTI to model the storage overhead of sector bits in caches (1 byte/cache block) and the SP (1,088 bytes/core). The sector bits (200 KiB additional storage for a system with 12.5 MiB cumulative L1, L2, and L3 cache capacity) and the predictor storage increase the area of the eight-core processor by \(1.22\%\) .

8 Discussion

8.1 Sectored DRAM with More Memory Channels

We evaluate two- and four-channel systems to study Sectored DRAM’s impact on memory performance in systems equipped with more than one memory channel. Figure 12 shows the normalized parallel speedup of all high, medium, and low LLC MPKI homogeneous workloads for varying numbers of cores on the x-axis. Different boxes show Baseline and Sectored DRAM configurations with one, two, and four memory channels.
Fig. 12.
Fig. 12. Normalized parallel speedup of all low, medium, and high LLC MPKI homogeneous workloads for varying numbers of cores. \(^{11}\)
We observe that more memory channels increase system performance on average across all workloads for Baseline and Sectored DRAM. For example, for HIGH MPKI 16-core workloads, Sectored DRAM provides 2.74 \(\times {}\) , 5.09 \(\times {}\) , and 8.94 \(\times {}\) average normalized parallel speedup for one-channel, two-channel, and four-channel systems, respectively.

8.2 Non-Memory-Intensive Workloads

Sectored DRAM’s current system integration can lead to varying performance benefits depending on the workload’s memory intensity. We propose two approaches to overcome the performance degradation in non-memory-intensive workloads.
Dynamically Turning Sectored DRAM Of f. Sectored DRAM can be turned off while the system executes non-memory-intensive (i.e., low and medium MPKI) workloads. To do so, we leverage the performance counters already present in modern processors [6, 40, 41] to periodically compute the average occupancy of the memory controller’s read request queue (i.e., the average number of requests in the read request queue) and turn Sectored DRAM on/off when the computed value exceeds an empirically determined threshold. We (i) periodically (every 1,000 cycles) compute the average occupancy of the memory controller’s read request queue, and we (ii) turn Sectored DRAM on for the next 1,000 cycles if the average occupancy exceeds 30 or turn Sectored DRAM off otherwise. Figure 13 shows the weighted speedup for 16 workload mixes from each memory intensity category, normalized to the coarse-grained DRAM baseline. We show results for two configurations: Always ON never turns Sectored DRAM off and Dynamic turns Sectored DRAM on and off as described previously.
Fig. 13.
Fig. 13. Weighted speedup for Always ON and Dynamic.
We observe that the Dynamic configuration allows Sectored DRAM to perform as well as the baseline for non-memory-intensity workloads and maintain better performance than the baseline for memory-intensive workloads. Dynamic provides higher speedups than Always ON, on average, across all workload mixes and classes, even though Dynamic provides slightly lower speedups than Always ON for high memory-intensive workloads.
Better Sector Prediction. Improving the SP’s accuracy would reduce the additional LLC misses. A more sophisticated SP could be developed by tracking the access patterns of instructions with deeper history, or other techniques (e.g., reinforcement learning [10, 42], perceptron-based prediction [9, 29, 47, 48, 49, 50, 123]) could be used to predict the useful words.

8.3 Sectored DRAM with Prefetching

We implement Sectored DRAM support in a simple region-based single-stride prefetcher (based on other works [38, 117]) to demonstrate Sectored DRAM’s performance in a system with prefetching enabled. We model two new system configurations: Baseline-Prefetch and Sectored DRAM-Prefetch. Baseline-Prefetch incorporates the simple prefetcher that partitions physical memory address space into 4-KiB (i.e., page-sized) regions and assigns a stream to each region. A region is trained after four consecutive memory accesses with the same stride, and the prefetcher starts issuing prefetch requests for subsequent LLC accesses (hits and misses) that target the memory region. We configure the prefetcher to have a degree of 4, and the first prefetch request targets four cache blocks ahead of the memory request that accesses the LLC. Sectored DRAM-Prefetcher augments this prefetcher with support for sector bits. Sectored DRAM-Prefetcher sets the sector bits of a prefetch request based on the sector bits of the memory request that resulted in the prefetch request (i.e., hit or missed in the LLC). For example, if a memory request asks for the first three sectors of a cache line, the prefetch request also asks for the first three sectors of their respective cache lines.
Figure 14 shows the normalized speedup (x-axis) for Baseline and Sectored DRAM (with and without prefetching) across all evaluated single-core workloads.
Fig. 14.
Fig. 14. Normalized speedup (x-axis) of all single-core workloads for Baseline and Sectored DRAM.11
We observe that Sectored DRAM-Prefetch improves system performance by 1.07% (37.56%) on average (at maximum) compared to Sectored DRAM without prefetching.14 We conclude that the simple stride prefetcher design improves Sectored DRAM’s performance. We leave the design and analysis of a more sophisticated Sectored DRAM prefetcher for future work.

8.4 Finer-Granularity Sector Support

We design and evaluate Sectored DRAM with eight sectors. Extending Sectored DRAM to support more sectors could enable higher energy and performance benefits but would require (i) transferring additional sector bits to DRAM from the memory controller and (ii) more DRAM circuit area to place additional sector latches.
Our implementation allows us to transfer up to 14 sector bits with each \(PRE\) command to DRAM (Section 4.1). To transfer more than 14 sector bits, (i) DRAM command encoding could be extended with new signals to carry the sector bits (e.g., another signal/pin for every additional sector bit), (ii) a new DRAM command with enough space allocated in its encoding for sector bits could be implemented, or (iii) sector bits for a single \(SA\) operation could be transferred to DRAM over multiple \(PRE\) commands.
We evaluate the area required by additional sector latches and find it to be very small. Implementing eight more sector latches brings Sectored DRAM’s DRAM chip area overhead from 1.72% to 1.78%.

8.5 DRAM Eccs

SECDED-ECC [84] used in today’s systems is naturally compatible with Sectored DRAM. Some systems make use of more specialized, Chipkill-like ECC (e.g., single symbol error correction (SSC) [5, 18, 130]). Sectored DRAM can easily support the SSC [5, 18, 130] scheme. In this scheme, the ECC codeword consists of 32 4-bit data symbols and four 4-bit ECC symbols with a total size of 144 bits. A total of 128 data bits are encoded to form ECC symbols. The DRAM module consists of 16 \(\times\) 4 chips to store data symbols and 2 \(\times\) 4 chips to store ECC symbols. The module transmits 72 bits with each beat of the data burst and an ECC codeword is transmitted over two beats of a data burst. To support the SSC scheme, Sectored DRAM can use burst lengths that are multiples of 2, which allows the DRAM module to transmit whole ECC codewords with every DRAM access.
Recent DRAM chips implement on-die ECC which allows the DRAM chip to correct errors transparently from the memory controller [93, 94, 95]. Sectored DRAM is compatible with on-die ECC schemes that operate at the granularity of a single sector (e.g., 8 bits). To develop an on-die ECC scheme for Sectored DRAM, existing on-die ECC schemes could be modified to operate at the granularity of a single sector or new ECC schemes could be developed. We leave such exploration for future work.

8.6 Sector Cache Benefits

We use sector caches to integrate Sectored DRAM into a full system. A comprehensive design space exploration for sector caches is outside the scope of this work (we refer the reader to other works [4, 7, 37, 51, 63, 73, 74, 97, 101, 103, 112]). Sector caches incur chip area costs but offer power and performance benefits beyond those that we demonstrate in our work (using Sectored DRAM). For example, powering off SRAM subarrays that contain invalid words in a cache block could save power (e.g., [97]), or filling up these invalid words with valid words from other cache blocks could increase effective cache capacity and improve system performance (e.g., [63]).

9 Related Work

Sectored DRAM is the first low-cost and high-performance DRAM substrate that alleviates the energy waste on the DRAM bus by enabling (i) Fine-DRAM-Transfer and (ii) Fine-DRAM-Act. We extensively compare Sectored DRAM qualitatively and quantitatively with the most relevant, low-cost, state-of-the-art fine-grained DRAM architectures [19, 31, 69, 124, 136] in Section 3.1 and Section 7.4. In this section, we discuss other related works.
Other FGA Mechanisms. Prior works [3, 16, 92, 116, 135] propose other fine-grained DRAM architectures. These architectures require intrusive reorganization of the DRAM array and/or modifications to the DRAM on-chip interconnect. Some works [16, 92, 116] target higher-bandwidth DRAM standards and offer significant performance and activation energy improvements. Zhang and Guo [135] develop a new interconnect that serializes data from multiple partially activated banks. Alawneh et al. [3] divide DRAM mats into submats by adding more HFFs to mitigate the throughput loss of FGA. These works do not reduce the energy wasted on the memory channel by avoiding the transfer of unused words, which Sectored DRAM does at low chip area cost.
DRAM-Module-Level Fine-DRAM-Transfer. A class of prior work [2, 12, 126, 131, 132, 137] proposes new DRAM module designs (e.g., subranked DIMMs [2, 126, 131, 132]) that allow independent operation of each DRAM chip in a DRAM module (Section 2.1) to implement Fine-DRAM-Transfer. From a system standpoint, Sectored DRAM is a more practical mechanism to implement compared to module-level mechanisms because Sectored DRAM requires no modifications to the physical DRAM interface. Similar to Sectored DRAM, these mechanisms (e.g., DGMS [132]) require modifications to DRAM (i.e., modifications to the DRAM module in DGMS and to the DRAM chip in Sectored DRAM) and the processor. However, on top of these modifications, module-level mechanisms also require modifications to the physical DRAM interface (e.g., three additional pins to select one of the eight chips in a rank [2]), thus making module-level mechanisms incompatible with current industry standards (i.e., JEDEC specifications [45]). In contrast, Sectored DRAM does not require modifications to the physical DRAM interface, and thus Sectored DRAM chips comply with existing DRAM industry standards and specifications.
We quantitatively evaluate a subranked DIMM design (DGMS [132]) that can be implemented with minimal modifications to the physical DRAM interface. This design can operate subranks independently, and each subrank can receive one DRAM command per DRAM command bus cycle (i.e., 1 \(\times\) ABUS scheme [131]). We find that this design reduces system performance for the high MPKI workload mixes, causing a 23% reduction in weighted speedup on average. Even though the subranked DIMM allows requests to be served from different subranks in parallel, the DRAM command bus bandwidth is insufficient to allow timely scheduling of these requests to different subranks [132] (i.e., the command bus becomes the bottleneck). The DRAM command bus bandwidth can be increased to enable higher-performance subranked DIMM designs. However, this comes at additional hardware cost and modifications to the physical DRAM interface [132]. In contrast, Sectored DRAM improves the weighted speedup for the same set of workloads by 17% on average and requires no modifications to the physical DRAM interface.

10 Conclusion

We designed a new, high-throughput, energy-efficient, and practical fine-grained DRAM architecture named Sectored DRAM. Compared to prior fine-grained DRAM architectures, our design significantly improves both system energy and performance. It does so by eliminating (i) the energy waste caused by transferring unused words between the processor and DRAM, and (ii) the energy spent on activating DRAM cells that are not accessed by memory requests. Activating a smaller number of cells allows the memory controller to serve memory requests with lower memory access latency. While effective at improving both system energy and performance, Sectored DRAM is also practical and can be implemented at low hardware cost.

Acknowledgments

We thank the anonymous reviewers of MICRO 2022, HPCA 2023, and TACO for their feedback. We thank the SAFARI Research Group members for providing a stimulating intellectual environment. We acknowledge the generous gifts from our industrial partners, including Google, Huawei, Intel, and Microsoft.

Footnotes

1
This class of prior works either do not enable fine-grained data transfer (i.e., they perform data transfers at cache block granularity) [19, 31, 124, 136] or do not enable fine-grained data transfer for both read and write operations [69].
2
We use the word sector to distinguish between what exists today in DRAM chips (mats) and what we propose in Sectored DRAM (sectors).
3
We refer the reader to various prior works [11, 14, 15, 23, 30, 35, 36, 53, 54, 55, 56, 59, 65, 66, 67, 68, 76, 89, 92, 108, 109, 110, 128, 134, 136] for a more detailed description of the DRAM architecture.
4
We consider DRAM subarrays with eight sectors. See Section 8.4 for a discussion of a finer-granularity activation mechanism.
5
We determine \(t_{RP}\) to be sufficiently long based on the overall latency of a READ command in a conventional DRAM chip. A READ command (i) propagates from the memory controller to the DRAM chip, (ii) accesses data in a portion of the row buffer in the corresponding bank, and (iii) sends the data back to the memory controller. In the DDR4 standard [45], the latency between issuing a READ and the first data beat appearing on the data bus is defined as \(tAA\) ( \(12.5 \,\mathrm{n}\mathrm{s}\) ). Because sector bits need only to propagate from the memory controller to a DRAM bank (and not back to the memory controller from a bank), the \(t_{RP}\) timing parameter is likely longer than what is needed (which we estimate as half \(tAA\) or \(6.25 \,\mathrm{n}\mathrm{s}\) ) for sector bits to propagate from the memory controller to a DRAM bank.
6
Commodity DDR4 chips already implement a relatively constrained version of \(VBL\) called burst-chop. Burst-chop enables 256-bit (in four-beat burst) data transfers [45].
7
In Section 8, we discuss how Sectored DRAM can be integrated into systems with different parameters (e.g., more sectors per chip).
8
The LSQ stores the necessary metadata (e.g., register destination identifiers, operands, and virtual and physical addresses of memory operands) to correctly execute and commit load and store instructions. We refer the readers to other works [113, 114, 129] for implementation details of the LSQ in modern microprocessors.
9
Due to aliasing in SHT index computation, multiple cache blocks can point to the same SHT entry. SHT receives conflicting updates if two or more such cache blocks are evicted from L1 in the same clock cycle. SHT selects and applies only one update among conflicting updates.
10
Due to our trace generation infrastructure limitations, we evaluate only those workloads for which we could successfully generate traces.
11
Each box is lower-bounded by the first quartile and upper-bounded by the third quartile. The median falls within the box. The inter-quartile range is the distance between the first and third quartiles (i.e., box size). Whiskers extend to the minimum and maximum data point values on either side of the box, whereas a bubble depicts average values.
12
We observe a decrease in normalized parallel speedup for hmmer-2006 and mcf-2017 as the number of cores increase from 8 to 16. This is because the two workloads, when run on 16 cores, contend for memory at a higher degree than when run on 8 cores in the simulated system. For example, 16-core hmmer-2006 has 24.41% higher average latency for memory requests than 8-core hmmer-2006.
13
We estimate HalfPageDRAM’s [31] performance and energy benefits using HalfDRAM [136], as both techniques enable half-size DRAM row activations at high DRAM throughput.
14
Performance drop induced by the prefetcher can be curbed using prefetch throttling techniques [24, 25, 26] (e.g., Feedback Directed Prefetching [117]).

References

[1]
Jung Ho Ahn et al. 2009. Future scaling of processor-memory interfaces. In SC.
[2]
Jung Ho Ahn et al. 2009. Multicore DIMM: An energy efficient memory module with independently controlled DRAMs. IEEE Computer Architecture Letters 8, 1 (2009), 5–8.
[3]
Tareq Alawneh, Raimund Kirner, and Catherine Menon. 2021. Dynamic row activation mechanism for multi-core systems. In CF.
[4]
D. B. Alpert et al. 1988. Performance trade-offs for microprocessor cache memories. IEEE Micro 8, 4 (1988), 44–54.
[5]
AMD. 2013. BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors. AMD.
[7]
C. Anderson et al. 1995. Two techniques for improving performance on bus-based multiprocessors. In HPCA.
[8]
Rajeev Balasubramonian et al. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM TACO 14, 2 (2017), Article 14, 25 pages.
[9]
Rahul Bera et al. 2022. Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction. In MICRO.
[10]
Rahul Bera et al. 2021. Pythia: A customizable hardware prefetching framework using online reinforcement learning. In MICRO.
[11]
Ishwar Bhati et al. 2015. Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions. In ISCA.
[12]
Tony M. Brewer. 2010. Instruction set innovations for the Convey HC-1 computer. IEEE Micro 30, 2 (2010), 70–79.
[13]
Karthik Chandrasekar et al. 2012. DRAMPower: Open-Source DRAM Power & Energy Estimation Tool. Retrieved June 17, 2024 from http://www.drampower.info
[14]
Kevin K. Chang et al. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. In SIGMETRICS.
[15]
Kevin K. Chang et al. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. In SIGMETRICS.
[16]
Niladrish Chatterjee et al. 2017. Architecting an energy-efficient DRAM system for GPUs. In HPCA.
[17]
C. F. Chen et al. 2004. Accurate and complexity-effective spatial pattern prediction. In HPCA.
[18]
C. L. Chen. 1996. Symbol error correcting codes for memory applications. In FTCS.
[19]
Elliott Cooper-Balis et al. 2010. Fine-grained activation for power reduction in DRAM. IEEE Micro 30, 3 (2010), 34–47.
[20]
Ahmed Dalalah et al. 2006. New hardware architecture for bit-counting. In WSEAS.
[21]
Reetuparna Das et al. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO.
[22]
Howard David et al. 2011. Memory power management via dynamic voltage/frequency scaling. In ICAC.
[23]
Robert H. Dennard. 1968. Field-effect transistor memory. (July 1967). Patent No. 3387286A, Filed July 14th., 1967; Issued June 4th., 1968.
[24]
Eiman Ebrahimi et al. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA.
[25]
Eiman Ebrahimi et al. 2009. Coordinated control of multiple prefetchers in multi-core systems. In MICRO.
[26]
Eiman Ebrahimi et al. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA.
[27]
Stijn Eyerman et al. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42–53.
[28]
Luca Frontini et al. 2018. A very compact population count circuit for associative memories. In MOCAST.
[29]
Elba Garza et al. 2019. Bit-level perceptron prediction for indirect branches. In ISCA.
[30]
Saugata Ghose et al. 2019. Demystifying complex workload-DRAM interactions: An experimental study. In SIGMETRICS.
[31]
Heonjae Ha et al. 2016. Improving energy efficiency of DRAM by exploiting half page row access. In MICRO.
[32]
Greg Hamerly et al. 2005. SimPoint 3.0: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism 7 (2005), 1–28.
[33]
Per Hammarlund et al. 2014. Haswell: The fourth-generation Intel core processor. IEEE Micro 34, 2 (2014), 6–20.
[34]
Milad Hashemi et al. 2016. Accelerating dependent cache misses with an enhanced memory controller. In ISCA.
[35]
Hasan Hassan et al. 2019. CROW: A low-cost substrate for improving DRAM performance, energy efficiency, and reliability. In ISCA.
[36]
Hasan Hassan et al. 2016. ChargeCache: Reducing DRAM latency by exploiting row access locality. In HPCA.
[37]
Mark D. Hill et al. 1984. Experimental evaluation of on-chip microprocessor cache memories. In ISCA.
[38]
Sorin Iacobovici et al. 2004. Effective stream-based and execution-based data prefetching. In ICS.
[39]
K. Inoue et al. 1999. Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs. In HPCA.
[40]
Intel. 2022. Intel Alder Lake Events. Retrieved June 17, 2024 from https://perfmon-events.intel.com/
[41]
Intel. 2022. Intel Performance Counter Monitor—A Better Way to Measure CPU Utilization. Retrieved June 17, 2024 from https://intel.ly/3xLo80Y
[42]
E. Ipek et al. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In ISCA.
[43]
Kiyoo Itoh. 2013. VLSI Memory Chip Design. Springer Science & Business Media.
[44]
JEDEC. 2007. JESD79-3: DDR3 SDRAM Standard. JEDEC.
[45]
JEDEC. 2020. JESD79-4C: DDR4 SDRAM Standard. JEDEC.
[46]
JEDEC. 2020. JESD79-5: DDR5 SDRAM Standard. JEDEC.
[47]
Daniel A. Jiménez. 2003. Fast path-based neural branch prediction. In MICRO.
[48]
Daniel A. Jiménez et al. 2001. Dynamic branch prediction with perceptrons. In HPCA.
[49]
Daniel A. Jiménez et al. 2002. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems 20, 4 (2002), 369–397.
[50]
Daniel A. Jiménez et al. 2017. Multiperspective reuse prediction. In MICRO.
[51]
M. Kadiyala et al. 1995. A dynamic cache sub-block design to reduce false sharing. In ICCD.
[52]
Dimitris Kaseridis et al. 2011. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era. In MICRO.
[53]
B. Keeth et al. 2008. DRAM Circuit Design: Fundamental and High-Speed Topics (2nd ed.). Wiley-IEEE Press.
[54]
Jeremie S. Kim et al. 2018. The DRAM latency PUF: Quickly evaluating physical unclonable functions by exploiting the latency-reliability tradeoff in modern commodity DRAM devices. In HPCA.
[55]
Jeremie S. Kim et al. 2019. D-RaNGe: Using commodity DRAM devices to generate true random numbers with low latency and high throughput. In HPCA.
[56]
Jeremie S. Kim et al. 2020. Revisiting RowHammer: An experimental analysis of modern DRAM devices and mitigation techniques. In ISCA.
[57]
Yoongu Kim et al. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.
[58]
Yoongu Kim et al. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO.
[59]
Yoongu Kim et al. 2012. A case for exploiting subarray-level parallelism (SALP) in DRAM. In ISCA.
[60]
Yoongu Kim et al. 2016. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1 (2016), 45–49.
[61]
Kibong Koo et al. 2012. A 1.2V 38nm 2.4Gb/s/pin 2Gb DDR4 SDRAM with bank group and ×4 half-page architecture. In ISSCC.
[62]
Sanjeev Kumar et al. 1998. Exploiting spatial locality in data caches using spatial footprints. In ISCA.
[63]
Snehasish Kumar et al. 2012. Amoeba-Cache: Adaptive blocks for eliminating waste in the memory hierarchy. In MICRO.
[64]
Chang Joo Lee et al. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.
[65]
Donghyuk Lee et al. 2017. Design-induced latency variation in modern DRAM chips: Characterization, analysis, and latency reduction mechanisms. In SIGMETRICS.
[66]
Donghyuk Lee et al. 2015. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. In HPCA.
[67]
Donghyuk Lee et al. 2013. Tiered-latency DRAM: A low latency and low cost DRAM architecture. In HPCA.
[68]
Donghyuk Lee et al. 2015. Decoupled direct memory access: Isolating CPU and IO traffic by leveraging a dual-data-port DRAM. In PACT.
[69]
Yebin Lee et al. 2017. Partial row activation for low-power DRAM system. In HPCA.
[70]
C. Lefurgy et al. 2003. Energy management for commercial servers. Computer 36, 12 (2003), 39–48.
[71]
Sheng Li et al. 2013. The McPAT framework for multicore and manycore architectures simultaneously modeling power, area, and timing. ACM Transactions on Architecture and Code Optimization 10, 1 (2013), Article 5, 29 pages.
[72]
S. Li et al. 2017. DRISA: A DRAM-based reconfigurable in-situ accelerator. In MICRO.
[73]
J. S. Liptay. 1968. Structural aspects of the System/360 Model 85, II: The cache. IBM Systems Journal 7, 1 (1968), 15–21.
[74]
Kuang-Chih Liu et al. 1997. On the effectiveness of sectored caches in reducing false sharing misses. In ICPADS.
[75]
Haocong Luo et al. 2023. Ramulator 2.0: A modern, modular, and extensible DRAM simulator. arXiv:2308.11030 [cs.AR] (2023).
[76]
Haocong Luo et al. 2023. RowPress: Amplifying read disturbance in modern DRAM chips. In ISCA.
[77]
Jack A. Mandelman et al. 2002. Challenges and future directions for the scaling of dynamic random-access memory (DRAM). IBM Journal of Research and Development 46, 2.3 (2002), 187–212.
[78]
T. Mano et al. 1983. Submicron VLSI memory circuits. In ISSCC.
[79]
Deepak Molly Mathew et al. 2020. Using runtime reverse engineering to optimize DRAM refresh. (April 2020). Patent No. 10622054B2, Filed Sept. 5th., 2018, Issued April 14th., 2020.
[80]
Deepak M. Mathew et al. 2017. Using run-time reverse-engineering to optimize DRAM refresh. In MEMSYS.
[81]
Micron Technology. 2014. SDRAM, 4Gb: x4, x8, x16 DDR4 SDRAM Features. Micron Technology.
[82]
W. R. Moore. 1986. A review of fault-tolerant techniques for the enhancement of integrated circuit yield. Proceedings of the IEEE 74, 5 (1986), 684–698.
[83]
Thomas Moscibroda et al. 2007. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX Security.
[84]
Shubu Mukherjee. 2008. Architecture Design for Soft Errors. Morgan Kaufmann Publishers.
[85]
Onur Mutlu. 2013. Memory scaling: A systems architecture perspective. In IMW.
[86]
Onur Mutlu et al. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO.
[87]
Onur Mutlu et al. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA.
[88]
Kyle J. Nesbit et al. 2006. Fair queuing memory systems. In MICRO.
[89]
Ataberk Olgun et al. 2021. QUAC-TRNG: High-throughput true random number generation using quadruple row activation in commodity DRAMs. In ISCA.
[90]
Geraldo F. Oliveira et al. 2024. MIMDRAM: An end-to-end processing-using-DRAM system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data computing. In HPCA.
[91]
Geraldo F. Oliveira et al. 2021. DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks. IEEE Access 9 (2021), 1–46.
[92]
Mike O’Connor et al. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In MICRO.
[93]
Minesh Patel. 2022. Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes. Ph. D. Dissertation. ETH Zürich.
[94]
Minesh Patel et al. 2019. Understanding and modeling on-die error correction in modern DRAM: An experimental study using real devices. In DSN.
[95]
Minesh Patel et al. 2020. Bit-exact ECC recovery (BEER): Determining DRAM on-die ECC functions by exploiting DRAM data retention characteristics. In MICRO.
[96]
Indrani Paul et al. 2015. Harmonia: Balancing compute and memory power in high-performance GPUs. In ISCA.
[97]
Prateek Pujara et al. 2006. Increasing the cache efficiency by eliminating noise. In HPCA.
[98]
Moinuddin K. Qureshi et al. 2007. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In HPCA.
[99]
Rambus. 2014. DRAM Power Model. Retrieved June 17, 2024 from https://www.rambus.com/energy/
[100]
Rambus. 2017. TN-40-07: Calculating Memory Power for DDR4 SDRAM. Retrieved June 17, 2024 from https://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4007_ddr4_power_calculation.pdf
[101]
Jeffrey B. Rothman et al. 2000. Sector cache design and performance. In MASCOTS.
[102]
Jeffrey B. Rothman et al. 1999. The pool of subsectors cache design. In ICS.
[103]
Jeffrey B. Rothman et al. 2002. Minerva: An adaptive subblock coherence protocol for improved SMP performance. In ISHPC.
[104]
SAFARI Research Group. 2023. Ramulator—GitHub Page. Retrieved June 17, 2024 from https://github.com/CMU-SAFARI/ramulator
[105]
SAFARI Research Group. 2023. Ramulator 2.0—GitHub Repository. Retrieved June 17, 2024 from https://github.com/CMU-SAFARI/ramulator2
[106]
SAFARI Research Group. 2024. DAMOV Benchmark Suite and Simulation Framework. Retrieved June 17, 2024 from https://github.com/CMU-SAFARI/DAMOV
[107]
SAFARI Research Group. 2024. Sectored DRAM—GitHub Page. Retrieved June 17, 2024 from https://github.com/CMU-SAFARI/Sectored-DRAM
[108]
Vivek Seshadri et al. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In MICRO.
[109]
Vivek Seshadri et al. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In MICRO.
[110]
Vivek Seshadri et al. 2019. In-DRAM bulk bitwise execution engine. arXiv:1905.09822 [cs.AR] (2019).
[111]
A. Seznec. 1994. Decoupled sectored caches: Conciliating low tag implementation cost. In ISCA.
[112]
Alan Jay Smith. 1987. Line (block) size choice for CPU cache memories. IEEE Transactions on Computers C-36, 9 (1987), 1063–1075.
[113]
J. E. Smith et al. 1987. The ZS-1 central processor. In ASPLOS.
[114]
J. E. Smith et al. 1995. The microarchitecture of superscalar processors. Proceedings of the IEEE 83, 12 (1995), 1609–1624.
[115]
Allan Snavely et al. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS.
[116]
Young Hoon Son et al. 2014. Microbank: Architecting through-silicon interposer-based main memory systems. In SC.
[117]
Santhosh Srinath et al. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA.
[118]
Standard Performance Evaluation Corp. 2006. SPEC CPU® 2006. Retrieved June 17, 2024 from http://www.spec.org/cpu2006
[119]
Standard Performance Evaluation Corp. 2017. SPEC CPU® 2017. Retrieved June 17, 2024 from http://www.spec.org/cpu2017
[120]
Lavanya Subramanian et al. 2016. BLISS: Balancing performance, fairness and complexity in memory access scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 3071–3087.
[121]
Kshitij Sudan et al. 2010. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In ISCA.
[122]
J. M. Tendler et al. 2002. POWER4 system microarchitecture. IBM Journal of Research and Development 46, 1 (2002), 5–25.
[123]
Elvira Teran et al. 2016. Perceptron learning for reuse prediction. In MICRO.
[124]
Aniruddha N. Udipi et al. 2010. Rethinking DRAM design and organization for energy-constrained multi-cores. In ISCA.
[125]
Thomas Vogelsang. 2010. Understanding the energy consumption of dynamic random access memories. In ISCA.
[126]
Frederick A. Ware et al. 2006. Improving power and data efficiency with threaded memory modules. In ICCD.
[127]
Malcolm Ware et al. 2010. Architecting for power management: The IBM® POWER7™ approach. In HPCA.
[128]
A. Giray Yaglikci et al. 2022. HiRA: Hidden row activation for reducing refresh latency of off-the-shelf DRAM chips. In MICRO.
[129]
K. C. Yeager. 1996. The Mips R10000 superscalar microprocessor. IEEE Micro 16, 2 (1996), 28–41.
[130]
Ravikiran Yeleswarapu et al. 2020. Addressing multiple bit/symbol errors in DRAM subsystem. arXiv:1908.01806 (2020).
[131]
Doe Hyun Yoon et al. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In ISCA.
[132]
Doe Hyun Yoon et al. 2012. The dynamic granularity memory system. In ISCA.
[133]
George L. Yuan et al. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In MICRO.
[134]
Ismail Emir Yuksel et al. 2024. Functionally-complete Boolean logic in real DRAM chips: Experimental characterization and analysis. In HPCA.
[135]
Chao Zhang and Xiaochen Guo. 2017. Enabling efficient fine-grained DRAM activations with interleaved I/O. In ISLPED.
[136]
Tao Zhang et al. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In ISCA.
[137]
Hongzhong Zheng et al. 2008. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In MICRO.

Index Terms

  1. Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 3
        September 2024
        592 pages
        EISSN:1544-3973
        DOI:10.1145/3613629
        • Editor:
        • David Kaeli
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 September 2024
        Online AM: 14 June 2024
        Accepted: 06 February 2024
        Revised: 06 January 2024
        Received: 01 June 2023
        Published in TACO Volume 21, Issue 3

        Check for updates

        Author Tags

        1. High-performance
        2. DRAM
        3. energy efficiency
        4. fine-grained DRAM access

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 252
          Total Downloads
        • Downloads (Last 12 months)252
        • Downloads (Last 6 weeks)121
        Reflects downloads up to 21 Sep 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media