1 Introduction
Main memory is a major performance and energy bottleneck in computing systems [
48,
120]. One way of overcoming the main memory bottleneck is to move computation into/near memory, a paradigm known as
processing-in-memory (PiM) [
120]. PiM reduces memory latency between the memory units and the compute units, enables the compute units to
exploit the large internal bandwidth within memory devices, and reduces the overall power consumption of the system by eliminating the need for transferring data over power-hungry off-chip interfaces [
48,
120].
Recent works propose a variety of PiM techniques to alleviate the data movement problem. One set of techniques propose to place compute logic
near memory arrays (e.g., processing capability in the memory controller, logic layer of three-dimensional- (3D) stacked memory, or near the memory array within the memory chip) [
2,
3,
4,
12,
20,
21,
22,
25,
31,
34,
37,
39,
45,
46,
47,
51,
53,
57,
58,
65,
67,
79,
80,
86,
111,
118,
121,
131,
133,
152,
161,
172,
175,
176,
177]. These techniques are called
processing-near-memory (PnM) techniques [
120]. Another set of techniques propose to leverage analog properties of memory (e.g., Static Random-Access Memory, Dynamic Random-Access Memory (DRAM), Non-Volatile Memory) operation to perform computation in different ways (e.g., leveraging non-deterministic behavior in memory array operation to generate random numbers, performing bitwise operations within the memory array by exploiting analog charge sharing properties of DRAM operation) [
1,
5,
6,
7,
8,
9,
17,
19,
24,
28,
32,
36,
42,
43,
44,
54,
55,
56,
69,
73,
82,
83,
91,
92,
93,
102,
103,
104,
114,
134,
136,
145,
147,
151,
155,
159,
162,
167,
170,
171]. These techniques are known as
processing-using-memory (PuM) techniques [
120].
A subset of PuM proposals devise mechanisms that enable computation using DRAM arrays [
5,
6,
28,
32,
44,
54,
82,
83,
103,
134,
145,
147,
159,
167]. These mechanisms provide significant performance benefits and energy savings by exploiting the high internal bit-level parallelism of DRAM for (1) bulk data copy and initialization operations at row granularity [
1,
28,
134,
145,
159], (2) bitwise operations [
7,
8,
9,
103,
104,
114,
142,
144,
146,
147,
148,
167], (3) arithmetic operations [
1,
6,
17,
32,
36,
42,
43,
55,
56,
73,
91,
92,
93,
102,
103,
151,
162,
170], and (4) security primitives (e.g., true random number generation [
83] and physical unclonable functions [
82,
126]). Recent works [
44,
82,
83] show that some of these PuM mechanisms can already be reliably supported in contemporary, off-the-shelf DRAM chips.
1 Given that DRAM is the dominant main memory technology, these commodity DRAM-based PuM techniques provide a promising way to improve the performance and energy efficiency of existing and future systems at
no additional DRAM hardware cost.
Integration of these PuM mechanisms in a real system imposes non-trivial challenges that require further research to find appropriate solutions. For example, in-DRAM bulk data copy and initialization techniques [
28,
147] require modifications to memory management that affect different parts of the system. First, these techniques have specific memory allocation and alignment requirements (e.g., page-granularity source and destination operand arrays should be allocated and aligned in the same DRAM subarray) that are
not satisfied by existing memory allocation primitives (e.g.,
malloc [
106] and
posix_memalign [
108]). Second, in-DRAM copy requires efficient handling of memory coherence, such that the contents of the source operand in DRAM are up-to-date.
None of these system integration challenges of PuM mechanisms can be efficiently studied in existing general-purpose computing systems (e.g., personal computers, cloud computers, and embedded systems), special-purpose testing platforms (e.g., SoftMC [
60]), or system simulators (e.g., gem5 [
18,
132], Ramulator [
90,
137], Ramulator-PIM [
139], zsim [
140], DAMOVSim [
125,
138], and other simulators [
35,
168,
169,
174]). Existing general-purpose computing systems do
not permit dynamically changing DDRx timing parameters, which is required to integrate many PuM mechanisms into real systems. Although special-purpose testing platforms can be used to dynamically change DDRx timing parameters, these platforms do
not model an end-to-end computing system where system integration of PuM mechanisms can be studied. System simulators do
not model DRAM operation that violates manufacturer-recommended timing parameters and do
not have a way of interacting with real DRAM chips that embody undisclosed and unique characteristics that have implications on how PuM techniques are integrated into real systems.
Our goal is to design and implement a flexible real-system platform that can be used to solve system integration challenges and analyze tradeoffs of end-to-end implementations of commodity DRAM-based PuM mechanisms. To this end, we develop Processing-in-DRAM (PiDRAM) framework, the first flexible, end-to-end, and open source framework that enables system integration studies and evaluation of real PuM techniques using real unmodified DRAM devices.
PiDRAM facilitates system integration studies of new commodity DRAM-based PuM mechanisms by providing four customizable hardware and software components that can be used as a common basis to enable system support. PiDRAM contains two main hardware components. First, a custom, easy-to-extend memory controller allows for implementing new DRAM command sequences that perform PuM operations. For example, the memory controller can be extended with a single state machine in its hardware description to implement a new DDRx command sequence with user-defined timing parameters to implement a new PuM technique (i.e., perform a new PuM operation). Second, an ISA-transparent controller (PuM Operations Controller (POC)) supervises PuM execution. POC exposes the PuM operations to the software components of PiDRAM over a memory-mapped interface to the processor, allowing the programmer to perform PuM operations using the PiDRAM framework by executing conventional LOAD/STORE instructions. The memory-mapped interface allows PiDRAM to be easily ported to systems that implement different instruction set architectures. PiDRAM contains two main software components. First, an extensible library allows system designers to implement software support for PuM mechanisms. This library contains customizable functions that communicate with POC to perform PuM operations. Second, a custom supervisor software contains the necessary OS primitives (e.g., memory management) to enable end-to-end implementations of commodity DRAM-based PuM techniques.
We demonstrate a prototype of PiDRAM on an FPGA-based RISC-V system [
11]. To demonstrate the flexibility and ease of use of PiDRAM, we implement two prominent PuM techniques: (1)
RowClone [
145], an in-DRAM data copy and initialization technique, and (2) an
in-DRAM true random number generation technique (D-RaNGe) [
83] based on activation-latency failure
s. To support RowClone (Section
5), (i) we customize the PiDRAM memory controller to issue carefully engineered sequences of DRAM commands that perform data copy (and initialization) operations in DRAM, and (ii) we extend the custom supervisor software to implement a new memory management mechanism that satisfies the memory allocation and alignment requirements of RowClone.
For D-RaNGe (Section
6), we extend (i) the PiDRAM memory controller with a new state machine that periodically performs DRAM accesses with reduced activation latencies to generate random numbers [
83] and a new hardware
random number buffer that stores the generated random numbers, and (ii) the custom supervisor software with a function that retrieves the random numbers from the hardware buffer to the user program. Our end-to-end evaluation of (i) RowClone demonstrates up to 14.6
\(\times {}\) speedup for bulk copy and 12.6
\(\times {}\) initialization operations over CPU copy (i.e., conventional
memcpy), even when coherence is satisfied using inefficient cache flush operations, and (ii) D-RaNGe demonstrates that an end-to-end integration of D-RaNGe can provide true random numbers at high throughput (8.30 Mb/s) and low latency (4-bit random number in 220ns), even without any hardware or software optimizations. Implementing both PuM techniques over the Verilog and C++ codebase provided by PiDRAM requires only 388 lines of Verilog code and 643 lines of C++ code.
Our contributions are as follows:
•
We develop PiDRAM, the first flexible framework that enables end-to-end integration and evaluation of PuM mechanisms using real unmodified DRAM chips.
•
We develop a prototype of PiDRAM on an FPGA-based platform. To demonstrate the ease-of-use and evaluation benefits of PiDRAM, we implement two state-of-the-art DRAM-based PuM mechanisms, RowClone and D-RaNGe, and evaluate them on PiDRAM’s prototype using unmodified DDR3 chips.
•
We devise a new memory management mechanism that satisfies the memory allocation and alignment requirements of RowClone. We demonstrate that our mechanism enables RowClone end-to-end in the full system, and provides significant performance improvements over traditional CPU-based copy and initialization operations (
memcpy [
107] and
calloc [
105]) as demonstrated on our PiDRAM prototype.
•
We implement and evaluate a state-of-the-art D-RaNGe. Our implementation provides a solid foundation for future work on system integration of DRAM-based PuM security primitives (e.g., PUFs [13, 82] and TRNGs [13, 123, 124]), implemented using real unmodified DRAM chips. 5 Case Study #1: End-to-end RowClone
We implement support for ComputeDRAM-based RowClone (in-DRAM copy/initialization) operations on PiDRAM to conduct a detailed study of the challenges associated with implementing RowClone end-to-end on a real system. None of the relevant prior works [
44,
54,
142,
145,
147,
148,
150,
159] provide a clear description or a real system demonstration of a working memory allocation mechanism that can be implemented in a real operating system to expose RowClone capability to the programmer.
5.1 Implementation Challenges
Data Mapping. RowClone has four data mapping and alignment requirements that cannot be satisfied by current memory allocation mechanisms (e.g., malloc [
106]). First, the source and destination operands (i.e., page (4-KiB)-sized arrays) of the copy operation must reside in the same DRAM subarray. We refer to this as the
mapping requirement. Second, the source and destination operands must be aligned to DRAM rows. We refer to this as the
alignment requirement. Third, the size of the copied data must be a multiple of the DRAM row size. The size constraint defines the granularity at which we can perform bulk-copy operations using RowClone. We refer to this as the
granularity requirement. Fourth, RowClone must operate on up-to-date data that resides in main memory. Modern systems employ caches to exploit locality in memory accesses and reduce memory latency. Thus, cache blocks (typically 64 B) of either the source or the destination operands of the RowClone operation may have cache block copies present in the cache hierarchy. Before performing RowClone, the cached copies of pieces of both source and destination operands must be invalidated and written back to main memory. We refer to this as the
memory coherence requirement.
We explain the data mapping and alignment requirements of RowClone using Figure
5(a). The operand Source 1 cannot be copied to the operand Target 1 as the operands do not satisfy the
granularity requirement (➊). Performing such a copy operation would overwrite the remaining (i.e., non-Target 1) data in Target 1’s DRAM row with the remaining (i.e., non-Source 1) data in Source 1’s DRAM row. Source 2 cannot be copied to Target 2 as Target 2 is not
aligned to its DRAM row (➋). Source 3 cannot be copied to Target 3, as these operands are not
mapped to the same DRAM subarray (➌). In contrast, Source 4 can be copied to Target 4 using in-DRAM copy, because these operands are (i)
mapped to the same DRAM subarray, (ii) aligned to their DRAM rows and (iii) occupy their rows completely (i.e., the operands have sizes equal to DRAM row size) (➍).
5.2 Memory Allocation Mechanism
Computing systems employ various layers of address mappings that obfuscate the DRAM row-bank-column address mapping from the programmer [
30,
61], which makes allocating source and target operands as depicted in Figure
5(a) (➍) difficult. Only the virtual addresses are exposed to the programmer. Without control over the virtual address to DRAM address mapping, the programmer
cannot easily place data in a way that satisfies the mapping and alignment requirements of an in-DRAM copy operation.
We implement a new memory allocation mechanism that can perform memory allocation for RowClone (in-DRAM copy/initialization) operations.
This mechanism enables page-granularity RowClone operations (i.e., a virtual page can be copied to another virtual page using RowClone) without introducing any changes to the programming model. Figure 5(b) depicts an overview of our memory allocation mechanism.At a high level, our memory allocation mechanism (i) splits the source and destination operands into page-sized virtually addressed memory blocks, (ii) allocates two physical pages in different DRAM rows in the same DRAM subarray, (iii) assigns these physical pages to virtual pages that correspond to the source and destination memory blocks at the same index such that the source block can be copied to the destination block using RowClone. We repeat this process until we exhaust the page-sized memory blocks. As the mechanism processes subsequent page-sized memory blocks of the two operands, it allocates physical pages from a different DRAM bank to maximize bank-level parallelism in streaming accesses to these operands.
To overcome the mapping, alignment, and granularity problems, we implement our memory management mechanism in the custom supervisor software of PiDRAM. We expose the allocation mechanism using the alloc_align(N, ID) system call. The system call returns a pointer to a contiguous array of N bytes in the virtual address space (i.e., one operand). Multiple calls with the same ID to alloc_align(N, ID) place the allocated arrays in the same subarray in DRAM, such that they can be copied from one to another using RowClone. If N is too large such that it exceeds the size of available physical memory, then alloc_align fails and causes an exception. Our implementation of RowClone requires application developers to directly use alloc_align to allocate data instead of malloc and similar function calls.
The custom supervisor software maintains three key structures to make alloc_align() work: (i) Subarray Mapping Table (SAMT), (ii) Allocation ID Table (AIT), and (iii) Initializer Rows Table (IRT).
(1) Subarray Mapping Table. We use the SAMT to maintain a list of physical page addresses that point to DRAM rows that are in the same DRAM subarray. alloc_align() queries SAMT to find physical addresses that map to rows in one subarray.
SAMT contains the physical pages that point to DRAM rows in each subarray. SAMT is indexed using subarray identifiers in the range [0, number of subarrays). SAMT contains an entry for every subarray. An entry consists of two elements: (i) the number of free physical address tuples and (ii) a list of physical address tuples. Each tuple in the list contains two physical addresses that respectively point to the first and second halves of the same DRAM row. The list of tuples contains all the physical addresses that point to DRAM rows in the DRAM subarray indexed by the SAMT entry. We allocate free physical pages listed in an entry and assign them to the virtual pages (i.e., memory blocks) that make up the row-copy operands (i.e., arrays) allocated by alloc_align(). We slightly modify our high-level memory allocation mechanism to allow for two memory blocks (4 KiB virtually addressed pages) of an array to be placed in the same DRAM row, as the page size in our system is 4 KiB and the size of a DRAM row is 8 KiB. We call two memory blocks in the same operand that are placed in the same DRAM row sibling memory blocks (also called sibling pages). The parameter N of the alloc_align() call defines this relationship: We designate memory blocks that are precisely N/2 bytes apart as sibling memory blocks.
Finding the DRAM Rows in a Subarray. Finding the DRAM row addresses that belong to the same subarray is not straightforward due to DRAM-internal mapping schemes employed by DRAM manufacturers (Section
2.1). It is extremely difficult to learn which DRAM address (i.e., bank-row-column) is actually mapped to a physical location (e.g., a subarray) in the DRAM device, as these mappings are not exposed through publicly accessible datasheets or standard definitions [
71,
116,
130]. We make the key observation that the entire mapping scheme need
not be available to successfully perform RowClone operations.
We observe that for a set of
{source, destination} DRAM row address pairs, RowClone operations
repeatedly succeed with a 100% probability. We hypothesize that these pairs of DRAM row addresses are mapped to the same DRAM subarray. We identify these row address pairs by conducting a
RowClone success rate experiment where we repeatedly perform RowClone operations between every
source, destination row address pair in a DRAM bank. Our experiment works in three steps: We (i) initialize both the source and the destination row with random data, (ii) perform a RowClone operation from the source to the destination row, and (iii) compare the data in the destination row with the source row. RowClone success rate is calculated as the number of bits that differ between the source and destination rows’ data divided by the number of bits stored in a row (8 KiB in our prototype). If there is no difference between the source and the destination rows’ data (i.e., the RowClone success rate for the source and the destination row is 100%), then we infer that the RowClone operation was successful. We repeat the experiment for 1000 iterations for each row address pair and if every iteration is successful, we store the address pair in the SAMT, indicating that the row address pair is mapped to different rows in the same DRAM subarray.
4(2) Allocation ID Table. To keep track of different operands that are allocated by alloc_align using the same ID (used to place different arrays in the same subarray), we use the AIT. AIT entries are indexed by allocation IDs (the parameter ID of the alloc_align call). Each AIT entry stores a pointer to an SAMT entry. The SAMT entry pointed by the AIT entry contains the set of physical addresses that were allocated using the same allocation ID. AIT entries are used by the alloc_align function to find which DRAM subarray can be used to allocate DRAM rows from, such that the newly allocated array can be copied to other arrays allocated using the same ID.
(3) Initializer Rows Table. To find which row in a DRAM subarray can be used as the source operand in RowClone-Initialize operations, we maintain the IRT. The IRT is indexed using physical page numbers. RowCopy-Initialize operations query the IRT to obtain the physical address of the DRAM row initialized with zeros and that belongs to the same subarray as the destination operand (i.e., the DRAM row to be initialized with zeros).
Figure
6describes how
alloc_align() works over an end-to-end example. Using the RowClone success rate experiment (described above), the custom supervisor software (CSS for short) finds the DRAM rows that are in the same subarray (➊) and initializes the SAMT. The programmer allocates two 128-KiB arrays, A and B, via
alloc_align() using the same
allocation id (
0), with the intent to copy from A to B (➋). CSS allocates contiguous ranges of virtual addresses to A and B and then splits the virtual address ranges into page-sized memory blocks (➌). CSS assigns consecutive memory blocks to consecutive DRAM banks and accesses the AIT with the
allocation id (➍) for each memory block. By accessing the AIT, CSS retrieves the
subarray id that points to a SAMT entry. The SAMT entry corresponds to the subarray that contains the arrays that are allocated using the
allocation id (➎). CSS accesses the SAMT entry to retrieve two physical addresses that point to the same DRAM row. CSS maps a memory block and its
sibling memory block (i.e., the memory block that is N/2 bytes away from this memory block, where N is the
size argument of the
alloc_align() call) to these two physical addresses, such that they are mapped to the first and the second halves of the same DRAM row (➏). Once allocated, these physical addresses are pinned to main memory and cannot be swapped out to storage. Finally, CSS updates the page table with the physical addresses to map the memory blocks to the same DRAM row (➐).
5.3 Maintaining Memory Coherence
Since memory instructions update the cached copies of data (Section
5.1), a naive implementation of RowClone can potentially operate on stale data, because cached copies of RowClone operands can be modified by CPU store instructions. Thus, we need to ensure memory coherence to prevent RowClone from operating on stale data.
We implement a new custom RISC-V instruction, called
CLFLUSH, to flush dirty cache blocks to DRAM (RISC-V does not implement any cache management operations [
160]) so as to ensure RowClone operates on up-to-date data. A CLFLUSH instruction flushes (invalidates) a physically addressed dirty (clean) cache block.
CLFLUSH or other cache management operations with similar semantics are supported in X86 [68] and ARM architectures [10]. Thus, the CLFLUSH instruction (that we implement) provides a minimally invasive solution (i.e., it requires no changes to the specification of commercial ISAs) to the memory coherence problem. Before executing a RowClone Copy or Initialization operation (see Section
5.4), the custom supervisor software flushes
(invalidates) the cache blocks of the source
(destination) row of the RowClone operation
using CLFLUSH.
5.4 RowClone-Copy and RowClone-Initialize
We support the RowClone-Copy and RowClone-Initialize operations in our custom supervisor software via two functions: (i) RowClone-Copy, rcc(void *dest, void *src, int size) and (ii) RowClone-Initialize, rci(void* dest, int size). rcc copies size number of bytes in the virtual address space starting from the src memory address to the dest memory address. rci initializes size number of bytes in the virtual address space starting from the dest memory address. We expose rcc and rci to user-level programs using system calls defined in the custom supervisor software.
rcc (i) splits the source and destination operands into page-aligned, page-sized blocks, (ii) traverses the page table
(Figure 6 ➑) to find the physical address of each block (i.e., the address of a DRAM row), (iii) flushes all cache blocks corresponding to the source operand and
invalidates all cache blocks corresponding to the destination operand, and (iv) performs a RowClone operation from the source row to the destination row using pumolib’s
copy_row() function.
rci (i) splits the destination operand into page-aligned, page-sized blocks, (ii) traverses the page table to find the physical address of the destination operand, (iii) queries the IRT (see Section
5.2) to obtain the physical address of the initializer row
(i.e., source operand), (iv)
invalidates the cache blocks corresponding to the destination operand, and (v) performs a RowClone operation from the initializer row to the destination row using using pumolib’s
copy_row() function.
5.5 Evaluation
We evaluate our solutions for the challenges in implementing RowClone end-to-end on a real system using PiDRAM. We modify the custom memory controller to implement DRAM command sequences
( \(ACT\rightarrow {}PRE\rightarrow {}ACT\) ) to trigger RowClone operations.
We set the \(tRAS\) and \(tRP\) parameters to 10 ns (below the manufacturer-recommended 37.5ns for tRAS and 13.5ns for tRP [117]).5.5.1 Experimental Methodology.
Table
3 (left) describes the configuration of the components in our system.
We use the pipelined and in-order Rocket core with 16-KiB L1 data cache and 4-entry TLB as the main processor of our system. We use the 1-GiB DDR3 module available on the ZC706 board as the main memory where we conduct PuM operations.Implementing RowClone requires an additional 198 lines of Verilog code over PiDRAM’s existing Verilog design. We add 43 and 522 lines of C code to pumolib and to our custom supervisor software, respectively, to implement RowClone.
Table
3 (right) describes the mapping scheme we use in our custom memory controller to translate from physical to DRAM row-bank-column addresses. We map physical addresses to DRAM columns, banks, and rows from lower-order bits to higher-order bits to exploit the bank-level parallelism in memory accesses to consecutive physical pages. We note that our memory management mechanism is compatible with other physical address
\(\rightarrow {}\) DRAM address mappings [
62]. For example, for a mapping scheme where page offset bits (physical address (PA) [11:0]) include all or a subset of the bank address bits, a single RowClone operand (i.e., a 4-KiB page) would be split across multiple DRAM banks. This only coarsens the granularity of RowClone operations as the sibling pages that must be copied in unison, to satisfy the granularity constraint, increases. We expect that for other complex or unknown physical address
\(\rightarrow {}\) DRAM address mapping schemes, the characterization of the DRAM device for RowClone success rate would take longer. In the worst case, DRAM row addresses that belong to the same DRAM subarray can be found by testing all combinations of physical addresses for their RowClone success rate.
We evaluate rcc and rci operations under two configurations to understand the copy/initialization throughput improvements provided by rcc and rci over CPU-copy operations performed by the Rocket core, and to understand the overheads introduced by end-to-end support for commodity DRAM-based PuM operations. We test two configurations:
(1) Bare-Metal. We assume that RowClone operations always target data that are allocated correctly in DRAM (i.e., there is no overhead introduced by address translation, IRT accesses, and CLFLUSH operations). We directly issue RowClone operations via pumolib using physical addresses. CPU-copy operations also use physical addresses.
(2) No Flush. We assume that the programmer uses the alloc_align function to allocate the operands of RowClone operations. We use a version of rcc and rci system calls that do not use CLFLUSH to flush cache blocks of source and destination operands of RowClone operations. We run the No Flush configuration on our custom supervisor software; rcc and rci and traditional CPU-copy operations use virtual addresses.
5.5.2 Workloads.
For the two configurations, we run a microbenchmark that consists of two programs,
copy and
init, on our prototype. Both programs take the argument
\(N\) , where
copy copies an
\(N\) -byte array to another
\(N\) -byte array and
init initializes an
\(N\) -byte array to all zeros. Both programs
have two versions: (i) CPU-copy, which copies/initializes data using memory loads and stores, and (ii) RowClone, which uses RowClone operations to perform copy/initialization.
All programs use alloc_align to allocate data.The performance results we present in this section are the average of a 1,000 runs.To maintain the same initial system state for both CPU-copy and RowClone, we flush all cache blocks before each one of the 1,000 runs. We run each program for array sizes (
N) that are powers of two and 8 KiB <
\(N\) < 8 MiB and find the average copy/initialization throughput across all 1,000 runs (by measuring the # of elapsed CPU cycles to execute copy/ initialization operations) for CPU-copy, RowClone-Copy (
rcc), and RowClone-Initialize (
rci).
5We analyze the overheads of CLFLUSH operations on copy/initialization throughput that
rcc and
rci can provide. We measure the execution time of CLFLUSH operations in our prototype to find how many CPU cycles it takes to flush a (i) dirty and (ii) clean cache block on average across 1,000 measurements. We simulate various scenarios (described in Section
5.5.5) where we assume a certain fraction of the operands of RowClone operations are cached and dirty.
5.5.3 Bare-Metal RowClone.
Figure
7(a) shows the throughput improvement provided by
rcc and
rci for
copy and
initialize over CPU-copy and CPU-initialization for increasing array sizes.
We make two major observations. First, we observe that
rcc and
rci provide significant throughput improvement over traditional CPU-copy and CPU-initialization. The throughput improvement provided by
rcc ranges from 317.5
\(\times {}\) (for 8 KiB arrays) to 364.8
\(\times {}\) (for 8 MiB arrays). The throughput improvement provided by
rci ranges from 172.4
\(\times {}\) to 182.4
\(\times {}\) . Second, the throughput improvement provided by
rcc and
rci increases as the array size increases. This increase saturates when the array size reaches 1 MiB. The load/store instructions used by CPU-copy and CPU-initialization access the operands in a streaming manner. The eviction of dirty cache blocks (i.e., the destination operands of copy and initialization operations) interfere with other memory requests on the memory bus.
6 We attribute the observed saturation at 1-MiB array size to the interference on the memory bus.
5.5.4 No Flush RowClone.
We analyze the overhead in copy/initialization throughput introduced by system support. Figure
7(b) shows the throughput improvement of copy and initialization provided by
rcc and
rci operations.
We make two major observations: First,
rcc improves the copy throughput by 58.3
\(\times {}\) for 8 KiB and by 118.5
\(\times {}\) for 8-MiB arrays, whereas
rci improves initialization throughput by 31.4
\(\times {}\) for 8 KiB and by 88.7
\(\times {}\) for 8-MiB arrays.
Second, we observe that the throughput improvement provided by rcc and rci improves non-linearly as the array size increases. The execution time (in Rocket core clock cycles) of rcc and rci operations (not shown in Figure 7(b)) does not increase linearly with the array size. For example, the execution time of rcc is 397 and 584 cycles at 8-KiB and 16-KiB array sizes, respectively, resulting in a 1.47 \(\times {}\) increase in execution time between 8-KiB and 16-KiB array sizes. However, the execution time of rcc is 92,656 and 187,335 cycles at 4-MiB and 8-MiB array sizes, respectively, resulting in a 2.02 \(\times {}\) increase in execution time between 4-MiB and 8-MiB array sizes. We make similar observations on the execution time of rci. For every RowClone operation, rcc and rci walk the page table to find the physical addresses corresponding to the source (rcc) and the destination (rcc and rci) operands. We attribute the non-linear increase in rcc and rci’s execution time to (i) the locality exploited by the Rocket core in accesses to the page table and (ii) the diminishing constant cost in the execution time of both rcc and rci due to common instructions executed to perform a system call.5.5.5 CLFLUSH Overhead.
We find that our implementation of CLFLUSH takes 45 Rocket core clock cycles to flush a dirty cache block and 6 Rocket core cycles to invalidate a clean cache block. We estimate the throughput improvement of
rcc and
rci including the CLFLUSH overhead. We assume that all cache blocks of the source and destination operands are cached and that a fraction of the all cached cache blocks is dirty (quantified on the
\(x\) axis). We do not include the overhead of accessing the data (e.g., by using
load instructions)
after the data gets copied in DRAM. Figure
8(a) shows the estimated improvement in copy and initialization throughput that
rcc and
rci provide for 8-MiB arrays.
We make three major observations. First, even with inefficient cache flush operations, rcc and rci provide 3.2 \(\times {}\) and 3.9 \(\times {}\) higher throughput over the CPU-copy and CPU-initialization operations, assuming 50% of the cache blocks of the 8-MiB source operand are dirty, respectively. Second, as the fraction of dirty cache blocks increases, the throughput improvement provided by both rcc and rci decreases (down to 1.9 \(\times {}\) for rcc and 2.3 \(\times {}\) for rci for 100% dirty cache block fraction). Third, we observe that rci can provide better throughput improvement compared to rcc when we include the CLFLUSH overhead. This is because rci flushes cache blocks of one operand (destination), whereas rcc flushes cache blocks of both operands (source and destination).
We do not study the distribution of dirty cache block fractions in real applications as that is not the goal of our CLFLUSH overhead analysis. However, if a large dirty cache block fraction causes severe overhead in a real application, then the system designer or the user of the system would likely decide not to offload the operation to PuM (i.e., performing rcc operations instead of CPU-Copy). PiDRAM’s prototype can be useful for studies on different PuM system integration aspects, including such offloading decisions.
We observe that the CLFLUSH operations are inefficient in supporting coherence for RowClone operations. Even so, we see that RowClone-Copy and RowClone-Initialization provides throughput improvements ranging from 1.9
\(\times {}\) to 14.6
\(\times {}\) . We expect the throughput improvement benefits to increase as coherence between the CPU caches and PIM accelerators become more efficient with new techniques [
21,
22,
143].
5.5.6 Real Workload Study.
The benefit of
rcc and
rci on a full application depends on what fraction of execution time is spent on bulk data copy and initialization. We demonstrate the benefit of
rcc and
rci on
forkbench [
145] and
compile [
145] workloads with varying fractions of time spent on bulk data copy and initialization to show that our infrastructure can enable end-to-end execution and estimation of benefits on real workloads.
7 We study
forkbench in detail to demonstrate how the benefits vary with the time spent on data copying in the baseline for this workload.
Forkbench first allocates N memory pages and copies data to these pages from a buffer in the process’s memory and then accesses 32K random cache blocks within the newly allocated pages to emulate a workload that frequently spawns new processes. We evaluate forkbench under varying bulk data copy sizes where we sweep N from 8 to 2,048.
Compile first zero-allocates (
calloc or
rci) two pages (8 KiBs) and then executes a number of arithmetic and memory instructions to operate on the zero-allocated data. We carefully develop the
compile microbenchmark to maintain a realistic ratio between the number of arithmetic and memory instructions executed and zero-allocation function calls made, which we obtain by profiling
gcc [
109]. We use the
No-Flush configuration of our RowClone implementation for both
forkbench and
compile.
Figure
8(b) plots the speedup provided by
rcc over the CPU-copy baseline, and the proportion of time spent on
memcpy functions by the CPU-copy baseline, for various configurations of
forkbench on the
\(x\) axis.
Forkbench. We observe that RowClone-Copy can significantly improve the performance of forkbench by up to 42.9%. RowClone-Copy’s performance improvement increases as the number of pages copied increase. This is because the copy operations accelerated by rcc contribute a larger amount to the total execution time of the workload. The memcpy function calls take 86% of the CPU-copy baseline’s time during forkbench execution for N = 2048.
Compile. RowClone-Initialize improves the performance of compile by 9%. Only an estimated 17% of the execution time of compile is used for zero-allocation by the CPU-initialization baseline, rci reduces the overhead of zero-allocation by (i) performing in-DRAM bulk-initialization and (ii) executing a smaller number of instructions.
Libquantum. To demonstrate that PiDRAM can run real workloads, we run a SPEC2006 [
153] workload (libquantum). We modify the
calloc (allocates and zero initializes memory) function call to allocate data using
alloc_align, and initialize data using
rci for allocations that are larger than 8 KiBs.
Using
rci to bulk initialize data in libquantum improves end-to-end application performance by 1.3% (compared to the baseline that uses CPU-Initialization). This improvement is brought by
rci, which initializes a total amount of 512 KiBs of memory
8 using RowClone operations. We note that the proportion of store instructions executed by libquantum to initialize arrays in the CPU-initialization baseline is only 0.2% of all dynamic instructions in the libquantum workload, which amounts to an estimated 2.3% of the total runtime of libquantum. Thus, the 1.3% end-to-end performance improvement provided by
rci is reasonable, and we expect it to increase with the initialization intensity of workloads.
Summary. We conclude from our evaluation that end-to-end implementations of RowClone (i) can be efficiently supported in real systems by employing memory allocation mechanisms that satisfy the memory
alignment,
mapping,
granularity requirements (Section
5.1) of RowClone operations, (ii) can greatly improve copy/initialization throughput in real systems, and (iii) require cache coherence mechanisms (e.g., PIM-optimized coherence management [
21,
22,
143]) that can flush dirty cache blocks of RowClone operands efficiently to achieve optimal copy/initialization throughput improvement. PiDRAM can be used to estimate end-to-end workload execution benefits provided by RowClone operations. Our experiments using libquantum, forkbench, and compile show that (i) PiDRAM can run real workloads, (ii) our end-to-end implementation of RowClone operates correctly, and (iii) RowClone can improve the performance of real workloads in a real system, even when inefficient CLFLUSH operations are used to maintain memory coherence.
7 Extending PiDRAM
We briefly describe the modifications required to extend PiDRAM (i) with new DRAM commands and DRAM timing parameters, (ii) with new case studies, and (iii) to support new FPGA boards.
New DRAM Commands and Timing Parameters. Implementing new DRAM commands or modifying DRAM timing parameters require modifications to PiDRAM’s memory controller. This is straightforward as PiDRAM’s memory controller’s Verilog design is modular and uses well-defined interfaces: It is composed of multiple modules that perform separate tasks. For example, the memory request scheduler comprises two main components: (1) command timer and (2) command scheduler. To serve LOAD and STORE memory requests, the command scheduler maintains state (e.g., which row is active) for every bank. The command scheduler selects the next DRAM command to satisfy the LOAD or STORE memory request and queries the command timer with the selected DRAM command. The command timer checks for all possible standard DRAM timing constraints and outputs a valid bit if the selected command can be issued in that FPGA clock cycle. To extend the memory controller with a new standard DRAM command (e.g., to implement a newer standard like DDR4 or DDR5), a PiDRAM developer simply needs to (i) add a new timing constraint by replicating the logic in the command timer and (ii) extend the command scheduler to correctly maintain the bank state.
New Case Studies. Implementing new techniques (e.g., those that are listed in Table
2) to perform new case studies requires modifications to PiDRAM’s hardware and software components. We describe the required modifications over an example ComputeDRAM-based in-DRAM bitwise operations case study.
To implement ComputeDRAM-based in-DRAM bitwise operations, the developers need to (i) extend the
custom command scheduler in PiDRAM’s memory controller with a new state machine that schedules new DRAM command sequences (ACT-PRE-ACT) with an appropriate set of violated timing parameters (our ComputeDRAM-based in-DRAM copy implementation provides a solid basis for this), (ii) expose the functionality to the processor by implementing new PiDRAM instructions in the PuM controller (e.g., by replicating and customizing the existing logic for decoding and executing RowClone operations), and (iii) and make modifications to the software library to expose the new instruction to the programmer (e.g., by replicating the copy_row function’s behavior, described in Table
1).
Porting to New FPGA Boards. Developing new PiDRAM prototypes on different FPGA boards could require modifications to design constraints (e.g., top level input/outputs to physical FPGA pins) and the DDRx PHY IP depending on the FPGA board. Modifying design constraints is a straightforward task involving looking up the FPGA manufacturer datasheets and modifying design constraint files [
164]. Manufacturers may provide different DDRx PHY IPs for different FPGAs. Fortunately, these IPs typically expose similar (based on the DFI standard [
33]) interfaces to user hardware (in our case, to PiDRAM’s memory controller). Thus, other PiDRAM prototypes on different FPGA boards can be developed with small yet careful modifications to the ZC706 prototype design we provide.
8 Related Work
To our knowledge, this is the first work to develop a flexible, open source framework that enables integration and evaluation of commodity DRAM-based PuM techniques on real DRAM chips by providing the necessary hardware and software components. We demonstrate the first end-to-end implementation of RowClone and D-RaNGe using real DRAM chips. We compare the features of PiDRAM with other state-of-the-art prototyping and evaluation platforms in Table
4 and discuss them below. The four features we use for comparison are as follows: (1)
Interface with real DRAM chips: The platform allows running experiments using real DRAM chips. (2)
Flexible memory controller for PuM: The platform provides a flexible memory controller that can easily be extended to perform (e.g., as in PiDRAM) or emulate (e.g., as in PiMulator [
119]) new PuM operations. (3)
System software support: The platform provides support for running system software such as operating systems or supervisor software (e.g., RISC-V PK [
135]). (4)
Open source: The platform is available as open source software.
Silent-PIM [78]. Silent-PIM proposes a new DRAM design that incorporates processing units capable of vector arithmetic computation. Silent-PIM’s goal is to evaluate PIM techniques on a
new, PIM-capable DRAM device using standard DRAM commands (e.g., as defined in DDR4 [
71]); it does not provide an evaluation platform or prototype.
In contrast, PiDRAM is designed for rapid integration and evaluation of PuM techniques that use real DRAM devices. PiDRAM provides key hardware and software components that facilitate end-to-end implementations of PuM techniques. SoftMC [52, 60]. SoftMC is an FPGA-based DRAM testing infrastructure. SoftMC can issue arbitrary sequences of DDR3 commands to real DRAM devices. SoftMC is widely used in prior work that studies the performance, reliability and security of real DRAM chips [
13,
14,
28,
38,
41,
50,
59,
77,
83,
85,
96,
127,
154]. SoftMC is built to test DRAM devices,
not to study end-to-end implementations of PuM techniques. Thus, SoftMC (i) does
not support application execution on a real system and (ii)
cannot use DRAM modules as main memory. While SoftMC is useful in studies that perform exhaustive search on all possible sequences of DRAM commands to potentially uncover undocumented DRAM behavior (e.g., ComputeDRAM [
44], QUAC-TRNG [
123]), PiDRAM is developed to study end-to-end implementations of PuM techniques. PiDRAM provides an FPGA-based prototype that comprises a RISC-V system and supports using DRAM modules both for storing data (i.e., as main memory) and performing PuM computation.
ComputeDRAM partially demonstrates that two DRAM-based state-of-the-art PuM techniques, RowClone [
145] and Ambit [
147], are already possible on real off-the-shelf DDR3 chips. ComputeDRAM uses SoftMC to demonstrate in-DRAM copy and bitwise AND/OR operations on real DDR3 chips. ComputeDRAM’s goal is
not to develop a framework to facilitate end-to-end implementations of PuM techniques. Therefore, it does
not provide (i) a flexible memory controller for PuM or (ii) support for system software. PiDRAM provides the necessary software and hardware components to facilitate end-to-end implementations of PuM techniques.
MEG [173]. MEG is an open source system emulation platform for enabling FPGA-based operation interfacing with
High-Bandwidth Memory (HBM). MEG aims to efficiently retrieve data from HBM and perform the computation in the host processor implemented as a soft core on the FPGA. Unlike PiDRAM, MEG does
not implement a flexible memory controller that is capable of performing PuM operations. We demonstrate the flexibility of PiDRAM by implementing two state-of-the-art PuM techniques [
83,
145]. We believe MEG and PiDRAM can be combined to get the functionality and prototyping power of both works.
PiMulator [119]. PiMulator is an open source PiM emulation platform. PiMulator implements a main memory and a PiM model using SystemVerilog, allowing FPGA emulation of PiM architectures. PiMulator enables easy emulation of new PiM techniques. However, it does
not allow end-to-end execution of workloads that use PiM techniques and it does not provide the user with full control over the DRAM interface.
Commercial Platforms (e.g., ZYNQ [165]). Some commercial platforms implement CPU-FPGA heterogeneous computing systems. A memory controller is provided to access DRAM as the main memory in such systems. However, in such systems, (i) there is
no support for PuM mechanisms and (ii) the entire hardware-software stack is closed source. PiDRAM can be integrated into these systems, using the closed source computing system as the main processor. Our prototype utilizes an open source system-on-chip (Rocket Chip [
11]) as the main processor, which enables developers to study architectural and microarchitectural aspects of PuM techniques (e.g., data allocation and coherence mechanisms). Such studies cannot be conducted using closed source computing systems.
Simulators. Many prior works propose full-system (e.g., References [
18,
132]), trace-based (e.g., References [
64,
90,
139,
168,
169,
174]), and instrumentation-based (e.g., References [
35,
64,
168]) simulators that can be used to evaluate PuM techniques. Although useful, these simulators do not model DRAM behavior and cannot integrate proprietary device characteristics (e.g., DRAM internal address mapping) into their simulations,
without conducting a rigorous characterization study. Moreover, the effects of environmental conditions (e.g., temperature and voltage) on DRAM chips are unlikely to be modeled on accurate, full-system simulators as it would require excessive computation, which would negatively impact the already poor performance (200K instructions per second) of full system simulators [140].
In contrast, PiDRAM interfaces with real DRAM devices and its prototype achieves a 50-MHz clock speed (and can be improved further), which lets PiDRAM execute >10M instructions per second (assuming <5 cycles per instruction). PiDRAM can be used to study end-to-end implementations of PuM techniques and explore solutions that take into account the effects related to the environmental conditions of real DRAM devices. Future versions of PiDRAM could be easily extended (e.g., with real hardware that allows controlling DRAM temperature and voltage [115, 156]) to experiment with different DRAM temperature and voltage levels to better understand the effects of these environmental conditions on the reliability of PuM operations. Using PiDRAM, experiments that require executing real workloads can take an order of magnitude shorter wall clock time compared to using full-system simulators. Other Related Work. Prior works (see Section
2.2) (i) propose or (ii) demonstrate using real DRAM chips, several DRAM-based PuM techniques that can perform computation [
6,
28,
40,
54,
144,
146,
147,
149,
150], move data [
145,
159], or implement security primitives [
13,
14,
82,
83,
124,
126] in memory. SIMDRAM [
54] develops a framework that provides a programming interface to perform in-DRAM computation using the majority operation. DR-STRANGE [
23] proposes an end-to-end system design for DRAM-based true random number generators. None of these works provide an end-to-end in-DRAM computation framework that is integrated into a real system using real DRAM chips. We conclude that existing platforms cannot substitute PiDRAM in studying commodity DRAM-based PuM techniques.