4.1 Synthetic Workloads
Hardware—To assess the benefits of CMRI, we choose the NVIDIA Jetson TX2 as a reference HeSoC. The block diagram of the architecture is shown in Figure
4. The
host processor is organized in two
compute clusters: a quad-core ARMv8 Cortex-A57, and a dual-core ARMv8-compliant
DENVER processor (NVIDIA proprietary design). In the following, we refer to the former as the
ARM cluster, and to the latter as the
DENVER cluster. Each of the A57 cores features a 32 KiB L1 data cache and a 48 KiB L1 instruction cache, while each DENVER cores, respectively, has 64 KiB and 128 KiB. The cores in each cluster locally share a 2 MiB L2 cache.
The main system memory is an 8 GiB LPDDR4 128 bit DRAM with a total bandwidth of 59.7 GiB/s, needed to sustain requests coming from the two clusters and the GPU.
System Software—The Tegra X2 (TX2) is flashed with Nvidia Jetpack 4.3 featuring Linux for Tegra version 4.9.140. The System on a Chip (SoC) power management and frequency regulators, acting on CPU and memory controller, are set for maximum performance with the nvpmodel and jetsonclocks tool.
Jailhouse [
17,
26] was chosen as a free and open-source (GPL licensed) bare-metal hypervisor, as its design is focused on safety-critical and real-time environments. Jailhouse is built around the principle of static partitioning of hardware resources (i.e., it does not provide any support to resource overprovisioning). The main benefit is that the hypervisor is essentially absent after the hardware initialization: VMs are allowed direct, unmediated, and exclusive access to the assigned hardware resources. We configured Jailhouse to host a benchmark bare-metal application on each core. Jailhouse features unofficial support to several real-time-oriented experimental features for memory management, arbitration and isolation. Amongst these, an implementation of MemGuard-core, which we adapt for the implementation of our bandwidth regulator.
Experiments—In the following, we describe two types of experiments:
(A)
An evaluation of the overall effect of CMRI at the intra-cluster (both ARM and DENVER clusters) and inter-cluster levels.
(B)
A comparison of VOLT and BWR in terms of degree of control of the injection/overheads;
Concerning the second point, as we have explained in Section
3.2 an implementation of MemGuard can rely on
refill and
throttling routines that are executed in hypervisor space (i.e., in Arm Execution Level 2) and triggered by a hardware event—a timer for the refill, an IRQ raised from the Arm
Performance Memory Unit (
PMU) for the throttling. The NVIDIA proprietary DENVER cores on the TX2 have a custom PMU that does not fully conform to ARMv8. In particular, the cache miss counter used for MemGuard is not implemented, hence MemGuard cannot be readily implemented on the DENVER cores. For this reason, the first experiment only relies on VOLT as a CMRI technique. The comparison between VOLT and BWR is conducted within the ARM cluster.
Benchmarking Methodology—The focus of our experiments is on assessing the benefits of using CMRI on top of PREM workloads, under the most pessimistic conditions.
The main metrics of interest are: (i) the latency of the task currently allowed by PREM to access memory, i.e., the task under test (or, the lengthening of its memory phase while other tasks generate controlled interference); (ii) the bandwidth utilization reached during the execution of these concurrent memory phases.
Note that both metrics are unaffected by whether compute phases are executed in parallel too, or not. These metrics are also independent of the exact synchronization and scheduling mechanisms used to enforce mutually-exclusive DRAM access. We analyze in depth what happens during a single memory phase, for all combinations of sequential and random memory access patterns. These are representative corner cases for what could happen on any real workload. As such, we explore both patterns on both the task under test and the other tasks injecting controlled amounts of memory requests.
The NVDIA TX2, like most modern HeSoCs, features a number of hardware blocks aimed at improving the average-case memory transaction performance (cache prefetchers, DRAM row buffers, etc.). To control the bypassing of such mechanisms, our micro-benchmark implements a simple pointer walk over a portion of a statically pre-allocated, large array (64MiB) of data structures, each containing: (i) a pointer to the next address to read/write; (ii) padding, to fill the remainder of a whole L2 cache line. This is needed to effectively model sequential and random memory access patterns.
For each configuration of interfering-task workloads, we repeat our experiment twice: (i) once for measuring the bandwidth requested by the tasks, (ii) once for measuring the memory access latency experienced by the task under test. To obtain a sensible bandwidth measurement, it is necessary to observe a relatively long time; on the other hand, latency measurements are actually measuring individual PREM memory phases, which are relatively short.
We built a micro-benchmark, which we called
mem_bench, able to deal with each of the above two types of measurements. This benchmark is an extension of the
lat_mem_rd test from the LMBench suite [
21].
1Mem_bench can be configured to operate in two modes:
(1)
mem_bench_LAT: We use this mode to measure the latency of the PREM task under test. In this case, the benchmark reads or writes the data specified as the PREM task memory phase only once. We run each experiment 100 times, and we take the worst-case value (95th percentile to filter out outliers). This is the most accurate mode for latency measurements.
(2)
mem_bench_BW: We use this mode to measure the average bandwidth request generated by the interference and PREM under test tasks. In this case, the benchmark continuously reads or writes the data multiple times to ensure the duration of the transfer is long enough (60 seconds) to allow for a stable and meaningful measurement. This is the most accurate mode for bandwidth measurements.
To model the worst-case interference that the
interference tasks can generate, we execute them for the whole time that the task
under test is running.
2 In a real system, the interference suffered by the task
under test would in general be smaller, and the benefits of CMRI higher.
4.1.1 Experiment A: Assessing the Benefits of CMRI.
Considering the cluster-based nature of the target hardware, we conduct our experiments in three settings:
(2)
inside the DENVER cluster;
(3)
across the two compute clusters.
For experiments 1 and 2, we consider a single task under test (UT), which represents the task that a regular PREM scheme would grant exclusive access to memory. On top of that, we explore the effect of allowing CMRI from a number of interferencetasks (IF), mapped on the remaining cores from the same compute cluster (each task is pinned to a different core). For these, we vary (with exponential spacing) the LOAD intensity from near-zero to 100%.
The size of the ARM and DENVER cores’ memory phases is set to 512KiB and 1024KiB, respectively (the size of the L2 cache divided by the number of cores in each cluster, which is a fine grained choice for the memory phase size, not manageable by tools such memguard).
The third set of experiments studies the effect of interference among the two compute clusters. One ARM core and one DENVER core run one of two UT tasks. We run four IF tasks on the remaining cores (three Cortex-A57, one DENVER) from both clusters. The GPU is active and generates as many sequential memory requests as possible, since we want to experience the worst possible situation for the UT task, with all the available engines requesting the maximum bandwidth at the same moment.
To stress all the possible corner cases, we configure mem_bench to generate eight different combinations for UT and IF tasks, considering both read-only and read+write workloads, and both sequential (SEQ) and random (RAN) access patterns. The workloads of the IF tasks are implemented by means of mem_bench_BW, while the UT task is run two times with different configurations for each experiment: mem_bench_LAT when measuring latency and mem_bench_BW for bandwidth. From our experiments, we observed that independent of the fact that we used read-only or read+write workloads, the range of applicability and magnitude of effects of CMRI is essentially the same. Consequently, due to the lack of space (and for the sake of presentation) in the following we only present and discuss results for reads.
In details, we consider the following four combinations:
For each of these combinations, our plots show a breakdown of the bandwidth usage among different cores (stacked areas, to be read on the left Y-axis) and the latency increase experienced by the UT task (lines with markers, to be read on the right Y-axis) as we increase the amount of injected IF traffic by controlling the load intensity through the value of A and C. Specifically, we select \(A=1\) (a single cache line is read) to allow for the finest granularity, whereas C (the number of NOPs between one cache line read/write and the other) is varied from 0 to \(4,\!500\) in discrete steps at an exponentially increasing distance for a total of around 100 points.
As a main metric to measure CMRI we introduce the concept of
injection rate (shown in the top
X-axis of every plot), defined as the ratio between the bandwidth requested by an IF task for a given
load intensity and the bandwidth obtainable at full throttle (i.e.,
load intensity = 100%) by the same task. We sweep
injection rate by 10% intervals (20%–100%).
3 Below 20%, we increase the sampling resolution (at 4%, 8%, 12%, and 16%) to observe the effects in an area where the granularity of the technique becomes really fine (and thus we expect to have less control/more overhead). The
load intensity corresponding to each
injection rate value is also reported in the plots for reference (bottom
X-axis). Note that we report two values at 100%
injection rate: the first is the one achieved with minimum
load intensity, the second with 100%
load intensity. The closer these two values, the less margin CMRI has to saturate the bandwidth.
The plots also show a red line, which represents an upper bound to the tolerated increase in the UT task latency—which we set to 10%.
4 This allows to easily spot how much CMRI can be used in each configuration before this point is reached.
Intra-cluster: ARM Cluster.
Figure
5 shows the plots for CMRI usage inside the ARM cluster under all the considered combinations of traffic types. Focusing on bandwidth usage (stacked areas), in all the plots it is evident that a significant portion of the available bandwidth—which is not exploited by the PREM task alone (as shown when the
injection rate is at its minimum)—can be utilized by CMRI (as shown as we move to the right). By combining the information from the bandwidth areas and the latency curve, it is possible to identify the sweet spot where, for each traffic pattern combination, we can gain the most.
When both the task under test and the interference traffic feature sequential access pattern—which is the most sensitive case to interference—the cumulative bandwidth request from all the ARM cores (100% injection rate) is limited to only around 6.6 GB/s. The plot reveals that this is due to system-level bandwidth capping, as the latency increase at this point is at \(2.5\times\) (a single core in isolation utilizes 66% of that maximum bandwidth, as shown for 0% injection). Increasing CMRI up to 30% keeps the latency increase within the 10% threshold. Note that 30% injection rate, corresponding to \(\approx\)0.8% LOAD intensity, allows to reach 83% of the maximum bandwidth allowed within the ARM cluster (6.6 GB/s) with little impact on the execution time of the UT task.
When the IF workload is of type RAN, the SEQ UT task is never perturbed, even with 100% CMRI. This brings a 26% increase in the overall ARM cluster bandwidth usage, equivalent to 80% of the maximum. When the UT task features RAN traffic, it is mostly insensitive to the IF tasks activity. When IF tasks employ SEQ traffic, CMRI enables a tremendous \(9.4\times\) improvement on the cluster memory bandwidth usage, reaching 94% of the maximum. Also when the IF tasks employ RAN traffic the bandwidth requests can be fully summed up, increasing bandwidth usage by \(3.22\times\) without impacting the execution time of the UT task.
Intra-cluster: DENVER Cluster.
Figure
6 shows the results for CMRI within the DENVER cluster. The benefits of CMRI here are even more pronounced, as one single DENVER core in isolation uses only slightly more than half the available bandwidth for the cluster (as opposed to 66% for a single ARM in the quad-core ARM cluster). Focusing on the worst-case SEQ–SEQ experiment, it can be seen that the increase in the latency of the UT task stays below 10% for CMRI as high as 80%. The bandwidth usage reaches nearly full efficiency, as it gets nearly doubled.
Similar to the ARM cluster, there is not a lot to be gained when the IF task is of type RAN and the UT task is of type seq, as the bandwidth generated by the random access pattern is an order of magnitude smaller than the sequential for the DENVER core. Intuitively, the latency increase should stay well below 10% and nearly unmodified in this case, even for 100% CMRI, as shown by the modest increase in bandwidth. However, it must be noted that the latency measurements for this experiment (and only for this experiment) suffered from very high variance,
5 which is probably to be attributed to a peculiar yet systematic effect of noisy setup.
Similar results to the ARM cluster apply when the UT task is of type RAN, where CMRI enables huge improvements in intra-cluster bandwidth usage. If the IF task is of type SEQ, we increase bandwidth usage by \(12\times\), and if the IF task is of type RAN, we can double the bandwidth usage, in both cases without impacting the UT task at all.
Inter-cluster: SoC-level Effect of CMRI. After studying the benefits of CMRI within each compute cluster in isolation, we experiment with inter-cluster interference. We first elect a single ARM or DENVER core, in turn, to host the UT task, and we place the IF tasks on the sole cores belonging to the other cluster. Thus, when one DENVER hosts the UT task, the IF tasks run on the ARM cluster (the second DENVER core is inactive), and vice-versa. We do not show plots for this experiment (which is preliminary to the following insights), but we discuss here the most important findings.
When a DENVER core hosts the UT task, its latency is virtually unmodified (\(\lt\)5%) independent of the injection rate of the IF tasks running on the ARM cores. When it is an ARM core that hosts the UT task, its latency is more susceptible to the activity of the IF tasks (running on the DENVER cores), but the variation always stays below 10% if the LOAD intensity stays within 33%.
These findings suggest that the best way to support PREM on a platform of this type is that of always allowing one core from each cluster to access memory. This still leaves plenty of room for better bandwidth exploitation, which CMRI can achieve.
Figure
7 shows the results for an experiment where an ARM core and a DENVER core both run a UT task, while the remaining cores from both clusters run IF tasks. At the system level, the benefits of CMRI already observed within each cluster are confirmed. When both the UT and IF tasks run SEQ traffic, we can tune IF tasks to inject with a rate of up to 16% to avoid perturbing the ARM-UT task beyond the 10% latency increase threshold. This still brings a 13% increase in bandwidth usage compared to only allowing a single Cortex-A57 and a single DENVER core to access memory (a basic PREM scheme). The DENVER-UT task could tolerate up to 40%
injection rate, bringing the bandwidth usage improvement to 31%. If the IF tasks are of type RAN, no significant interference is generated on the UT tasks on both clusters. The benefits are more modest (as already observed within the individual clusters), with an increase in bandwidth usage of around 11%. Note that the values in the latency curves (in particular for the DENVER) might sometimes drop below one. This is due to the fact that, as a baseline value for normalization, we consider the latency measured when a single core from a given cluster executes (SEQ or RAN), while all cores from the other cluster (and from the GPU) execute SEQ-IF tasks, as explained at the beginning of this section. In this particular experiment, we have less interference coming from the other cluster, as one of the cores hosts the UT task instead of an IF one. As a consequence, particularly when the UT task is of type RAN, the amount of interference generated for the other cluster is smaller.
When the UT tasks are of type RAN, we see again the highest potential for making a better use of the available bandwidth. Independent of the fact that the IF tasks are of type SEQ or RAN, they can inject at full throttle, achieving an improvement in system-level bandwidth usage of \(9.5\times\) and \(2.65\times\), respectively.
4.1.2 Experiment B: A Comparison of VOLT and BWR Implementations.
The second experiment set is aimed at comparing the two CMRI techniques that we have described, to assess their effectiveness at controlling the injection rate and their sensitivity to implementation overheads. For this reason, different from the setup for the previous experiment, we consider different values for numerous system parameters. The sizes for the memory phases of PREM tasks are considered in three variants, to cover several possible use cases, and in particular:
(1)
largest at 512 KiB: This equals ¼ of the L2 cache size on the ARM Cluster, and represents the same maximum-size heuristic used in the previous experiment;
(2)
medium at 32 KiB: This equals the size of the L1 cache for ARM cores. It is representative of a PREM scheme implemented at the L1 rather than at the L2;
(3)
smallest at 8 KiB: This represents the smallest reasonable size below, which the measured overheads to implement memory arbitration would dominate (see timing measurements later).
Concerning VOLT, similar to what we have already done in the previous experiment, we vary C to produce a varying load intensity as a means of controlling the injection rate. In addition, in this experiment, we also consider different values for A, as we will explain later on.
Concerning BWR, for any bandwidth limitation target there exist several \(\langle budget,period \rangle =\langle B,T \rangle\) configuration pairs to express it: a correlated increase of budget and period leaves the bandwidth unchanged. Intuitively, the shorter the period, the finer grained the control, since it is most frequently updated. On the other hand, shortening the period makes the technique more sensitive to overheads, because the execution latency of the refill routine, say \(L_\text{refill}\), increasingly eats the available time for memory usage, i.e., \(T-L_\text{refill}\).
The budget may also become meaninglessly large for the given period. For any \(T,\) there exists an upper-bound budget \(B^{T}\) such that for any \(B \ge B^{T}\) the bandwidth limitation obtained with \(\langle T,B \rangle\) is approximately equal to \(\langle P,B^{T} \rangle\). This upper bound budget is essentially the maximum amount of memory requests that the platform is able to serve within a time \(T-L_\text{refill}\). The budget rate is thus defined as \(B/B^{T}\).
To tune MemGuard as an injection mechanism, we benchmarked some key figures of the Jailhouse implementation on the ARM cores. We started from profiling \(L_\text{refill}\) over 10 K test runs. The test routine has been implemented as a bare-metal application, which measures the following steps: (i) interruption by the MemGuard server in EL2; (ii) budget refill; and (iii) return to EL1 execution. We obtained \(0.593\) µs in the average case, and \(1.75\) µs in the worst one.
This allows us to choose three testing values for T.
—
Short. \(T = 2\) µs: the minimum value capable of accommodating at least the refill routine.
—
Medium. \(T = 8\) µs: approximately 0.5 µs below the isolated latency of the medium PREM phase (32 KiB).
—
Long. \(T = 32\) µs: close to \(0.25\times\) the isolated latency of the large PREM phase (512 KiB).
For each \(T^{\prime }\) of the values for T, we found \(B^{T^{\prime }}\) and chose n values for \(B \in (1,B^{T^{\prime }})\) such that \(n \gtrsim 100\) and the distribution within such interval is exponentially decreasing with the budget.
Table
2 summarizes the configurations considered for this experiment. Due to space limits, and to simplify the discussion, we only show results for IF traffic of type SEQ, where CMRI is most effective, as we have learned in the previous section. VMs running IF tasks are started first by
Linux 4 Tegra (
L4T), which is allocated on a Denver core, then the VM hosting the UT task is spawned. This guarantees the cleanest setup for the measurement. To minimize the noise Linux 4 Tegra (L4T) is in a headless configuration, with only Ethernet and UART available as I/O channels. Tasks output log is emitted only after the end of the test. L4T also handles the submission of a GPU task that performs continuous SEQ requests on the main memory as previously described.
Discussion. Figures
8 and
9 show the comparison plots between VOLT and BWR. The first refers to the configuration where UT = SEQ and IF = SEQ traffic, the second has UT = RAN, IF = SEQ. The plots are organized in a grid of
\(3\ \times \ 3\): the rows refer to the three
memory phase sizes, the leftmost column refers to VOLT,
\(A=1\), and the remaining two columns refer to BWR with different
T or to VOLT with different
A.
The first thing that we can observe when we focus on a row and look at the plots from left to right is that both the cumulative bandwidth and the UT latency slightly decrease. This is an indication of the fact that BWR becomes more sensitive to the overheads as the
period T is shortened, which confirms the initial intuition. When the UT task generates SEQ traffic (Figure
8), this is not particularly relevant, since higher injection rates than 20% produce unacceptable latency degradation. On the other hand, when the UT task generates RAN traffic (Figure
9), BWR uses CMRI slightly less effectively than VOLT, if the
period is too small. The other intuition that a shorter
period should provide finer grained injection control is instead completely disproved. The obtained
injection rate is, on the contrary, coarser grained, because the shorter
period imposes a smaller upper-bound
budget, and this in turn limits the achievable control (as a reference,
\(B^{2 \text{µs}} = 100\)). This can be noted also visually by looking at the “holes” in the lower
budget % region of the plots, which are control points that the technique could not instantiate. Note that, particularly when the UT task uses SEQ traffic, the lack of control in the lower range of the
injection rate is indeed a limitation, as that is the only region where CMRI can be applied without significantly impacting the latency of the UT task. The lesson learned here is that
T should efficiently chosen as the largest value smaller than a reasonably safe upper bound to the
memory phase duration.
Focusing on VOLT plots (leftmost column), as the memory phase size gets smaller, we observe a step when crossing \(30\%\) injection rate. This is due to a systematic latency increase of around 4 µs caused by a penalizing interplay between the particular temporal injection pattern created by VOLT with \(A=1\) and some architectural features (burst reads). Using higher values for A enables the bursty patterns and removes this effect, as shown in the central plot of the bottom row, where \(A=64\). Note that the higher value for A slightly alters (reduces) the obtained load intensity (C values are unmodified), which motivates the minor reductions in cumulative bandwidth and latency.
4.2 Real Workloads
As a final set of experiments, we consider several of the Polybench benchmarks [
28] as workloads under test and study the effect of applying CMRI on their latency at the inter-cluster level. A brief description of the benchmarks is provided in Table
3. To make results as general and comprehensive as possible, these experiments have been carried out on two different types of HeSoCs: one based on a GPU accelerator (the NVIDIA Tegra TX2 previously described) and one based on an FPGA accelerator (the Xilinx Ultrascale+ MPSoC). We describe in the following the setup and the results for this experiment.
Hardware—For the hardware and software setup of the NVIDIA platform, we refer the reader to the previous section. In addition to that platform, we also consider the ZU9EG, the flagship system-on-chip of the Xilinx Zynq UltraScale+ family. As shown in Figure
10, the platform is composed of two different CPU clusters, namely the
Real-Time Processing Unit (
RPU) and the
Application Processing Unit (
APU). The RPU cluster, composed of two ARM Cortex R5F, has better predictability compared to the APU, but significantly lower performance, thus we do not consider it in our experiments. The APU is composed of four ARM Cortex A53 CPU cores. Each A53 core features a 32KB L1 cache for instructions and data and 1MB of L2-cache in CCI coherency domain, shared between each A53 core.
Coupled to the host CPU clusters an FPGA fabric can be exploited to accommodate custom logic.
The main system memory is a 4GB DDR4 64 bit DRAM with a total bandwidth of 17GB/s. As reported in the block diagram, the main memory subsystem features six DRAM ports: The first three ports (s0 to s2) are reserved for the host CPUs and the remaining three (s3 to s5) can be exploited by the FPGA fabric.
To correctly model the behavior of a system design that maximally exploits the DRAM bandwidth, we deploy three configurable memory traffic generators, i.e., three configurable DMA engines (representative of three custom accelerators), one for each DRAM port available.
System Software—To conduct the experiments on top of the ZU9EG, we rely on a Linux environment based on the Xilinx PetaLinux 2020.2 toolchain. The Linux distribution that we used is composed of a Linux kernel at the 5.4.0-xilinx-v2020.2 version and a Ubuntu 20.04 root filesystem. The FPGA fabric is configured with a custom-made hardware design, which couples the DMA engine to a configurable bandwidth regulator. This regulator implements the CMRI functionality for each of the traffic generators, that can be independently controlled and configured via a dedicated Linux application.
Experiments and Benchmarking Methodology—The peculiarity of HeSoCs, in terms of memory interference, is the possible presence of interference between heterogeneous clusters. For this reason, and after having studied intra-cluster interference in the previous section, in this section, we focus only on inter-cluster interference. We describe in the following two experiments aimed at evaluating the CMRI technique applied to real benchmarks, taking into account the two aforementioned platforms, and for inter-cluster interference.
In detail, the two experiments measure the effect of interference between the ARM cluster and (i) the DENVER cluster plus the GPU accelerator on the NVIDIA platform; (ii) the FPGA accelerators on the Xilinx platform. The plots for these experiments show the injection rate for the interference workload on the X-axis, and the latency increase for the Polybench benchmarks on the Y-axis.
Discussion—Figures
11 and
12 show the results for the considered experiments. The various curves represent the latency increase for each of the Polybench benchmarks. For reference, we also include the curve for the
synthetic workload from the previous section (
SYNTH) as well as the 10% latency increase
threshold. For the NVIDIA platform, we refer to the worst case obtained for the SoC-level experiment in Section
4.1.1 (Figure
7). For the XILINX platform, a similar worst-case experiment has been conducted to obtain the reference curve.
By comparing these figures, we observe two main things. First, most of the Polybench benchmarks seem to suffer very little from the SoC-level interference, which suggests that these workloads have very low L2 cache miss rate. Second, the LU decomposition shows a much sharper latency increase on the Xilinx platform as the injection rate from the accelerators increases. This suggests that the workloads deployed on the GPU on the NVIDIA platform are not saturating the available bandwidth, unlike the FPGA accelerators on the Xilinx platform. In the latter, CMRI is visibly required to satisfy the maximum latency increase constraints.
Comparing the behavior of the Polybench benchmarks to the worst-case curves confirms that: (i) applying PREM-like schemes that imply mutually exclusive CPU/accelerator access to DRAM heavily under-utilizes the system bandwidth (because benchmarks suffer from very little interference). In this case, most of the time the accelerator could operate at full throttle without harming the CPU tasks’ latency; (ii) suggests that there is a very large region where more DRAM-intensive benchmarks (i.e., those with a higher L2$ miss rate) would exceed the maximum latency increase threshold in absence of CMRI. Note that this could also happen to the same benchmarks when much larger datasets are used and/or when the inter-cluster interference is coupled to intra-cluster interference (which is what happens with the worst-case synthetic curves).