4.1 Experimental Setup
Implementation The cache size is 8 KB with 128 cache blocks. The baseline is automatic caching, for which we use the
Pseudo Least Recently Used (PLRU) eviction policy, a commonly used approximation of LRU that is more time- and space-efficient to implement (using a single status bit in each cache line) [
31]. We compare four cache programming techniques: CLAM, which does not consider phase variation; PRL, which divides a program execution into a fixed number of phases (Section
2.8); SHEL, which uses scope local leases (Section
2.3), and C-SHEL, which assigns inter-scope leases (Section
2.4). For PRL, we divide executions into five equal-length phases, as was done in Reference [
29]. In addition, we have implemented the FPGA to output the aggregate vacancy and the remaining lease values for visualization (Section
4.2).
Benchmarks. We use PolyBench/C 4.2.1, which contains 30 numerical kernels [
26]. We use PolyBench for several reasons. First, the benchmark suite is relatively easy to port through the FPGA tool chain to allow testing on a real system. Second, the current emulation system is not yet equipped to execute other benchmark suites due to limited RISCV instruction set coverage. Despite the latter limitations, PolyBench kernels are extracted from linear algebra, image processing, physics simulation, dynamic programming, and statistics, which are all common workloads in scientific computing and have been extensively used in studying performance analysis [
1,
25] and optimizations [
2,
19]. We compile each program with the
GCC -O3 optimization level without vectorization, which our CPU does not currently support, and report the results for small, medium, and large dataset sizes. Approximately, the amount of program data in these sizes are 128 KB, 1 MB, and 25 MB, respectively.
PLUTO Compiler. Version 0.11.4 of the PLUTO optimizing compiler was additionally used to optimize the PolyBench benchmarks [
9]. The compiler was invoked using only the
–tile option, to enable cache tiling and polyhedral optimizations. The compiler was configured to produce 16
\(\times\) 16
\(\times\) 16 tiles to effectively fit in our cache size. For the code run with the CLAM and PRL leasing policies, the PLUTO output was unchanged. For codes run with the SHEL and C-SHEL policies, the PLUTO output needed to be manually annotated to work with the current lease generator application with scope markings. For benchmarks that produced large optimized output, we separate the non-optimized source into several subregions and let the PLUTO compiler be invoked in each subregion. The code, after the PLUTO optimization, was then compiled and run exactly as described previously. All of the benchmarks were able to be optimized with the PLUTO compiler, excluding
heat-3d, which we were not able to run and collect data for after optimization. We present the results of lease cache on programs compiled without PLUTO optimization in Section
4.2, and we explore the combined effects of lease cache and PLUTO optimization in Section
4.3.
4.2 Performance of Cache Programming in Unoptimized Loops
We divide the benchmarks into two groups based on the number of scopes in them (shown in Figure
5). The first group consists of 13 benchmarks that have two or more scopes, and the second group has the remaining 17. Figure
5 shows the two groups side-by-side in three rows, each showing a different data size from small (top row) to large (bottom). The
x-axis shows program names, and the
y-axis shows the miss ratio of the cache. We discuss the multi-scope group in this section and single-scope group later in Section
B. For the multi-scope group, the red vertical line in each graph separates those with non-cyclic phases (Left) and those with cyclic phases (Right). Their phase counts are shown in Table
1.
Overall Comparison. Overall, cache programming improves significantly over automatic caching. The four techniques are used to provide leases for 13 multi-scope tests on each input, for a total of 156 (
\(4\times 13\times 3\)) lease solutions. When compared with PLRU as shown in Figure
5, in 153 out of 156 cases, lease cache matches or performs better than PLRU, reducing the number of misses by over 15% in 75 (about half) solutions, by over 75% in 16 (over 10%) solutions, and by as much as 87% in the best case. Hence, we have the first finding:
Finding.(1)
Caching programming using leases is overwhelmingly better than automatic caching (using PLRU), reducing the miss count by over 15% in one of every two cases and over 75% in every ten.
This shows the benefit of using program information when managing the cache. Note that, in this study, we use profiling, and the test run is the same as the training run. Hence, this improvement is an ideal case.
For the multi-scope group for the small dataset, the baseline technique CLAM and PRL reduce the miss count by 55% and 58%, on average. As a reminder, PRL considers phases but not the program structure. SHEL increases the average reduction to 60%, and C-SHEL to 61%, by considering the program structure, and in the case of C-SHEL, considering cross-scope reuses.
The data sizes in medium and large inputs are 1 MB and 25 MB and much greater than the cache size 8 KB. The improvement in caching has a smaller effect. For medium, the average reduction is 12%, 15%, 26%, and 21% for CLAM, PRL, SHEL, and C-SHEL, respectively. For large, these are 8%, 9%, 11%, and 11%.
Effect on Multi-scope Programs. Lease optimization is most important for multi-scope programs. We focus on the three methods that treat scopes differently, while using CLAM, which does not distinguish scopes, as a baseline. Indeed, except a few cases, they all outperform CLAM by considering each scope and assigning scope-local leases. SHEL, for example, outperforms CLAM by 6%, 15%, 4% for the three inputs, respectively. Within each scope, they use the same algorithm and differ only in how they treat the boundary effect. Among the three, no single method dominates in all cases. On average, the lowest miss count is obtained by C-SHEL on the small dataset, SHEL on the medium, and both SHEL and C-SHEL on the large. SHEL performs well except on the small input, where it performs the worst. While all three methods assign different leases for different program phases, SHEL ignores cross-loop reuses when assigning leases. On the small input, however, phases are the shortest, and cross-loop reuses are important to cache management.
By considering these reuses, both C-SHEL and PRL outperform SHEL on small inputs. Between the two, C-SHEL performs better than PRL on average on all three input sizes, for mainly two reasons. First, C-SHEL phases are marked according to program structure, while PRL intervals are even divisions. PRL optimization is less effective in boundary intervals where the reuse behavior is mixed. Second, as discussed in Section
2.8, PRL considers cross-loop reuses only in a boundary interval but uses the same leases on all interior intervals. The lease assigned based on a boundary interval may be sub-optimal for interior intervals. The first weakness can be ameliorated by using more intervals, but the second cannot. In comparison, C-SHEL considers the boundary effect proportionally. For our tests, however, the results show that it is better to ignore these effects on medium and large inputs. C-SHEL gives no significant advantage over SHEL on any of these tests but is significantly worse on some of them. In summary, we find the following:
Finding. On multi-scope tests, the three lease-optimization methods show that:(2)
It is best for lease optimization to ignore cross-loop reuses, that is, SHEL is the best or close to the best in all cases.
(3)
C-SHEL, by considering boundary reuses, has significant benefits on small inputs, is counter productive on medium inputs, and has no effect on large.
(4)
PRL, by constraining CARL using intervals, is always beneficial compared to CLAM, but less effective than SHEL on medium and large inputs
It is worth noting that the weakness of PRL, i.e., leases are constrained by the behavior of all intervals, is also a strength in robustness in that PRL never increases cache contention compared to CLAM. This is shown in the comparison where in all but one test, PRL performs the same as or better than CLAM.
Limit Analysis. CARL computes the miss ratios for an ideal cache, which, as proved by Ding et al. [
17], is the best for the same average virtual cache size for all lease solutions. No lease optimization can perform better than CARL. Note that CARL bounds may not be tight, i.e., what is included in the potential may not be reachable. What is valuable, however, is that they show what is excluded, which is definitely not realizable. No lease optimization can perform better than CARL.
Compared to PLRU, the potential for cache programming depends on the input size. On our system, the cache size is 8 KB, and the amount of program data is approximately 128 KB, 1 MB, and 25 MB for the three input sizes, respectively. The average potential is 62%, 36%, and 11% reduction over PLRU for the three input sizes. Given these potentials, the average realized by our three techniques is 84% for small by SHEL, 46% for medium by C-SHEL, and 82% for large by C-SHEL.
The CARL bound is not tight, because they require variable cache sizes. The realizable bound lies in the gap between the best actual result and the CARL bound. Among the three input sizes, this gap is narrowest in the large input, so the CARL bound is closest to the actual potential, so is the realized portion of the CARL bound to the realized potential. Hence, we have two additional findings based on the limit analysis:
Finding. The limit analysis shows that:(5)
Lease-based cache programming may improve cache management over LRU by over 60% when data size is 16 times the cache size and may still improve by over 10% when the data size 3,000 times larger.
(6)
The techniques in this article have realized over 80% of the potential of lease cache programming.
The limit of cache programming is also bounded by optimal fixed-size caching, i.e., the OPT or MIN method [
13,
23]. Optimal caching is automatic but unrealizable, because it requires precise future knowledge. Two earlier studies have shown that ideal lease cache performs as well as or better than OPT for both storage traces [
22] and PolyBench programs [
17].
Lease Visualization. Prechtl et al. [
29] showed that leases give a user the ability to visualize the state of the cache at each moment, and by taking samples and showing them in a sequence, cache dynamics over an execution. We use visualization in Figure
6 to show the effect of lease optimization on one of our test programs. The graphs plot the
individual cache line status, which shows the current lease for each cache block. At each moment, the entire cache space of 128 blocks is shown as a column of 128 cells, colored individually to show the current remaining lease in that block. Previous work [
29] calls this a cache tenancy spectrum. From the execution start time at the left boundary to the execution end time at the right boundary, we plot the execution as a matrix of colored cells. Yellow means no lease (empty block) and blue means occupied. The darker the blue is, the longer the lease. Thus, over-allocation is visible as large chunks of blue and under-allocation as chunks of yellow.
Figure
6 clearly shows the benefit of phase-aware lease assignment strategies. This program is composed of two top-level nested loops, whose effect can be clearly seen in the cache occupancy spectra. CLAM allocation is blind to this structure; it simply assigns all the most profitable leases until the budget has been met. Unfortunately, a disproportionate number of the most profitable lease assignments lie in the first loop, and fewer of them lie in the second loop. The result is over-allocation during the first loop and under-allocation during the second loop. SHEL, however, is able to optimize each loop individually, resulting in a balanced cache use, with fewer contention misses in the first phase and more hits in the second phase.
Effect on Single-scope Programs. Figure
5 shows the normalized miss count for the 17 single-scope tests. Because these programs have only a single scope, SHEL and C-SHEl lease assignments are identical to those of CLAM. On average for the small dataset, PRL reduction, 54%, is slightly better than CLAM’s 52%. For medium and large, the two results are effectively the same, with reductions of 27% and 26%, respectively. This can be explained by behavior variations in single-scope tests. PRL considers them separately using intervals, but CLAM does not. The effect, however, is significant only in small inputs. For the other two inputs, PRL sometimes performs worse than CLAM, likely because optimal leases assigned some intervals are sub-optimal overall.
4.3 Performance of Cache Programming in Optimized Loops
Data locality can be greatly improved by the compiler using the polyhedral abstraction [
7,
20]. We consider compiler transformations and lease caching to be complementary solutions for improving cache performance. Lease caching seeks to improve the hit ratio via optimization of the cache replacement policy given a memory access stream. Compiler transformations seek to improve the hit ratio by reordering the memory access stream itself, such that locality is improved. We examine the combined effect of polyhedral loop optimization done by the PLUTO compiler [
9] and our method of lease caching. We evaluate all four caching techniques (CLAM, PRL, SHEL, and C-SHEL) on 12 multi-scope benchmarks and CLAM and PRL on 17 single-scope benchmarks, each run with three different data sizes.
Overall Comparison. Comparing the PLRU results (black bar) in Figure
5 and Figure
7, the cache performance can be greatly improved by the PLUTO compiler. Cache programming still improves performance over automatic caching for locality-optimized code. As shown in Figure
7, in 168 out of 246 cases, lease cache matches or performs better than PLRU. In 146/168 tests, the programmable cache improves the cache performance by 25%. Another 14 solutions reduce the cache misses by over 50%, and the best reduction can reach as high as 82%. It is worth noting that PLUTO triples the cache misses in
ludcmp on PLRU, but such a negative impact does not happen in a programmable cache: Both PLUTO and non-PLUTO codes have the same cache performance in
ludcmp. For the remaining 78 cases where the lease cache does not perform well, more than 90% of them add less than 50% misses, and none of them makes the cache perform worse than the non-PLUTO performance. In addition, their cache miss ratio is relatively low, and the degradation is smaller in larger data inputs. The geometric mean miss ratios in the programmable cache are 1.00% (0.75% in PLRU) for small data size, 0.84% (0.63% in PLRU) for medium, and 0.34% (0.26% in PLRU) for large.
Table
2 summarizes the average (geo-mean) cache miss reduction by the lease cache under two groups of benchmarks with three data inputs. Negative number means lease cache performs worse than automatic caching (PLRU). For multi-scope tests, except for one anomaly
lu, the same discovery discussed in the previous section still applies to PLUTO-optimized code: C-SHEL performs the best on small dataset, SHEL on medium dataset, and SHEL and C-SHEL on the large. The same observation also applies to single-scope. Excluding one anomaly
gramschmidt, the best performance is obtained by PRL on small dataset but later switched to CLAM on medium and large datasets. These two anomalies significantly impact the lease cache performance. For PLUTO results, Table
2 shows the effect without these anomalies in parentheses. Next, we discuss why such anomalies happen in lease cache.
Lease Cache Anomalies. There are two anomalies, lu from multi-scope programs and gramschmidt from single-scope programs. Lease cache has 2.3\(\times\) and 35\(\times\) more misses on small and medium dataset in lu and 2.5\(\times\) more on small dataset in gramschmidt. PLUTO transforms a loop and increases the number of references in it. This poses two problems for the lease cache. The first is information loss. The effect of sampling is “diluted” in that the RI distribution for each reference may have fewer RIs. This is likely the problem in gramschmidt. The anomaly happens only for PRL and not for other methods, because PRL divides a program into five phases. The second problem is lease-table truncation. Our current hardware has 128 entries. If a program has more than 128 references, then the top 128 most profitable leases are loaded, and the remaining references are assigned the default lease (which is 1). This is likely the case of the two lu anomalies.
In summary, we have the following discovery on the effect of programmable cache on polyhedral-optimized code:
Finding. The three programmable caching schemes on PLUTO-optimized code show that:(7)
Programs after polyhedral optimization can still benefit from the lease cache optimization, and their best performance can be achieved by considering the scope.
Interaction between Cache Programming and Compiler Optimization. The PLUTO compiler will transform the original loop structures to remove dependencies to improve parallelization and locality. These transformations include loop distribution (fission) and loop fusion. Such transformations have two impacts on the lease cache: (1) The program scope can be altered by these transformations, making the scope marker we set on the original program less optimal. (2) More array references are inserted to handle boundary conditions when performing loop blocking or tiling. As the number of references increases, lease caching is negatively affected by two problems: information loss and lease-table truncation. The two problems happen on 3 of the 246 cases, 2 of them for the small input size, all by only SHEL and PRL (not CLAM or C-SHEL), and only for PLUTO optimized loops. Other tests are not significantly affected by these problems.
Finally, we observe that lease cache is beneficial for compiler optimized code. Polyhedral optimizations may not always provide perfect locality. In some cases, a compiler cannot eliminate all cache misses because either a program cannot be further optimized, the compiler fails to apply the full optimization possible, or a user fails to configure the compiler properly. Providing an additional layer of optimization can be beneficial for cases where there is still room for improvement in the cache replacement policy when applied to a transformed program.
Our results show improvement in geomean performance in all cases with SHEL and C-SHEL after PLUTO optimization, excluding the anomalous results of a single outlier program (lu). For deriche and ludcmp, PLUTO makes no locality improvement among all three input sizes. For ludcmp on the small size, PLUTO optimization is counter-productive. It more than triples the miss ratio in the PLRU cache. The lease cache has no such problem. It performs the same with and without PLUTO on all three sizes. PLUTO is slightly worse in the medium size and the same in the large size. This shows that compiler optimization may falter in rare cases, and the programmable cache provides another line of defense, and in this case, largely removes the degradation.
In this work, we show the results of lease cache on benchmarks whose loops are
Static Control Parts (SCoP) [
9], because the reuse interval information required by the lease assignment can be gathered statically. For parts of programs that are not SCoPs, a dynamic lease cache policy, which gathers statistics and assigns leases at runtime, could be applied to more parts beyond those that are amenable to such compiler transformations. However, the development of such a system is beyond the scope of this article, in which we seek to evaluate the performance of our design, which uses static leases.