There are multiple approaches in the state-of-the-art on how to defend against TEAs. Mainly, the literature focuses on the second and third steps of a TEA. The taxonomy adopted herein splits defenses into two categories: defenses that limit or prohibit speculation (the second step in setting up a TEA), and defenses that impede the formation of the communication channel (the final step in setting up a TEA).
6.1 Limited-Speculation Defenses
The key concept behind Limited-Speculation Defenses is that speculation is fundamentally insecure. Therefore, the defenses focus on controlling the micro-architectural state resulting from speculation. Table
3 shows seven techniques and is organized into six categories: whether the technique is a hardware and/or software implementation (HW/SW); what is the defense method (Method); which micro-architectural components are protected by the technique (Protected Components); what the drawbacks of the technique are (Drawbacks); what the maximum performance penalty is (Max. Performance Penalty); and if the technique is backward compatible (BC). Backward compatibility is defined using two variants: one for software and one for hardware. Software BC (SW BC) is defined as the ability to take a previously compiled binary, to the same architecture, and have it execute with the security guarantees provided by the new micro-architecture. Hardware BC (HW BC) is defined as the ability to backport the modifications performed by the technique to older micro-architectures (e.g., a microcode update that changes the behavior of certain instructions [
62]). The maximum performance penalty metric used is the maximum reported performance penalty by any of the cited papers for each category and is only valid for the experimental methodology used. To facilitate consultation, the performance penalty is provided with the accompanying citation.
Partition Speculation Components. This is the method currently employed in some current micro-architectures. Intel and ARM use some form of partitioning to limit a process from influencing the speculation results of another process [
13,
15]. Each entry in the speculative component has a unique application-specific ID. If there is no full ID match in the entry, the core does not speculate [
13]. Its limiting factor is the resource fighting between running processes in the same core. The partitioning components scheme is secure if and only if all shared speculative components are partitioned. It has been shown that one of the latest Intel micro-architectures (Ice Lake), despite having in-silicon defenses against TEAs, is vulnerable to Spectre-type attacks because one shared speculation component, the PHT, was not partitioned [
15]. Intel, ARM, and AMD do not provide any benchmarking for this type of defense. Therefore, the performance penalty is unknown. This technique is backward compatible in regards to software, as any software will immediately take advantage of the modifications performed. However, these kinds of modifications cannot be applied directly to older micro-architectures. Therefore, they fail hardware backward compatibility.
Clear Micro-Architectural State. On a context switch, the core will flush all shared micro-architectural buffers. The extent of the flushing depends on the security requirements of the system and/or software. There are proposals to handle the flushing in hardware [
171], while others use software [
150,
151]. Intel has modified the
VERW instruction to overwrite specific micro-architectural components [
62,
151]. Hardware solutions are always advantageous to the programmer as they do not have to reason about the micro-architectural state. Software solutions rely on the programmer to execute the correct set of instructions to clear the necessary state. However, hardware solutions are conservative in clearing the micro-architectural state regardless of the security requirements of the software running. Therefore, a software solution can yield better performance in cases where the software’s security model is known. In both cases, flushing part or the whole micro-architectural state adds to the performance penalty of context switches. Similarly to the previous category and for the same reasons, SW BC is maintained and HW BC is voided.
Trap Speculation. The trap speculation technique limits or blocks the results of speculative instructions to a communication channel. Most proposals focus on blocking the cache hierarchy communication channel. As such, they add a per thread private L0 cache to capture the effects of speculative instructions which operate over the cache hierarchy [
3,
177]. If the speculation is correct, the effects of the speculative instructions are applied to the cache hierarchy. Otherwise, they are ignored. One proposal stalls or predicts speculative memory accesses until they are verified [
133]. Other proposals allow speculation to alter the cache hierarchy but will “undo” the state on a mis-speculation [
77,
132]. Another proposal generalized the problem of transmitting data of speculative instructions to a communication channel in any micro-architecture [
182]. Recent research showed that these methods can still be attacked [
16,
86]. For methods which trap speculation in an L0 cache, the order in which speculative memory accesses, and subsequent memory accesses, are performed cause enough of a timing difference to build a communication channel [
16]. For the methods which rollback the mispeculated cache state, the communication channel is built from the timing difference associated with the size of the rollback state [
86]. The hardware modifications required by this technique implies that HW BC is voided. However, SW BC is maintained.
Speculative State defined in the ISA. The micro-architectural state is partially defined in the architectural state. This approach has been adopted by Intel [
59,
61], AMD [
5,
7], and ARM [
11,
12]. New instructions are added to the ISA such that the order of operations in relation to a speculative instruction is always guaranteed. Similarly to memory ordering instructions (fences), speculation ordering instructions guarantee that instructions which follow a speculative instruction are not allowed to execute until the speculation has been verified. Intel and AMD provide instructions to limit jump and memory dependence speculation [
5,
7,
59,
61]. ARM goes beyond and provides instructions, not only to limit jump and memory dependence speculation, but also to limit any speculation [
11,
12]. Old programs will have to be recompiled to take advantage of these new instructions. This voids SW BC. HW BC can be maintained if the older platforms allow microcode updates which introduce new instructions or add side-effects to existing instructions [
62]. Another type of protection domain would be in using special hardware, within the backend of the core, that is able to track and stop data forwarding to instructions with measurable side-effects [
92,
167]. The previous technique guarantees security in SW BC through extensive hardware modifications. As a result, they can not be backported to older micro-architectures, which means they are not HW BC. Although merging the micro-architectural and the architectural state guarantees speculation behavior to the programmer, it limits the design freedom provided to micro-architecture implementations. Furthermore, the usage of these instructions requires a deep understanding of the micro-architecture to not only guarantee security but to also maintain high performance. Much like memory ordering instructions, a conservative use of speculation limiting instructions will lead to performance loss [
11].
Retpoline. Retpoline is a technique which sets up the RAS with a prediction loop. The prediction loop will lock the speculation state into fetching and executing the same safe instruction sequence until the speculation is verified [
5,
63,
156]. As this technique relies on precise code in certain function calls, the backward compatibility requirement is not met. Despite limiting the RAS, there is still a vulnerable timing window to perform a Spectre attack while setting up the required prediction loop [
97]. Moreover, recent micro-architectures that have in-silicon defenses against RAS TEAs have been shown to still be vulnerable against Spectre [
170]. Since Retpoline is a software technique, only SW BC is maintained and considered.
Runtime Code Injection. The decode unit inspects the sequence of outputted micro-code. If the generated micro-code matches that of a TEA, the decoder will inject a specific micro-code sequence that will nullify the effects of the possible mispeculated instructions to a communication channel. These code sequences can be the ones from the previously described techniques, such as: instructions to clear the micro-architectural state, instructions to limit speculation, and/or retpoline. Using the cache hierarchy as a communication channel, the micro-code sequencer injects fence instructions such that some memory accesses are strictly ordered behind a speculative instruction [
148]. Regardless of the detection system used, this technique will always be susceptible to new exploits that remain undetected. Furthermore, the detection mechanism and code injection may lead to performance loss in certain workloads [
148]. To avoid incurring the performance penalty for every binary, the software environment could mark binaries as safe or unsafe depending on the presence of TEA gadgets or communication channels [
56,
74,
108,
120]. Moreover, to further improve performance, the software environment can tag specific regions of interest to defend against TEAs. As for BC, and since this code injection method needs to be the in the decoder stage of the micro-architecture’s pipeline, SW BC is maintained and HW BC is not.
Recompilation. Equivalent to the Runtime Code Injection technique. The compiler detects vulnerable code sequences and replaces them with safe variants [
56,
74,
108,
120]. Comparing the recompilation technique with the runtime code injection technique, there is no extra performance penalty for detection and mitigation during runtime. The performance cost of recompilation is paid at compile-time. The disadvantage, in comparison to the runtime alternative, is that the source code needs to be available to perform the recompilation, while the runtime alternative can execute any binary. Hence, recompilation is not SW BC. Both alternatives suffer from the same drawbacks.
A common theme in all techniques that look to limit speculation is that they all impact performance somehow. Moreover, some techniques have outstanding security vulnerabilities. All techniques try to define what the insecure speculative state is and how it should be limited. Except for trap speculation, they consider the speculative state as any computation which stems from any speculation. As a result, they limit the global speculative state. However, the Trap Speculation technique reduces the insecure speculative state to only speculative instructions which leads to communication channels. Recall that a TEA is only successful if the attacker is able to recover data from the victim, and not if there is some manipulated speculative state.
Discussion. Providing security by limiting speculation is a paradigm shift on how micro-architectures are designed and implemented. Through this methodology, the speculative state needs to be precisely defined for all micro-architectural states. This analysis is akin to the allowed memory consistency model [
106]. Using formal analysis of speculative states, one can design tools which detect possible insecure states [
24,
47,
98,
102,
108,
117,
154]. There is little research done on how a micro-architectural implementation, through a
Hardware Description Language (HDL), can be fixed if an illegal speculative state is found [
40,
57,
105,
157]. This is a difficult problem to correctly solve as the setup and clearing of a speculative state is particular to each micro-architecture implementation, not to the architectural model (even if it partially defines the speculative state). An unexplored avenue for micro-architectural programming is the usage of hint instructions, behind and not behind speculative instructions, in attacks. Hint instructions are architectural no-operations. However, they directly program the micro-architectural state. Most ISAs provide instructions to hint some prediction mechanism into a known state [
10,
65,
126]. These hints commonly affect jump prediction and memory dependence prediction units. Although attacks have been found for particular speculation mechanisms, it does not mean novel speculative components are immune to them. A survey of proposed speculative mechanisms showed that attacks can still be mounted and may provide unlimited access to certain resources in the system [
162]. The proposed speculation mechanisms range from value prediction (predict results of operations) to data compression inside and outside the core. It has already been shown that data compression in the cache can be exploited to infer what data is stored in a cache line depending on the level of compression [
155]. Value prediction allows an attacker to inject data into the victim’s operations if the predictor is not protected [
141].
Summary. Limiting the speculative state is always bound to be a complex task as the current paradigm to design and implement micro-architectures does not define speculative state. Partially defining the micro-architectural state in the ISA is a solution that moves the responsibility of the problem to the programmer. Historically, shifting the responsibilities of the micro-architecture to the programmer have not been advantageous. The design of a micro-architecture is usually around facilitating the programmer’s work. Micro-Architectures employ OoO execution because an OoO execution engine is able to obtain good performance from non-performant code. Cache hierarchies are employed because programs implicitly exhibit locality and temporality in their memory accesses. An example where micro-architectures were designed around a programmer’s ability to write correct and performant code would be in the memory consistency models [
106]. Weaker memory models can provide better performance than strong memory models. However, they rely on the programmer having the knowledge to correctly insert memory ordering instructions to get the expected results without sacrificing performance. A recent trend in modern computer architectures sees a preference toward stronger memory models due to the ease of programming. A recent industry example is the stronger ARMv8 memory model. Up until ARMv7, ARM employed a weak memory model which was hard to formally define due to the numerous possible outcomes in many litmus tests [
118]. The recent RISC-V memory model, which is still being defined, also shows features that would previously be only in strong memory models [
126]. Similarly to memory consistency models, the micro-architectural states allowed after a speculative event can also be defined using strong and weak qualifiers. A weak speculation model allows any state to result from any speculation. A strong speculation model allows a finite number of states to result from a set of known speculative events.
6.2 Limited-Communication-Channel Defenses
Unlike speculation-limited defenses, limited-communication-channel defenses allow cores to speculate. The observation is that speculation is not inherently insecure. However, the insecurity comes from the attacker being able to build a communication channel with the victim. Table
4 shows four techniques for preventing the attacker from communicating with the victim using the cache hierarchy as a communication channel. Table
4 uses the same categories as the previous section: Method, Protected Components, Drawbacks, Maximum Performance Penalty, and backward compatibility (software and hardware).
Cache Partitioning. Cache levels, across the hierarchy, are split among all running processes in the system. The size of each partition can be statically defined [
6,
32,
35,
53,
88], a certain number of sets, ways, or both are always reserved for a certain process, or dynamically defined [
33,
50,
164,
185], depending on certain heuristics. Regardless of using dynamic or static partitions, there is always a limit to the number of partitions a cache can hold. Partitioning uses unique process identifiers to gate access to a partition state. No process other than the owner can change the state of a partition, namely its size or content. Statically partitioned caches are immune to the construction of communication channels because no other process other than the owner of the partition can manipulate its state. However, due to the strictness of the partition state guarantees, these designs lose performance when more demanding processes are given smaller partitions than less demanding processes. Dynamically defined partitions circumvent the performance issues but may be vulnerable to a new kind of communication channel. An attacker can deploy multiple processes which will attempt to reduce the partition of the victim and occupy all but one partition in the cache. When the victim requires a larger partition, the controller will have to reduce an attacker-controlled partition. The attacker can inspect all partitions and infer some data transmission from how a partition was reduced. It is important to note that, once again, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Cache Partitioning are ineffective.
Randomized Caches. In contrast to partitioned caches, instead of splitting the cache and controlling performance, randomized caches leverage the observation that all communication channels can be reduced to eviction-based channels. Replacement-Policy-Based channels, similarly to eviction-based channels, requires finding multiple addresses that occupy the same set. Coherence-Based channels require shared memory to not be deduplicated in the cache hierarchy. Therefore, replacement-policy-based channels can be considered a subset of eviction-based channels while coherence-based channels can be reduced to eviction-based channels by duplicating memory in the cache hierarchy. The latter reduction has a side-effect wherein cacheable shared memory cannot be writable as that would require stores to shared memory to modify multiple cache lines in the hierarchy. By reducing all communication channels to the same type, randomized caches look to make the problem of “finding a group of addresses which occupy the same set” hard. This is achieved by using different hash functions per way. The traditional set-associative cache is split into
w direct-mapped caches (
\(\text{ways} = 1\)), wherein each direct-mapped cache will use a different hash function. The set is the group of all cache lines returned from each direct-mapped cache. There are proposals which use cryptographic hashes [
131,
137,
169], while others will change the hash dynamically [
121,
122] or use a single-layer of pointer redirection [
89]. Other works, do not rely on these techniques and seek to define security by intrinsically tying multiple states and allowing displacements within the cache [
41]. Similarly to cache partitioning, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Randomized Caches are ineffective.
Low Resolution Timers. The communication channels described in Section
4 rely on measuring the latency of memory operations to infer what data was transmitted. A possible defense is to decrease the granularity of the timer read by the attacker, such that no difference can be detected between an access serviced by the cache hierarchy and the external memory, or between two different events in the cache hierarchy. Certain execution environments, e.g., browsers, forbid software from accessing the high-performance counters and the timers available in the systems [
103,
144,
149,
152]. However, it has been shown that high-resolution timers can be built using other methods [
104,
136,
170,
173,
186]. Generally, these timers are constructed by executing a constant-time code and inferring a timer from that executing time. Another option is to amplify the latency of transient instructions, e.g., rely on multiple high-latency micro-architectural events such that the low-resolution timer cannot hide the sequence of events being measured.
Coherence Protocol. The cache hierarchy is manipulated by memory accesses performed by all cores in the system, regardless of speculation or not. The coherence protocol is tasked with maintaining a shared global state between all caches in a system. Any memory access creates coherence traffic in the network. Therefore, even if the cache hierarchy could be efficiently cleared on a mis-speculation, the attacker could still build a communication channel through the latency of the coherence network. Instead of defending particular cache levels, coherence protocol defenses aims to defend the whole cache hierarchy by controlling which cache lines are available in the hierarchy. Similarly to the trap speculation technique in Section
6.1, the coherence protocol reverses the effects of speculation when a misprediction occurs [
172].
Discussion. A significant advantage of using limited-communication-channel defenses is that the performance penalty should be lower while the core remains unchanged. The cited works herein show a maximum performance penalty of 13% [
184] for limited-communication-channel defenses whereas the limited-speculation defenses show a 125% [
167] maximum performance penalty. Focusing on communication channels defenses has three advantages: the ISA does not need to define the global micro-architectural/speculative state; only a small portion of the micro-architectural states lead to a communication channel; and blocking data transfers in the communication channel can be designed and employed independently of the micro-architectural state, which connects to the communication channel. If any TEA requires a communication channel, then a limited-communication-channel defense provides a clear separation of complexities between the speculative state and security. Since limited-communication-channel defenses focus on the cache hierarchy, which has no defined architectural state, most defenses should maintain SW BC. The same is not true for limited-speculation defenses. However, HW BC is not maintained due to the required changes to the cache hierarchy. No member of the industry has yet deployed any defense of this type. One may think that Intel’s
Cache Allocation Technology (CAT) could be considered a currently employed defense. However, CAT is insecure because it still allows page sharing between victim and attacker which permits setting up FLUSH+RELOAD [
179] or FLUSH+FLUSH [
46] communication channels [
79]. The reluctance of deploying limited-communication-channel defenses may come from the complexity of the cache hierarchy and that other communication channels may be used [
18,
28,
116,
124].
Summary. If the goal of providing performance year-on-year is maintained, the only current solution that is closest to that goal is to use communication-channel-limited defenses. There is significant interest by the community in designing secure caches not only to be used to deter TEAs but also to improve the security of trusted execution environments [
29]. Communication channels need to keep being cataloged. This process involves understanding how the communication channel is built, how data is transferred, and how the channel construction and/or transfer can be blocked.