Nothing Special   »   [go: up one dir, main page]

Zombie Load

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

ZombieLoad: Cross-Privilege-Boundary Data Sampling

Michael Schwarz Moritz Lipp Daniel Moghimi


Graz University of Technology Graz University of Technology Worcester Polytechnic Institute
michael.schwarz@iaik.tugraz.at moritz.lipp@iaik.tugraz.at amoghimi@wpi.edu

Jo Van Bulck Julian Stecklina Thomas Prescher


imec-DistriNet, KU Leuven Cyberus Technology Cyberus Technology
jo.vanbulck@cs.kuleuven.be julian.stecklina@cyberus- thomas.prescher@cyberus-
technology.de technology.de

Daniel Gruss
Graz University of Technology
daniel.gruss@iaik.tugraz.at

ABSTRACT data to user space, but also leak data across user processes, virtual
In early 2018, Meltdown first showed how to read arbitrary kernel machines, and SGX enclaves [68, 75]. Furthermore, data cannot only
memory from user space by exploiting side-effects from transient be leaked from the L1 cache but also from other microarchitectural
instructions. While this attack has been mitigated through stronger structures, including the register file [67], the line-fill buffer [45, 72],
isolation boundaries between user and kernel space, Meltdown and, as shown in concurrent work, the store buffer [53].
inspired an entirely new class of fault-driven transient execution Instead of executing the instruction stream in order, most mod-
attacks. Particularly, over the past year, Meltdown-type attacks ern processors can re-order instructions while maintaining archi-
have been extended to not only leak data from the L1 cache but tectural equivalence, creating the illusion of an in-order machine.
also from various other microarchitectural structures, including the Instructions then may already have been executed when the CPU
FPU register file and store buffer. detects that a previous instruction raises an exception. Hence, such
In this paper, we present the ZombieLoad attack which uncov- instructions following the faulting instruction (i.e., transient instruc-
ers a novel Meltdown-type effect in the processor’s previously tions) are rolled back. While the rollback ensures that there are no
unexplored fill-buffer logic. Our analysis shows that faulting load architectural effects, side effects might remain in the microarchitec-
instructions (i.e., loads that have to be re-issued for either architec- tural state. Most Meltdown-type data leaks exploit overly aggressive
tural or microarchitectural reasons) may transiently dereference performance optimizations around out-of-order execution.
unauthorized destinations previously brought into the fill buffer For many years, the microarchitectural state was considered in-
by the current or a sibling logical CPU. Hence, we report data visible to applications, and hence security considerations were often
leakage of recently loaded stale values across logical cores. We limited to the architectural state. Specifically, microarchitectural
demonstrate ZombieLoad’s effectiveness in a multitude of practical elements often do not distinguish between different applications or
attack scenarios across CPU privilege rings, OS processes, virtual privilege levels [9, 14, 37, 45, 58, 61, 63].
machines, and SGX enclaves. We discuss both short and long-term In this paper, we show that, first, there still are unexplored mi-
mitigation approaches and arrive at the conclusion that disabling croarchitectural buffers, and second, both architectural and microar-
hyperthreading is the only possible workaround to prevent this chitectural faults can be exploited. With our notion of “microar-
extremely powerful attack on current processors. chitectural faults”, i.e., faults that cause a memory request to be
re-issued internally without ever becoming architecturally visible,
CCS CONCEPTS we demonstrate that Meltdown-type attacks can also be triggered
without raising an architectural exception such as a page fault.
• Security and privacy → Side-channel analysis and counter-
Based on this, we demonstrate ZombieLoad, a novel, extremely
measures; Systems security; Operating systems security.
powerful Meltdown-type attack targeting the fill buffer logic.
KEYWORDS ZombieLoad exploits that load instructions which have to be
re-issued internally, may first transiently compute on stale values
side-channel attack, transient execution, fill buffer, Meltdown belonging to previous memory operations from either the current
or a sibling hyperthread. Using established transient execution at-
1 INTRODUCTION tack techniques, adversaries can recover the values of such “zombie
In 2018, Meltdown [45] was the first microarchitectural attack com- load” operations. Importantly, in contrast to all previously known
pletely breaching the security boundary between the user and transient execution attacks [9], ZombieLoad reveals recent data val-
kernel space and, thus, allowed to leak arbitrary data. While Melt- ues without adhering to any explicit address-based selectors. Hence,
down was fixed using a stronger isolation between user and kernel we consider ZombieLoad an instance of a novel type of microarchi-
space, the underlying principle turned out to be an entire class of tectural data sampling attacks. We present microarchitectural data
transient-execution attacks [9]. Over the past year, researchers have sampling as the missing link between traditional memory-based
demonstrated that Meltdown-type attacks cannot only leak kernel
Schwarz et al.

side-channels which correlate data adresses within a victim execu- 2 BACKGROUND


tion, and existing Meltdown-type transient execution attacks that In this section, we describe the background required for this paper.
can directly recover data values belonging to an explicit address. In
this paper, we combine primitives from traditional side-channel at-
tacks with incidental data sampling in the time domain to construct 2.1 Transient Execution Attacks
extremely powerful attacks with targeted leakage in the address Today’s high-performance processors typically implement an out-
domain. This not only opens up new attack avenues, but also re- of-order execution design, allowing the CPU to utilize different exe-
enables attacks that were previously assumed to be mitigated. cution units in parallel. The instruction stream is decoded in-order
We demonstrate ZombieLoad’s real-world implications in a mul- into simpler micro-operations (µOPs) [15] which can be executed
titude of practical attack scenarios that leak across processes, privi- as soon as the required operands are available. A dedicated reorder
lege boundaries, and even across logical CPU cores. Furthermore, buffer stores intermediate results and ensures that instruction re-
we show that we can leak Intel SGX enclave secrets loaded from sults are committed to the architectural state in the order specified
a sibling logical core. We demonstrate that ZombieLoad attackers by the program’s instruction stream. Any fault that occurred during
may extract sealing keys from Intel’s architectural quoting enclave, the execution of an instruction is handled at instruction retirement,
ultimately breaking SGX’s confidentiality and remote attestation leading to a pipeline flush which squashes any outstanding µOP
guarantees. ZombieLoad is furthermore not limited to native code results from the reorder buffer.
execution, but also works across virtualization boundaries. Hence, In addition, modern CPUs employ speculative execution optimiza-
virtual machines can attack not only the hypervisor but also differ- tions to avoid stalling the instruction pipeline until a conditional
ent virtual machines running on a sibling logical core. We conclude branch is resolved. The processor predicts the most likely outcome
that disabling hyperthreading, in addition to flushing several mi- of the branch and continues execution along that direction. If the
croarchitectural states during context switches, is the only possible branch is resolved and the prediction was correct, the speculative
workaround to prevent this extremely powerful attack. results retire in-order yielding a measurable performance improve-
ment. On the other hand, if the prediction was wrong, the pipeline
Contributions. The main contributions of this work are: is flushed, and any speculative results are squashed in the reorder
buffer. We refer to instructions that are executed speculatively or
(1) We present ZombieLoad, a powerful data sampling attack out-of-order but whose results are never architecturally committed
leaking data accessed on the same or sibling hyperthread. as transient instructions [9, 45, 68].
(2) We combine incidental data sampling in the time domain While the results and the architectural effects of transient instruc-
with traditional side-channel primitives to construct a tar- tions are discarded, measurable microarchitectural side effects may
geted information flow similar to regular Meltdown attacks. remain and are not reverted. Attacks that exploit these side effects
(3) We demonstrate ZombieLoad in several real-world scenarios: to observe sensitive information are called transient execution at-
cross-process, cross-VM, user-to-kernel, and SGX. tacks [9, 42, 45]. Typically, these attacks utilize a cache-based covert
(4) We show that ZombieLoad breaks the security guarantees channel to transmit the secret data observed transiently from the
provided by Intel SGX. microarchitectural domain to an architectural state. However, other
(5) We are the first to do post-processing of the leaked data covert channels can be utilized as well [6, 62]. In line with a recent
within the transient domain to eliminate noise. exhaustive survey [9], we refer to attacks exploiting mispredic-
tion [29, 40, 42, 43, 49] as Spectre-type, whereas attacks exploiting
Outline. Section 2 provides background. Section 3 provides an transient execution after a CPU exception [9, 40, 45, 67, 68, 75] are
overview of ZombieLoad, and introduces a novel classification classified as belonging to Meltdown-type.
scheme for memory-based side-channel attacks. Section 4 describes
attack scenarios and the respective attacker models. Section 5 in-
troduces and evaluates the basic primitives required for mounting 2.2 Memory Subsystem
ZombieLoad. Section 6 demonstrates ZombieLoad in real-world The CPU architecture defines different instructions to load data
attack scenarios. Section 7 discusses possible countermeasures. We from memory. In this section, we give a high-level overview of how
conclude in Section 8. out-of-order CPUs handle memory loads. However, as the actual
implementation of the microarchitecture is usually not publicly
Responsible Disclosure. We provided Intel with a PoC leaking documented, we rely on patents held by Intel to back up possible
uncacheable-typed memory locations from a concurrent hyper- implementation details.
thread on March 28, 2018. We clarified to Intel on May 30, 2018,
that we attribute the source of this leakage to the LFB. In our ex- Caches. To improve the performance of memory accesses, CPUs
periments, this works identically for Foreshadow (Meltdown-P), contain small and fast internal caches that store frequently used
undermining the completeness of L1-flush-based mitigations. This data. Caches are typically organized in multiple levels that are either
issue was acknowledged by Intel and tracked under CVE-2019- private per core or shared amongst them. Modern CPUs typically
11091. We responsibly disclosed the main attack presented in this use n-way set-associative caches containing n cache lines per set,
paper to Intel on April 12, 2019. Intel verified and acknowledged each typically 64 B wide. Usually, modern Intel CPUs have a private
our findings and assigned CVE-2018-12130 to this issue. Both issues first-level instruction (L1I) and data cache (L1D) and a unified L2
were part of an embargo ending on May 14, 2019. cache. The last-level cache (LLC) is shared across all cores.
ZombieLoad: Cross-Privilege-Boundary Data Sampling

Virtual Memory. CPUs use virtual memory to provide memory by issuing a microcode assist which points the sequencer to a prede-
isolation between processes. Virtual addresses are translated to fined microcode routine [13]. To do so, the execution unit associates
physical memory locations using multi-level translation tables. The an event code with the result of the faulting micro-op. When the
translation table entries define the properties, e.g., access control or micro-op of the execution unit is committed, the event code causes
memory type, of the referenced memory region. The CPU contains the out-of-order scheduler to squash all in-flight micro-ops in the
the translation-look-aside buffer (TLB) consisting of additional reorder buffer [13]. The microcode sequencer uses the event code to
caches to store address-translation information. read the micro-ops associated with the event in the microcode [7].

Memory Order Buffer. µOPs that deal with memory operations Intel TSX. Intel TSX is an x86 instruction set extension to sup-
are handled by dedicated execution units. Typically, Intel CPUs port hardware transactional memory [35] which has been intro-
contain 2 units responsible for loading data and one for storing duced with Intel Haswell CPUs. With TSX, particular code regions
data. While the reorder buffer resolves register dependencies, out-of- are executed transactionally. If the entire code regions completes
order executed µOPs can still have memory dependencies. In an out- successfully, memory operations within the transaction appear as
of-order CPU, the memory order buffer (MOB), incorporating a load an atomic commit to other logical processors. If an issue occurs
buffer and a store buffer, controls the dispatch of memory operations during the transaction, a transactional abort rolls back the execu-
and tracks their progress to resolve memory dependencies. tion to an architectural state before the transaction and, thereby,
discarding all performed operations. Transactional aborts can be
Data Loads. For every dispatched load operation an entry is caused by different issues: Typically, a conflicting memory opera-
allocated in the load buffer and in the reorder buffer. The allocated tion occurs where another logical processor either reads from an
load-buffer entry holds information about the operation, e.g., or- address which has been modified within the transaction or writes
dering constraints, the reorder buffer ID or the age of the most to an address which is used within the transaction. Further, the
recent store. To determine the physical address, the upper 36 bit amount of read and written data within the transaction may not
of the linear address are translated by the memory management exceed the size of the LLC and L1 cache respectively for the transac-
unit. Concurrently, the untranslated lower 12 bit are already used tion to succeed [31]. In addition, some instructions or system event
to index the cache set in the L1D [19]. If the address translation is might cause the transaction to abort as well [35].
cached in the TLB, the physical address is available immediately.
Otherwise, the page miss handler (PMH) is activated to perform a
page-table walk to retrieve the address translation as well as the Intel SGX. With the Skylake microarchitecture, Intel introduced
corresponding permission bits. With the physical address, the tag Software Guard Extension (SGX), an instruction-set extension for
and, thus, the way of the cache is determined. If the requested data isolating trusted code [31]. SGX executes trusted code inside so-
is in the L1D (cache hit), the load operation can be completed. called enclaves, which are mapped in the virtual address space of a
If data is not in the L1D, it needs to be served from higher levels conventional host application process but are isolated from the rest
of the cache or the main memory via the line-fill buffer (LFB). The of the system by the hardware itself. The threat model of SGX as-
LFB serves as an interface to other caches and the main memory and sumes that the operating system and all other running applications
keeps track of outstanding loads. Memory accesses to uncacheable could be compromised and, therefore, cannot be trusted. Any at-
memory regions, and non-temporal moves all go through the LFB. tempt to access SGX enclave memory in non-enclave mode results
If a load corresponds to an entry of a previous load operation in in abort page semantics, i.e., regardless of the current privilege level,
the load buffer, the loads can be merged [1, 57]. reads return the dummy value 0xff and writes are ignored [30].
On a fault, e.g., a physical address is not available, the page- Furthermore, to protect against powerful physical attackers prob-
table walk will not immediately abort [19]. Still, an instruction in a ing the memory bus, the SGX hardware transparently encrypts the
pipelined implementation must undergo each stage regardless of memory region used by enclaves [13].
whether a fault occurred or not [2], and is reissued in case of a fault. A dedicated eenter instruction redirects control flow to an en-
Only at the retirement of the faulting µOP, the fault is handled, clave entry point, whereas eexit transfers back to the untrusted
and the pipeline is flushed [18, 19]. If a fault occurs within a load host application. Furthermore, in case of an interrupt or fault, SGX
operation, it is still marked as “valid and completed” in the MOB [2]. securely saves CPU registers inside the enclave’s save state area
(SSA) before vectoring to the untrusted operating system. Next, the
eresume instruction can be used to restore processor state from the
2.3 Processor Extensions SSA frame and continue a previously interrupted enclave.
Microcode. Initially, all instructions were hardwired in the CPU SGX-capable processors feature cryptographic key derivation
core. However, to support more complex instructions, microcode al- facilities through the egetkey instruction, based on a CPU-level
lows implementing higher-level instructions using multiple hardware- master secret and a secure measurement of the calling enclave’s ini-
level instructions. Importantly, this allows processor vendors to tial code and data. Using this key, enclaves can securely seal secrets
support complex behavior and even extend or modify CPU behavior for untrusted persistent storage, and establish secure communica-
through microcode updates [31]. Preferably, new architectural fea- tion channels with other enclaves residing on the same processor.
tures are implemented as microcode extensions, e.g., Intel SGX [38]. Furthermore, to enable remote attestation, Intel provides a trusted
While the execution units perform the fast-paths directly in hard- quoting enclave which unseals an Intel-private key and generates
ware, more complex slow-path operations are typically performed an asymmetric signature over the local enclave identity report.
Schwarz et al.

Over the past years, researchers have demonstrated various at- that this is indeed an implementation issue (such as Meltdown) and
tacks to leak sensitive data from SGX enclaves, e.g., through mem- not an issue with the underlying design (as with Spectre). For our
ory safety violations [44], race conditions [74], or side-channels [54, hypothesis, we combined our observations with the nearly non-
63, 70, 71]. More recently, SGX was also compromised by transient existent official documentation of the fill buffer [31, 32]. Ultimately,
execution attacks [11, 68] which necessitated microcode updates we could neither prove nor disprove our hypothesis, leaving the
and increased the processor’s security version number (SVN). All verification or falsification of our hypothesis to future work.
SGX key derivations and attestations include SVN to reflect the
Stale-Entry Hypothesis. Every load is associated with an entry
current microcode version, and hence security level.
in the load buffer and potentially an entry in the fill buffer [32].
When a load encounters a complex situation, such as a fault,
3 ATTACK OVERVIEW it requires a microcode assist [31]. This microcode assist triggers
In this section, we provide an overview of ZombieLoad. We describe a machine clear, which flushes the pipeline. On a pipeline flush,
what can be observed using ZombieLoad and how that fits into the instructions which are already in flight still finish execution [28].
landscape of existing side-channel attacks. By that, we show that As this has to be as fast as possible to not incur additional delays,
ZombieLoad is a novel category of side-channel attacks, which we we expect that fill-buffer entries are optimistically matched as long
refer to as data-sampling attacks, opening a new research field. as parts of the physical address match. Thus, the load continues
with a wrong fill-buffer entry, which was valid for a previous load.
3.1 Overview This leads to a use-after-free vulnerability [24] in the hardware.
ZombieLoad is a transient-execution attack [9] which observes the Intel documents the fill buffer as being competitively shared among
values of memory loads on the current physical CPU. ZombieLoad hyperthreads [31], giving both logical cores access to the entire fill
exploits that the fill buffer is accessible by all logical CPUs of a phys- buffer (cf. Appendix A). Consequently, the stale fill-buffer entry can
ical CPU core and that it does not distinguish between processes also be from a previous load of the sibling logical core. As a result,
or privilege levels. the load instruction loads valid data from a previous load.
The load buffer acts as a queue for all memory loads from the Leakage Source. We devised 2 experiments to reduce the num-
memory subsystem. Whenever the CPU encounters a memory load ber of possible sources of the leaked data.
during execution, it reserves an entry in the load buffer. If the load In our first experiment, we marked a page as “uncacheable” via
was not an L1 hit, it additionally requires a fill-buffer entry. When the page-table entry and flushed the page from the cache. As a result,
the requested data has been loaded, the memory subsystem frees every memory load from the page circumvents all cache levels and
the corresponding load- and fill-buffer entries, at which point the directly travels from the main memory to the fill buffer [31]. We then
corresponding load instruction may retire. write the secret onto the uncacheable memory page to ensure that
However, we observed that under certain complex microarchitec- there is no copy of the data in the cache. When loading data from the
tural conditions (e.g., a fault), where the load requires a microcode uncacheable memory page, we can see leakage, but the leakage rate
assist, it may first read stale values before being re-issued eventu- is only in the order of bytes per second, e.g., 5.91 B/s (σx̄ = 0.18,
ally. As with any Meltdown-type attack, this opens up a transient n = 100) on an i7-8650U. We can attribute this leakage to the fill
execution window in which this value can be used for subsequent buffer. This was also exploited in concurrent work [72]. Our hy-
calculations before the execution is aborted and rolled back. Thus, pothesis is further backed by the MEM_LOAD_RETIRED.FB_HIT per-
an attacker can encode the leaked value into a microarchitectural formance counter, which shows multiple thousand line-fill-buffer
element, such as the cache. hits (117 330 FB_HIT/s (σx̄ = 511.57, n = 100)).
In contrast to previous Meltdown-type attacks, however, it is not Intel claims that the leakage is entirely from the fill buffer. How-
possible to select the value to leak based on an attacker-specified ad- ever, our second experiment shows that the line-fill buffer might
dress. ZombieLoad simply leaks any value which is currently loaded not be the only source of the leakage. We rely on Intel TSX to ensure
by the physical CPU core. While this at first sounds like a massive that memory accesses do not reach the line-fill buffer as follows.
limitation, we show that this opens a new field of side-channel Inside a transaction, we first write the secret value to a memory
attacks. We show that ZombieLoad is an even more powerful attack location which was previously initialized with a different value.
when combined with existing techniques known from traditional The write inside the transaction ensures that the address is in the
side-channel attacks. write set of the transaction and thus in L1 [32, 60]. Evicting data
from the write set from the cache leads to a transactional abort [32].
3.2 Microarchitectural Root Cause Hence, any subsequent memory access to the data from the write
For Meltdown, Foreshadow, and Fallout, the source of the leakage set ensures that it is served from the L1, and therefore, no request
is apparent. Moreover, for these attacks, there are plausible expla- to the line-fill buffer is sent [31]. In this experiment, we see a much
nations on what is going wrong in the microarchitecture, i.e., what higher rate of leakage which is in the order of kilobytes per second.
the root cause of the leakage is [45, 53, 68, 75]. For ZombieLoad, More importantly, we only see the value written inside the TSX
however, this is not entirely clear. transaction and not the value that was at the memory location
While we identified some necessary building blocks to observe before starting the transaction. Our hypothesis that the line-fill
the leakage (cf. Section 5), we can only provide a hypothesis on why buffer is not the only source of the leakage is further backed by
the interaction of the building blocks leads to the observed leakage. observing performance counters. The MEM_LOAD_RETIRED.FB_HIT
As we could only observe data leakage on Intel CPUs, we assume and MEM_LOAD_RETIRED.L1_MISS performance counters, do not
ZombieLoad: Cross-Privilege-Boundary Data Sampling

Data Sampling Instruction Pointer Page Number Page Offset


(this paper) Memory-based Side-channel 51 Physical 12
Attacks Meltdown 47 Virtual 12
11 0

51 Physical 12
Data Address Foreshadow 47 Virtual 12
11 0
Meltdown
51 Physical 12
Fallout 11 0
Figure 1: The 3 properties of a memory operation: instruction 47 Virtual 12

pointer of the program, target address, and data value. So far, ZombieLoad
51
47
Physical
Virtual
12
12
11 6 5 0

there are techniques to infer the instruction pointer from


target address and the data value from the address. With Figure 2: Meltdown-type attacks provide a varying degree of
ZombieLoad, we show the first instance of an attack which target control (gray hatched), from full virtual addresses in
infers the data value from the instruction pointer. the case of Meltdown to nearly no control for ZombieLoad.

With Meltdown, the full virtual address of the target data is pro-
vided, and the corresponding data value stored at this address is
increase significantly. In contrast, the MEM_LOAD_RETIRED.L1_HIT leaked. The success rate depends on the location of the data, i.e.,
performance counter shows multiple thousand L1 hits. whether it is in the cache or main memory. However, the only con-
While accessing the data to leak on the victim core, we moni- straint for Meltdown is that the data is addressable using a virtual
tored the MEM_LOAD_RETIRED.FB_HIT performance counter on the address [45]. Other Meltdown-type attacks [53, 68]also connect
attacker core for 10 s. If the address was cached, we measured a addresses to data values. However, they often impose additional
Pearson correlation of rp = 0.02 (n = 100) between the correct re- constraints, such as that the data has to be cached in L1 [68, 75],
coveries and line-fill buffer hits, indicating no association. However, the physical address has to be known [75], or that an attacker can
while continuously flushing the data on the victim core, ensuring choose only parts of the target address [53].
that a subsequent access must go through the LFB, we measure Figure 2 illustrates which parts of the virtual and physical address
a strong correlation of rp = 0.86 (n = 100). This result indicates an attacker can choose to target data values to leak. For Meltdown,
that the line-fill buffer is not the only source of leakage. However, the virtual address is sufficient to target data in the same address
a different explanation might be that the performance counters are space [45]. Foreshadow already requires knowledge of the physical
not reliable in such corner cases. Future work has to investigate address and the least-significant 12 bits of the virtual address to
whether other microarchitectural elements, e.g., the load buffer, are target any data in the L1, not limited to the own address space [68,
also involved in the observed data leakage. 75]. When leaking the last writes from the store buffer, an attacker
is already limited in choosing which value to leak. It is only possible
3.3 Classification to filter stores based on the least-significant 12 bits of the virtual
In this section, we introduce a way to classify memory-based side- address, a more targeted leakage is not possible [53].
channel and transient-execution attacks. For all these attacks, we Zombie loads provide no control over the leaked address to an
assume a target program which executes a memory operation at a attacker. The only possible target selection is the byte index inside
certain address with a specific data value at the program’s current the loaded data, which can be seen as an address with up to 6-bit
instruction pointer. Figure 1 illustrates these three properties as the in case an entire cache line is loaded. Hence, we do not count Zom-
corner of a triangle, and techniques which let an attacker infer one bieLoad as an attack which leaks data values based on the address.
of the properties based on one or both of the other properties. Instead, from the viewpoint of the target control, ZombieLoad is
Traditional memory-based side-channel attacks allow an attacker more similar to traditional memory-based side-channel attacks.
to observe the location of memory accesses. The granularity of With ZombieLoad, an attacker observes the data value of a memory
the location observation depends on the spatial accuracy of the access. Thus, this side channel establishes a connection between the
used side channel. Most common memory-based side-channel at- time domain and the data value. Again, the time domain correlates
tacks [20, 22, 23, 25, 37, 56, 58, 71, 78, 79] have a granularity be- with the instruction pointer of the target address. ZombieLoad is
tween one cache line [22, 23, 25, 79] i.e., usually 64 B, and one the first instance of a class of attacks which connects the instruc-
page [20, 37, 71, 78], i.e., usually 4 kB. These side channels establish tion pointer with the data value of a memory access. We refer to
a connection between the time domain and the space domain. The such attacks as data sampling attacks. Essentially, this new class of
time domain can either be the wall time or also commonly the exe- data sampling attacks is capable of breaking side-channel resistant
cution time of the program which correlates with the instruction applications, such as constant-time cryptographic algorithms [27].
pointer. These classic side channels provide means of connecting Following the classification scheme from Canella et al. [9], Zom-
the address of a memory access to a set of possible instruction bieLoad is a Meltdown-type transient execution attack, and we
pointers, which then allows reconstructing the program flow. Thus, propose Meltdown-MCA as the generic name. This reflects that the
side-channel resistant applications have to avoid secret-dependent (microarchitectural) fault type being exploited by ZombieLoad is
memory access to not leak secrets to a side-channel attacker. the microcode assist (MCA, explained further).
Since early 2018, with transient execution attacks [9] such as
Meltdown [45] and Spectre [42], there is a second type of attacks 4 ATTACK SCENARIOS & ATTACKER MODEL
which allow an attacker to observe the value stored at a memory Following most side-channel attacks, we assume the attacker can
address. Meltdown provided the most control over target address. execute unprivileged native code on the target machine. Thus, we
Schwarz et al.

assume a trusted operating system if not stated otherwise. This Table 1: Overview of different variants to induce zombie
relatively weak attacker model is sufficient to mount ZombieLoad. loads in different scenarios.
However, we also show that the increased attacker capabilities Variant 1 2
offered in certain scenarios, e.g., SGX and hypervisor attacks, may Scenario
further amplify the leakage while remaining within the threat model Unprivileged Attacker
of the respective scenario. Privileged Attacker (root)
At the hardware level, we assume a ubiquitous Intel CPU with Symbols indicate whether a variant can be used in the corresponding attack scenario
simultaneous multithreading (SMT, also known as hyperthreading) ( ), can be used depending on the hardware configuration as discussed in Section 5.1
enabled. Crucially, we do not rely on existing vulnerabilities, such ( ), or cannot be used ( ).
as Meltdown [45], Foreshadow [68, 75], or Fallout [53].
faulting load flush
User-Space Leakage. In the cross-process user-space scenario,
cache line User mapping
an unprivileged attacker leaks values loaded by another concur- Kernel v
rently running user-space application. We consider such a cross- address
process scenario most dangerous for end users, who are not com- 4 KB
k 4 KB
monly using Intel SGX nor virtual machines. Moreover, many Page p
secrets are likely to be found in user-space applications such as 2 MB
browsers or password managers. 2 MB 2 MB
The attacker can execute unprivileged code and is co-located
Figure 3: Variant 1: Using huge kernel pages for ZombieLoad.
with the victim on the same physical but a different logical CPU
Page p is mapped using a user-accessible address (v) and
core. This is a typical case for hyperthreading, where both attacker
a kernel-space huge page (k). Flushing v and then reading
and victim run on one hyperthread of the same CPU.
from k using Meltdown leaks values from the fill buffer.
Kernel Leakage. In addition to leakage across user-space appli-
cations, ZombieLoad can also leak across the privilege boundary
physical but different logical core. Thus, an attacker can leak values
between user and kernel space. We demonstrate that the value of
loaded from a virtual machine running on the sibling logical core.
loads executed in kernel space is leaked to an unprivileged attacker,
As the attacker is running inside an untrusted virtual machine,
executing either on the same or a sibling logical core.
the attacker is not restricted to unprivileged code execution. Thus,
In this scenario, the unprivileged attacker performs a system call
the attacker can, for instance, modify guest-page-table entries.
to the kernel, running on the same logical core. Importantly, we
found that kernel load leakage may even survive the switch back Hypervisor Leakage. In the hypervisor scenario, an attacker
from the kernel to user space. Hyperthreading is hence not a strict running inside a virtual machine utilizes ZombieLoad to leak the
requirement for this scenario. value of loads executed by the hypervisor.
As the attacker is running inside an untrusted virtual machine,
Intel SGX Leakage. In addition to leaking values loaded by the
the attacker is not restricted to unprivileged code execution.
kernel, ZombieLoad can observe loads executed inside an Intel SGX
enclave. In this scenario, the attacker is executing on a sibling logical
core, co-located with the victim enclave on the same physical core.
5 BUILDING BLOCKS
We demonstrate that ZombieLoad can leak secrets loaded during the In this section, we describe the building blocks for the attack.
enclave’s execution from a concurrent logical core, but we did not
observe leakage on the same logical core after exiting the enclave 5.1 Zombie Loads
synchronously (eexit) or asynchronously (on interrupt). The main primitive for mounting ZombieLoad is a load which trig-
While in the aftermath of the Foreshadow [68] attack, current gers a microcode assist, resulting in a transient load containing
SGX attestations indicate whether hyperthreading has been en- wrong data. We refer to such a load as a zombie load. Zombie loads
abled at boot time, Intel’s official security advisory [34] merely are loads which either architecturally or microarchitecturally fault
suggests that a remote verifier might reject attestations from a and thus cannot complete, requiring a re-issue of the load at a
hyperthreading-enabled system “if it deems the risk of potential later point. We identified multiple different scenarios to create such
attacks from the sibling logical processor as not acceptable”. Hence, zombie loads required for a successful attack. All variants have in
machines with up-to-date patched microcode may still run with common that they abuse the clflush instruction to reliably create
hyperthreading enabled. the conditions required for leaking from a wrong destination (cf.
Within the SGX threat model, we can leverage the attacker’s first Section 3.2). In this section, we describe 2 different variants that
rate control over the untrusted operating system. An attacker can, can be used to leak data (cf. Section 5.2) depending on the adver-
for instance, modify page table entries [71], or precisely execute ary’s capabilities. Table 1 overviews which variant is applicable in
the victim enclave at most one instruction at a time [69]. which scenario, depending on the operating system and underlying
hardware configuration.
Virtual Machine Leakage. With ZombieLoad, it is possible
to leak loaded values across virtual-machine boundaries. In this Variant 1: Kernel Mapping. The first variant is a ZombieLoad
scenario, an attacker running inside a virtual machine can leak setup which does not rely on any specific CPU feature. We require
values from a different virtual machine co-located on the same a kernel virtual address k, i.e., an address where the user-accessible
ZombieLoad: Cross-Privilege-Boundary Data Sampling

bit is not set in the page-table entry. In practice, the kernel is usually 5.2 Data Leakage
mapped with huge pages (i.e., 2 MB pages). Thus k refers to a 2 MB To leak data with the setup described in Section 5.1, we constantly
physical page p. Note that although we use such huge pages for flush the first cache line of p through the virtual address v. We
our experiments, it is not strictly required, as the setup also works achieve this by executing the unprivileged clflush instruction (or
with 4 kB pages. We also require the user to have read access to the clflushopt instruction if available) on the user-accessible virtual
content of the physical page through a different virtual address v. address v. For Variant 1, we leverage Meltdown to read from the
Figure 3 illustrates such a setup. In this setup, accessing the page kernel address k which maps to the cache line flushed before. As
p via the user-accessible virtual address v provides an architec- with Meltdown-US [45], various methods of preventing an archi-
turally valid way to access the contents of the page. Accessing the tectural exception can be used. We verified that ZombieLoad with
same page via the kernel address k results in a zombie load similar Variant 1 works with exception prevention (i.e., speculative execu-
to Meltdown [45] requiring a microcode assist. Note that while tion), handling (i.e., a custom signal handler), and suppression (i.e.,
there are other ways to construct an inaccessible address k, e.g., by Intel TSX).
clearing the present bit [68], we were only able to exploit zombie For Variant 2, we transiently, i.e., behind a mispredicted branch,
loads originating from kernel mappings. read from the address v 2 .
To create precisely the scenario depicted in Figure 3, we allocate Counterintuitively, the resulting values leaked for all variants
a page p in the user space with the virtual address v. Note that are not coming from page p. Instead, we get access to data which is
p is a regular 4 kB page which is accessible through the virtual currently loaded on the current or sibling logical CPU core. Thus,
address v. We retrieve its physical address through /proc/pagemap, it appears that we reuse fill-buffer entries, and leak the data which
or alternatively using a side channel [22, 36]. Using the physical the entries references. For Variant 1 and Variant 2, this allowed
address and the base address of the direct-physical map, we get an us to access all bytes from the cache line that the fill-buffer entry
inaccessible kernel address k which maps to the allocated page p. If references.
the operating system does not use stronger kernel isolation [21],
e.g., KPTI [47], the direct-physical map in the kernel is mapped in
5.3 Data Sampling
the user space and uses huge pages which are marked as not user
accessible. In the case of a privileged attacker, e.g., when attacking Independent of the setup for ZombieLoad, we cannot directly con-
a hypervisor or SGX enclave, an attacker can easily create such trol the address of the data to leak. Both the virtual addresses k
pages if they do not exist. and v, as well as the physical address of p is arbitrary and does not
correlate with the leaked data. In any case, we simply get the value
referenced by one fill-buffer entry which we cannot specify.
However, there is at least control within the fill-buffer entry,
i.e., we can target specific bytes within the 64 B fill-buffer entry.
Variant 2: Microcode-Assisted Page-Table Walk. A variant The least-significant 6 bits of the virtual address v refer to the byte
similar to Variant 1 is to trigger a microcode-assisted page-table within the fill-buffer entry. Hence, we can target a single byte at a
walk. If a page-table walk requires an update to the access or dirty specific position from the fill-buffer entry. While at first, this does
bit in the page-table entry, it falls back to a microcode assist [13]. not sound powerful, it allows leaking sensitive information, such
In this setup, we require one physical page p which has 2 user- as AES keys, byte-by-byte as shown in Section 6.1.
accessible virtual addresses, v and v 2 . This can be easily achieved As described in Section 4, the leakage is not limited to the own
by using a shared-memory segment or memory-mapped file, which process. With ZombieLoad, we observe values from all processes
is mapped twice in the application. The virtual address v can be running on the same as well as on the sibling logical CPU core.
used to access the contents of p architecturally. For v 2 , we have Furthermore, we also observe leakage across privilege boundaries,
to clear the accessed bit in the page-table entry. On Linux, this is i.e., from the kernel, hypervisor, and Intel SGX enclaves. Thus,
not possible in the case of an unprivileged attacker, and can thus ZombieLoad allows sampling of all data which is loaded by any
only be used in attacks where we assume a privileged attacker application on the current physical CPU core.
(cf. Section 4). However, we experimentally verified that Windows
10 (1803 build 17134.706) periodically clears the accessed bits. We 5.4 Performance Evaluation
assume that the page-replacement algorithm is responsible for this. In this section, we evaluate ZombieLoad and the performance of
Thus, this variant enables the attack on Windows for unprivileged 1
our proof-of-concept implementations .
attackers .
When accessing the page through the virtual address v 2 , the Environment. We evaluated the different variants of ZombieLoad,
accessed bit of the page-table entry has to be set. This, however, described in Section 5.1, on different environments listed in Table 2.
cannot be done by the page-miss handler [13]. Instead, microar- The tested CPUs range from Sandy Bridge (released 2012) to Cas-
chitecturally, the load faults, and a micro-code assist is triggered cade Lake (released 2019). We were able to mount Variant 1 and
which repeats the page-table walk and sets the accessed bit [13]. Variant 2 on different microarchitectures except for Whiskey Lake,
If the access to v 2 is done transiently, i.e., behind a misspecu- Coffee Lake-R, and Cascade Lake-SP.
lated branch or after an exception, the accessed bit cannot be set
architecturally. Thus, the leakage is not only exploitable once but
1
instead for every access. https://github.com/IAIK/ZombieLoad
Schwarz et al.

Table 2: Tested environments. keyn (0xD2) (4,4)-dominon,n+1 (0x21)


keyn+1 (0x1C)
Variant 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0
Setup CPU µ-arch. 1 2
Lab Core i7-3630QM Ivy Bridge ✓ ✓ (7,1)-dominon,n+1 (0xA4)
Lab Core i7-6700K Skylake-S ✓ ✓
Lab Core i5-7300U Kaby Lake ✓ ✓
Lab Core i7-7700 Kaby Lake ✓ ✓ Figure 4: Additionally leaking domino bytes comprised of
Lab Core i7-8650U Kaby Lake-R ✓ ✓ bits of different AES-key bytes to filter out unrelated loads.
Lab Core i7-8565U Whiskey Lake ✗ ✗
Lab Core i7-8700K Coffee Lake-S ✓ ✓
one byte at a time, we also have to combine the leaked bytes to the
Lab Core i9-9900K Coffee Lake-R ✗ ✗
full AES-128 key correctly.
Lab Xeon E5-1630 v4 Broadwell-EP ✓ ✓
Cloud Xeon E5-2670 Sandy Bridge-EP ✓ ✓ Side-Channel Synchronization. For the attack, we assume a
Cloud Xeon Gold 5120 Skylake-SP ✓ ✓ shared library implementing the AES encryption which can be used
Cloud Xeon Platinum 8175M Skylake-SP ✓ ✓ by both the attacker and the victim, e.g., OpenSSL. Even though
Cloud Xeon Gold 5218 Cascade Lake-SP ✗ ✗ OpenSSL (v3.0.0) has a side-channel resistant AES-NI implementa-
tion, we can still rely on classical memory-based side-channel at-
tacks to monitor the control flow. For example, using Flush+Reload,
Performance. To evaluate the performance of each variant, we we can detect when a specific part of the code is executed [16, 25].
performed the following experiment on an i7-8650U. While reading While this does not leak any secrets, it acts as a synchronization
a specific value on one logical core, we performed each variant primitive for ZombieLoad.
of ZombieLoad on the sibling logical core for 10 s, recording the We constantly monitor a cache line of the code which is executed
number of successful and unsuccessful recoveries. For Variant 1 right before the key is loaded from memory. In OpenSSL (v3.0.0),
using TSX to suppress the exception, we achieve an average trans- this is the second cache line of aesni_set_encrypt_key, i.e., 64 B
mission rate of 5.30 kB/s (σx̄ = 0.076, n = 1000) and a true positive after the start of the function. Similarly to Schwarz et al. [60], we
rate of 85.74 % (σx̄ = 0.0046, n = 1000). With Variant 2 in combi- leverage the cache state of the cache line as a trigger for the actual
nation with signal handling, we achieved an average transmission attack. Only if we detect a cache hit on the monitored cache line,
rate of 0.08 kB/s (σx̄ = 0.002, n = 1000) and a true positive rate we start leaking values using ZombieLoad. Hence, we already filter
of 52.7 % (σx̄ = 0.0062, n = 1000). Variant 2 in combination with out most bytes not related to the AES key.
TSX, achieves an average transmission rate of 7.73 kB/s (σx̄ = 0.21, Note that if there is no cache line before the load which can be
n = 1000) and a true positive rate of 76.28 % (σx̄ = 0.0055, n = 1000). used as a trigger, we can still use a nearby cache line (i.e., a cache
line after the load) as a filter. In a parallel thread, we collect the
6 CASE STUDY ATTACKS timestamps of cache hits in the nearby cache line. If we also save the
In this section, we present 5 attacks using ZombieLoad in real-world time stamps of the values leaked using ZombieLoad, in an offline
scenarios. post-processing step we can filter out values which were leaked at
a different instruction-pointer location.
6.1 AES-NI Key Leakage To further reduce unrelated loads, it is also possible to slow
down the victim using performance-degradation techniques such
To demonstrate that data sampling is a powerful side channel, we ex-
as flushing the code [3, 16]. For OpenSSL, we used performance
tract an AES-128 key. The victim application uses AES-NI, which is
degradation on the code directly following the load of the key.
resistant against timing and cache-based side-channel attacks [27].
However, even with the hardware-assisted AES-NI, the key has Domino Attack. Inevitably, even when synchronizing Zom-
to be loaded from memory to a 128-bit XMM register. This is usu- bieLoad by using a cache-based trigger, we also leak values not
ally the case before invoking AESKEYGENASSIST, which is used to related to the key. Moreover, for practical reasons, the size of the
derive the AES round keys. The round-key derivation is entirely Flush+Reload covert channel is limited, and we can only transmit
done in hardware using the XMM registers. Hence, there is no a single key byte from the transient domain at a time. Hence, we
memory load required for the derivation of the 11 round keys used have a probability distribution for every byte in the AES key. As
in AES-128. Thus, when the key is loaded from memory before the bytes in the AES key are independent of each other, we can
the round-key derivation starts is the point where we can mount only assume that the byte with the highest probability is the correct
ZombieLoad to leak the value of the key. For OpenSSL (v3.0.0), key byte. Thus, if there is a key byte suffering from noise from
this is in the function aesni_set_encrypt_key which is called by unrelated loads, we may assume that the noise is the correct key
EVP_EncryptInit_ex. Note that instead of leaking the key, we can byte, which leads to a wrong key.
also leak the round keys loaded in the encryption process. However, Therefore, we propose the Domino attack, an innovative tran-
to attack the round keys, an attacker needs to leak (and distinguish) sient error detection technique for reducing noise when leaking
more different values, making the attack more complex. multi-byte loads. In addition to leaking every single key byte, we
When leaking the key using ZombieLoad, we have first to detect also transmit a specially crafted domino byte composed by com-
which load corresponds to the key. Moreover, as we can only leak bining bits from two adjacent key bytes. Note that creating such
ZombieLoad: Cross-Privilege-Boundary Data Sampling

a domino byte is possible, as the transient domain has access to the enclave does not make progress, we can perform unlimited
the full AES key and can use it for arbitrary computations (cf. Sec- ZombieLoad attack attempts to reconstruct CPU register values
tion 6.3). Figure 4 illustrates the idea of the Domino attack. In this from these implicit SSA memory accesses.
case, we leak (4,4) domino bytes consisting of 4 bits of two adjacent We further reduce noise from unrelated non-enclave loads on
key bytes respectively. By combining the lower nibble of one key the victim CPU by opting for timer-based zero-stepping with a
byte with the higher nibble of the next key byte, we transmit a user space interrupt handler [70] to avoid repeatedly invoking
domino byte which encodes partial information of two key bytes. the operating system. Furthermore, we found that executing the
Hence, in a post-processing step, we combine the probability distri- ZombieLoad attack code in a separate address space avoids unnec-
bution of two adjacent key bytes with the probability distribution essarily slowing down the spy through implicit TLB invalidations
of the domino byte to select the two adjacent key bytes with the on enclave entry/exit [30].
highest combined probability. Note that the selection of bits can be Note that the SSA frame spans multiple cache lines. With Zom-
adapted to the noise which can be measured before leaking the key, bieLoad, we do not have explicit address-based control over which
e.g., multiple (7,1) domino bytes can be leaked that are shifted by cache line is being leaked. Hence, leaked data might come from dif-
only a single bit. ferent saved registers that are at the same offset within a cache line.
To filter out such noisy observations, we use the Domino transient
Results. We evaluated the attack in a cross-user-space attack error detection technique introduced in Section 6.1. Specifically, we
(cf. Section 4). We always ran the attack until the correct key was implemented a “sliding window” that transmits 7 different domino
recovered, i.e., until the key with the highest probability is the bytes for each candidate key byte, stuffed with increasing bits from
correct key. In a practical attack, the number of attacks can even be the next adjacent key byte candidate. Any noisy observations that
reduced, as typically it is easy to verify whether a key candidate is do not match the overlap can now efficiently be filtered out.
correct. Thus, an attacker can simply test all key candidates with a
probability over a certain threshold and does not have to wait until Attack on sgx_get_key. The Intel SGX design includes a se-
the highest probability corresponds to the correct key. cure key derivation facility through the egetkey instruction (cf.
On average, we recovered the entire AES-128 key of the victim Section 2.3). Enclaves execute this instruction to query a 128-bit
in under 10 s using the cache-based trigger and the Domino attack. cryptographic key from the hardware, based on the calling enclave’s
During this time, the key was loaded approximately 10 000 times code layout or developer identity. This is the underlying primitive
by the victim. used by Intel’s trusted prebuilt quoting enclave to securely unseal
a long-term private attestation key from persistent storage [13, 68].
The official Intel SGX SDK [30] offers a convenient sgx_get_key
6.2 SGX Sealing Key Extraction
wrapper procedure that first executes egetkey with the necessary
In this section, we show that privileged SGX attackers can drasti- parameters, and eventually copies the retrieved key into a provided
cally improve ZombieLoad’s temporal resolution and bridge from buffer. We reverse engineered the proprietary intel_fast_memcpy
incidental data sampling in the time domain to the targeted re- function and found that in this case, the key is copied using two 128-
construction of arbitrary enclave secrets (cf. Figure 1). We first bit moves to/from the xmm0 SSE register. We revert to zero-stepping
explain how state-of-the-art enclave execution control and tran- on the last instruction of the memcpy invocation. At this point, the
sient post-processing techniques can be leveraged to reliably leak attacker-induced zero-step enclave resumptions will repeatedly
register values at any point during an enclave invocation. Then we reload a.o., the xmm0 register containing the 128-bit key from the
demonstrate the impact of this attack by recovering a full 128-bit memory hierarchy.
SGX sealing key, as used by Intel’s trusted provision and quoting
enclaves to decrypt the long-term EPID private attestation key. Results. We evaluated the attack on a Kaby Lake i7-7700 CPU
with an up-to-date Foreshadow-patched microcode revision 0x8e.
Leaking Enclave Registers. We consider Intel SGX root attack- In the first experiment, we implemented a benchmark enclave
ers that co-locate with a victim enclave on the same physical CPU. that uses sgx_get_key to generate a new report key with different
As a system attacker, we can increase ZombieLoad’s temporal res- random key IDs. We performed 100 key-recovery experiments on
olution by leveraging previous research results exploiting page sgx_get_key with different random keys. Our results show that
faults [71, 78] or interrupts [54, 70] to regulate the victim enclave’s 30 % of the times the full 128-bit key is among the key candidates
execution. We use the SGX-Step [69] framework to precisely single- with average remaining key space entropy of 8.8 bits. Among these
step the victim enclave one instruction at a time, allowing the cases, 3 % of the times the exact full key has been recovered. In the
attacker to reach a code part where sensitive information is stored other 70 % of the cases where the full key is not among the key
in CPU registers. At such a point, we switch to unlimited zero- candidates, 31 % of the times, we have partial key bytes among the
stepping [68] by either setting the system timer interrupt to a very recovered key candidates. The average correct key bytes are 10 out
short interval or revoking code page execute permissions before of 16 bytes with the remaining global entropy of 13.59 bits. In the
resuming the victim enclave. This technique provides ZombieLoad remaining 39 % of the times where the correct key is not among the
attackers with a primitive to repeatedly force-reload CPU regis- key candidates, our attack which uses the Domino technique with
ters from the interrupted enclave’s SSA frame (cf. Section 2.3). Our a sliding window did not reveal any candidates, which means an
experiments show that even though execution of the enclave in- attacker can simply repeat the attack in such cases. Also in cases,
struction never completes, any direct operands plus SSA register where some of the key bytes are part of the candidates, most of
file contents are loaded from memory each time. Importantly, since failed key bytes resides in the first few bytes of the key. The reason
Schwarz et al.

23 15 7 0
for this behavior is that the explained Domino attack will have a
stronger effect on key bytes in the middle that are surrounded by 0xFF SEQ DATA DATA
more key bytes. Figure 5: The packet format used in the covert channel. Every
In the second experiment, we perform an attack on Intel’s trusted 32-bit packet consists of 8 data bits, 8-bit checksum (two’s
quoting enclave. The quoting enclave performs a call to sgx_get_key complement), 8-bit sequence number, and a constant prefix.
to derive the sealing key which is used to decrypt the EPID provi-
sioning blob. We executed the attack on a quoting enclave that is
signed with debug keys, so we can use it as a ground truth to easily As a result, our proof-of-concept limits the transmission of actual
verify that we have recovered the correct sealing key. We executed data to a single byte per leaked load. However, we can use the
the attack multiple times on our setup, and we managed to recover remaining bits in the load to ensure that the channel is free of
the correct 128-bit sealing key after multiple executions of the at- errors.
tack and checking the candidates against each other. The recovered
sealing key matches the correct key, and can indeed successfully Transient Error Detection. The transmission of the data be-
decrypt the EPID blob for our debug signed quoting enclave. While tween sender and receiver is free of any noise. However, the re-
we did not yet reproduce this attack to recover the sealing key from ceiver does not only recover values from the sender, but also other
the official quoting enclave image signed by Intel, we believe that loads from the current and sibling logical core. Hence, to get rid of
this experimental evaluation showcased all the required primitives this noise, we encode the data as shown in Figure 5. This allows
to break Intel SGX’s remote attestation guarantees, as demonstrated the receiver to filter out data not originating from the sender.
before by Foreshadow [68]. Although we cannot transfer the entire packet into the archi-
tectural domain, we can compute on the packet in the transient
6.3 Cross-VM Covert Channel domain. Thus, we run the error detection in the transient domain,
and only transmit valid packets to the architectural domain.
To evaluate the performance of ZombieLoad, we implement a covert
The challenge to run the error detection in the transient domain
channel which can be used for all attack scenarios described in
is that the number of instructions is limited, and not all instructions
Section 4. However, in this section, we focus on the cross-VM covert
can be used. For reliable results, we cannot use instructions which
channel. While covert channels are possible for Intel SGX, the
speculate on either control or data flow. Hence, the error-detection
kernel, and the hypervisor, these are somewhat artificial scenarios.
code has to be as short as possible and branch free.
Moreover, there are various covert channels available to user-space
Our packet structure allows for extremely efficient error detec-
applications for stealthy inter-process communication [17, 51].
tion. We encode the data in the first byte and the two’s complement
For VMs, however, there are not many known covert chan-
of the data in the second byte as a checksum. To detect errors, we
nels which can be used between two VMs. So far, all cross-VM
XOR the value of the first byte (i.e., the data) onto the second byte
covert channels either relied on Prime+Probe [46, 50, 51, 59, 77],
(i.e., the two’s complement of the data). If both values are received
DRAMA [58], or bus locking [76]. We show that ZombieLoad can be
correctly, the XOR ensures that the bits 8 to 15 of the packet are
used as a fast and reliable covert channel between VMs scheduled
zero. Thus, for a correct packet, the least-significant 16 bits of the
on the same physical core.
packet represent a value between 0 and 255, and for a wrong packet,
Sender. For the fastest result, the sender repeatedly loads the these bits represent a value which is larger than 255. We use these
value to be transmitted from the L1 cache into a register. By not resulting 16-bit value as an index into our oracle array, i.e., an array
only loading the value from one memory address but instead from consisting of 256 pages. Therefore, any value which is not a correct
multiple memory addresses, the sender ensures that potentially byte is out of bounds and has thus no effect on the cache state of
multiple fill-buffer entries are used. In addition, this also thwarts the array. A correct byte is also a valid index into the oracle array
an optimization of Intel CPUs which combines multiple loads from and ensures that the first cache line of the corresponding page is
the same cache line to a single load [1]. cached. Finally, by applying a cache-based side-channel attack, such
On a CPU supporting AVX2, the sender can encode up to 256 as Flush+Reload, we can recover the byte from the cache state of
bits per load (e.g., using the VMOVAPS load). the oracle array [42, 45].
The error detection in the transient domain has the advantage
Receiver. The receiver mounts ZombieLoad to leak the values that we do not require computation time in the architectural do-
loaded by the sender. However, as the receiver leaks the loads only main. Instead of waiting for the exception to become architecturally
in the transient domain, the leaked value have to be transferred visible by doing nothing, we already use this time to perform the
into the architectural domain. We encode the leaked values into required computation. An additional advantage is that while we
the cache and recover them using Flush+Reload. When encoding are still in the transient domain, we can work on noise-free data.
values in the cache, we require at least 2 cache lines, i.e., 128 B, per Thus, we do not require complex error correction after receiving
bit to prevent the adjacent-cache-line prefetcher from interfering the data [51].
with the encoding. In practice, we require one physical page, i.e., In addition to the error detection, we also encode a sequence
4 kB, per possible value to prevent interference of the prefetcher. number into the packet. The sequence number allows ordering
To reduce the recover bottleneck, we transfer single bytes from the the received packets. It can be recovered using the same method
transient to the architectural domain which already requires 256 as the data value, e.g., using an oracle array and a cache-based
runs of Flush+Reload. side-channel attack.
ZombieLoad: Cross-Privilege-Boundary Data Sampling

Results. We evaluate the covert channel both in a lab environ- Table 3: Number of accesses required to recover a website
ment as well as in a public cloud. In the lab environment, we used name. The experiment was repeated 100 times per website.
2 virtual machines running inside QEMU KVM on an i7-8650U. For Website Minimal Average Maximum
2
the cloud scenario , we used 2 co-located virtual machines running nytimes.com 1 1 3
CentOS 7.6.1810 with a Linux kernel version of 3.10.0-957 on a Xeon facebook.com 1 2 4
E5-2670 CPU. kernel.org 2 6 13
Both on the cloud, as well as on our lab machine, we achieved an gnupg.org 2 10 34
error-free transmission. On our lab machine, we observed transmis-
sion rates of up to 26.8 kbit/s. As TSX was not available in the cloud
1 if (x < array_len) {
scenario, we achieved a transmission rate of 1.99 kbit/s (σx̄ = 2.5 %, 2 y = array[x];
n = 1000) with Variant 1 and signal handling. 3 }

6.4 Browsing-Behavior Monitoring


Listing 1: A simple prefetch gadget relying on Spectre-
ZombieLoad is also well suited for detecting specific byte sequences PHT [42]. By mistraining the branch, this gadget loads an
within loaded data. We demonstrate an attack for which we leverage arbitrary out-of-bounds value for targeted leakage.
ZombieLoad to fingerprint a web browser session. For this attack,
we assume an unprivileged attacker running on one logical core and
a web browser running on the sibling logical core. In this scenario, character. We can repeat these steps until we see a string ending
it is irrelevant whether the attacker and victim run on a native with a top-level domain.
machine or whether they are in (different) virtual machines. Note that this attack is not limited to URLs. Potentially all data
We present two different attacks, a keyword detection attack which follows a predictable pattern, such as session cookies or
which can fingerprint website content, and an URL recovery attack credit-card numbers, can be leaked with this variant.
to monitor a victim’s browsing behavior.
Results. We evaluated both attacks running an unmodified Fire-
Keyword Detection. The keyword detection allows an attacker fox browser version 66.0.2 on the same physical core as the attacker.
to gain information on the type of content the victim is consuming. Our proof-of-concept implementation of the keyword-checking
For this attack, we constantly sample data using ZombieLoad and attack can check four up to 8-byte long keywords. Due to excessive
match leaked values against a list of pre-defined keywords. precomputations of browsers when entering an URL, a keyword is
We leverage the fact that we have access to a full cache line sometimes already matched during the autocompletion of the URL.
and can do arbitrary computations in the transient domain (cf. Sec- For highly dynamic websites, such as nytimes.com, keywords reli-
tion 6.3). As a result of the computation, we only have to externalize ably match on the first access of the website. Accessing mostly static
a small integer indicating which keyword has matched via a cache websites, such as gnupg.org, have a 60 % probability of matching a
side channel. keyword in this setup. We observed false positives after the first
One limitation is the length of the keyword list, as in the transient website access when continuing to use the browser. We hypothesize
domain, only a limited number of memory accesses are possible that memory locations containing the keywords get re-used and
before the transient execution aborts. The most reliable solution is may thus leak at a later time again.
to store the keyword list entirely in CPU registers. Hence, the length For the URL recovery attack, we simulated user behavior by
of the keyword list is limited by the available registers. Moreover, accessing popular websites and refreshing them in a defined time
the length is also limited by the amount of code that is transiently interval. We counted the number of refreshes necessary until we
executed to compare leaked values to the keyword list. recovered the entire URL including top level domain. For each
website, the experiment was repeated 100 times.
URL Recovery. In the second attack, we recover accessed web-
The actual number of refreshes needed depends on the nature
sites from browser sessions without prior selection of interesting
of the website that is visited. If it is a highly dynamic page, such as
keywords. We take a more indirect approach that relies on modern
facebook.com or nytimes.com, a small number of reloads is sufficient
websites performing many individual HTTP requests to the same
to recover the entire name. For static pages, such as gnupg.org or
domain, e.g., to load additional resources such as scripts and images.
kernel.org, the number of reloads necessary increases by a factor of
In the transient domain, we again sample data using ZombieLoad.
10, approximately. See Table 3 for a detailed overview of required
While still in the transient domain, we detect the substring “www.”
reloads.
inside the leaked data. When we discover a match, we leak the
character following “www.” to the architectural domain using a 6.5 Targeted Data Leakage
cache side channel. This already results in a set of first characters
of domain names which we refer to as the candidate set. Inherently, ZombieLoad is a 1-dimensional side channel, i.e., the
In the next iteration, for every domain in the candidate set, we leakage is only controlled by the time. Hence, leakage cannot
take the last four leaked characters (e.g., “ww.X”). We use this string be steered using specific addresses as is the case, e.g., for Melt-
in the transient domain to filter leaked values, similar to the “www.” down [45]. While this data sampling is still sufficient for several
substring in the first iteration. If a match is found, we leak the next real-world attacks, it is still a limiting factor for general attacks.
In this section, we show how ZombieLoad can be combined with
2
The cloud provider asked us not to disclose its name at this point. prefetch gadgets [9] for targeted data leakage.
Schwarz et al.

Speculative Data Leakage. Listing 1 illustrates such a gadget. ZombieLoad on one logical core and on the other we execute sys-
It is a common pattern in software for accessing an element of an tem calls that switch between out-of-bounds and in-bounds array
array [9]. First, the code checks whether the index lies within the indices to achieve a high frequency of mispredictions in the gadget.
bounds of the array. Only if this is the case, the element is accessed, This approach yields leaked values with a large noise compo-
i.e., loaded. While it is evident that for a user-controlled index the nent from unrelated loads. We repeat this setup without trying to
corresponding array element can be loaded, such a gadget is even generate mispredictions to generate a baseline of noise values. We
more powerful. generate frequency distributions for both runs and subtract the
On a CPU vulnerable to Spectre, an attacker can mistrain the noise frequency from the misprediction run. We then choose the
branch predictor, e.g., by providing several valid values for the array byte value that was seen most frequently.
index. Then, by providing an out-of-bounds index, the branch is With this crude statistical method, we can recover kernel mem-
misspeculated and speculatively accesses an out-of-bounds value. ory at one byte per 10 s with 38 % accuracy. Probing bytes for 20 s
Alternatively, the attacker can alternate between valid and out-of- improves the accuracy to 46 %.
bounds indices randomly to achieve a high percentage of mispre- As with Meltdown [45], common byte values such as 0x00 and
dictions without any prior branch predictor mistraining. 0xFF occur too often and have to be removed from the leaked data
ZombieLoad cannot only leak architecturally accessed data but for the recovery to work. Our approach is thus blind to these values.
also speculatively accessed data. Hence, ZombieLoad can even see The speed and accuracy can be improved if there is a priori
the value of loads which are never architecturally visible. Such loads knowledge of the target data. For example, a 7-bit ASCII string can
include, among others, speculative memory loads and prefetches. be leaked with a probing time of 10 s per byte with 72 % accuracy.
Thus, any Spectre gadget which is not hardened, e.g., using a mem-
ory fence [4, 5, 9, 33] or a mask [9, 10], can be used to specify data 7 COUNTERMEASURES
to leak.
As ZombieLoad leaks loaded values across logical cores, a straight-
Moreover, ZombieLoad does not require classic Spectre gadgets
forward mitigation is disabling the use of hyperthreading. Hyper-
containing an indirect array access [42]. A simple out-of-bounds
threading improves performance for certain workloads by 30 % to
access (cf. Listing 1) is sufficient. While such gadgets have been
40 % [8, 52], and as such disabling it may incur an unacceptable
demonstrated for breaking KASLR [62], they were considered as
performance impact.
relatively harmless as they do not leak data [9]. Hence, most ap-
proaches for finding gadgets do not consider such gadgets [26, 73]. Co-Scheduling. Depending on the workload, a more efficient
In the Linux kernel, however, such gadgets are also patched if they mitigation is the use of co-scheduling [55]. Co-scheduling can be
are discovered, mainly as they can be used together with the Fore- configured to prevent the execution of code from different pro-
shadow vulnerability to leak arbitrary kernel memory [12, 66]. So tection domains on a hyperthread pair. Current topology-aware
far, 172 such gadgets have been fixed in kernel 5.0 [9]. With Zom- co-scheduling algorithms [64] are not concerned with preventing
bieLoad, we show that such gadgets are indeed powerful and have kernel code from running concurrently with user-space code. With
to be patched as well. such a scheduling strategy, leaks between user processes can be pre-
vented but leaks between kernel and user space cannot. To prevent
Potential Incompleteness of Countermeasures. Mainly, there leakage between kernel and user space, the kernel must addition-
are 2 methods to prevent exploitation of Spectre-PHT: memory ally ensure that kernel entries on one logical core force the sibling
fences after branches [4, 5, 9, 33], or constraining the index to a logical core into the kernel as well. This discussion applies in an
valid range using a bitmask [9, 10]. The variant using fences is im- analogous way to hypervisors and virtual machines.
plemented in the Microsoft compiler [41, 42], whereas the variant
using bitmasks is implemented in GCC [48] and LLVM [10], and Flushing Buffers. We have demonstrated that ZombieLoad
also used in the Linux kernel [48]. also works across protection boundaries on a single logical core.
Both methods prevent exploitation of Spectre-PHT [9], as the Hence, disabling hyperthreading or co-scheduling are not fully ef-
misspeculation cannot load any data. Hence, this is also effective fective as mitigation. We have not found an instruction sequence
against ZombieLoad, as fixed gadgets cannot be exploited to load that reliably prevents leakage across protection boundaries. Even
arbitrary values. flushing the entire L1 data cache (using MSR_IA32_FLUSH_CMD) and
However, even with these countermeasures in place, there is issuing as many dummy loads as there are fill-buffer entries (“load
a remaining leakage which can be exploited using ZombieLoad. stuffing”) is not sufficient. There is still remaining leakage, which
When architecturally loading an in-bounds value, ZombieLoad can we assume is caused by the replacement policy of the line-fill buffer.
leak up to 64 bytes of the load. Hence, with ZombieLoad, there is a Hence, to fully mitigate the leakage, we require a microcode update
potential leakage of up to 63 bytes which are out of bounds if the which provides a method to flush the line-fill buffer.
last in-bounds value is at the beginning of a cache line or the base
of the array is at the end of a cache line. Selective Feature Deactivation. Weaker countermeasures tar-
get individual building blocks (cf. Section 5). The operating system
Data Leakage. To demonstrate the feasibility of prefetch gad- kernel can make sure always to set the accessed and dirty bits in
gets for targeted data leakage, we leverage an artificial prefetch page tables to impair Variant 2. Unfortunately, Variant 1 is always
gadget as given in Listing 1. For our evaluation, we used such a possible, if the attacker can identify an alias mapping of any acces-
gadget in the system-call path of the Linux kernel 5.0.7. We execute sible user page in the kernel. This is especially true if the attacker
ZombieLoad: Cross-Privilege-Boundary Data Sampling

is running in or can create a virtual machine. Hence, we also rec- (CrowdStrike), and Martin Schwarzl (Graz University of Technol-
ommend disabling VT-x on systems that do not need to run virtual ogy). The research presented in this paper was partially supported
machines. by the Research Fund KU Leuven. Jo Van Bulck is supported by a
grant of the Research Foundation – Flanders (FWO). The project
Removing Prefetch Gadgets. To prevent targeted data leakage, was supported by the European Research Council (ERC) under
prefetch gadgets need to be neutralized, e.g., using array_index_nospec the European Union’s Horizon 2020 research and innovation pro-
in the Linux kernel. This function clamps array indices into valid gramme (grant agreement No 681402). It was also supported by
values and prevents arbitrary virtual memory to be prefetched. the Austrian Research Promotion Agency (FFG) via the K-project
Placing these functions is currently a manual task and due to the DeSSnet, which is funded in the context of COMET – Competence
incomplete documentation of how Intel CPUs prefetch data, these Centers for Excellent Technologies by BMVIT, BMWFW, Styria
mitigations cannot be complete. Note that Spectre mitigations using and Carinthia. Additional funding was provided by a generous gift
lfence instructions might also be incomplete against ZombieLoad. from Intel. Any opinions, findings, and conclusions or recommen-
Another way to prevent prefetch gadgets from reaching sensitive dations expressed in this paper are those of the authors and do not
data is to prevent this data from being mapped in the address necessarily reflect the views of the funding parties.
space of the prefetch gadget. Exclusive Page-Frame Ownership [39]
(XPFO) partially achieves this for the Linux kernel’s mapping of REFERENCES
physical memory. [1] Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G.,
Prefetch gadgets can also be neutralized using Speculative Load Madland, P. D., Papworth, D. B., and Fetterman, M. A. Method and apparatus
Hardening [10] (SLH). SLH prevents speculative execution by intro- for dispatching and executing a load operation to memory, Feb. 1998. US Patent
5,717,882.
ducing artificial data dependencies via a compiler pass. SLH incurs [2] Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G.,
a performance overhead of 10 % to 50 % for typical applications. To Madland, P. D., Papworth, D. B., and Fetterman, M. A. Method and apparatus
for dispatching and executing a load operation to memory, 1998. US Patent
the best of our knowledge, its overhead for kernel or hypervisor 5,717,882.
code has not been studied yet. [3] Allan, T., Brumley, B. B., Falkner, K., Van de Pol, J., and Yarom, Y. Amplifying
side channels through performance degradation. In ACSAC (2016).
Instruction Filtering. The above discussion mostly focusses [4] AMD. Software Techniques for Managing Speculation on AMD Processors, 2018.
Revison 7.10.18.
on attacks across process or virtual-machine boundaries. For attacks [5] ARM Limited. Vulnerability of Speculative Processors to Cache Timing Side-
inside of a single process (e.g., JavaScript sandbox), the sandbox Channel Mechanism, 2018.
implementation must make sure that the requirements for mounting [6] Bhattacharyya, A., Sandulescu, A., Neugschwandtner, M., Sorniotti, A.,
Falsafi, B., Payer, M., and Kurmus, A. SMoTherSpectre: exploiting speculative
ZombieLoad are not met. One example is to prevent the generation execution through port contention. arXiv:1903.01843 (2019).
and execution of the clflush instructions, which so far is a crucial [7] Boggs, D. D., and Rodgers, S. D. Microprocessor with novel instruction for
signaling event occurrence and for providing event handling information in
part of the attack. response thereto, Apr. 1997. US Patent 5,625,788.
[8] Bulpin, J. R., and Pratt, I. A. Multiprogramming performance of the Pentium 4
Secret Sharing. On the software side, we can also rely on secret with Hyper-Threading. In Second Annual Workshop on Duplicating, Deconstruction
sharing techniques used to protect against physical side-channel and Debunking (WDDD) (2004).
[9] Canella, C., Van Bulck, J., Schwarz, M., Lipp, M., von Berg, B., Ortner,
attacks [65]. We can ensure that a secret is never directly loaded P., Piessens, F., Evtyushkin, D., and Gruss, D. A Systematic Evaluation of
from memory but instead only combined in registers before being Transient Execution Attacks and Defenses. In USENIX Security Symposium (to
used. As a consequence, observing the data of a load does not reveal appear) (2019).
[10] Carruth, C. RFC: Speculative Load Hardening (a Spectre variant #1 mitigation),
the secret. For a successful attack, an attacker has to leak all shares Mar. 2018.
of the secret. This mitigation is, of course, incomplete if register [11] Chen, G., Chen, S., Xiao, Y., Zhang, Y., Lin, Z., and Lai, T. H. SGXPECTRE
Attacks: Leaking Enclave Secrets via Speculative Execution. arXiv:1802.09085
values are written to and subsequently loaded from memory as part (2018).
of context switching. [12] Corbet, J. Finding Spectre vulnerabilities with smatch, https://lwn.net/Articles/
752408/ Apr. 2018.
[13] Costan, V., and Devadas, S. Intel SGX explained.
8 CONCLUSION [14] Evtyushkin, D., Riley, R., Abu-Ghazaleh, N. C., ECE, and Ponomarev, D.
Branchscope: A new side-channel attack on directional branch predictor. In
With ZombieLoad, we showed a novel Meltdown-type attack target- ASPLOS’18 (2018).
ing the processor’s fill-buffer logic. ZombieLoad enables an attacker [15] Fog, A. The microarchitecture of Intel, AMD and VIA CPUs: An optimization
to leak recently loaded values used by the current or sibling logical guide for assembly programmers and compiler makers, 2016.
[16] García, C. P., and Brumley, B. B. Constant-time callees with variable-time
CPU. We show that ZombieLoad allows leaking across user-space callers. In USENIX Security Symposium (2017).
processes, CPU protection rings, virtual machines, and SGX en- [17] Ge, Q., Yarom, Y., Cock, D., and Heiser, G. A Survey of Microarchitectural
Timing Attacks and Countermeasures on Contemporary Hardware. Journal of
claves. We demonstrated the immense attack potential by monitor- Cryptographic Engineering (2016).
ing browser behaviour, extracting AES keys, establishing cross-VM [18] Glew, A. F., Akkary, H., Colwell, R. P., Hinton, G. J., Papworth, D. B., and
covert channels or recovering SGX sealing keys. Finally, we con- Fetterman, M. A. Method and apparatus for implementing a non-blocking
translation lookaside buffer, Oct. 1996. US Patent 5,564,111.
clude that disabling hyperthreading is the only possible workaround [19] Glew, A. F., Akkary, H., and Hinton, G. J. Translation lookaside buffer that is
to mitigate ZombieLoad on current processors. non-blocking in response to a miss for use within a microprocessor capable of
processing speculative instructions, 1997. US Patent 5,613,083.
[20] Gras, B., Razavi, K., Bos, H., and Giuffrida, C. Translation Leak-aside Buffer:
ACKNOWLEDGMENTS Defeating Cache Side-channel Protections with TLB Attacks. In USENIX Security
Symposium (2018).
We thank Werner Haas (Cyberus Technology), Claudio Canella [21] Gruss, D., Lipp, M., Schwarz, M., Fellner, R., Maurice, C., and Mangard, S.
(Graz University of Technology), Jon Masters (Red Hat), Alex Ionescu KASLR is Dead: Long Live KASLR. In International Symposium on Engineering
Schwarz et al.

Secure Software and Systems (2017). Patent 7,346,735.


[22] Gruss, D., Maurice, C., Fogh, A., Lipp, M., and Mangard, S. Prefetch Side- [58] Pessl, P., Gruss, D., Maurice, C., Schwarz, M., and Mangard, S. DRAMA:
Channel Attacks: Bypassing SMAP and Kernel ASLR. In CCS (2016). Exploiting DRAM Addressing for Cross-CPU Attacks. In USENIX Security Sym-
[23] Gruss, D., Maurice, C., Wagner, K., and Mangard, S. Flush+Flush: A Fast and posium (2016).
Stealthy Cache Attack. In DIMVA (2016). [59] Ristenpart, T., Tromer, E., Shacham, H., and Savage, S. Hey, You, Get Off of
[24] Gruss, D., Schwarz, M., Wübbeling, M., Guggi, S., Malderle, T., More, S., My Cloud: Exploring Information Leakage in Third-Party Compute Clouds. In
and Lipp, M. Use-after-freemail: Generalizing the use-after-free problem and CCS (2009).
applying it to email services. In AsiaCCS (2018). [60] Schwarz, M., Gruss, D., Lipp, M., Maurice, C., Schuster, T., Fogh, A., and
[25] Gruss, D., Spreitzer, R., and Mangard, S. Cache Template Attacks: Automating Mangard, S. Automated Detection, Exploitation, and Elimination of Double-
Attacks on Inclusive Last-Level Caches. In USENIX Security Symposium (2015). Fetch Bugs using Modern CPU Features. AsiaCCS (2018).
[26] Guarnieri, M., Köpf, B., Morales, J. F., Reineke, J., and Sánchez, A. SPECTEC- [61] Schwarz, M., Lipp, M., Gruss, D., Weiser, S., Maurice, C., Spreitzer, R., and
TOR: Principled Detection of Speculative Information Flows. arXiv:1812.08639 Mangard, S. KeyDrown: Eliminating Software-Based Keystroke Timing Side-
(2018). Channel Attacks. In NDSS (2018).
[27] Gueron, S. Intel advanced encryption standard (intel aes) instructions set – rev [62] Schwarz, M., Schwarzl, M., Lipp, M., and Gruss, D. NetSpectre: Read Arbitrary
3.01, 2012. Memory over Network. arXiv:1807.10535 (2018).
[28] Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantitative [63] Schwarz, M., Weiser, S., Gruss, D., Maurice, C., and Mangard, S. Malware
Approach, 6 ed. Morgan Kaufmann, 2017. Guard Extension: Using SGX to Conceal Cache Attacks. In DIMVA (2017).
[29] Horn, J. speculative execution, variant 4: speculative store bypass, 2018. [64] Schönherr, J. H., Juurlink, B., and Richling, J. Topology-aware equipartition-
[30] Intel. Intel Software Guard Extensions SDK for Linux OS Developer Reference, ing with coscheduling on multicore systems. In 6th International Workshop on
May 2016. Rev 1.5. Multi-/Many-core Computing Systems (MuCoCoS) (2013).
′ [65] Shamir, A. How to share a secret. Communications of the ACM (1979).
[31] Intel. Intel 64 and IA-32 Architectures Software Developer s Manual, Volume 3
(3A, 3B & 3C): System Programming Guide. [66] Stecklina, J. [RFC] x86/speculation: add L1 Terminal Fault / Foreshadow demo,
[32] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual, 2017. https://lkml.org/lkml/2019/1/21/606 Jan. 2019.
[33] Intel. Intel Analysis of Speculative Execution Side Channels, https://software. [67] Stecklina, J., and Prescher, T. LazyFP: Leaking FPU Register State using
intel.com/security-software-guidance/api-app/sites/default/files/336983-Intel- Microarchitectural Side-Channels. arXiv:1806.07480 (2018).
Analysis-of-Speculative-Execution-Side-Channels-White-Paper.pdf July 2018. [68] Van Bulck, J., Minkin, M., Weisse, O., Genkin, D., Kasikci, B., Piessens, F.,
[34] Intel. L1 Terminal Fault SA-00161, https://software.intel.com/security-software- Silberstein, M., Wenisch, T. F., Yarom, Y., and Strackx, R. Foreshadow:
guidance/software-guidance/l1-terminal-fault Aug. 2018. Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order
[35] Intel. Intel® C++ Compiler 19.0 Developer Guide and Reference, Apr. 2019. Execution. In USENIX Security Symposium (2018).
[36] Islam, S., Moghimi, A., Bruhns, I., Krebbel, M., Gulmezoglu, B., Eisenbarth, [69] Van Bulck, J., Piessens, F., and Strackx, R. Sgx-step: A practical attack frame-
T., and Sunar, B. SPOILER: Speculative Load Hazards Boost Rowhammer and work for precise enclave execution control. In Workshop on System Software for
Cache Attacks. arXiv:1903.00446 (2019). Trusted Execution (2017).
[37] Jang, Y., Lee, S., and Kim, T. Breaking Kernel Address Space Layout Randomiza- [70] Van Bulck, J., Piessens, F., and Strackx, R. Nemesis: Studying microarchitec-
tion with Intel TSX. In CCS (2016). tural timing leaks in rudimentary CPU interrupt logic. In CCS (2018).
[38] Johnson, S. P., Savagaonkar, U. R., Scarlata, V. R., McKeen, F. X., and Rozas, [71] Van Bulck, J., Weichbrodt, N., Kapitza, R., Piessens, F., and Strackx, R.
C. V. Technique for supporting multiple secure enclaves, June 2012. US Patent Telling your secrets without page faults: Stealthy page table-based attacks on
2012/0159184 A1. enclaved execution. In USENIX Security Symposium (2017).
[39] Kemerlis, V. P., Polychronakis, M., and Keromytis, A. D. ret2dir: Rethinking [72] van Schaik, S., Milburn, A., Österlund, S., Frigo, P., Maisuradze, G., Razavi,
kernel isolation. In USENIX Security (2014). K., Bos, H., and Giuffrida, C. RIDL. In S&P (2019).
[40] Kiriansky, V., and Waldspurger, C. Speculative Buffer Overflows: Attacks and [73] Wang, G., Chattopadhyay, S., Gotovchits, I., Mitra, T., and Roychoudhury,
Defenses. arXiv:1807.03757 (2018). A. oo7: Low-overhead Defense against Spectre Attacks via Binary Analysis.
[41] Kocher, P. Spectre mitigations in Microsoft’s C/C++ compiler, 2018. arXiv:1807.05843 (2018).
[42] Kocher, P., Horn, J., Fogh, A., Genkin, D., Gruss, D., Haas, W., Hamburg, [74] Weichbrodt, N., Kurmus, A., Pietzuch, P., and Kapitza, R. Asyncshock:
M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. Spectre Exploiting synchronisation bugs in Intel SGX enclaves. In ESORICS (2016).
attacks: Exploiting speculative execution. In S&P (2019). [75] Weisse, O., Van Bulck, J., Minkin, M., Genkin, D., Kasikci, B., Piessens, F.,
[43] Koruyeh, E. M., Khasawneh, K., Song, C., and Abu-Ghazaleh, N. Spectre Silberstein, M., Strackx, R., Wenisch, T. F., and Yarom, Y. Foreshadow-
Returns! Speculation Attacks using the Return Stack Buffer. In WOOT (2018). NG: Breaking the Virtual Memory Abstraction with Transient Out-of-Order
[44] Lee, J., Jang, J., Jang, Y., Kwak, N., Choi, Y., Choi, C., Kim, T., Peinado, M., and Execution, 2018.
Kang, B. B. Hacking in darkness: Return-oriented programming against secure [76] Wu, Z., Xu, Z., and Wang, H. Whispers in the Hyper-space: High-speed Covert
enclaves. In USENIX Security (2017). Channel Attacks in the Cloud. In USENIX Security Symposium (2012).
[45] Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Fogh, A., Horn, J., [77] Xu, Y., Bailey, M., Jahanian, F., Joshi, K., Hiltunen, M., and Schlichting,
Mangard, S., Kocher, P., Genkin, D., Yarom, Y., and Hamburg, M. Meltdown: R. An exploration of L2 cache covert channels in virtualized environments. In
Reading Kernel Memory from User Space. In USENIX Security Symposium (2018). CCSW’11 (2011).
[46] Liu, F., Yarom, Y., Ge, Q., Heiser, G., and Lee, R. B. Last-Level Cache Side- [78] Xu, Y., Cui, W., and Peinado, M. Controlled-Channel Attacks: Deterministic
Channel Attacks are Practical. In S&P (2015). Side Channels for Untrusted Operating Systems. In S&P (May 2015).
[47] LWN. The current state of kernel page-table isolation, https://lwn.net/ [79] Yarom, Y., and Falkner, K. Flush+Reload: a High Resolution, Low Noise, L3
SubscriberLink/741878/eb6c9d3913d7cb2b/ Dec. 2017. Cache Side-Channel Attack. In USENIX Security Symposium (2014).
[48] LWN. Spectre v1 defense in gcc, https://lwn.net/Articles/759423/ July 2018.
[49] Maisuradze, G., and Rossow, C. ret2spec: Speculative execution using return
stack buffers. In CCS (2018).
[50] Maurice, C., Neumann, C., Heen, O., and Francillon, A. C5: Cross-Cores A FILL-BUFFER SIZE
Cache Covert Channel. In DIMVA (2015).
[51] Maurice, C., Weber, M., Schwarz, M., Giner, L., Gruss, D., Alberto Boano, In this section, we analyze the size of the fill buffer in terms of fill-
C., Mangard, S., and Römer, K. Hello from the Other Side: SSH over Robust buffer entries usable per logical core. Intel describes the fill buffer as
Cache Covert Channels in the Cloud. In NDSS (2017). a “competitively-shared resource during HT operation” [31]. Hence,
[52] Michael Larabel. Intel Hyper Threading Performance With A Core i7 On
Ubuntu 18.04 LTS, https://www.phoronix.com/scan.php?page=article&item= with 10 fill-buffer entries (Sandy Bridge and newer microarchitec-
intel-ht-2018&num=4 June 2018. tures) [31], we expect that when hyperthreading is enabled, every
[53] Minkin, M., Moghimi, D., Lipp, M., Schwarz, M., Van Bulck, J., Genkin, D.,
Gruss, D., Piessens, F., Sunar, B., and Yarom, Y. Fallout: Reading Kernel Writes
logical core can use up to 10 entries.
From User Space, 2019. Our experimental setup measures the time it takes to execute n
[54] Moghimi, A., Irazoqi, G., and Eisenbarth, T. Cachezoom: How SGX amplifies stores to DRAM, for n = 1, . . . , 20. We expect that the time increases
the power of cache attacks. In CHES (2017).
[55] Ousterhout, J. K., et al. Scheduling techniques for concurrent systems. In linearly with the number of stores n as long as there are unused
ICDCS (1982). fill-buffer entries. To ensure that the stores occupy the fill buffer, we
[56] Percival, C. Cache missing for fun and profit. In BSDCan (2005). leverage non-temporal stores which bypass the cache and directly
[57] Peri, R., Fernando, J., and Kolagotla, R. Virtualized load buffers, 2008. US
go to DRAM. We repeated our experiments 1 000 000 times, and we
ZombieLoad: Cross-Privilege-Boundary Data Sampling

400
one thread
always measured the best case, i.e., the minimum latency, to get rid
Latency
[cycles]

300 two threads of any noise.


200 FB exhaust Figure 6 shows that both logical cores can indeed leverage the
Latency increase
(12 entries) (36 entries)
entire fill buffer. When running the experiment on one (isolated)
100
0 5 10 15 logical core, while the other (isolated) logical core does nothing, we
Non-temporal Stores get a latency increase when executing more than 12 stores. When
we run the experiment on both logical cores in parallel, the latency
Figure 6: One logical core can leverage the entire fill buffer increase is still after 12 stores.
(12 entries). If both logical cores execute stores, the fill buffer Interestingly, the documented number of fill buffers does not
is competitively shared, leading to an increased latency for match our experiments for Skylake and newer microarchitectures.
both logical cores. While we measure 10 entries on pre-Skylake CPUs as it is docu-
mented, we measure 12 entries on Skylake and newer (cf. Figure 7).
500 Haswell From our experiments we conclude that both logical cores can
Latency

Latency increase
[cycles]

Skylake (10 entries) leverage the entire fill buffer Therefore, every logical core can
400 Latency increase potentially use any entry in the fill buffer.
(12 entries)
300
6 8 10 12 14
Non-temporal Stores

Figure 7: One pre-Skylake, we measure 10 fill-buffer entries,


matching Intel’s documentation. On Skylake and newer, we
measure 12 fill-buffer entries.

You might also like