Zombie Load
Zombie Load
Zombie Load
Daniel Gruss
Graz University of Technology
daniel.gruss@iaik.tugraz.at
ABSTRACT data to user space, but also leak data across user processes, virtual
In early 2018, Meltdown first showed how to read arbitrary kernel machines, and SGX enclaves [68, 75]. Furthermore, data cannot only
memory from user space by exploiting side-effects from transient be leaked from the L1 cache but also from other microarchitectural
instructions. While this attack has been mitigated through stronger structures, including the register file [67], the line-fill buffer [45, 72],
isolation boundaries between user and kernel space, Meltdown and, as shown in concurrent work, the store buffer [53].
inspired an entirely new class of fault-driven transient execution Instead of executing the instruction stream in order, most mod-
attacks. Particularly, over the past year, Meltdown-type attacks ern processors can re-order instructions while maintaining archi-
have been extended to not only leak data from the L1 cache but tectural equivalence, creating the illusion of an in-order machine.
also from various other microarchitectural structures, including the Instructions then may already have been executed when the CPU
FPU register file and store buffer. detects that a previous instruction raises an exception. Hence, such
In this paper, we present the ZombieLoad attack which uncov- instructions following the faulting instruction (i.e., transient instruc-
ers a novel Meltdown-type effect in the processor’s previously tions) are rolled back. While the rollback ensures that there are no
unexplored fill-buffer logic. Our analysis shows that faulting load architectural effects, side effects might remain in the microarchitec-
instructions (i.e., loads that have to be re-issued for either architec- tural state. Most Meltdown-type data leaks exploit overly aggressive
tural or microarchitectural reasons) may transiently dereference performance optimizations around out-of-order execution.
unauthorized destinations previously brought into the fill buffer For many years, the microarchitectural state was considered in-
by the current or a sibling logical CPU. Hence, we report data visible to applications, and hence security considerations were often
leakage of recently loaded stale values across logical cores. We limited to the architectural state. Specifically, microarchitectural
demonstrate ZombieLoad’s effectiveness in a multitude of practical elements often do not distinguish between different applications or
attack scenarios across CPU privilege rings, OS processes, virtual privilege levels [9, 14, 37, 45, 58, 61, 63].
machines, and SGX enclaves. We discuss both short and long-term In this paper, we show that, first, there still are unexplored mi-
mitigation approaches and arrive at the conclusion that disabling croarchitectural buffers, and second, both architectural and microar-
hyperthreading is the only possible workaround to prevent this chitectural faults can be exploited. With our notion of “microar-
extremely powerful attack on current processors. chitectural faults”, i.e., faults that cause a memory request to be
re-issued internally without ever becoming architecturally visible,
CCS CONCEPTS we demonstrate that Meltdown-type attacks can also be triggered
without raising an architectural exception such as a page fault.
• Security and privacy → Side-channel analysis and counter-
Based on this, we demonstrate ZombieLoad, a novel, extremely
measures; Systems security; Operating systems security.
powerful Meltdown-type attack targeting the fill buffer logic.
KEYWORDS ZombieLoad exploits that load instructions which have to be
re-issued internally, may first transiently compute on stale values
side-channel attack, transient execution, fill buffer, Meltdown belonging to previous memory operations from either the current
or a sibling hyperthread. Using established transient execution at-
1 INTRODUCTION tack techniques, adversaries can recover the values of such “zombie
In 2018, Meltdown [45] was the first microarchitectural attack com- load” operations. Importantly, in contrast to all previously known
pletely breaching the security boundary between the user and transient execution attacks [9], ZombieLoad reveals recent data val-
kernel space and, thus, allowed to leak arbitrary data. While Melt- ues without adhering to any explicit address-based selectors. Hence,
down was fixed using a stronger isolation between user and kernel we consider ZombieLoad an instance of a novel type of microarchi-
space, the underlying principle turned out to be an entire class of tectural data sampling attacks. We present microarchitectural data
transient-execution attacks [9]. Over the past year, researchers have sampling as the missing link between traditional memory-based
demonstrated that Meltdown-type attacks cannot only leak kernel
Schwarz et al.
Virtual Memory. CPUs use virtual memory to provide memory by issuing a microcode assist which points the sequencer to a prede-
isolation between processes. Virtual addresses are translated to fined microcode routine [13]. To do so, the execution unit associates
physical memory locations using multi-level translation tables. The an event code with the result of the faulting micro-op. When the
translation table entries define the properties, e.g., access control or micro-op of the execution unit is committed, the event code causes
memory type, of the referenced memory region. The CPU contains the out-of-order scheduler to squash all in-flight micro-ops in the
the translation-look-aside buffer (TLB) consisting of additional reorder buffer [13]. The microcode sequencer uses the event code to
caches to store address-translation information. read the micro-ops associated with the event in the microcode [7].
Memory Order Buffer. µOPs that deal with memory operations Intel TSX. Intel TSX is an x86 instruction set extension to sup-
are handled by dedicated execution units. Typically, Intel CPUs port hardware transactional memory [35] which has been intro-
contain 2 units responsible for loading data and one for storing duced with Intel Haswell CPUs. With TSX, particular code regions
data. While the reorder buffer resolves register dependencies, out-of- are executed transactionally. If the entire code regions completes
order executed µOPs can still have memory dependencies. In an out- successfully, memory operations within the transaction appear as
of-order CPU, the memory order buffer (MOB), incorporating a load an atomic commit to other logical processors. If an issue occurs
buffer and a store buffer, controls the dispatch of memory operations during the transaction, a transactional abort rolls back the execu-
and tracks their progress to resolve memory dependencies. tion to an architectural state before the transaction and, thereby,
discarding all performed operations. Transactional aborts can be
Data Loads. For every dispatched load operation an entry is caused by different issues: Typically, a conflicting memory opera-
allocated in the load buffer and in the reorder buffer. The allocated tion occurs where another logical processor either reads from an
load-buffer entry holds information about the operation, e.g., or- address which has been modified within the transaction or writes
dering constraints, the reorder buffer ID or the age of the most to an address which is used within the transaction. Further, the
recent store. To determine the physical address, the upper 36 bit amount of read and written data within the transaction may not
of the linear address are translated by the memory management exceed the size of the LLC and L1 cache respectively for the transac-
unit. Concurrently, the untranslated lower 12 bit are already used tion to succeed [31]. In addition, some instructions or system event
to index the cache set in the L1D [19]. If the address translation is might cause the transaction to abort as well [35].
cached in the TLB, the physical address is available immediately.
Otherwise, the page miss handler (PMH) is activated to perform a
page-table walk to retrieve the address translation as well as the Intel SGX. With the Skylake microarchitecture, Intel introduced
corresponding permission bits. With the physical address, the tag Software Guard Extension (SGX), an instruction-set extension for
and, thus, the way of the cache is determined. If the requested data isolating trusted code [31]. SGX executes trusted code inside so-
is in the L1D (cache hit), the load operation can be completed. called enclaves, which are mapped in the virtual address space of a
If data is not in the L1D, it needs to be served from higher levels conventional host application process but are isolated from the rest
of the cache or the main memory via the line-fill buffer (LFB). The of the system by the hardware itself. The threat model of SGX as-
LFB serves as an interface to other caches and the main memory and sumes that the operating system and all other running applications
keeps track of outstanding loads. Memory accesses to uncacheable could be compromised and, therefore, cannot be trusted. Any at-
memory regions, and non-temporal moves all go through the LFB. tempt to access SGX enclave memory in non-enclave mode results
If a load corresponds to an entry of a previous load operation in in abort page semantics, i.e., regardless of the current privilege level,
the load buffer, the loads can be merged [1, 57]. reads return the dummy value 0xff and writes are ignored [30].
On a fault, e.g., a physical address is not available, the page- Furthermore, to protect against powerful physical attackers prob-
table walk will not immediately abort [19]. Still, an instruction in a ing the memory bus, the SGX hardware transparently encrypts the
pipelined implementation must undergo each stage regardless of memory region used by enclaves [13].
whether a fault occurred or not [2], and is reissued in case of a fault. A dedicated eenter instruction redirects control flow to an en-
Only at the retirement of the faulting µOP, the fault is handled, clave entry point, whereas eexit transfers back to the untrusted
and the pipeline is flushed [18, 19]. If a fault occurs within a load host application. Furthermore, in case of an interrupt or fault, SGX
operation, it is still marked as “valid and completed” in the MOB [2]. securely saves CPU registers inside the enclave’s save state area
(SSA) before vectoring to the untrusted operating system. Next, the
eresume instruction can be used to restore processor state from the
2.3 Processor Extensions SSA frame and continue a previously interrupted enclave.
Microcode. Initially, all instructions were hardwired in the CPU SGX-capable processors feature cryptographic key derivation
core. However, to support more complex instructions, microcode al- facilities through the egetkey instruction, based on a CPU-level
lows implementing higher-level instructions using multiple hardware- master secret and a secure measurement of the calling enclave’s ini-
level instructions. Importantly, this allows processor vendors to tial code and data. Using this key, enclaves can securely seal secrets
support complex behavior and even extend or modify CPU behavior for untrusted persistent storage, and establish secure communica-
through microcode updates [31]. Preferably, new architectural fea- tion channels with other enclaves residing on the same processor.
tures are implemented as microcode extensions, e.g., Intel SGX [38]. Furthermore, to enable remote attestation, Intel provides a trusted
While the execution units perform the fast-paths directly in hard- quoting enclave which unseals an Intel-private key and generates
ware, more complex slow-path operations are typically performed an asymmetric signature over the local enclave identity report.
Schwarz et al.
Over the past years, researchers have demonstrated various at- that this is indeed an implementation issue (such as Meltdown) and
tacks to leak sensitive data from SGX enclaves, e.g., through mem- not an issue with the underlying design (as with Spectre). For our
ory safety violations [44], race conditions [74], or side-channels [54, hypothesis, we combined our observations with the nearly non-
63, 70, 71]. More recently, SGX was also compromised by transient existent official documentation of the fill buffer [31, 32]. Ultimately,
execution attacks [11, 68] which necessitated microcode updates we could neither prove nor disprove our hypothesis, leaving the
and increased the processor’s security version number (SVN). All verification or falsification of our hypothesis to future work.
SGX key derivations and attestations include SVN to reflect the
Stale-Entry Hypothesis. Every load is associated with an entry
current microcode version, and hence security level.
in the load buffer and potentially an entry in the fill buffer [32].
When a load encounters a complex situation, such as a fault,
3 ATTACK OVERVIEW it requires a microcode assist [31]. This microcode assist triggers
In this section, we provide an overview of ZombieLoad. We describe a machine clear, which flushes the pipeline. On a pipeline flush,
what can be observed using ZombieLoad and how that fits into the instructions which are already in flight still finish execution [28].
landscape of existing side-channel attacks. By that, we show that As this has to be as fast as possible to not incur additional delays,
ZombieLoad is a novel category of side-channel attacks, which we we expect that fill-buffer entries are optimistically matched as long
refer to as data-sampling attacks, opening a new research field. as parts of the physical address match. Thus, the load continues
with a wrong fill-buffer entry, which was valid for a previous load.
3.1 Overview This leads to a use-after-free vulnerability [24] in the hardware.
ZombieLoad is a transient-execution attack [9] which observes the Intel documents the fill buffer as being competitively shared among
values of memory loads on the current physical CPU. ZombieLoad hyperthreads [31], giving both logical cores access to the entire fill
exploits that the fill buffer is accessible by all logical CPUs of a phys- buffer (cf. Appendix A). Consequently, the stale fill-buffer entry can
ical CPU core and that it does not distinguish between processes also be from a previous load of the sibling logical core. As a result,
or privilege levels. the load instruction loads valid data from a previous load.
The load buffer acts as a queue for all memory loads from the Leakage Source. We devised 2 experiments to reduce the num-
memory subsystem. Whenever the CPU encounters a memory load ber of possible sources of the leaked data.
during execution, it reserves an entry in the load buffer. If the load In our first experiment, we marked a page as “uncacheable” via
was not an L1 hit, it additionally requires a fill-buffer entry. When the page-table entry and flushed the page from the cache. As a result,
the requested data has been loaded, the memory subsystem frees every memory load from the page circumvents all cache levels and
the corresponding load- and fill-buffer entries, at which point the directly travels from the main memory to the fill buffer [31]. We then
corresponding load instruction may retire. write the secret onto the uncacheable memory page to ensure that
However, we observed that under certain complex microarchitec- there is no copy of the data in the cache. When loading data from the
tural conditions (e.g., a fault), where the load requires a microcode uncacheable memory page, we can see leakage, but the leakage rate
assist, it may first read stale values before being re-issued eventu- is only in the order of bytes per second, e.g., 5.91 B/s (σx̄ = 0.18,
ally. As with any Meltdown-type attack, this opens up a transient n = 100) on an i7-8650U. We can attribute this leakage to the fill
execution window in which this value can be used for subsequent buffer. This was also exploited in concurrent work [72]. Our hy-
calculations before the execution is aborted and rolled back. Thus, pothesis is further backed by the MEM_LOAD_RETIRED.FB_HIT per-
an attacker can encode the leaked value into a microarchitectural formance counter, which shows multiple thousand line-fill-buffer
element, such as the cache. hits (117 330 FB_HIT/s (σx̄ = 511.57, n = 100)).
In contrast to previous Meltdown-type attacks, however, it is not Intel claims that the leakage is entirely from the fill buffer. How-
possible to select the value to leak based on an attacker-specified ad- ever, our second experiment shows that the line-fill buffer might
dress. ZombieLoad simply leaks any value which is currently loaded not be the only source of the leakage. We rely on Intel TSX to ensure
by the physical CPU core. While this at first sounds like a massive that memory accesses do not reach the line-fill buffer as follows.
limitation, we show that this opens a new field of side-channel Inside a transaction, we first write the secret value to a memory
attacks. We show that ZombieLoad is an even more powerful attack location which was previously initialized with a different value.
when combined with existing techniques known from traditional The write inside the transaction ensures that the address is in the
side-channel attacks. write set of the transaction and thus in L1 [32, 60]. Evicting data
from the write set from the cache leads to a transactional abort [32].
3.2 Microarchitectural Root Cause Hence, any subsequent memory access to the data from the write
For Meltdown, Foreshadow, and Fallout, the source of the leakage set ensures that it is served from the L1, and therefore, no request
is apparent. Moreover, for these attacks, there are plausible expla- to the line-fill buffer is sent [31]. In this experiment, we see a much
nations on what is going wrong in the microarchitecture, i.e., what higher rate of leakage which is in the order of kilobytes per second.
the root cause of the leakage is [45, 53, 68, 75]. For ZombieLoad, More importantly, we only see the value written inside the TSX
however, this is not entirely clear. transaction and not the value that was at the memory location
While we identified some necessary building blocks to observe before starting the transaction. Our hypothesis that the line-fill
the leakage (cf. Section 5), we can only provide a hypothesis on why buffer is not the only source of the leakage is further backed by
the interaction of the building blocks leads to the observed leakage. observing performance counters. The MEM_LOAD_RETIRED.FB_HIT
As we could only observe data leakage on Intel CPUs, we assume and MEM_LOAD_RETIRED.L1_MISS performance counters, do not
ZombieLoad: Cross-Privilege-Boundary Data Sampling
51 Physical 12
Data Address Foreshadow 47 Virtual 12
11 0
Meltdown
51 Physical 12
Fallout 11 0
Figure 1: The 3 properties of a memory operation: instruction 47 Virtual 12
pointer of the program, target address, and data value. So far, ZombieLoad
51
47
Physical
Virtual
12
12
11 6 5 0
With Meltdown, the full virtual address of the target data is pro-
vided, and the corresponding data value stored at this address is
increase significantly. In contrast, the MEM_LOAD_RETIRED.L1_HIT leaked. The success rate depends on the location of the data, i.e.,
performance counter shows multiple thousand L1 hits. whether it is in the cache or main memory. However, the only con-
While accessing the data to leak on the victim core, we moni- straint for Meltdown is that the data is addressable using a virtual
tored the MEM_LOAD_RETIRED.FB_HIT performance counter on the address [45]. Other Meltdown-type attacks [53, 68]also connect
attacker core for 10 s. If the address was cached, we measured a addresses to data values. However, they often impose additional
Pearson correlation of rp = 0.02 (n = 100) between the correct re- constraints, such as that the data has to be cached in L1 [68, 75],
coveries and line-fill buffer hits, indicating no association. However, the physical address has to be known [75], or that an attacker can
while continuously flushing the data on the victim core, ensuring choose only parts of the target address [53].
that a subsequent access must go through the LFB, we measure Figure 2 illustrates which parts of the virtual and physical address
a strong correlation of rp = 0.86 (n = 100). This result indicates an attacker can choose to target data values to leak. For Meltdown,
that the line-fill buffer is not the only source of leakage. However, the virtual address is sufficient to target data in the same address
a different explanation might be that the performance counters are space [45]. Foreshadow already requires knowledge of the physical
not reliable in such corner cases. Future work has to investigate address and the least-significant 12 bits of the virtual address to
whether other microarchitectural elements, e.g., the load buffer, are target any data in the L1, not limited to the own address space [68,
also involved in the observed data leakage. 75]. When leaking the last writes from the store buffer, an attacker
is already limited in choosing which value to leak. It is only possible
3.3 Classification to filter stores based on the least-significant 12 bits of the virtual
In this section, we introduce a way to classify memory-based side- address, a more targeted leakage is not possible [53].
channel and transient-execution attacks. For all these attacks, we Zombie loads provide no control over the leaked address to an
assume a target program which executes a memory operation at a attacker. The only possible target selection is the byte index inside
certain address with a specific data value at the program’s current the loaded data, which can be seen as an address with up to 6-bit
instruction pointer. Figure 1 illustrates these three properties as the in case an entire cache line is loaded. Hence, we do not count Zom-
corner of a triangle, and techniques which let an attacker infer one bieLoad as an attack which leaks data values based on the address.
of the properties based on one or both of the other properties. Instead, from the viewpoint of the target control, ZombieLoad is
Traditional memory-based side-channel attacks allow an attacker more similar to traditional memory-based side-channel attacks.
to observe the location of memory accesses. The granularity of With ZombieLoad, an attacker observes the data value of a memory
the location observation depends on the spatial accuracy of the access. Thus, this side channel establishes a connection between the
used side channel. Most common memory-based side-channel at- time domain and the data value. Again, the time domain correlates
tacks [20, 22, 23, 25, 37, 56, 58, 71, 78, 79] have a granularity be- with the instruction pointer of the target address. ZombieLoad is
tween one cache line [22, 23, 25, 79] i.e., usually 64 B, and one the first instance of a class of attacks which connects the instruc-
page [20, 37, 71, 78], i.e., usually 4 kB. These side channels establish tion pointer with the data value of a memory access. We refer to
a connection between the time domain and the space domain. The such attacks as data sampling attacks. Essentially, this new class of
time domain can either be the wall time or also commonly the exe- data sampling attacks is capable of breaking side-channel resistant
cution time of the program which correlates with the instruction applications, such as constant-time cryptographic algorithms [27].
pointer. These classic side channels provide means of connecting Following the classification scheme from Canella et al. [9], Zom-
the address of a memory access to a set of possible instruction bieLoad is a Meltdown-type transient execution attack, and we
pointers, which then allows reconstructing the program flow. Thus, propose Meltdown-MCA as the generic name. This reflects that the
side-channel resistant applications have to avoid secret-dependent (microarchitectural) fault type being exploited by ZombieLoad is
memory access to not leak secrets to a side-channel attacker. the microcode assist (MCA, explained further).
Since early 2018, with transient execution attacks [9] such as
Meltdown [45] and Spectre [42], there is a second type of attacks 4 ATTACK SCENARIOS & ATTACKER MODEL
which allow an attacker to observe the value stored at a memory Following most side-channel attacks, we assume the attacker can
address. Meltdown provided the most control over target address. execute unprivileged native code on the target machine. Thus, we
Schwarz et al.
assume a trusted operating system if not stated otherwise. This Table 1: Overview of different variants to induce zombie
relatively weak attacker model is sufficient to mount ZombieLoad. loads in different scenarios.
However, we also show that the increased attacker capabilities Variant 1 2
offered in certain scenarios, e.g., SGX and hypervisor attacks, may Scenario
further amplify the leakage while remaining within the threat model Unprivileged Attacker
of the respective scenario. Privileged Attacker (root)
At the hardware level, we assume a ubiquitous Intel CPU with Symbols indicate whether a variant can be used in the corresponding attack scenario
simultaneous multithreading (SMT, also known as hyperthreading) ( ), can be used depending on the hardware configuration as discussed in Section 5.1
enabled. Crucially, we do not rely on existing vulnerabilities, such ( ), or cannot be used ( ).
as Meltdown [45], Foreshadow [68, 75], or Fallout [53].
faulting load flush
User-Space Leakage. In the cross-process user-space scenario,
cache line User mapping
an unprivileged attacker leaks values loaded by another concur- Kernel v
rently running user-space application. We consider such a cross- address
process scenario most dangerous for end users, who are not com- 4 KB
k 4 KB
monly using Intel SGX nor virtual machines. Moreover, many Page p
secrets are likely to be found in user-space applications such as 2 MB
browsers or password managers. 2 MB 2 MB
The attacker can execute unprivileged code and is co-located
Figure 3: Variant 1: Using huge kernel pages for ZombieLoad.
with the victim on the same physical but a different logical CPU
Page p is mapped using a user-accessible address (v) and
core. This is a typical case for hyperthreading, where both attacker
a kernel-space huge page (k). Flushing v and then reading
and victim run on one hyperthread of the same CPU.
from k using Meltdown leaks values from the fill buffer.
Kernel Leakage. In addition to leakage across user-space appli-
cations, ZombieLoad can also leak across the privilege boundary
physical but different logical core. Thus, an attacker can leak values
between user and kernel space. We demonstrate that the value of
loaded from a virtual machine running on the sibling logical core.
loads executed in kernel space is leaked to an unprivileged attacker,
As the attacker is running inside an untrusted virtual machine,
executing either on the same or a sibling logical core.
the attacker is not restricted to unprivileged code execution. Thus,
In this scenario, the unprivileged attacker performs a system call
the attacker can, for instance, modify guest-page-table entries.
to the kernel, running on the same logical core. Importantly, we
found that kernel load leakage may even survive the switch back Hypervisor Leakage. In the hypervisor scenario, an attacker
from the kernel to user space. Hyperthreading is hence not a strict running inside a virtual machine utilizes ZombieLoad to leak the
requirement for this scenario. value of loads executed by the hypervisor.
As the attacker is running inside an untrusted virtual machine,
Intel SGX Leakage. In addition to leaking values loaded by the
the attacker is not restricted to unprivileged code execution.
kernel, ZombieLoad can observe loads executed inside an Intel SGX
enclave. In this scenario, the attacker is executing on a sibling logical
core, co-located with the victim enclave on the same physical core.
5 BUILDING BLOCKS
We demonstrate that ZombieLoad can leak secrets loaded during the In this section, we describe the building blocks for the attack.
enclave’s execution from a concurrent logical core, but we did not
observe leakage on the same logical core after exiting the enclave 5.1 Zombie Loads
synchronously (eexit) or asynchronously (on interrupt). The main primitive for mounting ZombieLoad is a load which trig-
While in the aftermath of the Foreshadow [68] attack, current gers a microcode assist, resulting in a transient load containing
SGX attestations indicate whether hyperthreading has been en- wrong data. We refer to such a load as a zombie load. Zombie loads
abled at boot time, Intel’s official security advisory [34] merely are loads which either architecturally or microarchitecturally fault
suggests that a remote verifier might reject attestations from a and thus cannot complete, requiring a re-issue of the load at a
hyperthreading-enabled system “if it deems the risk of potential later point. We identified multiple different scenarios to create such
attacks from the sibling logical processor as not acceptable”. Hence, zombie loads required for a successful attack. All variants have in
machines with up-to-date patched microcode may still run with common that they abuse the clflush instruction to reliably create
hyperthreading enabled. the conditions required for leaking from a wrong destination (cf.
Within the SGX threat model, we can leverage the attacker’s first Section 3.2). In this section, we describe 2 different variants that
rate control over the untrusted operating system. An attacker can, can be used to leak data (cf. Section 5.2) depending on the adver-
for instance, modify page table entries [71], or precisely execute ary’s capabilities. Table 1 overviews which variant is applicable in
the victim enclave at most one instruction at a time [69]. which scenario, depending on the operating system and underlying
hardware configuration.
Virtual Machine Leakage. With ZombieLoad, it is possible
to leak loaded values across virtual-machine boundaries. In this Variant 1: Kernel Mapping. The first variant is a ZombieLoad
scenario, an attacker running inside a virtual machine can leak setup which does not rely on any specific CPU feature. We require
values from a different virtual machine co-located on the same a kernel virtual address k, i.e., an address where the user-accessible
ZombieLoad: Cross-Privilege-Boundary Data Sampling
bit is not set in the page-table entry. In practice, the kernel is usually 5.2 Data Leakage
mapped with huge pages (i.e., 2 MB pages). Thus k refers to a 2 MB To leak data with the setup described in Section 5.1, we constantly
physical page p. Note that although we use such huge pages for flush the first cache line of p through the virtual address v. We
our experiments, it is not strictly required, as the setup also works achieve this by executing the unprivileged clflush instruction (or
with 4 kB pages. We also require the user to have read access to the clflushopt instruction if available) on the user-accessible virtual
content of the physical page through a different virtual address v. address v. For Variant 1, we leverage Meltdown to read from the
Figure 3 illustrates such a setup. In this setup, accessing the page kernel address k which maps to the cache line flushed before. As
p via the user-accessible virtual address v provides an architec- with Meltdown-US [45], various methods of preventing an archi-
turally valid way to access the contents of the page. Accessing the tectural exception can be used. We verified that ZombieLoad with
same page via the kernel address k results in a zombie load similar Variant 1 works with exception prevention (i.e., speculative execu-
to Meltdown [45] requiring a microcode assist. Note that while tion), handling (i.e., a custom signal handler), and suppression (i.e.,
there are other ways to construct an inaccessible address k, e.g., by Intel TSX).
clearing the present bit [68], we were only able to exploit zombie For Variant 2, we transiently, i.e., behind a mispredicted branch,
loads originating from kernel mappings. read from the address v 2 .
To create precisely the scenario depicted in Figure 3, we allocate Counterintuitively, the resulting values leaked for all variants
a page p in the user space with the virtual address v. Note that are not coming from page p. Instead, we get access to data which is
p is a regular 4 kB page which is accessible through the virtual currently loaded on the current or sibling logical CPU core. Thus,
address v. We retrieve its physical address through /proc/pagemap, it appears that we reuse fill-buffer entries, and leak the data which
or alternatively using a side channel [22, 36]. Using the physical the entries references. For Variant 1 and Variant 2, this allowed
address and the base address of the direct-physical map, we get an us to access all bytes from the cache line that the fill-buffer entry
inaccessible kernel address k which maps to the allocated page p. If references.
the operating system does not use stronger kernel isolation [21],
e.g., KPTI [47], the direct-physical map in the kernel is mapped in
5.3 Data Sampling
the user space and uses huge pages which are marked as not user
accessible. In the case of a privileged attacker, e.g., when attacking Independent of the setup for ZombieLoad, we cannot directly con-
a hypervisor or SGX enclave, an attacker can easily create such trol the address of the data to leak. Both the virtual addresses k
pages if they do not exist. and v, as well as the physical address of p is arbitrary and does not
correlate with the leaked data. In any case, we simply get the value
referenced by one fill-buffer entry which we cannot specify.
However, there is at least control within the fill-buffer entry,
i.e., we can target specific bytes within the 64 B fill-buffer entry.
Variant 2: Microcode-Assisted Page-Table Walk. A variant The least-significant 6 bits of the virtual address v refer to the byte
similar to Variant 1 is to trigger a microcode-assisted page-table within the fill-buffer entry. Hence, we can target a single byte at a
walk. If a page-table walk requires an update to the access or dirty specific position from the fill-buffer entry. While at first, this does
bit in the page-table entry, it falls back to a microcode assist [13]. not sound powerful, it allows leaking sensitive information, such
In this setup, we require one physical page p which has 2 user- as AES keys, byte-by-byte as shown in Section 6.1.
accessible virtual addresses, v and v 2 . This can be easily achieved As described in Section 4, the leakage is not limited to the own
by using a shared-memory segment or memory-mapped file, which process. With ZombieLoad, we observe values from all processes
is mapped twice in the application. The virtual address v can be running on the same as well as on the sibling logical CPU core.
used to access the contents of p architecturally. For v 2 , we have Furthermore, we also observe leakage across privilege boundaries,
to clear the accessed bit in the page-table entry. On Linux, this is i.e., from the kernel, hypervisor, and Intel SGX enclaves. Thus,
not possible in the case of an unprivileged attacker, and can thus ZombieLoad allows sampling of all data which is loaded by any
only be used in attacks where we assume a privileged attacker application on the current physical CPU core.
(cf. Section 4). However, we experimentally verified that Windows
10 (1803 build 17134.706) periodically clears the accessed bits. We 5.4 Performance Evaluation
assume that the page-replacement algorithm is responsible for this. In this section, we evaluate ZombieLoad and the performance of
Thus, this variant enables the attack on Windows for unprivileged 1
our proof-of-concept implementations .
attackers .
When accessing the page through the virtual address v 2 , the Environment. We evaluated the different variants of ZombieLoad,
accessed bit of the page-table entry has to be set. This, however, described in Section 5.1, on different environments listed in Table 2.
cannot be done by the page-miss handler [13]. Instead, microar- The tested CPUs range from Sandy Bridge (released 2012) to Cas-
chitecturally, the load faults, and a micro-code assist is triggered cade Lake (released 2019). We were able to mount Variant 1 and
which repeats the page-table walk and sets the accessed bit [13]. Variant 2 on different microarchitectures except for Whiskey Lake,
If the access to v 2 is done transiently, i.e., behind a misspecu- Coffee Lake-R, and Cascade Lake-SP.
lated branch or after an exception, the accessed bit cannot be set
architecturally. Thus, the leakage is not only exploitable once but
1
instead for every access. https://github.com/IAIK/ZombieLoad
Schwarz et al.
a domino byte is possible, as the transient domain has access to the enclave does not make progress, we can perform unlimited
the full AES key and can use it for arbitrary computations (cf. Sec- ZombieLoad attack attempts to reconstruct CPU register values
tion 6.3). Figure 4 illustrates the idea of the Domino attack. In this from these implicit SSA memory accesses.
case, we leak (4,4) domino bytes consisting of 4 bits of two adjacent We further reduce noise from unrelated non-enclave loads on
key bytes respectively. By combining the lower nibble of one key the victim CPU by opting for timer-based zero-stepping with a
byte with the higher nibble of the next key byte, we transmit a user space interrupt handler [70] to avoid repeatedly invoking
domino byte which encodes partial information of two key bytes. the operating system. Furthermore, we found that executing the
Hence, in a post-processing step, we combine the probability distri- ZombieLoad attack code in a separate address space avoids unnec-
bution of two adjacent key bytes with the probability distribution essarily slowing down the spy through implicit TLB invalidations
of the domino byte to select the two adjacent key bytes with the on enclave entry/exit [30].
highest combined probability. Note that the selection of bits can be Note that the SSA frame spans multiple cache lines. With Zom-
adapted to the noise which can be measured before leaking the key, bieLoad, we do not have explicit address-based control over which
e.g., multiple (7,1) domino bytes can be leaked that are shifted by cache line is being leaked. Hence, leaked data might come from dif-
only a single bit. ferent saved registers that are at the same offset within a cache line.
To filter out such noisy observations, we use the Domino transient
Results. We evaluated the attack in a cross-user-space attack error detection technique introduced in Section 6.1. Specifically, we
(cf. Section 4). We always ran the attack until the correct key was implemented a “sliding window” that transmits 7 different domino
recovered, i.e., until the key with the highest probability is the bytes for each candidate key byte, stuffed with increasing bits from
correct key. In a practical attack, the number of attacks can even be the next adjacent key byte candidate. Any noisy observations that
reduced, as typically it is easy to verify whether a key candidate is do not match the overlap can now efficiently be filtered out.
correct. Thus, an attacker can simply test all key candidates with a
probability over a certain threshold and does not have to wait until Attack on sgx_get_key. The Intel SGX design includes a se-
the highest probability corresponds to the correct key. cure key derivation facility through the egetkey instruction (cf.
On average, we recovered the entire AES-128 key of the victim Section 2.3). Enclaves execute this instruction to query a 128-bit
in under 10 s using the cache-based trigger and the Domino attack. cryptographic key from the hardware, based on the calling enclave’s
During this time, the key was loaded approximately 10 000 times code layout or developer identity. This is the underlying primitive
by the victim. used by Intel’s trusted prebuilt quoting enclave to securely unseal
a long-term private attestation key from persistent storage [13, 68].
The official Intel SGX SDK [30] offers a convenient sgx_get_key
6.2 SGX Sealing Key Extraction
wrapper procedure that first executes egetkey with the necessary
In this section, we show that privileged SGX attackers can drasti- parameters, and eventually copies the retrieved key into a provided
cally improve ZombieLoad’s temporal resolution and bridge from buffer. We reverse engineered the proprietary intel_fast_memcpy
incidental data sampling in the time domain to the targeted re- function and found that in this case, the key is copied using two 128-
construction of arbitrary enclave secrets (cf. Figure 1). We first bit moves to/from the xmm0 SSE register. We revert to zero-stepping
explain how state-of-the-art enclave execution control and tran- on the last instruction of the memcpy invocation. At this point, the
sient post-processing techniques can be leveraged to reliably leak attacker-induced zero-step enclave resumptions will repeatedly
register values at any point during an enclave invocation. Then we reload a.o., the xmm0 register containing the 128-bit key from the
demonstrate the impact of this attack by recovering a full 128-bit memory hierarchy.
SGX sealing key, as used by Intel’s trusted provision and quoting
enclaves to decrypt the long-term EPID private attestation key. Results. We evaluated the attack on a Kaby Lake i7-7700 CPU
with an up-to-date Foreshadow-patched microcode revision 0x8e.
Leaking Enclave Registers. We consider Intel SGX root attack- In the first experiment, we implemented a benchmark enclave
ers that co-locate with a victim enclave on the same physical CPU. that uses sgx_get_key to generate a new report key with different
As a system attacker, we can increase ZombieLoad’s temporal res- random key IDs. We performed 100 key-recovery experiments on
olution by leveraging previous research results exploiting page sgx_get_key with different random keys. Our results show that
faults [71, 78] or interrupts [54, 70] to regulate the victim enclave’s 30 % of the times the full 128-bit key is among the key candidates
execution. We use the SGX-Step [69] framework to precisely single- with average remaining key space entropy of 8.8 bits. Among these
step the victim enclave one instruction at a time, allowing the cases, 3 % of the times the exact full key has been recovered. In the
attacker to reach a code part where sensitive information is stored other 70 % of the cases where the full key is not among the key
in CPU registers. At such a point, we switch to unlimited zero- candidates, 31 % of the times, we have partial key bytes among the
stepping [68] by either setting the system timer interrupt to a very recovered key candidates. The average correct key bytes are 10 out
short interval or revoking code page execute permissions before of 16 bytes with the remaining global entropy of 13.59 bits. In the
resuming the victim enclave. This technique provides ZombieLoad remaining 39 % of the times where the correct key is not among the
attackers with a primitive to repeatedly force-reload CPU regis- key candidates, our attack which uses the Domino technique with
ters from the interrupted enclave’s SSA frame (cf. Section 2.3). Our a sliding window did not reveal any candidates, which means an
experiments show that even though execution of the enclave in- attacker can simply repeat the attack in such cases. Also in cases,
struction never completes, any direct operands plus SSA register where some of the key bytes are part of the candidates, most of
file contents are loaded from memory each time. Importantly, since failed key bytes resides in the first few bytes of the key. The reason
Schwarz et al.
23 15 7 0
for this behavior is that the explained Domino attack will have a
stronger effect on key bytes in the middle that are surrounded by 0xFF SEQ DATA DATA
more key bytes. Figure 5: The packet format used in the covert channel. Every
In the second experiment, we perform an attack on Intel’s trusted 32-bit packet consists of 8 data bits, 8-bit checksum (two’s
quoting enclave. The quoting enclave performs a call to sgx_get_key complement), 8-bit sequence number, and a constant prefix.
to derive the sealing key which is used to decrypt the EPID provi-
sioning blob. We executed the attack on a quoting enclave that is
signed with debug keys, so we can use it as a ground truth to easily As a result, our proof-of-concept limits the transmission of actual
verify that we have recovered the correct sealing key. We executed data to a single byte per leaked load. However, we can use the
the attack multiple times on our setup, and we managed to recover remaining bits in the load to ensure that the channel is free of
the correct 128-bit sealing key after multiple executions of the at- errors.
tack and checking the candidates against each other. The recovered
sealing key matches the correct key, and can indeed successfully Transient Error Detection. The transmission of the data be-
decrypt the EPID blob for our debug signed quoting enclave. While tween sender and receiver is free of any noise. However, the re-
we did not yet reproduce this attack to recover the sealing key from ceiver does not only recover values from the sender, but also other
the official quoting enclave image signed by Intel, we believe that loads from the current and sibling logical core. Hence, to get rid of
this experimental evaluation showcased all the required primitives this noise, we encode the data as shown in Figure 5. This allows
to break Intel SGX’s remote attestation guarantees, as demonstrated the receiver to filter out data not originating from the sender.
before by Foreshadow [68]. Although we cannot transfer the entire packet into the archi-
tectural domain, we can compute on the packet in the transient
6.3 Cross-VM Covert Channel domain. Thus, we run the error detection in the transient domain,
and only transmit valid packets to the architectural domain.
To evaluate the performance of ZombieLoad, we implement a covert
The challenge to run the error detection in the transient domain
channel which can be used for all attack scenarios described in
is that the number of instructions is limited, and not all instructions
Section 4. However, in this section, we focus on the cross-VM covert
can be used. For reliable results, we cannot use instructions which
channel. While covert channels are possible for Intel SGX, the
speculate on either control or data flow. Hence, the error-detection
kernel, and the hypervisor, these are somewhat artificial scenarios.
code has to be as short as possible and branch free.
Moreover, there are various covert channels available to user-space
Our packet structure allows for extremely efficient error detec-
applications for stealthy inter-process communication [17, 51].
tion. We encode the data in the first byte and the two’s complement
For VMs, however, there are not many known covert chan-
of the data in the second byte as a checksum. To detect errors, we
nels which can be used between two VMs. So far, all cross-VM
XOR the value of the first byte (i.e., the data) onto the second byte
covert channels either relied on Prime+Probe [46, 50, 51, 59, 77],
(i.e., the two’s complement of the data). If both values are received
DRAMA [58], or bus locking [76]. We show that ZombieLoad can be
correctly, the XOR ensures that the bits 8 to 15 of the packet are
used as a fast and reliable covert channel between VMs scheduled
zero. Thus, for a correct packet, the least-significant 16 bits of the
on the same physical core.
packet represent a value between 0 and 255, and for a wrong packet,
Sender. For the fastest result, the sender repeatedly loads the these bits represent a value which is larger than 255. We use these
value to be transmitted from the L1 cache into a register. By not resulting 16-bit value as an index into our oracle array, i.e., an array
only loading the value from one memory address but instead from consisting of 256 pages. Therefore, any value which is not a correct
multiple memory addresses, the sender ensures that potentially byte is out of bounds and has thus no effect on the cache state of
multiple fill-buffer entries are used. In addition, this also thwarts the array. A correct byte is also a valid index into the oracle array
an optimization of Intel CPUs which combines multiple loads from and ensures that the first cache line of the corresponding page is
the same cache line to a single load [1]. cached. Finally, by applying a cache-based side-channel attack, such
On a CPU supporting AVX2, the sender can encode up to 256 as Flush+Reload, we can recover the byte from the cache state of
bits per load (e.g., using the VMOVAPS load). the oracle array [42, 45].
The error detection in the transient domain has the advantage
Receiver. The receiver mounts ZombieLoad to leak the values that we do not require computation time in the architectural do-
loaded by the sender. However, as the receiver leaks the loads only main. Instead of waiting for the exception to become architecturally
in the transient domain, the leaked value have to be transferred visible by doing nothing, we already use this time to perform the
into the architectural domain. We encode the leaked values into required computation. An additional advantage is that while we
the cache and recover them using Flush+Reload. When encoding are still in the transient domain, we can work on noise-free data.
values in the cache, we require at least 2 cache lines, i.e., 128 B, per Thus, we do not require complex error correction after receiving
bit to prevent the adjacent-cache-line prefetcher from interfering the data [51].
with the encoding. In practice, we require one physical page, i.e., In addition to the error detection, we also encode a sequence
4 kB, per possible value to prevent interference of the prefetcher. number into the packet. The sequence number allows ordering
To reduce the recover bottleneck, we transfer single bytes from the the received packets. It can be recovered using the same method
transient to the architectural domain which already requires 256 as the data value, e.g., using an oracle array and a cache-based
runs of Flush+Reload. side-channel attack.
ZombieLoad: Cross-Privilege-Boundary Data Sampling
Results. We evaluate the covert channel both in a lab environ- Table 3: Number of accesses required to recover a website
ment as well as in a public cloud. In the lab environment, we used name. The experiment was repeated 100 times per website.
2 virtual machines running inside QEMU KVM on an i7-8650U. For Website Minimal Average Maximum
2
the cloud scenario , we used 2 co-located virtual machines running nytimes.com 1 1 3
CentOS 7.6.1810 with a Linux kernel version of 3.10.0-957 on a Xeon facebook.com 1 2 4
E5-2670 CPU. kernel.org 2 6 13
Both on the cloud, as well as on our lab machine, we achieved an gnupg.org 2 10 34
error-free transmission. On our lab machine, we observed transmis-
sion rates of up to 26.8 kbit/s. As TSX was not available in the cloud
1 if (x < array_len) {
scenario, we achieved a transmission rate of 1.99 kbit/s (σx̄ = 2.5 %, 2 y = array[x];
n = 1000) with Variant 1 and signal handling. 3 }
Speculative Data Leakage. Listing 1 illustrates such a gadget. ZombieLoad on one logical core and on the other we execute sys-
It is a common pattern in software for accessing an element of an tem calls that switch between out-of-bounds and in-bounds array
array [9]. First, the code checks whether the index lies within the indices to achieve a high frequency of mispredictions in the gadget.
bounds of the array. Only if this is the case, the element is accessed, This approach yields leaked values with a large noise compo-
i.e., loaded. While it is evident that for a user-controlled index the nent from unrelated loads. We repeat this setup without trying to
corresponding array element can be loaded, such a gadget is even generate mispredictions to generate a baseline of noise values. We
more powerful. generate frequency distributions for both runs and subtract the
On a CPU vulnerable to Spectre, an attacker can mistrain the noise frequency from the misprediction run. We then choose the
branch predictor, e.g., by providing several valid values for the array byte value that was seen most frequently.
index. Then, by providing an out-of-bounds index, the branch is With this crude statistical method, we can recover kernel mem-
misspeculated and speculatively accesses an out-of-bounds value. ory at one byte per 10 s with 38 % accuracy. Probing bytes for 20 s
Alternatively, the attacker can alternate between valid and out-of- improves the accuracy to 46 %.
bounds indices randomly to achieve a high percentage of mispre- As with Meltdown [45], common byte values such as 0x00 and
dictions without any prior branch predictor mistraining. 0xFF occur too often and have to be removed from the leaked data
ZombieLoad cannot only leak architecturally accessed data but for the recovery to work. Our approach is thus blind to these values.
also speculatively accessed data. Hence, ZombieLoad can even see The speed and accuracy can be improved if there is a priori
the value of loads which are never architecturally visible. Such loads knowledge of the target data. For example, a 7-bit ASCII string can
include, among others, speculative memory loads and prefetches. be leaked with a probing time of 10 s per byte with 72 % accuracy.
Thus, any Spectre gadget which is not hardened, e.g., using a mem-
ory fence [4, 5, 9, 33] or a mask [9, 10], can be used to specify data 7 COUNTERMEASURES
to leak.
As ZombieLoad leaks loaded values across logical cores, a straight-
Moreover, ZombieLoad does not require classic Spectre gadgets
forward mitigation is disabling the use of hyperthreading. Hyper-
containing an indirect array access [42]. A simple out-of-bounds
threading improves performance for certain workloads by 30 % to
access (cf. Listing 1) is sufficient. While such gadgets have been
40 % [8, 52], and as such disabling it may incur an unacceptable
demonstrated for breaking KASLR [62], they were considered as
performance impact.
relatively harmless as they do not leak data [9]. Hence, most ap-
proaches for finding gadgets do not consider such gadgets [26, 73]. Co-Scheduling. Depending on the workload, a more efficient
In the Linux kernel, however, such gadgets are also patched if they mitigation is the use of co-scheduling [55]. Co-scheduling can be
are discovered, mainly as they can be used together with the Fore- configured to prevent the execution of code from different pro-
shadow vulnerability to leak arbitrary kernel memory [12, 66]. So tection domains on a hyperthread pair. Current topology-aware
far, 172 such gadgets have been fixed in kernel 5.0 [9]. With Zom- co-scheduling algorithms [64] are not concerned with preventing
bieLoad, we show that such gadgets are indeed powerful and have kernel code from running concurrently with user-space code. With
to be patched as well. such a scheduling strategy, leaks between user processes can be pre-
vented but leaks between kernel and user space cannot. To prevent
Potential Incompleteness of Countermeasures. Mainly, there leakage between kernel and user space, the kernel must addition-
are 2 methods to prevent exploitation of Spectre-PHT: memory ally ensure that kernel entries on one logical core force the sibling
fences after branches [4, 5, 9, 33], or constraining the index to a logical core into the kernel as well. This discussion applies in an
valid range using a bitmask [9, 10]. The variant using fences is im- analogous way to hypervisors and virtual machines.
plemented in the Microsoft compiler [41, 42], whereas the variant
using bitmasks is implemented in GCC [48] and LLVM [10], and Flushing Buffers. We have demonstrated that ZombieLoad
also used in the Linux kernel [48]. also works across protection boundaries on a single logical core.
Both methods prevent exploitation of Spectre-PHT [9], as the Hence, disabling hyperthreading or co-scheduling are not fully ef-
misspeculation cannot load any data. Hence, this is also effective fective as mitigation. We have not found an instruction sequence
against ZombieLoad, as fixed gadgets cannot be exploited to load that reliably prevents leakage across protection boundaries. Even
arbitrary values. flushing the entire L1 data cache (using MSR_IA32_FLUSH_CMD) and
However, even with these countermeasures in place, there is issuing as many dummy loads as there are fill-buffer entries (“load
a remaining leakage which can be exploited using ZombieLoad. stuffing”) is not sufficient. There is still remaining leakage, which
When architecturally loading an in-bounds value, ZombieLoad can we assume is caused by the replacement policy of the line-fill buffer.
leak up to 64 bytes of the load. Hence, with ZombieLoad, there is a Hence, to fully mitigate the leakage, we require a microcode update
potential leakage of up to 63 bytes which are out of bounds if the which provides a method to flush the line-fill buffer.
last in-bounds value is at the beginning of a cache line or the base
of the array is at the end of a cache line. Selective Feature Deactivation. Weaker countermeasures tar-
get individual building blocks (cf. Section 5). The operating system
Data Leakage. To demonstrate the feasibility of prefetch gad- kernel can make sure always to set the accessed and dirty bits in
gets for targeted data leakage, we leverage an artificial prefetch page tables to impair Variant 2. Unfortunately, Variant 1 is always
gadget as given in Listing 1. For our evaluation, we used such a possible, if the attacker can identify an alias mapping of any acces-
gadget in the system-call path of the Linux kernel 5.0.7. We execute sible user page in the kernel. This is especially true if the attacker
ZombieLoad: Cross-Privilege-Boundary Data Sampling
is running in or can create a virtual machine. Hence, we also rec- (CrowdStrike), and Martin Schwarzl (Graz University of Technol-
ommend disabling VT-x on systems that do not need to run virtual ogy). The research presented in this paper was partially supported
machines. by the Research Fund KU Leuven. Jo Van Bulck is supported by a
grant of the Research Foundation – Flanders (FWO). The project
Removing Prefetch Gadgets. To prevent targeted data leakage, was supported by the European Research Council (ERC) under
prefetch gadgets need to be neutralized, e.g., using array_index_nospec the European Union’s Horizon 2020 research and innovation pro-
in the Linux kernel. This function clamps array indices into valid gramme (grant agreement No 681402). It was also supported by
values and prevents arbitrary virtual memory to be prefetched. the Austrian Research Promotion Agency (FFG) via the K-project
Placing these functions is currently a manual task and due to the DeSSnet, which is funded in the context of COMET – Competence
incomplete documentation of how Intel CPUs prefetch data, these Centers for Excellent Technologies by BMVIT, BMWFW, Styria
mitigations cannot be complete. Note that Spectre mitigations using and Carinthia. Additional funding was provided by a generous gift
lfence instructions might also be incomplete against ZombieLoad. from Intel. Any opinions, findings, and conclusions or recommen-
Another way to prevent prefetch gadgets from reaching sensitive dations expressed in this paper are those of the authors and do not
data is to prevent this data from being mapped in the address necessarily reflect the views of the funding parties.
space of the prefetch gadget. Exclusive Page-Frame Ownership [39]
(XPFO) partially achieves this for the Linux kernel’s mapping of REFERENCES
physical memory. [1] Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G.,
Prefetch gadgets can also be neutralized using Speculative Load Madland, P. D., Papworth, D. B., and Fetterman, M. A. Method and apparatus
Hardening [10] (SLH). SLH prevents speculative execution by intro- for dispatching and executing a load operation to memory, Feb. 1998. US Patent
5,717,882.
ducing artificial data dependencies via a compiler pass. SLH incurs [2] Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G.,
a performance overhead of 10 % to 50 % for typical applications. To Madland, P. D., Papworth, D. B., and Fetterman, M. A. Method and apparatus
for dispatching and executing a load operation to memory, 1998. US Patent
the best of our knowledge, its overhead for kernel or hypervisor 5,717,882.
code has not been studied yet. [3] Allan, T., Brumley, B. B., Falkner, K., Van de Pol, J., and Yarom, Y. Amplifying
side channels through performance degradation. In ACSAC (2016).
Instruction Filtering. The above discussion mostly focusses [4] AMD. Software Techniques for Managing Speculation on AMD Processors, 2018.
Revison 7.10.18.
on attacks across process or virtual-machine boundaries. For attacks [5] ARM Limited. Vulnerability of Speculative Processors to Cache Timing Side-
inside of a single process (e.g., JavaScript sandbox), the sandbox Channel Mechanism, 2018.
implementation must make sure that the requirements for mounting [6] Bhattacharyya, A., Sandulescu, A., Neugschwandtner, M., Sorniotti, A.,
Falsafi, B., Payer, M., and Kurmus, A. SMoTherSpectre: exploiting speculative
ZombieLoad are not met. One example is to prevent the generation execution through port contention. arXiv:1903.01843 (2019).
and execution of the clflush instructions, which so far is a crucial [7] Boggs, D. D., and Rodgers, S. D. Microprocessor with novel instruction for
signaling event occurrence and for providing event handling information in
part of the attack. response thereto, Apr. 1997. US Patent 5,625,788.
[8] Bulpin, J. R., and Pratt, I. A. Multiprogramming performance of the Pentium 4
Secret Sharing. On the software side, we can also rely on secret with Hyper-Threading. In Second Annual Workshop on Duplicating, Deconstruction
sharing techniques used to protect against physical side-channel and Debunking (WDDD) (2004).
[9] Canella, C., Van Bulck, J., Schwarz, M., Lipp, M., von Berg, B., Ortner,
attacks [65]. We can ensure that a secret is never directly loaded P., Piessens, F., Evtyushkin, D., and Gruss, D. A Systematic Evaluation of
from memory but instead only combined in registers before being Transient Execution Attacks and Defenses. In USENIX Security Symposium (to
used. As a consequence, observing the data of a load does not reveal appear) (2019).
[10] Carruth, C. RFC: Speculative Load Hardening (a Spectre variant #1 mitigation),
the secret. For a successful attack, an attacker has to leak all shares Mar. 2018.
of the secret. This mitigation is, of course, incomplete if register [11] Chen, G., Chen, S., Xiao, Y., Zhang, Y., Lin, Z., and Lai, T. H. SGXPECTRE
Attacks: Leaking Enclave Secrets via Speculative Execution. arXiv:1802.09085
values are written to and subsequently loaded from memory as part (2018).
of context switching. [12] Corbet, J. Finding Spectre vulnerabilities with smatch, https://lwn.net/Articles/
752408/ Apr. 2018.
[13] Costan, V., and Devadas, S. Intel SGX explained.
8 CONCLUSION [14] Evtyushkin, D., Riley, R., Abu-Ghazaleh, N. C., ECE, and Ponomarev, D.
Branchscope: A new side-channel attack on directional branch predictor. In
With ZombieLoad, we showed a novel Meltdown-type attack target- ASPLOS’18 (2018).
ing the processor’s fill-buffer logic. ZombieLoad enables an attacker [15] Fog, A. The microarchitecture of Intel, AMD and VIA CPUs: An optimization
to leak recently loaded values used by the current or sibling logical guide for assembly programmers and compiler makers, 2016.
[16] García, C. P., and Brumley, B. B. Constant-time callees with variable-time
CPU. We show that ZombieLoad allows leaking across user-space callers. In USENIX Security Symposium (2017).
processes, CPU protection rings, virtual machines, and SGX en- [17] Ge, Q., Yarom, Y., Cock, D., and Heiser, G. A Survey of Microarchitectural
Timing Attacks and Countermeasures on Contemporary Hardware. Journal of
claves. We demonstrated the immense attack potential by monitor- Cryptographic Engineering (2016).
ing browser behaviour, extracting AES keys, establishing cross-VM [18] Glew, A. F., Akkary, H., Colwell, R. P., Hinton, G. J., Papworth, D. B., and
covert channels or recovering SGX sealing keys. Finally, we con- Fetterman, M. A. Method and apparatus for implementing a non-blocking
translation lookaside buffer, Oct. 1996. US Patent 5,564,111.
clude that disabling hyperthreading is the only possible workaround [19] Glew, A. F., Akkary, H., and Hinton, G. J. Translation lookaside buffer that is
to mitigate ZombieLoad on current processors. non-blocking in response to a miss for use within a microprocessor capable of
processing speculative instructions, 1997. US Patent 5,613,083.
[20] Gras, B., Razavi, K., Bos, H., and Giuffrida, C. Translation Leak-aside Buffer:
ACKNOWLEDGMENTS Defeating Cache Side-channel Protections with TLB Attacks. In USENIX Security
Symposium (2018).
We thank Werner Haas (Cyberus Technology), Claudio Canella [21] Gruss, D., Lipp, M., Schwarz, M., Fellner, R., Maurice, C., and Mangard, S.
(Graz University of Technology), Jon Masters (Red Hat), Alex Ionescu KASLR is Dead: Long Live KASLR. In International Symposium on Engineering
Schwarz et al.
400
one thread
always measured the best case, i.e., the minimum latency, to get rid
Latency
[cycles]
Latency increase
[cycles]
Skylake (10 entries) leverage the entire fill buffer Therefore, every logical core can
400 Latency increase potentially use any entry in the fill buffer.
(12 entries)
300
6 8 10 12 14
Non-temporal Stores