Ben Yehuda

The Turtles Project: Design and Implementation of Nested Virtualization
Muli Ben-Yehuda† Michael D. Day‡ Zvi Dubitzky† Michael Factor† Nadav Har’El†
muli@il.ibm.com mdday@us.ibm.com dubi@il.ibm.com factor@il.ibm.com nyh@il.ibm.com
† ‡ †
Abel Gordon Anthony Liguori Orit Wasserman Ben-Ami Yassour†
abelg@il.ibm.com aliguori@us.ibm.com oritw@il.ibm.com benami@il.ibm.com
† ‡
IBM Research – Haifa IBM Linux Technology Center
Abstract via the KVM [29] hypervisor. As commodity operat-

ing systems gain virtualization functionality, nested vir-
In classical machine virtualization, a hypervisor runs tualization will be required to run those operating sys-
multiple operating systems simultaneously, each on its tems/hypervisors themselves as virtual machines.
own virtual machine. In nested virtualization, a hypervi-
Nested virtualization has many other potential uses.
sor can run multiple other hypervisors with their associ-
Platforms with hypervisors embedded in firmware [1,20]
ated virtual machines. As operating systems gain hyper-
need to support any workload and specifically other hy-
visor functionality—Microsoft Windows 7 already runs
pervisors as guest virtual machines. An Infrastructure-
Windows XP in a virtual machine—nested virtualization
as-a-Service (IaaS) provider could give a user the ability
will become necessary in hypervisors that wish to host
to run a user-controlled hypervisor as a virtual machine.
them. We present the design, implementation, analysis,
This way the cloud user could manage his own virtual
and evaluation of high-performance nested virtualization
machines directly with his favorite hypervisor of choice,
on Intel x86-based systems. The Turtles project, which
and the cloud provider could attract users who would like
is part of the Linux/KVM hypervisor, runs multiple un-
to run their own hypervisors. Nested virtualization could
modified hypervisors (e.g., KVM and VMware) and op-
also enable the live migration [14] of hypervisors and
erating systems (e.g., Linux and Windows). Despite the
their guest virtual machines as a single entity for any
lack of architectural support for nested virtualization in
reason, such as load balancing or disaster recovery. It
the x86 architecture, it can achieve performance that is
also enables new approaches to computer security, such
within 6-8% of single-level (non-nested) virtualization
as honeypots capable of running hypervisor-level root-
for common workloads, through multi-dimensional pag-
kits [43], hypervisor-level rootkit protection [39,44], and
ing for MMU virtualization and multi-level device as-
hypervisor-level intrusion detection [18, 25]—for both
signment for I/O virtualization.
hypervisors and operating systems. Finally, it could also
be used for testing, demonstrating, benchmarking and
The scientist gave a superior smile before re- debugging hypervisors and virtualization setups.
plying, “What is the tortoise standing on?”
The anticipated inclusion of nested virtualization in
“You’re very clever, young man, very clever”,
x86 operating systems and hypervisors raises many in-
said the old lady. “But it’s turtles all the way
teresting questions, but chief amongst them is its runtime
down!”1
performance cost. Can it be made efficient enough that
the overhead doesn’t matter? We show that despite the
1 Introduction lack of architectural support for nested virtualization in
the x86 architecture, efficient nested x86 virtualization—
Commodity operating systems increasingly make use with as little as 6-8% overhead—is feasible even when
of virtualization capabilities in the hardware on which running unmodified binary-only hypervisors executing
they run. Microsoft’s newest operating system, Win- non-trivial workloads.
dows 7, supports a backward compatible Windows XP Because of the lack of architectural support for nested
mode by running the XP operating system as a virtual virtualization, an x86 guest hypervisor cannot use the
machine. Linux has built-in hypervisor functionality hardware virtualization support directly to run its own
guests. Fundamentally, our approach for nested virtual-
1 http://en.wikipedia.org/wiki/Turtles all the way down ization multiplexes multiple levels of virtualization (mul-
tiple hypervisors) on the single level of architectural sup- x86 architecture which has a single level of architectural
port available. We address each of the following ar- support for virtualization, they proposed a hardware ar-
eas: CPU (e.g., instruction-set) virtualization, memory chitecture with multiple virtualization levels.
(MMU) virtualization, and I/O virtualization. The IBM z/VM hypervisor [35] included the first prac-
x86 virtualization follows the “trap and emulate” tical implementation of nested virtualization, by making
model [21,22,36]. Since every trap by a guest hypervisor use of multiple levels of architectural support. Nested
or operating system results in a trap to the lowest (most virtualization was also implemented by Ford et al. in a
privileged) hypervisor, our approach for CPU virtualiza- microkernel setting [16] by modifying the software stack
tion works by having the lowest hypervisor inspect the at all levels. Their goal was to enhance OS modularity,
trap and forward it to the hypervisors above it for emula- flexibility, and extensibility, rather than run unmodified
tion. We implement a number of optimizations to make hypervisors and their guests.
world switches between different levels of the virtualiza- During the last decade software virtualization tech-
tion stack more efficient. For efficient memory virtual- nologies for x86 systems rapidly emerged and were
ization, we developed multi-dimensional paging, which widely adopted by the market, causing both AMD and
collapses the different memory translation tables into the Intel to add virtualization extensions to their x86 plat-
one or two tables provided by the MMU [13]. For effi- forms (AMD SVM [4] and Intel VMX [48]). KVM [29]
cient I/O virtualization, we bypass multiple levels of hy- was the first x86 hypervisor to support nested virtualiza-
pervisor I/O stacks to provide nested guests with direct tion. Concurrent with this work, Alexander Graf and Jo-
assignment of I/O devices [11, 31, 37, 52, 53] via multi- erg Roedel implemented nested support for AMD pro-
level device assignment. cessors in KVM [23]. Despite the differences between
Our main contributions in this work are: VMX and SVM—VMX takes approximately twice as
many lines of code to implement—nested SVM shares
• The design and implementation of nested virtual- many of the same underlying principles as the Turtles
ization for Intel x86-based systems. This imple- project. Multi-dimensional paging was also added to
mentation can run unmodified hypervisors such as nested SVM based on our work, but multi-level device
KVM and VMware as guest hypervisors, and can assignment is not implemented.
run multiple operating systems such as Linux and There was also a recent effort to incorporate nested
Windows as nested virtual machines. Using multi- virtualization into the Xen hypervisor [24], which again
dimensional paging and multi-level device assign- appears to share many of the same underlying principles
ment, it can run common workloads with overhead as our work. It is, however, at an early stage: it can only
as low as 6-8% of single-level virtualization. run a single nested guest on a single CPU, does not have
multi-dimensional paging or multi-level device assign-
• The first evaluation and analysis of nested x86 virtu-
ment, and no performance results have been published.
alization performance, identifying the main causes
Blue Pill [43] is a root-kit based on hardware virtual-
of the virtualization overhead, and classifying them
ization extensions. It is loaded during boot time by in-
into guest hypervisor issues and limitations in the
fecting the disk master boot record. It emulates VMX
architectural virtualization support. We also sug-
in order to remain functional and avoid detection when a
gest architectural and software-only changes which
hypervisor is installed in the system. Blue Pill’s nested
could reduce the overhead of nested x86 virtualiza-
virtualization support is minimal since it only needs to
tion even further.
remain undetectable [17]. In contrast, a hypervisor with
nested virtualization support must efficiently multiplex
2 Related Work the hardware across multiple levels of virtualization deal-
ing with all of CPU, MMU, and I/O issues. Unfortu-
Nested virtualization was first mentioned and theoreti- nately, according to its creators, Blue Pill’s nested VMX
cally analyzed by Popek and Goldberg [21, 22, 36]. Bel- implementation can not be published.
paire and Hsu extended this analysis and created a formal ScaleMP vSMP is a commercial product which aggre-
model [10]. Lauer and Wyeth [30] removed the need for gates multiple x86 systems into a single SMP virtual ma-
a central supervisor and based nested virtualization on chine. ScaleMP recently announced a new “VM on VM”
the ability to create nested virtual memories. Their im- feature which allows running a hypervisor on top of their
plementation required hardware mechanisms and corre- underlying hypervisor. No details have been published
sponding software support, which bear little resemblance on the implementation.
to today’s x86 architecture and operating systems. Berghmans demonstrates another approach to nested
Belpaire and Hsu also presented an alternative ap- x86 virtualization, where a software-only hypervisor is
proach for nested virtualization [9]. In contrast to today’s run on a hardware-assisted hypervisor [12]. In contrast,
2
our approach allows both hypervisors to take advantage
of the virtualization hardware, leading to a more efficient
implementation.
3 Turtles: Design and Implementation

The IBM Turtles nested virtualization project imple-
Figure 1: Nested traps with single-level architectural
ments nested virtualization for Intel’s virtualization tech-
support for virtualization
nology based on the KVM [29] hypervisor. It can host
multiple guest hypervisors simultaneously, each with its
own multiple nested guest operating systems. We have know it is not running directly on the hardware. Build-
tested it with unmodified KVM and VMware Server as ing on this infrastructure, the guest at L1 is itself able
guest hypervisors, and unmodified Linux and Windows use the same techniques to emulate the VMX hardware
as nested guest virtual machines. Since we treat nested to an L2 hypervisor which can then run its L3 guests.
hypervisors and virtual machines as unmodified black More generally, given that the guest at Ln−1 provides a
boxes, the Turtles project should also run any other x86 faithful emulation of VMX to guests at Ln , a guest at Ln
hypervisor and operating system. can use the exact same techniques to emulate VMX for a
The Turtles project is fairly mature: it has been tested guest at Ln+1 . We thus limit our discussion below to L0 ,
running multiple hypervisors simultaneously, supports L1 , and L2 .
SMP, and takes advantage of two-dimensional page table Fundamentally, our approach for nested virtualization
hardware where available in order to implement nested works by multiplexing multiple levels of virtualization
MMU virtualization via multi-dimensional paging. It (multiple hypervisors) on the single level of architectural
also makes use of multi-level device assignment for effi- support for virtualization, as can be seen in Figure 2.
cient nested I/O virtualization. Traps are forwarded by L0 between the different levels.
3.1 Theory of Operation

There are two possible models for nested virtualization,
which differ in the amount of support provided by the
underlying architecture. In the first model, multi-level
architectural support for nested virtualization, each hy-
pervisor handles all traps caused by sensitive instructions
of any guest hypervisor running directly on top of it. This
model is implemented for example in the IBM System z
architecture [35].
The second model, single-level architectural support Figure 2: Multiplexing multiple levels of virtualization
for nested virtualization, has only a single hypervisor on a single hardware-provided level of support
mode, and a trap at any nesting level is handled by this
hypervisor. As illustrated in Figure 1, regardless of the When L1 wishes to run a virtual machine, it launches it
level in which a trap occurred, execution returns to the via the standard architectural mechanism. This causes a
level 0 trap handler. Therefore, any trap occurring at trap, since L1 is not running in the highest privilege level
any level from 1 . . . n causes execution to drop to level (as is L0 ). To run the virtual machine, L1 supplies a spec-
0. This limited model is implemented by both Intel and ification of the virtual machine to be launched, which
AMD in their respective x86 virtualization extensions, includes properties such as its initial instruction pointer
VMX [48] and SVM [4]. and its page table root. This specification must be trans-
Since the Intel x86 architecture is a single-level vir- lated by L0 into a specification that can be used to run
tualization architecture, only a single hypervisor can L2 directly on the bare metal, e.g., by converting mem-
use the processor’s VMX instructions to run its guests. ory addresses from L1 ’s physical address space to L0 ’s
For unmodified guest hypervisors to use VMX instruc- physical address space. Thus L0 multiplexes the hard-
tions, this single bare-metal hypervisor, which we call ware between L1 and L2 , both of which end up running
L0 , needs to emulate VMX. This emulation of VMX can as L0 virtual machines.
work recursively. Given that L0 provides a faithful em- When any hypervisor or virtual machine causes a trap,
ulation of the VMX hardware any time there is a trap the L0 trap handler is called. The trap handler then in-
on VMX instructions, the guest running on L1 will not spects the trapping instruction and its context, and de-
3
cides whether that trap should be handled by L0 (e.g., the processor, L0 uses it to emulate a VMX enabled CPU
because the trapping context was L1 ) or whether to for- for L1 .
ward it to the responsible hypervisor (e.g., because the
trap occurred in L2 and should be handled by L1 ). In the
latter case, L0 forwards the trap to L1 for handling.
When there are n levels of nesting guests, but the hard-
ware supports less than n levels of MMU or DMA trans-
lation tables, the n levels need to be compressed onto the
levels available in hardware, as described in Sections 3.3
and 3.4.
3.2 CPU: Nested VMX Virtualization

Virtualizing the x86 platform used to be complex and
slow [40, 41, 49]. The hypervisor was forced to re- Figure 3: Extending VMX for nested virtualization
sort to on-the-fly binary translation of privileged instruc-
tions [3], slow machine emulation [8], or changes to
guest operating systems at the source code level [6] or 3.2.1 VMX Trap and Emulate
during compilation [32].
In due time Intel and AMD incorporated hardware VMX instructions can only execute successfully in root
virtualization extensions in their CPUs. These exten- mode. In the nested case, L1 uses VMX instructions in
sions introduced two new modes of operation: root mode guest mode to load and launch L2 guests, which causes
and guest mode, enabling the CPU to differentiate be- VMExits. This enables L0 , running in root mode, to trap
tween running a virtual machine (guest mode) and run- and emulate the VMX instructions executed by L1 .
ning the hypervisor (root mode). Both Intel and AMD In general, when L0 emulates VMX instructions, it
also added special in-memory virtual machine control updates VMCS structures according to the update pro-
structures (VMCS and VMCB, respectively) which con- cess described in the next section. Then, L0 resumes
tain environment specifications for virtual machines and L1 , as though the instructions were executed directly by
the hypervisor. the CPU. Most of the VMX instructions executed by L1
The VMX instruction set and the VMCS layout are ex- cause, first, a VMExit from L1 to L0 , and then a VMEn-
plained in detail in [27]. Data stored in the VMCS can be try from L0 to L1 .
divided into three groups. Guest state holds virtualized For the instructions used to run a new VM,
CPU registers (e.g., control registers or segment regis- vmresume and vmlaunch, the process is different,
ters) which are automatically loaded by the CPU when since L0 needs to emulate a VMEntry from L1 to L2 .
switching from root mode to guest mode on VMEntry. Therefore, any execution of these instructions by L1
Host state is used by the CPU to restore register val- cause, first, a VMExit from L1 to L0 , and then, a VMEn-
ues when switching back from guest mode to root mode try from L0 to L2 .
on VMExit. Control data is used by the hypervisor to
inject events such as exceptions or interrupts into vir- 3.2.2 VMCS Shadowing
tual machines and to specify which events should cause
a VMExit; it is also used by the CPU to specify the L0 prepares a VMCS (VMCS0→1 ) to run L1 , exactly in
VMExit reason to the hypervisor. the same way a hypervisor executes a guest with a single
In nested virtualization, the hypervisor running in root level of virtualization. From the hardware’s perspective,
mode (L0 ) runs other hypervisors (L1 ) in guest mode. the processor is running a single hypervisor (L0 ) in root
L1 hypervisors have the illusion they are running in root mode and a guest (L1 ) in guest mode. L1 is not aware
mode. Their virtual machines (L2 ) also run in guest that it is running in guest mode and uses VMX instruc-
mode. tions to create the specifications for its own guest, L2 .
As can be seen in Figure 3, L0 is responsible for mul- L1 defines L2 ’s environment by creating a VMCS
tiplexing the hardware between L1 and L2 . The CPU (VMCS1→2 ) which contains L2 ’s environment from L1 ’s
runs L1 using VMCS0→1 environment specification. Re- perspective. For example, the VMCS1→2 GUEST- CR 3
spectively, VMCS0→2 is used to run L2 . Both of these field points to the page tables that L1 prepared for L2 .
environment specifications are maintained by L0 . In ad- L0 cannot use VMCS1→2 to execute L2 directly, since
dition, L1 creates VMCS1→2 within its own virtualized VMCS 1→2 is not valid in L 0 ’s environment and L 0 can-
environment. Although VMCS1→2 is never loaded into not use L1 ’s page tables to run L2 . Instead, L0 uses
4
VMCS 1→2 to construct a new VMCS ( VMCS 0→2 ) that 3.3 MMU: Multi-dimensional Paging
holds L2 ’s environment from L0 ’s perspective.
L0 must consider all the specifications defined In addition to virtualizing the CPU, a hypervisor also
in VMCS1→2 and also the specifications defined in needs to virtualize the MMU: A guest OS builds a guest
VMCS 0→1 to create VMCS 0→2 . The host state defined in page table which translates guest virtual addresses to
VMCS 0→2 must contain the values required by the CPU guest physical addresses. These must be translated again
to correctly switch back from L2 to L0 . In addition, into host physical addresses. With nested virtualization,
VMCS 1→2 host state must be copied to VMCS 0→1 guest a third layer of address translation is needed.
state. Thus, when L0 emulates a switch between L2 to These translations can be done entirely in software,
L1 , the processor loads the correct L1 specifications. or assisted by hardware. However, as we explain be-
The guest state stored in VMCS1→2 does not require low, current hardware supports only one or two dimen-
any special handling in general, and most fields can be sions (levels) of translation, not the three needed for
copied directly to the guest state of VMCS0→2 . nested virtualization. In this section we present a new
The control data of VMCS1→2 and VMCS0→1 must be technique, multi-dimensional paging, for multiplexing
merged to correctly emulate the processor behavior. For the three needed translation tables onto the two avail-
example, consider the case where L1 specifies to trap an able in hardware. In Section 4.1.2 we demonstrate the
event EA in VMCS1→2 but L0 does not trap such event importance of this technique, showing that more naı̈ve
for L1 (i.e., a trap is not specified in VMCS0→1 ). To for- approaches (surveyed below) cause at least a three-fold
ward the event EA to L1 , L0 needs to specify the corre- slowdown of some useful workloads.
sponding trap in VMCS0→2 . In addition, the field used by When no hardware support for memory manage-
L1 to inject events to L2 needs to be merged, as well as ment virtualization was available, a technique known as
the fields used by the processor to specify the exit cause. shadow page tables [15] was used. A guest creates a
For the sake of brevity, we omit some details on how guest page table, which translates guest virtual addresses
specific VMCS fields are merged. For the complete de- to guest physical addresses. Based on this table, the hy-
tails, the interested reader is encouraged to refer to the pervisor creates a new page table, the shadow page ta-
KVM source code [29]. ble, which translates guest virtual addresses directly to
the corresponding host physical address [3, 6]. The hy-
pervisor then runs the guest using this shadow page table
3.2.3 VMEntry and VMExit Emulation instead of the guest’s page table. The hypervisor has to
trap all guest paging changes, including page fault excep-
In nested environments, switches from L1 to L2 and back tions, the INVLPG instruction, context switches (which
must be emulated. When L2 is running and a VMExit cause the use of a different page table) and all the guest
occurs there are two possible handling paths, depending updates to the page table.
on whether the VMExit must be handled only by L0 or To improve virtualization performance, x86 architec-
must be forwarded to L1 . tures recently added two-dimensional page tables [13]—
When the event causing the VMExit is related to L0 a second translation table in the hardware MMU. When
only, L0 handles the event and resumes L2 . This kind of translating a guest virtual address, the processor first uses
event can be an external interrupt, a non-maskable inter- the regular guest page table to translate it to a guest phys-
rupt (NMI) or any trappable event specified in VMCS0→2 ical address. It then uses the second table, called EPT by
that was not specified in VMCS1→2 . From L1 ’s perspec- Intel (and NPT by AMD), to translate the guest physi-
tive this event does not exist because it was generated cal address to a host physical address. When an entry
outside the scope of L1 ’s virtualized environment. By is missing in the EPT table, the processor generates an
analogy to the non-nested scenario, an event occurred at EPT violation exception. The hypervisor is responsible
the hardware level, the CPU transparently handled it, and for maintaining the EPT table and its cache (which can
the hypervisor continued running as before. be flushed with INVEPT), and for handling EPT viola-
The second handling path is caused by events related tions, while guest page faults can be handled entirely by
to L1 (e.g., trappable events specified in VMCS1→2 ). the guest.
In this case L0 forwards the event to L1 by copying The hypervisor, depending on the processors capabil-
VMCS 0→2 fields updated by the processor to VMCS 1→2 ities, decides whether to use shadow page tables or two-
and resuming L1 . The hypervisor running in L1 believes dimensional page tables to virtualize the MMU. In nested
there was a VMExit directly from L2 to L1 . The L1 hy- environments, both hypervisors, L0 and L1 , determine
pervisor handles the event and later on resumes L2 by independently the preferred mechanism. Thus, L0 and
executing vmresume or vmlaunch, both of which will L1 hypervisors can use the same or a different MMU
be emulated by L0 . virtualization mechanism. Figure 4 shows three differ-
5
ent nested MMU virtualization models. ates an additional table, EPT1→2 , to run L2 , because L0
exposed a virtualized EPT capability to L1 . The L0 hy-
pervisor could then compress EPT0→1 and EPT1→2 into
a single EPT0→2 table as shown in Figure 4. Then L0
could run L2 using EPT0→2 , which translates directly
from the L2 guest physical address to the L0 host physi-
cal address, reducing the number of page fault exits and
improving nested virtualization performance. In Sec-
tion 4.1.2 we demonstrate more than a three-fold speedup
of some useful workloads with multi-dimensional page
tables, compared to shadow-on-EPT.
Figure 4: MMU alternatives for nested virtualization The L0 hypervisor launches L2 with an empty EPT0→2
table, building the table on-the-fly, on L2 EPT-violation
Shadow-on-shadow is used when the processor does exits. These happen when a translation for a guest phys-
not support two-dimensional page tables, and is the least ical address is missing in the EPT table. If there is no
efficient method. Initially, L0 creates a shadow page ta- translation in EPT1→2 for the faulting address, L0 first
ble to run L1 (SPT0→1 ). L1 , in turn, creates a shadow lets L1 handle the exit and update EPT1→2 . L0 can now
page table to run L2 (SPT1→2 ). L0 cannot use SPT1→2 create an entry in EPT0→2 that translates the L2 guest
to run L2 because this table translates L2 guest virtual physical address directly to the L0 host physical address:
addresses to L1 host physical addresses. Therefore, L0 EPT 1→2 is used to translate the L 2 physical address to a
compresses SPT0→1 and SPT1→2 into a single shadow L1 physical address, and EPT0→1 translates that into the
page table, SPT0→2 . This new table translates directly desired L0 physical address.
from L2 guest virtual addresses to L0 host physical ad- To maintain correctness of EPT0→2 , the L0 hypervisor
dresses. Specifically, for each guest virtual address in needs to know of any changes that L1 makes to EPT1→2 .
SPT 1→2 , L 0 creates an entry in SPT0→2 with the corre- L0 sets the memory area of EPT1→2 as read-only, thereby
sponding L0 host physical address. causing a trap when L1 tries to update it. L0 will then up-
Shadow-on-EPT is the most straightforward approach date EPT0→2 according to the changed entries in EPT1→2 .
to use when the processor supports EPT. L0 uses the EPT L0 also needs to trap all L1 INVEPT instructions, and in-
hardware, but L1 cannot use it, so it resorts to shadow validate the EPT cache accordingly.
page tables. L1 uses SPT1→2 to run L2 . L0 configures By using huge pages [34] to back guest memory, L0
the MMU to use SPT1→2 as the first translation table and can create smaller and faster EPT tables. Finally, to
EPT 0→1 as the second translation table. In this way, the further improve performance, L0 also allows L1 to use
processor first translates from L2 guest virtual address to VPIDs. With this feature, the CPU tags each transla-
L1 host physical address using SPT1→2 , and then trans- tion in the TLB with a numeric virtual-processor id,
lates from the L1 host physical address to the L0 host eliminating the need for TLB flushes on every VMEn-
physical address using the EPT0→1 . try and VMExit. Since each hypervisor is free to choose
Though the Shadow-on-EPT approach uses the EPT these VPIDs arbitrarily, they might collide and therefore
hardware, it still has a noticeable overhead due to page L0 needs to map the VPIDs that L1 uses into valid L0
faults and page table modifications in L2 . These must VPIDs.
be handled in L1 , to maintain the shadow page ta-
ble. Each of these faults and writes cause VMExits
3.4 I/O: Multi-level Device Assignment
and must be forwarded from L0 to L1 for handling. In
other words, Shadow-on-EPT is slow for the exactly the I/O is the third major challenge in server virtualization.
same reasons that Shadow itself was slow for single-level There are three approaches commonly used to provide
virtualization—but it is even slower because nested exits I/O services to a guest virtual machine. Either the hyper-
are slower than non-nested exits. visor emulates a known device and the guest uses an un-
In multi-dimensional page tables, as in two- modified driver to interact with it [47], or a para-virtual
dimensional page tables, each level creates its own sepa- driver is installed in the guest [6, 42], or the host assigns
rate translation table. For L1 to create an EPT table, L0 a real device to the guest which then controls the device
exposes EPT capabilities to L1 , even though the hard- directly [11, 31, 37, 52, 53]. Device assignment generally
ware only provides a single EPT table. provides the best performance [33, 38, 53], since it mini-
Since only one EPT table is available in hardware, the mizes the number of I/O-related world switches between
two EPT tables should be compressed into one: Let us the virtual machine and its hypervisor, and although it
assume that L0 runs L1 using EPT0→1 , and that L1 cre- complicates live migration, device assignment and live
6
migration can peacefully coexist [26, 28, 54]. Memory-mapped I/O (MMIO) and Port I/O (PIO) for
These three basic I/O approaches for a single-level a nested guest work the same way they work for a single-
guest imply nine possible combinations in the two-level level guest, without incurring exits on the critical I/O
nested guest case. Of the nine potential combinations path [53].
we evaluated the more interesting cases, presented in Ta-
ble 1. Implementing the first four alternatives is straight-
forward. We describe the last option, which we call
3.5 Micro Optimizations
multi-level device assignment, below. Multi-level de- There are two main places where a guest of a nested hy-
vice assignment lets the L2 guest access a device di- pervisor is slower than the same guest running on a bare-
rectly, bypassing both hypervisors. This direct device metal hypervisor. First, the transitions between L1 and
access requires dealing with DMA, interrupts, MMIO, L2 are slower than the transitions between L0 and L1 .
and PIOs [53]. Second, the exit handling code running in the L1 hyper-
visor is slower than the same code running in L0 . In this
I/O virtualization method I/O virtualization method section we discuss these two issues, and propose opti-
between L0 & L1 between L1 & L2 mizations that improve performance. Since we assume
Emulation Emulation that both L1 and L2 are unmodified, these optimizations
Para-virtual Emulation require modifying L0 only. We evaluate these optimiza-
Para-virtual Para-virtual tions in the evaluation section.
Device assignment Para-virtual
Device assignment Device assignment
3.5.1 Optimizing transitions between L1 and L2
Table 1: I/O combinations for a nested guest As explained in Section 3.2.3, transitions between L1
and L2 involve an exit to L0 and then an entry. In
Device DMA in virtualized environments is compli- L0 , most of the time is spent merging the VMCS’s. We
cated, because guest drivers use guest physical addresses, optimize this merging code to only copy data between
while memory access in the device is done with host VMCS ’s if the relevant values were modified. Keeping
physical addresses. The common solution to the DMA track of which values were modified has an intrinsic cost,
problem is an IOMMU [2, 11], a hardware component so one must carefully balance full copying versus partial
which resides between the device and main memory. It copying and tracking. We observed empirically that for
uses a translation table prepared by the hypervisor to common workloads and hypervisors, partial copying has
translate the guest physical addresses to host physical a lower overhead.
addresses. IOMMUs currently available, however, only VMCS merging could be further optimized by copy-
support a single level of address translation. Again, we ing multiple VMCS fields at once. However, according to
need to compress two levels of translation tables onto the Intel’s specifications, reads or writes to the VMCS area
one level available in hardware. must be performed using vmread and vmwrite in-
For modified guests this can be done using a paravir- structions, which operate on a single field. We empiri-
tual IOMMU: the code in L1 which sets a mapping on cally noted that under certain conditions one could ac-
the IOMMU from L2 to L1 addresses is replaced by a cess VMCS data directly without ill side-effects, bypass-
hypercall to L0 . L0 changes the L1 address in that map- ing vmread and vmwrite and copying multiple fields
ping to the respective L0 address, and puts the resulting at once with large memory copies. However, this opti-
mapping (from L2 to L0 addresses) in the IOMMU. mization does not strictly adhere to the VMX specifica-
A better approach, one which can run unmodified tions, and thus might not work on processors other than
guests, is for L0 to emulate an IOMMU for L1 [5]. L1 the ones we have tested.
believes that it is running on a machine with an IOMMU, In the evaluation section, we show that this opti-
and sets up mappings from L2 to L1 addresses on it. L0 mization gives a significant performance boost in micro-
intercepts these mappings, remaps the L1 addresses to benchmarks. However, it did not noticeably improve the
L0 addresses, and builds the L2 -to-L0 map on the real other, more typical, workloads that we have evaluated.
IOMMU.
In current x86 architecture, interrupts always cause a 3.5.2 Optimizing exit handling in L1
guest exit to L0 , which proceeds to forward the interrupt
to L1 . L1 will then inject it into L2 . The EOI (end of The exit-handling code in the hypervisor is slower when
interrupt) will also cause a guest exit. In Section 4.1.1 we run in L1 than the same code running in L0 . The main
discuss the slowdown caused by these interrupt-related cause of this slowdown are additional exits caused by
exits, and propose ways to avoid it. privileged instructions in the exit-handling code.
7
In Intel VMX, the privileged instructions vmread and 4.1 Macro Workloads
vmwrite are used by the hypervisor to read and mod-
ify the guest and host specification. As can be seen in kernbench is a general purpose compilation-type
Section 4.3, these cause L1 to exit multiple times while benchmark that compiles the Linux kernel multiple
it handles a single L2 exit. times. The compilation process is, by nature, CPU- and
memory-intensive, and it also generates disk I/O to load
In contrast, in AMD SVM, guest and host specifica- the compiled files into the guest’s page cache.
tions can be read or written to directly using ordinary SPECjbb is an industry-standard benchmark de-
memory loads and stores. The clear advantage of that signed to measure the server-side performance of Java
model is that L0 does not intervene while L1 modifies run-time environments. It emulates a three-tier system
L2 specifications. Removing the need to trap and emu- and is primarily CPU-intensive.
late special instructions reduces the number of exits and
We executed kernbench and SPECjbb in four se-
improves nested virtualization performance.
tups: host, single-level guest, nested guest, and nested
One thing L0 can do to avoid trapping on every guest optimized with direct read and write (DRW) as de-
vmread and vmwrite is binary translation [3] of prob- scribed in Section 3.5.2. The optimizations described
lematic vmread and vmwrite instructions in the L1 in Section 3.5.1 did not make a significant difference to
instruction stream, by trapping the first time such an in- these benchmarks, and are thus omitted from the results.
struction is called and then rewriting it to branch to a We used KVM as both L0 and L1 hypervisor with multi-
non-trapping memory load or store. To evaluate the po- dimensional paging. The results are depicted in Table 2.
tential performance benefit of this approach, we tested a
modified L1 that directly reads and writes VMCS1→2 in Kernbench
memory, instead of using vmread and vmwrite. The Host Guest Nested NestedDRW
performance of this setup, which we call DRW (direct Run time 324.3 355 406.3 391.5
read and write) is described in the evaluation section. STD dev. 1.5 10 6.7 3.1
% overhead
vs. host - 9.5 25.3 20.7
% overhead
4 Evaluation vs. guest - - 14.5 10.3
%CPU 93 97 99 99
SPECjbb
We start the evaluation and analysis of nested virtual-
ization with macro benchmarks that represent real-life Host Guest Nested NestedDRW
workloads. Next, we evaluate the contribution of multi- Score 90493 83599 77065 78347
level device assignment and multi-dimensional paging to STD dev. 1104 1230 1716 566
nested virtualization performance. Most of our experi- % degradati-
ments are executed with KVM as the L1 guest hyper- on vs. host - 7.6 14.8 13.4
visor. In Section 4.2 we present results with VMware % degradati-
Server as the L1 guest hypervisor. on vs. guest - - 7.8 6.3
%CPU 100 100 100 100
We then continue the evaluation with a synthetic,
worst-case micro benchmark running on L2 which Table 2: kernbench and SPECjbb results
causes guest exits in a loop. We use this synthetic, worst-
case benchmark to understand and analyze the overhead We compared the impact of running the workloads in a
and the handling flow of a single L2 exit. nested guest with running the same workload in a single-
Our setup consisted of an IBM x3650 machine booted level guest, i.e., the overhead added by the additional
with a single Intel Xeon 2.9GHz core and with 3GB of level of virtualization. For kernbench, the overhead
memory. The host OS was Ubuntu 9.04 with a kernel of nested virtualization is 14.5%, while for SPECjbb the
that is based on the KVM git tree version kvm-87, with score is degraded by 7.82%. When we discount the
our nested virtualization support added. For both L1 and Intel-specific vmread and vmwrite overhead in L1 ,
L2 guests we used an Ubuntu Jaunty guest with a kernel the overhead is 10.3% and 6.3% respectively.
that is based on the KVM git tree, version kvm-87. L1 To analyze the sources of overhead, we examine the
was configured with 2GB of memory and L2 was config- time distribution between the different levels. Figure 5
ured with 1GB of memory. For the I/O experiments we shows the time spent in each level. It is interesting to
used a Broadcom NetXtreme 1Gb/s NIC connected via compare the time spent in the hypervisor in the single-
crossover-cable to an e1000e NIC on another machine. level case with the time spent in L1 in the nested guest
8
case, since both hypervisors are expected to do the same Guest
L1
work. The times are indeed similar, although the L1 hy- 100 L0
CPU mode switch
pervisor takes more cycles due to cache pollution and
Normalized CPU Cycles

80
TLB flushes, as we show in Section 4.3. The signifi-
cant part of the virtualization overhead in the nested case 60
comes from the time spent in L0 and the increased num-

40
ber of exits.
20
For SPECjbb, the total number of cycles across
all levels is the same for all setups. This is because 0 Gu N N Gu N N
est Gested Gested est Gested Gested
SPECjbb executed for the same pre-set amount of time ues
t ues
tD
ues
t ues
tD
RW RW
in both cases and the difference was in the benchmark Kernbench SPECjbb
score.
Efficiently virtualizing a hypervisor is hard. Nested Figure 5: CPU cycle distribution
virtualization creates a new kind of workload for the L0
hypervisor which did not exist before: running another 200,000
180,000 nested guest
hypervisor (L1 ) as a guest. As can be seen in Figure 5, nested guest DRW
160,000 guest
for kernbench L0 takes only 2.28% of the overall cy- 140,000
cles in the single-level guest case, but takes 5.17% of the
CPU Cycles
120,000
overall cycles for the nested-guest case. In other words, 100,000
L0 has to work more than twice as hard when running a 80,000

60,000
nested guest.
40,000
Not all exits of L2 incur the same overhead, as each 20,000
type of exit requires different handling in L0 and L1 . In 0 Ext P

ern IO
Rea CR A E W P C E
d M acc PIC a xcept rite M endin puid PT vi
al i SR ess cce ion SR g inte ola
Figure 6, we show the total number of cycles required nte
rrup
t
ss rrup
t
tion
to handle each exit type. For the single level guest we

measured the number of cycles between VMExit and the
consequent VMEntry. For the nested guest we measured Figure 6: Cycle costs of handling different types of exits
the number of cycles spent between L2 VMExit and the
consequent L2 VMEntry.
4.1.1 I/O Intensive Workloads
There is a large variance between the handling times
of different types of exits. The cost of each exit comes To examine the performance of a nested guest in the
primarily from the number of privileged instructions per- case of I/O intensive workloads we used netperf, a
formed by L1 , each of which causes an exit to L0 . For ex- TCP streaming application that attempts to maximize the
ample, when L1 handles a PIO exit of L2 , it generates on amount of data sent over a single TCP connection. We
average 31 additional exits, whereas in the cpuid case measured the performance on the sender side, with the
discussed later in Section 4.3 only 13 exits are required. default settings of netperf (16,384 byte messages).
Discounting traps due to vmread and vmwrite, the Figure 7 shows the results for running the netperf
average number of exits was reduced to 14 for PIO and TCP stream test on the host, in a single-level guest, and in
to 2 for cpuid. a nested guest, using the five I/O virtualization combina-
Another source of overhead is heavy-weight exits. The tions described in Section 3.4. We used KVM’s default
external interrupt exit handler takes approximately 64K emulated NIC (RTL-8139), virtio [42] for a paravirtual
cycles when executed by L0 . The PIO exit handler takes NIC, and a 1 Gb/s Broadcom NetXtreme II with device
approximately 12K cycles when executed by L0 . How- assignment. All tests used a single CPU core.
ever, when those handlers are executed by L1 , they take On bare-metal, netperf easily achieved line rate
much longer: approximately 192K cycles and 183K cy- (940 Mb/s) with 20% CPU utilization.
cles, respectively. Discounting traps due to vmread Emulation gives a much lower throughput, with full
and vmwrite, they take approximately 148K cycles and CPU utilization: On a single-level guest we get 25%
130K cycles, respectively. This difference in execution of the line rate. On the nested guest the throughput is
times between L0 and L1 is due to two reasons: first, the even lower and the overhead is dominated by the cost of
handlers execute privileged instructions causing exits to device emulation between L1 and L2 . Each L2 exit is
L0 . Second, the handlers run for a long time compared trapped by L0 and forwarded to L1 . For each L2 exit, L1
with other handlers and therefore more external events then executes multiple privileged instructions, incurring
such as external interrupts occur during their run-time. multiple exits back to L0 . In this way the overhead for
9
throughput (Mbps)
%cpu 1000
1,000 100 900
Throughput (Mbps)
900 800
800 80 700
throughput (Mbps)
700 600
600 60 500
% cpu
500 400
400 40 300 L0 (bare metal)
300 200 L2 (direct/direct)
L2 (direct/virtio)
200 20 100
100 16 32 64 128 256 512
0 0 Message size (netperf -m)
na si si si ne ne ne ne ne
ti ve enmgle vni gle dni gle emsted visted visted disted disted
ulalev rtio lev rect lev ula gu rtio gu rtio gu rect gu rect gu
e
tio l g e e e e e e e
l g acc l g tio st / emst / vi st / v st / d st
n ue ue es ue n / irt ire
st st s st em ulat rtio io ct
ula io
tio n Figure 8: Performance of netperf with interrupt-less
n
network driver
Figure 7: Performance of netperf in various setups

ran netperf with smaller messages. As we can see in the
figure, for 64-byte messages, for example, on L0 (bare
each L2 exit is multiplied. metal) a throughput of 900 Mb/s is achieved, while on
The para-virtual virtio NIC performs better than emu- L2 with multi-level device assignment, we get 837 Mb/s,
lation since it reduces the number of exits. Using virtio a mere 7% slowdown. The runner-up method, virtio on
all the way up to L2 gives 75% of line rate with a satu- direct, was not nearly as successful, and achieved just
rated CPU, better but still considerably below bare-metal 469 Mb/s, 50% below bare-metal performance. CPU
performance. utilization was 100% in all cases since a polling driver
Multi-level device assignment achieved the best per- consumes all available CPU cycles.
formance, with line rate at 60% CPU utilization (Fig-
ure 7, direct/direct). Using device assignment between 4.1.2 Impact of Multi-dimensional Paging
L0 and L1 and virtio between L1 and L2 enables the L2
guest to saturate the 1Gb link with 92% CPU utilization To evaluate multi-dimensional paging, we compared
(Figure 7, direct/virtio). each of the macro benchmarks described in the previ-
While multi-level device assignment outperformed the ous sections with and without multi-dimensional paging.
other methods, its measured performance is still subop- For each benchmark we configured L0 to run L1 with
timal because 60% of the CPU is used for running a EPT support. We then compared the case where L1 uses
workload that only takes 20% on bare-metal. Unfortu- shadow page tables to run L2 (“Shadow-on-EPT”) with
nately on current x86 architecture, interrupts cannot be the case of L1 using EPT to run L2 (“multi-dimensional
assigned to guests, so both the interrupt itself and its EOI paging”).
cause exits. The more interrupts the device generates,
3.5
the more exits, and therefore the higher the virtualiza-
tion overhead—which is more pronounced in the nested 3.0
case. We hypothesize that these interrupt-related exits

2.5
are the biggest source of the remaining overhead, so had Shadow on EPT
Improvement ratio
Multi−dimensional paging
the architecture given us a way to avoid these exits—by 2.0
assigning interrupts directly to guests rather than having 1.5

each interrupt go through both hypervisors—netperf
performance on L2 would be close to that of bare-metal. 1.0
To test this hypothesis we reduced the number of in- 0.5

terrupts, by modifying standard bnx2 network driver to
0.0
work without any interrupts, i.e., continuously poll the kernbench specjbb netperf
device for pending events

Figure 8 compares some of the I/O virtualization com- Figure 9: Impact of multi-dimensional paging
binations with this polling driver. Again, multi-level de-
vice assignment is the best option and, as we hypothe- Figure 9 shows the results. The overhead between the
sized, this time L2 performance is close to bare-metal. two cases is mostly due to the number of page-fault exits.
With netperf’s default 16,384 byte messages, the When shadow paging is used, each page fault of the L2
throughput is often capped by the 1 Gb/s line rate, so we guest results in a VMExit. When multi-dimensional pag-
10
ing is used, only an access to a guest physical page that is overhead of handling guest exits in L0 and L1 . Based on
not mapped in the EPT table will cause an EPT violation this definition, this cpuid micro benchmark is a worst
exit. Therefore the impact of multi-dimensional paging case workload, since L2 does virtually nothing except
depends on the number of guest page faults, which is a generate exits. We note that cpuid cannot in the gen-
property of the workload. The improvement is startling eral case be handled by L0 directly, as L1 may wish to
in benchmarks such as kernbench with a high number modify the values returned to L2 .
of page faults, and is less pronounced in workloads that Figure 10 shows the number of CPU cycles required to
do not incur many page faults. execute a single cpuid instruction. We ran the cpuid
instruction 4 ∗ 106 times and calculated the average num-
ber of cycles per iteration. We repeated the test for the
4.2 VMware Server as a Guest Hypervisor following setups: 1. native, 2. running cpuid in a single
We also evaluated VMware as the L1 hypervisor to ana- level guest, and 3. running cpuid in a nested guest with
lyze how a different guest hypervisor affects nested vir- and without the optimizations described in Section 3.5.
tualization performance. We used the hosted version, For each execution, we present the distribution of the cy-
VMWare Server v2.0.1, build 156745 x86-64, on top of cles between the levels: L0 , L1 , L2 . CPU mode switch
Ubuntu based on kernel 2.6.28-11. We intentionally did stands for the number of cycles spent by the CPU when
not install VMware tools for the L2 guest, thereby in- performing a VMEntry or a VMExit. On bare metal
creasing nested virtualization overhead. Due to similar cpuid takes about 100 cycles, while in a virtual ma-
results obtained for VMware and KVM as the nested hy- chine it takes about 2,600 cycles (Figure 10, column 1),
pervisor, we show only kernbench and SPECjbb re- about 1,000 of which is due to the CPU mode switch-
sults below. ing. When run in a nested virtual machine it takes about
58,000 cycles (Figure 10, column 2).
Benchmark % overhead vs. single-level guest
60,000
kernbench 14.98
50,000
SPECjbb 8.85
40,000
CPU Cycles
Table 3: VMware Server as a guest hypervisor 30,000

L1
L0
cpu mode switch
20,000
Examining L1 exits, we noticed VMware Server 10,000

uses VMX initialization instructions (vmon, vmoff, 0 1. S 2. N 3. N 4. N 5. N
vmptrld, vmclear) several times during L2 execu- Gu ingle
est L
eve
este
d Gu
opt este
im d G
opt este
est izatio uest
imi d G
opt este
i d
zati ues mizatiGues
l ns 3 ons t ons t
tion. Conversely, KVM uses them only once. This .5.1 3 .5.2 3.5
.1 &3
.5.2
dissimilitude derives mainly from the approach used by
VMware to interact with the host Linux kernel. Each
time the monitor module takes control of the CPU, it en- Figure 10: CPU cycle distribution for cpuid
ables VMX. Then, before it releases control to the Linux
kernel, VMX is disabled. Furthermore, during this tran- To understand the cost of handling a nested guest
sition many non-VMX privileged instructions are exe- exit compared to the cost of handling the same exit for
cuted by L1 , increasing L0 intervention. a single-level guest, we analyzed the flow of handling
Although all these initialization instructions are emu- cpuid:
lated by L0 , transitions from the VMware monitor mod-
ule to the Linux kernel are less frequent for Kernbench 1. L2 executes a cpuid instruction
and SPECjbb. The VMware monitor module typically 2. CPU traps and switches to root mode L0
handles multiple L2 exits before switching to the Linux 3. L0 switches state from running L2 to running L1
kernel. As a result, this behavior only slightly affected 4. CPU switches to guest mode L1
the nested virtualization performance.
5. L1 modifies VMCS1→2
repeat n times:
4.3 Micro Benchmark Analysis (a) L1 accesses VMCS1→2
To analyze the cycle-costs of handling a single L2 exit, (b) CPU traps and switches to root mode L0
we ran a micro benchmark in L2 that does nothing ex- (c) L0 emulates VMCS1→2 access and resumes L1
cept generate exits by calling cpuid in a loop. The vir-
tualization overhead for running an L2 guest is the ratio (d) CPU switches to guest mode L1
between the effective work done by the L2 guest and the 6. L1 emulates cpuid for L2
11
7. L1 executes a resume of L2 5 Discussion
8. CPU traps and switches to root mode L0
9. L0 switches state from running L1 to running L2 In nested environments we introduce a new type of work-
10. CPU switches to guest mode L2 load not found in single-level virtualization: the hypervi-
sor as a guest. Traditionally, x86 hypervisors were de-
signed and implemented assuming they will be running
In general, step 5 can be repeated multiple times. Each directly on bare metal. When they are executed on top of
iteration consists of a single VMExit from L1 to L0 . another hypervisor this assumption no longer holds and
The total number of exits depends on the specific im- the guest hypervisor behavior becomes a key factor.
plementation of the L1 hypervisor. A nesting-friendly With a nested L1 hypervisor, the cost of a single L2
hypervisor will keep privileged instructions to a mini- exit depends on the number of exits caused by L1 dur-
mum. In any case, the L1 hypervisor must interact with ing the L2 exit handling. A nesting-friendly L1 hyper-
VMCS 1→2 , as described in Section 3.2.2. In the case of visor should minimize this critical chain to achieve bet-
cpuid, in step 5, L1 reads 7 fields of VMCS1→2 , and ter performance, for example by limiting the use of trap-
writes 4 fields to VMCS1→2 , which ends up as 11 VMEx- causing instructions in the critical path.
its from L1 to L0 . Overall, for a single L2 cpuid exit Another alternative for reducing this critical chain is to
there are 13 CPU mode switches from guest mode to para-virtualize the guest hypervisor, similar to OS para-
root mode and 13 CPU mode switches from root mode virtualization [6, 50, 51]. While this approach could re-
to guest mode, specifically in steps: 2, 4, 5b, 5d, 8, 10. duce L0 intervention when L1 virtualizes the L2 envi-
The number of cycles the CPU spends in a single ronment, the work being done by L0 to virtualize the
switch to guest mode plus the number of cycles to switch L1 environment will still persist. How much this tech-
back to root mode, is approximately 1,000. The total nique can help depends on the workload and on the spe-
CPU switching cost is therefore around 13,000 cycles. cific approach used. Taking as a concrete example the
conversion of vmreads and vmwrites to non-trapping
The other two expensive steps are 3 and 9. As de- load/stores, para-virtualization could reduce the over-
scribed in Section 3.5, these switches can be optimized. head for kernbench from 14.5% to 10.3%.
Indeed as we show in Figure 10, column 3, using various
optimizations we can reduce the virtualization overhead
by 25%, and by 80% when using non-trapping vmread 5.1 Architectural Overhead
and vmwrite instructions.
Part of the overhead introduced with nested virtualization
By avoiding traps on vmread and vmwrite (Fig- is due to the architectural design choices of x86 hardware
ure 10, columns 4 and 5), we removed the exits caused virtualization extensions.
by VMCS1→2 accesses and the corresponding VMCS ac-
Virtualization API: Two performance sensitive areas
cess emulation, step 5. This optimization reduced the
in x86 virtualization are memory management and I/O
switching cost by 84.6%, from 13,000 to 2,000.
virtualization. With multi-dimensional paging we com-
While it might still be possible to optimize steps 3 pressed three MMU translation tables onto the two avail-
and 9 further, it is clear that the exits of L1 while han- able in hardware; multi-level device assignment does
dling a single exit of L2 , and specifically VMCS accesses, the same for IOMMU translation tables. Architectural
are a major source of overhead. Architectural support for support for multiple levels of MMU and DMA transla-
both faster world switches and VMCS updates without ex- tion tables—as many tables as there are levels of nested
its will reduce the overhead. hypervisors—will immediately improve MMU and I/O
Examining Figure 10, it seems that handling cpuid virtualization.
in L1 is more expensive than handling cpuid in L0 . Architectural support for delivering interrupts directly
Specifically, in column 3, the nested hypervisor L1 from the hardware to the L2 guest will remove L0 inter-
spends around 5,000 cycles to handle cpuid, while in vention on interrupt delivery and completion, interven-
column 1 the same hypervisor running on bare metal tion which, as we explained in Section 4.1.1, hurts nested
only spends 1500 cycles to handle the same exit (note performance. Such architectural support will also help
that these numbers do not include the mode switches). single-level I/O virtualization performance [33].
The code running in L1 and in L0 is identical; the differ- VMX features such as MSR bitmaps, I/O bitmaps, and
ence in cycle count is due to cache pollution. Running CR masks/shadows [48] proved to be effective in reduc-
the cpuid handling code incurs on average 5 L2 cache ing exit overhead. Any architectural feature that reduces
misses and 2 TLB misses when run in L0 , whereas run- single-level exit overhead also shortens the nested critical
ning the exact same code in L1 incurs on average 400 L2 path. Such features, however, also add implementation
cache misses and 19 TLB misses. complexity, since to exploit them in nested environments
12
they must be properly emulated by L0 hypervisors. Acknowledgments
Removing the (Intel-specific) need to trap on every
vmread and vmwrite instruction will give an imme- The authors would like to thank Alexander Graf and Jo-
diate performance boost, as we showed in Section 3.5.2. erg Roedel, whose KVM patches for nested SVM in-
Same Core Constraint: The x86 trap-and-emulate spired parts of this work. The authors would also like
implementation dictates that the guest and hypervisor to thank Ryan Harper, Nadav Amit, and our shepherd
share each core, since traps are always handled on the Robert English for insightful comments and discussions.
core where they occurred. Due to this constraint, when
the hypervisor handles an exit the guest is temporarily References
stopped on that core. In a nested environment, the L1 [1] Phoenix Hyperspace. http://www.hyperspace.com/.
guest hypervisor will also be interrupted, increasing the [2] A BRAMSON , D., JACKSON , J., M UTHRASANALLUR , S.,
total interruption time of the L2 guest. Gavrilovska, et N EIGER , G., R EGNIER , G., S ANKARAN , R., S CHOINAS , I.,
al., presented techniques for exploiting additional cores U HLIG , R., V EMBU , B., AND W IEGERT, J. Intel virtualiza-
to handle guest exits [19]. According to the authors, for tion technology for directed I/O. Intel Technology Journal 10, 03
(August 2006), 179–192.
a single level of virtualization, they measured 41% aver-
[3] A DAMS , K., AND AGESEN , O. A comparison of software and
age improvements in call latency for null calls, cpuid and hardware techniques for x86 virtualization. SIGOPS Oper. Syst.
page table updates. These techniques could be adapted Rev. 40, 5 (December 2006), 2–13.
for nested environments in order to remove L0 interven- [4] AMD. Secure virtual machine architecture reference manual.
tions and also reduce privileged instructions call laten- [5] A MIT, N., B EN -Y EHUDA , M., AND YASSOUR , B.-A. IOMMU:
cies, decreasing the total interruption time of a nested Strategies for mitigating the IOTLB bottleneck. In WIOSCA ’10:
guest. Sixth Annual Workshop on the Interaction between Operating
Systems and Computer Architecture.
Cache Pollution: Each time the processor switches
[6] BARHAM , P., D RAGOVIC , B., F RASER , K., H AND , S.,
between the guest and the host context on a single core, H ARRIS , T., H O , A., N EUGEBAUER , R., P RATT, I., AND
the effectiveness of its caches is reduced. This phe- WARFIELD , A. Xen and the art of virtualization. In SOSP ’03:
nomenon is magnified in nested environments, due to Symposium on Operating Systems Principles (2003).
the increased number of switches. As was seen in Sec- [7] BAUMANN , A., BARHAM , P., DAGAND , P. E., H ARRIS , T.,
tion 4.3, even after discounting L0 intervention, the L1 I SAACS , R., P ETER , S., ROSCOE , T., S CH ÜPBACH , A., AND
S INGHANIA , A. The multikernel: a new os architecture for scal-
hypervisor still took more cycles to handle an L2 exit able multicore systems. In SOSP ’09: 22nd ACM SIGOPS Sym-
than it took to handle the same exit for the single-level posium on Operating systems principles, pp. 29–44.
scenario, due to cache pollution. Dedicating cores to [8] B ELLARD , F. QEMU, a fast and portable dynamic translator. In
guests could reduce cache pollution [7, 45, 46] and in- USENIX Annual Technical Conference (2005), p. 41.
crease performance. [9] B ELPAIRE , G., AND H SU , N.-T. Hardware architecture for re-
cursive virtual machines. In ACM ’75: 1975 annual ACM con-
ference, pp. 14–18.
[10] B ELPAIRE , G., AND H SU , N.-T. Formal properties of recur-
6 Conclusions and Future Work sive virtual machine architectures. SIGOPS Oper. Syst. Rev. 9, 5
(1975), 89–96.
Efficient nested x86 virtualization is feasible, despite [11] B EN -Y EHUDA , M., M ASON , J., X ENIDIS , J., K RIEGER , O.,
the challenges stemming from the lack of architectural VAN D OORN , L., NAKAJIMA , J., M ALLICK , A., AND WAHLIG ,
support for nested virtualization. Enabling efficient E. Utilizing IOMMUs for virtualization in Linux and Xen. In
OLS ’06: The 2006 Ottawa Linux Symposium, pp. 71–86.
nested virtualization on the x86 platform through multi-
[12] B ERGHMANS , O. Nesting virtual machines in virtualization test
dimensional paging and multi-level device assignment frameworks. Master’s thesis, University of Antwerp, May 2010.
opens exciting avenues for exploration in such diverse [13] B HARGAVA , R., S EREBRIN , B., S PADINI , F., AND M ANNE ,
areas as security, clouds, and architectural research. S. Accelerating two-dimensional page walks for virtualized sys-
We are continuing to investigate architectural and tems. In ASPLOS ’08: 13th intl. conference on architectural sup-
software-based methods to improve the performance port for programming languages and operating systems (2008).
of nested virtualization, while simultaneously exploring [14] C LARK , C., F RASER , K., H AND , S., H ANSEN , J. G., J UL , E.,
L IMPACH , C., P RATT, I., AND WARFIELD , A. Live migration of
ways of building computer systems that have nested vir- virtual machines. In NSDI ’05: Second Symposium on Networked
tualization built-in. Systems Design & Implementation (2005), pp. 273–286.
Last, but not least, while the Turtles project is fairly [15] D EVINE , S. W., B UGNION , E., AND ROSENBLUM , M. Virtu-
mature, we expect that the additional public exposure alization system including a virtual machine monitor for a com-
puter with a segmented architecture. US #6397242, May 2002.
stemming from its open source release will help enhance
[16] F ORD , B., H IBLER , M., L EPREAU , J., T ULLMANN , P., BACK ,
its stability and functionality. We look forward to see- G., AND C LAWSON , S. Microkernels meet recursive virtual ma-
ing in what interesting directions the research and open chines. In OSDI ’96: Second USENIX symposium on Operating
source communities will take it. systems design and implementation (1996), pp. 137–151.
13
[17] G ARFINKEL , T., A DAMS , K., WARFIELD , A., AND F RANKLIN , [37] R AJ , H., AND S CHWAN , K. High performance and scalable I/O
J. Compatibility is not transparency: VMM detection myths and virtualization via self-virtualized devices. In HPDC ’07: Pro-
realities. In HOTOS’07: 11th USENIX workshop on Hot topics ceedings of the 16th international symposium on High perfor-
in operating systems (2007), pp. 1–6. mance distributed computing (2007), pp. 179–188.
[18] G ARFINKEL , T., AND ROSENBLUM , M. A virtual machine in- [38] R AM , K. K., S ANTOS , J. R., T URNER , Y., C OX , A. L., AND
trospection based architecture for intrusion detection. In Network R IXNER , S. Achieving 10Gbps using safe and transparent net-
& Distributed Systems Security Symposium (2003), pp. 191–206. work interface virtualization. In VEE ’09: The 2009 ACM SIG-
[19] G AVRILOVSKA , A., K UMNAR , S., R AJ , H., S CHWAN , K., PLAN/SIGOPS International Conference on Virtual Execution
G UPTA , V., NATHUJI , R., N IRANJAN , R., R ANADIVE , A., AND Environments (March 2009).
S ARAIYA , P. High-performance hypervisor architectures: Virtu- [39] R ILEY, R., J IANG , X., AND X U , D. Guest-transparent pre-
alization in hpc systems. In HPCVIRT ’07: 1st Workshop on vention of kernel rootkits with vmm-based memory shadowing.
System-level Virtualization for High Performance Computing. In Recent Advances in Intrusion Detection, vol. 5230 of Lecture
[20] G EBHARDT, C., AND DALTON , C. Lala: a late launch appli- Notes in Computer Science. 2008, ch. 1, pp. 1–20.
cation. In STC ’09: 2009 ACM workshop on Scalable trusted [40] ROBIN , J. S., AND I RVINE , C. E. Analysis of the intel pen-
computing (2009), pp. 1–8. tium’s ability to support a secure virtual machine monitor. In 9th
[21] G OLDBERG , R. P. Architecture of virtual machines. In Proceed- conference on USENIX Security Symposium (2000), p. 10.
ings of the workshop on virtual computer systems (New York, [41] ROSENBLUM , M. Vmware’s virtual platform: A virtual machine
NY, USA, 1973), ACM, pp. 74–112. monitor for commodity pcs. In Hot Chips 11 (1999).
[22] G OLDBERG , R. P. Survey of virtual machine research. IEEE [42] RUSSELL , R. virtio: towards a de-facto standard for virtual I/O
Computer Magazine (June 1974), 34–45. devices. SIGOPS Oper. Syst. Rev. 42, 5 (2008), 95–103.
[23] G RAF, A., AND ROEDEL , J. Nesting the virtualized world. [43] RUTKOWSKA , J. Subverting vista kernel for fun and profit.
Linux Plumbers Conference, Sep. 2009. Blackhat, Aug. 2006.
[24] H E , Q. Nested virtualization on xen. Xen Summit Asia 2009. [44] S ESHADRI , A., L UK , M., Q U , N., AND P ERRIG , A. Secvisor: a
[25] H UANG , J.-C., M ONCHIERO , M., AND T URNER , Y. Ally: Os- tiny hypervisor to provide lifetime kernel code integrity for com-
transparent packet inspection using sequestered cores. In WIOV modity oses. In SOSP ’07: 21st ACM SIGOPS symposium on
’10: The Second Workshop on I/O Virtualization. Operating systems principles (2007), pp. 335–350.
[26] H UANG , W., L IU , J., KOOP, M., A BALI , B., AND PANDA , D. [45] S HALEV, L., B OROVIK , E., S ATRAN , J., AND B EN -Y EHUDA ,
Nomad: migrating OS-bypass networks in virtual machines. In M. Isostack—highly efficient network processing on dedicated
VEE ’07: 3rd international conference on Virtual execution envi- cores. In USENIX ATC ’10: The 2010 USENIX Annual Technical
ronments (2007), pp. 158–168. Conference (2010).
[27] I NTEL C ORPORATION. Intel 64 and IA-32 Architectures Soft- [46] S HALEV, L., M AKHERVAKS , V., M ACHULSKY, Z., B IRAN , G.,
ware Developers Manual. 2009. S ATRAN , J., B EN -Y EHUDA , M., AND S HIMONY, I. Loosely
coupled tcp acceleration architecture. In HOTI ’06: Proceedings
[28] K ADAV, A., AND S WIFT, M. M. Live migration of direct-access of the 14th IEEE Symposium on High-Performance Interconnects
devices. In First Workshop on I/O Virtualization (WIOV ’08). (Washington, DC, USA, 2006), IEEE Computer Society, pp. 3–8.
[29] K IVITY, A., K AMAY, Y., L AOR , D., L UBLIN , U., AND [47] S UGERMAN , J., V ENKITACHALAM , G., AND L IM , B.-H. Virtu-
L IGUORI , A. KVM: the linux virtual machine monitor. In Ot- alizing I/O devices on VMware workstation’s hosted virtual ma-
tawa Linux Symposium (July 2007), pp. 225–230. chine monitor. In USENIX Annual Technical Conference (2001).
[30] L AUER , H. C., AND W YETH , D. A recursive virtual machine [48] U HLIG , R., N EIGER , G., RODGERS , D., S ANTONI , A. L.,
architecture. In Workshop on virtual computer systems (1973), M ARTINS , F. C. M., A NDERSON , A. V., B ENNETT, S. M.,
pp. 113–116. K AGI , A., L EUNG , F. H., AND S MITH , L. Intel virtualization
[31] L EVASSEUR , J., U HLIG , V., S TOESS , J., AND G ÖTZ , S. Un- technology. Computer 38, 5 (2005), 48–56.
modified device driver reuse and improved system dependability [49] WALDSPURGER , C. A. Memory resource management in
via virtual machines. In OSDI ’04: 6th conference on Symposium VMware ESX server. In OSDI ’02: 5th Symposium on Operating
on Opearting Systems Design & Implementation (2004), p. 2. System Design and Implementation.
[32] L EVASSEUR , J., U HLIG , V., YANG , Y., C HAPMAN , M., [50] W HITAKER , A., S HAW, M., AND G RIBBLE , S. D. Denali: a
C HUBB , P., L ESLIE , B., AND H EISER , G. Pre-virtualization: scalable isolation kernel. In EW ’10: 10th ACM SIGOPS Euro-
Soft layering for virtual machines. In ACSAC ’08: 13th Asia- pean workshop (2002), pp. 10–15.
Pacific Computer Systems Architecture Conference, pp. 1–9.
[51] W HITAKER , A., S HAW, M., AND G RIBBLE , S. D. Scale and
[33] L IU , J. Evaluating standard-based self-virtualizing devices: A performance in the denali isolation kernel. SIGOPS Oper. Syst.
performance study on 10 GbE NICs with SR-IOV support. In Rev. 36, SI (2002), 195–209.
IPDPS ’10: IEEE International Parallel and Distributed Pro-
cessing Symposium (2010). [52] W ILLMANN , P., S HAFER , J., C ARR , D., M ENON , A., R IXNER ,
S., C OX , A. L., AND Z WAENEPOEL , W. Concurrent direct net-
[34] NAVARRO , J., I YER , S., D RUSCHEL , P., AND C OX , A. Prac- work access for virtual machine monitors. In High Performance
tical, transparent operating system support for superpages. In Computer Architecture, 2007. HPCA 2007. IEEE 13th Interna-
OSDI ’02: 5th symposium on Operating systems design and im- tional Symposium on (2007), pp. 306–317.
plementation (2002), pp. 89–104.
[53] YASSOUR , B.-A., B EN -Y EHUDA , M., AND WASSERMAN , O.
[35] O SISEK , D. L., JACKSON , K. M., AND G UM , P. H. Esa/390 Direct device assignment for untrusted fully-virtualized virtual
interpretive-execution architecture, foundation for vm/esa. IBM machines. Tech. rep., IBM Research Report H-0263, 2008.
Systems Journal 30, 1 (1991).
[54] Z HAI , E., C UMMINGS , G. D., AND D ONG , Y. Live migration
[36] P OPEK , G. J., AND G OLDBERG , R. P. Formal requirements for with pass-through device for Linux VM. In OLS ’08: The 2008
virtualizable third generation architectures. Commun. ACM 17, 7 Ottawa Linux Symposium (July 2008), pp. 261–268.
(July 1974), 412–421.
14

Ben Yehuda

Uploaded by

Copyright:

Available Formats

Ben Yehuda

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ben Yehuda

Uploaded by

Copyright:

Available Formats

The Turtles Project: Design and Implementation of Nested Virtualization

Abstract via the KVM [29] hypervisor. As commodity operat-

3 Turtles: Design and Implementation

3.1 Theory of Operation

3.2 CPU: Nested VMX Virtualization

Normalized CPU Cycles

comes from the time spent in L0 and the increased num-

L0 has to work more than twice as hard when running a 80,000

type of exit requires different handling in L0 and L1 . In 0 Ext P

to handle each exit type. For the single level guest we

Figure 7: Performance of netperf in various setups

case. We hypothesize that these interrupt-related exits

assigning interrupts directly to guests rather than having 1.5

To test this hypothesis we reduced the number of in- 0.5

device for pending events

Table 3: VMware Server as a guest hypervisor 30,000

Examining L1 exits, we noticed VMware Server 10,000

You might also like