5.1 A Top-down Performance Analysis
In this section, we study the virtualization overhead of GiantVM compared to non-distributed QEMU-KVM, and answer Question-1. We run microbenchmarks in the guest to analyze performance in the top-level abstraction. Then we do a run time breakdown analysis to reveal the performance bottleneck. Finally, we have a comparison between RDMA and TCP network backends.
Distributed vCPU: We run CPU stress methods of
Stress-ng [
36] v0.05.23 to evaluate distributed vCPU performance. The total throughput of all threads are measured (Ops/s). These methods either have a small memory footprint or only involve local memory accesses, thus the DSM involvement is minimized. Figure
5 (left) indicates that all CPU stress methods present linear scalability. Performance keeps rising even more than 8 vCPUs are required, which is the number of CPUs on each node of the cluster. GiantVM aggregates CPU resources well for workloads with infrequent memory accesses. Figure
5 (right) reports the cost of CPU resource aggregation, comparing
\(4\times 4\) GiantVM and 16-vCPU QEMU-KVM. GiantVM is
\(1.34\times\) slower than the baseline on average. The slowdown can be attributed to the DSM component. These tests are conducted by iterations, and results are submitted to a global view at the end of each iteration, incurring DSM overhead as all nodes require the global results. Nevertheless, the results are acceptable. These are ideal workloads that could benefit from GiantVM CPU resource aggregation.
Distributed Memory and I/O: We run memory and file I/O tests of
Sysbench [
39] v1.0.19 to evaluate distributed memory and I/O virtualization performance, respectively. Since our stand-alone machine has 24 CPUs, we could only run up to
\(4\times 6\) configurations for GiantVM. The memory tests are configured as
\(\lbrace size,scope,oper,mode\rbrace\), doing
oper (read or write) in
scope (local or global) memory on
mode (GiantVM or baseline). Local means each thread accesses its own page-aligned memory space, while global means all threads access global shared memory. The I/O tests configurations are
\(\lbrace size,mode\rbrace\),
sequentially reading/writing
one size (1 KiB or 1 MiB) file on
mode (GiantVM or baseline).
The total memory/file read/write speed of all threads is reported in Figure
6(a–f). In almost all tests, baseline outperforms GiantVM significantly, especially for file I/O tests (at most
\(45.2\times\) slower than single-node baseline). The results indicate a high overhead to provide a continuous address space crossing multiple nodes. For accessing local data or reading global data, the MSI protocol of DSM allows all the processors to make progress at full speed. While for writing global data, especially the file I/O workloads, which modifies data structures of the deep storage stack in the monolithic kernel, the DSM involvement is frequent. A DSM PF has a penalty of ~15
\(us\), while the cache system miss is only ~60
ns. The workloads frequently doing writes lead to excessive page faults, thus a great slowdown is determined.
To further investigate the slowdown of file I/O tests, we use
perf to profile the time spent on each component of the system, and draw CDF of page fault numbers on the top-100 frequently page-faulting pages, in Figure
6(g) and (h). Only 35.4% of the time is for doing useful work, i.e., running the Guest OS. The DSM component occupies 43.8% of the time, which is catastrophic. The kernel storage system fully considers the single-machine cache system but not the GiantVM DSM system. Several locks and shared data are allocated on the
same page, resulting in false sharing. The CDF of the number of page faults also reveals the issue. Total 70% of DSM page faults happen on pages totaling 0.5 MiB. While for interrupt and I/O forwarding, i.e., Router part in Figure
6(g), has a negligible run time of 1.06%. Thus, distributed I/O could not be the performance bottleneck.
An anomaly is the
\(\lbrace 4 KiB,global,write,*\rbrace\) memory test, in which we observe the “
slower is faster” phenomenon. There exists some data that IPoIB > RDMA > baseline (DRAM & SRAM). Due to the throughput orientation
9 of the memory test, one thread could read/write a local page at full speed, before an
Invalid message is sent from another node. Slow messages could leave more time for useful work, while fast ones could lead to
livelock. Slow messages match the design philosophy of a weaker memory model [
19], i.e., synchronize data across nodes lazily. We discuss the possibility of relaxing the memory model to be x86-TSO [
55] in Section
6.1.
The phenomenon disappears when we test \(\lbrace 4 MiB,global,write,*\rbrace\). The DSM now suffers from a large working set, while it is sparse enough for the cache system, and there is little cache line thrashing and livelock. At this time, a huge gap in accessing speed between L3 cache (or DRAM) and remote memory is shown.
Run Time Breakdown: We implement three big-data workloads for Linux in C++. Then we profile their run time using
perf on a 2-node GiantVM, in the same way for
Sysbench file I/O. These workloads represent typical big-data workloads that could benefit on GiantVM, for resource aggregation and ease of programming. Thus, we could learn the performance bottleneck of GiantVM for big-data applications. Figure
7(a–c) gives the results.
•
WC reads a 5 GiB text file in the experiment, representing large-scale text processing workloads. It runs one thread on each CPU, and text files are divided equally among threads.
•
Pi is a Monte Carlo-based \(\pi\) calculator, which does 20 billion random samplings, showing similar patterns to CPU-heavy workloads such as scientific computing. Pi creates one thread on each CPU.
•
PageRank (PR) is representative of iterative tasks seen in data analysis and machine learning fields. PR works on a graph with 576 K nodes and 5.1 M edges and executes 20 iterations on it. It spawns one thread on each CPU, and we divide nodes equally for each thread.
(1) In all configurations, the Router overhead is negligible (\(\lt\)1% on average), since interrupt and I/O forwarding involve only several bytes of network transfer, containing register states. (2) WC and PR do a great amount of disk read for text file and graph data. Consequently, QEMU device emulation takes up an observable part of run time, 21.6% and 25.0% for WC and PR, respectively. (3) KVM overhead could only be seen in \(2\times 8\) PR, which is 2.02% of run time. (4) For WC and Pi, since they are either CPU intensive or only involve mostly local memory accesses, their DSM overheads are relatively small, which are 8.4% and 4.0% on average. PR has poor memory access locality, thus incurs significant DSM overhead.
For \(2\times 8\) PR, the scalability issue occurs. DSM takes up to 31.3% of the execution time. This is because our current implementation of the network is simply two QPs (Queue Pair, the basic connection data structure used in RDMA) between each pair of machines, leading to significant resource contention. For workloads with random and frequent memory access behaviors, DSM becomes the bottleneck.
Network Backends Comparisons: Figure
7(d) gives a breakdown comparison between RDMA and TCP, for big-data workloads. The 2
\(\times\) 4 setting of GiantVM is used. For all workloads, especially the memory-intensive
PR, the RDMA speedup is significant. For
PR, 60.1% of DSM time is saved after switching to RDMA, and the total execution time speedup is
\(1.15\times\),
\(1.04\times\), and
\(4.46\times\) for
WC,
Pi, and
PR, respectively.
We also evaluate network backends in the lower level of abstraction, in a 2
\(\times\) 8 GiantVM. Figure
8 reports the latency and bandwidth of three network backends in the host kernel, and the latency of non-owner writes of DSM as CDF.
(1) We measure the latency and bandwidth of three network backends in the
host kernel, by sending and receiving requests from 64 Byte to 1 MiB for 4,000 times, with a single-thread client and server. Figure
8(a, b) shows TCP-Ethernet > IPoIB > RDMA results for latency and the opposite for bandwidth. IPoIB does not make full use of HCA’s capabilities, since the network traffic goes through the TCP/IP stack, and the CPU is not fast enough to process packets for a 56 Gbps IB link. CPU is also a bottleneck for TCP-Ethernet. TCP-Ethernet and IPoIB have a poor bandwidth (
\(\lt \!\!245.7\) and 770.0 MiB/s, respectively). While for RDMA, the maximum bandwidth is
\(2,\!840.3\) MiB/s.
(2) Figure
8(c) shows the 1,000 non-owner write latencies as CDF, while running
Sysbench file I/O
\(\lbrace 1MiB,\)GiantVM
\(\rbrace\) tests. Recall Section
3.3.1, a non-owner write DSM PF incurs the largest overhead among all sorts of DSM PF, thus is the performance bottleneck of DSM. IPoIB has a similar performance with RDMA, which shows great stability in latency. While for TCP-Ethernet, the stability in latency is poor. The P90 (90th percentile) latency is
\(1.8\times\) of P50. For RDMA and IPoIB, this number is
\(1.25\times\) and
\(1.48\times\), respectively. The instability of TCP-Ethernet sometimes leads to failures of timeout-based services when bootstrapping Linux. The respective P90 latencies of RDMA, IPoIB, and TCP-Ethernet are
\(10.0 \mu s\),
\(18.3 \mu s\), and
\(235.6 \mu s\). We run the
Sysbench local memory write test, and get the P90 latency of 4 KiB DRAM writes to be
\(0.56 \mu s\). RDMA can be a large speedup for DSM, although it is still
\(17.9\times\) of DRAM.
Summary: GiantVM presents linear scalability for distributed vCPU, and incurs negligible overhead for interrupt and I/O forwarding. DSM component is the performance bottleneck of GiantVM and its overhead increases as the number of vCPUs increases. RDMA brings substantial speedup for DSM, which is a new opportunity for SSI systems.
5.2 DSM-aware Optimization Results
This section answers Question-2 by examining two DSM-aware optimizations. We have a comparison with
Message Passing Interface (
MPI) [
1], a distributed computation framework, to show the overhead brought by GiantVM (with DSM-aware optimizations enabled) for the
ease of programming.
Kernel Scalability is Crucial: One significant advantage of a distributed hypervisor over other SSI systems is providing a unified ISA interface, which enables us to utilize many other alternative OSes. Barrelfish is a multi-kernel OS assuming that the cost of cache coherence is expensive and kernel scalability is important. This kind of hardware is more similar to a network or distributed system than a shared-memory architecture on a small-scale machine. There are multiple per-core kernels (hence the multi-kernel) in Barrelfish. The message transmissions among kernels and applications are acted via RPC channels, although the channels are based on the shared memory. It is no doubt that the targeted hardware of Barrelfish perfectly matches the physical reality of GiantVM and minimizes DSM overhead.
We deploy two web servers on Linux (
Apache2, v2.4.18) and Barrelfish (built-in, commit
78cf89d) respectively, and use Apache HTTP server benchmarking tool
ab [
17] v2.3 to continuously
GET a 298 KiB file. The
Request Per Second (
RPS) is shown by Figure
9.
\(\lbrace n,os\rbrace\) means
n clients send requests to
os simultaneously. Barrelfish outperforms Linux by
\(6.4\times\) on average. When the concurrency level (the number of concurrent clients) increases, it is clear that Linux lacks scalability and has a large overhead. As a kernel-intensive application, the
Apache2 web server can cause frequent false sharing when accessing shared data structures allocated on the same guest page. Nevertheless, page faults in Barrelfish are limited to the RPC channels, most of which are transmissions of valuable data instead of false sharing. Hence the performance enhancement is determined.
However, the single-node web server (8-vCPUs, 40 clients) still has an \(8.2\times\) RPS compared to the fastest Barrelfish result (\(4\times 2\), \(\lbrace 40,\)Barrelfish\(\rbrace\)), due to the memory intensive nature of web server applications. However, the \(6.4\times\) improvement brought by Barrelfish still indicates the importance of kernel scalability in the distributed environment.
DSM-aware Scheduling Results: We use the
NPB suite v3.4.1 [
9] to test the effectiveness of DaS (Section
3.3.2). We choose the OpenMP implementation of
NPB to run on GiantVM, representing single-machine applications that could benefit from GiantVM. The opposite is the MPI [
1] implementation, which adopts a message-passing programming model. We compare the OpenMP version running in DaS and the MPI version, for the “ease of programming” cost.
All tests run on a
\(2\times 8\) GiantVM, and all benchmarks run 16 threads. Each OpenMP version benchmark runs as a single process. Each benchmark is named as
\(name.class\) where
class can be
S,
W, and
A –
F, representing the compiled default input size. We choose
class as in Figure
10, for the
largest working sets that could fit into a 64 GiB-RAM GiantVM guest. Since our system does not support compiling class
D and
E for
IS, we use the longest-running
IS.C.
(1) We first give a comparison between DaS, CFS, and NUMA balancing, in Figure
10 (left-one subfigure). We set
\(N=2\) in Algorithm 1 for this experiment, equal to the number of NUMA nodes. The results of
\(2\times 8\) MPI results are normalized to 1, in which the MPI version of
NPB runs on two nodes, each node hosts 8 single-threaded processes of the benchmark. OpenMP results are from a single benchmark running on the 16 vCPUs of GiantVM. The total throughput of all 16 threads of benchmarks are compared (Mops/total). The DaS performs the best among the three scheduling policies, which is an increase of up to
\(3.5\times\) for
MG.D, compared to CFS, while NUMA balancing leads to a performance collapse of up to 28.7% for
CG.D. However, the performance gap between MPI and OpenMP continues to exist. Although running a shared address space across two nodes incurs great overhead, DaS could improve the performance by up to
\(3.5\times\).
We observe that workloads have variable reactions to different scheduling policies. We analyze it by dumping the DSM matrices while applications are running. Figure
11 gives the most representative ones, with the working set size marked in the title.
EP.D is insensitive to scheduling policies, since it has infrequent memory accesses.
MG.D and
SP.D have relatively regular memory access patterns, and the clique generation algorithm works well on them. For other workloads with too frequent/infrequent memory accesses, DaS gives a small improvement. For NUMA balancing, it relies on page migration to make all memory accesses local, which can be costly on GiantVM. By examining
/proc/vmstat file, we observe that NUMA balancing keeps migrating pages between hosts (around 5,000 pages/s). It lacks a history of memory access behaviors of the whole application, introducing large amounts of unnecessary page migration. DaS addresses this issue by memorizing DSM PF statistics in the DSM matrix, during the 6,000
ms scheduling period (Section
4.2.2).
(2) CPU and memory are of unequal importance for workloads. Figure
10 (right-two subfigures) reports the comparisons among several configurations for CPU and memory. In this experiment, we run two groups of benchmarks,
CG.D,
EP.D for the first group, and
FT.C,
BT.D for the second group. Benchmarks within each group achieve comparable run time. The configurations are read as
Scheduler-
nCPU. For
\(nCPU=8\), two workloads of the same group run together, scheduled by
Scheduler. For
\(nCPU=16\), a single workload runs in
Scheduler. All tests except MPI run on a
\(2\times 8\) GiantVM. For MPI, a single MPI task runs across 2 nodes, each node hosting 8 single-threaded processes for the task.
Note that for DaS-8, we set
\(N=1\), thus DaS would schedule all threads of a single process to the same node, eliminating DSM overheads. For DaS-16,
\(N=2\). Results of single-node 16-vCPU OpenMP running on the vanilla QEMU-KVM are normalized to 1.
Providing more CPUs (CFS-16) is more important than
making memory accesses local (DaS-8) for
CG.D and
EP.D, since they do not perform intensive memory accesses (Figure
11). As a result, we see DaS-8
\(\gt\) CFS-16 for them. Also, DaS-16 could achieve up to
\(1.06\times\) of the single-node OpenMP results, since threads frequently access shared data are scheduled to closer CPU cores, resulting in better use of the L3 cache.
While for
FT.C and
BT.D, the results are the opposite. They are the most memory intensive in Figure
11, and DaS could not break their peerthreads into two groups for better memory locality (DaS-16).
Making all memory accesses local (DaS-8) brings a huge performance boost for them. These workloads are not suitable to run across nodes. For MPI results, due to the total elimination of false sharing by the message-passing model, applications outperform the single-node OpenMP setting, by up to
\(1.96\times\). However, they trade ease of programming for performance.
(3) We have an analysis for the overhead that DaS itself introduces. Since the DaS runs kernel threads \(t_s\) and \(t_m\) on vCPU 0 together with application threads, it incurs CPU and memory overhead. For CPU overhead, we annotate the 6,000 ms scheduling period with getnstimeofday calls, and collect the time spent running our Algorithm 1, which is at most 79,007 \(\mu\)s. Thus, threads \(t_s\) and \(t_m\) occupy around 0.16% CPU time, which is negligible. For memory consumption, the main data structure consuming memory is the DSM matrix. Modern applications could have no more than thousands of threads, and the resulting DSM matrix size is 3.8 MiB. As for our experiments with less than 32 threads, a DSM matrix consumes no more than 4 KiB of RAM.
Summary: DaS on Linux guests could offset the cost of programming ease for workloads with moderate memory access frequency (EP.C), and improve the performance of memory-intensive workloads by up to \(3.5\times\) (MG.D). The multi-kernel guest OS could further speed up a memory-intensive web server by \(6.4\times\).
5.3 Benefits of Migration with GiantVM
We first give a comprehensive comparison between migration with GiantVM and VM live migration, in terms of performance degradation and network bandwidth consumption. Then we co-locate LC tasks running on the bare-metal, with migratable BE tasks running on GiantVM, to see the benefits of resource reallocation with GiantVM, including the improved latency of LC tasks and CPU utilization of the cluster. Finally, we answer Question-3 with the experimental results.
Impact of Migration on QPS: In this experiment, we measure the impact on the migrated tasks by GiantVM. We use the
\(2\times 8\) GiantVM for the experiment, and run
Apache2 v2.4.18,
iPerf v2.0.5, and
Redis v3.0.6 on GiantVM. We measure their
Query Per Second (
QPS) during migration, i.e. being scheduled in the guest OS from NUMA node 0 to 1. Figure
12(right) reports the results. The QPS of three benchmarks suffers from a degradation up to 50% in 1–2 s after the migration. Nevertheless, this is not a primary concern for a BE task running on GiantVM.
For VM live migration, these applications cannot respond to the requests during downtime, i.e. the time during which applications are stopped and the guest pages are transferred, and the QPS would be 0. The downtime comparison is omitted here because the concept is confusing for migration with GiantVM.
Network Bandwidth Consumption of Migration: Our main claim for the benefit of migration with GiantVM is the low cost of migration, i.e., the network bandwidth consumption is small, compared to VM live migration. Note that in our evaluation, the destination has been scheduled at least
once for GiantVM, to allow standby pages at the migration destination. The comparisons with VM live migration are carried out with a
\(2\times 8\) GiantVM. While for VM live migration configurations, we use the default settings of QEMU v2.8.1.1, in which
Postcopy-RAM and
XBZRLE [
58] are enabled. The migrated workloads are
Apache2 v2.4.18,
Apache-build (build the
Apache2 v2.4.18 from source code),
OpenSSL v1.0.2,
Blowfish v2.2.2 [
29],
MD5 [
64],
POV-Ray v3.7,
7-Zip v19.00,
Redis v3.0.6, and
PHPBench v1.1.0.
Figure
12 (left) presents the average network bandwidth consumption during one migration. GiantVM can save 68% bandwidth on average (the median is 91%). The cost of migration with GiantVM heavily depends on the fraction of read-only pages of a workload (Section
3.3.4).
7-Zip frequently
malloc/free bulk memory, which tends to touch a huge range of address space during the migration. Due to the lack of the stop-and-copy stage in migration with GiantVM, pages are frequently transferred across nodes due to DSM PF during migration, and the bandwidth consumption rises to
\(1.94\times\) of VM live migration. While for most applications, there are still many read-only pages that can be exploited for lowering network cost. VM live migration could not efficiently detect them.
Application Co-location Results with LaS: The effectiveness of LaS (Section
4.2.3) is examined in this part. We have three settings for comparison, without co-location (
w/o), naïve co-location (
naïve), and LaS co-location (
LaS). The tasks without co-location achieve the maximum performance but leave a high surplus utilization. The naïve co-location only co-locates fixed tasks, which lacks the ability of resource reallocation. So it suffers from performance degradation and QoS violation. LaS co-location runs LC tasks as fixed tasks on the bare-metal, and BE tasks as migratable tasks on GiantVM. When the CPU requirement of LC tasks bursts, LaS schedules BE tasks away from the overloaded node, to ensure QoS of LC tasks. Meanwhile, BE tasks could make use of the surplus CPU resources on nodes with a low
steal time (Section
4.2.3).
In practical use, LC tasks are interactive jobs, such as a web server and an in-memory key-value store. They have a strict demand for low latency and high QoS. BE tasks are batch jobs like data processing and kernel compiling. BE tasks can be sent to run on GiantVM, without any modification. We run OpenSSL v1.0.2 as BE tasks on a \(4\times 8\) GiantVM, and Memcached v1.6.12 as LC tasks on the first physical node hosting GiantVM (Node 0). For w/o, we run Memcached alone on Node 0, then run OpenSSL on an 8-vCPU QEMU alone on Node 0. For naïve, we run Memcached and 8-vCPU QEMU hosting OpenSSL together on Node 0. For Memcached, latencies and QPS are measured. For OpenSSL, the reciprocal of the run time is measured. The Memcached workload follows a Gaussian distribution.
Figure
13 shows the results. LaS decreases the P95, P99, and P99.9 latency of
Memcached by 26.7%, 27.1%, and 22.8%, respectively, compared to the naïve co-location. Also, CPU utilization is improved by 14.3%, compared to the naïve co-location. For QPS and run time reciprocal results, we could see w/o > LaS > naïve. Hence, the extra 14.3% CPU utilization can be exploited by LaS to provide a better QoS guarantee for both BE and LC tasks.
Summary: As a novel approach for workload migration, GiantVM could save 68% bandwidth on average compared to VM live migration, and no downtime is introduced. Resource reallocation with GiantVM could improve CPU utilization by 14.3%, compared to the naïve co-location. As measured in Section
5.1, interrupt (including IPI) forwarding overhead is negligible in GiantVM, thus distributed vCPU could be very scalable. GiantVM as a job scheduler could be deployed on dozens of physical nodes if all workloads on GiantVM are scheduled to a single NUMA node.