Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls Livio Soares University of Toronto Michael Stumm University of Toronto Abstract 1 Introduction System calls are the de facto interface to the operating system kernel. They are used to request services offered by, and implemented in the operating system kernel. While User-mode IPC (higher is faster) For the past 30+ years, system calls have been the de facto interface used by applications to request services from the operating system kernel. System calls have almost universally been implemented as a synchronous mechanism, where a special processor instruction is used to yield userspace execution to the kernel. In the first part of this paper, we evaluate the performance impact of traditional synchronous system calls on system intensive workloads. We show that synchronous system calls negatively affect performance in a significant way, primarily because of pipeline flushing and pollution of key processor structures (e.g., TLB, data and instruction caches, etc.). We propose a new mechanism for applications to request services from the operating system kernel: exception-less system calls. They improve processor efficiency by enabling flexibility in the scheduling of operating system work, which in turn can lead to significantly increased temporal and spacial locality of execution in both user and kernel space, thus reducing pollution effects on processor structures. Exception-less system calls are particularly effective on multicore processors. They primarily target highly threaded server applications, such as Web servers and database servers. We present FlexSC, an implementation of exceptionless system calls in the Linux kernel, and an accompanying user-mode thread package (FlexSC-Threads), binary compatible with POSIX threads, that translates legacy synchronous system calls into exception-less ones transparently to applications. We show how FlexSC improves performance of Apache by up to 116%, MySQL by up to 40%, and BIND by up to 105% while requiring no modifications to the applications. Syscall impact on user-mode IPC 1.5 1.3 1.1 0.9 Lost performance (cycles) 0.7 0.5 0.3 Syscall exception 0 2000 4000 6000 8000 10000 12000 14000 16000 Time (in cycles) Figure 1: User-mode instructions per cycles (IPC) of Xalan (from SPEC CPU 2006) in response to a system call exception event, as measured on an Intel Core i7 processor. different operating systems offer a variety of different services, the basic underlying system call mechanism has been common on all commercial multiprocessed operating systems for decades. System call invocation typically involves writing arguments to appropriate registers and then issuing a special machine instruction that raises a synchronous exception, immediately yielding user-mode execution to a kernel-mode exception handler. Two important properties of the traditional system call design are that: (1) a processor exception is used to communicate with the kernel, and (2) a synchronous execution model is enforced, as the application expects the completion of the system call before resuming user-mode execution. Both of these effects result in performance inefficiencies on modern processors. The increasing number of available transistors on a chip (Moore’s Law) has, over the years, led to increasingly sophisticated processor structures, such as superscalar and out-of-order execution units, multi-level caches, and branch predictors. These processor structures have, in turn, led to a large increase in the performance potential of software, but at the same time there is a widening gap between the performance of efficient software and the performance of inefficient software, primarily due to the increasing disparity of accessing different processor resources (e.g., registers vs. caches vs. memory). Server and system-intensive workloads, which are of particular interest in our work, are known to perform well below the potential processor throughput [11, 12, 19]. Most studies attribute this inefficiency to the lack of locality. We claim that part of this lack of locality, and resulting performance degradation, stems from the current synchronous system call interface. Synchronous implementation of system calls negatively impacts the performance of system intensive workloads, both in terms of the direct costs of mode switching and, more interestingly, in terms of the indirect pollution of important processor structures which affects both usermode and kernel-mode performance. A motivating example that quantifies the impact of system call pollution on application performance can be seen in Figure 1. It depicts the user-mode instructions per cycles (kernel cycles and instructions are ignored) of one of the SPEC CPU 2006 benchmarks (Xalan) immediately before and after a pwrite system call. There is a significant drop in instructions per cycle (IPC) due to the system call, and it takes up to 14,000 cycles of execution before the IPC of this application returns to its previous level. As we will show, this performance degradation is mainly due to interference caused by the kernel on key processor structures. To improve locality in the execution of system intensive workloads, we propose a new operating system mechanism: the exception-less system call. An exception-less system call is a mechanism for requesting kernel services that does not require the use of synchronous processor exceptions. In our implementation, system calls are issued by writing kernel requests to a reserved syscall page, using normal memory store operations. The actual execution of system calls is performed asynchronously by special in-kernel syscall threads, which post the results of system calls to the syscall page after their completion. Decoupling the system call execution from its invocation creates the possibility for flexible system call scheduling, offering optimizations along two dimensions. The first optimization allows for the deferred batch execution of system calls resulting in increased temporal locality of execution. The second provides the ability to execute system calls on a separate core, in parallel to executing usermode threads, resulting in spatial, per core locality. In both cases, system call threads become a simple, but powerful abstraction. One interesting feature of the proposed decoupled system call model is the possibility of dynamic core specialization in multicore systems. Cores can become temporarily specialized for either user-mode or kernel-mode execution, depending on the current system load. We describe how the operating system kernel can dynamically adapt core specialization to the demands of the workload. One important challenge of our proposed system is how to best use the exception-less system call interface. One option is to rewrite applications to directly interface with the exception-less system call mechanism. We believe the lessons learned by the systems community with eventdriven servers indicate that directly using exception-less system calls would be a daunting software engineering task. For this reason, we propose a new M -on-N threading package (M user-mode threads executing on N kernel-visible threads, with M >> N ). The main purpose of this threading package is to harvest independent system calls by switching threads, in user-mode, whenever a thread invokes a system call. This research makes the following contributions: • We quantify, at fine granularity, the impact of synchronous mode switches and system call execution on the micro-architectural processor structures, as well as on the overall performance of user-mode execution. • We propose a new operating system mechanism, the exception-less system call, and describe an implementation, FlexSC1 , in the Linux kernel. • We present a M -on-N threading system, compatible with PThreads, that transparently uses the new exception-less system call facility. • We show how exception-less system calls coupled with our M -on-N threading system improves performance of important system-intensive highly threaded workloads: Apache by up to 116%, MySQL by to 40%, and BIND by up to 105%. 2 The (Real) Costs of System Calls In this section, we analyze the performance costs associated with a traditional, synchronous system call. We analyze these costs in terms of mode switch time, the system call footprint, and the effect on user-mode and kernelmode IPC. We used the Linux operating system kernel and an Intel Nehalem (Core i7) processor, along with its performance counters to obtain our measurements. However, we believe the lessons learned are applicable to most modern high-performance processors2 and other operating system kernels. 2.1 Mode Switch Cost Traditionally, the performance cost attributed to system calls is the mode switch time. The mode switch time consists of the time necessary to execute the appropriate system call instruction in user-mode, resuming execution in an elevated protection domain (kernel-mode), and the return of control back to user-mode. Modern processors implement the mode switch as a processor exception: flushing the user-mode pipeline, saving a few registers onto the 1 Pronounced as “flex” (/’fleks/). 2 Experiments performed on an older PowerPC 970 processor yielded similar insights than the ones presented here. Syscall stat pread pwrite open+close mmap+munmap open+write+close Instructions 4972 3739 5689 6631 8977 9921 Cycles 13585 12300 31285 19162 19079 32815 IPC 0.37 0.30 0.18 0.34 0.47 0.30 i-cache 32 32 50 47 41 78 d-cache 186 294 373 240 233 481 L2 660 679 985 900 869 1462 L3 2559 2160 3160 3534 3913 5105 d-TLB 21 20 44 28 7 49 Table 1: System call footprint of different processor structures. For the processors structures (caches and TLB), the numbers represent number of entries evicted; the cache line for the processor is of 64-bytes. i-cache and d-cache refer to the instruction and data sections of the L1 cache, respectively. The d-TLB represents the data portion of the TLB. kernel stack, changing the protection domain, and redirecting execution to the registered exception handler. Subsequently, return from exception is necessary to resume execution in user-mode. We measured the mode switch time by implementing a new system call, gettsc that obtains the time stamp counter of the processor and immediately returns to user-mode. We created a simple benchmark that invoked gettsc 1 billion times, recording the time-stamp before and after each call. The difference between each of the three time-stamps identifies the number of cycles necessary to enter and leave the operating system kernel, namely 79 cycles and 71 cycles, respectively. The total round-trip time for the gettsc system call is modest at 150 cycles, being less than the latency of a memory access that misses the processor caches (250 cycles on our machine).3 2.2 System Call Footprint The mode switch time, however, is only part of the cost of a system call. During kernel-mode execution, processor structures including the L1 data and instruction caches, translation look-aside buffers (TLB), branch prediction tables, prefetch buffers, as well as larger unified caches (L2 and L3), are populated with kernel specific state. The replacement of user-mode processor state by kernel-mode processor state is referred to as the processor state pollution caused by a system call. To quantify the pollution caused by system calls, we used the Core i7 hardware performance counters (HPC). We ran a high instruction per cycle (IPC) workload, Xalan, from the SPEC CPU 2006 benchmark suite that is known to invoke few system calls. We configured an HPC to trigger infrequently (once every 10 million usermode instructions) so that the processor structures would be dominated with application state. We then set up the HPC exception handler to execute specific system calls, while measuring the replacement of application state in the processor structures caused by kernel execution (but not by the performance counter exception handler itself). 3 For all experiments presented in this paper, user-mode applications execute in 64-bit mode and when using synchronous system calls, use the “syscall” x86 64 instruction, which is currently the default in Linux. Table 1 shows the footprint on several processor structures for three different system calls and three system call combinations. The data shows that, even though the number of i-cache lines replaced is modest (between 2 and 5 KB), the number of d-cache lines replaced is significant. Given that the size of the d-cache on this processor is 32 KB, we see that the system calls listed pollute at least half of the d-cache, and almost all of the d-cache in the “open+write+close” case. The 64 entry first level dTLB is also significantly polluted by most system calls. Finally, it is interesting to note that the system call impact on the L2 and L3 caches is larger than on the L1 caches, primarily because the L2 and L3 caches use more aggressive prefetching. 2.3 System Call Impact on User IPC Ultimately, the most important measure of the real cost of system calls is the performance impact on the application. To quantify this, we executed an experiment similar to the one described in the previous subsection. However, instead of measuring kernel-mode events, we only measured user-mode instructions per cycle (IPC), ignoring all kernel execution. Ideally, user-mode IPC should not decrease as a result of invoking system calls, since the cycles and instructions executed as part of the system call are ignored in our measurements. In practice, however, user-mode IPC is affected by two sources of overhead: Direct: The processor exception associated with the system call instruction that flushes the processor pipeline. Indirect: System call pollution on the processor structures, as quantified in Table 1. Figures 2 and 3 show the degradation in user-mode IPC when running Xalan (from SPEC CPU 2006) and SPECJBB, respectively, given different frequencies of pwrite calls. These benchmarks were chosen since they have been created to avoid significant use of system services, and should spend only 1-2% of time executing in kernelmode. The graphs show that different workloads can have different sensitivities to system call pollution. Xalan has a baseline user-mode IPC of 1.46, but the IPC degrades by up to 65% when executing a pwrite every 1,0002,000 instructions, yielding an IPC between 0.58 and 0.50. 80% 70% 50% Degradation (lower is faster) Degradation (lower is faster) 70% Indirect Direct 60% 40% 30% 20% 10% 0% 1K 50% 40% 30% 20% 10% 0% 2K 5K 10K 20K 50K 100 100K 500K Figure 2: System call (pwrite) impact on user-mode IPC as a function of system call frequency for Xalan. 50% Indirect Direct 40% 30% 500 1K 2K 5K 10K 20K 50K 100K 500K instructions between interrupts (log scale) instructions between interrupts (log scale) Degradation (lower is faster) 60% Figure 4: System call (pwrite), impact on kernel-mode IPCs for x as a function of system call frequency. JBB, the data presented here indicates that server workloads suffer from significant performance degradation due to processor pollution of system calls. 20% 10% 0% 1K 2.4 2K 5K 10K 20K 50K 100K 500K instructions between interrupts (log scale) Figure 3: System call (pwrite) impact on user-mode IPC as a function of system call frequency for SPEC JBB. SPEC-JBB has a slightly lower baseline of 0.97, but still observes a 45% degradation of user-mode IPC. The figures also depict the breakdown of user-mode IPC degradation due to direct and indirect costs. The degradation due to the direct cost was measured by issuing a null system call, while the indirect portion is calculated subtracting the direct cost from the degradation measured when issuing a pwrite system call. For high frequency system call invocation (once every 2,000 instructions, or less), the direct cost of raising an exception and subsequent flushing of the processor pipeline is the largest source of user-mode IPC degradation. However, for medium frequencies of system call invocation (once per 2,000 to 100,000 instructions), the indirect cost of system calls is the dominant source of user-mode IPC degradation. To understand the implication of these results on typical server workloads, it is necessary to quantify the system call frequency of these workloads. The average usermode instruction count between consecutive system calls for three popular server workloads are shown in Table 2. For this frequency range in Figures 2 and 3 we observe user-mode IPC performance degradation between 20% and 60%. While the excecution of the server workloads listed in Table 2 is not identical to that of Xalan or SPECWorkload (server) DNSbench (BIND) ApacheBench (Apache) Sysbench (MySQL) Instructions per Syscall 2445 3368 12435 Table 2: The average number of instructions executed on different workloads before issuing a syscall. Mode Switching Cost on Kernel IPC The lack of locality due to frequent mode switches also negatively affects kernel-mode IPC. Figure 4 shows the impact of different system call frequencies on the kernelmode IPC. As expected, the performance trend is opposite to that of user-mode execution. The more frequent the system calls, the more kernel state is maintained in the processor. Note that the kernel-mode IPC listed in Table 1 for different system calls ranges from 0.18 to 0.47, with an average of 0.32. This is significantly lower than the 1.47 and 0.97 user-mode IPC for Xalan and SPEC-JBB, respectively; up to 8x slower. 3 Exception-Less System Calls To address (and partially eliminate) the performance impact of traditional, synchronous system calls on system intensive workloads, we propose a new operating system mechanism called exception-less system call. Exceptionless system call is a mechanism for requesting kernel services that does not require the use of synchronous processor exceptions. The key benefit of exception-less system calls is the flexibility in scheduling system call execution, ultimately providing improved locality of execution of both user and kernel code. We explore two use cases: System call batching: Delaying the execution of a series of system calls and executing them in batches minimizes the frequency of switching between user and kernel execution, eliminating some of the mode switch overhead and allowing for improved temporal locality. This improves both the direct and indirect costs of system calls. Core specialization: In multicore systems, exceptionless system calls allow a system call to be scheduled on a core different than the one on which the system call was invoked. Scheduling system calls on a separate processor core allows for improved spatial locality and with it lower sys call page User Kernel (a) Traditional, sync. system call (b) Exception-less system call Figure 5: Illustration of synchronous and exception-less system call invocation. The left diagram shows the sequential nature of exception-based system calls, while the right diagram depicts exception-less user and kernel communication through shared memory. syscall number number of arguments status arg 0 arg 6 return value Figure 6: 64-byte syscall entry from the syscall page. indirect costs. In an ideal scenario, no mode switches are necessary, eliminating the direct cost of system calls. The design of exception-less system calls consists of two components: (1) an exception-less interface for userspace threads to register system calls, along with (2) an in-kernel threading system that allows the delayed (asynchronous) execution of system calls, without interrupting or blocking the thread in user-space. 3.1 Exception-Less Syscall Interface The interface for exception-less system calls is simply a set of memory pages that is shared amongst user and kernel space. The shared memory page, henceforth referred to as syscall page, is organized to contain exception-less system call entries. Each entry contains space for the request status, system call number, arguments, and return value. With traditional synchronous system calls, invocation occurs by populating predefined registers with system call information and issuing a specific machine instruction that immediately raises an exception. In contrast, to issue an exception-less system call, the user-space threads must find a free entry in the syscall page and populate the entry with the appropriate values using regular store instructions. The user-space thread can then continue executing without interruption. It is the responsibility of the userspace thread to later verify the completion of the system call by reading the status information in the entry. None of these operations, issuing a system call or verifying its completion, causes exceptions to be raised. 3.2 Syscall Pages Syscall pages can be viewed as a table of syscall entries, each containing information specific to a single system call request, including the system call number, arguments, status (free/submitted/busy/done), and the result (Figure 6). In our 64-bit implementation, we have organized each entry to occupy 64 bytes. This size comes from the Linux ABI which allows any system call to have up to 6 arguments, and a return value, totalling 56 bytes. Although the remaining 3 fields (syscall number, status and number of arguments) could be packed in less than the remaining 8 bytes, we selected 64 bytes because 64 is a divisor of popular cache line sizes of today’s processor. To issue an exception-less system call, the user-space thread must find an entry in one of its syscall pages that contain a free status field. It then writes the syscall number and arguments to the entry. Lastly, the status field is changed to submitted4 , indicating to the kernel that the request is ready for execution. The thread must then check the status of the entry until it becomes done, consume the return value, and finally set the status of the entry to free. 3.3 Decoupling Execution from Invocation Along with the exception-less interface, the operating system kernel must support delayed execution of system calls. Unlike exception-based system calls, the exceptionless system call interface does not result in an explicit kernel notification, nor does it provide an execution stack. To support decoupled system call execution, we use a special type of kernel thread, which we call syscall thread. Syscall threads always execute in kernel mode, and their sole purpose is to pull requests from syscall pages and execute them on behalf of the user-space thread. Figure 5 illustrates the difference between traditional synchronous system calls, and our proposed split system call model. The combination of the exception-less system call interface and independent syscall threads allows for great flexibility in the scheduling the execution of system calls. Syscall threads may wake up only after user-space is unable to make further progress, in order to achieve temporal locality of execution on the processor. Orthogonally, syscall threads can be scheduled on a different processor core than that of the user-space thread, allowing for spatial locality of execution. On modern multicore processors, cache to cache communication is relatively fast (in the order of 10s of cycles), so communicating the entries of syscall pages from a user-space core to a kernel core, or vice-versa, should only cause a small number of processor stalls. 3.4 Implementation – FlexSC Our implementation of the exception-less system call mechanism is called FlexSC (Flexible System Call) and was prototyped as an extension to the Linux kernel. Although our implementation was influenced by a mono4 User-space must update the status field last, with an appropriate memory barrier, to prevent the kernel from selecting incomplete syscall entries to execute. lithic kernel architecture, we believe that most of our design could be effective with other kernel architectures, e.g., exception-less micro-kernel IPCs, and hypercalls in a paravirtualized environment. We have implemented FlexSC for the x86 64 and PowerPC64 processor architectures. Porting FlexSC to other architectures is trivial; a single function is needed, which moves arguments from the syscall page to appropriate registers, according to the ABI of the processor architecture. Two new system calls were added to Linux as part of FlexSC, flexsc register and flexsc wait. flexsc register() This system call is used by processes that wish to use the FlexSC facility. Making this registration procedure explicit is not strictly necessary, as processes can be registered with FlexSC upon creation. We chose to make it explicit mainly for convenience of prototyping, giving us more control and flexibility in userspace. One legitimate reason for making registration explicit is to avoid the extra initialization overheads incurred for processes that do not use exception-less system calls. Invocation of the flexsc register system call must use the traditional, exception-based system call interface to avoid complex bootstrapping; however, since this system call needs to execute only once, it does not impact application performance. Registration involves two steps: mapping one or more syscall pages into user-space virtual memory space, and spawning one syscall thread per entry in the syscall pages. flexsc wait() The decoupled execution model of exception-less system calls creates a challenge in userspace execution, namely what to do when the user-space thread has nothing more to execute and is waiting on pending system calls. With the proposed execution model, the OS kernel loses the ability to determine when a userspace thread should be put to sleep. With synchronous system calls, this is simply achieved by putting the thread to sleep while it is executing a system call if the call blocks waiting for a resource. The solution we adopted is to require that the user explicitly communicate to the kernel that it cannot progress until one of the issued system calls completes by invoking the flexsc wait system call. We implemented flexsc wait as an exception-based system call, since execution should be synchronously directed to the kernel. FlexSC will later wake up the user-space thread when at least one of posted system calls are complete. 3.5 Syscall Threads Syscall threads is the mechanism used by FlexSC to allow for exception-less execution of system calls. The Linux system call execution model has influenced some implementation aspects of syscall threads in FlexSC: (1) the virtual address space in which system call execution occurs is the address space of the corresponding process, and (2) the current thread context can be used to block execution should a necessary resource not be available (for example, waiting for I/O). To resolve the virtual address space requirement, syscall threads are created during flexsc register. Syscall threads are thus “cloned” from the registering process, resulting in threads that share the original virtual address space. This allows the transfer of data from/to userspace with no modification to Linux’s code. FlexSC would ideally never allow a syscall thread to sleep. If a resource is not currently available, notification of the resource becoming available should be arranged, and execution of the next pending system call should begin. However, implementing this behavior in Linux would require significant changes and a departure from the basic Linux architecture. Instead, we adopted a strategy that allows FlexSC to maintain the Linux thread blocking architecture, as well as requiring only minor modifications (3 lines of code) to Linux context switching code, by creating multiple syscall threads for each process that registers with FlexSC. In fact, FlexSC spawns as many syscall threads as there are entries available in the syscall pages mapped in the process. This provisions for the worst case where every pending system call blocks during execution. Spawning hundreds of syscall threads may seem expensive, but Linux in-kernel threads are typically much lighter weight than user threads: all that is needed is a task struct and a small, 2-page, stack for execution. All the other structures (page table, file table, etc.) are shared with the user process. In total, only 10KB of memory is needed per syscall thread. Despite spawning multiple threads, only one syscall thread is active per application and core at any given point in time. If system calls do not block all the work is executed by a single syscall thread, while the remaining ones sleep on a work-queue. When a syscall thread needs to block, for whatever reason, immediately before it is put to sleep, FlexSC notifies the work-queue. Another thread wakes-up and immediately starts executing the next system call. Later, when resources become free, current Linux code wakes up the waiting thread (in our case, a syscall thread), and resumes its execution, so it can post its result to the syscall page and return to wait in the FlexSC work-queue. 3.6 FlexSC Syscall Thread Scheduler FlexSC implements a syscall thread scheduler that is responsible for determining when and on which core system calls will execute. This scheduler is critical to performance, as it influences the locality of user and kernel execution. On a single-core environment, the FlexSC scheduler assumes the user-space will attempt to post as many exception-less system calls as possible, and subsequently call flexsc wait(). The FlexSC scheduler then wakes up an available syscall thread that starts executing the first system call. If the system call does not block, the same syscall thread continues to execute the next submitted syscall entry. If the execution of a syscall thread blocks, the currently scheduled syscall thread notifies the scheduler to wake another thread to continue to execute more system calls. The scheduler does not wake up the user-space thread until all available system calls have been issued, and have either finished or are currently blocked with at least one system call having been completed. This is done to minimize the number of mode switches to userspace. For multicore execution, the scheduler biases execution of syscall threads on a subset of available cores, dynamically specializing cores according to the workload requirements. In our current implementation, this is done by attempting to schedule syscall threads using a predetermined, static list of cores. Upon a scheduling decision, the first core on the list is selected. If a syscall thread of a process is currently running on that core, the next core on the list is selected as the target. If the selected core is not currently executing a syscall thread, an inter-processor interrupt is sent to the remote core, signalling that it must wake a syscall thread. As previously described, there is never more than one syscall thread concurrently executing per core, for a given process. However in the multicore case, for the same process, there can be as many syscall threads as cores concurrently executing on the entire system. To avoid cacheline contention of syscall pages amongst cores, before a syscall thread begins executing calls from a syscall page, it locks the page until all its submitted calls have been issued. Since FlexSC processes typically map multiple syscall pages, each core on the system can schedule a syscall thread to work independently, executing calls from different syscall pages. 4 System Calls Galore – FlexSC-Threads Exception-less system calls present a significant change to the semantics of the system call interface with potentially drastic implications for application code and programmers. Programming using exception-less system calls directly is more complex than using synchronous system calls, as they do not provide the same, easy-to-reasonabout sequentiality. In fact, our experience is that programming using exception-less system calls is akin to event-driven programming, which has itself been criticized for being a complex programming model [21]. The main difference is that with exception-less system calls, not only are I/O related calls scheduled for future completion, any system calls can be requested, verified for completion, and handled, as if it were an asynchronous event. To address the programming complexities, we propose the use of exception-less system calls in two different modes that might be used depending on the concurrency model adopted by the programmer. We argue that if used according to our recommendations, exception-less system calls should pose no more complexity than their synchronous counter-parts. 4.1 Event-driven Servers, a Case for Hybrid Execution For event-driven systems, we advocate a hybrid approach where both synchronous and exception-less system calls coexist. System calls that are executed in performance critical paths of applications should use exception-less calls while all other calls should be synchronous. After all, there is no good justification to make a simple getpid() complex to program. Event-driven servers already have their code structured so that performance critical paths of execution are split into three parts: request event, wait for completion and handle event. Adapting an event-driven server to use exception-less system calls, for the already considered events, should be straightforward. However, we have not yet attempted to evaluate the use of exception-less system calls in an event-driven program, and leave this as future work. 4.2 FlexSC-Threads Multiprocessing has become the default for computation on servers. With the emergence and ubiquity of multicore processors, along with projection of future chip manufacturing technologies, it is unlikely that this trend will reverse in the medium future. For this reason, and because of its relative simplicity vis-a-vis event-based programming, we believe that the multithreading concurrency model will continue to be the norm. In this section, we describe the design and implementation of FlexSC-Threads, a threading package that transforms legacy synchronous system calls into exceptionless ones transparently to applications. It is intended for server-type applications with many user-mode threads, such as Apache or MySQL. FlexSC-Threads is compliant with POSIX Threads, and binary compatible with NPTL [8], the default Linux thread library. As a result, Linux multi-threaded programs work with FlexSCThreads “out of the box” without modification or recompilation. FlexSC-Threads uses a simple M -on-N threading model (M user-mode threads executing on N kernel- user-mode switch one kernel-visible thread per core multiple user-mode threads sys call pages User Kernel z sys call pages multiple syscall threads per core z z z z flexsc_wait() sys call pages z Figure 7: The left-most diagram depicts the components of FlexSC-Threads pertaining to a single core. Each core executes a pinned kernel-visible thread, which in turn can multiplex multiple user-mode threads. Multiple syscall pages, and consequently syscall threads, are also allocated (and pinned) per core. The middle diagram depicts a user-mode thread being preempted as a result of issuing a system call. The right-most diagram depicts the scenario where all user-mode threads are waiting for system call requests; in this case FlexSC-Threads library synchronously invokes flexsc wait() to the kernel. User sys call pages sys call pages Kernel Core 0 Core 1 Figure 8: Multicore example. Opaque threads are active, while grayed-out threads are inactive. Syscall pages are accessible to both cores, as we run using shared-memory, leveraging the fast on-chip communication of multicores. visible threads). We rely on the ability to perform usermode thread switching solely in user-space to transparently transform legacy synchronous calls into exceptionless ones. This is done as follows: 1. We redirect to our library each libc call that issues a legacy system call. Typically, applications do not directly embed code to issue system calls, but instead call wrappers in the dynamically loaded libc. We use the dynamic loading capabilities of Linux to redirect execution of such calls to our library. 2. FlexSC-Threads then post the corresponding exception-less system call to a syscall page and switch to another user-mode thread that is ready. 3. If we run out of ready user-mode threads, FlexSC checks the syscall page for any syscall entries that have been completed, waking up the appropriate user-mode thread so it can obtain the result of the completed system call. 4. As a last resort, flexsc wait() is called, putting the kernel visible thread to sleep until one of the pending system calls has completed. specific core. Multiple user-mode threads multiplex execution on the kernel visible thread. Since kernel-visitble threads only block when there is no more available work, there is no need to create more than one kernel visible thread per core. Figure 7 depicts the components of FlexSC-Threads and how they interact during execution. As an optimization, we have designed FlexSC-Threads to register a private set of syscall pages per kernel visible thread (i.e., per core). Since syscall pages are private to each core, there is no need to synchronize their access with costly atomic instructions. The FlexSCThreads user-mode scheduler implements a simple form of cooperative scheduling, with system calls acting as yield points. Consequently, syscall pages behave as lockfree single-producer (kernel-visible thread) and singleconsumer (syscall thread) data structures. From the kernel side, although syscall threads are pinned to specific cores, they do not only execute system call requests from syscall pages registered to that core. An example of this is shown in Figure 8, where user-mode threads execute on core 0, while syscall threads running on core 1 are satisfying system call requests. It is important to note that FlexSC-Threads relies on a large number of independent user-mode threads to post concurrent exception-less system calls. Since threads are executing independently, there is no constraint on ordering or serialization of system call execution (thread-safety constraints should be enforced at the application level and is orthogonal to the system call execution model). FlexSC-Threads leverages the independent requests to efficiently schedule operating system work on single or multicore systems. For this reason, highly threaded workloads, such as internet/network servers, are ideal candidates for FlexSC-Threads. 5 FlexSC-Threads implements multicore support by creating a single kernel visible thread per core available to the process, and pinning each kernel visible thread to a Experimental Evaluation We first present the results of a microbenchmark that shows the overhead of the basic exception-less system Specification 4 64 B for all caches 32 KB, 3 cycle latency 32 KB, 4 cycle latency 512 KB, 11 cycle latency 8 MB, 35-40 cycle latency 250 cycle latency (avg.) 64 (data) + 64 (instr.) entries 512 entries 700 Time (nanoseconds) Component Cores Cache line Private L1 i-cache Private L1 d-cache Private L2 cache Shared L3 cache Memory TLB (L1) TLB (L2) 10 20 30 40 50 60 70 Time (nanoseconds) Number of batched requests Figure 9: Exception-less system call cost on a single-core. call mechanism, and then we show the performance of two popular server applications, Apache and MySQL, transparently using exception-less system calls through FlexSC-Threads. Finally, we analyze the sensitivity of the performance of FlexSC to the number of system call pages. FlexSC was implemented in the Linux kernel, version 2.6.33. The baseline line measurements we present were collected using unmodified Linux (same version), and the default native POSIX threading library (NPTL). We identify the baseline configuration as “sync”, and the system with exception-less system calls as “flexsc”. The experiments presented in this section were run on an Intel Nehalem (Core i7) processor with the characteristics shown in Table 3. The processor has 4 cores, each with 2 hyper-threads. We disabled the hyper-threads, as well as the “TurboBoost” feature, for all our experiments to more easily analyze the measurements obtained. For the Apache and MySQL experiments, requests were generated by a remote client connected to our test machine through a 1 Gbps network, using a dedicated router. The client machine contained a dual core Core2 processor, running the same Linux installation as the test machine, and was not CPU or network constrained in any of the experiments. All values reported in our evaluation represent the average of 5 separate runs. 5.1 400 300 200 100 0 10 20 30 40 50 60 70 Number of batched requests flexsc sync 0 500 0 Table 3: Characteristics of the 2.3GHz Core i7 processor. 90 80 70 60 50 40 30 20 10 0 flexsc sync (same core) 600 Overhead The overhead of executing an exception-less system call involves switching to a syscall thread, de-marshalling arguments from the appropriate syscall page entry, switch- Figure 10: Exception-less system call cost, in the worst case, for remote core execution. ing back to the user-thread, and retrieving the return value from the syscall page entry. To measure this overhead, we created a micro-benchmark that successively invokes a getppid() system call. Since the user and kernel footprints of this call is small, the time measured corresponds to the direct cost of issuing system calls. We varied the number of batched system calls, in the exception-less case, to verify if the direct costs are amortized when batching an increasing number of calls. The results obtained executing on a single core are shown in Figure 9. The baseline time, show as a horizontal line, is the time to execute an exception-based system call on a single core. Executing a single exception-less system call on a single core is 43% slower than a synchronous call. However, when batching 2 or more calls there is no overhead, and when batching 32 or more calls, the execution of each call is up to 130% faster than a synchronous call. We also measured the time to execute system calls on a remote core (Figure 10). In addition to the single core operations, remote core execution entails sending an interprocessor interrupt (IPI) to wake up the remote syscall thread. In the remote core case, the time to issue a single exception-less system call can be more than 10 times slower than a synchronous system call on the same core. This measurement represents a worst case scenario when there is no currently executing syscall thread. Despite the high overhead, the overhead on remote core execution is recouped when batching 32 or more system calls. 5.2 Apache We used Apache version 2.2.15 to evaluate the performance of FlexSC-Threads. Since FlexSC-Threads is binary compatible with NPTL, we used the same Apache binary for both FlexSC and Linux/NPTL experiments. We configured Apache to use a different maximum number of spawned threads for each case. The performance of Apache running on NPTL degrades with too many threads, and we experimentally determined that 200 was optimal for our workload and hence used that configuration for the NPTL case. For the FlexSC-Threads case, we raised the maximum number of threads to 1000. The workload we used was ApacheBench, a HTTP 35000 30000 25000 20000 15000 10000 5000 0 0 200 400 600 800 1000 45000 flexsc sync 40000 35000 30000 25000 20000 15000 10000 5000 0 0 200 Request Concurrency 400 600 800 Request Concurrency (a) 1 Core (b) 2 Cores 1000 Throughput (requests/sec.) flexsc sync 40000 Throughput (requests/sec.) Throughput (requests/sec.) 45000 45000 40000 35000 30000 25000 20000 15000 10000 flexsc sync 5000 0 0 200 400 600 800 1000 Request Concurrency (c) 4 Cores Figure 11: Comparison of Apache throughput of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores. 100% 100% 90% 90% 36% 80% 80% 50% 70% 70% 60% 50% 40% idle user kernel 37% 22% 30% 0% 76% 86% 50% 40% idle user kernel 30% 20% 10% Time Time 60% 20% 28% 27% sync flexsc 10% 0% (a) Apache 24% 14% sync flexsc (b) MySQL Figure 12: Breakdown of execution time of Apache and MySQL workloads on 4 cores. 25 sync flexsc Latency (ms) 20 15 10 5 0 1 core 2 cores 0 cores Figure 13: Comparison of Apache latency of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores, with 256 concurrent requests. workload generator that is distributed with Apache. It is designed to stress-test the Web server determining the number of requests per second that can be serviced, with varying number of concurrent requests. Figure 11 shows the results of Apache running on 1, 2 and 4 cores. For the single core experiments, FlexSC employs system call batching, and for the multicore experiments it additionally dynamically redirects system calls to maximize core locality. The results show that, except for a very low number of concurrent requests, FlexSC outperforms Linux/NPTL by a wide margin. With system call batching alone (1 core case), we observe a throughput improvement of up to 86%. The 2 and 4 core experiments show that FlexSC achieves up to 116% throughput improvement, showing the added benefit of dynamic core specialization. Table 4 shows the effects of FlexSC on the microarchitectural state of the processor while running Apache. It displays various processor metrics, collected using hardware performance counters during execution with 512 concurrent requests. The most important metric listed is the instruction per cycles (IPC) of the user and kernel mode for the different setups, as it summarizes the efficiency of execution. The other values listed are normalized values using misses per kilo-instructions (MPKI). MPKI is a widely used normalization method that makes it easy to compare values obtained from different executions. The most efficient execution of the four listed in the table is FlexSC on 1 core, yielding an IPC of 0.94 on both kernel and user execution, which is 95–108% higher than for NPTL. While the FlexSC execution of Apache on 4 cores is not as efficient as the single core case, with an average IPC of 0.75, there is still an 71% improvement, on average, over NPTL. Most metrics we collected are significantly improved with FlexSC. Of particular importance are the performance critical structures that have a high MPKI value on NPTL such as d-cache, i-cache, and L2 cache. The better use of these microarchitectural structures effectively demonstrates the premise of this work, namely that exception-less system calls can improve processor efficiency. The only structure which observes more misses on FlexSC is the user-mode TLB. We are currently investigating the reason for this. There is an interesting disparity between the throughput improvement (94%) and the IPC improvement (71%) in the 4 core case. The difference comes from the added benefit of localizing kernel execution with core specialization. Figure 12a shows the time breakdown of Apache executing on 4 cores. FlexSC execution yields significantly less idle time than the NPTL execution.5 The reduced idle time is a consequence of lowering the contention on a specific kernel semaphore. Linux protects address spaces with a per address-space read-write semaphore (mmap sem). Profiling shows that every Apache thread allocates and frees memory for serving requests, and both of these operations require the semaphore to be held with write permission. Further, the network code in Linux invokes copy user(), which transfers data in and out of the user address-space. This function verifies that the user-space memory is indeed valid, and to do so acquires 5 The execution of Apache on 1 or 2 core did not present idle time. Apache Setup sync (1 core) flexsc (1 core) sync (4 cores) flexsc (4 cores) IPC 0.48 0.94 0.45 0.74 L3 3.7 1.7 3.9 1.0 L2 68.9 27.5 64.6 37.5 User d-cache i-cache TLB 63.8 130.8 7.7 35.3 41.3 8.8 67.9 127.6 9.6 55.5 49.4 19.3 Branch 20.9 12.6 20.2 13.0 IPC 0.45 0.94 0.43 0.76 L3 1.4 1.0 4.4 1.5 L2 80.0 15.8 49.5 19.1 Kernel d-cache i-cache TLB 78.2 159.6 4.6 31.6 45.2 3.3 73.8 124.9 4.4 50.2 63.7 4.2 Branch 15.7 11.2 15.2 11.6 Table 4: Micro-architectural breakdown of Apache execution on uni- and quad-core setups. All values shown, except for IPC, are normalized using misses per kilo-instruction (MPKI): therefore, lower numbers yield more efficient execution and higher IPC. the semaphore with read permissions. In the NPTL case, threads from all 4 cores compete on this semaphore, resulting in 50% idle time. With FlexSC, kernel code is dynamically scheduled to run predominantly on 2 out of the 4 cores, halving the contention to this resource, eliminating 38% of the original idle time. Another important metric for servicing Web requests besides throughput is latency of individual requests. One might intuitively expect that latency of requests to be higher under FlexSC because of batching and asynchronous servicing of system calls, but the opposite is the case. Figure 13 shows the average latency of requests when processing 256 concurrent requests (other concurrency levels showed similar trends). The results show that Web requests on FlexSC are serviced within 50-60% of the time needed on NPTL, on average. 5.3 MySQL In the previous section, we demonstrated the effectiveness of FlexSC running on a workload with a significant proportion of kernel time. In this section, we experiment with OLTP on MySQL, a workload for which the proportion of kernel execution is smaller (roughly 25%). Our evaluation used MySQL version 5.5.4 with an InnoDB backend engine, and as in the Apache evaluation, we used the same binary for running on NPTL and on FlexSC. We also used the same configuration parameters for both the NPTL and FlexSC experiments, after tuning them for the best NPTL performance. To generate requests to MySQL, we used the sysbench system benchmark utility. Sysbench was created for benchmarking MySQL processor performance and contains an OLTP inspired workload generator. The benchmark allows executing concurrent requests by spawning multiple client threads, connecting to the server, and sequentially issuing SQL queries. To handle the concurrent clients, MySQL spawns a user-level thread per connection. At the end, sysbench reports the number of transactions per second executed by the database, as well as average latency information. For these experiments, we used a database with 5M rows, resulting in 1.2 GB of data. Since we were interested in stressing the CPU component of MySQL, we disabled synchronous transactions to disk. Given that the configured database was small enough to fit in memory, the workload presented no idle time due to disk I/O. Figure 14 shows the throughput numbers obtained on 1, 2 and 4 cores when varying the number of concurrent client threads issuing requests to the MySQL server.6 For this workload, system batching on one core provides modest improvements: up to 14% with 256 concurrent requests. On 2 and 4 cores, however, we see that FlexSC provides a consistent improvement with 16 or more concurrent clients, achieving up to 37%-40% higher throughput. Table 5 contains the microarchitectural processor metrics collected for the execution of MySQL. Because MySQL invokes the kernel less frequently than Apache, kernel execution yields high miss rates, resulting in a low IPC of 0.33 on NPTL. In the single core case, FlexSC does not greatly alter the execution of user-space, but increases kernel IPC by 36%. FlexSC allows the kernel to reuse state in the processor structures, yielding lower misses across most metrics. In the case of 4 cores, FlexSC also improves the performance of user-space IPC by as much as 30%, compared to NPTL. Despite making less of an impact in the kernel IPC than in single core execution, there is still a 25% kernel IPC improvement over NPTL. Figure 15 shows the average latencies of individual requests for MySQL execution with 256 concurrent clients. As is the case with Apache, the latency of requests on FlexSC is improved over execution on NPTL. Requests on FlexSC are satisfied within 70-88% of the time used by requests on NPTL. 5.4 Sensitivity Analysis In all experiments presented so far, FlexSC was configured to have 8 system call pages per core, allowing up to 512 concurrent exception-less system calls per core. Figure 16 shows the sensitivity of FlexSC to the number of available syscall entries. It depicts the throughput of Apache, on 1 and 4 cores, while servicing 2048 concurrent requests per core, so that there would always be more requests available than syscall entries. Uni-core performance approaches its best with 200 to 250 syscall entries 6 For both NPTL and FlexSC, increasing the load on MySQL yields peak throughput between 32 and 128 concurrent clients after which throughput degrades. The main reason for performance degradation is the costly and coarse synchronization used in MySQL. MySQL and Linux kernel developers have observed similar performance degradation. flexsc sync 0 50 100 150 200 250 300 700 600 500 400 300 200 flexsc sync 100 0 0 50 Request Concurrency 100 150 200 250 300 Throughput (requests/sec.) Throughput (requests/sec.) Throughput (requests/sec.) 500 450 400 350 300 250 200 150 100 50 0 1000 900 800 700 600 500 400 300 200 100 0 0 Request Concurrency (a) 1 Core flexsc sync 50 100 150 200 250 300 Request Concurrency (b) 2 Cores (c) 4 Cores Figure 14: Comparison of MySQL throughput of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores. MySQL Setup sync (1 core) flexsc (1 core) sync (4 cores) flexsc (4 cores) IPC 1.12 1.10 0.55 0.72 L3 0.6 0.8 3.7 2.7 L2 21.1 19.6 15.8 16.7 User d-cache i-cache TLB 34.8 24.2 3.8 36.3 23.6 5.4 25.2 18.9 3.1 30.6 20.9 4.1 Branch 7.8 6.9 5.9 6.5 IPC 0.33 0.45 0.36 0.45 L3 16.5 23.2 16.6 18.4 L2 125.2 55.1 78.0 46.6 Kernel d-cache i-cache 209.6 184.9 131.9 86.5 147.0 120.0 104.4 63.5 TLB Branch 3.9 17.4 3.7 13.6 3.6 15.7 2.5 11.5 Table 5: Micro-architectural breakdown of MySQL execution on uni- and quad-core setups. All values shown, except for IPC, are normalized using misses per kilo-instruction (MPKI): therefore, lower numbers yield more efficient execution and higher IPC. 6 700 Latency (ms) 6.1 500 300 200 0 1 core 2 cores 4 cores Figure 15: Comparison of MySQL latency of Linux/NPTL and FlexSC executing on 1, 2 and 4 cores, with 256 concurrent requests. Throughput (requests/sec.) System Call Batching 400 100 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 4 cores 1 core 0 Related Work sync flexsc 600 100 200 300 400 500 600 Number of syscall entries (per core) Figure 16: Execution of Apache on FlexSC-Threads, showing the performance sensitivity of FlexSC to different number of syscall pages. Each syscall page contains 64 syscall entries. (3 to 4 syscall pages), while quad-core execution starts to plateau with 300 to 400 syscall entries (6 to 7 syscall pages). It is particularly interesting to compare Figure 16 with figures 9 and 10. The direct cost of mode switching, exemplified by the micro-benchmark, has a lesser effect on performance when compared to the indirect cost of mixing user- and kernel-mode execution. The idea of batching calls in order to save crossings has been extensively explored in the systems community. Specific to operating systems, multi-calls are used in both operating systems and paravirtualized hypervisors as a mechanism to address the high overhead of mode switching. Cassyopia is a compiler targeted at rewriting programs to collect many independent system calls, and submitting them as a single multi-call [18]. An interesting technique in Cassyopia, which could be eventually explored in conjunction with FlexSC, is the concept of a looped multi-call where the result of one system call can be automatically fed as an argument to another system call in the same multi-call. In the context of hypervisors, both Xen and VMware currently support a special multi-call hypercall feature [4][20]. An important difference between multi-calls and exception-less system calls is the level of flexibility exposed. The multi-call proposals do not investigate the possibility of parallel execution of system calls, or address the issue of blocking system calls. In multi-calls, system calls are executed sequentially; each system call must complete before a subsequent can be issued. With exception-less system calls, system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 6.2 Locality of Execution and Multicores Several researchers have studied the effects of operating system execution on application performance [1, 3, 7, 6, 11, 13]. Larus and Parkes also identified processor inef- ficiencies of server workloads, although not focusing on the interaction with the operating system. They proposed Cohort Scheduling to efficiently execute staged computations to improve locality of execution [11]. Techniques such as Soft Timers [3] and Lazy Receiver Processing [9] also address the issue of locality of execution, from the other side of the compute stack: handling device interrupts. Both techniques describe how to limit processor interference associated with interrupt handling, while not impacting the latency of servicing requests. Most similar to the multicore execution of FlexSC is Computation Spreading proposed by Chakraborty et. al [6]. They introduced processor modifications to allow for hardware migration of threads, and evaluated the effects on migrating threads upon entering the kernel to specialize cores. Their simulation-based results show an improvement of up to 20% on Apache, however, they explicitly do not model TLBs and provide for fast thread migration between cores. On current hardware, synchronous thread migration between cores requires a costly interprocessor interrupt. Recently, both Corey and Factored Operating System (fos) have proposed dedicating cores for specific operating system functionality [24, 25]. There are two main differences between the core specialization possible with these proposals and FlexSC. First, both Corey and fos require a micro-kernel design of the operating system kernel in order to execute specific kernel functionality on dedicated cores. Second, FlexSC can dynamically adapt the proportion of cores used by the kernel, or cores shared by user and kernel execution, depending on the current workload behavior. Explicit off-loading of select OS functionality to cores has also been studied for performance [15, 16] and power reduction in the presence of single-ISA heterogeneous multicores [14]. While these proposals rely on expensive inter-processor interrupts to offload system calls, we hope FlexSC can provide for a more efficient, and flexible, mechanism that can be used by such proposals. 6.3 Non-blocking Execution Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers have analyzed the use of threading, event-based (non-blocking), and hybrid systems for achieving high performance on server applications [2, 10, 17, 21, 22, 23]. Capriccio described techniques to improve performance of user-level thread libraries for server applications [22]. Specifically, Behren et al. showed how to efficiently manage thread stacks, minimizing wasted space, and propose resource aware scheduling to improver server performance. For an extensive performance comparison of thread-based and event-driven Web server architectures we refer the reader to [17]. Finally, the Linux community has proposed a generic mechanism for implementing non-blocking system calls, which is call asynchronous system calls [5]. In their proposal, system calls are still exception-based, and tentatively execute synchronously. Like scheduler activations, if a blocking condition is detected, they utilize a “syslet” thread to block, allowing the user thread to continue execution. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system call proposals completely decouple the invocation of the system call from its execution. As we have discussed, the flexibility resulting from this decoupling is crucial for efficiently exploring optimizations such as system call batching and core specialization. 7 Concluding Remarks In this paper, we introduced the concept of exception-less system calls that decouples system call invocation from execution. This allows for flexible scheduling of system call execution which in turn enables system call batching and dynamic core specialization that both improve locality in a significant way. System calls are issued by writing kernel requests to a reserved syscall page using normal store operations, and they are executed by special inkernel syscall threads, which then post the results to the syscall page. In fact, the concept of exception-less system calls originated as a mechanism for low-latency communication between user and kernel-space with hyper-threaded processors in mind. We had hoped that communicating directly through the shared L1 cache would be much more efficient than mode switching. However, the measurements presented in Section 2 made it clear that mixing user and kernel-mode execution on the same core would not be efficient for server class workloads. In future work we intend to study how to exploit exception-less system calls as a communication mechanism in hyper-threaded processors. We presented our implementation of FlexSC, a Linux kernel extension, and FlexSC-Threads, a M -on-N threading package that is binary compatible with NPTL and that transparently transforms synchronous system calls into exception-less ones. With this implementation, we demonstrated how FlexSC improves throughput of Apache by up to 116% and MySQL by up to 40% while requiring no modifications to the applications. We believe these two workloads are representative of other highly threaded server workloads that would benefit from FlexSC. For example, experiments with the BIND DNS server demonstrated throughput improvements of between 30% and 105% depending on the concurrency of requests. In the current implementation of FlexSC, syscall threads process system call requests in no specific order, opportunistically issuing calls as they are posted on syscall pages. The asynchronous execution model, however, would allow for different selection algorithms. For example, syscall threads could sort the requests to consecutively execute requests of the same type, potentially yielding greater locality of execution. Also, system calls that perform I/O could be prioritized so as to issue them as early as possible. Finally, if a large number of cores are available, cores could be dedicated to specific system call types to promote further locality gains. 8 Acknowledgements This work was supported in part by Discovery Grant funding from the Natural Sciences and Engineering Research Council (NSERC) of Canada. We would like to thank the feedback from the OSDI reviewers, and to Emmett Witchel for shepherding our paper. Special thanks to Ioana Burcea for encouraging the work in its early stages, and the Computer Systems Lab members (University of Toronto), as well as Benjamin Gamsa, for insightful comments on the work and drafts of this paper. References [1] AGARWAL , A., H ENNESSY, J., AND H OROWITZ , M. Cache performance of operating system and multiprogramming workloads. ACM Trans. Comput. Syst. 6, 4 (1988), 393–431. [2] A NDERSON , T. E., B ERSHAD , B. N., L AZOWSKA , E. D., AND L EVY, H. M. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. ACM Trans. Comput. Syst. 10, 1 (1992), 53–79. [3] A RON , M., AND D RUSCHEL , P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228. [4] BARHAM , P., D RAGOVIC , B., F RASER , K., H AND , S., H AR RIS , T., H O , A., N EUGEBAUER , R., P RATT, I., AND WARFIELD , A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177. [5] B ROWN , Z. Asynchronous system calls. In Proceedings of the Ottawa Linux Symposium (OLS) (2007), pp. 81–85. [6] C HAKRABORTY, K., W ELLS , P. M., AND S OHI , G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292. [7] C HEN , J. B., AND B ERSHAD , B. N. The impact of operating system structure on memory system performance. In Proceedings of the 14th ACM Symposium on Operating Systems Principles (SOSP) (1993), pp. 120–133. [8] D REPPER , U., AND M OLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. http://people.redhat.com/drepper/nptl-design.pdf. [9] D RUSCHEL , P., AND BANGA , G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275. [10] E LMELEEGY, K., C HANDA , A., C OX , A. L., AND Z WAENEPOEL , W. Lazy asynchronous I/O for event-driven servers. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2004), pp. 21–21. [11] L ARUS , J., AND PARKES , M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114. [12] L I , T., J OHN , L. K., S IVASUBRAMANIAM , A., V IJAYKRISH NAN , N., AND RUBIO , J. Understanding and Improving Operating System Effects in Control Flow Prediction. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2002), pp. 68–80. [13] M OGUL , J. C., AND B ORG , A. The Effect of Context Switches on Cache Performance. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (1991), pp. 75–84. [14] M OGUL , J. C., M UDIGONDA , J., B INKERT, N., R AN GANATHAN , P., AND TALWAR , V. Using asymmetric single-ISA CMPs to save energy on operating systems. IEEE Micro 28, 3 (2008), 26–41. [15] N ELLANS , D., BALASUBRAMONIAN , R., AND B RUNVAND , E. OS execution on multi-cores: is out-sourcing worthwhile? SIGOPS Oper. Syst. Rev. 43, 2 (2009), 104–105. [16] N ELLANS , D., S UDAN , K., B RUNVAND , E., AND BALASUBRA MONIAN , R. Improving Server Performance on Multi-Cores via Selective Off-loading of OS Functionality. In Sixth Annual Workshorp on the Interaction between Operating Systems and Computer Architecture (WIOSCA) (2010), pp. 13–20. [17] PARIAG , D., B RECHT, T., H ARJI , A., B UHR , P., S HUKLA , A., AND C HERITON , D. R. Comparing the performance of Web server architectures. In Proceedings of the 2nd European Conference on Computer Systems (Eurosys) (2007), pp. 231–243. [18] R AJAGOPALAN , M., D EBRAY, S. K., H ILTUNEN , M. A., AND S CHLICHTING , R. D. Cassyopia: compiler assisted system optimization. In Proceedings of the 9th conference on Hot Topics in Operating Systems (HotOS) (2003), pp. 18–18. [19] R EDSTONE , J. A., E GGERS , S. J., AND L EVY, H. M. An analysis of operating system behavior on a simultaneous multithreaded architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2000), pp. 245–256. [20] VMWARE. VMWare Virtual Machine Interface Specification. http://www.vmware.com/pdf/vmi specs.pdf. [21] VON B EHREN , R., C ONDIT, J., AND B REWER , E. Why Events Are A Bad Idea (for high-concurrency servers). In Proceedings of the 9th conference on Hot Topics in Operating Systems (HotOS) (2003). [22] VON B EHREN , R., C ONDIT, J., Z HOU , F., N ECULA , G. C., AND B REWER , E. Capriccio: scalable threads for internet services. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 268–281. [23] W ELSH , M., C ULLER , D., AND B REWER , E. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP) (2001), pp. 230–243. [24] W ENTZLAFF , D., AND AGARWAL , A. Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores. SIGOPS Oper. Syst. Rev. 43, 2 (2009), 76–85. [25] W ICKIZER , S. B., C HEN , H., C HEN , R., M AO , Y., K AASHOEK , F., M ORRIS , R., P ESTEREV, A., S TEIN , L., W U , M., DAI , Y., Z HANG , Y., AND Z HANG , Z. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2008).