Construction of A Highly Dependable Operating System
Construction of A Highly Dependable Operating System
Construction of A Highly Dependable Operating System
Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum
Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
{jnherder,herbertb,beng,philip,ast}@cs.vu.nl
Abstract
It has been well established that most operating system crashes are due to bugs in device drivers. Because drivers are normally linked into the kernel address
space, a buggy driver can wipe out kernel tables and
bring the system crashing to a grinding halt.
We have greatly mitigated this problem by reducing
the kernel to an absolute minimum and running each
driver as a separate, unprivileged user-mode process. In
addition, we implemented a POSIX-conformant operating system, MINIX 3, as multiple user-mode servers. In
this design, a server or driver failure no longer is fatal
and does not require rebooting the computer.
This paper discusses how we designed and implemented the system, which problems we encountered, and
how we solved these problems. We also discuss the performance effects of our changes and evaluate how our
multiserver design improves operating system dependability over monolithic designs.
Perfection is not achieved when there is nothing left
to add, but when there is nothing left to take away.
Antoine de Saint-Exupery [2]
1. Introduction
For the vast majority of computer users, the biggest
perceived problem with using PCs is that they are not dependable and frequently fail. For example, think of the
scenario where you buy a new peripheral device, say,
an Ethernet card, plug it in, and install the driver. The
network connection seems to work fine, but not much
later your computer suddenly crashes while downloading a file. Crashes reoccur at seemingly random times,
and you start to wonder whether your computer upgrade
has anything to do with it.
In our research, we try to address this situation and
prevent serious bugs in the operating system, such as a
device driver dereferencing an invalid pointer, winding
up in an infinite loop, or executing an illegal instruction,
from crashing or hanging the computer.
on top of a tiny microkernel. The microkernel is responsible for low-level operations such as interrupt handling,
scheduling, and programming the MMU and CPU. On
top of the microkernel, we implemented user-mode device drivers to manage the hardware, including drivers
for hard disk, printer, video, and audio. The user-mode
drivers are managed by the reincarnation server, whereas
the file server, network server and process manager provide operating system services to the application layer.
To the best of our knowledge no one before has put
all the pieces together to build a fully modular, opensource, POSIX-conformant UNIX clone that is far more
fault tolerant than normal UNIX systems, with a performance loss of only 5% to 10% compared to our base
system with drivers in the kernel. In addition, our approach differs from related efforts as we do not focus
on commodity operating systems. Rather than patching
legacy systems, we use a new, lightweight design that
can make future operating systems more dependable.
Shell
User
mode
File
Server
CC
Reinc.
Server
HardDisk
Driver
Kernel
mode
Find
Printer
Driver
PS
Xdm
TCP/IP
Stack
Video
Driver
Diff
Process
Manager
Audio
Driver
Isolated Programs
Microkernel
(Interrupts, MMU, Scheduling, IPC)
Figure 1: The design of our operating system is fully compartmentalized to properly isolate faults and enable recovery.
The loosely layered structure of our revitalized operating system is illustrated in Fig. 1. All applications,
servers, and drivers run as isolated, user-mode programs
When a driver is called, it is run on a different virtual machine than the main system so that a crash or
other fault does not pollute the main system. In addition to isolation, this technique enables unmodified
reuse of drivers when experimenting with new operating systems.
A recent project ran Linux device drivers in user
mode with small changes to the Linux kernel [12]. This
work shows that drivers can be isolated in user-mode
processes without significant performance degradation.
While isolating device drivers helps to improve the
dependability of legacy operating systems, we believe
that a proper, fully modular design from scratch gives
more dependability. This includes encapsulating all operating system servers and drivers in independent, usermode processes.
This classification was helpful to find general approaches to resolve entire classes of dependencies,
instead of ad hoc solutions for individual dependencies.
Class 1 dependencies occur because device drivers
do I/O. With the exception of the RAM disk driver, all
drivers do actual I/O, which requires reading and writing
I/O device registers. On most machines, doing I/O is not
possible in user mode. On some machines there may
be a way to map a page containing I/O registers to user
space or map some of the I/O ports to user space, but
tiny microkernel is left in kernel mode. All that this microkernel does is catch interrupts and convert each one
to a message, select the next process to run, load the selected process registers and MMU entries, and handle
the transport of fixed-size message between processes
using the rendezvous principle. Everything else is done
by user-space processes.
In this section, we describe what was needed to resolve the driver dependencies discussed in Sec. 3. We
also discuss how we operationally removed all drivers
from the kernel and how we transformed them into isolated, user-mode processes.
VDEVIO
VIRCOPY
IRQCTL
GETINFO
SETALARM
PRIVCTL
Purpose
Read or write a vector of I/O ports
Safe copy between address spaces
Set or reset an interrupt policy
Get a copy of kernel information
Set or reset a synchronous alarm
Restrict a process privileges
Figure 2: A selection of common kernel calls. All calls require privileged operations and are handled by SYS.
need access to environment variables that contain information about the systems hardware configuration or
user-specific settings passed through the boot monitor.
The SYS SETALARM call replaces the old method of
handling watchdog timershaving each driver call internal clock routines to schedule a future call of one of
its own internal functions. Instead, they now call the
clock driver asking for a notification to be sent when the
timer goes off. In the old system, the watchdog procedure ran as part of the clock process. Now it runs as part
of the callers process, a much cleaner design.
The SYS PRIVCTL call can be used to set the privileges of each process in the system. The call can only
be used by a privileged user-mode server, and is used,
for example, to restrict the I/O ports that can be used by
individual drivers. Other resources that can be restricted
are discussed in Sec. 5.
In addition to the new calls listed above, another
dozen low-level kernel calls exist, some carried over unmodified from the base system and some modified for
the new system. They include process management,
memory management, copying data between processes,
device I/O and interrupt management, access to kernel
data structures, and clock services.
ciated with the IRQ line, and sends a nonblocking notification message to each of them. Drivers that control
multiple IRQ lines, such as our serial line driver, can
specify an identifier for each IRQ line that is returned
in the notification message. This way the driver can directly tell different interrupt requests apart.
6. Performance Optimizations
Performance measurements comparing MINIX 2 and
MINIX 3 show that the average overhead introduced by
our changes is limited to 510% [8]. While user-mode
drivers introduce a performance hit, the overhead we
measured is also due to other changes, such as stricter
checks on all IPC. However, if performance is crucial,
there are other areas where gains are possible. In the
following subsections we will discuss some of them.
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
16
64
256
1024
transfer size (kB)
4096
16384
65536
the call from the file server to the kernel to move data
from its cache to the callers address space not use the
same principle. If a process asks for n blocks worth of
data, the file server should give the kernel a list of blocks
and say: Move all of these. Currently, the file server
makes a separate kernel call for each block.
Although we have not looked carefully yet, we suspect that there are other trade-offs to be made that
win back some performance. For example, the base
system, MINIX 2, was never optimized for performance. Instead, the source code was setup to be
well-structured and readable, even if betterbut more
complexalgorithms were available. The same holds
for MINIX 3. While the system probably can be optimized in various ways to boost its performance, we are
careful not to introduce unnecessary complexity.
1000 LoC, (also see Sec. 1.1), there are less than 50 bugs
in the kernel. With under 4000 lines of executable code
(LoC) the size of our kernel is much smaller than most
other microkernels. For example, L4 [11] contains over
10,000 LoC. While this difference is partly due to differences in functionality, we explicitly strive for simplicity.
A small and simple kernel means the people can actually
understand its full working, which also helps to get the
implementation correct over time.
In contrast, a monolithic system such as Linux with
2.5 million lines of executable code in the kernel is
likely to have at least 10 x 2500 = 25,000 bugs. For
Windows XP, which has about 5 million lines of kernel code, the problem is twice as bad. Moreover, with
multimillion-line systems, no person will ever read the
complete code and fully understand its working, which
can lead to servers and drivers interacting in poorly understood ways.
8. Conclusions
The primary achievement of the work reported here
is that we have actually built a highly dependable, opensource, POSIX-conformant, multiserver operating system, MINIX 3, that can be freely downloaded for inspection. The fully modular structure of our system is
shown in Fig. 1. We have discussed the implementation
process and the resulting design in detail and evaluated
its dependability and performance.
Our system is based on a microkernel whose complete source is under 4000 lines of executable code. This
code represents the total amount of code that runs in
kernel mode. To the best of our knowledge, this is by
far the smallest microkernel in existence that supports a
POSIX-conformant multiserver operating system in user
mode. It is also the only one that has each server and
each device driver running as a separate user-mode process, with many encapsulation facilities, and the ability
to withstand and often recover from failures that normally would be fatal.
We make no claim that we can catch every bug, but
we greatly improve the operating systems dependability
by structurally eliminating many different types of failures. A small and understandable kernel means fewer
fatal bugs and thus fewer crashes. Furthermore, while
shifting code to user space does not eliminate bugs,
it makes the consequences of failures less devastating
since they are encapsulated in highly restricted usermode processes. Moreover, most servers and all device
drivers of the operating system are monitored and often
can be automatically revived if a problem is detected.
Concluding, we have shown how we moved device
drivers to user space and minimized the amount of code
that runs in kernel mode in order to improve operating
system dependability.
9. Availability
The system is called MINIX 3 because we started
with MINIX 2 as a base and then modified it very heavily. It is free, open-source software, available via the
Internet. You can download MINIX 3 from the official homepage at: http://www.minix3.org/, which also
contains the source code, documentation, news, contributed software packages, and more. Over 75,000
people have downloaded the CD-ROM image in the
past few months, resulting in a large and growing user
community that communicates using the USENET newgroup comp.os.minix. MINIX 3 is actively being developed, and your help and feedback are much appreciated.
10. Acknowledgements
We would like to thank Mischa Geldermans, Jan
Looyen, Ruedi Weis, Bruno Crispo, Chandana Gamage,
and Carol Conti for their help and feedback.
Supported by the Netherlands Organization for Scientific Research (NWO) under grant 612-060-420.
References
[1] A. Chou and J. Yang and B. Chelf and S. Hallem and D.
Engler. An Empirical Study of Operating System Errors.
In Proc. 18th ACM Symp. on Oper. Syst. Prin., pages 73
88, 2001.
[2] A.-M. de Saint-Exupery. Wind, Sand, and Stars. Harcourt, Brace & Co, NY, 1940.
[3] A. Gefflaut, T. Jaeger, Y. Park, J. Liedtke, K. Elphinstone, V. Uhlig, J. Tidswell, L. Deller, and L. Reuther.
The SawMill Multiserver Approach. In ACM SIGOPS
European Workshop, pages 109114, Sept. 2000.
[4] H. Hartig and M. Hohmuth and J. Liedtke and S. Schonberg and J. Wolter. The Performance of -Kernel-Based
Systems. In Proc. 16th ACM Symp. on Oper. Syst. Prin.,
pages 6677, Oct. 1997.
[5] H. Hartig and R. Baumgartl and M. Borriss and Cl.-J.
Hamann and M. Hohmuth and F. Mehnert and L. Reuther
and S. Schonberg and J. Wolter. DROPS OS Support for Distributed Multimedia Applications. In Proc.
8th ACM SIGOPS European Workshop, pages 203209,
Sept. 1998.
[6] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S.
Tanenbaum. Modular System Programming in MINIX 3.
;login: The USENIX Magazine, 31(1):1928, Apr. 2006.
[7] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S.
Tanenbaum. Reorganizing UNIX for Reliability. In Proc.
11th Asia-Pacific Computer Systems Architecture Conference, Sept. 2006.
[8] J. N. Herder, H. Bos, and A. S. Tanenbaum. A
Lightweight Method for Building Reliable Operating
Systems Despite Unreliable Device Drivers. In Technical Report IR-CS-018 [www.cs.vu.nl/jnherder/ir-cs018.pdf], Vrije Universiteit, Jan. 2006.