Despite many decades of research, the management of errors in a live operating system remains a challenging problem. This thesis presents CuriOS, an operating system that incorporates several new error management techniques that significantly improve reliability. Errors detected by both hardware and software are signaled using language exception handling mechanisms. Unhandled exceptions do not crash the operating system and are dispatched to recovery routines.
The architecture of CuriOS is influenced by microkernel design principles. Individual operating system services are assigned separate protection domains. This componentization provided by traditional microkernel designs helps confine errors. However, an error that occurs in a microkernel operating system service can potentially result in state corruption and service failure. A simple restart of the failed service is not always the best solution for reliability. Blindly restarting a service which maintains client-related state such as session information results in the loss of this state and affects all clients that were using the service. CuriOS adopts a novel design that uses lightweight distribution, isolation and persistence of client-related state information maintained by operating system services. This helps mitigate the problem of state loss during a restart. This design also achieves inter-client isolation by curtailing error propagation within services.
Fault injection experiments show that it is possible to recover from 87% or more manifested errors in operating system services such as the file system, timer, scheduler and network while maintaining low performance overheads.
Recommendations
Towards an immortal operating system in virtual environments
Highlights- We show how a commercial OS can be successfully recovered from a crash.
- Support ...
AbstractMany OS crashes are caused by bugs in kernel extensions or device drivers while the OS itself may have been tested rigorously. To make an OS immortal we must resurrect the OS from these crashes. We present a novel OS-hypervisor ...
Building a Self-Healing Operating System
DASC '07: Proceedings of the Third IEEE International Symposium on Dependable, Autonomic and Secure ComputingUser applications and data in volatile memory are usu- ally lost when an operating system crashes because of er- rors caused by either hardware or software faults. This is because most operating systems are designed to stop working when some internal ...