research-article

Otherworld: giving applications a chance to survive OS kernel crashes

Authors:

Alex Depoutovitch,

Michael StummAuthors Info & Claims

EuroSys '10: Proceedings of the 5th European conference on Computer systems

Pages 181 - 194

https://doi.org/10.1145/1755913.1755933

Published: 13 April 2010 Publication History

Abstract

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent.

Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore.

We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.

References

[1]

Auslander, M., Larkin, D., and Scherr, A. The evolution of the MVS operating system. IBM Journal of Research and Development 25, 5 (1981), 471--482.

Digital Library

[2]

Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11--33.

Digital Library

[3]

Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast, reliable file systems. Proc. of the 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (1992), 10--22.

Digital Library

[4]

Baker, M., and Sullivan, M. The Recovery Box: Using fast recovery to provide high availability in the Unix environment. Proc. of the 1992 USENIX Summer Conf. (1992), 31--43.

[5]

Baumann, R. Soft errors in commercial semiconductor technology: overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals (2002), 121.

[6]

Biederman, E. Kexec. http://lwn.net/Articles/15468/, 2002.

[7]

Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. Remote repair of operating system state using Backdoors. Proc. of the Intl. Conf. on Autonomic Computing (2004), 256--263.

Digital Library

[8]

Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot-a technique for cheap recovery. Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 31--44.

Digital Library

[9]

Chandra, S., and Chen, P. M. The impact of recovery mechanisms on the likelihood of saving corrupted state. Proc. of the 13th Intl. Symposium on Software Reliability Engineering (2002), 91--101.

Digital Library

[10]

Chen, Y., Gnawali, O., Kazandjieva, M., Levis, P., and Regehr, J. Surviving sensor network software faults. Proc. of the 22nd Proc. of the Symposium on Operating Systems Principles (2009), 235--246.

Digital Library

[11]

Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An empirical study of operating system errors. Symposium on Operating Systems Principles (2001), 73--88.

Digital Library

[12]

David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving reliability through operating system structure. Proc. of the 8th Symposium on Operating Systems Design and Implementation (2008), 59--72.

Digital Library

[13]

Depoutovitch, A., and Stumm, M. Otherworld -- giving applications a chance to survive OS kernel crashes. Proc. of the 4th Workshop on Hot Topics in System Dependability (2008).

Digital Library

[14]

Goyal, V., Biederman, E., and Nellitheertha, H. KDump, a Kexec-based kernel crash dumping mechanism. Proc. of the Linux Symposium (2005), 169--181.

[15]

Gu, W., Kalbarczyk, Z., Iyer, R., and Yang, Z. Characterization of Linux kernel behavior under errors. Proc. of the Intl. Conf. on Dependable Systems and Networks (2003), 459--468.

[16]

Hargrove, P., and Duell, J. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conf. Series (2006), vol. 46, Institute of Physics Publishing, pp. 494--499.

[17]

Herder, J. N., Bos, H., Gras, B., Homburg, P., and Tanenbaum, A. S. Reorganizing Unix for reliability. Proc. of Asia-Pacific Computer Systems Architecture Conf. (2006), 81--94.

Digital Library

[18]

Intel. Using the Intel ICH family watchdog timer (WDT) application note: AP-725 http://www.intel.com/design/chipsets/applnots/292273.htm (2002)

[19]

King, S., Dunlap, G., and Chen, P. Debugging operating systems with time-traveling virtual machines. Proc. of the USENIX 2005 Technical Conf. (2005), 1--15.

Digital Library

[20]

Laadan, O., and Nieh, J. Transparent checkpoint-restart of multiple processes on commodity operating systems. Proc. of the 2007 USENIX Technical Conf. (2007), 323--336.

Digital Library

[21]

Lehman, T., Shekita, E., and Cabrera, L. An evaluation of the Starburst memory-resident storage component. IEEE Trans. on Knowledge and Data Engineering (1992), 555--566.

Digital Library

[22]

Lowell, D. E., Chandra, S., and Chen, P. M. Exploring failure transparency and the limits of generic recovery. Proc. of the 4th Symposium on Operating System Design and Implementation (2000), 289--304.

Digital Library

[23]

Microsoft. Underpinnings of the session state implementation in ASP.NET. http://msdn2.microsoft.com/enus/library/aa479041.aspx, 2003.

[24]

Ng, W. Design and implementation of reliable main memory. Ph.D. thesis (1999).

Digital Library

[25]

Ng, W. T., and Chen, P. M. The systematic improvement of fault tolerance in the Rio file cache. Proc. of the 1999 Symposium on Fault-Tolerant Computing (1999), 76--83.

Digital Library

[26]

Patterson, D. Recovery oriented computing: A new research agenda for a new century. Proc. of the 8th Intl. Symposium on High-Performance Computer Architecture (2002), 223.

Digital Library

[27]

Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. Proc. of the USENIX 2004 Annual Technical Conf. (2004), 29--44

Digital Library

[28]

Sullivan, M., and Chillarege, R. Software defects and their impact on system availability: A study of field failures in operating systems. Proc. of the 21st Intl. Symposium on Fault-Tolerant Computing (1991), 2--9.

[29]

Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. Recovering device drivers. ACM Transactions on Computer Systems 24, 4 (2006), 333--360.

Digital Library

[30]

Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems 23, 1 (2005), 77--110.

Digital Library

[31]

Van Vleck, T. Unix and Multics. http://www.multicians.org/unix.html, 1993.

[32]

VMWare Fault tolerance, http://www.vmware.com/

[33]

Volano benchmark, http://www.volano.com/benchmarks.html

[34]

Zheng, G., Shi, L., and Kal_e, L. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Proc. of the 2004 IEEE Intl. Conf. on Cluster Computing (2004), 93--103.

Digital Library

Cited By

Liu JHao XArpaci-Dusseau AArpaci-Dusseau RChajed T(2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665942
Wada TYamada H(2024)Reboot-Based Recovery of Unikernels at the Component Level2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00017(15-28)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00017
Decourcelle JNgoc TTeabe BHagimont D(2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3592849
Show More Cited By

Index Terms

Otherworld: giving applications a chance to survive OS kernel crashes
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

Improving the reliability of commodity operating systems
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles

Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures. This paper ...
Recovering device drivers

This article presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. ...
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08: Proceedings of the Fourth conference on Hot topics in system dependability

We propose a mechanism that allows applications to survive operating system kernel crashes and continue functioning with no application data loss after a system reboot. This mechanism introduces no run-time overhead and can be implemented in a commodity ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '10: Proceedings of the 5th European conference on Computer systems

April 2010

388 pages

ISBN:9781605585772

DOI:10.1145/1755913

General Chair:
Christine Morin
INRIA Rennes, France
,
Program Chair:
Gilles Muller
INRIA/LIP6, France

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EuroSys '10

Sponsor:

SIGOPS

EuroSys '10: Fifth EuroSys Conference 2010

April 13 - 16, 2010

Paris, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
733
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)8

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu JHao XArpaci-Dusseau AArpaci-Dusseau RChajed T(2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665942
Wada TYamada H(2024)Reboot-Based Recovery of Unikernels at the Component Level2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00017(15-28)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00017
Decourcelle JNgoc TTeabe BHagimont D(2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3592849
Dinh Ngoc TTeabe BTchana AMuller GHagimont D(2023)HyperTP: A unified approach for live hypervisor replacement in datacentersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104733181(104733)Online publication date: Nov-2023
https://doi.org/10.1016/j.jpdc.2023.104733
Segalini ALopez Pacheco DUrvoy-Keller GHermenier FJacquemart Q(2022)Hy-FiX: Fast In-Place Upgrades of KVM HypervisorsIEEE Transactions on Cloud Computing10.1109/TCC.2021.305659010:4(2679-2690)Online publication date: 1-Oct-2022
https://doi.org/10.1109/TCC.2021.3056590
Wada TYamada H(2022)Towards Making Unikernels Rejuvenatable2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW55968.2022.00062(154-161)Online publication date: Oct-2022
https://doi.org/10.1109/ISSREW55968.2022.00062
Iguchi TYamada H(2022)Graceful ECC-uncorrectable Error Handling in the Operating System Kernel2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00021(109-120)Online publication date: Oct-2022
https://doi.org/10.1109/ISSRE55969.2022.00021
Gouveia IVölp MEsteves-Verissimo P(2022)Behind the last line of defenseComputers and Security10.1016/j.cose.2022.102920123:COnline publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1016/j.cose.2022.102920
JUMONJI YYAMADA H(2021)Efficient Reboot-Based Recovery of In-Memory DatabasesIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7501E104.D:12(2164-2172)Online publication date: 1-Dec-2021
https://doi.org/10.1587/transinf.2020ZDP7501
Ngoc TTeabe BTchana AMuller GHagimont DBarbalace ABhatotia PAlvisi LCadar C(2021)Mitigating vulnerability windows with hypervisor transplantProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456235(162-177)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456235
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents