Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1755913.1755933acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Otherworld: giving applications a chance to survive OS kernel crashes

Published: 13 April 2010 Publication History

Abstract

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent.
Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore.
We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.

References

[1]
Auslander, M., Larkin, D., and Scherr, A. The evolution of the MVS operating system. IBM Journal of Research and Development 25, 5 (1981), 471--482.
[2]
Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11--33.
[3]
Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast, reliable file systems. Proc. of the 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (1992), 10--22.
[4]
Baker, M., and Sullivan, M. The Recovery Box: Using fast recovery to provide high availability in the Unix environment. Proc. of the 1992 USENIX Summer Conf. (1992), 31--43.
[5]
Baumann, R. Soft errors in commercial semiconductor technology: overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals (2002), 121.
[6]
Biederman, E. Kexec. http://lwn.net/Articles/15468/, 2002.
[7]
Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. Remote repair of operating system state using Backdoors. Proc. of the Intl. Conf. on Autonomic Computing (2004), 256--263.
[8]
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot-a technique for cheap recovery. Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 31--44.
[9]
Chandra, S., and Chen, P. M. The impact of recovery mechanisms on the likelihood of saving corrupted state. Proc. of the 13th Intl. Symposium on Software Reliability Engineering (2002), 91--101.
[10]
Chen, Y., Gnawali, O., Kazandjieva, M., Levis, P., and Regehr, J. Surviving sensor network software faults. Proc. of the 22nd Proc. of the Symposium on Operating Systems Principles (2009), 235--246.
[11]
Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An empirical study of operating system errors. Symposium on Operating Systems Principles (2001), 73--88.
[12]
David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving reliability through operating system structure. Proc. of the 8th Symposium on Operating Systems Design and Implementation (2008), 59--72.
[13]
Depoutovitch, A., and Stumm, M. Otherworld -- giving applications a chance to survive OS kernel crashes. Proc. of the 4th Workshop on Hot Topics in System Dependability (2008).
[14]
Goyal, V., Biederman, E., and Nellitheertha, H. KDump, a Kexec-based kernel crash dumping mechanism. Proc. of the Linux Symposium (2005), 169--181.
[15]
Gu, W., Kalbarczyk, Z., Iyer, R., and Yang, Z. Characterization of Linux kernel behavior under errors. Proc. of the Intl. Conf. on Dependable Systems and Networks (2003), 459--468.
[16]
Hargrove, P., and Duell, J. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conf. Series (2006), vol. 46, Institute of Physics Publishing, pp. 494--499.
[17]
Herder, J. N., Bos, H., Gras, B., Homburg, P., and Tanenbaum, A. S. Reorganizing Unix for reliability. Proc. of Asia-Pacific Computer Systems Architecture Conf. (2006), 81--94.
[18]
Intel. Using the Intel ICH family watchdog timer (WDT) application note: AP-725 http://www.intel.com/design/chipsets/applnots/292273.htm (2002)
[19]
King, S., Dunlap, G., and Chen, P. Debugging operating systems with time-traveling virtual machines. Proc. of the USENIX 2005 Technical Conf. (2005), 1--15.
[20]
Laadan, O., and Nieh, J. Transparent checkpoint-restart of multiple processes on commodity operating systems. Proc. of the 2007 USENIX Technical Conf. (2007), 323--336.
[21]
Lehman, T., Shekita, E., and Cabrera, L. An evaluation of the Starburst memory-resident storage component. IEEE Trans. on Knowledge and Data Engineering (1992), 555--566.
[22]
Lowell, D. E., Chandra, S., and Chen, P. M. Exploring failure transparency and the limits of generic recovery. Proc. of the 4th Symposium on Operating System Design and Implementation (2000), 289--304.
[23]
Microsoft. Underpinnings of the session state implementation in ASP.NET. http://msdn2.microsoft.com/enus/library/aa479041.aspx, 2003.
[24]
Ng, W. Design and implementation of reliable main memory. Ph.D. thesis (1999).
[25]
Ng, W. T., and Chen, P. M. The systematic improvement of fault tolerance in the Rio file cache. Proc. of the 1999 Symposium on Fault-Tolerant Computing (1999), 76--83.
[26]
Patterson, D. Recovery oriented computing: A new research agenda for a new century. Proc. of the 8th Intl. Symposium on High-Performance Computer Architecture (2002), 223.
[27]
Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. Proc. of the USENIX 2004 Annual Technical Conf. (2004), 29--44
[28]
Sullivan, M., and Chillarege, R. Software defects and their impact on system availability: A study of field failures in operating systems. Proc. of the 21st Intl. Symposium on Fault-Tolerant Computing (1991), 2--9.
[29]
Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. Recovering device drivers. ACM Transactions on Computer Systems 24, 4 (2006), 333--360.
[30]
Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems 23, 1 (2005), 77--110.
[31]
Van Vleck, T. Unix and Multics. http://www.multicians.org/unix.html, 1993.
[32]
VMWare Fault tolerance, http://www.vmware.com/
[33]
Volano benchmark, http://www.volano.com/benchmarks.html
[34]
Zheng, G., Shi, L., and Kal_e, L. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Proc. of the 2004 IEEE Intl. Conf. on Cluster Computing (2004), 93--103.

Cited By

View all
  • (2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
  • (2024)Reboot-Based Recovery of Unikernels at the Component Level2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00017(15-28)Online publication date: 24-Jun-2024
  • (2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
  • Show More Cited By

Index Terms

  1. Otherworld: giving applications a chance to survive OS kernel crashes

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    EuroSys '10: Proceedings of the 5th European conference on Computer systems
    April 2010
    388 pages
    ISBN:9781605585772
    DOI:10.1145/1755913
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 April 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crash kernel
    2. kernel
    3. microreboot
    4. recovery

    Qualifiers

    • Research-article

    Conference

    EuroSys '10
    Sponsor:
    EuroSys '10: Fifth EuroSys Conference 2010
    April 13 - 16, 2010
    Paris, France

    Acceptance Rates

    Overall Acceptance Rate 241 of 1,308 submissions, 18%

    Upcoming Conference

    EuroSys '25
    Twentieth European Conference on Computer Systems
    March 30 - April 3, 2025
    Rotterdam , Netherlands

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
    • (2024)Reboot-Based Recovery of Unikernels at the Component Level2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00017(15-28)Online publication date: 24-Jun-2024
    • (2023)Fast VM Replication on Heterogeneous Hypervisors for Robust Fault ToleranceProceedings of the 24th International Middleware Conference10.1145/3590140.3592849(15-28)Online publication date: 27-Nov-2023
    • (2023)HyperTP: A unified approach for live hypervisor replacement in datacentersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104733181(104733)Online publication date: Nov-2023
    • (2022)Hy-FiX: Fast In-Place Upgrades of KVM HypervisorsIEEE Transactions on Cloud Computing10.1109/TCC.2021.305659010:4(2679-2690)Online publication date: 1-Oct-2022
    • (2022)Towards Making Unikernels Rejuvenatable2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW55968.2022.00062(154-161)Online publication date: Oct-2022
    • (2022)Graceful ECC-uncorrectable Error Handling in the Operating System Kernel2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00021(109-120)Online publication date: Oct-2022
    • (2022)Behind the last line of defenseComputers and Security10.1016/j.cose.2022.102920123:COnline publication date: 1-Dec-2022
    • (2021)Efficient Reboot-Based Recovery of In-Memory DatabasesIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7501E104.D:12(2164-2172)Online publication date: 1-Dec-2021
    • (2021)Mitigating vulnerability windows with hypervisor transplantProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456235(162-177)Online publication date: 21-Apr-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media