Hive: Fault containment for shared-memory multiprocessors

J Chapin, M Rosenblum, S Devine, T Lahiri… - Proceedings of the …, 1995 - dl.acm.org
J Chapin, M Rosenblum, S Devine, T Lahiri, D Teodosiu, A Gupta
Proceedings of the fifteenth ACM symposium on Operating systems principles, 1995dl.acm.org
Reliability and scalability are major concerns when designing operating systems for large-
scale shared-memory multiprocessors. In this paper we describe Hive, an operating system
with a novel kernel architecture that addresses these issues Hive is structured as an internal
distributed system of independent kernels called cells. This improves reliabihty because a
hardwme or software fault damages only one cell rather than the whole system, and
improves scalability because few kernel resources are shared by processes running on …
Abstract
Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of independent kernels called cells. This improves reliabihty because a hardwme or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive’s solutlon to the following key challenges:(1) fault containment, ie confining the effects of hardware or software faults to the cell where they occur, and (2) memory sharing among cells, which is requmed to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requmes defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH jirewzdl, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a umfied free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH, The effects of faults were contained to the cell in which they occurred m all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0’%. and 11YOslowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).
ACM Digital Library