Google Scholar

Hive: Fault containment for shared-memory multiprocessors

J Chapin, M Rosenblum, S Devine, T Lahiri… - Proceedings of the …, 1995 - dl.acm.org

J Chapin, M Rosenblum, S Devine, T Lahiri, D Teodosiu, A Gupta

Proceedings of the fifteenth ACM symposium on Operating systems principles, 1995•dl.acm.org

Abstract

Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of independent kernels called cells. This improves reliabihty because a hardwme or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive’s solutlon to the following key challenges:(1) fault containment, ie confining the effects of hardware or software faults to the cell where they occur, and (2) memory sharing among cells, which is requmed to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requmes defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH jirewzdl, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a umfied free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH, The effects of faults were contained to the cell in which they occurred m all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0’%. and 11YOslowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).

ACM Digital Library

Show moreShow less

Save Cite Cited by 245 Related articles All 26 versions

Cite

Advanced search

Saved to My library

Hive: Fault containment for shared-memory multiprocessors