Nothing Special   »   [go: up one dir, main page]

skip to main content
article

BASE: Using abstraction to improve fault tolerance

Published: 01 August 2003 Publication History

Abstract

Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or nondeterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, nondeterministic implementation. These examples suggest that our technique can be used in practice---in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.

References

[1]
Adya, A., Gruber, R., Liskov, B., and Maheshwari, U. 1995. Efficient Optimistic Concurrency Control using Loosely Synchronized Clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. San Jose, California, 23--34.]]
[2]
Amir, Y., Danilov, C., Miskin-Amir, M., Stanton, J., and Tutu, C. 2002. Practical Wide-Area Database Replication. Tech. Rep. CNDS-2002-1, Johns Hopkins University.]]
[3]
Birman, K., Schiper, A., and Stephenson, P. 1991. Lightweight causal and atomic group multicast. In ACM Trans. Comput. Syst. Vol. 9(3).]]
[4]
Bressoud, T. and Schneider, F. 1995. Hypervisor-based Fault Tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles. Copper Mountain Resort, Colorado, 1--11.]]
[5]
Callaghan, B. 1999. NFS Illustrated. Addison-Wesley, Reading, Massachusetts.]]
[6]
Carey, M. J., DeWitt, D. J., and Naughton, J. F. 1993. The OO7 Benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington D.C., 12--21.]]
[7]
Castro, M. 2000. Practical Byzantine fault-tolerance. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts.]]
[8]
Castro, M., Adya, A., Liskov, B., and Myers, A. 1997. HAC: Hybrid Adaptive Caching for Distributed Storage Systems. In Proceedings of the Sixteenth ACM Symposium on Operating System Principles. Saint Malo, France, 102--115.]]
[9]
Castro, M. and Liskov, B. 1999. Practical Byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation. New Orleans, Louisiana, 173--186.]]
[10]
Castro, M. and Liskov, B. 2000. Proactive recovery in a Byzantine-fault-tolerant system. In Proceedings of the Fourth Symposium on Operating Systems Design and Implementation. San Diego, California, 273--288.]]
[11]
Castro, M. and Liskov, B. 2002. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. Comput. Syst. 20, 4 (Nov.), 398--461.]]
[12]
Chen, L. and Avizienis, A. 1978. N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation. In Fault Tolerant Computing, FTCS-8. 3--9.]]
[13]
Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145--185.]]
[14]
Cooper, E. 1985. Replicated Distributed Programs. In Proceedings of the Tenth ACM Symposium on Operating System Principles. Orcas Island, Washington, 63--78.]]
[15]
Geiger, K. 1995. Inside ODBC. Microsoft Press.]]
[16]
Ghemawat, S. 1995. The Modified Object Buffer: a storage management technique for object-oriented databases. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts.]]
[17]
Gray, J. and Siewiorek, D. 1991. High-availability computer systems. IEEE Comput. 24, 9 (Sept.), 39--48.]]
[18]
Gray, J. N. and Reuter, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc.]]
[19]
Hayden, M. 1998. The Ensemble System. Tech. Rep. TR98-1662, Cornell University, Ithaca, New York. Jan.]]
[20]
Herlihy, M. P. and Wing, J. M. 1987. Axioms for Concurrent Objects. In Conference Record of the Fourteenth Annual ACM Symposium on Principles of Programming Languages. Munich, Germany, 13--26.]]
[21]
Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., and West, M. 1988. Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6, 1 (Feb.), 51--81.]]
[22]
Huang, Y., Kintala, C., Kolettis, N., and Fulton, N. D. 1995. Software rejuvenation: Analysis, modules and applications. In Digest of Papers: FTCS-25, The Twenty-Fifth International Symposium on Fault-Tolerant Computing. Pasadena, California, 381--390.]]
[23]
Jiménez-Peris, R., Patiño-Martínez, M., Kemme, B., and Alonso, G. 2002. Improving the Scalability of Fault-Tolerant Database Clusters. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS'02). Vienna, Austria.]]
[24]
Kemme, B. and Alonso, G. 2000. Don't be lazy be consistent: Postgres-R, a new way to implement Database Replication. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases. Cairo, Egypt.]]
[25]
Lamport, L. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM 21, 7 (July), 558--565.]]
[26]
Liskov, B., Castro, M., Shrira, L., and Adya, A. 1999. Providing persistent objects in distributed systems. In Proceedings of the 13th European Conference on Object-Oriented Programming (ECOOP '99). Lisbon, Portugal, 230--257.]]
[27]
Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., Shrira, L., and Williams, M. 1991. Replication in the Harp File System. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles. Pacific Grove, California, 226--238.]]
[28]
Liskov, B. and Guttag, J. 2000. Program Development in Java: Abstraction, Specification, and Object-Oriented Design. Addison-Wesley.]]
[29]
Maffeis, S. 1995. Adding group communication and fault tolerance to CORBA. In Proceedings of the Second USENIX Conference on Object-Oriented Technologies. Toronto, Canada, 135--146.]]
[30]
Marzullo, K. and Schmuck, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems. San Jose, California, 447--453.]]
[31]
Mills, D. L. 1992. Network Time Protocol (Version 3) Specification, Implementation and Analysis. Network Working Report RFC 1305.]]
[32]
Minnich, R. 2000. The Linux BIOS Home Page. http://www.acl.lanl.gov/linuxbios.]]
[33]
Moser, L., Melliar-Smith, P., and Narasimhan, P. 1998. Consistent object replication in the eternal system. Theory and Practice of Object Systems 4, 2 (Jan.), 81--92.]]
[34]
Narasimhan, P., Kihlstrom, K., Moser, L., and Melliar-Smith, P. 1999. Providing Support for Survivable CORBA Applications with the Immune System. In Proceedings of the 19th International Conference on Distributed Computing Systems. Austin, Texas, 507--516.]]
[35]
Object Management Group. 1999. The Common Object Request Broker: Architecture and Specification. Omg techical committee document formal/98-12-01. June.]]
[36]
Object Management Group. 2000. Fault Tolerant CORBA. Omg techical committee document orbos/2000-04-04. Mar.]]
[37]
Ousterhout, J. 1990. Why Aren't Operating Systems Getting Faster as Fast as Hardware? In Proceedings of the Usenix Summer 1990 Technical Conference. Anaheim, California, 247--256.]]
[38]
Pease, M., Shostak, R., and Lamport, L. 1980. Reaching Agreement in the Presence of Faults. J. ACM 27, 2 (Apr.), 228--234.]]
[39]
Ramkumar, B. and Strumpen, V. 1997. Portable checkpointing for heterogeneous architectures. In Digest of Papers: FTCS-27, The Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing. Seattle, Washington, 58--67.]]
[40]
RFC-1014 1987. Network working group request for comments: 1014. XDR: External data representation standard.]]
[41]
RFC-1094 1989. Network working group request for comments: 1094. NFS: Network file system protocol specification.]]
[42]
Rodrigues, R. 2001. Combining abstraction with Byzantine fault-tolerance. M.S. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts.]]
[43]
Romanovsky, A. 2000. Faulty version recovery in object-oriented N-version programming. IEE Proc. Soft. 147, 3 (June), 81--90.]]
[44]
Salles, F., Rodríguez, M., Fabre, J., and Arlat, J. 1999. MetaKernels and Fault Containment Wrappers. In Digest of Papers: FTCS-29, The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. Madison, Wisconsin, 22--29.]]
[45]
Schneider, F. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (Dec.), 299--319.]]
[46]
Stonebraker, M., Rowe, L. A., and Hirohama, M. 1990. The implementation of POSTGRES. IEEE Trans. Knowl. Data Eng. 2, 1 (Mar.), 125--142.]]
[47]
Tso, K. and Avizienis, A. 1987. Community error recovery in N-version software: A design study with experimentation. In Digest of Papers: FTCS-17, the Seventeenth Annual Symposium on Fault Tolerant Computing. Pittsburgh, Pennsylvania, 127--133.]]
[48]
W3C. 2000. Extensible Markup Language (XML) 1.0 (Second Edition). W3C recommendation.]]

Cited By

View all
  • (2024)Follow the Leader: Alternating CPU/GPU Computations in PDESProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3615979.3656056(47-51)Online publication date: 24-Jun-2024
  • (2024)Specular: Towards Secure, Trust-minimized Optimistic Blockchain Execution2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00175(3943-3960)Online publication date: 19-May-2024
  • (2023)Intrusion Resilience Systems for Modern Vehicles2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring)10.1109/VTC2023-Spring57618.2023.10200532(1-7)Online publication date: Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 21, Issue 3
August 2003
106 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/859716
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2003
Published in TOCS Volume 21, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Byzantine fault tolerance
  2. N-version programming
  3. asynchronous systems
  4. proactive recovery
  5. state machine replication

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Follow the Leader: Alternating CPU/GPU Computations in PDESProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3615979.3656056(47-51)Online publication date: 24-Jun-2024
  • (2024)Specular: Towards Secure, Trust-minimized Optimistic Blockchain Execution2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00175(3943-3960)Online publication date: 19-May-2024
  • (2023)Intrusion Resilience Systems for Modern Vehicles2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring)10.1109/VTC2023-Spring57618.2023.10200532(1-7)Online publication date: Jun-2023
  • (2023)SoK: Essentials of BFT Consensus for Blockchains2023 Fifth International Conference on Blockchain Computing and Applications (BCCA)10.1109/BCCA58897.2023.10338868(315-328)Online publication date: 24-Oct-2023
  • (2022)Distributed Ledger Technology (DLT) and Byzantine Fault Tolerance in BlockchainSoft Computing: Theories and Applications10.1007/978-981-19-0707-4_86(971-981)Online publication date: 2-Jun-2022
  • (2021)Byzantine Fault-tolerant State-machine Replication from a Systems PerspectiveACM Computing Surveys10.1145/343672854:1(1-38)Online publication date: 11-Feb-2021
  • (2021)Byzantine GeoconsensusNetworked Systems10.1007/978-3-030-91014-3_2(19-35)Online publication date: 19-May-2021
  • (2021)Byzantine Fault ToleranceFrom Traditional Fault Tolerance to Blockchain10.1002/9781119682127.ch7(245-293)Online publication date: 18-Jun-2021
  • (2020)Byzantine ordered consensus without byzantine oligarchyProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488802(633-649)Online publication date: 4-Nov-2020
  • (2020)From Analyzing Operating System Vulnerabilities to Designing Multiversion Intrusion-Tolerant ArchitecturesIEEE Transactions on Reliability10.1109/TR.2019.289724869:1(22-39)Online publication date: Mar-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media