Rollback Recovery in Multiprocessor Ring Configurations

L. Lehmann³ &
J. Brehm³

Part of the book series: Informatik-Fachberichte ((INFORMATIK,volume 147))

61 Accesses
3 Citations

Abstract

This paper describes a technique for distributed recovery in multiprocessor ring configurations, which has been developed and implemented for the multiprocessor system DIRMU 25 — a 25 processor system which is operational at the University of Erlangen-Nuremberg. First a short overview of the DIRMU hardware architecture and the distributed operating system DIRMOS is given. The steps of distributed recovery using distributed system checkpoints are described. By measurement of the runtime overhead of a realistic application (2D-Poisson-multigrid) its efficiency is discussed in comparasion to recovery techniques using central system checkpoints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Checkpointing Tools in a Supercomputer Center

Article 01 December 2020

Using Replication for Resilience on Exascale Systems

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Article Open access 13 March 2024

References

Dilger, E.; Maehle, E.: Systemarchitektur und Fehlertoleranz, Informatik Spektrum, Themenheft “Fehlertoleranz in Systemen”, Vol. 9, No. 2, p 110–118, April 1986.
Google Scholar
Hackbusch, W.; Trottenberg, U.: Multigrid Methods, Lecture Notes in Mathematics 960, p 1–170, Springer Verlag, Berlin-Heidelberg 1982.
Book MATH Google Scholar
Haendler, W.; Maehle, E.; Wirl, K.: DIRMU Multiprocessor Configurations, Proc. 1985 Int. Conf. on Parallel Processing, p 652–656, St. Charles, Illinois 1985.
Google Scholar
Hayes, J.P.; Yanney, R.N.: Distributed Recovery in Fault-Tolerant Multiprocessor Networks, IEEE Transactions on Computers, Vol. 35, No. 10, October 1986.
Google Scholar
Maehle, E.; Moritzen, K.; Wirl, K.: Fault Tolerant Hardware Configuration Management on the Multiprocessor System DIRMU 25, Proceedings CONPAR 86, Aachen 1986, Lecture Notes in Computer Science 237, p 190–197, Springer-Verlag, Berlin-Heidelberg 1986.
Google Scholar
Maehle, E.; Moritzen, K.; Wirl, K.: A Graph Modell and its Application to a Fault-Tolerant Multiprocessor System, Proceedings International Symposium on Fault-Tolerant Computing ‘FTCS-161’, p 292-297, Vienna 1986.
Google Scholar
Young, J.W.: A First Order Approximation to the Optimum Checkpoint Interval, Communications of the ACM, Vol. 17, No. 6, p 493–499, September 1978.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science (IMMD), University of Erlangen-Nuremberg, Martensstr. 3, D-8520, Erlangen, Federal Republic of Germany
L. Lehmann & J. Brehm

Authors

L. Lehmann
View author publications
You can also search for this author in PubMed Google Scholar
J. Brehm
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fachbereich 2, Hochschule Bremerhaven, Bürgermeister-Smidt-Straße 20, D-2850, Bremerhaven, Germany
F. Belli
Institut für Rechnerentwurf und Fehlertoleranz Fakultät für Informatik, Universität Karlsruhe, Postfach 6980, D-7500, Karlsruhe 1, Germany
W. Görke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehmann, L., Brehm, J. (1987). Rollback Recovery in Multiprocessor Ring Configurations. In: Belli, F., Görke, W. (eds) Fehlertolerierende Rechensysteme / Fault-Tolerant Computing Systems. Informatik-Fachberichte, vol 147. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45628-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-45628-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-18294-8
Online ISBN: 978-3-642-45628-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Rollback Recovery in Multiprocessor Ring Configurations

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Using Replication for Resilience on Exascale Systems

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Rollback Recovery in Multiprocessor Ring Configurations

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Using Replication for Resilience on Exascale Systems

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation