Abstract
We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. Journal of High Performance Computing Applications 15(3), 200–222 (2001)
Foster, I.: What is the Grid? A Three Point Checklist, Grid Today 1(6) (2002)
Valcarenghi, L., et al.: QoS Aware Fault Tolerance in Grid Computing. In: Workshop on Reliability and Robustness in Grid Computing Systems, GGF16, Athens, Greece, February 13-16 (2006)
MPICH-G2: A Grid-enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)
Tunali, T., Erciyes, K., Soysert, Z.: A Hierarchical Fault-Tolerant Ring Protocol For A Distributed Real-Time System. Special issue of Parallel and Distributed Computing Practices on Parallel and Distributed Real-Time Systems 2(1), 33–44 (2000)
Amir, Y., et al.: The TOTEM Single Ring Ordering and membership Protocol. ACM Trans. Comp. Systems. 13(4) (1995)
Amir, Y., et al.: Transis: A communication subsystem for high availability. In: Proc. of 22nd IEEE Int’l Symp. on Fault-Tolerant Computing, pp. 76–84. IEEE Press, NJ
Birman, K.P., van Renesse, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, Los Alamitos (1994)
Birman, K.P.: The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, 36(12) (December 1993)
Chockler, G., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Computing Surveys 33(4), 427–469 (2001)
Cristian F.: Synchronous and Asynchronous Communication, Communications of the ACM. Special Section on Group Communication 39(4) (April 1996)
Defago, X.: Agreement Related Problem: From semi-passive replication to Totally Ordered Broadcast. Ph.D. thesis, Ecole Polytechnique Lausanne, Switzerland (August 2000)
Kaashoek, M.F., Tanenbaum, A.S.: Group Communication in the Amoeba distributed operating system. In: Proc. of the 11th IEEE International Conf. on Distributed Computing Systems, pp. 436–447. IEEE Computer Society press, Los Alamitos
Keidar, I., et al.: Moshe: A group membership service for WANs. ACM Transactions on Computer Systems (TOCS) 20(3), 191–238 (2002)
Schenider, F.: Replication management using the state-machine approach. In: Duistributed Systems, pp. 169–198. ACM Press, New York
Van Renesse, R., Birman, K.P., Maffeis, S.: Horus: A Flexible Group communication System. Communications of the ACM, Special section on Group Communication 39(4) (April 1996)
Susuki, I., Kasami, T.: A Distributed Mutual Exclusion Algorithm. ACM Trans. Computer Systems 3(4), 344–349 (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Erciyes, K. (2006). A Replication-Based Fault Tolerance Protocol Using Group Communication for the Grid. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_62
Download citation
DOI: https://doi.org/10.1007/11946441_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68067-3
Online ISBN: 978-3-540-68070-3
eBook Packages: Computer ScienceComputer Science (R0)