CONSISTENCY OF DISTRIBUTED SYSTEM WITH ACTIVE INITIATOR PROCESS WITHOUT USELESS CHECKPOINTS
DOI:
https://doi.org/10.47839/ijc.5.1.387Keywords:
Asynchronous distributed system, software fault tolerance, consistent global checkpoint, useless checkpoint, checkpointing Interval, initiator process and consistent stateAbstract
Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally committing their checkpoints. Thus, the checkpointing pattern described in this paper takes only those checkpoints that will contribute to the consistent global snapshot thereby eliminating the number of useless checkpoints.References
Aurelin, L.Pierre, K.Geraud, C.Franck, “Coordinated checkpoint versus message log for fault tolerant MPI,” Proceeding of the IEEE International Conference on Cluster Computing, PP: 242 – 250, IEEE CS Press, 1-4 Dec. 2003.
Baldoni, R., J.M.Mostefaoui, A and Raynal M., “A Communication Induced Checkpointing Protocol that Ensures Rollback Dependency Tractability”, IRISA Research Report 1076, Jan 1997.
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier P., Lodygensky O., Magniette F., Neri V., and Selikhov A., “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes”, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, PP: 1 - 18, 2002
Bouteiller Bouteiller, Franck Cappello, Thomas Herault, Krawezik Krawezik, Pierre Lemarinier, Magniette Magniette. "MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging," sc’03, ACM/IEEE press, PP: 25- 42, 2003.
Chandy, M. and Lamport, L., “Distributed snapshots: Determining global states of distributed systems”, ACM Transactions on Computing Systems, Vol. 3, No. 1, PP: 63-75, Aug. 1985.
Elnozahy, E. N., Alvisi, L., Wang, Y.M., and Johnson D. B., “A survey of rollback-recovery protocols in message-passing systems”, ACM Computing Surveys, Vol. 34, No. 3, PP: 375–408, 2002.
Gopalan, N.P. and Nagarajan, K., “Self-Refined Fault Tolerance in HPC using Dynamic Dependent Process Groups”, Lecturer Notes in Computer Science (LNCS), Springer-Verlag, LNCS 3741, pp. 153 – 158, Dec 2005..
Gunnels, J; Lin, C; Morrow, G; and Van de Geijn, R; “Analysis of a Class of Parallel Matrix Multiplication Algorithms,” Proc. Int’l Parallel Processing Symp., 1998.
Jichiang Tsai, “On Properties of RDT Communication-Induced Checkpointing Protocols”, IEEE Transactions on Parallel and Distributed Systems, Volume 14, Issue 8, Pages: 755 – 764, August 2003.
Kalaiselvi, S. and Rajaraman, V, “A survey of rollback and recovery strategies for computer programs”, IEEE Transaction on Computer, Vol. 25: PP 489–510, October 2000.
Lamport, L., “Time, Clock and the ordering of events in a Distributed System”, Communications of ACM, 21(7): 558-567, 1978.
Manivannan, D.; Netzer, R.H.B.; Singhal, M.; “Finding Consistent Global Checkpoints in a Distributed Computation”, IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No.6, June 1997, PP 623 – 627.
Manivannan, D., “Quasi-Synchronous Checkpointing: Models, Characterization and classifications”, IEEE Trans. On Parallel and Distributed Systems, Vol. 10. No. 7, July 1999, PP 703 –713.
Neogy, S. Sinha, A; Das, P.K., “Finding Consistent Checkpoints in a Distributed System with Synchronized Clocks”, IASTED International Conference on Applied Informatics AI -2001, February 19 – 22, Australia.
Prakash, R; Singhal,M; “Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems”, IEEE Trans. On Parallel and Distributed System, Vol. 7, No. 10, PP 1035-1048, October 1996.
Sinha, A; Das, P.K.; Basu, D.; “Implementation and Timing analysis of Clocks Synchronization on a Transporters based replicated systems”, Information & Software Technology, 40(1998), PP 291 –309.
Strom, R.E.; Yemini, S.; “Optimistic Recovery in Distributed Systems”, ACM Trans. On Computer Systems, Vol. 3. No. 3, Aug. 1985, PP 204 –226.
Tong. Z.; Richard, Y.K. & Tsai, W.T.; “Rollback Recovery in distributed systems using loosely synchronized clocks”, IEEE Trans. On Parallel and Distributed Systems, Vol. 3. No.2, March 1992, PP 246- 251.
Tsai, J.; Kuo, S.; “Theoretical Analysis for Communication Induced Checkpointing protocols with Rollback Recovery Dependency Tractability”, IEEE Trans. On Parallel and Distributed Systems, Vol. 9, No. 10, Oct. 1998, PP 963-971.
Tsai, J.; Wang, Y.;Kuo, S.; “Evaluation of Domino free communication induced checkpointing protocols”, Information Processing Letters 69(1999),PP 31- 37.
Wang, Y.M..; Lowary, A; Fuchs, W.K.; “Consistent Global Checkpoint Based on Dependency tracking”, Information Processing Letters, Vol. 50, No. 4, 1994, PP 223-230.
Wong, F. and Franklin, M., “Checkpointing in distributed systems,” Journal of Parallel & Distributed Systems, Vol. 35, No. 1, PP 67–75, May 1996.
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.