Abstract
In recent years, cloud is being widely used to host numerous distributed applications. The expanding usage of cloud has introduced greater sensitivity in the environment. Therefore, most of the applications require that an effective fault tolerant mechanism must be in place. A fault tolerant mechanism involves detection as well as recovery from failures; traditionally checkpointing has been used to serve the purpose. The conventional checkpointing methods have also been tried in cloud e.g., periodic checkpointing and application based checkpointing; however, the periodic checkpointing is time inefficient and the application based checkpointing is space inefficient. Secondly, the above methods have been implemented using synchronous approach, which is inherently message inefficient, less scalable and has high synchronization latency. Furthermore, the asynchronous approaches are practically not viable owing to their inability to detect failures. In addition, the cloud entails massive scalability, thus we have proposed a quasi-synchronous checkpointing algorithm for cloud based distributed applications that exhibits better space efficiency while keeping latency under strict control. Our claims have been substantiated with static analysis and suitable simulation experiments.
Similar content being viewed by others
References
Birman KP (2012) Guide to reliable distributed systems: building high-assurance applications and cloud-hosted services. Springer Science & Business Media, Berlin
Kshemkalyani AD, Singhal M (2011) Distributed computing: principles, algorithms, and systems. Cambridge University Press
Cao J, Simonin M, Cooperman G, Morin C (2014) Checkpointing as a Service in Heterogeneous Cloud Environments. arXiv preprint arXiv:1411.1958
Koren I, Krishna CM (2010) Fault-tolerant systems. Elsevier
Manivannan D, Singhal M (1996) A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proceedings of 16th International Conference on Distributed Computing Systems, p 100–107. IEEE
Liu Y, Nassar R, Leangsuksuno C, Naksinehaboo N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Proc
Yi S, Kondo D, Andrzejak A (2010) Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 3rd International Conference on Cloud Computing, p 236–243. IEEE
Jung D, Chin S, Chung K, Yu H, Gil J (2011) An efficient checkpointing scheme using price history of spot instances in cloud computing environment. In: IFIP International Conference on Network and Parallel Computing. Springer, p 185–200
Di S, Robert Y, Vivien F, Kondo D, Wang C, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: Proceedings of the International Conference on High Performance Computing, Networking
Voorsluys W, Buyya R (2012) Reliable provisioning of spot instances for compute-intensive applications. In: 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, p 542–549. IEEE
Zhao J, Xiang Y, Lan T, Huang HH, Subramaniam S (2017) Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. In: IEEE Transactions on Parallel and Distributed Systems, p 491–502
Neto JPA, Pianto DM, Ralha CG (2019) MULTS: a multi-cloud fault-tolerant architecture to manage transient servers in cloud computing. J Syst Archit 101:101651
Nogueira A, Casimiro A, Bessani A (2017) Elastic state machine replication. IEEE Trans Parallel Distrib Syst 28(9):2486–2499
Shah SAR, Jaikar AH, Noh SY (2015) A performance analysis of precopy, postcopy and hybrid live VM migration algorithms in scientific cloud computing environment. In: 2015 International Conference on High Performance Computing & Simulation (HPCS), p 229–236. IEEE
Yi S, Andrzejak A, Kondo D (2011) Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Trans Serv Comput 5(4):512–524
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng (1):23–31
Juang TY, Venkatesan S (1991) Crash recovery with little overhead. In: [1991] Proceedings. 11th International Conference on Distributed Computing Systems, p 454–461. IEEE
Peterson SL, Kearns P (1993) Rollback based on vector time. In: Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems, p 68–77. IEEE
Hélary JM, Mostefaoui A, Netzer RH, Raynal M (2000) Communication-based prevention of useless checkpoints in distributed computations. Distrib Comput 13(1):29–43
Mattern F (1993) Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18(4):423–434
Vaidya NH (1999) Staggered consistent checkpointing. IEEE Trans Parallel Distrib Syst 10(7):694–702
Ghosh R, Longo F, Frattini F, Russo S, Trivedi KS (2014) Scalable analytics for IaaS cloud availability. IEEE Trans on Cloud Comput 2(1):57–70
Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS One 9(8):e104591
Meroufel BAKHTA, Belalem GHALEM (2015) Service to fault tolerance in cloud computing environment. WSEAS Trans Comput 14(1):782–791
Amoon M, El-Bahnasawy N, Sadi S, Wagdi M (2019) On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. J Ambient Intell Humaniz Comput 10(11):4567–4577
Dey T, Sato K, Nicolae B, Guo J, Domke J, Yu W, ... Mohror K (2020, May) Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), p 1036–1043. IEEE
Frank A, Yang D, Brinkmann A, Schulz M, Süss T (2019) Reducing False Node Failure Predictions in HPC. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), p 323–332. IEEE
Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the Reliability Behavior of HPC Applications. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), p 124–131. IEEE
Pinto J, Jain P, Kumar T (2016) Hadoop cluster monitoring and fault analysis in real time. In: 2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE), p 1–6. IEEE
de Araujo Neto JP, Pianto DM, Ralha CG (2018) A resilient agent-based architecture for efficient usage of transient servers in cloud computing. In: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), p 218–225. IEEE
Silva FM, Oliveira RL, Monteiro CC, Inacio PR, Freire M (2017) CloudSim Plus: a Cloud Computing Simulation Framework Pursuing Software Engineering Principles for Improved Modularity, Extensibility and Correctness. In: International Symposium on Integrated Network Management, p 2017. IEEE
Núñez A, Cañizares PC, Núñez M, Hierons RM (2020) TEA-Cloud: A formal framework for testing cloud computing systems. IEEE Trans Reliab 70(1):261–284
Abreu DP, Velasquez K, Assis MRM, Bittencourt LF, Curado M, Monteiro E, Madeira E (2018) A rank scheduling mechanism for fog environments. In: 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), p 363–369. IEEE
Ran L, Shi X, Shang M (2019) SLAs-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), p 1518–1525. IEEE
Bendechache M, Svorobej S, Endo PT, Mario MN, Ares ME, Byrne J, Lynn T (2019) Modelling and simulation of ElasticSearch using CloudSim. In: 2019 IEEE/ACM 23rd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), p 1–8. IEEE
Wei J, Cao S, Pan S, Han J, Yan L, Zhang L (2020) SatEdgeSim: A Toolkit for Modeling and Simulation of Performance Evaluation in Satellite Edge Computing Environments. In: 2020 12th International Conference on Communication Software and Networks (ICCSN), p 307–313. IEEE
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sinha, B., Singh, A.K. & Saini, P. A hybrid approach towards reduced checkpointing overhead in cloud-based applications. Peer-to-Peer Netw. Appl. 15, 473–483 (2022). https://doi.org/10.1007/s12083-021-01230-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-021-01230-2