Abstract
With the rapid growth of the high performance computer system size and complexity, passive fault tolerance can no longer effectively provide reliability of the system because of the high overhead and poor scalability of these methods. Hybrid fault tolerant method which is the combination of passive and active fault tolerant approaches has the potential to be widely used in fault tolerance of exascale system. However, there are still many issues of this method need to be ironed out. This paper focuses on the issues of checkpointing of hybrid fault tolerant method. A common question surrounding checkpointing is the optimization of the checkpoint interval. This paper proposes two models to model the systems which adopt hybrid fault tolerance. By comparing their results with the simulation, this paper evaluates the effectiveness of these two models. Experimental result shows that the modified model can not only predict the total work time excellently, but also can predict the optimum checkpoint interval precisely.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Felix, S., Maren, L., Miroslaw, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys 42, Article No. 10 (2010)
Cappello, F.: fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. The International Journal of High Performance Computing Applications 23, 212–226 (2009)
Varela, M.R., Ferreira, K.B., Riesen, R.: Fault-Tolerance for Exascale Systems. In: 2010 IEEE International Conference on Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), pp. 1–4 (2010)
Leonardo, F., Dolores, R., Emilio, L.: What Is Missing in Current Checkpoint Interval Models? In: 2011 International Conference on Distributed Computing Systems, pp. 322–332 (2011)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 530–531 (1974)
Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. International Journal of High Performance Computing Applications 18(3), 363–372 (2004)
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003, Part IV. LNCS, vol. 2660, pp. 3–12. Springer, Heidelberg (2003)
Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 303–312 (2006)
Avritzer, A., Bondi, A., Grottke, M., Trivedi, K.S., et al.: Performance assurance via software rejuvenation: Monitoring, statistics and algorithms. In: Proc. International Conference on Dependable Systems and Networks, pp. 435–444 (2006)
Gujrati, P., Li, Y., Lan, Z., Thakur, R., et al.: A meta-learning failure predictor for BlueGene/L systems. In: The 2007 International Conference on Parallel Processing, p. 40 (2007)
Gu, X., Papadimitriou, S., Yu, P.S., Chang, S.P.: Toward predictive failure management for distributed stream processing systems. In: The 28th International Conference on Distributed Computing Systems, pp. 825–832 (2008)
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration and back migration in HPC environments. Journal of Parallel and Distributed Computing 72, 254–267 (2012)
Jangjaimon, I., Tzeng, N.-F.: Adaptive Incremental Checkpointing via Delta Compression for Networked Multicore Systems. In: The 27th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2013), pp. 7–18 (2013)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In: The 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, L., Gu, J., Wang, Y., Zhao, T. (2013). Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance. In: Wu, C., Cohen, A. (eds) Advanced Parallel Processing Technologies. APPT 2013. Lecture Notes in Computer Science, vol 8299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45293-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-45293-2_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45292-5
Online ISBN: 978-3-642-45293-2
eBook Packages: Computer ScienceComputer Science (R0)