Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3030207.3030209acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Public Access

Analytic Models of Checkpointing for Concurrent Component-Based Software Systems

Published: 17 April 2017 Publication History

Abstract

Checkpointing and rollback is a key mechanism used to improve the reliability of software systems. The benefits of this mechanism can be offset by the overhead of checkpointing when the failure rate is low. The problem of developing analytic models of rollback and checkpointing has been continuously addressed for over four decades using different assumptions. This paper examines the problem under a more realistic angle, i.e., one in which there are several software components sharing resources (e.g., processors and I/O devices) among themselves and with the checkpointing processes. Additionally, the paper allows for different components to have different computing, rollback, and checkpointing demands, as well as different failure distributions. Our models also allow for various checkpointing processes to be executing concurrently to checkpoint the state of different software components. The analytic models developed here combine Markov Chains and Queuing Networks and allow us to compute the following metrics: (1) average time needed by a component to complete its execution, (2) average throughput of a component, (3) availability of a component, and (4) checkpointing overhead. The models were validated through extensive simulation and experimentation.

References

[1]
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In 2011 Intl. Conf. High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11. IEEE, 2011.
[2]
K. M. Chandy, J. C. Browne, C. W. Dissly, and W. R. Uhrig. Analytic models for rollback and recovery strategies in data base systems. IEEE Tr. Software Engineering, (1):100--110, 1975.
[3]
N. Chen and S. Ren. Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications. In Proc. 2009 ACM Symp. Applied Computing, pages 1015--1020. ACM, 2009.
[4]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303--312, 2006.
[5]
S. Di, L. Bautista-Gomez, and F. Cappello. Optimization of a multilevel checkpoint model with uncertain execution scales. In Proc. Intl. Conf. High Performance Computing, Networking, Storage and Analysis, pages 907--918. IEEE Press, 2014.
[6]
B. Dimitrov, Z. Khalil, N. Kolev, and P. Petrov. On the optimal total processing time using checkpoints. IEEE Tr. Software Engineering, 17(5):436, 1991.
[7]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375--408, 2002.
[8]
E. Elsayed. Reliability Engineering. Number v. 1 in Reliability Engineering. Addison Wesley Longman, 1996.
[9]
S. Gao, B. He, and J. Xu. Real-time in-memory checkpointing for future hybrid memory systems. In Proc. 29th ACM on Intl. Conf. Supercomputing, pages 263--272. ACM, 2015.
[10]
S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi. Minimizing completion time of a program by checkpointing and rejuvenation. In ACM SIGMETRICS Performance Evaluation Review, volume 24, pages 252--261. ACM, 1996.
[11]
E. Gelenbe. A model of roll-back recovery with multiple checkpoints. In Proc. 2nd Intl. Conf. Software Engineering, pages 251--255. IEEE Computer Society Press, 1976.
[12]
E. Gelenbe. On the optimum checkpoint interval. J. ACM, 26(2):259--270, 1979.
[13]
E. Gelenbe and D. Derochette. Performance of rollback recovery systems under intermittent failures. C. ACM, 21(6):493--499, 1978.
[14]
W. M. Jones, J. T. Daly, and N. DeBardeleben. Application monitoring and checkpointing in hpc: looking towards exascale systems. In Proc. 50th Annual Southeast Regional Conference, pages 262--267. ACM, 2012.
[15]
L. Kleinrock. Theory, Volume 1, Queueing Systems. Wiley-Interscience, 1975.
[16]
E. Lazowska, J. Zahorjan, G. Graham, and K. Sevcik. Quantitative System Performance: Computer System Analysis Using Queueing Network Models. Prentice Hall, 1984.
[17]
Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. IEEE Tr. Computers, 50(7):699--708, 2001.
[18]
J. D. Little. A proof for the queuing formula: L=λw. Operations Research, 9(3):383--387, 1961.
[19]
G. Lu, Z. Zheng, and A. A. Chien. When is multi-version checkpointing needed? In Proc. 3rd Wkhp. Fault-tolerance for HPC at extreme scale, pages 49--56. ACM, 2013.
[20]
D. A. Menasce, V. A. Almeida, Dowdy, and L.W. Dowdy. Performance by design: computer capacity planning by example. Prentice Hall Professional, 2004.
[21]
V. F. Nicola and J. M. Van Spanje. Comparative analysis of different models of checkpointing and recovery. IEEE Tr. Software Engineering, 16(8):807--821, 1990.
[22]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016.
[23]
A. N. Tantawi and M. Ruschitzka. Performance analysis of checkpointing strategies. ACM Tr. Computer Systems (TOCS), 2(2):123--144, 1984.
[24]
S. Urbanek. Rserve: Binary R server, 2013. R package version 1.7--3.
[25]
K. Wolter. Stochastic Models for Fault Tolerance, Restart, Rejuvenation, and Checkpointing. Springer Verlag, 2010.
[26]
J. W. Young. A first order approximation to the optimum checkpoint interval. C. ACM, 17(9):530--531, 1974.
[27]
W. Zhao. Building dependable distributed systems. John Wiley & Sons, 2014

Cited By

View all
  • (2018)Pattern-based Modeling of Multiresilience Solutions for High-Performance ComputingProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering10.1145/3184407.3184421(80-87)Online publication date: 30-Mar-2018
  • (2018)Efficient modeling and optimizing of checkpointing in concurrent component-based software systemsJournal of Systems and Software10.1016/j.jss.2018.01.032139:C(1-13)Online publication date: 1-May-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '17: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering
April 2017
450 pages
ISBN:9781450344043
DOI:10.1145/3030207
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. analytic models
  2. checkpointing
  3. concurrent components
  4. markov chains
  5. queuing networks

Qualifiers

  • Research-article

Funding Sources

  • Air Force Office of Scientific Research

Conference

ICPE '17
Sponsor:

Acceptance Rates

ICPE '17 Paper Acceptance Rate 27 of 83 submissions, 33%;
Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)6
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Pattern-based Modeling of Multiresilience Solutions for High-Performance ComputingProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering10.1145/3184407.3184421(80-87)Online publication date: 30-Mar-2018
  • (2018)Efficient modeling and optimizing of checkpointing in concurrent component-based software systemsJournal of Systems and Software10.1016/j.jss.2018.01.032139:C(1-13)Online publication date: 1-May-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media