Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/SC.2018.00047acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Partial redundancy in HPC systems with non-uniform node reliabilities

Published: 26 July 2019 Publication History

Abstract

We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes(full and no-replication) for any significant range of node counts. In this work, we argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient.

References

[1]
F. Cappello, "Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities," The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 212--226, 2009.
[2]
K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011, pp. 1--12.
[3]
H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, "Combining process replication and checkpointing for resilience on exascale systems," Ph.D. dissertation, INRIA, 2012.
[4]
J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann, "Combining partial redundancy and checkpointing for hpc," in Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE, 2012, pp. 615--626.
[5]
N. El-Sayed and B. Schroeder, "Reading between the lines of failure logs: Understanding how hpc systems fail," in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 2013, pp. 1--12.
[6]
S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, "Understanding and exploiting spatial properties of system failures on extreme-scale hpc systems," in Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 2015, pp. 37--44.
[7]
S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, "Failures in large scale systems: long-term measurement, analysis, and implications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 44.
[8]
S. Di, R. Gupta, M. Snir, E. Pershey, and F. Cappello, "Logaider: A tool for mining potential correlations of hpc log events," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on. IEEE, 2017, pp. 442--451.
[9]
T. J. Hacker, F. Romero, and C. D. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, pp. 652--665, 2009.
[10]
Wikipedia, "Rearrangement inequality." {Online}. Available: https://en.wikipedia.org/wiki/Rearrangement_inequality
[11]
C. George and S. Vadhiyar, "Fault tolerance on large scale systems using adaptive process replication," IEEE Transactions on Computers, vol. 64, no. 8, pp. 2213--2225, 2015.
[12]
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, "Checkpointing strategies for parallel jobs," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 33.
[13]
C. Engelmann and S. Böhm, "Redundant execution of hpc applications with mr-mpi," in Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), 2011, pp. 15--17.
[14]
J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future generation computer systems, vol. 22, no. 3, pp. 303--312, 2006.
[15]
J. Stearley, K. Ferreira, D. Robinson, J. Laros, K. Pedretti, D. Arnold, P. Bridges, and R. Riesen, "Does partial replication pay off?" in Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on. IEEE, 2012, pp. 1--6.
[16]
Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott, "An optimal checkpoint/restart model for a large scale high performance computing system," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008, pp. 1--9.
[17]
O. Subasi, G. Kestor, and S. Krishnamoorthy, "Toward a general theory of optimal checkpoint placement," in Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 2017, pp. 464--474.
[18]
Wikipedia, "Laplace's method." {Online}. Available: https://en.wikipedia.org/wiki/Laplace's_method
[19]
J. W. Young, "A first order approximation to the optimum checkpoint interval," Communications of the ACM, vol. 17, no. 9, pp. 530--531, 1974.
[20]
A. Benoit, A. Cavelan, V. Le Fèvre, and Y. Robert, "Optimal check-pointing period with replicated execution on heterogeneous platforms," Ph.D. dissertation, INRIA, 2017.
[21]
N. Nakka and A. Choudhary, "Failure data-driven selective node-level duplication to improve mttf in high performance computing systems," in High Performance Computing Systems and Applications. Springer, 2010, pp. 304--322.
[22]
O. Subasi, O. Unsal, and S. Krishnamoorthy, "Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications," in Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware. ACM, 2017, p. 2.
[23]
O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, "A runtime heuristic to selectively replicate tasks for application-specific reliability targets," in Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 2016, pp. 498--505.
[24]
O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, "Designing and modelling selective replication for fault-tolerant hpc applications," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on. IEEE, 2017, pp. 452--457.

Cited By

View all
  • (2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
  • (2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
  • (2019)Replication is more efficient than you thinkProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356171(1-14)Online publication date: 17-Nov-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Author Tags

  1. HPC
  2. checkpoint
  3. fault tolerance
  4. replication
  5. resilience

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
  • (2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
  • (2019)Replication is more efficient than you thinkProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356171(1-14)Online publication date: 17-Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media