research-article

Partial redundancy in HPC systems with non-uniform node reliabilities

Authors:

Rami MelhemAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 44, Pages 1 - 11

https://doi.org/10.1109/SC.2018.00047

Published: 26 July 2019 Publication History

Abstract

We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes(full and no-replication) for any significant range of node counts. In this work, we argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient.

References

[1]

F. Cappello, "Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities," The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 212--226, 2009.

Digital Library

[2]

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011, pp. 1--12.

Digital Library

[3]

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, "Combining process replication and checkpointing for resilience on exascale systems," Ph.D. dissertation, INRIA, 2012.

[4]

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann, "Combining partial redundancy and checkpointing for hpc," in Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE, 2012, pp. 615--626.

Digital Library

[5]

N. El-Sayed and B. Schroeder, "Reading between the lines of failure logs: Understanding how hpc systems fail," in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 2013, pp. 1--12.

Digital Library

[6]

S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, "Understanding and exploiting spatial properties of system failures on extreme-scale hpc systems," in Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 2015, pp. 37--44.

Digital Library

[7]

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, "Failures in large scale systems: long-term measurement, analysis, and implications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 44.

Digital Library

[8]

S. Di, R. Gupta, M. Snir, E. Pershey, and F. Cappello, "Logaider: A tool for mining potential correlations of hpc log events," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on. IEEE, 2017, pp. 442--451.

Digital Library

[9]

T. J. Hacker, F. Romero, and C. D. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, pp. 652--665, 2009.

Digital Library

[10]

Wikipedia, "Rearrangement inequality." {Online}. Available: https://en.wikipedia.org/wiki/Rearrangement_inequality

[11]

C. George and S. Vadhiyar, "Fault tolerance on large scale systems using adaptive process replication," IEEE Transactions on Computers, vol. 64, no. 8, pp. 2213--2225, 2015.

Digital Library

[12]

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, "Checkpointing strategies for parallel jobs," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 33.

Digital Library

[13]

C. Engelmann and S. Böhm, "Redundant execution of hpc applications with mr-mpi," in Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), 2011, pp. 15--17.

[14]

J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future generation computer systems, vol. 22, no. 3, pp. 303--312, 2006.

Digital Library

[15]

J. Stearley, K. Ferreira, D. Robinson, J. Laros, K. Pedretti, D. Arnold, P. Bridges, and R. Riesen, "Does partial replication pay off?" in Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on. IEEE, 2012, pp. 1--6.

[16]

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott, "An optimal checkpoint/restart model for a large scale high performance computing system," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008, pp. 1--9.

[17]

O. Subasi, G. Kestor, and S. Krishnamoorthy, "Toward a general theory of optimal checkpoint placement," in Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 2017, pp. 464--474.

[18]

Wikipedia, "Laplace's method." {Online}. Available: https://en.wikipedia.org/wiki/Laplace's_method

[19]

J. W. Young, "A first order approximation to the optimum checkpoint interval," Communications of the ACM, vol. 17, no. 9, pp. 530--531, 1974.

Digital Library

[20]

A. Benoit, A. Cavelan, V. Le Fèvre, and Y. Robert, "Optimal check-pointing period with replicated execution on heterogeneous platforms," Ph.D. dissertation, INRIA, 2017.

[21]

N. Nakka and A. Choudhary, "Failure data-driven selective node-level duplication to improve mttf in high performance computing systems," in High Performance Computing Systems and Applications. Springer, 2010, pp. 304--322.

Digital Library

[22]

O. Subasi, O. Unsal, and S. Krishnamoorthy, "Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications," in Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware. ACM, 2017, p. 2.

Digital Library

[23]

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, "A runtime heuristic to selectively replicate tasks for application-specific reliability targets," in Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 2016, pp. 498--505.

[24]

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, "Designing and modelling selective replication for fault-tolerant hpc applications," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on. IEEE, 2017, pp. 452--457.

Digital Library

Cited By

Briggs IDas ABaranowski MSharma VKrishnamoorthy SRakamarić ZGopalakrishnan G(2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
https://dl.acm.org/doi/10.1145/3369381
Krluku EGusev MZdraveski V(2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
https://dl.acm.org/doi/10.1145/3351556.3351567
Benoit AHerault TFèvre VRobert YTaufer MBalaji PPeña A(2019)Replication is more efficient than you thinkProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356171(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356171

Partial redundancy in HPC systems with non-uniform node reliabilities

Recommendations

Partial redundancy in HPC systems with non-uniform node reliabilities
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of ...
Byzantine quorum systems

Quorum systems are well-known tools for ensuring the consistency and availability of replicated data despite the benign failure of data repositories. In this paper we consider the arbitrary (Byzantine) failure of data repositories and present the first ...
High Availability through Output Continuity
CLUSTERW '12: Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops

Virtual machine (VM) based state machine approaches, i.e. VM replication, provide high availability without source code modifications, unfortunately, existing VM replication approaches suffer from excessive replication overheads. On the other hand, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
38
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Briggs IDas ABaranowski MSharma VKrishnamoorthy SRakamarić ZGopalakrishnan G(2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
https://dl.acm.org/doi/10.1145/3369381
Krluku EGusev MZdraveski V(2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
https://dl.acm.org/doi/10.1145/3351556.3351567
Benoit AHerault TFèvre VRobert YTaufer MBalaji PPeña A(2019)Replication is more efficient than you thinkProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356171(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356171

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents