Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2909428.2909429acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner

Published: 31 May 2016 Publication History

Abstract

We present a task-based domain-decomposition preconditioner for partial differential equations (PDEs) resilient to silent data corruption (SDC) and hard faults.
The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a regression-based solution update that is resilient to SDC. We adopt a server-client model implemented using the User Level Fault Mitigation MPI (MPI-ULFM). All state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM are complemented at the algorithm level to support missing tasks, making the application resilient to hard faults affecting the clients.
Weak and strong scaling tests up to ~115k cores show an excellent performance of the application with efficiencies above 90%, demonstrating the suitability to run at large scale. We demonstrate the resilience of the application for a 2D elliptic PDE by injecting SDC using a random single bit-flip model, and hard faults in the form of clients crashing. We show that in all cases, the application converges to the right solution. We analyze the overhead caused by the faults, and show that, for the test problem considered, the overhead incurred due to SDC is minimal compared to that from the hard faults.

References

[1]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems peter kogge, editor & study lead, 2008.
[2]
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of mpi communication capability: Design and rationale. Int. J. High Perform. Comput. Appl., 27(3):244--254, Aug. 2013.
[3]
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, Apr 2009.
[4]
P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012.
[5]
P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012.
[6]
F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward Exascale Resilience. International Journal of High Performance Computing Applications, 23(4):374--388, oct 2009.
[7]
F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1(1), 2014.
[8]
I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1--38, 2010.
[9]
J. Elliott, M. Hoemmen, and F. Mueller. Evaluating the impact of SDC on the GMRES iterative solver. CoRR, abs/1311.6505, 2013.
[10]
J. Elliott, M. Hoemmen, and F. Mueller. Evaluating the impact of sdc on the gmres iterative solver. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 1193--1202, Washington, DC, USA, 2014. IEEE Computer Society.
[11]
J. Elliott, M. Hoemmen, and F. Mueller. A numerical soft fault model for iterative linear solvers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, pages 271--274, New York, NY, USA, 2015. ACM.
[12]
J. J. Elliott, M. F. Hoemmen, and F. Mueller. Tolerating Silent Data Corruption in Opaque Preconditioners. Apr 2014.
[13]
C. Engelmann and T. Naughton. Toward a performance/resilience tool for hardware/software co-design of high-performance computing systems. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 960--969, Oct 2013.
[14]
R. Gupta, K. Iskra, K. Yoshii, P. Balaji, and P. Beckman. Introspective fault tolerance for exascale systems. Technical report, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, 2012.
[15]
M. Heroux, R. Bartlett, V. H. R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, H. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An Overview of Trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, 2003.
[16]
D. Li, J. S. Vetter, and W. Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 57:1--57:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[17]
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. SIGOPS Oper. Syst. Rev., 42(2):265--276, Mar. 2008.
[18]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
[19]
K. Sargsyan, F. Rizzi, P. Mycek, C. Safta, K. Morris, H. Najm, O. L. Maître, O. Knio, and B. Debusschere. Fault resilient domain decomposition preconditioner for pdes. SIAM Journal on Scientific Computing, 37(5):A2317--A2345, 2015.
[20]
M. Snir, W. Gropp, and P. Kogge. Exascale Research: Preparing for the Post-Moore Era. Technical report, Computer Science Whitepapers, 2011.
[21]
K. Teranishi and M. A. Heroux. Toward local failure local recovery resilience model using mpi-ulfm. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 51:51--51:56, New York, NY, USA, 2014. ACM.

Cited By

View all
  • (2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2021)Legio: fault resiliency for embarrassingly parallel MPI applicationsThe Journal of Supercomputing10.1007/s11227-021-03951-wOnline publication date: 25-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
May 2016
58 pages
ISBN:9781450343497
DOI:10.1145/2909428
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • DOE Office of Science ASCR
  • NERSC
  • Lockheed Martin Corporation for the U.S. Department of Energy's National Nuclear Security Administration

Conference

HPDC'16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2021)Legio: fault resiliency for embarrassingly parallel MPI applicationsThe Journal of Supercomputing10.1007/s11227-021-03951-wOnline publication date: 25-Jun-2021
  • (2020)Tree‐based fault‐tolerant collective operations for MPIConcurrency and Computation: Practice and Experience10.1002/cpe.582633:14Online publication date: 15-Jun-2020
  • (2018)Towards Ad Hoc Recovery for Soft Errors2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS.2018.00004(1-10)Online publication date: Nov-2018
  • (2017)Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applicationsThe Journal of Supercomputing10.1007/s11227-016-1863-z73:1(316-329)Online publication date: 1-Jan-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media