research-article

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner

Authors:

Francesco Rizzi,

Khachik Sargsyan,

Bert Debusschere,

Olivier LeMaitre,

Omar KnioAuthors Info & Claims

FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Pages 19 - 26

https://doi.org/10.1145/2909428.2909429

Published: 31 May 2016 Publication History

Abstract

We present a task-based domain-decomposition preconditioner for partial differential equations (PDEs) resilient to silent data corruption (SDC) and hard faults.

The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a regression-based solution update that is resilient to SDC. We adopt a server-client model implemented using the User Level Fault Mitigation MPI (MPI-ULFM). All state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM are complemented at the algorithm level to support missing tasks, making the application resilient to hard faults affecting the clients.

Weak and strong scaling tests up to ~115k cores show an excellent performance of the application with efficiencies above 90%, demonstrating the suitability to run at large scale. We demonstrate the resilience of the application for a 2D elliptic PDE by injecting SDC using a random single bit-flip model, and hard faults in the form of clients crashing. We show that in all cases, the application converges to the right solution. We analyze the overhead caused by the faults, and show that, for the test problem considered, the overhead incurred due to SDC is minimal compared to that from the hard faults.

References

[1]

K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems peter kogge, editor & study lead, 2008.

[2]

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of mpi communication capability: Design and rationale. Int. J. High Perform. Comput. Appl., 27(3):244--254, Aug. 2013.

Digital Library

[3]

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, Apr 2009.

Digital Library

[4]

P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012.

[5]

P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012.

[6]

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward Exascale Resilience. International Journal of High Performance Computing Applications, 23(4):374--388, oct 2009.

Digital Library

[7]

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1(1), 2014.

[8]

I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1--38, 2010.

[9]

J. Elliott, M. Hoemmen, and F. Mueller. Evaluating the impact of SDC on the GMRES iterative solver. CoRR, abs/1311.6505, 2013.

[10]

J. Elliott, M. Hoemmen, and F. Mueller. Evaluating the impact of sdc on the gmres iterative solver. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 1193--1202, Washington, DC, USA, 2014. IEEE Computer Society.

Digital Library

[11]

J. Elliott, M. Hoemmen, and F. Mueller. A numerical soft fault model for iterative linear solvers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, pages 271--274, New York, NY, USA, 2015. ACM.

Digital Library

[12]

J. J. Elliott, M. F. Hoemmen, and F. Mueller. Tolerating Silent Data Corruption in Opaque Preconditioners. Apr 2014.

[13]

C. Engelmann and T. Naughton. Toward a performance/resilience tool for hardware/software co-design of high-performance computing systems. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 960--969, Oct 2013.

Digital Library

[14]

R. Gupta, K. Iskra, K. Yoshii, P. Balaji, and P. Beckman. Introspective fault tolerance for exascale systems. Technical report, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, 2012.

[15]

M. Heroux, R. Bartlett, V. H. R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, H. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An Overview of Trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, 2003.

[16]

D. Li, J. S. Vetter, and W. Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 57:1--57:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[17]

M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. SIGOPS Oper. Syst. Rev., 42(2):265--276, Mar. 2008.

Digital Library

[18]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[19]

K. Sargsyan, F. Rizzi, P. Mycek, C. Safta, K. Morris, H. Najm, O. L. Maître, O. Knio, and B. Debusschere. Fault resilient domain decomposition preconditioner for pdes. SIAM Journal on Scientific Computing, 37(5):A2317--A2345, 2015.

[20]

M. Snir, W. Gropp, and P. Kogge. Exascale Research: Preparing for the Post-Moore Era. Technical report, Computer Science Whitepapers, 2011.

[21]

K. Teranishi and M. A. Heroux. Toward local failure local recovery resilience model using mpi-ulfm. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 51:51--51:56, New York, NY, USA, 2014. ACM.

Digital Library

Cited By

Dorier MWang ZRamesh SAyachit USnyder SRoss RParashar M(2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
https://doi.org/10.1016/j.jpdc.2023.02.014
Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Rocco RGadioli DPalermo G(2021)Legio: fault resiliency for embarrassingly parallel MPI applicationsThe Journal of Supercomputing10.1007/s11227-021-03951-wOnline publication date: 25-Jun-2021
https://doi.org/10.1007/s11227-021-03951-w
Show More Cited By

Index Terms

Recommendations

Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split ...
Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Task-based PDE preconditioner resilient to silent data corruptions (SDCs).Demonstrate excellent scalability and suitability to run at large scale.Demonstrate potential energy savings via dynamics voltage/frequency scaling.Demonstrate resilience to SDCs ...
Two-dimensional differential transform for partial differential equations

The differential transform is a numerical method for solving differential equations. In this paper, we present the definition and operation of the two-dimensional differential transform. A distinctive feature of the differential transform is its ability ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

May 2016

58 pages

ISBN:9781450343497

DOI:10.1145/2909428

Program Chair:
Nathan DeBardeleben
Los Alamos National Laboratory, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

DOE Office of Science ASCR
NERSC
Lockheed Martin Corporation for the U.S. Department of Energy's National Nuclear Security Administration

Conference

HPDC'16

Sponsor:

University of Arizona
SIGARCH

HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing

May 31, 2016

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dorier MWang ZRamesh SAyachit USnyder SRoss RParashar M(2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
https://doi.org/10.1016/j.jpdc.2023.02.014
Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Rocco RGadioli DPalermo G(2021)Legio: fault resiliency for embarrassingly parallel MPI applicationsThe Journal of Supercomputing10.1007/s11227-021-03951-wOnline publication date: 25-Jun-2021
https://doi.org/10.1007/s11227-021-03951-w
Margolin ABarak A(2020)Tree‐based fault‐tolerant collective operations for MPIConcurrency and Computation: Practice and Experience10.1002/cpe.582633:14Online publication date: 15-Jun-2020
https://doi.org/10.1002/cpe.5826
Losada NBautista-Gomez LKeller KUnsal O(2018)Towards Ad Hoc Recovery for Soft Errors2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS.2018.00004(1-10)Online publication date: Nov-2018
https://doi.org/10.1109/FTXS.2018.00004
Losada NMartín MGonzález P(2017)Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applicationsThe Journal of Supercomputing10.1007/s11227-016-1863-z73:1(316-329)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s11227-016-1863-z

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents