Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Silent Data Corruption Resilient Two-sided Matrix Factorizations

Published: 26 January 2017 Publication History

Abstract

This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance.
The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.

References

[1]
C. Anfinson and F. Luk. A linear algebraic model of algorithm-based fault tolerance, 1988. ISSN 00189340. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=9736.
[2]
L. S. Blackford, J. Society for Industrial and Applied Mathematics., A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and J. J. Whaley, R. C./Dongarra. ScaLAPACK user's guide. SIAM, 1997. ISBN 0898713978.
[3]
A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy. URL http://www.netlib.org/utk/people/JackDongarra/PAPERS/topc.pdf.
[4]
F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward Exascale Resilience: 2014 update. Supercomputing frontiers and innovations, 1(1):5--28, jun 2014. URL http://superfri.org/superfri/article/view/14http://superfri.org/superfri/article/download/14/7.
[5]
Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628--1641, dec 2008. ISSN 10459219. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4492768.
[6]
T. Davies and Z. Chen. Correcting soft errors online in LU factorization. pages 167--178. ACM, 2013. URL http://dl.acm.org/citation.cfm?id=2462920.
[7]
N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper, 2009.
[8]
J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1--2):215--227, sep 1989. ISSN 03770427. URL http://linkinghub.elsevier.com/retrieve/pii/0377042789903671.
[9]
P. Du and P. Luszczek. High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors. Procedia Computer Science, 9:216--225, 2012. ISSN 18770509.
[10]
P. Du, P. Luszczek, S. Tomov, and J. Dongarra. Soft error resilient QR factorization for hybrid system with GPGPU. Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 11, 2011. URL http://dl.acm.org/citation.cfm?doid=2133173.2133179.
[11]
P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations. PPoPP '12, pages 225--234, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. URL http://doi.acm.org/10.1145/2145816.2145845http://dl.acm.org/ft{_}gateway.cfm?id=2145845{&}type=pdf.
[12]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. SC '12, pages 78:1--78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978--1--4673-0804--5.URL http://dl.acm.org/citation.cfm?id=2388996.2389102http://dl.acm.org/ft{_}gateway.cfm?id=2389102{&}type=pdf.
[13]
D. Hakkarinen, P. Wu, and Z. Chen. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition. IEEE Transactions on Parallel and Distributed Systems, 26(5):1323--1335, 2015. ISSN 10459219. URL http://ieeexplore.ieee.org/ielx7/71/4359390/06805637.pdf?tp={&}arnumber=6805637{&}isnumber=4359390http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=6805637.
[14]
T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. SC '15, pages 73:1--73:12, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3723--6. URL http://doi.acm.org/10.1145/2807591.2807644http://dl.acm.org/ft{_}gateway.cfm?id=2807644{&}type=pdf.
[15]
K.-h. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518--528, jun 1984. ISSN 0018--9340. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1676475.
[16]
Y. Jia, G. Bosilca, P. Luszczek, and J. J. Dongarra. Parallel reduction to hessenberg form with algorithm-based fault tolerance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, pages 1--11, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. URL http://dl.acm.org/citation.cfm?id=2503210.2503249.
[17]
Y. Jia, P. Luszczek, G. Bosilca, and J. J. Dongarra. CPUGPU hybrid bidiagonal reduction with soft error resilience. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13, pages 1--5, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450325080. URL http://dl.acm.org/citation.cfm?id=2530268.2530270.
[18]
F. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 37(11):1434--1438, 1988. ISSN 00189340. URL http://ieeexplore.ieee.org/xpl/freeabs{_}all.jsp?arnumber=8712.
[19]
F. T. Luk and H. Park. An Analysis of Algorithm-based Fault Tolerance Techniques. J. Parallel Distrib. Comput., 5(2):172--184, apr 1988. ISSN 0743--7315. URL http://dx.doi.org/10.1016/0743--7315(88)90027--5.
[20]
T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, jan 1979. ISSN 0018--9383. URL http://ieeexplore.ieee.org/ielx5/16/31785/01479948.pdf?tp={&}arnumber=1479948{&}isnumber=31785http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=1479948{&}tag=1.
[21]
S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, S. Wender, and Others. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. Device and Materials Reliability, IEEE Transactions on, 5(3):329--335, 2005.
[22]
S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2011. URL http://app.knovel.com/web/toc.v/cid:kpADSE0016https://books.google.com/books?hl=en{&}lr={&}id=wBuy0oLXEuQC{&}oi=fnd{&}pg=PP1{&}dq=architecture+design+for+soft+errors{&}ots=8xRhRBNaaq{&}sig= Gk6RnKWTc09HszxcshwNKHVNzCc{#}v=onepage{&}q=architecturedesignforsofterrors{&}f=f.
[23]
E. Normand. Single event upset at ground level. IEEE transactions on Nuclear Science, 43(6):2742--2750, 1996.
[24]
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. DeBardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, 28(2):129--173, may 2014. ISSN 1094--3420, 1741--2846. URL http://hpc.sagepub.com/content/28/2/129http://hpc.sagepub.com/content/28/2/129.full.pdfhttp://www.mcs.anl.gov/publication/addressing-failures-exascale-computing-0.
[25]
Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on- SC '13, pages 1--12, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. URL http://dl.acm.org/citation.cfm?id=2503210.2503219.
[26]
P. Wu and Z. Chen. FT-ScaLAPACK: Correcting Soft Errors On-line for ScaLAPACK Cholesky, QR, and LU Factorization Routines. HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978--1--4503--2749--7. URL http://doi.acm.org/10.1145/2600212.2600232http://dl.acm.org/ft{_}gateway.cfm?id=2600232{&}type=pdf.
[27]
P. Wu, C. Ding, L. Chen, F. Gao, T. Davies, C. Karlsson, and Z. Chen. Fault tolerant matrix-matrix multiplication. In Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 25, New York, New York, USA, 2011. ACM Press. ISBN 9781450311809. URL http://dl.acm.org/citation.cfm?doid=2133173.2133185.
[28]
P. Wu, C. Ding, L. Chen, T. Davies, C. Karlsson, and Z. Chen. On-line soft error correction in matrix-matrix multiplication. Journal of Computational Science, 4(6):165--172, nov 2013. ISSN 18777503. URL http://www.sciencedirect.com/science/article/pii/S1877750313000641http://www.sciencedirect.com/science/article/pii/S1877750313000641/pdf?md5=740c9447815ceb3ef4e1fc79e3e60af2{&}pid=1-s2.0-S1877750313000641-main.pdf.
[29]
P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, and Z. Chen. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16, pages 31--42, New York, New York, USA, 2016. ACM Press. ISBN 9781450343145. URL http://dl.acm.org/citation.cfm?doid=2907294.2907315.
[30]
J. F. Ziegler and H. Puchner. SER--history, Trends and Challenges: A Guide for Designing with Memory ICs. Cypress, 2004.

Cited By

View all
  • (2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-0Online publication date: 22-Feb-2024
  • (2023)Anomaly Detection in Scientific Datasets using Sparse RepresentationProceedings of the First Workshop on AI for Systems10.1145/3588982.3603610(13-18)Online publication date: 10-Aug-2023
  • (2023)Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous SystemsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577496(274-287)Online publication date: 25-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2017
    476 pages
    ISBN:9781450344937
    DOI:10.1145/3018743
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2017
Published in SIGPLAN Volume 52, Issue 8

Check for updates

Author Tags

  1. abft
  2. algorithm based fault tolerance
  3. eigenvalue decomposition
  4. singular value decomposition
  5. svd

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)140
  • Downloads (Last 6 weeks)15
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-0Online publication date: 22-Feb-2024
  • (2023)Anomaly Detection in Scientific Datasets using Sparse RepresentationProceedings of the First Workshop on AI for Systems10.1145/3588982.3603610(13-18)Online publication date: 10-Aug-2023
  • (2023)Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous SystemsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577496(274-287)Online publication date: 25-Feb-2023
  • (2022)Reliability Evaluation of LU Decomposition on GPU-Accelerated System-on-Chip Under Proton IrradiationIEEE Transactions on Nuclear Science10.1109/TNS.2022.315582069:7(1467-1474)Online publication date: Jul-2022
  • (2022)Efficient Soft-Error Detection for Low-precision Deep Learning Recommendation Models2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020972(1556-1563)Online publication date: 17-Dec-2022
  • (2021)Resilient error-bounded lossy compressor for data transferProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476195(1-14)Online publication date: 14-Nov-2021
  • (2021)TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.02.013151(70-85)Online publication date: May-2021
  • (2020)FPDetectACM Transactions on Architecture and Code Optimization10.1145/340245117:3(1-27)Online publication date: 17-Aug-2020
  • (2020)Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00034(266-275)Online publication date: Sep-2020
  • (2019)BonVoisionProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330388(484-496)Online publication date: 26-Jun-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media