research-article

Public Access

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Authors:

Antonio J. Peña,

Pavan BalajiAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 71, Pages 1 - 12

https://doi.org/10.1145/2807591.2807640

Published: 15 November 2015 Publication History

Abstract

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient fault-tolerant system for accelerators. Although we leverage our techniques to protect from detected but uncorrected ECC errors in the device memory in OpenCL-accelerated applications, coprocessor reliability solutions based on different error detectors and similar API semantics can directly adopt the techniques we propose. Adding error detection and protection involves a tradeoff between runtime overhead and recovery time. Although optimal configurations depend on the particular application, the length of the run, the error rate, and the temporary storage speed, our test cases reveal a good balance with significantly reduced runtime overheads.

References

[1]

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244--254, 2013.

Digital Library

[2]

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009.

Digital Library

[3]

A. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Magniette. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In ACM/IEEE Supercomputing Conference, Nov. 2003.

Digital Library

[4]

A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla. On the use of remote GPUs and low-power processors for the acceleration of scientific applications. In International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY), 2014.

[5]

J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems. Scientific Programming, 21(3-4):197--212, 2013.

Digital Library

[6]

L. B. Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda. GPGPUs: how to combine high computational power with high reliability. In Design, Automation & Test in Europe (DATE), 2014.

Digital Library

[7]

P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.

[8]

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical report, Sandia National Laboratories, 2009

[9]

Intel Corporation. Intel MPI library. http://software.intel.com/en-us/intel-mpi-library, 2015.

[10]

A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi. Distributed-Shared CUDA: Virtualization of large-scale GPU systems for programmability and reliability. In International Conference on Future Computational Technologies and Applications, 2012.

[11]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In 26th International Conference on Supercomputing, pages 341--352. ACM, 2012.

Digital Library

[12]

P. Lama, Y. Li, A. M. Aji, P. Balaji, J. Dinan, S. Xiao, Y. Zhang, W. Feng, R. Thakur, and X. Zhou. pVOCL: Power-aware dynamic placement and migration in virtualized GPU environments. In 33rd International Conference on Distributed Computing Systems (ICDCS), pages 145--154. IEEE, 2013.

Digital Library

[13]

P. Lavallée, G. C. de Verdière, P. Wautelet, D. Lecas, and J. Dupays. Porting and optimizing HYDRO to new platforms and programming paradigms--lessons learnt. Technical report, PRACE, Dec. 2012.

[14]

A. Munshi, editor. The OpenCL Specification Version 2.0. Khronos OpenCL Working Group, 2014.

[15]

R. Naseer and J. Draper. Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs. In 34th European Solid-State Circuits Conference (ESSCIRC), pages 222--225, Sept. 2008.

[16]

NVIDIA Corporation. NVIDIA Management Library (NVML). http://developer.nvidia.com/nvidia-management-library-nvml, 2015.

[17]

A. J. Peña. Virtualization of accelerators in high performance clusters. PhD thesis, Universitat Jaume I, Castellon, Spain, Jan. 2013.

[18]

A. J. Peña and S. R. Alam. Evaluation of inter- and intra-node data transfer efficiencies between GPU devices and their impact on scalable applications. In 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2013.

Digital Library

[19]

A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Orti, and J. Duato. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing, 40(10):574--588, 2014.

Digital Library

[20]

M. Rebaudengo, M. Reorda, M. Torchiano, and M. Violante. Soft-error detection through software fault-tolerance techniques. In International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), pages 210--218, Nov 1999.

Digital Library

[21]

A. Rezaei, G. Coviello, C. Li, S. Chakradhar, and F. Mueller. Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2014.

Digital Library

[22]

S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.

Digital Library

[23]

E. Strohmaier, J. Dongarra, H. Simon, and M. Meuer. TOP500 supercomputing sites. http://www.top500.org/lists/2014/11, Nov. 2014.

[24]

H. Takizawa, K. Koyama, K. Sato, K. Komatsu, and H. Kobayashi. CheCL: Transparent checkpointing and process migration of OpenCL applications. In International Parallel & Distributed Processing Symposium (IPDPS), pages 864--876. IEEE, 2011.

Digital Library

[25]

H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. CheCUDA: A checkpoint/restart tool for CUDA applications. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 408--413. IEEE, 2009.

Digital Library

[26]

TSUBAME Computing Services. Failure history of TSUBAME2.5. http://mon.g.gsic.titech.ac.jp/trouble-list, 2015.

[27]

Virginia Tech. HokieSpeed (Seneca CPU--GPU). http://www.arc.vt.edu/resources/hpc/hokiespeed.php, 2015.

[28]

J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014.

Digital Library

[29]

C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. SIGARCH Comput. Archit. News, 38(3):83--93, 2010.

Digital Library

[30]

S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng. VOCL: An optimized environment for transparent virtualization of graphics processing units. In Innovative Parallel Computing (InPar). IEEE, 2012.

[31]

K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk: Lightweight silent data corruption error detector for GPGPU. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287--300, 2011.

Digital Library

Cited By

Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Leng JBuyuktosunoglu ABertran RBose PChen QGuo MJanapa Reddi V(2020)Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00014(44-57)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00014
Parasyris KKeller KBautista-Gomez LUnsal O(2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-69
Show More Cited By

Index Terms

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Recommendations

Virtualized ECC: Flexible Reliability in Main Memory

Virtualized error checking and correcting (ECC) is a scheme that virtualizes memory-error correction. Unlike traditional uniform ECC, which provides a fixed level of error tolerance, virtualized ECC enables flexible memory protection by mapping ...
Fault Tolerant Single Error Correction Encoders

Soft errors are an important issue for circuit reliability. To mitigate their effects on the system functionality, different techniques are used. In many cases Error Correcting Codes (ECC) are used to protect circuits. Single Error Correction (SEC) ...
Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
390
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)7

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Leng JBuyuktosunoglu ABertran RBose PChen QGuo MJanapa Reddi V(2020)Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00014(44-57)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00014
Parasyris KKeller KBautista-Gomez LUnsal O(2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-69
Baird MScholz SŠinkarovs ABautista-Gomez L(2019)Checkpointing Kernel Executions of MPI+CUDA ApplicationsEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_53(694-706)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-48340-1_53
Engelmann CVallée GPophale S(2019)Concepts for OpenMP Target Offload ResilienceOpenMP: Conquering the Full Hardware Spectrum10.1007/978-3-030-28596-8_6(78-93)Online publication date: 9-Aug-2019
https://doi.org/10.1007/978-3-030-28596-8_6
Baird MFensch CScholz SŠinkarovs A(2018)A Lightweight Approach to GPU ResilienceEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_64(826-838)Online publication date: 31-Dec-2018
https://doi.org/10.1007/978-3-030-10549-5_64
Peña ABeltran VClauss CMoschny TGropp WBeckman PLi ZCazorla F(2017)Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniquesProceedings of the International Conference on Supercomputing10.1145/3079079.3079093(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079093
Losada NFraguela BGonzlez PMartn M(2017)A portable and adaptable fault tolerance solution for heterogeneous applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.01.020104:C(146-158)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2017.01.020
Chen CDu YZuo KFang JYang C(2017)Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimizationThe Journal of Supercomputing10.1007/s11227-017-2116-5Online publication date: 20-Aug-2017
https://doi.org/10.1007/s11227-017-2116-5
Li GPattabiraman KCher CBose PWest J(2016)Understanding error propagation in GPGPU applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014932(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014932

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents