Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2807591.2807640acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Published: 15 November 2015 Publication History

Abstract

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient fault-tolerant system for accelerators. Although we leverage our techniques to protect from detected but uncorrected ECC errors in the device memory in OpenCL-accelerated applications, coprocessor reliability solutions based on different error detectors and similar API semantics can directly adopt the techniques we propose. Adding error detection and protection involves a tradeoff between runtime overhead and recovery time. Although optimal configurations depend on the particular application, the length of the run, the error rate, and the temporary storage speed, our test cases reveal a good balance with significantly reduced runtime overheads.

References

[1]
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244--254, 2013.
[2]
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009.
[3]
A. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Magniette. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In ACM/IEEE Supercomputing Conference, Nov. 2003.
[4]
A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, and F. Silla. On the use of remote GPUs and low-power processors for the acceleration of scientific applications. In International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY), 2014.
[5]
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems. Scientific Programming, 21(3-4):197--212, 2013.
[6]
L. B. Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda. GPGPUs: how to combine high computational power with high reliability. In Design, Automation & Test in Europe (DATE), 2014.
[7]
P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.
[8]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical report, Sandia National Laboratories, 2009
[9]
Intel Corporation. Intel MPI library. http://software.intel.com/en-us/intel-mpi-library, 2015.
[10]
A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi. Distributed-Shared CUDA: Virtualization of large-scale GPU systems for programmability and reliability. In International Conference on Future Computational Technologies and Applications, 2012.
[11]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In 26th International Conference on Supercomputing, pages 341--352. ACM, 2012.
[12]
P. Lama, Y. Li, A. M. Aji, P. Balaji, J. Dinan, S. Xiao, Y. Zhang, W. Feng, R. Thakur, and X. Zhou. pVOCL: Power-aware dynamic placement and migration in virtualized GPU environments. In 33rd International Conference on Distributed Computing Systems (ICDCS), pages 145--154. IEEE, 2013.
[13]
P. Lavallée, G. C. de Verdière, P. Wautelet, D. Lecas, and J. Dupays. Porting and optimizing HYDRO to new platforms and programming paradigms--lessons learnt. Technical report, PRACE, Dec. 2012.
[14]
A. Munshi, editor. The OpenCL Specification Version 2.0. Khronos OpenCL Working Group, 2014.
[15]
R. Naseer and J. Draper. Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs. In 34th European Solid-State Circuits Conference (ESSCIRC), pages 222--225, Sept. 2008.
[16]
NVIDIA Corporation. NVIDIA Management Library (NVML). http://developer.nvidia.com/nvidia-management-library-nvml, 2015.
[17]
A. J. Peña. Virtualization of accelerators in high performance clusters. PhD thesis, Universitat Jaume I, Castellon, Spain, Jan. 2013.
[18]
A. J. Peña and S. R. Alam. Evaluation of inter- and intra-node data transfer efficiencies between GPU devices and their impact on scalable applications. In 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2013.
[19]
A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Orti, and J. Duato. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing, 40(10):574--588, 2014.
[20]
M. Rebaudengo, M. Reorda, M. Torchiano, and M. Violante. Soft-error detection through software fault-tolerance techniques. In International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), pages 210--218, Nov 1999.
[21]
A. Rezaei, G. Coviello, C. Li, S. Chakradhar, and F. Mueller. Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2014.
[22]
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.
[23]
E. Strohmaier, J. Dongarra, H. Simon, and M. Meuer. TOP500 supercomputing sites. http://www.top500.org/lists/2014/11, Nov. 2014.
[24]
H. Takizawa, K. Koyama, K. Sato, K. Komatsu, and H. Kobayashi. CheCL: Transparent checkpointing and process migration of OpenCL applications. In International Parallel & Distributed Processing Symposium (IPDPS), pages 864--876. IEEE, 2011.
[25]
H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. CheCUDA: A checkpoint/restart tool for CUDA applications. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 408--413. IEEE, 2009.
[26]
TSUBAME Computing Services. Failure history of TSUBAME2.5. http://mon.g.gsic.titech.ac.jp/trouble-list, 2015.
[27]
Virginia Tech. HokieSpeed (Seneca CPU--GPU). http://www.arc.vt.edu/resources/hpc/hokiespeed.php, 2015.
[28]
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014.
[29]
C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. SIGARCH Comput. Archit. News, 38(3):83--93, 2010.
[30]
S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng. VOCL: An optimized environment for transparent virtualization of graphics processing units. In Innovative Parallel Computing (InPar). IEEE, 2012.
[31]
K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk: Lightweight silent data corruption error detector for GPGPU. In International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287--300, 2011.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2020)Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00014(44-57)Online publication date: Feb-2020
  • (2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ECC
  2. VOCL
  3. co-processor
  4. fault tolerance
  5. virtualization

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)7
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2020)Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00014(44-57)Online publication date: Feb-2020
  • (2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
  • (2019)Checkpointing Kernel Executions of MPI+CUDA ApplicationsEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_53(694-706)Online publication date: 26-Aug-2019
  • (2019)Concepts for OpenMP Target Offload ResilienceOpenMP: Conquering the Full Hardware Spectrum10.1007/978-3-030-28596-8_6(78-93)Online publication date: 9-Aug-2019
  • (2018)A Lightweight Approach to GPU ResilienceEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_64(826-838)Online publication date: 31-Dec-2018
  • (2017)Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniquesProceedings of the International Conference on Supercomputing10.1145/3079079.3079093(1-10)Online publication date: 14-Jun-2017
  • (2017)A portable and adaptable fault tolerance solution for heterogeneous applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.01.020104:C(146-158)Online publication date: 1-Jun-2017
  • (2017)Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimizationThe Journal of Supercomputing10.1007/s11227-017-2116-5Online publication date: 20-Aug-2017
  • (2016)Understanding error propagation in GPGPU applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014932(1-12)Online publication date: 13-Nov-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media