PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Xin-Hai Xu¹,
Xue-Jun Yang¹,
Jing-Ling Xue²,
Yu-Fei Lin¹ &
…
Yi-Song Lin¹

95 Accesses
6 Citations
Explore all metrics

Abstract

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpointing Kernel Executions of MPI+CUDA Applications

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Discover the latest articles, news and stories from top researchers in related subjects.

References

Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I,Woolley C, Lefohn A. GPGPU: General-purpose computation on graphics hardware. In Proc. SIGGRAPH 2004 Course Notes, New York, NY, USA, Aug. 2004, p.33.
Owens J, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A, Purcell T. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Mar. 2007, 26(1): 80–113.
Article Google Scholar
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P. Brook for GPUs: Stream computing on graphics hardware. In Proc. ACM SIGGRAPH 2004 Papers, New York, NY, USA, Aug. 2004, pp.777–786.
AMD. Brook+. http://developer.amd. com/gpu assets/AMD-Brookplus.pdf.
NVIDIA Corporation. Cuda programming guide, 2008. http://www.nvidia.com/object/cuda develop.html.
Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices, April 2009, 44(4): 101–110.
Article Google Scholar
Top500 Supercomputer Site. http://www.top500.org/lists/2010/11.
Yim K S, Pham C, Saleheen M, kalbarczyk Z, Iyer R. Hauberk: Lightweight silent data corruption error detectors for GPGPU. In Proc. the 25th Int. Parallel & Distributed Processing Symposium, Anchorage, USA, May 2011, pp.287–300.
Borucki L, Schindlbeck G, Slayman C. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proc. the Int. Reliability Physics Symposium, Phoenix, USA, April 27-May 1, 2008, pp.482–487.
Schroeder B, Pinheiro E, Weber W D. DRAM errors in the wild: A large-scale field study. In Proc. the 11th International Joint Conf. Measurement and Modeling of Computer Systems, Seattle, USA, June 15-19, 2009, pp.193–204.
Mukherjee S S, Emer J S, Reinhardt S K. The soft error problem: An architectural perspective. In Proc. the 11th International Symposium on High-Performance Computer Architecture, February 12-16, 2005, pp.243–247.
Gregerson A E, Abhyankar A V. Performance-cost analysis of software implemented hardware fault tolerance methods in general-purpose gpu computing. http://home-pages.cae.wisc.edu/ece753/papers/Paper 4.pdf.
Maruyama N, Nukada A, Matsuoka S. Software-based ECC for GPUs. In Proc. 2009 Symposium on Application Accelerators in High Performance Computing, Urbana, Illinois, USA, July 27-31, 2009.
Sheaffer J W, Luebke D P, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In Proc. the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, California, USA, August 4-5, 2007, pp.55–64.
Dimitrov M, Mantor M, Zhou H Y. Understanding software approaches for GPGPU reliability. In Proc. the 2nd Work-shop on General Purpose Processing on Graphics Processing Units (GPGPU 2009), Washington, USA, March 8, 2009, pp.94–104.
Maruyama N, Nukada A, Matsuoka S. A high-performance faulttolerant software framework for memory on commodity GPUs. In Proc. 2010 IEEE Int. Symp. Parallel & Distributed Processing, Atlanta, GA, USA, April 19-23, 2010, pp.1–12.
Roman E. A survey of checkpoint/restart implementations. Berkeley Lab Technical Report, July 2002, https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/checkpointSurvey-020724b.pdf.
Chandy K M, Ramamoorthy C V. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers, June 1972, 21(6): 546–556.
Article MathSciNet MATH Google Scholar
Jafar S, Krings A, Gautier T. Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing, 2009, 6(1): 32–44.
Article Google Scholar
Chu S L, Hsiao C C. OpenCL: Make ubiquitous supercomputing possible. In Proc. the 12th IEEE International Conference on High Performance Computing and Communications, Melbourne, Australia, 1-3 Sept. 2010, pp.556–561.
George N, Lach J, Gurumurthi S. Towards transient fault tolerance for heterogeneous computing platforms. In Proc. Workshop on Compiler and Architectural Techniques for Application Reliability and Security, Anchorage, Alaska, USA, June 2008, http://www.cs.virginia.edu/»gurumurthi/papers/catars08.pdf.
Goloubeva O, Rebaudengo M, Reorda M S, Violante M. Software-Implemented Hardware Fault Tolerance. New York: Springer, 2006, p.228.
Pradhan D K. Fault-Tolerant Computer System Design. Prentice Hall PTR, 1996.
Reis G A, Chang J, Vachharajani N, Rangan R, August D I. SWIFT: Software implemented fault tolerance. In Proc. the International Symposium on Code Generation and Optimization, Washington, DC, USA, March 2005, pp.243–254.
Dubrova E. Fault-Tolerant Design: An Introduction. KTH Royal Institute of Technology, Stockholm, Sweden, 2008, http://web.it.kth.se/»dubrova/draft.pdf.

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, 410073, China
Xin-Hai Xu (Student Member, CCF, ACM), Xue-Jun Yang (Senior Member, CCF, Member, ACM, IEEE), Yu-Fei Lin (Student Member, CCF, ACM) & Yi-Song Lin
Programming Languages and Compilers Group, School of Computer Science and Engineering University of New South Wales, Sydney, Australia
Jing-Ling Xue (Senior Member, IEEE, Member, ACM)

Authors

Xin-Hai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xue-Jun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Ling Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Fei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Song Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin-Hai Xu.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 116 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, XH., Yang, XJ., Xue, JL. et al. PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs. J. Comput. Sci. Technol. 27, 240–255 (2012). https://doi.org/10.1007/s11390-012-1220-5

Download citation

Received: 13 June 2011
Revised: 06 January 2012
Published: 05 March 2012
Issue Date: March 2012
DOI: https://doi.org/10.1007/s11390-012-1220-5

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Checkpointing Kernel Executions of MPI+CUDA Applications

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 116 kb)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Checkpointing Kernel Executions of MPI+CUDA Applications

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 116 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation