Abstract
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I,Woolley C, Lefohn A. GPGPU: General-purpose computation on graphics hardware. In Proc. SIGGRAPH 2004 Course Notes, New York, NY, USA, Aug. 2004, p.33.
Owens J, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A, Purcell T. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Mar. 2007, 26(1): 80–113.
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P. Brook for GPUs: Stream computing on graphics hardware. In Proc. ACM SIGGRAPH 2004 Papers, New York, NY, USA, Aug. 2004, pp.777–786.
AMD. Brook+. http://developer.amd. com/gpu assets/AMD-Brookplus.pdf.
NVIDIA Corporation. Cuda programming guide, 2008. http://www.nvidia.com/object/cuda develop.html.
Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices, April 2009, 44(4): 101–110.
Top500 Supercomputer Site. http://www.top500.org/lists/2010/11.
Yim K S, Pham C, Saleheen M, kalbarczyk Z, Iyer R. Hauberk: Lightweight silent data corruption error detectors for GPGPU. In Proc. the 25th Int. Parallel & Distributed Processing Symposium, Anchorage, USA, May 2011, pp.287–300.
Borucki L, Schindlbeck G, Slayman C. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proc. the Int. Reliability Physics Symposium, Phoenix, USA, April 27-May 1, 2008, pp.482–487.
Schroeder B, Pinheiro E, Weber W D. DRAM errors in the wild: A large-scale field study. In Proc. the 11th International Joint Conf. Measurement and Modeling of Computer Systems, Seattle, USA, June 15-19, 2009, pp.193–204.
Mukherjee S S, Emer J S, Reinhardt S K. The soft error problem: An architectural perspective. In Proc. the 11th International Symposium on High-Performance Computer Architecture, February 12-16, 2005, pp.243–247.
Gregerson A E, Abhyankar A V. Performance-cost analysis of software implemented hardware fault tolerance methods in general-purpose gpu computing. http://home-pages.cae.wisc.edu/ece753/papers/Paper 4.pdf.
Maruyama N, Nukada A, Matsuoka S. Software-based ECC for GPUs. In Proc. 2009 Symposium on Application Accelerators in High Performance Computing, Urbana, Illinois, USA, July 27-31, 2009.
Sheaffer J W, Luebke D P, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In Proc. the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, California, USA, August 4-5, 2007, pp.55–64.
Dimitrov M, Mantor M, Zhou H Y. Understanding software approaches for GPGPU reliability. In Proc. the 2nd Work-shop on General Purpose Processing on Graphics Processing Units (GPGPU 2009), Washington, USA, March 8, 2009, pp.94–104.
Maruyama N, Nukada A, Matsuoka S. A high-performance faulttolerant software framework for memory on commodity GPUs. In Proc. 2010 IEEE Int. Symp. Parallel & Distributed Processing, Atlanta, GA, USA, April 19-23, 2010, pp.1–12.
Roman E. A survey of checkpoint/restart implementations. Berkeley Lab Technical Report, July 2002, https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/checkpointSurvey-020724b.pdf.
Chandy K M, Ramamoorthy C V. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers, June 1972, 21(6): 546–556.
Jafar S, Krings A, Gautier T. Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing, 2009, 6(1): 32–44.
Chu S L, Hsiao C C. OpenCL: Make ubiquitous supercomputing possible. In Proc. the 12th IEEE International Conference on High Performance Computing and Communications, Melbourne, Australia, 1-3 Sept. 2010, pp.556–561.
George N, Lach J, Gurumurthi S. Towards transient fault tolerance for heterogeneous computing platforms. In Proc. Workshop on Compiler and Architectural Techniques for Application Reliability and Security, Anchorage, Alaska, USA, June 2008, http://www.cs.virginia.edu/»gurumurthi/papers/catars08.pdf.
Goloubeva O, Rebaudengo M, Reorda M S, Violante M. Software-Implemented Hardware Fault Tolerance. New York: Springer, 2006, p.228.
Pradhan D K. Fault-Tolerant Computer System Design. Prentice Hall PTR, 1996.
Reis G A, Chang J, Vachharajani N, Rangan R, August D I. SWIFT: Software implemented fault tolerance. In Proc. the International Symposium on Code Generation and Optimization, Washington, DC, USA, March 2005, pp.243–254.
Dubrova E. Fault-Tolerant Design: An Introduction. KTH Royal Institute of Technology, Stockholm, Sweden, 2008, http://web.it.kth.se/»dubrova/draft.pdf.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Xu, XH., Yang, XJ., Xue, JL. et al. PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs. J. Comput. Sci. Technol. 27, 240–255 (2012). https://doi.org/10.1007/s11390-012-1220-5
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1220-5