论文部分内容阅读
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems,including TianHe-lA,the world's fastest supercomputer in the TOP500 list,built at NUDT (National University of Defense Technology) last year.However,despite their performance advantages,GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications.By analyzing the SIMT (single-instruction,multiple-thread) characteristics of programs running on GPGPUs,we have developed PartialRC,a new checkpoint-based compiler-directed partial recomputing method,for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs.In this paper,we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region,describe a checkpoint-based faulttolerance framework developed on PartialRC,and discuss an implementation on the CUDA platform.Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing CheckpointRollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FulIRC,by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average.In addition,PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.