PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

来源 :计算机科学技术学报(英文版) | 被引量 : 0次 | 上传用户:lenvy11
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems,including TianHe-lA,the world's fastest supercomputer in the TOP500 list,built at NUDT (National University of Defense Technology) last year.However,despite their performance advantages,GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications.By analyzing the SIMT (single-instruction,multiple-thread) characteristics of programs running on GPGPUs,we have developed PartialRC,a new checkpoint-based compiler-directed partial recomputing method,for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs.In this paper,we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region,describe a checkpoint-based faulttolerance framework developed on PartialRC,and discuss an implementation on the CUDA platform.Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing CheckpointRollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FulIRC,by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average.In addition,PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
其他文献
针对淮南矿区煤层群上行卸压开采被保护层回采巷道围岩控制的技术难题,采用数值模拟和物理模拟方法研究了下部煤层卸压开采后项板巷道应力场和裂隙场时空演化规律,以及卸压开
语言研究人文化是八十年代以来中国学术界一个研究热点.语言与文化密不可分,语言是文化的载体,文化的发展丰富了语言.作为外语教师应该熟悉跨文化交际的相关知识,在教学中有
应用PLC控制系统,对传统的TKD电气控制的交流提升绞车进行改造,提高了提升系统的安全性.
生态重建已经成为发展的必要手段和重要目标,当前应该推广植树造林技术,这样可以在人工干预的前提下,以更为科学的方式达到生态和生存环境的快速和合理建设。本研究根据林业工作
中小城镇由于发展基础较差、经济水平薄弱的原因,造成了中小城镇水污染控制的特殊性.本文针对我国小城镇的实际情况,介绍了在中小城镇易于推广的经济、高效、节能和简便易行
The potential of base treated Shorea dasyphylla (BTSD) sawdust for Acid Blue 25 (AB 25) adsorption was investigated in a batch adsorption process.Various physio
The present study analyzed the electromagnetic radiation (EMR) time series of the destruction process of coal or rock sample under uniaxial loading and the moni
Endocrine disrupting chemicals (EDCs) in the secondary effluent discharged from wastewater treatment plants (WWTPs) are of great concern in the process of water
下文作者根据连云港地区变电站的电力动作情况,以地区的变电让自动化系统为参考,分析了变电站自动化的安全运行管理,以及为我国自动化系统在变电站中的运行提供可供参考意见和建
The metathesis of ethylene and 2-pentene was studied as an alternative route for propylene production over Re2O7/γ-Al2O3 and Re2O7/SiO2-Al2O3 catalysts.Both NH