User-level failure detection and auto-recovery of parallel programs in HPC systems

来源 :计算机科学前沿 | 被引量 : 0次 | 上传用户:InsideCSharp
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
As the mean-time-between-failures(MTBF)con-tinues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,pro-gram failures might occur during the execution period with high probability.Ensuring successful execution of the HPC pro-grams has become an issue that the unprivileged users should be concerned.From the user perspective,if the program fail-ure cannot be detected and handled in time,it would waste re-sources and delay the progress of program execution.Unfortu-nately,the unprivileged users are unable to perform program state checking due to execution control by the job manage-ment system as well as the limited privilege.Currently,auto-mated tools for supporting user-level failure detection and auto-recovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic re-submission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mecha-nism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment re-sults show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and perfor-mance overhead caused by ARL is negligible.The good scala-bility of ARL makes it applicable for large-scale HPC systems.
其他文献
随着国家清洁低碳、安全高效的现代电力工业体系的构建,远距离、大容量配套新能源电力外送已成为转变能源资源配置的一项重要途径.如何评估此类项目CO2排放水平及节能减排效益,已受到各界关注.以陕北某特高压直流输电工程为例,对其配套外送电源的CO2排放水平进行研究.结果表明:该工程实施火电与新能源打捆外送,不仅扩大送端省份的新能源消纳能力及范围,提高受端省份的非水可再生能源消纳比例,还直接降低了送端电源的CO2排放水平,当燃煤电源供电比例从0.995降低至0.7时,单位输电CO2排放强度大大降低,年减少CO2排放
抗剪强度参数的合理选取,是边坡稳定性评价的前提和基础,对于膨胀土地区的边坡稳定性的合理分析、评估,保障边坡区域的建(构)筑物的安全运行具有重要意义.通过对膨胀土地区边坡滑动破坏机理及特征的研究,探讨出膨胀土边坡抗剪强度参数的一种合理的取值方法,并应用于工程实例中.
Chemosensor arrays have a great potential for on-site applications in real-world scenarios.However,to fabricate on chemosensor array a number of chemosensors are required to obtain various optical patterns for multi-analyte detection.Herein,we propose a m
Mitochondrial DNA has a special structure that is prone to damage resulting in many serious diseases,such as genetic diseases and cancers.Therefore,the rapid and specific monitoring of mitochondrial DNA damage is urgently needed for biological recognition
Uridine diphosphate(UDP)-glucuronosyl-transferases(UGTs)are enzymes involved in the biotrans-formation of important endogenous compounds such as steroids,bile acids,and hormones as well as exogenous substances including drugs,environmental toxicants,and c
为研究双路电缆温控试验中电缆暂态温度随时间的变化关系,预测温度的变化趋势,在经典热路模型的基础上,提出了预测方案.让调节电流以二分法的形式进行有规律地变化,并用Runge-Kutta法求出离散的电缆暂态温度计算值,再通过Levenberg-Marquardt优化算法对这些离散的电缆暂态温度计算值进行非线性曲线拟合,最后得出电缆温度随时间变化的理论计算曲线.试验结果表明,该计算曲线能对双路电缆温控试验中暂态温度的变化进行有效预测,可保证试验的安全性和高效性.
考虑海上油田电力系统与陆地电力系统的差异性,提出了一种适用于计算海上油田电网的设备时变故障率计算方法.首先基于知识图谱方法,分析了历史故障数据中各电气设备的故障原因和影响因素,并根据分析结果建立了计及设备及部件相互影响的电气设备基本失效模型;然后考虑设备组成元件的运行年限、检修策略等影响因素,对基本失效模型进行了修正,提出设备的时变故障率计算方法.以某油田电力系统为算例,计算了各设备的时变故障率等可靠性参数,通过仿真分析得到了系统的可靠性指标.计算结果可为识别海上电力系统的薄弱环节和制定可靠性提升措施提供
The development of fluorophores emitting in the near-infrared spectral window has gained increased attention given their suitable features for biological imaging.In this work,we have optimised a general and straightforward synthetic approach to prepare a
针对大规模逆变器接入弱电网易引起宽频带振荡,影响自身及系统的稳定运行的问题,结合PWM控制的三相桥式逆变电路,研究了脉冲宽度调制技术及死区效应下的逆变器并网口输出电压的谐波特性,以及LCL滤波器的固有谐振点以及控制延时对谐波的影响.通过在仿真平台搭建模型,实现了逆变器谐波特性的复现.结果表明,PWM调制技术会产生载波频率及其整数倍附近的高次谐波;死区效应增加了逆变器输出波形的低次谐波,且随着死区时间的增加,5次和7次谐波含量增加;控制延时将造成滤波器谐振频率的偏移.
为提高投资决策水平、完善投资决策机制,针对变电站投资建设项目后评价,提出了一种基于粗糙集与模糊?多级可拓法的评价方法.首先,在国家电网公司现有项目后评价体系的基础上,结合变电站的建设特点,确定了具体的评价指标体系,并基于粗糙集理论确定项目后评价指标权重;然后,将模糊综合评价理论与多级可拓评价方法结合,提出了一种改进的模糊?多级可拓评价方法;最后,通过算例分析,验证了所提评价方法的合理性.