论文部分内容阅读
为了提供稳定的计算资源以提高网格用户作业的完成率,针对高性能计算系统的稳定性问题,提出了故障的主动管理方法,该方法的实施可分为三步:首先,根据系统运行历史数据,提炼出系统主要故障模式集合;然后从软件、硬件角度实现系统状态的实时监控,根据监控数据完成对系统故障的诊断;最后,根据诊断结果实施故障部件的隔离,避免故障的传播,从而减少底层故障对上层应用的影响.该方法在某实际生产性系统上取得较好效果:系统全局故障时间间隔由原来的8 d提高到28 d;故障修复时间由原来的平均10 h缩短到16 min;节点故障引起的失败作业比例由4.6%降低为1.3%.实践证明主动故障管理方法能够降低系统故障开销、提高并行作业的完成率,部署到CNGrid节点的高性能计算系统上,可进一步提高CNGrid的服务质量.
In order to provide stable computing resources to improve the completion rate of grid user jobs, aiming at the stability of HPC system, a method of active fault management is proposed. The implementation of this method can be divided into three steps: Firstly, according to the system operation history Data to extract the set of main fault modes of the system; then real-time monitoring of the system status is realized from the perspective of software and hardware; the fault diagnosis of the system is completed based on the monitoring data; finally, the fault components are isolated according to the diagnosis results, Reduce the impact of the underlying fault on the upper application.This method has achieved good results in a practical production system: the system global failure time interval from the original 8 d to 28 d; fault repair time from the original average of 10 h to 16 min; node failure caused by the proportion of failed jobs from 4.6% to 1.3%. Practice proves that active fault management can reduce system overhead and improve the completion rate of parallel operations deployed to CNGrid nodes on the high-performance computing system can be further improved CNGrid’s service quality.