论文部分内容阅读
结合现有监控方法,设计并实现了一种针对集群系统关键部件的多层次、集中式监控软件.该软件的监控状态参数丰富,涵盖了部件的物理状态、节点的负载状态、节点的事件信息状态和数字电路逻辑状态这4大类运行状态;其状态数据使用数据库集中存储,便于历史数据的检索与分析;状态数据具有统一时钟,能够再现集群系统历史某时刻的运行时场景.在实际系统上的运行结果表明:基于该软件实现的故障在线自动处理机制能够提高系统运行稳定性及作业的成功率.
Combined with the existing monitoring methods, a multi-level and centralized monitoring software is designed and implemented for the key components of the cluster system.The software has abundant monitoring status parameters and covers the physical status of the components, the load status of the nodes, the event information of the nodes State and digital logic state of the four major categories of operating conditions; the state data using centralized database storage for easy historical data retrieval and analysis; state data with a unified clock, able to reproduce the cluster system at some point in the history of the scene in the actual system The results show that the online fault handling mechanism based on this software can improve the stability of the system and the success rate of the operation.