论文部分内容阅读
针对大规模或复杂的随机动态规划系统,可利用其分层结构特点或引入分层控制方式,借助分层强化学习(Hierarchical Reinforcement Learning,HRL)来解决其“维数灾”和“建模难”问题.HRL归属于样本数据驱动优化方法,通过空间/时间抽象机制,可有效加速策略学习过程.其中,Option方法可将系统目标任务分解成多个子目标任务来学习和执行,层次化结构清晰,是具有代表性的HRL方法之一.传统的Option算法主要是建立在离散时间半马尔可夫决策过程(Semi-Markov Decision Processes,SMDP)和折扣性能准则基础上,无法直接用于解决连续时间无穷任务问题.因此本文在连续时间SMDP框架及其性能势理论下,结合现有的Option算法思想,运用连续时间SMDP的相关学习公式,建立一种适用于平均或折扣性能准则的连续时间统一Option分层强化学习模型,并给出相应的在线学习优化算法.最后通过机器人垃圾收集系统为仿真实例,说明了这种HRL算法在解决连续时间无穷任务优化控制问题方面的有效性,同时也说明其与连续时间模拟退火Q学习相比,具有节约存储空间、优化精度高和优化速度快的优势.
For large-scale or complex stochastic dynamic programming systems, the hierarchical structural features or the introduction of hierarchical control can be used to solve the problem of dimensionality disaster and Hierarchical Reinforcement Learning (HRL) Modeling difficult "problem.HRL belonging to the sample data-driven optimization method, through the space / time abstraction mechanism, can effectively accelerate the strategy learning process.Among them, Option method can be decomposed into a number of sub-target system tasks to learn and implement tasks, Hierarchical structure is clear and representative of the HRL method is one of the traditional Option algorithm is mainly based on the discrete-time semi-Markov Decision Processes (SMDP) and discount performance criteria based on the direct So as to solve the infinite task of continuous time.Therefore, under the continuous-time SMDP framework and its performance potential theory, combined with the existing Option algorithm idea, using the related learning formula of continuous-time SMDP, we establish a method that is suitable for the average or discount performance criterion Continuous time unified Option stratified reinforcement learning model, and gives the corresponding online learning optimization algorithm.Finally, through the robot garbage collection The system is a simulation example, which shows the effectiveness of this HRL algorithm in solving infinite task optimization control problems in continuous time. It also shows that compared with continuous-time simulated annealing Q learning, it has the advantages of saving storage space, optimizing the precision and optimizing the speed Fast advantage.