论文部分内容阅读
结合增广马尔可夫决策过程(AMDP),蒙特卡罗-部分可观察马尔可夫决策过程(MC-POMDP)以及Q学习,提出了AMDP-Q学习(AMDP-Q)算法.算法的主要思想是:首先用一个低维充分统计量表示原信念状态空间,通常使用最大似然状态和信念状态的信息熵作为充分统计量,其组成的空间称为增广状态空间;然后应用参考状态集离散化该空间,并利用Q学习和Shepard插值得到连续状态的转移函数和回报函数;最后使用具有知识探索性质的ε-贪婪策略进行策略选择.实验结果表明:AMDP-Q比MC-POMDP收敛速度更快.
The AMDP-Q learning (AMDP-Q) algorithm is proposed by combining the augmented Markov decision process (AMDP), Monte Carlo-partially observable Markov decision process (MC-POMDP) and Q learning.The main idea of the algorithm Is: Firstly, a low-dimensional sufficient statistic is used to represent the original belief state space. The information entropy of the maximum likelihood state and belief state is usually used as the sufficient statistic, and the space composed of it is called augmented state space. Then the reference state set is discretized The space and use of Q learning and Shepard interpolation to obtain continuous state transfer function and reward function. Finally, we use the ε-greedy strategy of knowledge exploration strategy to select the strategy.Experimental results show that AMDP-Q has more convergence rate than MC-POMDP fast.