论文部分内容阅读
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法。文章提出了一种新的算法,这个算法通过牺牲最优性来获取鲁棒性,重点给出了一组逼近算法和它们的收敛结果。利用广义平均算子来替代最优算子max(或min),对激励学习中的两类最重要的算法一动态规划算法和个学习算法一进行了研究,并讨论了它们的收敛性。其目的就是为了提高激励学习算法的鲁棒性。
An incentive learning agent solves the problem by learning an optimal strategy from state to action mapping. Incentive learning method is Agent to use experiments to interact with the environment to improve their behavior. The Markov Decision Making Process (MDP) model is a common method of solving motivational learning problems. In this paper, a new algorithm is proposed. This algorithm obtains the robustness by sacrificing the optimality, and focuses on a set of approximation algorithms and their convergence results. Using the generalized averaging operator to replace the optimal operator max (or min), two types of the most important algorithms in incentive learning, dynamic programming and learning algorithms, are studied and their convergence is discussed. The purpose is to improve the robustness of incentive learning algorithm.