论文部分内容阅读
The task of learning the value function under a fixed policy in continuous Markov decision processes(MDPs)is considered.Although ELM based approximator can achieve better performance in learning prediction problems than traditional method(such as LSTD with radial basis function)for high dimension problems,the random initialization of ELM parameters would result in fluctuating performance.In order to overcome this problem,a least-squares temporal difference algorithm with eligibility traces is proposed based on regularized extreme learning machine(RELM-LSTD(λ)).The regularized extreme learning machine(RELM)is used to approximate value functions.Furthermore,the eligibility trace term is introduced to increase data efficiency.In experiments,the performances of the proposed algorithm are demonstrated and compared with those of LSTD and ELM-LSTD.Experiment results show that the proposed algorithm can achieve a more stable and better performance in approximating the value function under a fixed policy.