LIDAR:learning from imperfect demonstrations with advantage rectification

来源 :计算机科学前沿 | 被引量 : 0次 | 上传用户：sjt111

【摘要】

：

【作者】

：

Xiaoqin ZHANG Huimin MA Xiong LUO Jian YUAN

【机构】

：

Department of EE,Tsinghua University,Beijing 100084,China;School of Computer and Communication Engin

【出处】

：

计算机科学前沿

【发表日期】

：

2022年1期

【关键词】

：

learning from demonstrations actor-critic rein-forcement learning advantage rect

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

In actor-critic reinforcement learning (RL) algo-rithms,function estimation errors are known to cause ineffec-tive random exploration at the beginning of training,and lead to overestimated value estimates and suboptimal policies.In this paper,we address the problem by executing advantage rectifi-cation with imperfect demonstrations,thus reducing the func-tion estimation errors.Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain.However,existing methods,such as behavior cloning,often assume the demonstrations contain other information or labels with regard to performances,such as optimal assumption,which is usually incorrect and useless in the real world.In this paper,we explicitly handle imperfect demonstrations within the actor-critic RL frameworks,and propose a new method called learning from imperfect demonstrations with advantage recti-fication (LIDAR).LIDAR utilizes a rectified loss function to merely learn from selective demonstrations,which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy.LIDAR learns from contradictions caused by estimation errors,and in turn reduces estimation errors.We apply LIDAR to three popular actor-critic algorithms,DDPG,TD3 and SAC,and experiments show that our method can observably reduce the function esti-mation errors,effectively leverage demonstrations far from the optimal,and outperform state-of-the-art baselines consistently in all the scenarios.

其他文献

考虑故障概率和净负荷不确定性的鲁棒随机备用优化

备用的配置是为了应对系统中可能的设备故障、负荷与可再生能源出力的不确定性.系统配置充足的备用是保证电力系统安全经济运行的必要条件.在传统的备用决策方法中,不同程度上忽略了故障概率的不确定性.为此提出了一种统筹考虑设备故障概率和净负荷不确定性的备用优化模型.其中,风电与负荷造成的净负荷不确定性以及设备故障概率不确定性统一采用区间描述,并采用对偶理论、上境界转化和KKT(Karush-Kuhn-Tucker)条件等方法进行求解.最后,基于IEEE-RTS 24节点系统验证了所提模型和方法的有效性.

期刊

备用优化可再生能源故障概率不确定性鲁棒优化随机优化

Advantageous mechanochemical synthesis of copper(Ⅰ)selenide semiconductor,characterization,and prope

Copper(Ⅰ) selenide-nanocrystalline semicon-ductor was synthesized via one-step mechanochemical synthesis after 5 min milling in a planetary ball mill.The kinetics of synthesis was followed by X-ray powder diffraction analysis and specific surface area mea

期刊

Cu2Seberzelianitenanocrystalline semicon-ductormechanochemical synthesisplan

Implementing a sidechain-based asynchronous DPKI

1 IntroductionrnPublic key infrastructure (PKI) is an integral part of the net-work communication system.Relying on the certificate author-ity,it provides security policies for the systems,which enables users to communicate or conduct e-commerce transacti

期刊

凸优化SAR的城区不规则高层建筑反演重构成像

针对城区不规则高层建筑目标三维成像算法进行研究,基于Attributed Scattering Center(ASC)模型,优化不规则复杂体高层建筑的GEO SAR重构算法.利用北斗GEO卫星作为辐射源,机载接收机作为信号接收平台,成像场景反射GNSS信号作为回波信号,建立以稀疏表示系数和字典为凸优化变量的优化模型,提出一种基于变量分裂和增广拉格朗日技术的迭代方法.基于GEO SAR系统进行回波仿真和成像处理,利用TeslaK20 C显卡仿真平台,生成建筑物图像并应用于目标成像仿真.仿真结果表明,生成的图

期刊

北斗GEO卫星凸优化合成孔径雷达高层建筑成像处理反演重构

Synergistic optimization framework for the process synthesis and design of biorefineries

The conceptual process design of novel bioprocesses in biorefinery setups is an important task,which remains yet challenging due to several limitations.We propose a novel framework incorporating super-structure optimization and simulation-based optimizati

期刊

biotechnologysurrogate modellingsuper-structure optimizationsimulation-based

Design of bio-oil additives via molecular signature descriptors using a multi-stage computer-aided m

Direct application ofbio-oil from fast pyrolysis as a fuel has remained a challenge due to its undesirable attributes such as low heating value,high viscosity,high corrosiveness and storage instability.Solvent addition is a simple method for circumventing

期刊

computer-aided molecular designbio-oil additivesmolecular signature descriptor

Automated synthesis of steady-state continuous processes using reinforcement learning

Automated flowsheet synthesis is an important field in computer-aided process engineering.The present work demonstrates how reinforcement learning can be used for automated flowsheet synthesis without any heuristics or prior knowledge of conceptual design

期刊

automated process synthesisflowsheet synth-esisartificial intelligencemachine

基于分散协同多阶段鲁棒调度的电热联合系统灵活性增强方法

在“碳达峰”的背景下,新能源的渗透率将进一步增加,电热联合系统的灵活性欠缺难以满足日益增长的新能源消纳需求.文中提出了一种基于分散协同多阶段鲁棒调度的电热联合系统灵活性增强方法,相比传统调度方法,可充分发挥多能源储能、电热转换设备等灵活性资源应对风电出力波动的作用.首先,对含多种灵活性资源的电热联合系统进行建模,基于盒式不确定集的风电预测出力假设,并考虑到电、热系统分属于不同运营商的场景,建立了分散协同的多阶段鲁棒调度模型.为了衡量调度模型的灵活性,提出了基于线性规划灵敏度分析的多种灵活性指标.然后,提出

期刊

鲁棒动态规划电热联合系统分布式优化对偶动态规划灵活性

Improving accuracy of automatic optical inspection with machine learning

Electronic devices require the printed circuit board(PCB) to support the whole structure,but the assembly of PCBs suffers from welding problem of the electronic components such as surface mounted devices (SMDs) resistors.The auto-mated optical inspection

期刊

automated optical inspectionindustrial internet of thingsmachine learningimag

地区型电热能源市场迭代博弈交易优化

能源互联地区分布市场允许小型分散化市场成员灵活自由交易,需要制定有效的市场机制和充分考虑市场主体意愿的交易策略来优化地区型多能源市场交易.首先,建立地区型电热能源市场交易框架;其次,通过分析参与者交易意愿,提出含信息中心的纳什均衡迭代博弈机制,参与者以自身效用最大化为目标制定交易策略提交市场,市场运营商以社会福利最大化出清,迭代博弈协调市场电热能源供需关系.针对地区型商业电热泵站与商业储电站购、售能源交易策略的复杂性,引入报量控制系数优化储能站的投标策略.算例结果表明,迭代博弈机制可实现市场均衡,提高参与

期刊

地区能源市场多能源市场商业电热泵站迭代优化市场均衡

LIDAR:learning from imperfect demonstrations with advantage rectification

与本文相关的学术论文