Subsampling bias and the best-discrepancy systematic cross validation

来源 :中国科学:数学(英文版) | 被引量 : 0次 | 上传用户:fenglu84
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k subsets,which usually causes subsampling bias,inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation.Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory,we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence,which ensures low subsampling bias and leads to more precise expected-prediction-error(EPE)estimates.Experiments with 156 benchmark datasets and three classifiers(logistic regression,decision tree and naive bayes)show that in general,our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18%and the variances around 26.73%.In comparison,the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58%and 11.85%,respectively.The leave-one-out(LOO)can lower the EPE around 2.50%but its variances are much higher than the any other cross-validation(CV)procedure.The computational time of our cross-validation procedure is just 8.64%of the MCCV,8.67%of the stratified MCCV and 16.72%of the LOO.Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio.This makes our approach particularly pertinent when solving bioscience classification problems.Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.
其他文献
本文对近年来地下随钻气体检测技术的现状、最新成果及存在问题进行了分析、总结.由于传统的钻井液脱气检测均在地面进行,存在着信息滞后、气体组成变化等问题,一直是困扰实
随着井场信息的增加和录井传输技术的发展,录井服务需要面对更多的技术资源,需要扩展综合录井仪的功能共享信息,融合和集成现有的技术应用平台,以适应勘探开发需求.本文围绕
X射线元素录井技术是在2007年才提出并建立起来的,目前该技术仍处于试验和应用研究阶段,为了促进该项技术的加速完善和成熟,建立起相应的解释评价方法是当务之急,也十分有利
会议
《不—不仔》是美籍日裔作家约翰·冈田出版的唯一一部小说,主要描述了第二代日裔美国人Ichiro在二战这个特殊的历史时期探寻自己文化身份的过程。小说中的母亲形象代表着日
本文主要针对知识经济时代对基层图书馆员的素质要求进行了论述。知识经济时代基层图书馆工作的新变化、新特点,主要包括图书业务管理从手工操作向自动化转变,文献由传统的纸质
Immune checkpoint inhibitors (ICIs),especially inhibitors of the PD-1/PD-L1 axis,have significantly affected the outcomes of patients with lung cancer.Nivolumab
本文主要论述了新时期提高图书馆员素养的途径。文章指出,要适应时代的发展,注重更新观念;增强求知欲望,注重知识积累;培养综合能力,注重继续教育。总之,在新形势下,图书馆事业的发展
页岩气的开发最早始于美国,是目前国内非常规气藏的勘探重点和热点.天然气在页岩中的存在形式有游离气和吸附气两种,具有产量低、开采时间长的特点.要工业化地开采页岩气,钻
会议
The reactive electrophilic species (RES),typically the molecules bearing α,β-unsaturated carbonyl group,are widespread in living organisms and notoriously kno
Regenerative medicine (RM) is one of the most promising disciplines for advancements in modern medicine,and regenerative ophthalmology (RO) is one of the most a