,CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

来源 :信息与电子工程前沿(英文版) | 被引量 : 0次 | 上传用户:temp_dl
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
As we approach the exascale era in supercomputing, designing a balanced computer system with a pow-erful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
其他文献
该研究选用大麦Yd2近等基因系材料Atlas(Yd2-)/Atlas68(Yd2+)和Proctor(Yd2-)/Shannon(Yd2+)构建了2个F2作图分离群体.依据禾谷类作物基因组间存在同线性原理,从小麦族及其它
There are many bottlenecks that limit the computing power of the Mobile Web3D and they need to be solved before implementing a public fire evacuation system on
棉花纤维是棉花胚珠表皮层发育成的单细胞毛状体,其发育过程是由多种基因共同作用,受到转录、翻译、代谢、激素等不同水平、不同因子共同调控的复杂的生物过程。其中植物激素对纤维起始和伸长起重要作用。生长素主要通过诱导生长素响应基因表达起促进细胞伸长作用。棉花纤维发育伸长期是影响棉花纤维产量和品质的关键时期,鉴定和利用纤维伸长相关的基因对于解析棉花纤维发育的分子机理以及提高纤维产量和品质具有重要的理论和实践
依据2010年9月8日~9月22日对黄渤海调查所得的数据,分析了黄渤海溶解有机碳(dissolved organic carbon,DOC)的含量及平面分布特征,并对其影响因素进行了初步探讨。结果表明,秋
To simplify the transient stability analysis of a large-scale power system and realize real-time emergency control, a fast transient stability simulation algori
在1995份农家品种的白粉病抗性鉴定中,仅筛选出7份抗病材料,其中红蜷芒、蚂蚱麦和小白冬麦经多年鉴定,高抗白粉病。在南京条件下经多年鉴定,无论是苗期还是成株期均抗白粉病。试
本研究利用84004的两个高世代姐妹系的分离后代和本实验室保存的抗、感菌核病材料及部分国内外栽培品种进行油菜品种多态性研究,并探讨了用RAPD方法进行油菜抗菌核病标记,具体结论如下: (1)通过RAPD和聚类分析发现,84004的两个高世代姐妹系的自交分离后代能各自聚在一起,相似系数较大。不同品种中相同来源的品种也能独立聚为一类。虽然它们农艺性状和抗病性差别很大,但通过分...
高粱(Sorghum bicolour,2n=20)是世界上第五大粮食作物,随着世界能源危机的不断加剧,高粱作为高光效能源植物开始引起人们的重视。目前,植物育种研究已进入表型鉴定与分子鉴定有机结合的时代,通过对杂交后代群体DNA多样性的分析,进而研究植物的遗传背景,可以减少育种工作中选择的盲目性和繁育杂交群体的工作量,也能为育种研究中储备最少量的材料但具备尽可能丰富的遗传多样性提供依据。本研究利用
We propose a general method of designing phase-shifting algorithms for grating lateral shearing interferometry.The algorithms compensate for the zeroth-order ef
本研究通过PVC管培养试验,研究了土壤水分状况对甘肃中部近50年来不同年代品种春小麦生理指标、干物质分配和养分分配等生长指标的影响,通过试验,我们取得了以下研究结果和新见解: 1.不同年代品种春小麦光合特性不同。春小麦光合速率的大小不仅与土壤水分状况和生育期有关,而且也与品种和育种进程有关。高土壤水分状况下的光合速率明显大于低土壤水分状况下的光合速率,而且在高土壤水分条件下,不同年...