Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

来源 :Journal of Computer Science & Technology | 被引量 : 0次 | 上传用户:xingyongxiao
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters. Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of acquiring a lock (see access to data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in having large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower, execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
其他文献
Ni/PtHY with different Ni loadings was prepared by impregnating HY with hexachloroplatinic acid solution and Ni2+/N,N-dimethylformamide solution.An increase in
Asymptotic behaviour of solutions is studied for some second order equations including the model case (x)(t) +γx(t) +▽Φ(x(t)) =h(t) with γ > 0 and h ∈ L1(0,
The performance of BaCl2-TiO2-SnO2 composite catalysts in oxidative coupling of methane reaction has been investigated.A series of BaCl2-TiO2,BaCl2-SnO2,TiO2-Sn
Palm-based dihydroxystearic acid of 69.55% purity was produced in a 500-kg-per-batch operation pilot plant and purified through solvent crystallization in a cust
Bismuth(Ⅲ) nitrate pentahydrate catalyzed the three-component condensation of β-naphthol,aldehydes and amines/urea under solvent-free conditions to afford the
为探究吕家坨井田地质构造格局,根据钻孔勘探资料,采用分形理论和趋势面分析方法,研究了井田7
Three-component reaction of arylsulfonamides,dialkyl acetylenedicarboxylates,and ethyl chlorooxoacetate promoted by triphenylphosphine and triethylamine provide
Since peanut oil(PO) is more expensive than other seed oils,some PO is adulterated with other cheap seed oils,such as soybean oil,palm olein,cottonseed oil,corn
Two methods,rapidly depressurizing to 0.1 MPa at a constant temperature and rising temperature under equilibrium P,T conditions,were used to study the dissociat
目的:探讨“罗布人后裔”的生理特征.方法:用方便取样方法,从新疆维吾尔族自治区尉犁县居住的罗布人后裔中抽取594人,通过现场问卷调查、体格检查和生化测定获取研究对象的人