论文部分内容阅读
To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector(Sp MV) multiplication kernel implemented in a partitioned global address space language, which supports a hybrid thread- and process-based communication layer for multicore systems. One-sided communication is used for inter-node data exchange, while intra-node communication uses a mix of process shared memory and multithreading. We develop performance models to facilitate selecting the best configuration of threads and processes hybridization as well as the best communication pattern for Sp MV. As a result, our tuned Sp MV in the hybrid runtime environment consumes less memory and reduces inter-node communication volume, without damaging the data locality. Experiments are conducted on 12 real sparse matrices. On 16-node Xeon and 8-node Opteron clusters, our tuned Sp MV kernel gets on average 1.4X and 1.5X improvement in performance over the well-optimized process-based message-passing implementation, respectively.
To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector (Sp MV) multiplication kernel implemented in a partitioned global address space language, which supports a hybrid thread- and process-based communication layer for multicore systems. One-sided communication is used for inter-node data exchange, while intra-node communication uses a mix of process shared memory and multithreading. We a develop performance models to facilitate selecting the best configuration of threads and processes hybridization as well as the best communication pattern for Sp MV. As a result, our tuned Sp MV in the hybrid runtime environment consumes less memory and reduces inter-node communication volume, without damaging the data locality. Experiments are conducted on 12 real sparse matrices. On 16-node Xeon and 8-node Opteron clusters, our tuned Sp MV kernel gets on average 1.4X and 1.5X improvement in performance over the well-optimized process-based message-passing implementation, respectively.