论文部分内容阅读
BLAS库分为两类函数运算:复数函数与实数函数。矩阵乘法函数是BLAS库的核心函数,BLAS库中的许多函数在实现时都调用了矩阵乘法函数。文章结合龙芯3A体系结构的特点,通过对矩阵乘法计算过程的分析选择了先对矩阵分块然后进行任务划分的方式,从而减少了数据拷贝数量,提高了拷贝数据的利用率,并运用循环展开、指令调度、数据分块等技术对子线程的运算进行了优化。优化后的ZGEMM函数的多线程运算速度是ATLAS库的两倍。
BLAS library is divided into two types of function operations: complex functions and real numbers. The matrix multiplication function is the core function of the BLAS library. Many functions in the BLAS library call the matrix multiplication function when implemented. According to the characteristics of the Godson 3A architecture, the article chooses the method of partitioning the matrix and then dividing the tasks according to the analysis of the matrix multiplication process, so as to reduce the number of data copies and improve the utilization of copy data. , Instruction scheduling, data blocking and other technologies on the sub-thread operation has been optimized. The multi-threaded operation of the optimized ZGEMM function is twice as fast as the ATLAS library.