merge from https://github.com/CNugteren/myGEMM . Doc is https://cnugteren.github.io/tutorial/pages/page1.html
-
Compile the code:
make build
Compiles the benchmarking infrastructure and the myGEMM kernels. Make sure there is a "bin" and "obj" directory available. Note that you might have to edit the Makefile to set the proper locations of the CUDA and OpenCL installations on your system.
-
Run the code:
make run
This runs the code for matrices ranging from MINSIZE to MAXSIZE (defined in src/common.h). It will run cuBLAS, clBlas, and the CUDA and OpenCL versions of the myGEMM kernels. The particular kernel to be executed is defined using the KERNEL keyword in src/settings.h. This file also contains other settings you might want to modify for your particular GPU.
-
Inspect the code:
make inspect
This generates all kinds of assembly-like versions of the CUDA kernels in the "bin" subdirectory. It also prints out statistics of the kernels such as the register usage.
-
myGEMM1:
普通矩阵乘
-
myGEMM2:
将 A, B 矩阵中的数据保存在global内存中. 将每个group要用的内存事先保存到local内存中, 运算时直接取local内存中的数据即可。 设置两个local的二维数组Asub, Bsub. 大小为TS x TS. 也就是每个group中占用local内存的大小.
const int numTiles = K / TS; for (int t=0; t < numTiles; t++) { // Load one tile of A and B into local memory const int tiledRow = TS * t + row; const int tiledCol = TS * t + col; // 从global memory读取数据到local memory. Asub[row][col] = A[globalRow * K + tiledCol]; Bsub[row][col] = B[tiledRow * N + globalCol]; // 同步等待, 直到group内所有线程到达这里. 此时Asub, Bsub已经被填充满 barrier(CLK_LOCAL_MEM_FENCE); // 计算结果, 此时是从local memory中读TS×2个数据, 比从global中获取速度快. for (int k=0; k<TS; k++) { acc += Asub[row][k] * Bsub[k][col]; } // 同步等待, 直到group中所有线程到达这里 barrier(CLK_LOCAL_MEM_FENCE); }
-
myGEMM3:
myGEMM2中, 每个线程取一次数据只计算一次. 但是计算不同点时会用到同一行或同一列数据. 所以将代码修改为一个线程计算WPT个点, 能减少数据读取次数。
const int numTiles = K/TS; for (int t=0; t<numTiles; t++) { // Load one tile of A and B into local memory const int tiledRow = TS * t + row; const int tiledCol = TS * t + col; for (int w=0; w<WPT; w++) { Asub[row + w * RTS][col] = A[(globalRow + w * RTS) * K + tiledCol]; Bsub[row + w * RTS][col] = B[(tiledRow + w * RTS) * N + globalCol]; } // Synchronise to make sure the tile is loaded barrier(CLK_LOCAL_MEM_FENCE); // Perform the computation for a single tile for (int k=0; k<TS; k++) { for (int w=0; w<WPT; w++) { // 在内层循环中, Bsub[k][col]值在寄存器中, 不需要从local memory读数据. acc[w] += Asub[row + w * RTS][k] * Bsub[k][col]; } } // Synchronise before loading the next tile barrier(CLK_LOCAL_MEM_FENCE); }
-
myGEMM4:
将myGEMM2修改为向量化计算.
-
myGEMM5:
将矩阵B转置后运算.