Exploring the performance of SGEMM in OpenCL on NVIDIA GPUs

merge from https://github.com/CNugteren/myGEMM . Doc is https://cnugteren.github.io/tutorial/pages/page1.html

Usage

Compile the code:
```
make build
```
Compiles the benchmarking infrastructure and the myGEMM kernels. Make sure there is a "bin" and "obj" directory available. Note that you might have to edit the Makefile to set the proper locations of the CUDA and OpenCL installations on your system.
Run the code:
```
make run
```
This runs the code for matrices ranging from MINSIZE to MAXSIZE (defined in src/common.h). It will run cuBLAS, clBlas, and the CUDA and OpenCL versions of the myGEMM kernels. The particular kernel to be executed is defined using the KERNEL keyword in src/settings.h. This file also contains other settings you might want to modify for your particular GPU.
Inspect the code:
```
make inspect
```
This generates all kinds of assembly-like versions of the CUDA kernels in the "bin" subdirectory. It also prints out statistics of the kernels such as the register usage.

说明

myGEMM1:

普通矩阵乘

myGEMM2:

将 A, B 矩阵中的数据保存在global内存中. 将每个group要用的内存事先保存到local内存中, 运算时直接取local内存中的数据即可。设置两个local的二维数组Asub, Bsub. 大小为TS x TS. 也就是每个group中占用local内存的大小.

const int numTiles = K / TS;
for (int t=0; t < numTiles; t++) {
    // Load one tile of A and B into local memory
    const int tiledRow = TS * t + row;
    const int tiledCol = TS * t + col;
// 从global memory读取数据到local memory.
    Asub[row][col] = A[globalRow * K + tiledCol];
    Bsub[row][col] = B[tiledRow * N + globalCol];
    // 同步等待, 直到group内所有线程到达这里. 此时Asub, Bsub已经被填充满
    barrier(CLK_LOCAL_MEM_FENCE);

    // 计算结果, 此时是从local memory中读TS×2个数据, 比从global中获取速度快.
    for (int k=0; k<TS; k++) {
        acc += Asub[row][k] * Bsub[k][col];
    }

    // 同步等待, 直到group中所有线程到达这里
    barrier(CLK_LOCAL_MEM_FENCE);
}

myGEMM3:

myGEMM2中, 每个线程取一次数据只计算一次. 但是计算不同点时会用到同一行或同一列数据. 所以将代码修改为一个线程计算WPT个点, 能减少数据读取次数。

const int numTiles = K/TS;
for (int t=0; t<numTiles; t++) {

    // Load one tile of A and B into local memory
    const int tiledRow = TS * t + row;
    const int tiledCol = TS * t + col;
    for (int w=0; w<WPT; w++) {
        Asub[row + w * RTS][col] = A[(globalRow + w * RTS) * K + tiledCol];
        Bsub[row + w * RTS][col] = B[(tiledRow + w * RTS) * N + globalCol];
    }

    // Synchronise to make sure the tile is loaded
    barrier(CLK_LOCAL_MEM_FENCE);

    // Perform the computation for a single tile
    for (int k=0; k<TS; k++) {
        for (int w=0; w<WPT; w++) {
            // 在内层循环中, Bsub[k][col]值在寄存器中, 不需要从local memory读数据.
            acc[w] += Asub[row + w * RTS][k] * Bsub[k][col];
        }
    }
    // Synchronise before loading the next tile
    barrier(CLK_LOCAL_MEM_FENCE);
}

myGEMM4:

将myGEMM2修改为向量化计算.
myGEMM5:

将矩阵B转置后运算.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring the performance of SGEMM in OpenCL on NVIDIA GPUs

Usage

说明

About

Uh oh!

Releases

Packages

Languages

cituma/opencl-gemm

Folders and files

Latest commit

History

Repository files navigation

Exploring the performance of SGEMM in OpenCL on NVIDIA GPUs

Usage

说明

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages