8000 GitHub - cituma/opencl-gemm
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

cituma/opencl-gemm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Exploring the performance of SGEMM in OpenCL on NVIDIA GPUs

merge from https://github.com/CNugteren/myGEMM . Doc is https://cnugteren.github.io/tutorial/pages/page1.html

Usage

  • Compile the code:

    make build
    

    Compiles the benchmarking infrastructure and the myGEMM kernels. Make sure there is a "bin" and "obj" directory available. Note that you might have to edit the Makefile to set the proper locations of the CUDA and OpenCL installations on your system.

  • Run the code:

    make run
    

    This runs the code for matrices ranging from MINSIZE to MAXSIZE (defined in src/common.h). It will run cuBLAS, clBlas, and the CUDA and OpenCL versions of the myGEMM kernels. The particular kernel to be executed is defined using the KERNEL keyword in src/settings.h. This file also contains other settings you might want to modify for your particular GPU.

  • Inspect the code:

    make inspect
    

    This generates all kinds of assembly-like versions of the CUDA kernels in the "bin" subdirectory. It also prints out statistics of the kernels such as the register usage.

说明

  • myGEMM1:

    普通矩阵乘

  • myGEMM2:

    将 A, B 矩阵中的数据保存在global内存中. 将每个group要用的内存事先保存到local内存中, 运算时直接取local内存中的数据即可。 设置两个local的二维数组Asub, Bsub. 大小为TS x TS. 也就是每个group中占用local内存的大小.

    const int numTiles = K / TS;
    for (int t=0; t < numTiles; t++) {
        // Load one tile of A and B into local memory
        const int tiledRow = TS * t + row;
        const int tiledCol = TS * t + col;
    // 从global memory读取数据到local memory.
        Asub[row][col] = A[globalRow * K + tiledCol];
        Bsub[row][col] = B[tiledRow * N + globalCol];
        // 同步等待, 直到group内所有线程到达这里. 此时Asub, Bsub已经被填充满
        barrier(CLK_LOCAL_MEM_FENCE);
    
        // 计算结果, 此时是从local memory中读TS×2个数据, 比从global中获取速度快.
        for (int k=0; k<TS; k++) {
            acc += Asub[row][k] * Bsub[k][col];
        }
    
        // 同步等待, 直到group中所有线程到达这里
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    
  • myGEMM3:

    myGEMM2中, 每个线程取一次数据只计算一次. 但是计算不同点时会用到同一行或同一列数据. 所以将代码修改为一个线程计算WPT个点, 能减少数据读取次数。

    const int numTiles = K/TS;
    for (int t=0; t<numTiles; t++) {
    
        // Load one tile of A and B into local memory
        const int tiledRow = TS * t + row;
        const int tiledCol = TS * t + col;
        for (int w=0; w<WPT; w++) {
            Asub[row + w * RTS][col] = A[(globalRow + w * RTS) * K + tiledCol];
            Bsub[row + w * RTS][col] = B[(tiledRow + w * RTS) * N + globalCol];
        }
    
        // Synchronise to make sure the tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    
        // Perform the computation for a single tile
        for (int k=0; k<TS; k++) {
            for (int w=0; w<WPT; w++) {
                // 在内层循环中, Bsub[k][col]值在寄存器中, 不需要从local memory读数据.
                acc[w] += Asub[row + w * RTS][k] * Bsub[k][col];
            }
        }
        // Synchronise before loading the next tile
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    
  • myGEMM4:

    将myGEMM2修改为向量化计算.

  • myGEMM5:

    将矩阵B转置后运算.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0