CUDA_prep

Ramping up on parallel programming basics and gpu architecture experiments.

Summary of Benchmarks

🔴 `vec_opN` Kernel

Approach: Execute a synthetic vector operation that performs a fixed number of floating-point operations (FLOPs) per memory element accessed. For a given number of FLOPs per element (FLOP_COUNT), each thread executes:

$$ C[i] = \sum_{j=1}^{\mathrm{FLOP_COUNT}/4} \left( A[i] \cdot B[i] + D[i] \cdot E[i] \right) $$

Total FLOPs:

$$ \mathrm{FLOPs} = N \cdot \mathrm{FLOP_COUNT} $$

Memory Bytes Moved:

We read from A[i], B[i], D[i], E[i] and write to C[i]:

$$ \text{Bytes} = 5 \cdot N \cdot \text{sizeof(float)} $$

Arithmetic Intensity:

$$ \text{Intensity} = \frac{N \cdot \mathrm{FLOP_COUNT}}{5 \cdot N \cdot 4} = \frac{\mathrm{FLOP_COUNT}}{20} $$

The vec_opN kernel is used to sweep arithmetic intensity by increasing FLOP_COUNT and analyzing how performance shifts from memory-bound to compute-bound behavior.

⚫ `matmul` Kernel

We investigated standard dense matrix multiplication:

$$ C_{ij} = \sum_{k=1}^{N} A_{ik} \cdot B_{kj} $$

Total FLOPs:

$$ \mathrm{FLOPs} = 2 \cdot N^3 $$

Memory Bytes Moved:

We read matrices A, B and write C, each of size $N \times N$:

$$ \text{Bytes} = 3 \cdot N^2 \cdot \text{sizeof(float)} $$

Arithmetic Intensity:

$$ \text{Intensity} = \frac{2 \cdot N^3}{3 \cdot N^2 \cdot 4} = \frac{N}{6} $$

The matmul kernel evaluates a more realistic compute-bound workload where intensity increases with matrix size.

These two benchmarks allow us to sweep across a range of arithmetic intensities, visualizing memory- and compute-bound regions on the roofline plot of the RTX 3090. Red dots represent vec_opN, and black/gray triangles represent matmul.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
intensity_sweep.cu		intensity_sweep.cu
matmul_sweep.cu		matmul_sweep.cu
plot_roofline.ipynb		plot_roofline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA_prep

Summary of Benchmarks

🔴 `vec_opN` Kernel

Total FLOPs:

Memory Bytes Moved:

Arithmetic Intensity:

⚫ `matmul` Kernel

Total FLOPs:

Memory Bytes Moved:

Arithmetic Intensity:

About

Uh oh!

Releases

Packages

Languages

ldgalvan/CUDA_prep

Folders and files

Latest commit

History

Repository files navigation

CUDA_prep

Summary of Benchmarks

🔴 vec_opN Kernel

Total FLOPs:

Memory Bytes Moved:

Arithmetic Intensity:

⚫ matmul Kernel

Total FLOPs:

Memory Bytes Moved:

Arithmetic Intensity:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔴 `vec_opN` Kernel

⚫ `matmul` Kernel

Packages