Ramping up on parallel programming basics and gpu architecture experiments.
Approach: Execute a synthetic vector operation that performs a fixed number of floating-point operations (FLOPs) per memory element accessed. For a given number of FLOPs per element (FLOP_COUNT
), each thread executes:
We read from A[i]
, B[i]
, D[i]
, E[i]
and write to C[i]
:
The
vec_opN
kernel is used to sweep arithmetic intensity by increasingFLOP_COUNT
and analyzing how performance shifts from memory-bound to compute-bound behavior.
We investigated standard dense matrix multiplication:
We read matrices A
, B
and write C
, each of size
The
matmul
kernel evaluates a more realistic compute-bound workload where intensity increases with matrix size.
These two benchmarks allow us to sweep across a range of arithmetic intensities, visualizing memory- and compute-bound regions on the roofline plot of the RTX 3090. Red dots represent vec_opN
, and black/gray triangles represent matmul
.