8000 Kernel filtering with Omniperf profile · Issue #325 · ROCm/rocprofiler-compute · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Kernel filtering with Omniperf profile #325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ausellis0 opened this issue Mar 20, 2024 · 3 comments
Closed

Kernel filtering with Omniperf profile #325

ausellis0 opened this issue Mar 20, 2024 · 3 comments
Assignees

Comments

@ausellis0
Copy link
ausellis0 commented Mar 20, 2024

Describe the bug
On the 2.x branch, omniperf profile and the kernel filtering -k option is not limiting the kernels that are being profiled. After running omniperf analyze all kernels are still present. This is needed for applications with many, many kernels and dispatches. The kernel filtering -k option does work as expected for omniperf analyze.

Development Environment:

  • Linux Distribution: RHEL/8.9 (TOSS)
  • Omniperf Version: 2.0.0-RC1 (6222138)
  • GPU: MI300A
  • Cluster (if applicable): LLNL System

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/ROCm/HIP-Examples.git
cd HIP-Examples/add4

./buildit.sh

ROCPROF=${ROCM_PATH}/bin/rocprofv2 omniperf profile -n add4_test -k "add" -- ./gpu-stream-hip
omniperf analyze -p workloads/add4_test/MI300A_A1
# All kernels present in Top Stats and all metrics are essentially the same without any filtering for `omniperf profile`. 

Expected behavior
omniperf profile and -k is expected to not collect information on kernels not passed to the flag.

@coleramos425
Copy link
Collaborator

@ausellis0 after toying around with a reproducer, I found this issue can be isolated to rocprofv2. See below Case 1 yields the same input file as Case 2, so profiler input is consistent. I suspect rocprofv2 is either expecting a different format or it's skipping the filter altogether.

ROCm version: 6.0.2
Distro: RHEL 8.9

Case 1 (rocprofv1)

$ omniperf profile -n stream_k_filt -k "add" -- hpc_apps/HIP-Examples/GPU-STREAM/build/hip-stream
...
$ omniperf analyze -p workloads/stream_k_filt/MI200/ -b 0

  ___                  _                  __
 / _ \ _ __ ___  _ __ (_)_ __   ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
| |_| | | | | | | | | | | |_) |  __/ |  |  _|
 \___/|_| |_| |_|_| |_|_| .__/ \___|_|  |_|
                        |_|

Analysis mode = cli
[analysis] deriving Omniperf metrics...

--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
╒════╤══════════════════════════════════════════╤═════════╤═════════════╤════════════╤══════════════╤════════╕
│    │ Kernel_Name                              │   Count │     Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════════╪═════════╪═════════════╪════════════╪══════════════╪════════╡
│  0 │ void add_kernel<double>(double const*, d │  100.00 │ 63611084.00 │  636110.84 │    636002.00 │ 100.00 │
│    │ ouble const*, double*) [clone .kd]       │         │             │            │              │        │
╘════╧══════════════════════════════════════════╧═════════╧═════════════╧════════════╧══════════════╧════════╛
0.2 Dispatch List
╒════╤═══════════════╤════════════════════════════════════════════════════════════════════════════╤══════════╕
│    │   Dispatch_ID │ Kernel_Name                                                                │   GPU_ID │
╞════╪═══════════════╪════════════════════════════════════════════════════════════════════════════╪══════════╡
│  0 │             3 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  1 │             8 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  2 │            13 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  3 │            18 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  4 │            23 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  5 │            28 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  6 │            33 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  7 │            38 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  8 │            43 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
├────┼───────────────┼────────────────────────────────────────────────────────────────────────────┼──────────┤
│  9 │            48 │ void add_kernel<double>(double const*, double const*, double*) [clone .kd] │        2 │
╘════╧═══════════════╧════════════════════════════════════════════════════════════════════════════╧══════════╛
$ cat workloads/stream_k_filt/MI200/perfmon/pmc_perf_0.txt
pmc:  SQ_CYCLES SQ_BUSY_CYCLES SQ_WAVES SQ_INSTS_VALU_CVT SQ_INSTS_VMEM_WR SQ_INSTS_VMEM_RD SQ_INSTS_VMEM SQ_INSTS_SALU GRBM_COUNT GRBM_GUI_ACTIVE TCP_GATE_EN1_sum TCP_GATE_EN2_sum TCP_TD_TCP_STALL_CYCLES_sum TCP_TCR_TCP_STALL_CYCLES_sum TA_TA_BUSY_sum TA_BUFFER_WAVEFRONTS_sum TD_TD_BUSY_sum TD_TC_STALL_sum SPI_CSN_WINDOW_VALID SPI_CSN_BUSY CPC_CPC_STAT_BUSY CPC_CPC_STAT_IDLE CPF_CPF_STAT_BUSY CPF_CPF_STAT_STALL TCC_CYCLE_sum TCC_BUSY_sum TCC_PROBE_sum TCC_PROBE_ALL_sum

gpu:
range:
kernel: add

Case 2 (rocprofv2)

$ export ROCPROF=rocprofv2
$ omniperf profile -n stream_k_filt_2 -k "add" -- hpc_apps/HIP-Examples/GPU-STREAM/build/hip-stream
...
$ omniperf analyze -p workloads/stream_k_filt_2/MI200/ -b 0

  ___                  _                  __
 / _ \ _ __ ___  _ __ (_)_ __   ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
| |_| | | | | | | | | | | |_) |  __/ |  |  _|
 \___/|_| |_| |_|_| |_|_| .__/ \___|_|  |_|
                        |_|

Analysis mode = cli
[analysis] deriving Omniperf metrics...

--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
╒════╤══════════════════════════════════════════╤═════════╤═════════════╤════════════╤══════════════╤═══════╕
│    │ Kernel_Name                              │   Count │     Sum(ns) │   Mean(ns) │   Median(ns) │   Pct │
╞════╪══════════════════════════════════════════╪═════════╪═════════════╪════════════╪══════════════╪═══════╡
│  0 │ void add_kernel<double>(double const*, d │  100.00 │ 63845927.00 │  638459.27 │    638460.75 │ 26.23 │
│    │ ouble const*, double*) (.kd)             │         │             │            │              │       │
├────┼──────────────────────────────────────────┼─────────┼─────────────┼────────────┼──────────────┼───────┤
│  1 │ void triad_kernel<double>(double*, doubl │  100.00 │ 63837188.00 │  638371.88 │    638368.00 │ 26.22 │
│    │ e const*, double const*) (.kd)           │         │             │            │              │       │
├────┼──────────────────────────────────────────┼─────────┼─────────────┼────────────┼──────────────┼───────┤
│  2 │ void copy_kernel<double>(double const*,  │  100.00 │ 38598405.50 │  385984.05 │    385985.00 │ 15.85 │
│    │ double*) (.kd)                           │         │             │            │              │       │
├────┼──────────────────────────────────────────┼─────────┼─────────────┼────────────┼──────────────┼───────┤
│  3 │ void mul_kernel<double>(double*, double  │  100.00 │ 38538754.50 │  385387.54 │    385398.50 │ 15.83 │
│    │ const*) (.kd)                            │         │             │            │              │       │
├────┼──────────────────────────────────────────┼─────────┼─────────────┼────────────┼──────────────┼───────┤
│  4 │ void dot_kernel<double>(double const*, d │  100.00 │ 38082883.00 │  380828.83 │    380813.00 │ 15.64 │
│    │ ouble const*, double*, int) (.kd)        │         │             │            │              │       │
├────┼──────────────────────────────────────────┼─────────┼─────────────┼────────────┼──────────────┼───────┤
│  5 │ void init_kernel<double>(double*, double │    1.00 │   551168.00 │  551168.00 │    551168.00 │  0.23 │
│    │ *, double*, double, double, double) (.kd │         │             │            │              │       │
│    │ )                                        │         │             │            │              │       │
╘════╧══════════════════════════════════════════╧═════════╧═════════════╧════════════╧══════════════╧═══════╛
0.2 Dispatch List
╒════╤═══════════════╤═══════════════════════════════════════════════════════════════════════════════════╤══════════╕
│    │   Dispatch_ID │ Kernel_Name                                                                       │   GPU_ID │
╞════╪═══════════════╪═══════════════════════════════════════════════════════════════════════════════════╪══════════╡
│  0 │             1 │ void init_kernel<double>(double*, double*, double*, double, double, double) (.kd) │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  1 │             2 │ void copy_kernel<double>(double const*, double*) (.kd)                            │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  2 │             3 │ void mul_kernel<double>(double*, double const*) (.kd)                             │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  3 │             4 │ void add_kernel<double>(double const*, double const*, double*) (.kd)              │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  4 │             5 │ void triad_kernel<double>(double*, double const*, double const*) (.kd)            │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  5 │             6 │ void dot_kernel<double>(double const*, double const*, double*, int) (.kd)         │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  6 │             7 │ void copy_kernel<double>(double const*, double*) (.kd)                            │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  7 │             8 │ void mul_kernel<double>(double*, double const*) (.kd)                             │     9354 │
├────┼───────────────┼────────────────────────────────────────────────────────────
8000
──────────────────────┼──────────┤
│  8 │             9 │ void add_kernel<double>(double const*, double const*, double*) (.kd)              │     9354 │
├────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────┼──────────┤
│  9 │            10 │ void triad_kernel<double>(double*, double const*, double const*) (.kd)            │     9354 │
╘════╧═══════════════╧═══════════════════════════════════════════════════════════════════════════════════╧══════════╛
$ cat workloads/stream_k_filt_2/MI200/perfmon/pmc_perf_0.txt
pmc:  SQ_CYCLES SQ_BUSY_CYCLES SQ_WAVES SQ_INSTS_VALU_CVT SQ_INSTS_VMEM_WR SQ_INSTS_VMEM_RD SQ_INSTS_VMEM SQ_INSTS_SALU GRBM_COUNT GRBM_GUI_ACTIVE TCP_GATE_EN1_sum TCP_GATE_EN2_sum TCP_TD_TCP_STALL_CYCLES_sum TCP_TCR_TCP_STALL_CYCLES_sum TA_TA_BUSY_sum TA_BUFFER_WAVEFRONTS_sum TD_TD_BUSY_sum TD_TC_STALL_sum SPI_CSN_WINDOW_VALID SPI_CSN_BUSY CPC_CPC_STAT_BUSY CPC_CPC_STAT_IDLE CPF_CPF_STAT_BUSY CPF_CPF_STAT_STALL TCC_CYCLE_sum TCC_BUSY_sum TCC_PROBE_sum TCC_PROBE_ALL_sum

gpu:
range:
kernel: add

@coleramos425 coleramos425 self-assigned this Mar 21, 2024
@coleramos425 coleramos425 added rocprofiler and removed bug Something isn't working labels Mar 21, 2024
@jandrej
Copy link
jandrej commented Aug 23, 2024

Filtering is still an issue with rocprofv2

$ rocprofv2 --version
ROCm version: 6.1.2-119
ROCProfiler version: 2.0
$ omniperf --version
----------------------------------------
Omniperf version: 2.0.0 (release)
Git revision:     62221383
----------------------------------------

@ppanchad-amd
Copy link

@ausellis0 Closing ticket since fixed in ROCm.
@jandrej Please open a new ticket for your issue under 'rocprofiler'. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0