Profiler with Kineto has "orphan+childless" function events (on P100) #54267

indigoviolet · 2021-03-18T18:58:45Z

🐛 Bug

With the use_kineto=True flag on a P100 gpu, the torch profiler returns some FunctionEvents that have neither a parent nor children.

For AlexNet, here are the names of these events:

{'Memcpy DtoD (Device -> Device)',
 'aten::dropout',
 'maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt',
 'void at::native::(anonymous namespace)::adaptive_average_pool<float>(float*, float*, int, int, int, int, long, long, long)',
 'void at::native::(anonymous namespace)::max_pool_forward_nchw<float, float>(int, float const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*, long*)',
 'void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*, 3>, OffsetCalculator<2, unsigned int>, OffsetCalculator<1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, at::native::AddFunctor<float>, at::detail::Array<char*, 3>, OffsetCalculator<2, unsigned int>, OffsetCalculator<1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast)',
 'void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}, at::detail::Array<char*, 3>)',
 'void cudnn::detail::explicit_convolve_sgemm<float, int, 128, 5, 5, 3, 3, 3, 0, true>(int, int, int, float const*, int, float const*, int, float*, kernel_conv_params, int, int, float, float, int, float*, float*)',
 'void cudnn::winograd::generateWinogradTilesKernel<0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)',
 'void gemv2T_kernel_val<int, int, float, float, float, 128, 16, 2, 2, false, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>, float, float)',
 'void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const*, float*, int)'}

To Reproduce

https://colab.research.google.com/drive/1kiOtdCilQ96lM_3WhT_A14PGOjEGluKE#scrollTo=pV6sSyiDqVP-

Expected behavior

I had expected the set of events to be a tree with a single root, especially if there is a top-level record_function and we are doing torch.cuda.synchronize. This has been the case with use_kineto=False in my experience. Perhaps my mental model is incorrect, in which case please point me to any documentation about this -- I have not been able to find it.

Environment


Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.8.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0+cu101
[conda] Could not collect

Additional context

On a K80, there was only a single aten::dropout event that showed up in this list.

cc @ilia-cher @gdankel @ngimel

The text was updated successfully, but these errors were encountered:

gdankel · 2021-03-23T22:09:09Z

Thanks for reporting. We currently do include GPU events that have no corresponding CPU event by default. Let me take a look at where that comes from in this particular case.

indigoviolet · 2021-03-23T22:19:45Z

@gdankel Thanks for investigating. My confusion stems from the fact that I'm wrapping my entire profiled code in a record_function block, and I'm using torch.cuda.synchronize, so my mental model is that there should be no GPU events outside a CPU event (at least timewise). Would love to understand better.

    with tprofiler.profile(model, use_cuda=True, use_kineto=True) as prof:
        with tprofiler.record_function('Overall'):
            output = model(input_batch)
            torch.cuda.synchronize()

ilia-cher · 2021-03-24T11:06:20Z

I had expected the set of events to be a tree with a single root

I believe that might just not be true anymore, we treat on-device events as a separate class of events, not in the CPU hierarchy, as they are not executed on CPU. We do though save the information (correlation id) to associate on-device events with CPU events

ilia-cher · 2021-03-24T11:13:33Z

in which case please point me to any documentation about this

We have basic docs for the new profiler but we'll make sure to extend the existing tutorial to cover it for 1.9 release (1.8 release - experimental preview of the new profiler)

indigoviolet · 2021-04-05T03:02:29Z

@gdankel @ilia-cher I think I'm still a bit confused: is this a bug or not? What does it mean that these GPU events don't belong within a CPU event, especially the top-level record_function call? Doesn't that break the purpose of record_function?

ilia-cher · 2021-04-06T20:41:31Z

the purpose of record_function is to record custom user-level events (custom label, cpu event),
to see what we record and how I'd recommend exporting to the chrome trace (with prof.export_chrome_trace(path)) and visualizing the events with chrome://tracing, then I think the relationship between cpu and cuda event should be clear.

aaronenyeshi · 2024-05-08T15:41:46Z

Closing old issue that has an answer.

ejguan added oncall: profiler profiler-related issues (cpu, gpu, kineto) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 18, 2021

aaronenyeshi closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiler with Kineto has "orphan+childless" function events (on P100) #54267

Profiler with Kineto has "orphan+childless" function events (on P100) #54267

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Profiler with Kineto has "orphan+childless" function events (on P100) #54267

Profiler with Kineto has "orphan+childless" function events (on P100) #54267

Comments

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!