10000 Profiler with Kineto has "orphan+childless" function events (on P100) · Issue #54267 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Profiler with Kineto has "orphan+childless" function events (on P100) #54267

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
indigoviolet opened this issue Mar 18, 2021 · 7 comments
Closed
Labels
oncall: profiler profiler-related issues (cpu, gpu, kineto) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@indigoviolet
Copy link
indigoviolet commented Mar 18, 2021

🐛 Bug

With the use_kineto=True flag on a P100 gpu, the torch profiler returns some FunctionEvents that have neither a parent nor children.

For AlexNet, here are the names of these events:

{'Memcpy DtoD (Device -> Device)',
 'aten::dropout',
 'maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt',
 'void at::native::(anonymous namespace)::adaptive_average_pool<float>(float*, float*, int, int, int, int, long, long, long)',
 'void at::native::(anonymous namespace)::max_pool_forward_nchw<float, float>(int, float const*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*, long*)',
 'void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*, 3>, OffsetCalculator<2, unsigned int>, OffsetCalculator<1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, at::native::AddFunctor<float>, at::detail::Array<char*, 3>, OffsetCalculator<2, unsigned int>, OffsetCalculator<1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast)',
 'void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}, at::detail::Array<char*, 3>)',
 'void cudnn::detail::explicit_convolve_sgemm<float, int, 128, 5, 5, 3, 3, 3, 0, true>(int, int, int, float const*, int, float const*, int, float*, kernel_conv_params, int, int, float, float, int, float*, float*)',
 'void cudnn::winograd::generateWinogradTilesKernel<0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)',
 'void gemv2T_kernel_val<int, int, float, float, float, 128, 16, 2, 2, false, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>, float, float)',
 'void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const*, float*, int)'}

To Reproduce

https://colab.research.google.com/drive/1kiOtdCilQ96lM_3WhT_A14PGOjEGluKE#scrollTo=pV6sSyiDqVP-

Expected behavior

I had expected the set of events to be a tree with a single root, especially if there is a top-level record_function and we are doing torch.cuda.synchronize. This has been the case with use_kineto=False in my experience. Perhaps my mental model is incorrect, in which case please point me to any documentation about this -- I have not been able to find it.

Environment


Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.8.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0+cu101
[conda] Could not collect

Additional context

On a K80, there was only a single aten::dropout event that showed up in this list.

cc @ilia-cher @gdankel @ngimel

@ejguan ejguan added oncall: profiler profiler-related issues (cpu, gpu, kineto) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 18, 2021
@gdankel
Copy link
Contributor
gdankel commented Mar 23, 2021

Thanks for reporting. We currently do include GPU events that have no corresponding CPU event by default. Let me take a look at where that comes from in this particular case.

@indigoviolet
Copy link
Author

@gdankel Thanks for investigating. My confusion stems from the fact that I'm wrapping my entire profiled code in a record_function block, and I'm using torch.cuda.synchronize, so my mental model is that there should be no GPU events outside a CPU event (at least timewise). Would love to understand better.

    with tprofiler.profile(model, use_cuda=True, use_kineto=True) as prof:
        with tprofiler.record_function('Overall'):
            output = model(input_batch)
            torch.cuda.synchronize()

@ilia-cher
Copy link
Contributor

I had expected the set of events to be a tree with a single root

I believe that might just not be true anymore, we treat on-device events as a separate class of events, not in the CPU hierarchy, as they are not executed on CPU. We do though save the information (correlation id) to associate on-device events with CPU events

@ilia-cher
Copy link
Contributor

in which case please point me to any documentation about this

We have basic docs for the new profiler but we'll make sure to extend the existing tutorial to cover it for 1.9 release (1.8 release - experimental preview of the new profiler)

@indigoviolet
Copy link
Author

@gdankel @ilia-cher I think I'm still a bit confused: is this a bug or not? What does it mean that these GPU events don't belong within a CPU event, especially the top-level record_function call? Doesn't that break the purpose of record_function?

@ilia-cher
Copy link
Contributor

the purpose of record_function is to record custom user-level events (custom label, cpu event),
to see what we record and how I'd recommend exporting to the chrome trace (with prof.export_chrome_trace(path)) and visualizing the events with chrome://tracing, then I think the relationship between cpu and cuda event should be clear.

@aaronenyeshi
Copy link
Member

Closing old issue that has an answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: profiler profiler-related issues (cpu, gpu, kineto) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
None yet
Development

No branches or pull requests

5 participants
0