10000 多卡训练卡住,单卡GPU利用率为0,其余100% · Issue #11832 · open-mmlab/mmdetection · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

多卡训练卡住,单卡GPU利用率为0,其余100% #11832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WangJian981002 opened this issue Jul 5, 2024 · 4 comments
Open

多卡训练卡住,单卡GPU利用率为0,其余100% #11832

WangJian981002 opened this issue Jul 5, 2024 · 4 comments
Assignees

Comments

@WangJian981002
Copy link

如图:
image
自定义model后,训练总是出现卡住的情况。表现为时而单卡GPU利用率为0,其余皆为100%,训练速度大幅降低

而训练config中配置好的模型就不会出现这个问题

已经尝试export NCCL_P2P_DISABLE=1, 调整num_workers的方案,都没有用

有没有知道该如何解决

@WangJian981002
Copy link
Author

image
训练因为卡住越来越慢,其余卡在等待

@kimsolo
Copy link
kimsolo commented Jul 31, 2024

想问下解决了吗,遇到了同样的问题。

@DemoGit4LIANG
Copy link

相同问题

@Jctrp
Copy link
Jctrp commented Apr 24, 2025

Hello, I also met the problem, could you tell me how do you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0