10000
We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如图: 自定义model后,训练总是出现卡住的情况。表现为时而单卡GPU利用率为0,其余皆为100%,训练速度大幅降低
而训练config中配置好的模型就不会出现这个问题
已经尝试export NCCL_P2P_DISABLE=1, 调整num_workers的方案,都没有用
有没有知道该如何解决
The text was updated successfully, but these errors were encountered:
训练因为卡住越来越慢,其余卡在等待
Sorry, something went wrong.
想问下解决了吗,遇到了同样的问题。
相同问题
Hello, I also met the problem, could you tell me how do you solve it?
BIGWangYuDong
No branches or pull requests
如图:

自定义model后,训练总是出现卡住的情况。表现为时而单卡GPU利用率为0,其余皆为100%,训练速度大幅降低
而训练config中配置好的模型就不会出现这个问题
已经尝试export NCCL_P2P_DISABLE=1, 调整num_workers的方案,都没有用
有没有知道该如何解决
The text was updated successfully, but these errors were encountered: