-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add kwargs to _moments to support additional Keras parameters #3775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: TJ <tjx@nvidia.com>
Unit Test Results 1 155 files + 106 1 155 suites +106 13h 17m 48s ⏱️ + 1h 26m 21s Results for commit 177e730. ± Comparison against base commit 811cf67. ♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 1 314 files - 63 1 314 suites - 63 14h 2m 2s ⏱️ - 9m 13s For more details on these failures, see this check. Results for commit 177e730. ± Comparison against base commit 811cf67. ♻️ This comment has been updated with latest results. |
try to fix keras legacy optimizer Signed-off-by: TJ <tjx@nvidia.com>
We are now seeing
https://buildkite.com/horovod/horovod/builds/8663#018489d6-74e6-4fb7-b91c-b4952ab5e9a0/232-367 Shall we fix this in a separate PR? |
found this line in the logs: |
Only cuda11.8 is available with cudnn 8.6. Should I also change the base to |
Signed-off-by: TJ <tjx@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, LGTM!
CUDA_DOCKER_VERSION: 11.6.1-devel-ubuntu20.04 | ||
CUDNN_VERSION: 8.4.1.50-1+cuda11.6 | ||
CUDA_DOCKER_VERSION: 11.8.0-devel-ubuntu20.04 | ||
CUDNN_VERSION: 8.6.0.163-1+cuda11.8 | ||
NCCL_VERSION_OVERRIDE: 2.11.4-1+cuda11.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also move NCCL to cuda 11.8, e.g. libnccl2_2.15.5-1+cuda11.8_amd64.deb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, i think we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm looks like the new nccl has some issues, seeing this in the tests:
"[1]<stdout>:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllgather_device_/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz"
will have to upgrade it in another separate pr then.
Signed-off-by: TJ <tjx@nvidia.com>
Signed-off-by: TJ <tjx@nvidia.com>
NCCL upgrade was still worth trying. Thanks for fixing this, this is ready to be merged then. |
Agreed. I'm tracking it in a separate issue here. |
Signed-off-by: TJ tjx@nvidia.com
Checklist before submitting
Description
Add kwargs to support addition parameters added by keras.layers.batchnormalization.
Fixes # (issue).
Review process to land