Add kwargs to _moments to support additional Keras parameters #3775

Tixxx · 2022-11-17T20:01:02Z

Signed-off-by: TJ tjx@nvidia.com

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Add kwargs to support addition parameters added by keras.layers.batchnormalization.

Fixes # (issue).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: TJ <tjx@nvidia.com>

github-actions · 2022-11-18T01:28:05Z

Unit Test Results

  1 155 files +  106   1 155 suites +106 13h 17m 48s ⏱️ + 1h 26m 21s
    840 tests ±      0     787 ✔️ +    11     53 💤 -     5 0 ❌ - 6
23 767 runs +2 149 16 987 ✔️ +1 736 6 780 💤 +422 0 ❌ - 9

Results for commit 177e730. ± Comparison against base commit 811cf67.

♻️ This comment has been updated with latest results.

github-actions · 2022-11-18T01:28:21Z

Unit Test Results (with flaky tests)

  1 314 files -     63   1 314 suites - 63 14h 2m 2s ⏱️ - 9m 13s
    840 tests ±      0     786 ✔️ +    11     53 💤 -     5 1 ❌ -   6
27 184 runs - 1 816 19 024 ✔️ - 1 001 8 159 💤 - 788 1 ❌ - 27

For more details on these failures, see this check.

Results for commit 177e730. ± Comparison against base commit 811cf67.

♻️ This comment has been updated with latest results.

try to fix keras legacy optimizer Signed-off-by: TJ <tjx@nvidia.com>

EnricoMi · 2022-11-18T11:03:48Z

We are now seeing

[0]<stderr>:Node: 'sequential/conv2d/Relu'
[0]<stderr>:DNN library is not found.
[0]<stderr>:     [[{{node sequential/conv2d/Relu}}]] [Op:__inference_training_step_1359]

https://buildkite.com/horovod/horovod/builds/8663#018489d6-74e6-4fb7-b91c-b4952ab5e9a0/232-367

Shall we fix this in a separate PR?

Tixxx · 2022-11-18T18:47:14Z

We are now seeing
[0]<stderr>:Node: 'sequential/conv2d/Relu'
[0]<stderr>:DNN library is not found.
[0]<stderr>:     [[{{node sequential/conv2d/Relu}}]] [Op:__inference_training_step_1359]
https://buildkite.com/horovod/horovod/builds/8663#018489d6-74e6-4fb7-b91c-b4952ab5e9a0/232-367

Shall we fix this in a separate PR?

found this line in the logs:
Loaded runtime CuDNN library: 8.4.1 but source was compiled with: 8.6.0.
Let me try changing the cudnn version to 8.6, if it doesnt fix it then i will open another pr to address it.

EnricoMi · 2022-11-18T18:51:33Z

That would be libcudnn8-dev_8.6.0.163-1+cuda11.8_amd64.deb
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/

Tixxx · 2022-11-18T19:04:38Z

That would be libcudnn8-dev_8.6.0.163-1+cuda11.8_amd64.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/

Only cuda11.8 is available with cudnn 8.6. Should I also change the base to 11.8.0-devel-ubuntu20.04 ?

Signed-off-by: TJ <tjx@nvidia.com>

EnricoMi

Minor comment, LGTM!

EnricoMi · 2022-11-19T16:46:24Z

docker-compose.test.yml

-        CUDA_DOCKER_VERSION: 11.6.1-devel-ubuntu20.04
-        CUDNN_VERSION: 8.4.1.50-1+cuda11.6
+        CUDA_DOCKER_VERSION: 11.8.0-devel-ubuntu20.04
+        CUDNN_VERSION: 8.6.0.163-1+cuda11.8
        NCCL_VERSION_OVERRIDE: 2.11.4-1+cuda11.6


should we also move NCCL to cuda 11.8, e.g. libnccl2_2.15.5-1+cuda11.8_amd64.deb?

yep, i think we should.

hmm looks like the new nccl has some issues, seeing this in the tests:
"[1]<stdout>:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllgather_device_/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz"
will have to upgrade it in another separate pr then.

Signed-off-by: TJ <tjx@nvidia.com>

EnricoMi · 2022-11-20T21:29:47Z

NCCL upgrade was still worth trying. Thanks for fixing this, this is ready to be merged then.

Tixxx · 2022-11-20T21:47:51Z

NCCL upgrade was still worth trying. Thanks for fixing this, this is ready to be merged then.

Agreed. I'm tracking it in a separate issue here.

add kwargs to support addition parameters added by keras

8000

6cbed04

Signed-off-by: TJ <tjx@nvidia.com>

change torch nightly to use cuda 11.6

483f8ec

try to fix keras legacy optimizer Signed-off-by: TJ <tjx@nvidia.com>

EnricoMi changed the title ~~add kwargs to support addition parameters added by keras~~ Add kwargs to _moments to support additional Keras parameters Nov 18, 2022

use cudnn8.6 and cuda 11.8 to test CI

d700027

Signed-off-by: TJ <tjx@nvidia.com>

Tixxx requested review from EnricoMi and chongxiaoc November 18, 2022 23:26

chongxiaoc approved these changes Nov 18, 2022

View reviewed changes

EnricoMi approved these changes Nov 19, 2022

View reviewed changes

Tixxx added 2 commits November 19, 2022 20:37

change nccl to use cuda11.8 build

e443e2d

Signed-off-by: TJ <tjx@nvidia.com>

revert nccl upgrade change

177e730

Signed-off-by: TJ <tjx@nvidia.com>

Tixxx merged commit 0268506 into master Nov 21, 2022

Tixxx deleted the fix_ci branch November 21, 2022 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add kwargs to _moments to support additional Keras parameters #3775

Add kwargs to _moments to support additional Keras parameters #3775

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add kwargs to _moments to support additional Keras parameters #3775

Add kwargs to _moments to support additional Keras parameters #3775

Uh oh!

Conversation

Checklist before submitting

Description

Review process to land

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Unit Test Results (with flaky tests)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!