8000 Add kwargs to _moments to support additional Keras parameters by Tixxx · Pull Request #3775 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add kwargs to _moments to support additional Keras parameters #3775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 21, 2022
Merged

Conversation

Tixxx
Copy link
Collaborator
@Tixxx Tixxx commented Nov 17, 2022

Signed-off-by: TJ tjx@nvidia.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Add kwargs to support addition parameters added by keras.layers.batchnormalization.

Fixes # (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: TJ <tjx@nvidia.com>
@github-actions
Copy link
github-actions bot commented Nov 18, 2022

Unit Test Results

  1 155 files  +   106    1 155 suites  +106   13h 17m 48s ⏱️ + 1h 26m 21s
     840 tests ±       0       787 ✔️ +     11       53 💤  -     5  0  - 6 
23 767 runs  +2 149  16 987 ✔️ +1 736  6 780 💤 +422  0  - 9 

Results for commit 177e730. ± Comparison against base commit 811cf67.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
github-actions bot commented Nov 18, 2022

Unit Test Results (with flaky tests)

  1 314 files   -      63    1 314 suites   - 63   14h 2m 2s ⏱️ - 9m 13s
     840 tests ±       0       786 ✔️ +     11       53 💤  -     5  1  -   6 
27 184 runs   - 1 816  19 024 ✔️  - 1 001  8 159 💤  - 788  1  - 27 

For more details on these failures, see this check.

Results for commit 177e730. ± Comparison against base commit 811cf67.

♻️ This comment has been updated with latest results.

try to fix keras legacy optimizer

Signed-off-by: TJ <tjx@nvidia.com>
@EnricoMi EnricoMi changed the title add kwargs to support addition parameters added by keras Add kwargs to _moments to support additional Keras parameters Nov 18, 2022
@EnricoMi
Copy link
Collaborator

We are now seeing

[0]<stderr>:Node: 'sequential/conv2d/Relu'
[0]<stderr>:DNN library is not found.
[0]<stderr>:     [[{{node sequential/conv2d/Relu}}]] [Op:__inference_training_step_1359]

https://buildkite.com/horovod/horovod/builds/8663#018489d6-74e6-4fb7-b91c-b4952ab5e9a0/232-367

Shall we fix this in a separate PR?

@Tixxx
Copy link
Collaborator Author
Tixxx commented Nov 18, 2022

We are now seeing

[0]<stderr>:Node: 'sequential/conv2d/Relu'
[0]<stderr>:DNN library is not found.
[0]<stderr>:     [[{{node sequential/conv2d/Relu}}]] [Op:__inference_training_step_1359]

https://buildkite.com/horovod/horovod/builds/8663#018489d6-74e6-4fb7-b91c-b4952ab5e9a0/232-367

Shall we fix this in a separate PR?

found this line in the logs:
Loaded runtime CuDNN library: 8.4.1 but source was compiled with: 8.6.0.
Let me try changing the cudnn version to 8.6, if it doesnt fix it then i will open another pr to address it.

@EnricoMi
Copy link
Collaborator

@Tixxx
Copy link
Collaborator Author
Tixxx commented Nov 18, 2022

That would be libcudnn8-dev_8.6.0.163-1+cuda11.8_amd64.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/

Only cuda11.8 is available with cudnn 8.6. Should I also change the base to 11.8.0-devel-ubuntu20.04 ?

Signed-off-by: TJ <tjx@nvidia.com>
@Tixxx Tixxx requested review from EnricoMi and chongxiaoc November 18, 2022 23:26
Copy link
Collaborator
@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, LGTM!

CUDA_DOCKER_VERSION: 11.6.1-devel-ubuntu20.04
CUDNN_VERSION: 8.4.1.50-1+cuda11.6
CUDA_DOCKER_VERSION: 11.8.0-devel-ubuntu20.04
CUDNN_VERSION: 8.6.0.163-1+cuda11.8
NCCL_VERSION_OVERRIDE: 2.11.4-1+cuda11.6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also move NCCL to cuda 11.8, e.g. libnccl2_2.15.5-1+cuda11.8_amd64.deb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i think we should.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks like the new nccl has some issues, seeing this in the tests:
"[1]<stdout>:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllgather_device_/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz"
will have to upgrade it in another separate pr then.

Signed-off-by: TJ <tjx@nvidia.com>
Signed-off-by: TJ <tjx@nvidia.com>
@EnricoMi
Copy link
Collaborator

NCCL upgrade was still worth trying. Thanks for fixing this, this is ready to be merged then.

@Tixxx
Copy link
Collaborator Author
Tixxx commented Nov 20, 2022

NCCL upgrade was still worth trying. Thanks for fixing this, this is ready to be merged then.

Agreed. I'm tracking it in a separate issue here.

@Tixxx Tixxx merged commit 0268506 into master Nov 21, 2022
@Tixxx Tixxx deleted the fix_ci branch November 21, 2022 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0