8000 Re-enabling new keras optimizers by nvcastet · Pull Request #3860 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Re-enabling new keras optimizers #3860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 17, 2023

Conversation

nvcastet
Copy link
Collaborator
@nvcastet nvcastet commented Mar 10, 2023

Description

Re-enabling TF Keras 2.11+ optimizers. Those and the legacy ones should work out the box with Horovod.
Using new optimizers with LossScaleOptimizer is broken in 2.11 but fixed in 2.12.
See discussions at keras-team/tf-keras#349
and commit at keras-team/keras@c98ea36

EDIT:
They are quite a few changes in API between the 2 optimizer versions. But hopefully this PR properly integrates Horovod to the new ones while maintaining support for the old ones.

@nvcastet nvcastet force-pushed the enable_new_keras_opts branch from cc8f797 to 43e2045 Compare March 10, 2023 21:29
@github-actions
Copy link
github-actions bot commented Mar 11, 2023

Unit Test Results

  1 156 files   -   35    1 156 suites   - 35   13h 1m 30s ⏱️ - 1h 9m 14s
     887 tests  -     1       820 ✔️ ±    0       59 💤 +    5    8  - 4 
26 018 runs   - 415  17 863 ✔️  - 862  8 141 💤 +468  14  - 5 

For more details on these failures, see this check.

Results for commit 9fe3eec. ± Comparison against base commit a418125.

This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_legacy_optimizer
This pull request skips 5 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@github-actions
Copy link
github-actions bot commented Mar 11, 2023

Unit Test Results (with flaky tests)

  1 520 files   -      93    1 520 suites   - 93   15h 49m 17s ⏱️ - 1h 29m 14s
     887 tests  -        1       819 ✔️  -        1         59 💤 +    5    9  -   3 
35 055 runs   - 1 614  23 221 ✔️  - 2 392  11 791 💤 +840  43  - 14 

For more details on these failures, see this check.

Results for commit 9fe3eec. ± Comparison against base commit a418125.

This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_legacy_optimizer
This pull request skips 5 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@nvcastet
Copy link
Collaborator Author
nvcastet commented Mar 24, 2023

@MrAta Maybe, you have an idea on the remaining issue. It is failing because variables in the test_gradient_aggregation_with_local_vars test is a list of lists of variables instead of a list of variables. Could we make just make it a list of variables?
https://github.com/horovod/horovod/blob/master/test/parallel/test_tensorflow2_keras.py#L428

@nvcastet
Copy link
Collaborator Author

@MrAta Maybe, you have an idea on the remaining issue. It is failing because variables in the test_gradient_aggregation_with_local_vars test is a list of lists of variables instead of a list of variables. Could we just make a list of variables? https://github.com/horovod/horovod/blob/master/test/parallel/test_tensorflow2_keras.py#L428

I think flattening the list did the trick.

@nvcastet nvcastet requested a review from maxhgerlach March 24, 2023 18:19
@nvcastet nvcastet force-pushed the enable_new_keras_opts branch from 86e0030 to dd0cfbc Compare March 24, 2023 20:13
@nvcastet
Copy link
Collaborator Author

@maxhgerlach Do you know why the build heads are failing?

@nvcastet nvcastet requested a review from chongxiaoc March 30, 2023 15:56
@@ -59,7 +54,7 @@ def main():

# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.001 * hvd.size()
opt = optimizers.Adam(scaled_lr)
opt = tf.optimizers.Adam(scaled_lr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but a question about tf.keras and keras compatibility in horovod:

  • After tf 2.6, tf.keras == keras.
  • Before tf 2.6, this might be an issue?
  • Or our CI already dropped support for TF2 versions < 2.6?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly about the test matrix of our CI.
Technically it should work for tf < 2.6 version when using keras package. I maintained the code path for pure keras (older version of TF and Keras).

Copy link
Collaborator
@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just not sure TF2.6- compatibilities.
Wait for head CIs to pass?

@maxhgerlach
Copy link
Collaborator

@maxhgerlach Do you know why the build heads are failing?

There are build fixes for tf-head in #3864. Ideally someone other than me should approve the PR as I added C++ changes to it.

@nvcastet nvcastet force-pushed the enable_new_keras_opts branch from dd0cfbc to 320be63 Compare April 4, 2023 16:18
@maxhgerlach
Copy link
Collaborator

Argh, there appears to be a new change on tf-head:

2023-04-04T19:34:12.6761918Z [0]<stdout>:___________ ERROR collecting test/parallel/test_tensorflow2_keras.py ___________
2023-04-04T19:34:12.6765384Z [0]<stdout>:test_tensorflow2_keras.py:32: in <module>
2023-04-04T19:34:12.6769596Z [0]<stdout>:    if version.parse(keras.__version__.replace("-tf", "+tf")) < version.parse("2.9.0"):
2023-04-04T19:34:12.6772749Z [0]<stdout>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/lazy_loader.py:59: in __getattr__
2023-04-04T19:34:12.6775470Z [0]<stdout>:    return getattr(module, item)
2023-04-04T19:34:12.6778755Z [0]<stdout>:E   AttributeError: module 'keras.api._v2.keras' has no attribute '__version__'

@nvcastet nvcastet force-pushed the enable_new_keras_opts branch from 2e694aa to 752580a Compare April 13, 2023 14:59
nvcastet added 13 commits April 17, 2023 11:17
…3822)"

This reverts commit 305280d.

Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
New Keras optimizers should work without issues with Horovod.

Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
@nvcastet nvcastet force-pushed the enable_new_keras_opts branch from 2bfef56 to ff7052c Compare April 17, 2023 16:22
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
@nvcastet
Copy link
Collaborator Author

The TF head failures are unrelated to those changes. Those can be fixed/investigated separately.

@nvcastet nvcastet merged commit 2ef8ff9 into horovod:master Apr 17, 2023
@nvcastet nvcastet deleted the enable_new_keras_opts branch April 17, 2023 22:54
@LuFinch
Copy link
LuFinch commented Apr 18, 2023

Hi, may I know when will Horovod v0.28.0 release? I use v0.27.0 now and want to use branch with this PR for TF keras 2.12.

@maxhgerlach
Copy link
Collabo 9E81 rator

Hi, may I know when will Horovod v0.28.0 release? I use v0.27.0 now and want to use branch with this PR for TF keras 2.12.

I think once the CI is updated and we can be sure tests pass with TF 2.12, we should be able to get the next release out soon. Does Horovod master work for you in the mean time, @LuFinch?

@LuFinch
Copy link
LuFinch commented Apr 20, 2023

Hi, may I know when will Horovod v0.28.0 release? I use v0.27.0 now and want to use branch with this PR for TF keras 2.12.

I think once the CI is updated and we can be sure tests pass with TF 2.12, we should be able to get the next release out soon. Does Horovod master work for you in the mean time, @LuFinch?

@maxhgerlach Thanks for the info. We are rebasing our fork repo to v0.27.0 but some models using keras optimizers meet issues, so we want to directly rebese to v0.28.0 including this PR. Glad to hear that you will release v0.28.0 soon and thanks for your jobs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Keras: Support Keras reworked optimizers in Keras 2.11+
4 participants
0