-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Fix autoscaler recovery docker config to use node-specific settings #53992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix autoscaler recovery docker config to use node-specific settings #53992
Conversation
Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
@rueian could you review this one? |
@@ -1261,7 +1261,7 @@ def recover_if_needed(self, node_id, now): | |||
process_runner=self.process_runner, | |||
use_internal_ip=True, | |||
is_head_node=False, | |||
docker_config=self.config.get("docker"), | |||
docker_config=self._get_node_specific_docker_config(node_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rueian could you review this one?
Sure. Hi @JohnsonKuan, the change of using _get_node_specific_docker_config
looks good to me.
def _get_node_specific_docker_config(self, node_id):
if "docker" not in self.config:
return {}
docker_config = copy.deepcopy(self.config.get("docker", {}))
node_specific_docker = self._get_node_type_specific_fields(node_id, "docker")
docker_config.update(node_specific_docker)
return docker_config
I think using _get_node_specific_docker_config
is correct and pretty safe. However, can we still have a test in test_autoscaler.py
? I think we could use runner.assert_has_call
or runner.command_history
to test which image it uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rueian! Yes, absolutely. I’ve added a test in test_autoscaler.py
as suggested using runner.assert_has_call
to verify the correct image is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjyao, this PR is ready to be merged! PTAL.
…during recovery Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
…ay-project#53992) The autoscaler's `recover_if_needed` method currently uses `self.config.get("docker")` which only retrieves the top-level docker configuration. However, regular node updates use `self._get_node_specific_docker_config(node_id)` which properly merges top-level and node-type-specific docker settings. This inconsistency means that when a node type has specific docker requirements (different image, volumes, environment variables, etc.), recovery attempts will ignore those node-type-specific settings and only use the global docker configuration. This could cause recovery to fail or behave differently than the original node setup. For example, if a node type is configured with a specific docker image or custom volumes, the recovery process will not use those settings, potentially leading to recovery failures or inconsistent behavior between initial node setup and recovery attempts. The fix ensures that recovery attempts use the same docker configuration logic as regular node updates, maintaining consistency and ensuring that node-type-specific docker settings are properly applied during recovery. Closes ray-project#53987 --------- Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
Why are these changes needed?
The autoscaler's
recover_if_needed
method currently usesself.config.get("docker")
which only retrieves the top-level docker configuration. However, regular node updates useself._get_node_specific_docker_config(node_id)
which properly merges top-level and node-type-specific docker settings.This inconsistency means that when a node type has specific docker requirements (different image, volumes, environment variables, etc.), recovery attempts will ignore those node-type-specific settings and only use the global docker configuration. This could cause recovery to fail or behave differently than the original node setup.
For example, if a node type is configured with a specific docker image or custom volumes, the recovery process will not use those settings, potentially leading to recovery failures or inconsistent behavior between initial node setup and recovery attempts.
The fix ensures that recovery attempts use the same docker configuration logic as regular node updates, maintaining consistency and ensuring that node-type-specific docker settings are properly applied during recovery.
Related issue number
Closes #53987
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.