Fix autoscaler recovery docker config to use node-specific settings #53992

JohnsonKuan · 2025-06-21T23:04:36Z

Why are these changes needed?

The autoscaler's recover_if_needed method currently uses self.config.get("docker") which only retrieves the top-level docker configuration. However, regular node updates use self._get_node_specific_docker_config(node_id) which properly merges top-level and node-type-specific docker settings.

This inconsistency means that when a node type has specific docker requirements (different image, volumes, environment variables, etc.), recovery attempts will ignore those node-type-specific settings and only use the global docker configuration. This could cause recovery to fail or behave differently than the original node setup.

For example, if a node type is configured with a specific docker image or custom volumes, the recovery process will not use those settings, potentially leading to recovery failures or inconsistent behavior between initial node setup and recovery attempts.

The fix ensures that recovery attempts use the same docker configuration logic as regular node updates, maintaining consistency and ensuring that node-type-specific docker settings are properly applied during recovery.

Related issue number

Closes #53987

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

jjyao · 2025-06-22T00:37:21Z

@rueian could you review this one?

rueian · 2025-06-22T16:16:39Z

python/ray/autoscaler/_private/autoscaler.py

@@ -1261,7 +1261,7 @@ def recover_if_needed(self, node_id, now):
            process_runner=self.process_runner,
            use_internal_ip=True,
            is_head_node=False,
-            docker_config=self.config.get("docker"),
+            docker_config=self._get_node_specific_docker_config(node_id),


@rueian could you review this one?

Sure. Hi @JohnsonKuan, the change of using _get_node_specific_docker_config looks good to me.

def _get_node_specific_docker_config(self, node_id): if "docker" not in self.config: return {} docker_config = copy.deepcopy(self.config.get("docker", {})) node_specific_docker = self._get_node_type_specific_fields(node_id, "docker") docker_config.update(node_specific_docker) return docker_config

I think using _get_node_specific_docker_config is correct and pretty safe. However, can we still have a test in test_autoscaler.py? I think we could use runner.assert_has_call or runner.command_history to test which image it uses.

Thanks @rueian! Yes, absolutely. I’ve added a test in test_autoscaler.py as suggested using runner.assert_has_call to verify the correct image is used.

Thanks. LGTM!

Hi @jjyao, this PR is ready to be merged! PTAL.

…during recovery Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

JohnsonKuan · 2025-06-24T23:23:24Z

Hi @jjyao @edoakes, may I get your help to merge this one?

Thanks!

…ay-project#53992) The autoscaler's `recover_if_needed` method currently uses `self.config.get("docker")` which only retrieves the top-level docker configuration. However, regular node updates use `self._get_node_specific_docker_config(node_id)` which properly merges top-level and node-type-specific docker settings. This inconsistency means that when a node type has specific docker requirements (different image, volumes, environment variables, etc.), recovery attempts will ignore those node-type-specific settings and only use the global docker configuration. This could cause recovery to fail or behave differently than the original node setup. For example, if a node type is configured with a specific docker image or custom volumes, the recovery process will not use those settings, potentially leading to recovery failures or inconsistent behavior between initial node setup and recovery attempts. The fix ensures that recovery attempts use the same docker configuration logic as regular node updates, maintaining consistency and ensuring that node-type-specific docker settings are properly applied during recovery. Closes ray-project#53987 --------- Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

JohnsonKuan added 2 commits June 21, 2025 15:15

fix: use node specific docker config for autoscaler recovery

7412c41

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

style: reformat

a20af05

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

JohnsonKuan requested a review from a team as a code owner June 21, 2025 23:04

jjyao assigned rueian Jun 22, 2025

rueian reviewed Jun 22, 2025

View reviewed changes

test: add test to verify autoscaler uses node-specific docker config …

c52f885

…during recovery Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>

rueian self-requested a review June 23, 2025 16:40

rueian approved these changes Jun 23, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Jun 23, 2025

edoakes enabled auto-merge (squash) June 23, 2025 17:10

Merge branch 'master' into fix-autoscaler-recovery-docker-config

e7a3933

github-actions bot disabled auto-merge June 24, 2025 21:30

edoakes merged commit 3bd20ba into ray-project:master Jun 25, 2025
5 checks passed

JohnsonKuan deleted the fix-autoscaler-recovery-docker-config branch June 26, 2025 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix autoscaler recovery docker config to use node-specific settings #53992

Fix autoscaler recovery docker config to use node-specific settings #53992

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix autoscaler recovery docker config to use node-specific settings #53992

Fix autoscaler recovery docker config to use node-specific settings #53992

Uh oh!

Conversation

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!