8000 Fix autoscaler recovery docker config to use node-specific settings by JohnsonKuan · Pull Request #53992 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix autoscaler recovery docker config to use node-specific settings #53992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohnsonKuan
Copy link
Contributor
@JohnsonKuan JohnsonKuan commented Jun 21, 2025

Why are these changes needed?

The autoscaler's recover_if_needed method currently uses self.config.get("docker") which only retrieves the top-level docker configuration. However, regular node updates use self._get_node_specific_docker_config(node_id) which properly merges top-level and node-type-specific docker settings.

This inconsistency means that when a node type has specific docker requirements (different image, volumes, environment variables, etc.), recovery attempts will ignore those node-type-specific settings and only use the global docker configuration. This could cause recovery to fail or behave differently than the original node setup.

For example, if a node type is configured with a specific docker image or custom volumes, the recovery process will not use those settings, potentially leading to recovery failures or inconsistent behavior between initial node setup and recovery attempts.

The fix ensures that recovery attempts use the same docker configuration logic as regular node updates, maintaining consistency and ensuring that node-type-specific docker settings are properly applied during recovery.

Related issue number

Closes #53987

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
@JohnsonKuan JohnsonKuan requested a review from a team as a code owner June 21, 2025 23:04
@jjyao
Copy link
Collaborator
jjyao commented Jun 22, 2025

@rueian could you review this one?

@@ -1261,7 +1261,7 @@ def recover_if_needed(self, node_id, now):
process_runner=self.process_runner,
use_internal_ip=True,
is_head_node=False,
docker_config=self.config.get("docker"),
docker_config=self._get_node_specific_docker_config(node_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian could you review this one?

Sure. Hi @JohnsonKuan, the change of using _get_node_specific_docker_config looks good to me.

    def _get_node_specific_docker_config(self, node_id):
        if "docker" not in self.config:
            return {}
        docker_config = copy.deepcopy(self.config.get("docker", {}))
        node_specific_docker = self._get_node_type_specific_fields(node_id, "docker")
        docker_config.update(node_specific_docker)
        return docker_config

I think using _get_node_specific_docker_config is correct and pretty safe. However, can we still have a test in test_autoscaler.py? I think we could use runner.assert_has_call or runner.command_history to test which image it uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rueian! Yes, absolutely. I’ve added a test in test_autoscaler.py as suggested using runner.assert_has_call to verify the correct image is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjyao, this PR is ready to be merged! PTAL.

…during recovery

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
@rueian rueian self-requested a review June 23, 2025 16:40
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 23, 2025
@edoakes edoakes enabled auto-merge (squash) June 23, 2025 17:10
@github-actions github-actions bot disabled auto-merge June 24, 2025 21:30
@JohnsonKuan
Copy link
Contributor Author
JohnsonKuan commented Jun 24, 2025 8000

Hi @jjyao @edoakes, may I get your help to merge this one?

Thanks!

@edoakes edoakes merged commit 3bd20ba into ray-project:master Jun 25, 2025
5 checks passed
@JohnsonKuan JohnsonKuan deleted the fix-autoscaler-recovery-docker-config branch June 26, 2025 22:17
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…ay-project#53992)

The autoscaler's `recover_if_needed` method currently uses
`self.config.get("docker")` which only retrieves the top-level docker
configuration. However, regular node updates use
`self._get_node_specific_docker_config(node_id)` which properly merges
top-level and node-type-specific docker settings.

This inconsistency means that when a node type has specific docker
requirements (different image, volumes, environment variables, etc.),
recovery attempts will ignore those node-type-specific settings and only
use the global docker configuration. This could cause recovery to fail
or behave differently than the original node setup.

For example, if a node type is configured with a specific docker image
or custom volumes, the recovery process will not use those settings,
potentially leading to recovery failures or inconsistent behavior
between initial node setup and recovery attempts.

The fix ensures that recovery attempts use the same docker configuration
logic as regular node updates, maintaining consistency and ensuring that
node-type-specific docker settings are properly applied during recovery.

Closes ray-project#53987

---------

Signed-off-by: JohnsonKuan <johnson.h.kuan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Autoscaler Node Recovery Ignores Node-Specific Docker Config
4 participants
0