[autoscaler][core] Safe node termination

Related to #15282

Currently the autoscaler issues NodeProvider.terminate_nodes requests without interacting with Ray processes on the nodes to be terminated.
It's left to the cloud provider to stop processes on the node and de-provision the node. The time between terminate node request and Ray process termination is dependent on the implementation of NodeProvider.terminate_nodes.

In principle, this is could happen:
(1) Node passes idle timeout
(2) Autoscaler issues provider.terminate_node request
(3) Ray task is scheduled on the node to be terminated
(4) Ray processes are stopped on the node, task fails
(5) Node is de-provisioned

Ray nodes should be somehow cordoned right before the terminate_node request, preferably by a Ray-internal mechanism (as opposed to, say, CommandRunner.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions