Description
Related to #15282
Currently the autoscaler issues NodeProvider.terminate_nodes
requests without interacting with Ray processes on the nodes to be terminated.
It's left to the cloud provider to stop processes on the node and de-provision the node. The time between terminate node request and Ray process termination is dependent on the implementation of NodeProvider.terminate_nodes
.
In principle, this is could happen:
(1) Node passes idle timeout
(2) Autoscaler issues provider.terminate_node request
(3) Ray task is scheduled on the node to be terminated
(4) Ray processes are stopped on the node, task fails
(5) Node is de-provisioned
Ray nodes should be somehow cordoned right before the terminate_node request, preferably by a Ray-internal mechanism (as opposed to, say, CommandRunner.)