8000 [autoscaler][core] Safe node termination · Issue #16975 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[autoscaler][core] Safe node termination #16975
Open
@DmitriGekhtman

Description

@DmitriGekhtman

Related to #15282

Currently the autoscaler issues NodeProvider.terminate_nodes requests without interacting with Ray processes on the nodes to be terminated.
It's left to the cloud provider to stop processes on the node and de-provision the node. The time between terminate node request and Ray process termination is dependent on the implementation of NodeProvider.terminate_nodes.

In principle, this is could happen:
(1) Node passes idle timeout
(2) Autoscaler issues provider.terminate_node request
(3) Ray task is scheduled on the node to be terminated
(4) Ray processes are stopped on the node, task fails
(5) Node is de-provisioned

Ray nodes should be somehow cordoned right before the terminate_node request, preferably by a Ray-internal mechanism (as opposed to, say, CommandRunner.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalenhancementRequest for new feature and/or capabilitypending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0