10000 [Core] The Idle worker killing feature slows down tasks · Issue #27863 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] The Idle worker killing feature slows down tasks #27863
Open
@kfstorm

Description

@kfstorm

What happened + What you expected to happen

Raylet kills idle workers periodically when the number of workers exceeds num_cpus of the node. It's OK if we only run tasks on the node. However, if there are actors on the node and these actors are not CPU-bound, we may see many workers started and killed from time to time. The killing happens if the workload of a job changes from full to non-full.

Let's assume that the node's num_cpus is 4, and there is an actor with num_cpus=0 running on this node. When there are 4 tasks with num_cpus=1 running on this node, there are in total 5 workers. Although 5 > 4, all workers are non-idle, so Raylet won't kill any workers. When the 4 tasks finished execution, 4 workers become idle, after a while (an idle worker will be killed only if it stays idle for 1 second), one of the 4 workers will be killed by Raylet due to 5 > 4. Later, we want to run another 4 tasks with num_cpus=1, since there are only 3 idle workers, Raylet will spawn a new worker so all CPU resources of this node can be occupied. Now there are again 5 workers and non of them are idle.

If we make the workload of the job switch between busy and idle periodically, workers will be started and killed constantly.

This behavior is bad when tasks take a while for initialization (e.g. loading required resources, or importing a large module). Every newly started worker needs to re-do the initialization and it may be killed later. The whole job is slowed down due to idle worker killing.

Versions / Dependencies

Master (0dceddb)

Reproduction script

The below script is an extreme case. There are 4 CPUs on the node and 4 dummy actors, each time only 4 tasks are scheduled (just enough to make all CPU resources occupied), and the initialization takes 5 seconds.

The printed time cost is 78s. If we change the script with IDLE_WORKER_KILLING_ENABLED=False, the time cost would be only 18s.

In the IDLE_WORKER_KILLING_ENABLED=True case, we can see the log The worker pool has xxx registered workers which exceeds the soft limit of 4, and worker xxx with pid xxx has been idle for a a while. Kill it occurred 49 times.

import ray
import time
import os


IDLE_WORKER_KILLING_ENABLED=True

os.environ["RAY_BACKEND_LOG_LEVEL"] = "debug"
system_config = {}
if not IDLE_WORKER_KILLING_ENABLED:
    system_config["kill_idle_workers_interval_ms"] = 0
ray.init(num_cpus=4, object_store_memory=100*1024*1024, _system_config=system_config)

@ray.remote(num_cpus=0)
class DummyActor:
    pass

@ray.remote(num_cpus=1)
def do_work():
    if not hasattr(do_work, "warmup"):
        time.sleep(5)
        print(f"Warmup done. {os.getpid()}")
        do_work.warmup = True
    time.sleep(0.1)

start = time.time()

actors = [DummyActor.remote() for _ in range(4)]
for _ in range(10):
    ray.get([do_work.remote() for _ in range(4)])
    # kill_idle_workers_interval_ms + idle_worker_killing_time_threshold_ms
    time.sleep(1.2)

end = time.time()

print(f"Time cost: {end - start}s")

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-workerpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0