[Core] The Idle worker killing feature slows down tasks

What happened + What you expected to happen

Raylet kills idle workers periodically when the number of workers exceeds num_cpus of the node. It's OK if we only run tasks on the node. However, if there are actors on the node and these actors are not CPU-bound, we may see many workers started and killed from time to time. The killing happens if the workload of a job changes from full to non-full.

Let's assume that the node's num_cpus is 4, and there is an actor with num_cpus=0 running on this node. When there are 4 tasks with num_cpus=1 running on this node, there are in total 5 workers. Although 5 > 4, all workers are non-idle, so Raylet won't kill any workers. When the 4 tasks finished execution, 4 workers become idle, after a while (an idle worker will be killed only if it stays idle for 1 second), one of the 4 workers will be killed by Raylet due to 5 > 4. Later, we want to run another 4 tasks with num_cpus=1, since there are only 3 idle workers, Raylet will spawn a new worker so all CPU resources of this node can be occupied. Now there are again 5 workers and non of them are idle.

If we make the workload of the job switch between busy and idle periodically, workers will be started and killed constantly.

This behavior is bad when tasks take a while for initialization (e.g. loading required resources, or importing a large module). Every newly started worker needs to re-do the initialization and it may be killed later. The whole job is slowed down due to idle worker killing.

Versions / Dependencies

Master (0dceddb)

Reproduction script

The below script is an extreme case. There are 4 CPUs on the node and 4 dummy actors, each time only 4 tasks are scheduled (just enough to make all CPU resources occupied), and the initialization takes 5 seconds.

The printed time cost is 78s. If we change the script with IDLE_WORKER_KILLING_ENABLED=False, the time cost would be only 18s.

In the IDLE_WORKER_KILLING_ENABLED=True case, we can see the log The worker pool has xxx registered workers which exceeds the soft limit of 4, and worker xxx with pid xxx has been idle for a a while. Kill it occurred 49 times.

import ray
import time
import os


IDLE_WORKER_KILLING_ENABLED=True

os.environ["RAY_BACKEND_LOG_LEVEL"] = "debug"
system_config = {}
if not IDLE_WORKER_KILLING_ENABLED:
    system_config["kill_idle_workers_interval_ms"] = 0
ray.init(num_cpus=4, object_store_memory=100*1024*1024, _system_config=system_config)

@ray.remote(num_cpus=0)
class DummyActor:
    pass

@ray.remote(num_cpus=1)
def do_work():
    if not hasattr(do_work, "warmup"):
        time.sleep(5)
        print(f"Warmup done. {os.getpid()}")
        do_work.warmup = True
    time.sleep(0.1)

start = time.time()

actors = [DummyActor.remote() for _ in range(4)]
for _ in range(10):
    ray.get([do_work.remote() for _ in range(4)])
    # kill_idle_workers_interval_ms + idle_worker_killing_time_threshold_ms
    time.sleep(1.2)

end = time.time()

print(f"Time cost: {end - start}s")

Issue Severity

Low: It annoys or frustrates me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions