Description
What happened + What you expected to happen
Raylet kills idle workers periodically when the number of workers exceeds num_cpus
of the node. It's OK if we only run tasks on the node. However, if there are actors on the node and these actors are not CPU-bound, we may see many workers started and killed from time to time. The killing happens if the workload of a job changes from full to non-full.
Let's assume that the node's num_cpus
is 4, and there is an actor with num_cpus=0
running on this node. When there are 4 tasks with num_cpus=1
running on this node, there are in total 5 workers. Although 5 > 4, all workers are non-idle, so Raylet won't kill any workers. When the 4 tasks finished execution, 4 workers become idle, after a while (an idle worker will be killed only if it stays idle for 1 second), one of the 4 workers will be killed by Raylet due to 5 > 4. Later, we want to run another 4 tasks with num_cpus=1
, since there are only 3 idle workers, Raylet will spawn a new worker so all CPU resources of this node can be occupied. Now there are again 5 workers and non of them are idle.
If we make the workload of the job switch between busy and idle periodically, workers will be started and killed constantly.
This behavior is bad when tasks take a while for initialization (e.g. loading required resources, or importing a large module). Every newly started worker needs to re-do the initialization and it may be killed later. The whole job is slowed down due to idle worker killing.
Versions / Dependencies
Master (0dceddb)
Reproduction script
The below script is an extreme case. There are 4 CPUs on the node and 4 dummy actors, each time only 4 tasks are scheduled (just enough to make all CPU resources occupied), and the initialization takes 5 seconds.
The printed time cost is 78s. If we change the script with IDLE_WORKER_KILLING_ENABLED=False
, the time cost would be only 18s.
In the IDLE_WORKER_KILLING_ENABLED=True
case, we can see the log The worker pool has xxx registered workers which exceeds the soft limit of 4, and worker xxx with pid xxx has been idle for a a while. Kill it
occurred 49 times.
import ray
import time
import os
IDLE_WORKER_KILLING_ENABLED=True
os.environ["RAY_BACKEND_LOG_LEVEL"] = "debug"
system_config = {}
if not IDLE_WORKER_KILLING_ENABLED:
system_config["kill_idle_workers_interval_ms"] = 0
ray.init(num_cpus=4, object_store_memory=100*1024*1024, _system_config=system_config)
@ray.remote(num_cpus=0)
class DummyActor:
pass
@ray.remote(num_cpus=1)
def do_work():
if not hasattr(do_work, "warmup"):
time.sleep(5)
print(f"Warmup done. {os.getpid()}")
do_work.warmup = True
time.sleep(0.1)
start = time.time()
actors = [DummyActor.remote() for _ in range(4)]
for _ in range(10):
ray.get([do_work.remote() for _ in range(4)])
# kill_idle_workers_interval_ms + idle_worker_killing_time_threshold_ms
time.sleep(1.2)
end = time.time()
print(f"Time cost: {end - start}s")
Issue Severity
Low: It annoys or frustrates me.