Description
What happened + What you expected to happen
After running batch tasks concurrently multiple times, resulting in the memory of IDLE and spill worker does not release,
Occupied a lot of memory resources in the cluster.
1、Why after the task ends, the memory of IDLE and spill worker does not release ?
2、Why the total number(22.55GiB) of occupied memory resources displayed on the dashboard is much smaller than the sum of each idle and spill worker?
3、In the node top
, the sum of each idle and spill worker much more than 22.55GiB, but free -h
only used 3.5GiB,
The figures of various statistics are very confusing. What is the difference between them?
Execute the top
command on the node node:
54 ray 20 0 58.0g 18.2g 17.9g S 0.7 29.9 62:09.37 raylet
19770 ray 35 15 20.8g 17.7g 17.6g S 0.0 29.1 2:22.07 ray::IDLE_Spill
19796 ray 35 15 20.8g 17.7g 17.6g S 0.0 29.1 2:20.27 ray::IDLE_Spill
19790 ray 35 15 20.8g 17.7g 17.6g S 0.3 29.0 2:22.18 ray::IDLE_Spill
19765 ray 35 15 20.8g 11.5g 11.5g S 0.3 18.9 2:16.03 ray::IDLE_Spill
27799 ray 35 15 20.8g 11.1g 11.1g S 0.0 18.2 1:50.15 ray::IDLE_Resto
28304 ray 35 15 20.8g 10.8g 10.8g S 0.0 17.8 1:48.80 ray::IDLE_Resto
28292 ray 35 15 20.8g 10.7g 10.6g S 0.0 17.5 1:48.42 ray::IDLE_Resto
27628 ray 35 15 20.8g 10.4g 10.4g S 0.0 17.1 1:53.71 ray::IDLE_Resto
19676 ray 35 15 22.4g 4.1g 3.9g S 0.0 6.7 2:51.97 ray::IDLE
19674 ray 35 15 22.4g 4.0g 3.9g S 0.0 6.6 2:53.72 ray::IDLE
19567 ray 35 15 22.4g 4.0g 3.8g S 0.3 6.5 2:53.83 ray::IDLE
19675 ray 35 15 22.4g 3.9g 3.8g S 0.0 6.4 2:55.70 ray::IDLE
19568 ray 35 15 22.4g 3.9g 3.7g S 0.0 6.4 2:53.99 ray::IDLE
26192 ray 35 15 22.4g 3.1g 2.9g S 0.3 5.1 4:01.70 ray::IDLE
17082 ray 35 15 22.4g 3.1g 2.9g S 0.3 5.1 4:29.38 ray::IDLE
13122 ray 35 15 22.4g 3.0g 2.8g S 0.3 5.0 4:28.34 ray::IDLE
14116 ray 35 15 22.4g 2.8g 2.7g S 0.0 4.7 2:50.36 ray::IDLE
9 ray 20 0 2846660 160868 69872 S 0.0 0.3 0:14.86 ray
208 ray 20 0 1915232 116648 33336 R 65.1 0.2 655:28.68 python
26450 ray 35 15 20.8g 85304 33344 S 0.7 0.1 1:01.80 ray::IDLE
95 ray 20 0 1649048 72180 26376 S 7.3 0.1 85:20.53 python
Execute the free
command on the node node:
total used free shared buff/cache available
Mem: 60Gi 3.5Gi 33Gi 17Gi 24Gi 39Gi
Swap: 0B 0B 0B
Versions / Dependencies
ray, version 3.0.0.dev0
Cluster deployed by kuberay, cluster info:
Node status
---------------------------------------------------------------
Healthy:
1 wg
1 head-group
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/16.0 CPU
0.0/1.0 GPU
0B/89.41GiB memory
0B/26.71GiB object_store_memory
Demands:
(no resource demands)
Reproduction script
run concurrently multiple times
import ray
import time
assert (
ray.__version__ >= "2.3.0"
), f"The version of ray must be greater than 2.3.0, the current version is {ray.__version__}"
ray.init()
class Process1:
def __call__(self, df):
time.sleep(0.2)
return df
class Process2:
def __call__(self, df):
time.sleep(0.1)
return df
ds = ray.data.range_tensor(500, shape=(3, 1024, 1024), parallelism=500)
pipe = ds.map_batches(
Process1,
batch_size=1,
num_cpus=0.5,
compute=ray.data.ActorPoolStrategy(1, 1),
).map_batches(
Process2,
batch_size=1,
num_cpus=1,
compute=ray.data.ActorPoolStrategy(1, 1),
)
for batch in pipe.iter_batches(batch_size=1):
...
Issue Severity
High: It blocks me from completing my task.