8000 [Core] After the task ends, the memory of IDLE and spill worker does not release · Issue #34613 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] After the task ends, the memory of IDLE and spill worker does not release  #34613
Open
@v4if

Description

@v4if

What happened + What you expected to happen

After running batch tasks concurrently multiple times, resulting in the memory of IDLE and spill worker does not release,
Occupied a lot of memory resources in the cluster.

image

1、Why after the task ends, the memory of IDLE and spill worker does not release ?
2、Why the total number(22.55GiB) of occupied memory resources displayed on the dashboard is much smaller than the sum of each idle and spill worker?
3、In the node top, the sum of each idle and spill worker much more than 22.55GiB, but free -h only used 3.5GiB,
The figures of various statistics are very confusing. What is the difference between them?

Execute the top command on the node node:

   54 ray       20   0   58.0g  18.2g  17.9g S   0.7  29.9  62:09.37 raylet
19770 ray       35  15   20.8g  17.7g  17.6g S   0.0  29.1   2:22.07 ray::IDLE_Spill
19796 ray       35  15   20.8g  17.7g  17.6g S   0.0  29.1   2:20.27 ray::IDLE_Spill
19790 ray       35  15   20.8g  17.7g  17.6g S   0.3  29.0   2:22.18 ray::IDLE_Spill
19765 ray       35  15   20.8g  11.5g  11.5g S   0.3  18.9   2:16.03 ray::IDLE_Spill
27799 ray       35  15   20.8g  11.1g  11.1g S   0.0  18.2   1:50.15 ray::IDLE_Resto
28304 ray       35  15   20.8g  10.8g  10.8g S   0.0  17.8   1:48.80 ray::IDLE_Resto
28292 ray       35  15   20.8g  10.7g  10.6g S   0.0  17.5   1:48.42 ray::IDLE_Resto
27628 ray       35  15   20.8g  10.4g  10.4g S   0.0  17.1   1:53.71 ray::IDLE_Resto
19676 ray       35  15   22.4g   4.1g   3.9g S   0.0   6.7   2:51.97 ray::IDLE
19674 ray       35  15   22.4g   4.0g   3.9g S   0.0   6.6   2:53.72 ray::IDLE
19567 ray       35  15   22.4g   4.0g   3.8g S   0.3   6.5   2:53.83 ray::IDLE
19675 ray       35  15   22.4g   3.9g   3.8g S   0.0   6.4   2:55.70 ray::IDLE
19568 ray       35  15   22.4g   3.9g   3.7g S   0.0   6.4   2:53.99 ray::IDLE
26192 ray       35  15   22.4g   3.1g   2.9g S   0.3   5.1   4:01.70 ray::IDLE
17082 ray       35  15   22.4g   3.1g   2.9g S   0.3   5.1   4:29.38 ray::IDLE
13122 ray       35  15   22.4g   3.0g   2.8g S   0.3   5.0   4:28.34 ray::IDLE
14116 ray       35  15   22.4g   2.8g   2.7g S   0.0   4.7   2:50.36 ray::IDLE
    9 ray       20   0 2846660 160868  69872 S   0.0   0.3   0:14.86 ray
  208 ray       20   0 1915232 116648  33336 R  65.1   0.2 655:28.68 python
26450 ray       35  15   20.8g  85304  33344 S   0.7   0.1   1:01.80 ray::IDLE
   95 ray       20   0 1649048  72180  26376 S   7.3   0.1  85:20.53 python

Execute the free command on the node node:

              total        used        free      shared  buff/cache   available
Mem:           60Gi       3.5Gi        33Gi        17Gi        24Gi        39Gi
Swap:            0B          0B          0B

Versions / Dependencies

ray, version 3.0.0.dev0
Cluster deployed by kuberay, cluster info:

Node status
---------------------------------------------------------------
Healthy:
 1 wg
 1 head-group
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/16.0 CPU
 0.0/1.0 GPU
 0B/89.41GiB memory
 0B/26.71GiB object_store_memory

Demands:
 (no resource demands)

Reproduction script

run concurrently multiple times

import ray
import time

assert (
    ray.__version__ >= "2.3.0"
), f"The version of ray must be greater than 2.3.0, the current version is {ray.__version__}"

ray.init()


class Process1:
    def __call__(self, df):
        time.sleep(0.2)
        return df


class Process2:
    def __call__(self, df):
        time.sleep(0.1)
        return df


ds = ray.data.range_tensor(500, shape=(3, 1024, 1024), parallelism=500)
pipe = ds.map_batches(
    Process1,
    batch_size=1,
    num_cpus=0.5,
    compute=ray.data.ActorPoolStrategy(1, 1),
).map_batches(
    Process2,
    batch_size=1,
    num_cpus=1,
    compute=ray.data.ActorPoolStrategy(1, 1),
)

for batch in pipe.iter_batches(batch_size=1):
    ...

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdashboardIssues specific to the Ray Dashboard

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0