vine: reaccounting disk allocation of tasks in workers #4063

tphung3 · 2025-02-14T15:44:51Z

A worker by default reports the disk usage of its cache and its tasks' disk allocations as its total disk usage to the manager. If tasks' inputs are already in the cache however, this results in the duplication of the cached input disk usage in both the vine cache and in the tasks' disk allocations.

For example, a worker W with 30GBs of disk allocation is assigned a task T1 with 20GBs of disk allocation with 19GBs of cacheable input files. To run T1, W fetches and caches 19GBs of T1's cacheable input files in its cache. This causes W to report back to the manager with its total disk usage = its vine cache + its task disk allocation = 19GBs + 20GBs = 39GBs, while the true disk usage value is 19GBs (from the cache) plus whatever files that are in T1's sandbox that are not cached. This issue causes the manager to not send tasks to W even though it can.

To fix this problem, when the manager is matching a task to a worker, it should adjust the task's disk allocation if some of its input files are already cached. Using the example above, T1's disk allocation should be adjusted by the manager from 20GBs to (20-19) = 1GB.

Points of contact: @tphung3 @colinthomas-z80

dthain · 2025-02-14T15:51:10Z

Again, this is just another example of how we are not carefully adhering to a consistent underlying model of storage management.

At the worker:

Input files go in the cache.
Input files are linked into sandboxes.
Sandboxes contain only intermediate and output files.

And so:

The sandbox allocation has nothing to do with the size of input files. It only contains intermediate and output files.
The worker's storage consumption is the size of the cache plus the sum of all sandboxes.

However:

When the manager wants to send a task to a worker, it should check that the available space is big enough for the sandbox PLUS the size of output files not already present. (And note that this is not the same as just making the sandbox bigger.)

dthain · 2025-02-14T15:52:47Z

To be clear:

The manager should not take the user's sandbox size of 20GB and reduce it to 19GB. The sandbox size should have been 1GB in the first place, and the manager should then further account for the size of the input files needed.

JinZhou5042 · 2025-02-14T18:48:23Z

Does #4060 seem relevant?

tphung3 · 2025-02-14T19:09:09Z

Does #4060 seem relevant?

No I don't think so. This issue is tangential to whether there are replicas or not I think.

tphung3 added the TaskVine label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: reaccounting disk allocation of tasks in workers #4063

vine: reaccounting disk allocation of tasks in workers #4063

vine: reaccounting disk allocation of tasks in workers #4063

vine: reaccounting disk allocation of tasks in workers #4063

Comments