8000 vine: reaccounting disk allocation of tasks in workers · Issue #4063 · cooperative-computing-lab/cctools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vine: reaccounting disk allocation of tasks in workers #4063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tphung3 opened this issue Feb 14, 2025 · 4 comments
Open

vine: reaccounting disk allocation of tasks in workers #4063

tphung3 opened this issue Feb 14, 2025 · 4 comments
Labels

Comments

@tphung3
Copy link
Contributor
tphung3 commented Feb 14, 2025

A worker by default reports the disk usage of its cache and its tasks' disk allocations as its total disk usage to the manager. If tasks' inputs are already in the cache however, this results in the duplication of the cached input disk usage in both the vine cache and in the tasks' disk allocations.

For example, a worker W with 30GBs of disk allocation is assigned a task T1 with 20GBs of disk allocation with 19GBs of cacheable input files. To run T1, W fetches and caches 19GBs of T1's cacheable input files in its cache. This causes W to report back to the manager with its total disk usage = its vine cache + its task disk allocation = 19GBs + 20GBs = 39GBs, while the true disk usage value is 19GBs (from the cache) plus whatever files that are in T1's sandbox that are not cached. This issue causes the manager to not send tasks to W even though it can.

To fix this problem, when the manager is matching a task to a worker, it should adjust the task's disk allocation if some of its input files are already cached. Using the example above, T1's disk allocation should be adjusted by the manager from 20GBs to (20-19) = 1GB.

Points of contact: @tphung3 @colinthomas-z80

@dthain
Copy link
Member
dthain commented Feb 14, 2025

Again, this is just another example of how we are not carefully adhering to a consistent underlying model of storage management.

At the worker:

  • Input files go in the cache.
  • Input files are linked into sandboxes.
  • Sandboxes contain only intermediate and output files.

And so:

  • The sandbox allocation has nothing to do with the size of input files. It only contains intermediate and output files.
  • The worker's storage consumption is the size of the cache plus the sum of all sandboxes.

However:

  • When the manager wants to send a task to a worker, it should check that the available space is big enough for the sandbox PLUS the size of output files not already present. (And note that this is not the same as just making the sandbox bigger.)

@dthain
Copy link
Member
dthain commented Feb 14, 2025

To be clear:

The manager should not take the user's sandbox size of 20GB and reduce it to 19GB. The sandbox size should have been 1GB in the first place, and the manager should then further account for the size of the input files needed.

@JinZhou5042
Copy link
Member

Does #4060 seem relevant?

@tphung3
Copy link
Contributor Author
tphung3 commented Feb 14, 2025

Does #4060 seem relevant?

No I don't think so. This issue is tangential to whether there are replicas or not I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0