8000 vine: cache-invalid associated files must be removed · Issue #4134 · cooperative-computing-lab/cctools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vine: cache-invalid associated files must be removed #4134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JinZhou5042 opened this issue Apr 22, 2025 · 4 comments
Closed

vine: cache-invalid associated files must be removed #4134

JinZhou5042 opened this issue Apr 22, 2025 · 4 comments
Labels
bug For modifications that fix a flaw in the code. TaskVine

Comments

@JinZhou5042
Copy link
Member
JinZhou5042 commented Apr 22, 2025

This is another reason causing workflow slowdown.

If a transfer fails with a cache-invalid message, it means the source worker is unable to fetch the file from the destination worker, indicating that the file has crashed on either the source or the destination.

To prevent future replication attempts or task executions from using this invalid file, we need to remove it from both the source and destination workers.

Otherwise, the manager may get stuck trying to use what it thinks are valid files.

Specifically, we should:

  • remove the replica from vine_file_replica_table
  • explicitly clean the file by sending a unlink message

Calling delete_worker_file seems appropriate

@JinZhou5042 JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Apr 22, 2025
@btovar
Copy link
Member
btovar commented Apr 22, 2025

Also, note that we count transfer failures and disconnect workers after some threshold. Is this not working as intended?

@btovar
Copy link
Member
btovar commented Apr 22, 2025

We don't want to immediately take action on a cache-invalid message, as some connection errors are transient.

@JinZhou5042
Copy link
Member Author

Yes, I agree that aggressively unlinking files upon cache-invalid may conceal the root cause. However, just in my case, this fix is necessary to ensure the workflow can eventually complete when replica count is set to 10. Without it, the workflow always fails to finish with significant slowdowns.

@JinZhou5042
Copy link
Member Author

See #4152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. TaskVine
Projects
None yet
Development

No branches or pull requests

2 participants
0