-
Notifications
You must be signed in to change notification settings - Fork 177
fix: Taps running in parallel use a different temporary catalog file #8794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Taps running in parallel use a different temporary catalog file #8794
Conversation
✅ Deploy Preview for meltano canceled.
|
@@ -233,7 +233,7 @@ def config_files(self): # noqa: ANN201 | |||
"""Get the configuration files for this tap.""" | |||
return { | |||
"config": f"tap.{self.instance_uuid}.config.json", | |||
"catalog": "tap.properties.json", | |||
"catalog": f"tap.{self.instance_uuid}.properties.json", | |||
"catalog_cache_key": "tap.properties.cache_key", | |||
"state": "state.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was checking up on the other issue and was curious about the fix. Similar but unrelated, I'm wonder
8000
ing if state.json here also needs a fix? It seems to be the same state file that is stored in azure when using different state backend. However, the difference I've noticed is that on azure, it actually gets split out based on --state-id-suffix
provided to meltano run
. I suppose something similar would happen if running meltano el
with --state-id
set. But locally, it also maintains this state but just in a fixed location. Is it also intended for it to still maintain this local state file when using different state backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't know if we should keep the local state.json
after the pipeline finishes 🤔.
FWIW I don't think this PR is currently on the right path. A more correct approach may be to have a dedicated directory per run. That way, plugins could run in parallel without clashing with each other's files:
tap-example --config /path/to/run-1/tap.config.json --state /path/to/run-1/state.json --catalog /path/to/run-1/tap.properties.json
tap-example --config /path/to/run-2/tap.config.json --state /path/to/run-2/state.json --catalog /path/to/run-2/tap.properties.json
During the preparation phase, we'd
- dump resolved config to
/path/to/run-1/tap.config.json
- get
/path/to/run-1/state.json
from the state backend - copy
/path/to/run-1/tap.properties.json
from cache, or regenerate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I did mention that as a possibility as well, but I'm not familiar with what's involved to do all of that. The idea is that it should use the run uuid to create a unique folder to use. You theoretically would get the same behavior if all of the files had the uuid as well, just a matter of what the desired folder structure is. I believe the local state does get removed once the run is done -- I just wasn't sure if the local state is needed if it's kept in a different backend anyway, but I guess it does need to be there in order to upload? In that case, maybe the folder uuid is still ultimately better because that way, the state file wouldn't need to be renamed. This does seem to get complicated, though, because you would want to reuse the same state somehow across runs.
Maybe it should read the state from whatever the state backend is, regardless of whether that is from cloud or local, in a central location without the uuid. Then, the execution will create the state in a uuid folder, and finally update the state backend after completion? I'm also not sure how state gets merged if there are multiple processes that end up writing to the same file in the state backend that started in parallel. For us, we maintain a separate state file per table using --state-id-suffix
, so it's not really an issue.
Or I might just be overthinking this whole thing.
There was a problem hiding this comment.
8000Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though the failing tests point to an assumption I'm violating here 🤔
I believe we need to handle the I am not sure whether the targeted changes are just for the |
Description
Related Issues