8000 fix: Taps running in parallel use a different temporary catalog file by edgarrmondragon · Pull Request #8794 · meltano/meltano · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: Taps running in parallel use a different temporary catalog file #8794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

edgarrmondragon
Copy link
Collaborator
@edgarrmondragon edgarrmondragon commented Sep 23, 2024

Copy link
netlify bot commented Sep 23, 2024

Deploy Preview for meltano canceled.

Name Link
🔨 Latest commit 7e27411
🔍 Latest deploy log https://app.netlify.com/sites/meltano/deploys/66f1c36fe7b12f00089fb43a

@@ -233,7 +233,7 @@ def config_files(self): # noqa: ANN201
"""Get the configuration files for this tap."""
return {
"config": f"tap.{self.instance_uuid}.config.json",
"catalog": "tap.properties.json",
"catalog": f"tap.{self.instance_uuid}.properties.json",
"catalog_cache_key": "tap.properties.cache_key",
"state": "state.json",
Copy link
@dluo-sig dluo-sig Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was checking up on the other issue and was curious about the fix. Similar but unrelated, I'm wonder 8000 ing if state.json here also needs a fix? It seems to be the same state file that is stored in azure when using different state backend. However, the difference I've noticed is that on azure, it actually gets split out based on --state-id-suffix provided to meltano run. I suppose something similar would happen if running meltano el with --state-id set. But locally, it also maintains this state but just in a fixed location. Is it also intended for it to still maintain this local state file when using different state backend?

Copy link
Collaborator Author
@edgarrmondragon edgarrmondragon Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know if we should keep the local state.json after the pipeline finishes 🤔.

FWIW I don't think this PR is currently on the right path. A more correct approach may be to have a dedicated directory per run. That way, plugins could run in parallel without clashing with each other's files:

tap-example --config /path/to/run-1/tap.config.json --state /path/to/run-1/state.json --catalog /path/to/run-1/tap.properties.json
tap-example --config /path/to/run-2/tap.config.json --state /path/to/run-2/state.json --catalog /path/to/run-2/tap.properties.json

During the preparation phase, we'd

  • dump resolved config to /path/to/run-1/tap.config.json
  • get /path/to/run-1/state.json from the state backend
  • copy /path/to/run-1/tap.properties.json from cache, or regenerate it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I did mention that as a possibility as well, but I'm not familiar with what's involved to do all of that. The idea is that it should use the run uuid to create a unique folder to use. You theoretically would get the same behavior if all of the files had the uuid as well, just a matter of what the desired folder structure is. I believe the local state does get removed once the run is done -- I just wasn't sure if the local state is needed if it's kept in a different backend anyway, but I guess it does need to be there in order to upload? In that case, maybe the folder uuid is still ultimately better because that way, the state file wouldn't need to be renamed. This does seem to get complicated, though, because you would want to reuse the same state somehow across runs.

Maybe it should read the state from whatever the state backend is, regardless of whether that is from cloud or local, in a central location without the uuid. Then, the execution will create the state in a uuid folder, and finally update the state backend after completion? I'm also not sure how state gets merged if there are multiple processes that end up writing to the same file in the state backend that started in parallel. For us, we maintain a separate state file per table using --state-id-suffix, so it's not really an issue.

Or I might just be overthinking this whole thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though the failing tests point to an assumption I'm violating here 🤔

@s7clarke10
Copy link

I believe we need to handle the meltano run command with a --state-id-suffix or use a meltano el command with a --state-id to provide uniqueness regardless of the command which is issued.

I am not sure whether the targeted changes are just for the meltano run command or whether it will help with the meltano el commands as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4359
Development

Successfully merging this pull request may close these issues.

feature: Catalog extra still puts a properties.json file in .meltano
3 participants
0