-
Notifications
You must be signed in to change notification settings - Fork 156
fix: Purge rule table on index build failure #2538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Purge rule table on index build failure #2538
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at this on a mobile at the moment so apologies for asking questions instead of testing them out. Don't we have a similar problem with database (i.e. non-indexed) stores as well? It's possible to save an invalid policy to a database as well so how are those stores immune from the problem? If they are, can we use the same mechanism for indexed stores as well?
Yes! I've been working on this along with another issue related to mutable stores, I'm going to bundle up both fixes into this PR--hopefully won't be long. Apologies, forgot to put this into Draft. |
I've checked this manually against disk, git, blob, and all three mutable stores. All now handle invalid indexes correctly. Non-mutable stores continue to return/act on the previous valid index state, mutable stores return errors until the policy set is fixed (I presume this was pre-existing behaviour?). All recover and start returning correct results once any invalid policies have been resolved. In addition, this solves a different case that was raised regarding mutable stores in this community Slack thread.
It seems to boil down to differing failure conditions for each store. For mutable stores, the compile manager forwards on the event to any subscribers including the rule table regardless of compile failures, whereas the other stores fail in different places and don't forward on the event (disk and git stores fail here for example, and return early, missing the event forwarding below). There's every chance that it's safe to forward on the event under all circumstances and that would allow us to remove this new event type, but I'm anxious to do that in a rush in case there are other hidden implications. Perhaps, if this change looked good, we could merge this now to free up production, and I could investigate tidying this up in a future PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the dilemma with the differing behaviour of stores and the way they deal with them. However, since we are kind of "doing surgery" anyway by introducing a new event as a stop gap measure, it's probably worth taking a look at fixing the underlying issue itself? I don't think we have a show-stopper that requires an immediate fix in prod so we can take a bit more time to sort things out.
From a cursory look, I think it might be the case that we could forward on the event from all sources as long as we add a new field to the event to indicate the state of the store (compiling/not-compiling/unknown -- I think it needs to be ternary because not all stores compile policies on the fly). The subscribers can inspect that field if they are doing anything that requires a valid store. It should be fairly easy to find all the subscription points by looking up implementations of OnStorageEvent
so I am fairly confident that we can locate and patch all the places 🤞🏽 WDYT?
😆 I entirely agree! I'd thought that the fix was more time pressured given it's in production, but if that's not the case then absolutely, I'll dig in. There's a chance that the rule table might not need to know the state of the index health and forwarding events might be enough. Let's see 🤷♂️ |
d6cc413
to
755dbe6
Compare
I don't think there was need for ternary state. In practice, only the schema manager and the rule table subscribed to the compile manager. I handled appropriately in the rule table and bypassed those events in the schema manager to maintain behaviour. |
755dbe6
to
4e689a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a subtle bug in the schema manager. Other than that, LGTM 👍🏽
When an index build fails, ensure the rule table is purged to avoid operating on stale data. Adds EventDisableRuleTable event type and makes rule table rebuild from scratch on next policy event. Signed-off-by: Sam Lock <sam@swlock.co.uk>
Signed-off-by: Sam Lock <sam@swlock.co.uk>
* emit disable rule table event for invalid mutable stores * fix invalid query generation for what should be no-ops Signed-off-by: Sam Lock <sam@swlock.co.uk>
Signed-off-by: Sam Lock <sam@swlock.co.uk>
…nel in favour of atomic.Bool Signed-off-by: Sam Lock <sam@swlock.co.uk>
…n existing behaviour Signed-off-by: Sam Lock <sam@swlock.co.uk>
Signed-off-by: Sam Lock <sam@swlock.co.uk>
ffd7f63
to
933cd2c
Compare
When an index build fails, ensure the rule table is purged to avoid operating on stale data. Adds EventDisableRuleTable event type and makes rule table rebuild from scratch on next policy event.
Also handles an issue in mutable stores where intended no-ops were generating invalid queries and breaking evaluations.