fix: Purge rule table on index build failure #2538

Sambigeara · 2025-04-02T08:22:42Z

When an index build fails, ensure the rule table is purged to avoid operating on stale data. Adds EventDisableRuleTable event type and makes rule table rebuild from scratch on next policy event.

Also handles an issue in mutable stores where intended no-ops were generating invalid queries and breaking evaluations.

charithe

Just looking at this on a mobile at the moment so apologies for asking questions instead of testing them out. Don't we have a similar problem with database (i.e. non-indexed) stores as well? It's possible to save an invalid policy to a database as well so how are those stores immune from the problem? If they are, can we use the same mechanism for indexed stores as well?

Sambigeara · 2025-04-03T09:28:05Z

Just looking at this on a mobile at the moment so apologies for asking questions instead of testing them out. Don't we have a similar problem with database (i.e. non-indexed) stores as well? It's possible to save an invalid policy to a database as well so how are those stores immune from the problem? If they are, can we use the same mechanism for indexed stores as well?

Yes! I've been working on this along with another issue related to mutable stores, I'm going to bundle up both fixes into this PR--hopefully won't be long. Apologies, forgot to put this into Draft.

Sambigeara · 2025-04-03T10:02:47Z

I've checked this manually against disk, git, blob, and all three mutable stores. All now handle invalid indexes correctly. Non-mutable stores continue to return/act on the previous valid index state, mutable stores return errors until the policy set is fixed (I presume this was pre-existing behaviour?). All recover and start returning correct results once any invalid policies have been resolved.

In addition, this solves a different case that was raised regarding mutable stores in this community Slack thread.

Don't we have a similar problem with database (i.e. non-indexed) stores as well? It's possible to save an invalid policy to a database as well so how are those stores immune from the problem?

It seems to boil down to differing failure conditions for each store. For mutable stores, the compile manager forwards on the event to any subscribers including the rule table regardless of compile failures, whereas the other stores fail in different places and don't forward on the event (disk and git stores fail here for example, and return early, missing the event forwarding below).

There's every chance that it's safe to forward on the event under all circumstances and that would allow us to remove this new event type, but I'm anxious to do that in a rush in case there are other hidden implications. Perhaps, if this change looked good, we could merge this now to free up production, and I could investigate tidying this up in a future PR?

charithe

I see the dilemma with the differing behaviour of stores and the way they deal with them. However, since we are kind of "doing surgery" anyway by introducing a new event as a stop gap measure, it's probably worth taking a look at fixing the underlying issue itself? I don't think we have a show-stopper that requires an immediate fix in prod so we can take a bit more time to sort things out.

From a cursory look, I think it might be the case that we could forward on the event from all sources as long as we add a new field to the event to indicate the state of the store (compiling/not-compiling/unknown -- I think it needs to be ternary because not all stores compile policies on the fly). The subscribers can inspect that field if they are doing anything that requires a valid store. It should be fairly easy to find all the subscription points by looking up implementations of OnStorageEvent so I am fairly confident that we can locate and patch all the places 🤞🏽 WDYT?

internal/engine/ruletable/rule_table.go

Sambigeara · 2025-04-04T07:29:09Z

"doing surgery"

😆

I entirely agree! I'd thought that the fix was more time pressured given it's in production, but if that's not the case then absolutely, I'll dig in. There's a chance that the rule table might not need to know the state of the index health and forwarding events might be enough. Let's see 🤷‍♂️

Sambigeara · 2025-04-04T11:00:50Z

I don't think there was need for ternary state. In practice, only the schema manager and the rule table subscribed to the compile manager. I handled appropriately in the rule table and bypassed those events in the schema manager to maintain behaviour.

charithe

I think there's a subtle bug in the schema manager. Other than that, LGTM 👍🏽

internal/schema/schema.go

Sambigeara · 2025-04-07T06:48:28Z

https://status.coveralls.io/incidents/ph6p14vg1fyr 🙄

When an index build fails, ensure the rule table is purged to avoid operating on stale data. Adds EventDisableRuleTable event type and makes rule table rebuild from scratch on next policy event. Signed-off-by: Sam Lock <sam@swlock.co.uk>

Signed-off-by: Sam Lock <sam@swlock.co.uk>

* emit disable rule table event for invalid mutable stores * fix invalid query generation for what should be no-ops Signed-off-by: Sam Lock <sam@swlock.co.uk>

Signed-off-by: Sam Lock <sam@swlock.co.uk>

…nel in favour of atomic.Bool Signed-off-by: Sam Lock <sam@swlock.co.uk>

…n existing behaviour Signed-off-by: Sam Lock <sam@swlock.co.uk>

Signed-off-by: Sam Lock <sam@swlock.co.uk>

Sambigeara changed the title ~~fix: Disable rule table on index build failure~~ fix: Purge rule table on index build failure Apr 2, 2025

charithe reviewed Apr 3, 2025

View reviewed changes

Sambigeara marked this pull request as draft April 3, 2025 09:28

Sambigeara marked this pull request as ready for review April 3, 2025 11:02

charithe reviewed Apr 4, 2025

View reviewed changes

internal/engine/ruletable/rule_table.go Outdated Show resolved Hide resolved

Sambigeara force-pushed the fix/index-build-failure-stale-ruletable branch from d6cc413 to 755dbe6 Compare April 4, 2025 10:58

Sambigeara force-pushed the fix/index-build-failure-stale-ruletable branch from 755dbe6 to 4e689a6 Compare April 4, 2025 11:06

charithe reviewed Apr 5, 2025

View reviewed changes

internal/schema/schema.go Outdated Show resolved Hide resolved

charithe approved these changes Apr 7, 2025

View reviewed changes

Sambigeara added 8 commits April 7, 2025 10:20

fix: Disable rule table on index build failure

772a1a3

When an index build fails, ensure the rule table is purged to avoid operating on stale data. Adds EventDisableRuleTable event type and makes rule table rebuild from scratch on next policy event. Signed-off-by: Sam Lock <sam@swlock.co.uk>

comments for clarity

ef8be5d

Signed-off-by: Sam Lock <sam@swlock.co.uk>

mutable s 8000 tore fixes

bf25466

* emit disable rule table event for invalid mutable stores * fix invalid query generation for what should be no-ops Signed-off-by: Sam Lock <sam@swlock.co.uk>

comment for clarity

37504b9

Signed-off-by: Sam Lock <sam@swlock.co.uk>

remove new event and pass index health state in events. swop out chan…

dc75a3c

…nel in favour of atomic.Bool Signed-off-by: Sam Lock <sam@swlock.co.uk>

ignore invalid events in the schema storage event handler, to maintai…

7f685f1

…n existing behaviour Signed-off-by: Sam Lock <sam@swlock.co.uk>

just lint

0792f58

Signed-off-by: Sam Lock <sam@swlock.co.uk>

remove buggy event bypass in schema manager

933cd2c

Signed-off-by: Sam Lock <sam@swlock.co.uk>

charithe force-pushed the fix/index-build-failure-stale-ruletable branch from ffd7f63 to 933cd2c Compare April 7, 2025 09:20

charithe approved these changes Apr 7, 2025

View reviewed changes

Sambigeara merged commit 03982ea into cerbos:main Apr 7, 2025
21 checks passed

Sambigeara deleted the fix/index-build-failure-stale-ruletable branch April 7, 2025 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Purge rule table on index build failure #2538

fix: Purge rule table on index build failure #2538

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: Purge rule table on index build failure #2538

fix: Purge rule table on index build failure #2538

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!