making the update reason and recreate annotations stable #38844

shawkins · 2025-04-10T16:12:03Z

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

The core fix is that we want to preserve the new annotations when possible from the existing deployment - they are still relevant on the current status even when the rolling or recreate has completed.

shawkins · 2025-04-10T18:01:54Z

The test failure in the first run is due to a weakness in our testing assumptions - and similarly causes some of the other tests be flakey as well.

Here we're checking for Ready=False with a particular reason, but that is an ephemeral state. Even with using informers to pick up all status changes does not guarentee that we'll see that state - for two reasons:

the informer can reconnect to the api server in such a way as to restart the watch, so we skip over intermediate events - this is not very likely given that bookmarks are in use, but it can still happen.
the event(s) that would have ultimately resulted in the update of the Keycloak status into the desired ephemeral state, were interrupted in some way - most likely a 409 error updating something - such that by the time we retry, we're in a different state.

vmuzikar · 2025-04-11T16:16:56Z

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

I would rather stick to integration test in this case, feels too "artificial" to mock statefulset controller.

shawkins · 2025-04-12T15:03:08Z

I would rather stick to integration test in this case, feels too "artificial" to mock statefulset controller.

There are of course tradeoffs. The reason the mocking here is so complete is to ensure that we generate intermediate states for the statefulset to match our test expectations looking for things like Ready=False - as explained above those are actually a little problematic in any case.

We could simplify the assertions and the mock logic instead of simulating up / down handling could just make the new statefulset seem immediately available.

In the end it's about the amount of coverage for how long the test takes. As written, the api server version of the test runs in less than 20 seconds. The integration test currently takes 1-2 minutes. I can tweak the integration test to instead disable probes (we don't really need the server to come up for this test), but it will still take nearly twice as long as the apiserver test - without any additional increase in coverage from the perspective of the operator.

Another benefit of the apiserver testing locally is that it's faster to restart testing - there's no namespace to cleanup / setup, and no database needed.

The last thought is that these testing approaches don't have to be mutually exclusive. The test logic here is the same whether it's being run against the apiserver only or a full kubernetes - if we're sufficiently clever, then we could have a single definition of a test, but three testing modes local, remote, and apiserver.

pruivo · 2025-04-14T09:37:02Z

...tor/src/main/java/org/keycloak/operator/controllers/KeycloakDeploymentDependentResource.java

        // version 22 changed the match labels, account for older versions
        if (!existingDeployment.isMarkedForDeletion() && !hasExpectedMatchLabels(existingDeployment, primary)) {
            context.getClient().resource(existingDeployment).lockResourceVersion().delete();
            Log.info("Existing Deployment found with old label selector, it will be recreated");
        }

+        baseDeployment.getMetadata().getAnnotations().put(Constants.KEYCLOAK_UPDATE_REASON_ANNOTATION, ContextUtils.getUpdateReason(context));


Is updating the metadata safe without using the "builder" classes?

It is. We don't need another copy because the desired is a working copy.

We also want to avoid using the actual state as a whole:

keycloak/operator/src/main/java/org/keycloak/operator/controllers/KeycloakDeploymentDependentResource.java

Line 568 in 0443cc5

var builder = actual.toBuilder();

The operator sdk was allowing this to happen - but should not have been (we already removed similar logic from fabric8). When using server side apply there could other stuff contributed by other controllers. Applying all of the actual state will make our operator assume ownership of everything.

Thank you for the explanation.

pruivo · 2025-04-14T09:39:23Z

operator/src/test/java/org/keycloak/operator/testsuite/apiserver/UpdateTest.java

+        var updateCondition = assertUnknownUpdateTypeStatus(kc);
+
+        // fake statefulset controller
+        k8sclient.apps().statefulSets().withName(KeycloakDeploymentDependentResource.getName(kc))


close() is never invoked on the informer. Does it matter?

As written no, because we just have a single test. To be more correct, it should be moved to before all.

pruivo · 2025-04-14T09:52:49Z

operator/src/test/java/org/keycloak/operator/testsuite/apiserver/UpdateTest.java

+                    if (!Objects.equals(revision, obj.getStatus().getUpdateRevision())) {
+                        actualReplicas = 0;
+                    } else if (replicas == 0) {


Can this branch be removed?
If the replicas < status.replicas then you decrement the number of replicas, otherwise you increment.

If the revision doesn't match the updateRevision, then we are rolling in another spec change. Setting the replicas to 0 represents taking everything offline, then we let subsequent events handle the roll up. Hybrid of serial and parallel handling.

You don't have to have a replica change with a spec change, so the other handling with incrementing / decrementing won't do anything if the replicas are left the same in the CR.

Right! forgot about the rolling update 🤦

pruivo · 2025-04-14T09:58:01Z

The test failure in the first run is due to a weakness in our testing assumptions - and similarly causes some of the other tests be flakey as well.

Here we're checking for Ready=False with a particular reason, but that is an ephemeral state. Even with using informers to pick up all status changes does not guarentee that we'll see that state - for two reasons:

the informer can reconnect to the api server in such a way as to restart the watch, so we skip over intermediate events - this is not very likely given that bookmarks are in use, but it can still happen.

the event(s) that would have ultimately resulted in the update of the Keycloak status into the desired ephemeral state, were interrupted in some way - most likely a 409 error updating something - such that by the time we retry, we're in a different state.

This is what I didn't figure out when I tried to fix the annotations here and it caused the test to fail more often 🤣 #38588 (comment)

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

It makes it easy to debug, which is a problem I had when trying to debug things. I suggest having the Explicit and RecateOnImageChange strategies in this format. Just my 2 cents, and I'm fine either way.

pruivo

LGTM

closes: keycloak#38487 Signed-off-by: Steve Hawkins <shawkins@redhat.com>

Signed-off-by: Steve Hawkins <shawkins@redhat.com>

shawkins · 2025-04-14T14:58:16Z

To keep this pr moving quickly, I've pulled out the apiserver test approach - I'll refine that and bring it back in another pr.

ahus1

Approving based on @pruivo's review. Thank you for this change, @shawkins, you're going to amazing depth with the Operator SDK here.

shawkins force-pushed the iss38487 branch from d289165 to c65b665 Compare April 10, 2025 16:15

shawkins marked this pull request as ready for review April 11, 2025 12:12

shawkins requested review from a team as code owners April 11, 2025 12:12

keycloak-github-bot bot added the team/cloud-native label Apr 11, 2025

pruivo reviewed Apr 14, 2025

View reviewed changes

pruivo approved these changes Apr 14, 2025

View reviewed changes

shawkins added 2 commits April 14, 2025 10:40

fix: making the update reason and recreate annotations stable

281072c

closes: keycloak#38487 Signed-off-by: Steve Hawkins <shawkins@redhat.com>

relaxing an assertion, and disabling probes

667c0f1

Signed-off-by: Steve Hawkins <shawkins@redhat.com>

shawkins force-pushed the iss38487 branch from 5b35ebe to 667c0f1 Compare April 14, 2025 14:56

ahus1 approved these changes Apr 14, 2025

View reviewed changes

ahus1 merged commit f21c486 into keycloak:main Apr 14, 2025
77 checks passed

pruivo mentioned this pull request May 14, 2025

Recreate update is not scaling down the statefulset to zero #39710

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

making the update reason and recreate annotations stable #38844

making the update reason and recreate annotations stable #38844

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

making the update reason and recreate annotations stable #38844

making the update reason and recreate annotations stable #38844

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!