8000 making the update reason and recreate annotations stable by shawkins · Pull Request #38844 · keycloak/keycloak · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

making the update reason and recreate annotations stable #38844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2025

Conversation

shawkins
Copy link
Contributor

closes: #38487

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

The core fix is that we want to preserve the new annotations when possible from the existing deployment - they are still relevant on the current status even when the rolling or recreate has completed.

@shawkins
Copy link
Contributor Author

The test failure in the first run is due to a weakness in our testing assumptions - and similarly causes some of the other tests be flakey as well.

Here we're checking for Ready=False with a particular reason, but that is an ephemeral state. Even with using informers to pick up all status changes does not guarentee that we'll see that state - for two reasons:

  1. the informer can reconnect to the api server in such a way as to restart the watch, so we skip over intermediate events - this is not very likely given that bookmarks are in use, but it can still happen.
  2. the event(s) that would have ultimately resulted in the update of the Keycloak status into the desired ephemeral state, were interrupted in some way - most likely a 409 error updating something - such that by the time we retry, we're in a different state.

@shawkins shawkins marked this pull request as ready for review April 11, 2025 12:12
@shawkins shawkins requested review from a team as code owners April 11, 2025 12:12
@vmuzikar
Copy link
Contributor

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

I would rather stick to integration test in this case, feels too "artificial" to mock statefulset controller.

@shawkins
Copy link
Contributor Author

I would rather stick to integration test in this case, feels too "artificial" to mock statefulset controller.

There are of course tradeoffs. The reason the mocking here is so complete is to ensure that we generate intermediate states for the statefulset to match our test expectations looking for things like Ready=False - as explained above those are actually a little problematic in any case.

We could simplify the assertions and the mock logic instead of simulating up / down handling could just make the new statefulset seem immediately available.

In the end it's about the amount of coverage for how long the test takes. As written, the api server version of the test runs in less than 20 seconds. The integration test currently takes 1-2 minutes. I can tweak the integration test to instead disable probes (we don't really need the server to come up for this test), but it will still take nearly twice as long as the apiserver test - without any additional increase in coverage from the perspective of the operator.

Another benefit of the apiserver testing locally is that it's faster to restart testing - there's no namespace to cleanup / setup, and no database needed.

The last thought is that these testing approaches don't have to be mutually exclusive. The test logic here is the same whether it's being run against the apiserver only or a full kubernetes - if we're sufficiently clever, then we could have a single definition of a test, but three testing modes local, remote, and apiserver.

// version 22 changed the match labels, account for older versions
if (!existingDeployment.isMarkedForDeletion() && !hasExpectedMatchLabels(existingDeployment, primary)) {
context.getClient().resource(existingDeployment).lockResourceVersion().delete();
Log.info("Existing Deployment found with old label selector, it will be recreated");
}

baseDeployment.getMetadata().getAnnotations().put(Constants.KEYCLOAK_UPDATE_REASON_ANNOTATION, ContextUtils.getUpdateReason(context));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is updating the metadata safe without using the "builder" classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. We don't need another copy because the desired is a working copy.

We also want to avoid using the actual state as a whole:

The operator sdk was allowing this to happen - but should not have been (we already removed similar logic from fabric8). When using server side apply there could other stuff contributed by other controllers. Applying all of the actual state will make our operator assume ownership of everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation.

var updateCondition = assertUnknownUpdateTypeStatus(kc);

// fake statefulset controller
k8sclient.apps().statefulSets().withName(KeycloakDeploymentDependentResource.getName(kc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() is never invoked on the informer. Does it matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written no, because we just have a single test. To be more correct, it should be moved to before all.

Comment on lines 99 to 101
if (!Objects.equals(revision, obj.getStatus().getUpdateRevision())) {
actualReplicas = 0;
} else if (replicas == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this branch be removed?
If the replicas < status.replicas then you decrement the number of replicas, otherwise you increment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the revision doesn't match the updateRevision, then we are rolling in another spec change. Setting the replicas to 0 represents taking everything offline, then we let subsequent events handle the roll up. Hybrid of serial and parallel handling.

You don't have to have a replica change with a spec change, so the other handling with incrementing / decrementing won't do anything if the replicas are left the same in the CR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! forgot about the rolling update 🤦

@pruivo
Copy link
Contributor
pruivo commented Apr 14, 2025

The test failure in the first run is due to a weakness in our testing assumptions - and similarly causes some of the other tests be flakey as well.

Here we're checking for Ready=False with a particular reason, but that is an ephemeral state. Even with using informers to pick up all status changes does not guarentee that we'll see that state - for two reasons:

  1. the informer can reconnect to the api server in such a way as to restart the watch, so we skip over intermediate events - this is not very likely given that bookmarks are in use, but it can still happen.
  2. the event(s) that would have ultimately resulted in the update of the Keycloak status into the desired ephemeral state, were interrupted in some way - most likely a 409 error updating something - such that by the time we retry, we're in a different state.

This is what I didn't figure out when I tried to fix the annotations here and it caused the test to fail more often 🤣 #38588 (comment)

The second commit shows what this test looks like against just the api server - that can be left off of this commit if desired as it has the obvious drawback of needing to emulate a statefulset controller. @pruivo @vmuzikar @mabartos what do you think?

It makes it easy to debug, which is a problem I had when trying to debug things. I suggest having the Explicit and RecateOnImageChange strategies in this format. Just my 2 cents, and I'm fine either way.

Copy link
Contributor
@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

closes: keycloak#38487

Signed-off-by: Steve Hawkins <shawkins@redhat.com>
Signed-off-by: Steve Hawkins <shawkins@redhat.com>
@shawkins
Copy link
Contributor Author

To keep this pr moving quickly, I've pulled out the apiserver test approach - I'll refine that and bring it back in another pr.

Copy link
Contributor
@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on @pruivo's review. Thank you for this change, @shawkins, you're going to amazing depth with the Operator SDK here.

@ahus1 ahus1 merged commit f21c486 into keycloak:main Apr 14, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Keycloak Operator CI] - Test remote (slow) - UpdateTest.testExplicitStrategy
4 participants
0