-
Notifications
You must be signed in to change notification settings - Fork 7.4k
fix: adding logic to isolate realm migration processing #39377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checked the v26.1 to 26.2 migration - and it's very fast even without this change, at least for simple realms. If there is anything that was missed with what is lazily loaded though (#38553 (review)), then this proposed change is still beneficial. |
Marking as ready for review. There is still no direct performance test included, but I'm open to feedback on including one. Also added a general note about import / export / migration performance - cc @mabartos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this change, looks good to me. I see that it is still running in one transaction, which is IMHO important to avoid an incomplete migration.
I was wondering if we could actually commit to the DB after the migration is complete for a specific version - but then one would also need to update the migration model to that version, so it can be restarted later. That seems to be a bit fragile and beyond this PR, so I'm good with this improvement.
@shawkins - thinking about it more I have some doubts what happens if a realm is touched twice when there are multiple migrations. The delegate might have been fetched but it is then disconnected from the context on clear/flush. I will need look into it more. |
My understand from the code is that each time we obtain the realm like this: It is attached to the current context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shawkins - I was a bit worried as the RealmAdapter
that is part of the Infinispan caching layer is holding a reference to a JPA entity:
keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/RealmAdapter.java
Line 73 in 4dc4de7
protected volatile RealmModel updated; |
After the first flush-and-clear, this is detached. Any further updates to that object instance will not be persisted.
Then I found that after the realm was modified, for each subsequent lookup of the realm, the delegate is returned and not the cached instance:
keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/RealmCacheSession.java
Lines 464 to 465 in 4dc4de7
if (invalidations.contains(id)) { | |
return getRealmDelegate().getRealm(id); |
If the EM was flushed and cleared, this is then a new instance of the entity, which is again attached to the EM, and all changes are persisted.
So as long as we don't reuse any previously looked up realm after a clear (which you are using not, as you are looking up the realm again explicitly again), this should be safe.
This mechanism seems to be consistent for other entities as well (like organizations).
model/storage-private/src/main/java/org/keycloak/migration/migrators/MigrateTo24_0_0.java
Outdated
Show resolved
Hide resolved
@@ -11,6 +11,12 @@ Asynchronous logging might be useful for deployments requiring **high throughput | |||
|
|||
For more details, see the https://www.keycloak.org/server/logging[Logging guide]. | |||
|
|||
= Improvements to import / export and migration | |||
|
|||
Previously it was recommended that import / export operations be performed while the server was stopped to ensure consistency due to splitting the work over multiple transactions. Import / export operations now use a single transaction, thus there is no need to stop the server to ensure consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this was the only reason? What if some other parts of KC don't handle transactions properly, i.e. use multiple transactions to perform an atomic operation? Then the export might hit an inconsistent DB state.
I know this would basically mean we have a transactions related bug somewhere but stating it's no longer needed to shutdown the server before export would expose it more likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this was the only reason?
That is what was originally explained.
What if some other parts of KC don't handle transactions properly, i.e. use multiple transactions to perform an atomic operation? Then the export might hit an inconsistent DB state.
I know this would basically mean we have a transactions related bug somewhere but stating it's no longer needed to shutdown the server before export would expose it more likely.
This really isn't a bug situation. It has more to do with locking and isolation level. When performing an export or import we are not explicitly using a jpa lock to prevent any other modificaions, and the isolation level will typically be set to read committed - not snapshot or serializable. So yes there is a possiblity of non-repeatable / phantom reads, but that is no different than any other code path in Keycloak. The persistence context is isolating us from this problem via cache - you'd have to keep refetching the same query to hit an inconsistency. I'm honestly not sure if we're hitting that situation with the export code - EDIT: this can happen with export because of how pagination is implemented.
We can certainly warn about the isolation level wrt exports, and for those that may be worried about excluding realm modifications made concurrently with the export they can still shut everything down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to break down the cases a bit more:
Single file export / import is functionally unchanged.
Multiple file export / import:
Export | Import | |
---|---|---|
Old | each realm export completed in a single transaction. Groups of n users, typically 50, exported in individual transactions | all import completed in a single transaction. Groups of n users, typically 50, exported in individual transactions. Concurrent usage of the realm is possible after it is committed. Failures in the import process after the realm is committed leave things in an inconsistent state. |
New | Single transaction | Single Transaction - realm not usable until the whole import commits, and any failures in the import process cause the entire import to fail |
So there's definitely an improvement on import. However concurrent modification to a realm during an import would either result in those modifications being effectively be thrown away, or cause the import to fail. Do you want a warning to that effect?
At an isolation level of read committed (or below) either the old or the new behavior is suseptable to having the pagination thrown off by adding or deleting users during the export. I can add a warning about that.
We can also consider changing the pagination logic to fetch much larger sets of ids (with partitioning on id or username for extreme cases) instead of using limit / offset queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahus1 @vmuzikar I've added a commit that removes index based pagination from export. This makes it more tolerant to read committed isolation. We also require a tweak to the UserCacheSession.managedUsers (I believe @ahus1 mentioned this at some point) because it will accumulate users for the entirety of an export. I've based the change on whether we're in a batch operation, but in general these entries shouldn't be needed with JPA because they effectively compete with persistence context. If it makes more sense to add a method on UserCache to flush the managed users, or to avoid creating managed user entries for JPA, let me know.
The other thing I noticed was quarkusio/quarkus#48135 It looks like there are a couple hundred bytes per statement held in memory until the transaction completes. With simple users and default memory settings, this creates a problem at around 2 million users / 10 million statements. The only thing that could improve upon this with the single transaction logic would be greater use of statement level batching.
If we need to support several millions of users per import / export in the near-term, then I believe I'll have to futher tweak things to still support a multi-transaction mode.
WDYT?
Another detail related to all this is that import processing is limited by file size because the placeholder logic is reading the entire file into memory and holding it. If that were tweaked then we should be able to much larger user counts in a single file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreported flaky test detected, please review
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.forms.VerifyProfileTest#testRequiredOnlyIfUser
Keycloak CI - Forms IT (chrome)
java.lang.IllegalArgumentException: No enum constant org.keycloak.testsuite.pages.AppPage.RequestType.localhost
at java.base/java.lang.Enum.valueOf(Enum.java:293)
at org.keycloak.testsuite.pages.AppPage$RequestType.valueOf(AppPage.java:57)
at org.keycloak.testsuite.pages.AppPage.getRequestType(AppPage.java:46)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreported flaky test detected, please review
Unreported flaky test detectedIf the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR. org.keycloak.testsuite.cluster.SessionFailoverClusterTest#sessionFailover
|
There are three pieces to this pr:
I think 1 should get in prior to 26.3. 2 and 3 are currently interwined. If we don't want to do 2, then 3 should be changed to recommend always shutting down the cluster for export. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shawkins Sorry for the late reply and thanks for the updates.
In general, the changes look good. Though not sure how but we should run some perf tests.
docs/guides/server/importExport.adoc
Outdated
@@ -16,6 +16,8 @@ The default count of users per file and per transaction is fifty, but you may us | |||
The `import` and `export` commands are essentially server launches that exit before bringing up the full server. They are not currently designed to be run from the same machine as another server instance, which may result in port or other conflicts. | |||
==== | |||
|
|||
It is recommended that all {project_name} nodes are stopped prior to using the `kc.[sh|bat] export` command. This ensures that the resulting operations will have no consistency issues with concurrent requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit. I'd move this into the note box above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
@@ -8,6 +8,12 @@ Asynchronous logging might be useful for deployments requiring high throughput a | |||
|
|||
For more details, see the https://www.keycloak.org/server/logging[Logging guide]. | |||
|
|||
= Improvements to import / export and migration | |||
|
|||
Previously it was recommended that import operations be performed while the server was stopped to ensure consistency due to splitting the work over multiple transactions. Import / export operations now use a single transaction, thus there is no need to stop the server to ensure import consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention here the recommendation to stop the server when doing export? Or do you perceive the risk of phantom reads of the users negligible with the latest changes around how we paginate the users? (Phantom reads probably can't happen as we're basically fetching all users in a single query.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention here the recommendation to stop the server when doing export?
This text was narrowed to only mention there's no need to shutdown during import. The recommendation to shutdown for export is still mentioned in the import / export docs.
We can reiterate that here if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents here:
- Export operations won't be consistent if there are concurrent operations due to different DB isolation levels, and results would differ from database to database
- Import operations on the CLI will not invalidate the caches of the running Keycloak cluster. So having that cluster running while doing the import on the CLI is still not supported. The only way to do an import at runtime is via the Admin CLI, as that would properly invalidate any caches. AFAIK there as some (small) changes to the master realm when you import a new realm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I would just mention that it is now faster. The advice to shut down KC is already in the docs, and people have either respected that or ignored it. Either way, I wouldn't mention it here.
BTW, the Keycloak Operator used to restart Keycloak after the realm import. Does it still do that? Given my statement about the caches above, it might still be the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe caching is not such a problem at import, is it? We're mostly talking about new Realm imports, no? Accessing a previously non-existent Realm inevitably leads to a cache miss, so no invalidation is needed after the import? Only issue would be when import overwrites existing Realm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might add at least one new client to the master realm, and I'm not sure if the cache the realm list at some point. All of that is IMHO not the core of this PR, and I'd suggest to defer to the a separate analysis after 26.3 has been released, and merge the code changes as is, not promising anything else than the speed improvement which is a big step forward for a lot of users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to defer to the a separate analysis after 26.3 has been released
+1
merge the code changes as is
Yes but the release notes still say:
thus there is no need to stop the server to ensure import consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator only allows for import of a new realm, it will not overwrite, and yes it still causes a rolling restart at the end of the import.
if (!exportUsersIntoRealmFile) { | ||
usersHolder.totalCount = session.users().getUsersCount(realm, true); | ||
if (usersExportStrategy != UsersExportStrategy.SKIP && !exportUsersIntoRealmFile) { | ||
Stream<UserModel> users = UserStoragePrivateUtil.userLocalStorage(session).searchForUserStream(realm, Map.of()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically means we're fetching all (un-paginated) users into the memory, no? I wonder what are the perf consequences.
Before, we were fetching users in a paginated way with the risk of phantom reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically means we're fetching all (un-paginated) users into the memory, no? I wonder what are the perf consequences.
It shouldn't. Hibernate stream calls are backed by the scrollable result, so they have a limited memory footprint: https://github.com/hibernate/hibernate-orm/blob/f07a6f41a7bb04fdb583ff776b3da0f158992874/hibernate-core/src/main/java/org/hibernate/query/spi/AbstractSelectionQuery.java#L262-L272
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the further conversation below, it appears we might not need this change after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vmuzikar the new logic is still morre performant for export - it's interogating only local storage and will perform only a single query for users rather than the paginated fetching. But yes if we are only recommending / requiring export to be performed with the server shutdown, then it's not strictly necessary to include this improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep it.
also adding an info log for each realm migrated closes: keycloak#33978 keycloak#38649 Signed-off-by: Steve Hawkins <shawkins@redhat.com>
also preventing creating cached users during export Signed-off-by: Steve Hawkins <shawkins@redhat.com>
Signed-off-by: Steve Hawkins <shawkins@redhat.com>
Signed-off-by: Steve Hawkins <shawkins@redhat.com>
Signed-off-by: Steve Hawkins <shawkins@redhat.com>
AFAIK, allowing exports on running instance would be a deal-breaker for many users as they would like to run export for backups and getting snapshots of the state at some time. So it means for some cases, they need to support some kind of eventual consistency. If they want to have data fully consistent, they need to basically shut down the server. It'd be up to them what to choose. If we're >~98% sure nothing breaks, it'd be great to include it all into 26.3 as well. So I like this PR idea. But I don't have so many details in this matter, just ventilating some community frequent requests :)) Relevant issues: |
As @mabartos is referring to the logic about using exports as a backup: I would consider this a bad practice and possibly not the intended use case for this, and the results would depend on the database and their (default) isolation levels. Intended use cases for export/import:
When you migrate from one setup to another, you would first block traffic to the old installation, then do the migration, then direct all users to the new system Abuse for this feature:
Reasons why not doing this with Keycloak means:
The way to utilize a DB's features:
|
Moving realms one at a time is especially problematic. Shutting down your old and new instance per realm - and having poor import / export performance to boot causea a lot of outages.
We weren't preventing before, just recommending the cluster be stopped. We can scale back the doc change here if need be. |
+1 |
Signed-off-by: Alexander Schwartz <alexander.schwartz@gmx.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we're all good here, but I didn't dare to press the merge button yet. Let me know (@vmuzikar or @shawkins) if you think this is now ready to be pressed. As @pskopek might be releasing on Monday, I'll added the related issue to milestone 26.3 and added it as a blocker for the release, given that for me it is ready to be merged and as we seems to agree to merge it as it is now. If someone disagrees with the merge, just remove the milestone from #38649 and remove it from the related project board. |
= Client initiated linking of the user account to identity provider is based on AIA | ||
|
||
When using for example a one-time-password (OTP) generators as a second factor for authenticating users (2FA), a user can get locked out of their account when they, for example, lose their phone that contains the OTP generator. | ||
To prepare for such a case, the recovery codes feature allows users to print a set of recovery codes as an additional second factor. | ||
If the recovery codes are then allowed as an alternative 2FA in the login flow, they can be used instead of the OTP generated passwords. | ||
|
||
With this release, the recovery codes feature is promoted from preview to a supported feature. | ||
For newly created realms, the browser flow now includes the Recovery Authentication Code Form as _Disabled_, and it can be switched to _Alternative_ by admins if they want to use this feature. | ||
|
||
For more information about this 2FA method, see the link:{adminguide_link}#_recovery-codes[Recovery Codes] chapter in the {adminguide_name}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this section got here by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of removing the rogue release note. Let me know if that works for you.
Signed-off-by: Václav Muzikář <vmuzikar@redhat.com>
* fix: adding logic to isolate realm migration processing also adding an info log for each realm migrated closes: keycloak#33978 keycloak#38649 Signed-off-by: Steve Hawkins <shawkins@redhat.com> * switching to an export strategy tolerant to read committed also preventing creating cached users during export Signed-off-by: Steve Hawkins <shawkins@redhat.com> * updating the docs to still recommend shutting the server down for export Signed-off-by: Steve Hawkins <shawkins@redhat.com> * accounting for null managed users Signed-off-by: Steve Hawkins <shawkins@redhat.com> * refinements based upon review comments Signed-off-by: Steve Hawkins <shawkins@redhat.com> * Scaling back the docs Signed-off-by: Alexander Schwartz <alexander.schwartz@gmx.net> * Remove rogue release note Signed-off-by: Václav Muzikář <vmuzikar@redhat.com> --------- Signed-off-by: Steve Hawkins <shawkins@redhat.com> Signed-off-by: Alexander Schwartz <alexander.schwartz@gmx.net> Signed-off-by: Václav Muzikář <vmuzikar@redhat.com> Co-authored-by: Alexander Schwartz <alexander.schwartz@gmx.net> Co-authored-by: Václav Muzikář <vmuzikar@redhat.com>
also adding an info log for each realm migrated
closes: #33978
closes: #38649
cc @ahus1 @pedroigor @vmuzikar - where would be a good place to add a performance test for a change like this? And how far back in the migrations would you take the change?
I wasn't sure about if we could generally force the query mode to commit - but that could help even more.