-
Notifications
You must be signed in to change notification settings - Fork 178
Integrate PBS->FLX client migration into protocol #6355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a draft, so feel free to tell me to go away, but here are a few early thoughts on this.
|
||
// SyncMetadataSchemas manages the schema version numbers for different groups of internal tables used | ||
// within sync. | ||
class SyncMetadataSchemaVersions : public SyncMetadataSchemaVersionsReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
public: | ||
explicit SyncMetadataSchemaVersions(const TransactionRef& ref); | ||
explicit SyncMetadataSchemaVersionsReader(const TransactionRef& ref); | ||
|
||
util::Optional<int64_t> get_version_for(const TransactionRef& tr, std::string_view schema_group_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be std::optional
// * the metadata schema version table was not created during construction (i.e. read-only flag was set) | ||
// * the metadata schema version table could not be read | ||
// * the legacy metadata schema version data still exists and has not been converted | ||
bool is_initialized() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not accurate anymore. Do we need this method? Since get_version_for
returns an optional, there is no problem if the metadata is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right - I was using the is_initialized()
to indicate whether or not the data could be read, but, since the optional value is returned by get_version_for()
, it doesn't matter whether the table existed or that the value hadn't been written yet. Either way, the version is not available and the data will need to be written at some point before it can be used.
I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I may need to change a few things when client reset handling is glued in, but nothing prevents merging these changes.
// Should not receive this error if original sync config is FLX | ||
REALM_ASSERT(!m_original_sync_config->flx_sync_requested); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not guaranteed unfortunately. This assertion will fail if the admin reverted the migration after starting to use an FLX config to open their realms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - so this could fail if the app was updated to use Native FLX, but the server was rolled back to PBS. Unfortunately, there is nothing we can do to work in this state, so we should propagate that error back up to the client app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have a pbs config anymore, there is no way to rollback unless the server provides the missing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, agreed. just wanted to double check if REALM_ASSERT
is the behavior we want here, since IIUC this will crash the process and tell the user to file a ticket, which would not be useful since this is a known issue. is there a more graceful way to fail here?
also, would REALM_ASSERT
trigger in a release build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to use the "switch to pbs" error if this happens and added a test to verify trying to connect as FLX after the server was rolled back.
src/realm/sync/protocol.cpp
Outdated
case ProtocolError::migrate_to_flx: | ||
return "Server migrated to flexible based sync - migrating client to flexible based sync"; | ||
case ProtocolError::bad_progress: | ||
return "Bad progress information (DOWNLOAD)"; | ||
case ProtocolError::revert_to_pbs: | ||
return "Server rolled back after flexible based sync migration - reverting client to partition based " | ||
"sync"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flexible based sync -> flexible sync
SyncConfig::FLXSyncEnabled{}); | ||
auto [err_promise, err_future] = util::make_promise_future<void>(); | ||
util::CopyablePromiseHolder promise(std::move(err_promise)); | ||
auto original_error_handler = std::move(flx_config.sync_config->error_handler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this original error handler here something other than just the default error handler that aborts? Do we need to go through the trouble of saving/calling it?
{ | ||
session.handle_error(std::move(error)); | ||
} | ||
static void handle_error(SyncSession& session, sync::SessionErrorInfo&& error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually add an overload of handle_error
that takes a SyncError
here? I believe SDKs actually use this for testing. They won't be able to mock out the migration query strong responses that way, but I think that's okay and it'll make adopting this easier for SDKs and our existing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the SyncError should be converted to a SessionErrorInfo
and passed through the current handle_error()
function.
@@ -358,12 +359,18 @@ class SyncSession : public std::enable_shared_from_this<SyncSession> { | |||
|
|||
SyncSession(_impl::SyncClient&, std::shared_ptr<DB>, const RealmConfig&, SyncManager* sync_manager); | |||
|
|||
// Create a subscription store pointer based on the flexible based sync configuration or return | |||
// null if not using flexible sync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this currently doesn't return anything? Also can we add a parameter name to the boolean parameter here or something so it's clear in the header what it does?
err_handler(sess, err); | ||
}; | ||
auto flx_realm = Realm::get_shared_realm(flx_config); | ||
timed_sleeping_wait_for([error_future = std::move(err_future)] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do wait_for_future(std::move(error_future), std::chrono::seconds(30)).get()
here.
@@ -21,8 +21,6 @@ using namespace realm::fixtures; | |||
|
|||
namespace { | |||
|
|||
using ErrorInfo = Session::ErrorInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, Session::ErrorInfo is a type alias of using ErrorInfo = SessionErrorInfo
, which means lots of testing changes in this pr seem to be "change ErrorInfo which was Session::ErrorInfo which is actually SessionErrorInfo to SessionErrorInfo" Can we just remove the ErrorInfo alias in the Session class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean - these changes have already been made and the alias has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to fix it here, but I was talking about this
realm-core/src/realm/sync/client.hpp
Line 169 in 84757c4
using ErrorInfo = SessionErrorInfo; |
}(m_config.sync_config&& m_config.sync_config->flx_sync_requested)) | ||
: m_config{config} | ||
, m_db{std::move(db)} | ||
, m_flx_subscription_store{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiniest of nits: we don't need to explicitly initialize this if there's a default constructor.
auto& history = static_cast<sync::ClientReplication&>(*m_db->get_replication()); | ||
if (!flx_sync_requested) { | ||
if (m_flx_subscription_store) { | ||
history.set_write_validator_factory(nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - thinking about this a little more - the write validator factory here just updates a UniqueFunction in a non-thread-safe way. We always got away with this because m_flx_subscription_store and the history validator were only ever updated in the constructor of SyncSession. Is this going to be unsafe now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called from two places: constructor (safe) and do_become_inactive (which I think it's safe since the session is not active anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but you can have a Realm open with the sync session inactive tho. My point is that regardless of whats going on with the SyncSession, don't we need to block write transactions that might try to use the write validator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We can easily protect set_write_validator_factory
and make_write_validator
with a lock, but the problem is the user can still commit changes. I think we have the same problem below where we create the subscription store. We can probably work around it if we set the validator from a write transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding realm transaction locking around the set_write_validator_factory
calls, but got a TSAN error when the subscription store was being constructed, so I removed it (for now) - we'll need to revisit this in our next PRs.
@@ -191,6 +197,11 @@ void SyncSession::do_become_inactive(util::CheckedUniqueLock lock, Status status | |||
if (m_sync_manager) { | |||
m_sync_manager->unregister_session(m_db->get_path()); | |||
} | |||
if (m_needs_subscription_store_updated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need to do this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is to add/remove the subscription store when the session is closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when m_session is closed because a migrate error message was received from the server? Don't we bypass this function after a migration by calling restart_session here? Or is there some case where the user could close their session that we need to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the session is closed in download_fresh_realm (once the client reset handling is implemented)
flx_sync_requested = m_config.sync_config->flx_sync_requested; | ||
} | ||
|
||
util::CheckedLockGuard cfg_lock(m_state_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this state_lock or something since it's not locking the config mutex?
@@ -191,6 +197,11 @@ void SyncSession::do_become_inactive(util::CheckedUniqueLock lock, Status status | |||
if (m_sync_manager) { | |||
m_sync_manager->unregister_session(m_db->get_path()); | |||
} | |||
if (m_needs_subscription_store_updated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when m_session is closed because a migrate error message was received from the server? Don't we bypass this function after a migration by calling restart_session here? Or is there some case where the user could close their session that we need to handle?
if (!flx_sync_requested) { | ||
if (m_flx_subscription_store) { | ||
history.set_write_validator_factory(nullptr); | ||
m_flx_subscription_store.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do any clean-up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is clean-up for the subscription store, but that will be done in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it when you have a green build.
The Jenkins failure is unrelated to these changes and are being addressed in #6403 |
@@ -140,8 +140,13 @@ const char* get_protocol_error_message(int error_code) noexcept | |||
case ProtocolError::compensating_write: | |||
return "Client attempted a write that is disallowed by permissions, or modifies an object outside the " | |||
"current query, and the server undid the change"; | |||
case ProtocolError::migrate_to_flx: | |||
return "Server mi FA89 grated to flexible based sync - migrating client to flexible based sync"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "Server migrated to flexible based sync - migrating client to flexible based sync"; | |
return "Server migrated to flexible sync - migrating client to flexible sync"; |
// Update error to the "switch to PBS" connect error | ||
error = sync::SessionErrorInfo(make_error_code(sync::ProtocolError::switch_to_pbs), | ||
"Server rolled back after flexible sync migration - cannot " | ||
"connect with flexible based sync config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"connect with flexible based sync config", | |
"connect with flexible sync config", |
What, How & Why?
Implementation of the protocol updates to start/cancel the PBS->FLX client migration. The protocol version will not be bumped to v8 until the end of the project.
Fixes #6341
☑️ ToDos