-
Notifications
You must be signed in to change notification settings - Fork 178
Reset client reset cycle detection after upload/download complete #6196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ironage / @michael-wb , I think this is ready for a look now. |
@@ -2181,6 +2221,10 @@ std::error_code Session::receive_ident_message(SaltedFileIdent client_file_ident | |||
REALM_ASSERT_EX(m_last_version_selected_for_upload == 0, m_last_version_selected_for_upload); | |||
|
|||
get_transact_reporter()->report_sync_transact(client_reset_old_version, client_reset_new_version); | |||
|
|||
if (has_pending_client_reset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is correct to clear the pending reset state here because the reset has just finished locally and it hasn't received any acknowledgement from the server yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't necessarily clear anything, it schedules the async waiting for the server to ack all uploads and tell the client when there are no more downloads, which will then clear the reset state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the reason we check if there's a pending client reset and then only call this if there is one is so that we don't need to open a separate read transaction after calling get_status()
if there are no client resets to do this dance with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it thanks! I hadn't understood that async_wait_for
is initiating an upload/download completion handler.
auto ft = m_db->start_frozen(); | ||
return _impl::client_reset::has_pending_reset(ft); | ||
}(); | ||
REALM_ASSERT(pending_reset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it safer to just return early if there is no pending reset?
This flow feels racy because we check for pending reset state in a read transaction and then clear it on a separate write transaction. Are we allowed to assume this here because of the single threaded nature of the sync client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda think it's equally safe to check pre-conditions vs returning early if pre-conditions aren't met?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok. Can we be certain that no other writer can clear the pending reset between when it gets written and when this frozen read transaction starts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the places we call this function we can guarantee it because we always call it after checking pre-conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Just needs a bugfix entry in the changelog, otherwise LGTM! 👍
m_sess->logger.info("Tracking pending client reset of type \"%1\" from %2", pending_reset->type, | ||
pending_reset->time); | ||
util::bind_ptr<SessionWrapper> self(this); | ||
async_wait_for(true, true, [self = std::move(self), pending_reset = *pending_reset](std::error_code ec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's a problem if more than one async_wait_for()
calls are waiting at a time to clear the client reset? Such as if process_pending_flx_bootstrap()
throws an exception and is run again or if a connection drops after receiving the ident message and then Session::receive_ident_message()
is called again when the client reconnects.
Probably it won't be a problem since the write transaction will serialize the calls to remove_pending_client_resets()
and this function is a noop if there is nothing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at worst it will be a no-op. We used to always remove the pending client reset whenever we got a download that advanced the upload cursor. so this should be no worse than that? I've added some extra guards and logging so that if we end up with multiple callbacks from async_wait_for, only the one that matches the pending reset info that kicked off the waiting for actually triggers removing the pending resets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walked through the flow and didn't see any issues. LGTM
src/realm/sync/client.cpp
Outdated
auto cur_pending_reset = _impl::client_reset::has_pending_reset(wt); | ||
if (!cur_pending_reset) { | ||
logger.debug( | ||
"Was going to remove client reset tracker for type \"%1\" from %2, but it was already removed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing values for %1
and %2
- assuming this is supposed to be pending_reset.type, pending_reset.time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
What, How & Why?
When we start a client reset with recovery we insert an object into the realm to track whether we've already started a client reset with recovery so that if the client reset fails (say because we recover local changes that cannot be applied to the server) we don't end up in an endless loop. Previously we cleared this tracking tombstone when we received a download message that advanced the upload progress - the idea being that if the server acknowledged our client reset with a download ack then the recovered changes must have been valid.
It turns out that if client reset with recovery doesn't actually recover anything, then there's nothing for the server to acknowledge and we never clear that tracking tombstone, even though everything succeeded. If there is another client reset, then we'll enter into a cycle where the sync client thinks there's a pending client reset.
This change moves where we remove the tombstone from just whenever we receive a download with an advancing upload cursor to after we've received an upload/download ACK (i.e. a download message with an advancing upload cursor and/or a MARK message indicating the server has nothing else to send us).
I also fixed the connection-level error handling so it respects the retry-timeout info sent from the server rather than a hardcoded 5 minute timeout, and unified the try again backoff logic into a helper class.
☑️ ToDos
* [ ] C-API, if public C++ API changed.