8000 mempool: Handle concurrent requests in recheck callback by hvanz · Pull Request #895 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mempool: Handle concurrent requests in recheck callback #895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

hvanz
Copy link
Member
@hvanz hvanz commented May 30, 2023

Addresses tendermint/tendermint#5519

Also, this PR simplifies the implementation of the recheck callback by eliminating the global pointers recheckCursor and recheckEnd, which were used to traverse the list of transactions while processing recheck requests. These variables were also used in other parts of the code for deciding if there were still unprocessed recheck requests. This PR replaces them by the existing txsMap, which maps tx keys to txs entries, and by a new counter variable to keep track of how many requests are still to be processed.

Before it was assumed that re-CheckTx requests were processed and handled sequentially, but this is not true for gRPC ABCI clients (see comments in tendermint/tendermint#5519).

Built on top of #894


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

@hvanz hvanz added mempool wip Work in progress labels May 30, 2023
@hvanz hvanz self-assigned this May 30, 2023
@otrack
Copy link
Contributor
otrack commented May 31, 2023

Handling in order ABCI calls at the client is mandatory. If not then a FlushAppConn operation may return earlier than a prior resCbFirstTime or resCbRecheck call. It follows that Update may concurrently execute with such a (re-)check operation. This breaks some base invariants in clist_mempool.go and/or may raise exceptions (e.g., removing twice the same element in CList.)

In my view, the gRPC ABCI client does provide the desired in-order semantics.

I would not add more concurrency to CListMempool, because it seems that there is no need for it performance wise.
In fact, I would do the opposite, that is additional locking, typically in the callbacks. In its current state, this class works because it relies heavily on (partly missing) assumptions wrt. consensus and the client application.

Nevertheless, the idea of removing the cursor is I believe a nice one because it does simplify the codebase :)

@hvanz hvanz added the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 5, 2023
@hvanz hvanz marked this pull request as ready for review June 5, 2023 17:43
@hvanz hvanz requested a review from a team as a code owner June 5, 2023 17:43
@hvanz hvanz removed the wip Work in progress label Jun 5, 2023
@hvanz
Copy link
Member Author
hvanz commented Jun 5, 2023

It seems that the ABCI Flush method is not relevant at all (see tendermint/tendermint#6994). While simplifying the client interface for v0.36, it was proposed to be removed (see comments in tendermint/tendermint#7607), which finally never happened. The client interface for ABCI has a comment saying that this method should be removed as it is not implemented. And even the SDK does not implement a handler for Flush requests.

@sergio-mena
Copy link
Contributor

I didn't go thoroughly through the patch, but have some comments:

  • I feel that we shouldn't backport this to v0.38.x: the QA has finished so this is not covered by those tests (it will be as part of v0.39.x QA). This changes are far from trivial and change concurrency in the mempool: an extra reason to not backport it.
  • Some info on some of the comments above
    • the "official" reason for having the Flush ABCI call is in this section of the spec. Now, I am aware that the semantics we provide for each of the clients (socket, gRPC, local) is currently different (actually, inconsistent). But I still see Flush makes sense in the context of socket and gRPC. We can discuss this synchronously if you think it'd help
    • The SDK does not implement Flush probably because it is irrelevant for a local client (SDK uses the local client, whereby ABCI calls become function calls, so you don't need to flush). It's a different story for remote clients (gRPC and socket)

@hvanz hvanz removed the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 8, 2023
@hvanz
Copy link
Member Author
hvanz commented Jun 8, 2023

I feel that we shouldn't backport this to v0.38.x:

You're right. I removed the label.

@hvanz
Copy link
Member Author
hvanz commented Jun 8, 2023
  • the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

@otrack
Copy link
Contributor
otrack commented Jun 8, 2023
  • the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

I believe that this is needed, otherwise we might end up with two concurrent remove calls of the same transaction on the txs field, leading to a panic. I will add a test based on your PR which exercices this.

To avoid the above problem, I am advocating to take locks in the callbacks. However, this poses a problem because the local client is hanging: the local client is synchronous and locks are not re-entrant in go. We should change this pattern because this client is not doing what is suppose to. However, as the SDK is using it, I wonder how tractable such a change is.

@otrack
Copy link
Contributor
otrack commented Jun 8, 2023

A bad interleaving that explains why we need flush (at least for the moment) is here.

@otrack
Copy link
Contributor
otrack commented Jun 9, 2023

Just a few more thoughts about improving the parallelism by allowing concurrent checkTx.

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

Maybe the best approach would be to have a re-entrant read/write lock for the CListMempool. Another idea is to ask AppConnMempool::CheckTxAsync to take as a parameter the callback to invoke once the application answers.

All in all, I am not sure this is a priority, given the speed of the current implementation: around 60us to check a Tx with the remote client.

@sergio-mena
Copy link
Contributor

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

This has never been thoroughly discussed. Maybe we can include it in today's meeting?

@hvanz
Copy link
Member Author
hvanz commented Jun 20, 2023

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

@otrack Here's an idea to untangle this knot. The callback resCbFirstTime is implemented this way because it needs to record the sender when the transaction is added to the mempool. Each transaction is stored in a mempoolTx together with its senders. We can take out the senders of mempoolTx (and therefore out of txs) and store them in a new map txSenders from txKey to the list of senders. This map will need to be kept consistent with the transactions in the mempool. Then we could call resCbFirstTime from globalCb, and take the senders from txSenders when we need to add the transaction to the mempool. See #1010 and #1032.

@hvanz hvanz marked this pull request as draft June 27, 2023 08:01
@hvanz hvanz added the wip Work in progress label Jun 27, 2023
@melekes
Copy link
Contributor
melekes commented Jan 31, 2024

@hvanz very interesting 👍 any plans to resume this?

@hvanz
Copy link
Member Author
hvanz commented Apr 26, 2024

Closing as it is currently not an option to break the FIFO ordering when checking transactions.

@hvanz hvanz closed this Apr 26, 2024
@hvanz hvanz removed the wip Work in progress label Apr 26, 2024
@hvanz hvanz deleted the hernan/mempool-concurrent-rechecks branch July 11, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

5 participants
0