fix(mempool): Fix data race when rechecking with async ABCI client #2268

hvanz · 2024-02-08T09:51:14Z

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the CheckTxAsync request for the last transaction fails, then the resCbRecheck callback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the CheckTx response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where CheckTx responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.

When we call CheckTxAsync, block waiting for a response. If the response never arrives, it will block Update forever.
After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it.

This PR implements option 8000 2, as we cannot allow the risk to block the node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and recheckCursor will always be nil.

This PR also:

refactors all recheck logic to put it into a separate recheck struct. The fix to the bug described above is the only change in the recheck logic.
adds 4 new tests.

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

faddat · 2024-02-10T10:23:26Z

thank you

cason

Left some comments, some are just me thinking aloud.

I am not sure that forcing a maximum delay for ReCheckTx to return is the right approach here. Is is possible to come up with a different (asynchronous) solution?

mempool/clist_mempool_test.go

cason · 2024-04-17T07:18:38Z

mempool/clist_mempool_test.go

+}
+
+//
8000
 This test used to cause a data race when rechecking (see https://github.com/cometbft/cometbft/issues/1827).
+func TestMempoolRecheckRace(t *testing.T) {


This test passes with the original code (in main).

I think it should break there, and pass here.

mempool/clist_mempool.go

cason · 2024-04-17T07:24:55Z

mempool/clist_mempool.go

@@ -649,29 +610,139 @@ func (mem *CListMempool) Update(
 	return nil
 }

+// recheckTxs sends all transactions in the mempool to the app for re-validation. When the function
+// returns, all recheck responses from the app have been processed.


This is not about this PR, but blocking here is terrible. Why we need to conclude the recheck before move on? And, if this is indeed the case, why we use CheckTxAsync, expected to be... asynchronous.

~~It's not blocking, that's why there is a timeout, so that rechecking can finish even when not all the CheckTx responses have arrived.~~

Remember that recheck happens after the mempool has been updated. And once rechecking finishes then the updateMtx lock is released and adding new transactions via CListMempool.CheckTx is possible again.

Sorry, ignore above when I say it's non-blocking. Rechecking needs to block waiting for all reCheckTx responses because it needs to finish before new transactions are allowed to be checked again. This is to not break the sequential order when rechecking transactions.

Bear in mind that, even if we wait here until all TXs have been re-Checked, it still makes sense that CheckTx is async: in non-local ABCI clients we will pipeline this:

Request serialization, request transmission (over TCP or gRPC), Request deserialization at the app, etc
If we make CheckTx synchronous, all that will have to happen for one TX until we start with the next. This may make no difference with local client, but is likely to have a high performance impact over a socket or gRPC

mempool/clist_mempool.go

config/config.go

docs/references/config/config.toml.md

mempool/clist_mempool.go

sergio-mena · 2024-04-30T10:19:48Z

mempool/clist_mempool.go


-	mem.recheckCursor = mem.txs.Front()
-	mem.recheckEnd = mem.txs.Back()
+	mem.recheck.init(mem.txs.Front(), mem.txs.Back())


Sorry, I forgot why we need to keep track of mem.txs.Back(). I imagine it's to avoid re-checking newly received TXs while running recheckTxs().
However, haven't we locked the mempool so that this doesn't happen?

This is part of the original recheck logic and it's to check if recheck.cursor reached the end of the list, that is, when recheck.cursor == recheck.end. You're right that the mempool is locked, so the list doesn't change during one rechecking process, but I think it's better to store the last value in the Recheck struct so the whole recheck logic is isolated.

mempool/clist_mempool.go

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go

…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>

…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>

…lient (backport cometbft#2268) (cometbft#3020)" This reverts commit 9ccdb9b.

hvanz added mempool backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x backport-to-v1.x Tell Mergify to backport the PR to v1.x labels Feb 8, 2024

hvanz self-assigned this Feb 8, 2024

hvanz linked an issue Feb 8, 2024 that may be closed by this pull request

mempool: Data race when rechecking with socket connection #1827

Closed

hvanz added 6 commits February 8, 2024 17:34

refactor resCbFirstTime and resCbRecheck

f88b93d

Fix and refactor recheck

b9c4f8d

fix typos

3ec6ed5

fix lints

886a346

comments

b97c975

Don't log the whole tx

d9af5d0

hvanz force-pushed the hvanz/mempool-fix-recheck-2225-1827 branch from 22ec65a to d9af5d0 Compare February 8, 2024 16:47

hvanz changed the base branch from main to hvanz/mempool-refactor-callbacks February 8, 2024 16:48

rename setDone; comments

db14f51

hvanz added the bug Something isn't working label Feb 9, 2024

Base automatically changed from hvanz/mempool-refactor-callbacks to main February 9, 2024 08:04

hvanz removed the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Feb 9, 2024

hvanz added the wip Work in progress label Feb 14, 2024

hvanz added this to the 2024-Q2 milestone Apr 4, 2024

hvanz and others added 2 commits April 4, 2024 13:10

Merge branch 'main' into hvanz/mempool-fix-recheck-2225-1827

17fae3d

Merge branch 'main' into hvanz/mempool-fix-recheck-2225-1827

0bd0747

adizere marked this pull request as ready for review April 15, 2024 14:27

adizere requested a review from a team as a code owner April 15, 2024 14:27

adizere requested a review from a team April 15, 2024 14:27

adizere assigned cason Apr 15, 2024

adizere mentioned this pull request Apr 15, 2024

Mempool Lanes: introduce QoS to the mempool [tracking issue] #2803

Closed

42 tasks

andynog added 2 commits April 16, 2024 14:46

Merge branch 'main' into hvanz/mempool-fix-recheck-2225-1827

cab4b5b

Merge branch 'main' into hvanz/mempool-fix-recheck-2225-1827

681abf3

cason reviewed Apr 17, 2024

View reviewed changes

melekes approved these changes Apr 29, 2024

View reviewed changes

andynog mentioned this pull request Apr 29, 2024

Mempool Rechecking all txs blocks consensus #2925

Open

Add changelog for bug fix

e2c73fa