-
Notifications
You must be signed in to change notification settings - Fork 638
fix(mempool): Fix data race when rechecking with async ABCI client #2268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
22ec65a
to
d9af5d0
Compare
thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, some are just me thinking aloud.
I am not sure that forcing a maximum delay for ReCheckTx to return is the right approach here. Is is possible to come up with a different (asynchronous) solution?
} | ||
|
||
// This test used to cause a data race when rechecking (see https://github.com/cometbft/cometbft/issues/1827). | ||
func TestMempoolRecheckRace(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes with the original code (in main
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should break there, and pass here.
@@ -649,29 +610,139 @@ func (mem *CListMempool) Update( | |||
return nil | |||
} | |||
|
|||
// recheckTxs sends all transactions in the 8000 mempool to the app for re-validation. When the function | |||
// returns, all recheck responses from the app have been processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about this PR, but blocking here is terrible. Why we need to conclude the recheck before move on? And, if this is indeed the case, why we use CheckTxAsync
, expected to be... asynchronous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not blocking, that's why there is a timeout, so that rechecking can finish even when not all the CheckTx responses have arrived.
Remember that recheck happens after the mempool has been updated. And once rechecking finishes then the updateMtx
lock is released and adding new transactions via CListMempool.CheckTx
is possible again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, ignore above when I say it's non-blocking. Rechecking needs to block waiting for all reCheckTx responses because it needs to finish before new transactions are allowed to be checked again. This is to not break the sequential order when rechecking transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that, even if we wait here until all TXs have been re-Checked, it still makes sense that CheckTx
is async: in non-local ABCI clients we will pipeline this:
- Request serialization, request transmission (over TCP or gRPC), Request deserialization at the app, etc
If we makeCheckTx
synchronous, all that will have to happen for one TX until we start with the next. This may make no difference with local client, but is likely to have a high performance impact over a socket or gRPC
|
||
mem.recheckCursor = mem.txs.Front() | ||
mem.recheckEnd = mem.txs.Back() | ||
mem.recheck.init(mem.txs.Front(), mem.txs.Back()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot why we need to keep track of mem.txs.Back()
. I imagine it's to avoid re-checking newly received TXs while running recheckTxs()
.
However, haven't we locked the mempool so that this doesn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the original recheck logic and it's to check if recheck.cursor
reached the end of the list, that is, when recheck.cursor == recheck.end
. You're right that the mempool is locked, so the list doesn't change during one rechecking process, but I think it's better to store the last value in the Recheck
struct so the whole recheck logic is isolated.
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go
…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
…lient (backport cometbft#2268) (cometbft#3020)" This reverts commit 9ccdb9b.
Fixes
#2225 and#1827(#2225 is now fixed in a separate PR, #2894)
The bug: during rechecking, when the
CheckTxAsync
request for the last transaction fails, then theresCbRecheck
callback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil
, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and theCheckTx
response finds that rechecking hasn't finished.This problem only happens when using the non-local ABCI client, where
CheckTx
responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.CheckTxAsync
, block waiting for a response. If the response never arrives, it will blockUpdate
forever.This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.
With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and
recheckCursor
will always benil
.This PR also:
recheck
struct. The fix to the bug described above is the only change in the recheck logic.PR checklist
.changelog
(we use unclog to manage our changelog)docs/
orspec/
) and code comments