8000 fix(mempool): Fix data race when rechecking with async ABCI client by hvanz · Pull Request #2268 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix(mempool): Fix data race when rechecking with async ABCI client #2268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
May 7, 2024

Conversation

hvanz
Copy link
Member
@hvanz hvanz commented Feb 8, 2024

Fixes #2225 and #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the CheckTxAsync request for the last transaction fails, then the resCbRecheck callback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the CheckTx response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where CheckTx responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.

  1. When we call CheckTxAsync, block waiting for a response. If the response never arrives, it will block Update forever.
  2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and recheckCursor will always be nil.

This PR also:

  • refactors all recheck logic to put it into a separate recheck struct. The fix to the bug described above is the only change in the recheck logic.
  • adds 4 new tests.

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@hvanz hvanz added mempool backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x backport-to-v1.x Tell Mergify to backport the PR to v1.x labels Feb 8, 2024
@hvanz hvanz self-assigned this Feb 8, 2024
@hvanz hvanz linked an issue Feb 8, 2024 that may be closed by this pull request
@hvanz hvanz force-pushed the hvanz/mempool-fix-recheck-2225-1827 branch from 22ec65a to d9af5d0 Compare February 8, 2024 16:47
@hvanz hvanz changed the base branch from main to hvanz/mempool-refactor-callbacks February 8, 2024 16:48
@hvanz hvanz added the bug Something isn't working label Feb 9, 2024
Base automatically changed from hvanz/mempool-refactor-callbacks to main February 9, 2024 08:04
@hvanz hvanz removed the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Feb 9, 2024
@faddat
Copy link
Contributor
faddat commented Feb 10, 2024

thank you

@hvanz hvanz added the wip Work in progress label Feb 14, 2024
@hvanz hvanz added this to the 2024-Q2 milestone Apr 4, 2024
@adizere adizere marked this pull request as ready for review April 15, 2024 14:27
@adizere adizere requested review from a team as code owners April 15, 2024 14:27
Copy link
Contributor
@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, some are just me thinking aloud.

I am not sure that forcing a maximum delay for ReCheckTx to return is the right approach here. Is is possible to come up with a different (asynchronous) solution?

}

// This test used to cause a data race when rechecking (see https://github.com/cometbft/cometbft/issues/1827).
func TestMempoolRecheckRace(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes with the original code (in main).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should break there, and pass here.

@@ -649,29 +610,139 @@ func (mem *CListMempool) Update(
return nil
}

// recheckTxs sends all transactions in the 8000 mempool to the app for re-validation. When the function
// returns, all recheck responses from the app have been processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about this PR, but blocking here is terrible. Why we need to conclude the recheck before move on? And, if this is indeed the case, why we use CheckTxAsync, expected to be... asynchronous.

Copy link
Member Author
@hvanz hvanz Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not blocking, that's why there is a timeout, so that rechecking can finish even when not all the CheckTx responses have arrived.

Remember that recheck happens after the mempool has been updated. And once rechecking finishes then the updateMtx lock is released and adding new transactions via CListMempool.CheckTx is possible again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore above when I say it's non-blocking. Rechecking needs to block waiting for all reCheckTx responses because it needs to finish before new transactions are allowed to be checked again. This is to not break the sequential order when rechecking transactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that, even if we wait here until all TXs have been re-Checked, it still makes sense that CheckTx is async: in non-local ABCI clients we will pipeline this:

  • Request serialization, request transmission (over TCP or gRPC), Request deserialization at the app, etc
    If we make CheckTx synchronous, all that will have to happen for one TX until we start with the next. This may make no difference with local client, but is likely to have a high performance impact over a socket or gRPC


mem.recheckCursor = mem.txs.Front()
mem.recheckEnd = mem.txs.Back()
mem.recheck.init(mem.txs.Front(), mem.txs.Back())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot why we need to keep track of mem.txs.Back(). I imagine it's to avoid re-checking newly received TXs while running recheckTxs().
However, haven't we locked the mempool so that this doesn't happen?

Copy link
Member Author
@hvanz hvanz May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the original recheck logic and it's to check if recheck.cursor reached the end of the list, that is, when recheck.cursor == recheck.end. You're right that the mempool is locked, so the list doesn't change during one rechecking process, but I think it's better to store the last value in the Recheck struct so the whole recheck logic is isolated.

@hvanz hvanz unassigned cason May 1, 2024
@hvanz hvanz added this pull request to the merge queue May 7, 2024
Merged via the queue into main with commit f3775f4 May 7, 2024
39 checks passed
@hvanz hvanz deleted the hvanz/mempool-fix-recheck-2225-1827 branch May 7, 2024 06:40
mergify bot pushed a commit that referenced this pull request May 7, 2024
…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)
mergify bot pushed a commit that referenced this pull request May 7, 2024
…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md
#	config/toml.go
#	docs/references/config/config.toml.md
#	mempool/clist_mempool.go
#	mempool/clist_mempool_test.go
hvanz added a commit that referenced this pull request May 7, 2024
…ackport #2268) (#3019)

Fixes ~~#2225 and~~ #1827 

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [X] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2268 done by
[Mergify](https://mergify.com).

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
hvanz added a commit that referenced this pull request May 7, 2024
…ackport #2268) (#3020)

Fixes ~~#2225 and~~ #1827 

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [X] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2268 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
dudong2 added a commit to b-harvest/cometbft that referenced this pull request Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x backport-to-v1.x Tell Mergify to backport the PR to v1.x bug Something isn't working mempool
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

mempool: Data race when rechecking with socket connection
7 participants
0