-
Notifications
You must be signed in to change notification settings - Fork 636
Mempool Rechecking all txs blocks consensus #2925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There are some refactoring on the re-checking logic happened in this PR #2268, maybe this will not solve the problem from the perspective @ValarDragon reported but just adding here for visibility. @hvanz might have a better opinion in this case. The mempool recheck can also be controlled with a configuration parameter |
Thanks @ValarDragon for reporting this problem! I agree that rechecking should not block consensus, though it will still need to block the mempool for checking new incoming transactions. This is for not breaking the FIFO ordering when rechecking txs. Currently |
Also, #2268 is about a corner case in the rechecking logic, not related to current issue. |
I think what we should do here is:
|
Should we add this to #2803? |
This is a problem by itself, we have probably to address it in another issue. |
As commented in #3008, we cannot call While we can discuss if this is the best way to go, this is the current contract with the ABCI application. |
The block execution should trigger the execution of |
First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in #2925) - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- C 8000 o-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 2cea495)
… (#71) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 2cea495) Co-authored-by: PaddyMc <paddymchale@hotmail.com>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
There is the additional problem which is:
So, the mempool channel should not block (as in #2685) because we are waiting for something that may take a lot to finish. This is a problem in general. Then, the second problem: if re-check can be slow, it cannot block consensus/block execution and the mempool. |
First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in #2925) - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 1c277c0) # Conflicts: # state/execution.go
First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
…3362) First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in #2925) - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
…3008) (cometbft#3362) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
…3008) (cometbft#3362) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
Uh oh!
There was an error while loading. Please reload this page.
Feature Request
Summary
Rechecking all txs in the mempool seems to block consensus. We see this from:
This is problematic as it means larger mempools will delay consensus longer. (Also IBC has a change that creeped in that is causing overly large expenses in RecheckTx)
Here is a copy of a pprof from a live-syncing Osmosis full node during 1 hour, with relatively average tx volume:

We see that it blocks Commit right here: https://github.com/cometbft/cometbft/blob/main/state/execution.go#L419-L426
If you look into the relevant code, each recheck call is actually synchronous due to how the callback's are structured.
Problem Definition
We should make the mempool rechecking not block
BlockExecutor.ApplyBlock
.Ideally it should only be blocking
ProposeBlock
until either everything in the mempool is rechecked orblockGas
worth of txs are rechecked. It should never be blocking fortimeout_prevote
The text was updated successfully, but these errors were encountered: