8000 perf(mempool): disable rechecking when node is catching up by hvanz · Pull Request #3343 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf(mempool): disable rechecking when node is catching up #3343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

hvanz
Copy link
Member
@hvanz hvanz commented Jun 26, 2024

When CheckTx is slow (on the application), a node may fail to catch up to reach the latest height, as seen in #2925. In particular, rechecking may take a lot of time to finish, blocking the mempool from receiving new transactions, and not allowing to node to ever catch up. #3008 has improved the performance but we can still reproduce the problem with this e2e testnet.

This PR improves the performance of nodes that are lagging behind by temporarily disabling rechecking. A node is considered to be lagging behind when at least one of its peers is at a later height. In the code "later" is a difference of at least two blocks. Nodes keep track of the latest height of its peers, which send their height in proposal and vote messages.

If a node is not at the latest height, I consider that it is safe to not recheck transactions during some blocks, until the node catches up. The transactions in its mempool are either old transactions, which were already included in a committed block by some other node, or new transactions received via RPC, meaning that most probably no other node has these transactions. The node won't be able to create a block with the new transactions until it catches up but it can disseminate them to be committed by other validators. Also, the node won't receive new transactions from peers because nodes don't gossip transactions to peers that are lagging behind (code, RFC-103).

Might be related to #3340


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@hvanz hvanz added the mempool label Jun 26, 2024
@hvanz hvanz self-assigned this Jun 26, 2024
@hvanz hvanz changed the title perf: disable recheck when node is catching up perf(mempool)!: disable recheck when node is catching up Jun 26, 2024
@hvanz hvanz marked this pull request as ready for review June 26, 2024 15:06
@hvanz hvanz requested review from a team as code owners June 26, 2024 15:06
cason
cason previously requested changes Jun 26, 2024
Copy link
Contributor
@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this approach.

We are adding to the peer keystore a redundant information, which is already there, produced by the consensus reactor and read by the mempool and evidence reactors. See details here: https://github.com/cometbft/cometbft/blob/main/spec/p2p/reactor-api/p2p-api.md#key-value-store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All routines contain:

ps, ok := e.Src.Get(types.PeerStateKey).(*PeerState)

This is the peer state known by the node, which is stored in the Peer object. The last known height is just ps.PRS.Height. No need for new data structures to keep this information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. Now I'm using this approach, which is much simpler.

@hvanz hvanz force-pushed the hvanz/disable-recheck-when-catching-up branch from 0c6583f to e4f9a9c Compare June 26, 2024 16:01
@hvanz hvanz added the backport-to-v1.x Tell Mergify to backport the PR to v1.x label Jun 26, 2024
Copy link
Contributor
@andynog andynog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor suggestions and questions.

@@ -20,6 +20,8 @@ import (
"github.com/cometbft/cometbft/types"
)

const maxHeightDiffForRechecking = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a config parameter? Are we OK having this value hard coded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking this myself. As an example in the current code, the mempool reactor doesn't send transactions if there's a difference with the peer of more than one block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.

In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.

@@ -96,6 +100,10 @@ func NewCListMempool(
return mp
}

func (mem *CListMempool) SetSwitch(sw *p2p.Switch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The p2p switch only needed in the list mempool right? Are any other mempool that would need this? If so we could add to the interface but that might be breaking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change the interface, as this is to me a hopefully temporary workaround.

if doRecheck {
mem.recheckTxs()
} else {
mem.logger.Debug("recheck disabled at this height", "height", height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition for recheck=false is when node is behind by a max based on the logic above so the debug message could be more related to that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this should be an INFO message.

@hvanz hvanz changed the title perf(mempool)!: disable recheck when node is catching up perf(mempool)!: disable rechecking when node is catching up Jun 26, 2024
@hvanz hvanz changed the title perf(mempool)!: disable rechecking when node is catching up perf(mempool): disable rechecking when node is catching up Jun 26, 2024
Copy link
Contributor
@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the effort here, but I wouldn't say I like that:

a) Switch is added to the mempool
b) New attack vectors are introduced (see my comment)

The general direction is okay, IMHO. We need to inform Comet components that the node is syncing (either via block syncing or consensus). Previously, we did similar things via pubsub to avoid tight coupling of components.

// maxHeightDiffForRechecking.
doRecheck := true
if mem.sw != nil {
mem.sw.Peers().ForEach(func(peer p2p.Peer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a method on the Switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible attack: I construct a node, which reports height=10000000, but otherwise appears to be valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the attack, we cannot trust individual reports of height.

Notice that we use the reported height to decide what to send to a peer, not to define our own behavior as a node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with this attack is that transactions may get stuck in the mempool, because with PrepareProposal rechecking is needed to do garbage collection. Otherwise the transactions will be validated when proposing.

@cason
Copy link
Contributor
cason commented Jun 27, 2024

In particular, rechecking may take a lot of time to finish, blocking the mempool from receiving new transactions, and not allowing to node to ever catch up.

Well, the first is non-problem, except for the fact that the mempool will be blocking the connections with peers, if they continue to send us transactions (while they shouldn't). The problem here is the mempool blocking the connection with the peers.

The second problem probably derives partially from the first. If the mempool can't process messages, thus blocking the channel, consensus/block sync messages are not processed as well. There is also the fact that block execution takes longer, but this should not be the core problem here, in my opinion.

From this analysis, I think the most general solution is not to block the mempool channel, namely, to drop transactions received while the node is busy with recheck or other stuff. Or do not block the mempool while rechecking is happening.

In any case, this something to discuss on an issue, not a PR.

@cason cason self-requested a review June 27, 2024 09:07
Copy link
Contributor
@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still consider this a workaround, agree that it can be abused.

@@ -20,6 +20,8 @@ import (
"github.com/cometbft/cometbft/types"
)

const maxHeightDiffForRechecking = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.

In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.

@@ -96,6 +100,10 @@ func NewCListMempool(
return mp
}

func (mem *CListMempool) SetSwitch(sw *p2p.Switch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change the interface, as this is to me a hopefully temporary workaround.

@@ -608,7 +616,25 @@ func (mem *CListMempool) Update(

// Recheck txs left in the mempool to remove them if they became invalid in the new state.
if mem.config.Recheck {
mem.recheckTxs()
// Don't perform recheck if one of the peers is known to be at least at our height +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case for creating a new method to do this verification and clear the code of this method from this workaround. In this way, if we find a better way to implement this, we change the new method, not this one.

if doRecheck {
mem.recheckTxs()
} else {
mem.logger.Debug("recheck disabled at this height", "height", height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this should be an INFO message.

@cason cason dismissed their stale review June 27, 2024 09:11

The code has changed a lot.

@hvanz
Copy link
Member Author
hvanz commented Jun 27, 2024

Thanks for all the feedback! I agree with all what was said above. I also don't like the approach of passing the switch to the mempool and depending on the height that peers inform. This was mostly a rushed attempt to try to fix the nightly tests. What we can get from this is that disabling rechecking temporarily could be a solution to some of the known problems. At least it makes the nightly tests pass.

@hvanz
Copy link
Member Author
hvanz commented Jun 27, 2024

Results of nightly tests: https://github.com/cometbft/cometbft/actions/runs/9696318374

@hvanz hvanz closed this Jun 28, 2024
@hvanz hvanz deleted the hvanz/disable-recheck-when-catching-up branch December 20, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-v1.x Tell Mergify to backport the PR to v1.x mempool
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants
0