perf(mempool): disable rechecking when node is catching up #3343

hvanz · 2024-06-26T08:20:22Z

When CheckTx is slow (on the application), a node may fail to catch up to reach the latest height, as seen in #2925. In particular, rechecking may take a lot of time to finish, blocking the mempool from receiving new transactions, and not allowing to node to ever catch up. #3008 has improved the performance but we can still reproduce the problem with this e2e testnet.

This PR improves the performance of nodes that are lagging behind by temporarily disabling rechecking. A node is considered to be lagging behind when at least one of its peers is at a later height. In the code "later" is a difference of at least two blocks. Nodes keep track of the latest height of its peers, which send their height in proposal and vote messages.

If a node is not at the latest height, I consider that it is safe to not recheck transactions during some blocks, until the node catches up. The transactions in its mempool are either old transactions, which were already included in a committed block by some other node, or new transactions received via RPC, meaning that most probably no other node has these transactions. The node won't be able to create a block with the new transactions until it catches up but it can disseminate them to be committed by other validators. Also, the node won't receive new transactions from peers because nodes don't gossip transactions to peers that are lagging behind (code, RFC-103).

Might be related to #3340

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

p2p/peer.go

state/execution.go

cason

I disagree with this approach.

We are adding to the peer keystore a redundant information, which is already there, produced by the consensus reactor and read by the mempool and evidence reactors. See details here: https://github.com/cometbft/cometbft/blob/main/spec/p2p/reactor-api/p2p-api.md#key-value-store

cason · 2024-06-26T15:22:04Z

internal/consensus/reactor.go

All routines contain:

ps, ok := e.Src.Get(types.PeerStateKey).(*PeerState)

This is the peer state known by the node, which is stored in the Peer object. The last known height is just ps.PRS.Height. No need for new data structures to keep this information.

Thanks for pointing that out. Now I'm using this approach, which is much simpler.

p2p/peer.go

state/execution.go

andynog

lgtm, minor suggestions and questions.

p2p/peer.go

andynog · 2024-06-26T16:28:39Z

mempool/clist_mempool.go

@@ -20,6 +20,8 @@ import (
 	"github.com/cometbft/cometbft/types"
 )

+const maxHeightDiffForRechecking = 2


Should this be a config parameter? Are we OK having this value hard coded?

I'm asking this myself. As an example in the current code, the mempool reactor doesn't send transactions if there's a difference with the peer of more than one block.

This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.

In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.

andynog · 2024-06-26T16:30:25Z

mempool/clist_mempool.go

@@ -96,6 +100,10 @@ func NewCListMempool(
 	return mp
 }

+func (mem *CListMempool) SetSwitch(sw *p2p.Switch) {


The p2p switch only needed in the list mempool right? Are any other mempool that would need this? If so we could add to the interface but that might be breaking

I wouldn't change the interface, as this is to me a hopefully temporary workaround.

andynog · 2024-06-26T16:36:00Z

mempool/clist_mempool.go

+		if doRecheck {
+			mem.recheckTxs()
+		} else {
+			mem.logger.Debug("recheck disabled at this height", "height", height)


This condition for recheck=false is when node is behind by a max based on the logic above so the debug message could be more related to that too

And this should be an INFO message.

test/e2e/networks_regressions/slow_recheck.toml

melekes

I appreciate the effort here, but I wouldn't say I like that:

a) Switch is added to the mempool
b) New attack vectors are introduced (see my comment)

The general direction is okay, IMHO. We need to inform Comet components that the node is syncing (either via block syncing or consensus). Previously, we did similar things via pubsub to avoid tight coupling of components.

melekes · 2024-06-27T06:20:17Z

mempool/clist_mempool.go

+		// maxHeightDiffForRechecking.
+		doRecheck := true
+		if mem.sw != nil {
+			mem.sw.Peers().ForEach(func(peer p2p.Peer) {


This could be a method on the Switch.

Possible attack: I construct a node, which reports height=10000000, but otherwise appears to be valid.

Agree with the attack, we cannot trust individual reports of height.

Notice that we use the reported height to decide what to send to a peer, not to define our own behavior as a node.

The main problem with this attack is that transactions may get stuck in the mempool, because with PrepareProposal rechecking is needed to do garbage collection. Otherwise the transactions will be validated when proposing.

cason · 2024-06-27T09:05:44Z

In particular, rechecking may take a lot of time to finish, blocking the mempool from receiving new transactions, and not allowing to node to ever catch up.

Well, the first is non-problem, except for the fact that the mempool will be blocking the connections with peers, if they continue to send us transactions (while they shouldn't). The problem here is the mempool blocking the connection with the peers.

The second problem probably derives partially from the first. If the mempool can't process messages, thus blocking the channel, consensus/block sync messages are not processed as well. There is also the fact that block execution takes longer, but this should not be the core problem here, in my opinion.

From this analysis, I think the most general solution is not to block the mempool channel, namely, to drop transactions received while the node is busy with recheck or other stuff. Or do not block the mempool while rechecking is happening.

In any case, this something to discuss on an issue, not a PR.

cason

Still consider this a workaround, agree that it can be abused.

cason · 2024-06-27T09:08:36Z

mempool/clist_mempool.go

@@ -20,6 +20,8 @@ import (
 	"github.com/cometbft/cometbft/types"
 )

+const maxHeightDiffForRechecking = 2


This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.

In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.

cason · 2024-06-27T09:09:07Z

mempool/clist_mempool.go

@@ -96,6 +100,10 @@ func NewCListMempool(
 	return mp
 }

+func (mem *CListMempool) SetSwitch(sw *p2p.Switch) {


I wouldn't change the interface, as this is to me a hopefully temporary workaround.

cason · 2024-06-27T09:10:23Z

mempool/clist_mempool.go

@@ -608,7 +616,25 @@ func (mem *CListMempool) Update(

 	// Recheck txs left in the mempool to remove them if they became invalid in the new state.
 	if mem.config.Recheck {
-		mem.recheckTxs()
+		// Don't perform recheck if one of the peers is known to be at least at our height +


This is a case for creating a new method to do this verification and clear the code of this method from this workaround. In this way, if we find a better way to implement this, we change the new method, not this one.

cason · 2024-06-27T09:10:48Z

mempool/clist_mempool.go

+		if doRecheck {
+			mem.recheckTxs()
+		} else {
+			mem.logger.Debug("recheck disabled at this height", "height", height)


And this should be an INFO message.

The code has changed a lot.

hvanz · 2024-06-27T12:28:16Z

Thanks for all the feedback! I agree with all what was said above. I also don't like the approach of passing the switch to the mempool and depending on the height that peers inform. This was mostly a rushed attempt to try to fix the nightly tests. What we can get from this is that disabling rechecking temporarily could be a solution to some of the known problems. At least it makes the nightly tests pass.

hvanz · 2024-06-27T12:29:15Z

Results of nightly tests: https://github.com/cometbft/cometbft/actions/runs/9696318374

hvanz added the mempool label Jun 26, 2024

hvanz self-assigned this Jun 26, 2024

hvanz changed the title ~~perf: disable recheck when node is catching up~~ perf(mempool)!: disable recheck when node is catching up Jun 26, 2024

hvanz marked this pull request as ready for review June 26, 2024 15:06

hvanz requested review from a team as code owners June 26, 2024 15:06

hvanz commented Jun 26, 2024

View reviewed changes

p2p/peer.go Outdated Show resolved Hide resolved

hvanz commented Jun 26, 2024

View reviewed changes

state/execution.go Outdated Show resolved Hide resolved

cason previously requested changes Jun 26, 2024

View reviewed changes

hvanz added 2 commits June 26, 2024 17:59

mempool: do not recheck if node is known to be lagging behind

45e22bd

e2e: add regression test

e4f9a9c

hvanz force-pushed the hvanz/disable-recheck-when-catching-up branch from 0c6583f to e4f9a9c Compare June 26, 2024 16:01

hvanz added 2 commits June 26, 2024 18:02

comment

5d02fde

add changelog file

78ddd82

hvanz added the backport-to-v1.x Tell Mergify to backport the PR to v1.x label Jun 26, 2024

stop iterating if peer found

5188132

andynog reviewed Jun 26, 2024

View reviewed changes

hvanz changed the title ~~perf(mempool)!: disable recheck when node is catching up~~ perf(mempool)!: disable rechecking when node is catching up Jun 26, 2024

hvanz added 2 commits June 26, 2024 20:24

add comment to manifest file

f3402ad

comment in manifest file

fbcbccc

andynog mentioned this pull request Jun 26, 2024

chore: preparation for first release candidate v1.0.0-rc1 #3255

Closed

33 tasks

hvanz changed the title ~~perf(mempool)!: disable rechecking when node is catching up~~ perf(mempool): disable rechecking when node is catching up Jun 26, 2024

melekes requested changes Jun 27, 2024

View reviewed changes

cason self-requested a review June 27, 2024 09:07

cason reviewed Jun 27, 2024

View reviewed changes

set switch after switch is created

4e107a8

log as Info instead of Debug

7ae14a0

hvanz closed this Jun 28, 2024

hvanz deleted the hvanz/disable-recheck-when-catching-up branch December 20, 2024 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(mempool): disable rechecking when node is catching up #3343

perf(mempool): disable rechecking when node is catching up #3343

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perf(mempool): disable rechecking when node is catching up #3343

perf(mempool): disable rechecking when node is catching up #3343

Uh oh!

Conversation

Uh oh!

PR checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!