-
Notifications
You must be signed in to change notification settings - Fork 636
perf(mempool): disable rechecking when node is catching up #3343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this approach.
We are adding to the peer keystore a redundant information, which is already there, produced by the consensus reactor and read by the mempool and evidence reactors. See details here: https://github.com/cometbft/cometbft/blob/main/spec/p2p/reactor-api/p2p-api.md#key-value-store
internal/consensus/reactor.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All routines contain:
ps, ok := e.Src.Get(types.PeerStateKey).(*PeerState)
This is the peer state known by the node, which is stored in the Peer
object. The last known height is just ps.PRS.Height
. No need for new data structures to keep this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. Now I'm using this approach, which is much simpler.
0c6583f
to
e4f9a9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, minor suggestions and questions.
@@ -20,6 +20,8 @@ import ( | |||
"github.com/cometbft/cometbft/types" | |||
) | |||
|
|||
const maxHeightDiffForRechecking = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a config parameter? Are we OK having this value hard coded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking this myself. As an example in the current code, the mempool reactor doesn't send transactions if there's a difference with the peer of more than one block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.
In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.
mempool/clist_mempool.go
Outdated
@@ -96,6 +100,10 @@ func NewCListMempool( | |||
return mp | |||
} | |||
|
|||
func (mem *CListMempool) SetSwitch(sw *p2p.Switch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The p2p switch only needed in the list mempool right? Are any other mempool that would need this? If so we could add to the interface but that might be breaking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change the interface, as this is to me a hopefully temporary workaround.
mempool/clist_mempool.go
Outdated
if doRecheck { | ||
mem.recheckTxs() | ||
} else { | ||
mem.logger.Debug("recheck disabled at this height", "height", height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition for recheck=false is when node is behind by a max based on the logic above so the debug message could be more related to that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this should be an INFO message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the effort here, but I wouldn't say I like that:
a) Switch
is added to the mempool
b) New attack vectors are introduced (see my comment)
The general direction is okay, IMHO. We need to inform Comet components that the node is syncing (either via block syncing or consensus). Previously, we did similar things via pubsub to avoid tight coupling of components.
// maxHeightDiffForRechecking. | ||
doRecheck := true | ||
if mem.sw != nil { | ||
mem.sw.Peers().ForEach(func(peer p2p.Peer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a method on the Switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible attack: I construct a node, which reports height=10000000, but otherwise appears to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the attack, we cannot trust individual reports of height.
Notice that we use the reported height to decide what to send to a peer, not to define our own behavior as a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem with this attack is that transactions may get stuck in the mempool, because with PrepareProposal rechecking is needed to do garbage collection. Otherwise the transactions will be validated when proposing.
Well, the first is non-problem, except for the fact that the mempool will be blocking the connections with peers, if they continue to send us transactions (while they shouldn't). The problem here is the mempool blocking the connection with the peers. The second problem probably derives partially from the first. If the mempool can't process messages, thus blocking the channel, consensus/block sync messages are not processed as well. There is also the fact that block execution takes longer, but this should not be the core problem here, in my opinion. From this analysis, I think the most general solution is not to block the mempool channel, namely, to drop transactions received while the node is busy with recheck or other stuff. Or do not block the mempool while rechecking is happening. In any case, this something to discuss on an issue, not a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still consider this a workaround, agree that it can be abused.
@@ -20,6 +20,8 @@ import ( | |||
"github.com/cometbft/cometbft/types" | |||
) | |||
|
|||
const maxHeightDiffForRechecking = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.
In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.
mempool/clist_mempool.go
Outdated
@@ -96,6 +100,10 @@ func NewCListMempool( | |||
return mp | |||
} | |||
|
|||
func (mem *CListMempool) SetSwitch(sw *p2p.Switch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change the interface, as this is to me a hopefully temporary workaround.
@@ -608,7 +616,25 @@ func (mem *CListMempool) Update( | |||
|
|||
// Recheck txs left in the mempool to remove them if they became invalid in the new state. | |||
if mem.config.Recheck { | |||
mem.recheckTxs() | |||
// Don't perform recheck if one of the peers is known to be at least at our height + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a case for creating a new method to do this verification and clear the code of this method from this workaround. In this way, if we find a better way to implement this, we change the new method, not this one.
mempool/clist_mempool.go
Outdated
if doRecheck { | ||
mem.recheckTxs() | ||
} else { | ||
mem.logger.Debug("recheck disabled at this height", "height", height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this should be an INFO message.
Thanks for all the feedback! I agree with all what was said above. I also don't like the approach of passing the switch to the mempool and depending on the height that peers inform. This was mostly a rushed attempt to try to fix the nightly tests. What we can get from this is that disabling rechecking temporarily could be a solution to some of the known problems. At least it makes the nightly tests pass. |
Results of nightly tests: https://github.com/cometbft/cometbft/actions/runs/9696318374 |
When CheckTx is slow (on the application), a node may fail to catch up to reach the latest height, as seen in #2925. In particular, rechecking may take a lot of time to finish, blocking the mempool from receiving new transactions, and not allowing to node to ever catch up. #3008 has improved the performance but we can still reproduce the problem with this e2e testnet.
This PR improves the performance of nodes that are lagging behind by temporarily disabling rechecking. A node is considered to be lagging behind when at least one of its peers is at a later height. In the code "later" is a difference of at least two blocks. Nodes keep track of the latest height of its peers, which send their height in proposal and vote messages.
If a node is not at the latest height, I consider that it is safe to not recheck transactions during some blocks, until the node catches up. The transactions in its mempool are either old transactions, which were already included in a committed block by some other node, or new transactions received via RPC, meaning that most probably no other node has these transactions. The node won't be able to create a block with the new transactions until it catches up but it can disseminate them to be committed by other validators. Also, the node won't receive new transactions from peers because nodes don't gossip transactions to peers that are lagging behind (code, RFC-103).
Might be related to #3340
PR checklist
.changelog
(we use unclog to manage our changelog)docs/
orspec/
) and code comments