8000 perf(mempool): disable rechecking when node is catching up by hvanz · Pull Request #3343 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf(mempool): disable rechecking when node is catching up #3343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- `[mempool]` Do not recheck transactions when the node is lagging behind with respect to its peers.
This should allow the node to catch up faster.
([\#3343](https://github.com/cometbft/cometbft/pull/3343))
32 changes: 31 additions & 1 deletion mempool/clist_mempool.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ import (
"github.com/cometbft/cometbft/types"
)

const maxHeightDiffForRechecking = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a config parameter? Are we OK having this value hard coded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking this myself. As an example in the current code, the mempool reactor doesn't send transactions if there's a difference with the peer of more than one block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty arbitrary, making a config flag probably won't help much, although it is not a bad idea.

In any case, 2 is a lower value. We are talking about a node that is slightly behind and it can catch-up. The problem is when it is 10 or more heights behind, I would say.


// CListMempool is an ordered in-memory pool for transactions before they are
// proposed in a consensus round. Transaction validity is checked using the
// CheckTx abci message before the transaction is added to the pool. The
Expand Down Expand Up @@ -56,6 +58,8 @@ type CListMempool struct {
// This reduces the pressure on the proxyApp.
cache TxCache

sw *p2p.Switch

logger log.Logger
metrics *Metrics
}
Expand Down Expand Up @@ -147,6 +151,14 @@ func (mem *CListMempool) SetLogger(l log.Logger) {
mem.logger = l
}

// SetSwitch sets the switch to access the peers.
func (mem *CListMempool) SetSwitch(sw *p2p.Switch) {
if sw == nil {
mem.logger.Error("switch passed as argument is empty")
}
mem.sw = sw
}

// WithPreCheck sets a filter for the mempool to reject a tx if f(tx) returns
// false. This is ran before CheckTx. Only applies to the first created block.
// After that, Update overwrites the existing value.
Expand Down Expand Up @@ -608,7 +620,25 @@ func (mem *CListMempool) Update(

// Recheck txs left in the mempool to remove them if they became invalid in the new state.
if mem.config.Recheck {
mem.recheckTxs()
// Don't perform recheck if one of the peers is known to be at least at our height +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case for creating a new method to do this verification and clear the code of this method from this workaround. In this way, if we find a better way to implement this, we change the new method, not this one.

// maxHeightDiffForRechecking.
doRecheck := true
if mem.sw != nil {
mem.sw.Peers().ForEach(func(peer p2p.Peer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a method on the Switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible attack: I construct a node, which reports height=10000000, but otherwise appears to be valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the attack, we cannot trust individual reports of height.

Notice that we use the reported height to decide what to send to a peer, not to define our own behavior as a node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with this attack is that transactions may get stuck in the mempool, because with PrepareProposal rechecking is needed to do garbage collection. Otherwise the transactions will be validated when proposing.

if peerState, ok := peer.Get(types.PeerStateKey).(PeerState); ok {
if peerState.GetHeight() >= height+maxHeightDiffForRechecking {
doRecheck = false
return
}
}
})
}

if doRecheck {
mem.recheckTxs()
} else {
mem.logger.Info("recheck disabled at this height", "height", height)
}
}

// Notify if there are still txs left in the mempool.
Expand Down
1 change: 1 addition & 0 deletions mempool/reactor.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ func (memR *Reactor) OnStart() error {
if !memR.config.Broadcast {
memR.Logger.Info("Tx broadcasting is disabled")
}
memR.mempool.SetSwitch(memR.Switch)
return nil
}

Expand Down
222 changes: 222 additions & 0 deletions test/e2e/networks_regressions/slow_recheck.toml
9E88
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# In this testnet, the full nodes start late and then never catch up to the latest height because
# rechecking transactions is very slow (due to the `check_tx_delay` setting). Solved by disabling
# rechecking when the nodes are lagging behind their peers by more than N blocks (in the PR, N=2).
# (https://github.com/cometbft/cometbft/pull/3343)

ipv6 = false
initial_height = 0
disable_pex = false
key_type = ""
abci_protocol = "tcp"
prepare_proposal_delay = "200ms"
process_proposal_delay = "200ms"
check_tx_delay = "120ms"
vote_extension_delay = "100ms"
finalize_block_delay = "500ms"
upgrade_version = ""
load_tx_size_bytes = 0
load_tx_batch_size = 0
load_tx_connections = 0
load_max_txs = 0
prometheus = true
block_max_bytes = 0
vote_extension_size = 2048
vote_extensions_enable_height = 11
vote_extensions_update_height = 0
peer_gossip_intraloop_sleep_duration = "0s"
experimental_max_gossip_connections_to_persistent_peers = 0
experimental_max_gossip_connections_to_non_persistent_peers = 0
abci_tests_enabled = false
default_zone = ""
pbts_enable_height = 11
pbts_update_height = 0

[initial_state]

[validators]
validator01 = 83
validator02 = 46
validator03 = 50
validator04 = 36
validator05 = 70

[node]
[node.full01]
mode = "full"
version = ""
persistent_peers = ["validator04", "validator01"]
privval_protocol = "tcp"
start_at = 20
block_sync_version = "v0"
state_sync = true
persist_interval = 0
snapshot_interval = 3
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.full02]
mode = "full"
version = ""
persistent_peers = ["validator02", "validator05"]
privval_protocol = "file"
start_at = 30
block_sync_version = "v0"
state_sync = true
persist_interval = 0
snapshot_interval = 3
retain_blocks = 0
enable_companion_pruning = true
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.light01]
mode = "light"
version = ""
persistent_peers = ["validator04", "validator03", "validator01", "validator02"]
privval_protocol = ""
start_at = 10
block_sync_version = ""
state_sync = false
persist_interval = 0
snapshot_interval = 0
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.light02]
mode = "light"
version = ""
persistent_peers = ["validator04", "validator03", "validator01", "validator02"]
privval_protocol = ""
start_at = 15
block_sync_version = ""
state_sync = false
persist_interval = 0
snapshot_interval = 0
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.validator01]
mode = "validator"
version = ""
privval_protocol = "file"
start_at = 0
block_sync_version = "v0"
state_sync = false
persist_interval = 1
snapshot_interval = 3
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.validator02]
mode = "validator"
version = ""
persistent_peers = ["validator01"]
privval_protocol = "file"
start_at = 0
block_sync_version = "v0"
state_sync = false
persist_interval = 5
snapshot_interval = 3
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.validator03]
mode = "validator"
version = ""
persistent_peers = ["validator01"]
privval_protocol = "tcp"
start_at = 0
block_sync_version = "v0"
state_sync = false
persist_interval = 0
snapshot_interval = 0
retain_blocks = 0
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.validator04]
mode = "validator 950E "
version = ""
persistent_peers = ["validator01", "validator02"]
privval_protocol = "tcp"
start_at = 0
block_sync_version = "v0"
state_sync = false
persist_interval = 0
snapshot_interval = 3
retain_blocks = 0
enable_companion_pruning = true
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
[node.validator05]
mode = "validator"
version = ""
persistent_peers = ["validator04"]
privval_protocol = "tcp"
start_at = 0
block_sync_version = "v0"
state_sync = false
persist_interval = 5
snapshot_interval = 3
retain_blocks = 56
enable_companion_pruning = false
send_no_load = false
zone = ""
experimental_db_key_layout = ""
compact = false
compaction_interval = 0
discard_abci_responses = false
indexer = ""
clock_skew = "0s"
Loading
0