-
Notifications
You must be signed in to change notification settings - Fork 636
Node starting from BlockSync may never catch up to latest height #3398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another solution is to fix this logic. I thought we compared the current height of the node with the peer's last known height. We should have the latest height of a node from the With this solution, the peer only starts receiving transactions when it is caught-up, the cost of which is having some slow nodes never receiving transactions... |
In other words, instead of delaying the start of the mempool reactor for an arbitrary amount of time, make it to reject transactions if it is not able to keep up with the incoming txs load... |
See also #2925 (comment) |
Uh oh!
There was an error while loading. Please reload this page.
Summary
A node starting from BlockSync may never reach the latest height. We have observed in e2e testnets that, at the end of BlockSync when the node switches to consensus, the node is still lagging by 2 or 3 blocks. Simultaneously the mempool is enabled and starts receiving a flood of transactions from its peers, while consensus is still trying to catch up.
What happens?
At the end of BlockSync we have the following scenario:
EnableInOutTxs
, and consensus,SwitchToConsensus
.Example
In these pictures we see a e2e testnet with node
validator05
starting at height 10, and nodefull01
starting at height 30 with StateSync enabled. They are able to catch up only after the tx load finishes and their mempools are empty. Note that in this testnet we inject a constant load of 2 tx/s, with each tx having 1kb. In real world scenarios, the load is not constant, which can give time to the node to catch up faster.This is the manifest file of the testnet. In particular,
check_tx_delay
is set to a high value (150ms) to be able to reproduce the failing scenario consistently.Do not send txs to lagging nodes
Nodes are not supposed to send transactions to peers that are lagging because of this condition before sending the transaction:
Here,
peerState.GetHeight()
is the height that the peer has of our node, andmemTx.Height()
is the height at which the transaction was added to the peer's mempool (not necessarily the current height of the peer).For example, this is a scenario observed in the testnet:
tx
.peerState.GetHeight() < memTx.Height()-1
equivalent to10 < 9-1
becomes false, so it sends the transaction, even though our node is still catching up.Possible solutions
I found that a simple solution is just to enable the mempool reactor a bit later than consensus. We see on these metrics that with 5 seconds delay to start the mempool, the nodes catch up pretty fast.
The ideal solution would be to start the mempool when the node is at the latest height (minus one). And then go back to BlockSync when the node is lagging (see #3372). The latest height could be defined as a function of the node's state, for instance, when 1/3+ of stake is at some height.
The text was updated successfully, but these errors were encountered: