Description
There are reports from node operators that while the p2p layer is attempting to reconnect to a persistent peer (typically because it is unavailable, offline, etc.) the overall performance of node degrades substantially. This is specially relevant in networks with short block times, when it is observed an increase in block times and proposers failing to get their blocks committed.
The method responsible for persistently attempt to dial a peer address is p2p.Switch.reconnectToPeer(*NetAddress)
. There is nothing really special on it in terms of resource consumption. The main calls are for dialing the peer address, which is the same p2p.Switch.DialPeerWithAddress(*NetAddress) used to dial any address, and sleeps.
The re-dialing is done using a standard (hard-code) procedure, summarized here. In summary, there are 20
attempts with linear intervals (5s
plus a random jitter up to 3s
), then the intervals are exponential, increasing powers of 3s
, using the same jitter. At most 10
attempts are performed with exponential intervals, so at most 30
attempts are performed in total.
Turning the parameters used by this procedure configuration parameters has been proposed several times by block operators.
But this issue should focus, in my opinion, on understanding the source of the overhead that has been observed.