p2p: investigate why re-dialing persistent peers consumes so many resources

There are reports from node operators that while the p2p layer is attempting to reconnect to a persistent peer (typically because it is unavailable, offline, etc.) the overall performance of node degrades substantially. This is specially relevant in networks with short block times, when it is observed an increase in block times and proposers failing to get their blocks committed.

The method responsible for persistently attempt to dial a peer address is p2p.Switch.reconnectToPeer(*NetAddress). There is nothing really special on it in terms of resource consumption. The main calls are for dialing the peer address, which is the same p2p.Switch.DialPeerWithAddress(*NetAddress) used to dial any address, and sleeps.

The re-dialing is done using a standard (hard-code) procedure, summarized here. In summary, there are 20 attempts with linear intervals (5s plus a random jitter up to 3s), then the intervals are exponential, increasing powers of 3s, using the same jitter. At most 10 attempts are performed with exponential intervals, so at most 30 attempts are performed in total.

Turning the parameters used by this procedure configuration parameters has been proposed several times by block operators.

But this issue should focus, in my opinion, on understanding the source of the overhead that has been observed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions