Description
Feature Request
Summary
We want to be able to ingest packets from peers faster. There is some evidence (cc @evan-forbes) that we have blocking behavior on our ability to ingest data from peers. But as my explanation goes on, it should become clear based on intuition this is a problem.
Here is a profile from osmosis mainnet taken last night over one hour, from recvRoutine:
p2p.createMConnection.func1 is reactor.onReceive
.
The flow we currently have is:
recvRoutine:
- Checks if flowrate is satisfied for max packet size (1024 bytes)
- Tries to read the next packet
- Protobuf Decodes packet
- Find corresponding packets channel
- Buffer the proto-decoded packet data
- If buffer corresponds to a full logical packet
- Run reactor.OnReceive
- Go back to beginning
The issue is reactor.OnReceive blocks reading and proto-decoding more data for all channels to that peer. Some messages, e.g. IBC txs, take over 5ms. This means that if a peer gossips you an IBC tx, it will take at least 5ms before you even attempt to decode the subsequent packets they send you. (Leading to IBC enabled-DOS', amongst many others)
We see under low-load, CheckTx is already dominant item here.
Another problem is some consensus packets block on the cs mutex, which will be locked during block execution and vote processing.
Its clear we need to split these into different processes.
Proposal
If we need to guarantee in-order delivery across each channel, then I think the answers are:
- Short term: Use a go channel within each channel, for buffering incoming packets and processing them in a different thread than the recv routine.
- Long term, change the channel.onReceive API
If we don't need to guarantee in order delivery, and just achieve it in ~almost every case, then we can just call this function in a goroutine: https://github.com/cometbft/cometbft/blob/main/p2p/conn/connection.go#L671