8000 Make reactor.onReceive not block unmarshalling more packets from peer · Issue #3199 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Make reactor.onReceive not block unmarshalling more packets from peer #3199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #3053
ValarDragon opened this issue Jun 6, 2024 · 3 comments
Open
Tracked by #3053
Labels
enhancement New feature or request p2p

Comments

@ValarDragon
Copy link
Contributor
ValarDragon commented Jun 6, 2024

Feature Request

Summary

We want to be able to ingest packets from peers faster. There is some evidence (cc @evan-forbes) that we have blocking behavior on our ability to ingest data from peers. But as my explanation goes on, it should become clear based on intuition this is a problem.

Here is a profile from osmosis mainnet taken last night over one hour, from recvRoutine:
image

p2p.createMConnection.func1 is reactor.onReceive.

The flow we currently have is:

recvRoutine:
- Checks if flowrate is satisfied for max packet size (1024 bytes)
- Tries to read the next packet
- Protobuf Decodes packet
- Find corresponding packets channel
- Buffer the proto-decoded packet data
- If buffer corresponds to a full logical packet
  - Run reactor.OnReceive
- Go back to beginning

The issue is reactor.OnReceive blocks reading and proto-decoding more data for all channels to that peer. Some messages, e.g. IBC txs, take over 5ms. This means that if a peer gossips you an IBC tx, it will take at least 5ms before you even attempt to decode the subsequent packets they send you. (Leading to IBC enabled-DOS', amongst many others)

We see under low-load, CheckTx is already dominant item here.

Another problem is some consensus packets block on the cs mutex, which will be locked during block execution and vote processing.

Its clear we need to split these into different processes.

Proposal

If we need to guarantee in-order delivery across each channel, then I think the answers are:

  • Short term: Use a go channel within each channel, for buffering incoming packets and processing them in a different thread than the recv routine.
  • Long term, change the channel.onReceive API

If we don't need to guarantee in order delivery, and just achieve it in ~almost every case, then we can just call this function in a goroutine: https://github.com/cometbft/cometbft/blob/main/p2p/conn/connection.go#L671

@ValarDragon ValarDragon added enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels Jun 6, 2024
@ValarDragon
Copy link
Contributor Author
ValarDragon commented Jun 6, 2024

I propose a two-PR short-term fix, and then decide on a longer term API breaking fix later.

PR #1:

First notice that the key property is that we only need in-order delivery per peer<>channel pair, but we have no in-order delivery guarantees across channels. So we can achieve this by making every onReceive push to a go channel per channel, that does the proto marshal/unmarshal.

PR #2:

Make each of the main reactors have an incoming buffer, and a separate goroutine that proccesses the inbound. This would eventually be an API break on reactor, but short term should drop these incoming costs notably. @evan-forbes has a change for this that he's already been testing

(Optional) PR #3: Change the onReceive channel setup to enable parallel proto cloning/marshalling. Unclear that this is needed short term, definitely a long term direction to have.

@cason
Copy link
Contributor
cason commented Jun 11, 2024

For context, before tendermint/tendermint#9622, reactors received raw messages (bytes), not protobuf messages decoded from the raw messages. This allows producing per-message-type p2p metrics and saves some coding on the reactors implementation. But this of course introduced unmarshalling to the critical path of the p2p layer. And this is the main problem here.

@cason
Copy link
Contributor
cason commented Jun 11, 2024

First notice that the key property is that we only need in-order delivery per peer<>channel pair, but we have no in-order delivery guarantees across channels. So we can achieve this by making every onReceive push to a go channel per channel, that does the proto marshal/unmarshal.

This probably means one go routine per channel per peer. This is a lot to handle, I am afraid that the overhead here might not pay the possible benefits.

@cason cason removed the needs-triage This issue/PR has not yet been triaged by the team. label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p2p
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0