-
Notifications
You must be signed in to change notification settings - Fork 636
Make sure that reactors' implementation of Receive()
method are non-blocking
#2685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
one simple solution that we played around with was the envelop buffers have to be quite large in order not to block for the consensus reactor, at least when blocks are large. As one might expect, most of messages that end up blocking are small messages, not larger messages such as block parts. The most common blocking channels are consensus state and votes, then block parts, at least from some recent experiments. We're collecting some tcp packet traces soon, but when traced on a local machine we can see that the reason we're not able to utilize all of the bandwidth is simply because we are unable to empty the tcp buffers fast enough. This also explains why other optimizations (like not using a mempool) or changes to prioritization don't make any meaningful change to throughput, at least for large blocks. One reason, but not the only reason, we block is because all incoming messages from all peers are effectively processed synchronously in the consensus reactor. This also explains the age old issue when connecting more peers reduces throughput. While not frequent, this can result in processing a vote, block part, or state message taking up to 700ms (!). Below is the graph of a particularly egregious example where we see max waits of 2.5seconds. (time taken to process a msg after receiving it in ms on y axis, channel on the x axis, max, average w/ stdev bars, and then the number of msgs for that channel that took over 100ms) Another is because we are not buffering the tcp connection properly. For example when we increase this constant, we see a meaningful but modest increase in throughput. I'm still working through the like 5 io.Reader/io.Writer buffered and unbuffered wrappers around the tcp connections. There are so many io.Reader/io.Writer wrapper around the tcp connections, its difficult to grok which need buffers and which actually degrade performance when we increase them. |
The Processing an evidence of misbehavior is a very costly procedure, which is mostly implemented by the (blocking) For context, from the log produced by this experiment:
Namely, most of the Previous analysis and the comprehensive work performed in the In the case of the |
Motivated by #2533.
The
Reactor
interface defines methods that a component must implement in order in order to use the communication services provided by the p2p layer. A core method that a reactor has to be implement isReceive()
, invoked by the p2p layer to deliver a message to the reactor. The implementation of this method should be non-blocking (see details here).If the implementation of
Receive()
in some reactor blocks when processing a message received from a given peer, the receive routine (recvRoutine()
) of the p2p connection with that peer also blocks until theReceive()
call returns:cometbft/p2p/conn/connection.go
Lines 654 to 655 in 0147e63
This scenario has some consequences:
Receive()
call returns;Possible solutions
This situation could be avoided if the
Receive()
call was not executed by the receive routine of the p2p connection. This change, however, is far from trivial, given the design of the p2p multiples connections. Moreover, even if theReceive()
call is executed in a different routine, this other routine will eventually block if theReceive()
method of a reactor blocks for a long period of time.The other approach, suggested in this issue, is to review the implementation of each reactor to make sure that the
Receive()
method is not blocking. This approach is also not trivial, as ultimately the way to prevent this is to buffer received messages that cannot be immediately processed. Since buffers are always finite in size, this would eventually lead to dropping messages.The text was updated successfully, but these errors were encountered: