-
Notifications
You must be signed in to change notification settings - Fork 636
Dynamic Optimal Graph (DOG) gossip protocol #3263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
JFYI, this proposal is isomorphic to libp2p/specs#548 and libp2p/specs#413 |
Not really. The semantics of the messages are different. First, the IDONTWANT message is an improvement to a push/pull protocol, while DOG is a push protocol. The IDONTWANT message is used to tell a peer to not send a specific message ID, even if the node doesn't know if the peer ever saw that message. On the other hand, our HaveTx message only happens after receiving a duplicate transaction and is to used to tell a peer to not send more of any transaction from the same source. In this sense, it resembles the CHOKE message because both aim to stop receiving duplicate messages/transactions, but a key difference is that we use the sender of the transaction to form a superposition of multiple spanning tree topologies. And a spanning tree is the optimal topology one could have when gossiping. |
There might be some misunderstanding. GossipSub is not a push/pull protocol but a push protocol in the first place. GossipSub has a pulling mechanism between overlay meshes to protect against a rare case where meshes disconnect, but this mechanism is strictly additive to the core pushing part. If GossipSub were a pull/push protocol, it wouldn't be able to gossip ETH blocks fast enough. |
Still, IDONTWANT is only used on the pulling mechanism. I don't see how it's relevant to DOG, which is a purely push protocol. |
That's not the case. The first paragraph from the spec:
"Broadcast ... to all its mesh peers" implies pushing. |
@Wondertan I see from your comments that you are very interested in DOG, and it's no surprise to me: indeed Celestia is potentially one of the biggest beneficiaries of the huge bandwidth savings that we have observed at scale. If you wish to discuss the details of the algorithm, we will be more than happy to do it in person: please come to our next CometBFT community call (in two weeks' time) where we can discuss those details at length. I don't think a long discussion here is helping other readers of this issue in any meaningful way. |
Afaiu, "pushing" in push gossip protocols refers to sending the information that one wants to disseminate. IDONTWANT is a control message; it does not transmit information, only metadata. |
My impression is that this The point here is to inform all peers that the node have a message, so that no one sends the same message again to the node. This is not a pull operation. This is a control message to prevent pushing of messages, which is not the same. |
So, regarding the differences between the protocols. First of all, I have to say that when looking at the DOG design the first think that came to my mind was libp2p's Gossipsub. They are protocols essentially trying to solve the same problem: produce a low-cost dissemination sub-graph in a p2p/partially-connected network. The DOG protocol reacts to the reception of duplicate messages. If I receive The goal is of the An important difference though is how we define "a kind of message". In the case of libp2p, but correct me if I am wrong, we refer to a So, I am a node that has received Again, there is no pulling in any of the protocols. They only push messages to everyone until they receive some information from peers telling them to chill out and send them less or no messages. |
Anyway, I think that libp2p/specs#413 is relevant and we should follow the outcomes of it. |
@sergio-mena, thanks for the invite! I agree it is worth discussing this in person, but I can't imagine comparing DOG to other protocols on Comet's Community Call. These calls need to cover multiple topics in a fixed, limited time and involve a lot of parties, while detailed comparisons between protocols seem to be a very low-level and lengthy discussion. The issue with the proposal seems to be like the right place to compare these protocols unless you can propose a better alternative. |
@hvanz, this is correct, but the point is that IDONTWANT is sent right before pushing the user payload, and the payload isn't being pulled. @cason, summarised it well in here: The idea is to front-run your peer with an IDONTWANT message to prevent it from pushing the payload to you. |
I just wonder whether I I understood it correctly, this message says: "the messages you are sending me matching The typical use of a message with this name will match more the libp2p protocol whose spec is drafted in some post above. It means that we prefer to inform all our peers that we have some data to receive the same data from them and discard, because duplicate. This is used in our consensus protocol for votes and block parts ( |
Another think I wonder is a way to formalize the protocol using graph theory. We have an undirected graph, which is the network topology (nodes connected with their peers). And, from this protocol point of view, we have messages (txs) that we, a node So essentially, we have a connection between The spanning graph is therefore in terms of this kind of hyper-graph, where we connect two nodes ( All this reasoning is, again, to differentiate this solution from the ones adopted by libp2p in gossipsub, that at the end are mostly based on this paper: https://ieeexplore.ieee.org/document/4365705 (or https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf). This algorithm indeed produces a spanning tree on the "mesh" or p2p-connections graph, by removing ( |
This would be amazing to have, this is a great idea! One important point that I don't believe is mentioned above, is that non-zero redundancy may actually have performance benefits for the network depending on topology and especially if transactions enter the network at random nodes (rather than mostly at a particular node or nodes). Even in the case where you don't have Byzantine or faulty nodes, my intuition is that a redundancy of 1 or greater may improve performance when:
Take for example the simple case of having validators This is making me think that it would be beneficial if the value of |
How does the system handle this case that I think can cause starvation: Consider a 4-node network where nodes are connected in a cycle. Let us be node
Let's say that our redundancy is set at 0. If two transactions, Later, I receive How is this solved for? |
That's a good point. Redundancy > 0 means that the node is receiving transactions from more than one peer, which could be for instance the peers with the lowest latencies. So when the path to one of these peers becomes slower, the closed-loop controller allows the node to reset and pick a new path with lower latency. We still need to make a lot of tests on performance, especially to find out ideal parameters, which would allow us to analyse how often we want to reset paths or create new ones. As before, it will be a tradeoff between redundancy and latency. |
Good observation. Not mentioned above but the protocol has the following rule:
In your example, let's say node Z first replies HaveTx to A. The goal of this rule is to allow incoming traffic from B to gradually stop, while all traffic is diverted to A. How much time to wait before sending HaveTx again? At least the time it takes already-dispatched messages from the B to arrive to Z. In the meantime Z will receive duplicate transactions from B, but we don't care about that. We don't want to render Z isolated from the traffic coming from X, so we need to be conservative on the time we abstain from sending HaveTx. In the current prototype we measure time in number of transactions received, eg. 500 txs, which is higher than the saturation point of the network. |
Closing issue: the first version of the protocol was implemented in #3297. |
Tracking issue: #3297
Abstract
This issue describes DOG, a novel transaction dissemination algorithm for CometBFT's mempool. At its current prototype stage, it has already shown compelling results at scale, where it is seen to reduce the bandwidth taken by transaction dissemination by more than 75% (see section "Large scale results" below).
Introduction
We propose a new protocol for transaction dissemination called Dynamic Optimal Graph (DOG), designed to reduce the number of received duplicate transactions to a pre-configured value such as 1, 0.5, or even 0! This reduction in duplicates could, in principle, bring the bandwidth used for transmitting transactions down to the theoretical minimum needed to get all transactions to all nodes.
Consider the following small example with 7 nodes, where
The top graphs show the number of duplicate transactions each node receives over time. The lower graphs display the bandwidth used per message type (measured in average kilobytes sent), with BlockParts messages in light green and Txs messages in dark green.
Achieving zero duplicate transactions is feasible only in non-Byzantine networks. In adversarial environments, as expected, some level of redundancy is necessary to counteract possible attacks. The DOG protocol implements a Redundancy Control mechanism (effectively a closed-loop controller) that allows tuning the level of redundant transactions, ensuring that all transactions reach all nodes and none are lost. By adjusting the redundancy level, we can also achieve latency comparable to the base algorithm.
The DOG protocol builds on top of the current mempool protocol, often referred to as “flood” or “v0”. This protocol broadcasts transactions found in the mempool to all of the node's peers, except the transaction sender. While it is very resilient to Byzantine attacks, it generates significant redundancy, resulting in numerous duplicate transactions, resulting in sub-optimal bandwidth usage and transaction processing.
Notably, the DOG protocol is not limited to the mempool. It is a generic BFT gossip protocol that can be applied to disseminate other types of data as well. In the next months, we will be exploring its applicability to other CometBFT messages such as block parts, or votes.
Large scale results
Before getting to the detail of the protocol, let us present the compelling results we get on large scale networks.
The following results were obtained by running a testnet with 200 full nodes (175 of them being validators) with a simplistic kv store as application. The CometBFT team uses this topology as part of its regular QA process.
A transaction load of 400 transactions per second, each transaction of size 1KiB, is injected for a period of time on one of the nodes. This workload is known to be close to the saturation point of a network with these settings. Saturation point: when the network is unable to process all the transactions it receives and transaction latencies start growing without limit.
The X axis in all graphs is experiment time, and all graphs in a test are aligned on the X axis.
Test A: the base FLOOD algorithm
This is how the mempool size evolves at all nodes. We can see that the network has received 4 "bursts" of close-to-saturation transaction workload.
The graph below shows the bandwidth consumed (averaged over 30s) by the different message types of CometBFT.
The important line is the violet one, which represents the bandwidth taken by transaction dissemination. We see all nodes collectively consume around 3.2 GiB/s under high load.
Test B: DOG
These results have been obtained by patching our prototype implementation of DOG directly on top of the version used for Test A. The target redundancy was set to 1 (one duplicate per transaction received on average), and the closed-loop controller was set to kick in slowly, to avoid unwanted oscillations.
This is the mempool size evolution. The network has received 2 bursts (longer than in Test A) of close-to-saturation transaction workload.
And this is the collective bandwidth consumed by the network (light green: transaction dissemination).
At the start, it is comparable to Test A (around 3.2 GiB/s), but as Dog kicks in we see the bandwidth consumed progressively decrease down to around 1.3 GiB/s (less than half!). In the second workload burst, the algorithm starts lower, as it remembers the disabled routes, and further decreases the transaction dissemination bandwidth to 760 MiB/s. This amounts to a bandwidth reduction in transaction dissemination of more than 75%.
Note that:
Finally, it is important to point out that consensus performance was unaffected in Test B (with respect to Test A). In both Test A and Test B:
Base protocol
The core idea of the protocol is the following. Consider a node A that receives from node B a transaction that it already has. Let's assume B itself had received the transaction from C. The fact that A received from B a transaction it already has means that there must exist a cycle in the network topology. Therefore, a tells B to stop sending transactions B would be getting from C (i.e. A tells B to disable route C → A → B).
Later on, if node A is not receiving enough transactions, it will tell its peers to gradually unblock disabled routes.
DOG simply adds two new messages:
HaveTx(TxKey)
for cutting cycles; it tells a peer “I already have this transaction; don’t send me more from the same source”, andReset
for dynamic rerouting when a peer disconnects, or when the node wants to receive more transactions (more on this later); it is used to tell a peer “my situation has changed; remove all disabled routes involving me on your side”.Each node keeps a list of disabled routes. A route in a node is a tuple
sourceNodeID → targetNodeID
, with source and target being any of the node's peers. Initially, all nodes have all their routes enabled. On receiving aHaveTx(tx)
message from node A, node B will disable the routesender(tx) → A
, wheresender(tx)
is the node from which B receivedtx
. In general, when A is about to sendtx
to some peer C, it will check that the routesender(tx) → C
is not in its list of disabled routes.Dynamic routing
What happens when a node joins the network? Its list of disabled routes is empty and, as the new node connects to peers, those peers will have no disabled routes involving the new node. From that point on, the new node and its peers will progressively adapt and close routes if they start receiving duplicate transactions.
What happens when a node is disconnected from the network, for whatever reason? We use the
Reset
message. A node sendingReset
signals that its situation has changed and its routing data should be reset on its peers. From that point on, the routes will be adapted as the various nodes involved start receiving duplicate transactions and sendingHaveTx
messages.We have run experiments with perturbations to the nodes, such as restarting or disconnecting nodes, and the results are as expected. On each perturbation we observed that there's a small spike in the number of duplicate transactions, and a few
HaveTx
andReset
messages were sent. Routes change but, after each perturbation, the traffic eventually stabilizes and comes back to normal.Example
The small experiment we ran to obtain the metrics shown in the introduction section was made with 7 nodes. In the following figure, the red arrows are the disabled routes in each node that have resulted from the initial interactions between the peers. The transaction load was 300 tx/s sent to nodes 1 and 7 during 120 seconds.
In this experiment we were setting the target redundancy to 0 (no duplicates). This was done for demonstration purposes; in real conditions, we may want to increase the target redundancy to 0.5, 1, or higher to increase the networks fault tolerance. For instance, a target redundancy or 1 means we are OK receiving 1 duplicate for each transaction received.
When the network stabilizes, the result is a superposition of spanning trees, one for each node. Transactions entering via a node will have only one path to reach another node in the network, thus forming a spanning tree. In the current example, these are the spanning trees for nodes 1 and 7:
We see that transactions entering from node 1 have only one path to get to node 7 (the path 1 → 2 → 4 → 5 → 7). Transactions entering from node 7 follow the path 7 → 6 → 4 → 3 → 1 to get to node 1.
Not shown here, but all other metrics except for bandwidth and duplicate transactions are the same when we run the current protocol. Experiments with different topologies and loads have shown the expected results.
Byzantine Fault Tolerance
It is not obvious for a byzantine node to affect the traffic of other nodes via abusing the algorithm. With
HaveTx
messages, a node can only stop peers from sending transactions to itself; it cannot affect how other nodes receive transactions. Likewise forReset
message, which only affects the routes of the node sending it.There are two known ways for a byzantine node (or cabal of nodes) to cause trouble with the base algorithm
HaveTx
messages they shouldn't send.The way we have come up with to address those possible attack is by targeting a non-zero redundancy in transaction transmission. The way we do it is explained in the next section.
Redundancy Control mechanism
As discussed above, the base protocol (which targets 0 redundancy) is vulnerable to attacks that may render a node isolated or may cause the network to lose some transactions. Hence, we introduce a Redundancy Control mechanism to maintain a predefined target level of duplicate transactions to mitigate potential attacks to the network.
This mechanism is implemented as a closed-loop controller. The intuition is the following.
Let’s say a node sets a target redundancy of 1 (for each received transaction the node is OK to receive the same transaction one more time). To achieve the target, a node monitors the average duplicates it receives, and periodically adjusts its redundancy:
Reset
message.HaveTx
message.Formal specification
We are working on a Quint specification of the base protocol. The process of writing the formal spec has been very useful during the early design of the protocol, even before writing any code, and especially to help thinking about corner cases. The spec is still work in progress
The text was updated successfully, but these errors were encountered: