8000 Dynamic Optimal Graph (DOG) gossip protocol · Issue #3263 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Dynamic Optimal Graph (DOG) gossip protocol #3263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hvanz opened this issue Jun 13, 2024 · 19 comments
Closed

Dynamic Optimal Graph (DOG) gossip protocol #3263

hvanz opened this issue Jun 13, 2024 · 19 comments
Assignees
Labels
enhancement New feature or request mempool P:bandwidth-optimization Priority: Optimize bandwidth usage protocol-change A protocol change proposal wip Work in progress
Milestone

Comments

@hvanz
Copy link
Member
hvanz commented Jun 13, 2024

Tracking issue: #3297

Abstract

This issue describes DOG, a novel transaction dissemination algorithm for CometBFT's mempool. At its current prototype stage, it has already shown compelling results at scale, where it is seen to reduce the bandwidth taken by transaction dissemination by more than 75% (see section "Large scale results" below).

Introduction

We propose a new protocol for transaction dissemination called Dynamic Optimal Graph (DOG), designed to reduce the number of received duplicate transactions to a pre-configured value such as 1, 0.5, or even 0! This reduction in duplicates could, in principle, bring the bandwidth used for transmitting transactions down to the theoretical minimum needed to get all transactions to all nodes.

Consider the following small example with 7 nodes, where

  • the results using the current basic FLOOD mempool algorithm are displayed on the left, and
  • the results using the DOG protocol are displayed on the right:

Screenshot 2024-05-03 at 12 38 08
The top graphs show the number of duplicate transactions each node receives over time. The lower graphs display the bandwidth used per message type (measured in average kilobytes sent), with BlockParts messages in light green and Txs messages in dark green.

Achieving zero duplicate transactions is feasible only in non-Byzantine networks. In adversarial environments, as expected, some level of redundancy is necessary to counteract possible attacks. The DOG protocol implements a Redundancy Control mechanism (effectively a closed-loop controller) that allows tuning the level of redundant transactions, ensuring that all transactions reach all nodes and none are lost. By adjusting the redundancy level, we can also achieve latency comparable to the base algorithm.

The DOG protocol builds on top of the current mempool protocol, often referred to as “flood” or “v0”. This protocol broadcasts transactions found in the mempool to all of the node's peers, except the transaction sender. While it is very resilient to Byzantine attacks, it generates significant redundancy, resulting in numerous duplicate transactions, resulting in sub-optimal bandwidth usage and transaction processing.

Notably, the DOG protocol is not limited to the mempool. It is a generic BFT gossip protocol that can be applied to disseminate other types of data as well. In the next months, we will be exploring its applicability to other CometBFT messages such as block parts, or votes.

Important: We are still at early stages of design and prototyping, so future changes to this proposal are expected.

Large scale results

Before getting to the detail of the protocol, let us present the compelling results we get on large scale networks.

The following results were obtained by running a testnet with 200 full nodes (175 of them being validators) with a simplistic kv store as application. The CometBFT team uses this topology as part of its regular QA process.

A transaction load of 400 transactions per second, each transaction of size 1KiB, is injected for a period of time on one of the nodes. This workload is known to be close to the saturation point of a network with these settings. Saturation point: when the network is unable to process all the transactions it receives and transaction latencies start growing without limit.

The X axis in all graphs is experiment time, and all graphs in a test are aligned on the X axis.

Test A: the base FLOOD algorithm

This is how the mempool size evolves at all nodes. We can see that the network has received 4 "bursts" of close-to-saturation transaction workload.

Screenshot 2024-06-13 at 20-02-14 Prometheus Time Series Collection and Processing Server

The graph below shows the bandwidth consumed (averaged over 30s) by the different message types of CometBFT.
The important line is the violet one, which represents the bandwidth taken by transaction dissemination. We see all nodes collectively consume around 3.2 GiB/s under high load.

Screenshot 2024-06-13 at 19-57-56 Prometheus Time Series Collection and Processing Server

Test B: DOG

These results have been obtained by patching our prototype implementation of DOG directly on top of the version used for Test A. The target redundancy was set to 1 (one duplicate per transaction received on average), and the closed-loop controller was set to kick in slowly, to avoid unwanted oscillations.

This is the mempool size evolution. The network has received 2 bursts (longer than in Test A) of close-to-saturation transaction workload.

Screenshot 2024-06-13 at 20-01-55 Prometheus Time Series Collection and Processing Server

And this is the collective bandwidth consumed by the network (light green: transaction dissemination).

Screenshot 2024-06-13 at 19-57-00 Prometheus Time Series Collection and Processing Server

At the start, it is comparable to Test A (around 3.2 GiB/s), but as Dog kicks in we see the bandwidth consumed progressively decrease down to around 1.3 GiB/s (less than half!). In the second workload burst, the algorithm starts lower, as it remembers the disabled routes, and further decreases the transaction dissemination bandwidth to 760 MiB/s. This amounts to a bandwidth reduction in transaction dissemination of more than 75%.

Note that:

  • the various parameters that influence the close-loop controller have been set using conservative guesses, so there is still much room for optimizing them.
  • given those conservative values chosen for the close-loop controller, the experiment was stopped before the algorithm reached the target redundancy of 1.

Finally, it is important to point out that consensus performance was unaffected in Test B (with respect to Test A). In both Test A and Test B:

  • The average block production rate under load was similar (around 17 blocks/min)
  • The average rate of transactions included in a block was also similar (somewhat above 23000 tx/min)

Base protocol

The core idea of the protocol is the following. Consider a node A that receives from node B a transaction that it already has. Let's assume B itself had received the transaction from C. The fact that A received from B a transaction it already has means that there must exist a cycle in the network topology. Therefore, a tells B to stop sending transactions B would be getting from C (i.e. A tells B to disable route C → A → B).

Later on, if node A is not receiving enough transactions, it will tell its peers to gradually unblock disabled routes.

DOG simply adds two new messages:

  • HaveTx(TxKey) for cutting cycles; it tells a peer “I already have this transaction; don’t send me more from the same source”, and
  • Reset for dynamic rerouting when a peer disconnects, or when the node wants to receive more transactions (more on this later); it is used to tell a peer “my situation has changed; remove all disabled routes involving me on your side”.

Each node keeps a list of disabled routes. A route in a node is a tuple sourceNodeID → targetNodeID, with source and target being any of the node's peers. Initially, all nodes have all their routes enabled. On receiving a HaveTx(tx) message from node A, node B will disable the route sender(tx) → A, where sender(tx) is the node from which B received tx. In general, when A is about to send tx to some peer C, it will check that the route sender(tx) → C is not in its list of disabled routes.

Dynamic routing

What happens when a node joins the network? Its list of disabled routes is empty and, as the new node connects to peers, those peers will have no disabled routes involving the new node. From that point on, the new node and its peers will progressively adapt and close routes if they start receiving duplicate transactions.

What happens when a node is disconnected from the network, for whatever reason? We use the Reset message. A node sending Reset signals that its situation has changed and its routing data should be reset on its peers. From that point on, the routes will be adapted as the various nodes involved start receiving duplicate transactions and sending HaveTx messages.

We have run experiments with perturbations to the nodes, such as restarting or disconnecting nodes, and the results are as expected. On each perturbation we observed that there's a small spike in the number of duplicate transactions, and a few HaveTx and Reset messages were sent. Routes change but, after each perturbation, the traffic eventually stabilizes and comes back to normal.

Example

The small experiment we ran to obtain the metrics shown in the introduction section was made with 7 nodes. In the following figure, the red arrows are the disabled routes in each node that have resulted from the initial interactions between the peers. The transaction load was 300 tx/s sent to nodes 1 and 7 during 120 seconds.

In this experiment we were setting the target redundancy to 0 (no duplicates). This was done for demonstration purposes; in real conditions, we may want to increase the target redundancy to 0.5, 1, or higher to increase the networks fault tolerance. For instance, a target redundancy or 1 means we are OK receiving 1 duplicate for each transaction received.

When the network stabilizes, the result is a superposition of spanning trees, one for each node. Transactions entering via a node will have only one path to reach another node in the network, thus forming a spanning tree. In the current example, these are the spanning trees for nodes 1 and 7:

We see that transactions entering from node 1 have only one path to get to node 7 (the path 1 → 2 → 4 → 5 → 7). Transactions entering from node 7 follow the path 7 → 6 → 4 → 3 → 1 to get to node 1.

Not shown here, but all other metrics except for bandwidth and duplicate transactions are the same when we run the current protocol. Experiments with different topologies and loads have shown the expected results.

Byzantine Fault Tolerance

It is not obvious for a byzantine node to affect the traffic of other nodes via abusing the algorithm. With HaveTx messages, a node can only stop peers from sending transactions to itself; it cannot affect how other nodes receive transactions. Likewise for Reset message, which only affects the routes of the node sending it.

There are two known ways for a byzantine node (or cabal of nodes) to cause trouble with the base algorithm

  • stay in the network for a while, wait patiently until it has many peers and, eventually, stop forwarding transactions it is supposed to forward according to the protocol.
  • inject the same transaction at two different points of the network to fool some nodes in the middle, making them believe their is a cycle somewhere and causing them to send HaveTx messages they shouldn't send.

The way we have come up with to address those possible attack is by targeting a non-zero redundancy in transaction transmission. The way we do it is explained in the next section.

Redundancy Control mechanism

As discussed above, the base protocol (which targets 0 redundancy) is vulnerable to attacks that may render a node isolated or may cause the network to lose some transactions. Hence, we introduce a Redundancy Control mechanism to maintain a predefined target level of duplicate transactions to mitigate potential attacks to the network.

This mechanism is implemented as a closed-loop controller. The intuition is the following.

Let’s say a node sets a target redundancy of 1 (for each received transaction the node is OK to receive the same transaction one more time). To achieve the target, a node monitors the average duplicates it receives, and periodically adjusts its redundancy:

  • If current redundancy < target redundancy, then send a Reset message.
  • If current redundancy >= target redundancy, then allow to send one HaveTx message.

Formal specification

We are working on a Quint specification of the base protocol. The process of writing the formal spec has been very useful during the early design of the protocol, even before writing any code, and especially to help thinking about corner cases. The spec is still work in progress

@hvanz hvanz added enhancement New feature or request mempool wip Work in progress protocol-change A protocol change proposal P:bandwidth-optimization Priority: Optimize bandwidth usage labels Jun 13, 2024
@hvanz hvanz added this to CometBFT Jun 13, 2024
@github-project-automation github-project-automation bot moved this to Todo in CometBFT Jun 13, 2024
@hvanz hvanz changed the title teaser: Dynamic Optimal Graph (DOG) gossip protocol Dynamic Optimal Graph (DOG) gossip protocol Jun 14, 2024
@sergio-mena sergio-mena moved this from Todo to In Progress in CometBFT Jun 14, 2024
@Wondertan
Copy link

JFYI, this proposal is isomorphic to libp2p/specs#548 and libp2p/specs#413

@hvanz
Copy link
Member Author
hvanz commented Jun 17, 2024

JFYI, this proposal is isomorphic to libp2p/specs#548 and libp2p/specs#413

Not really. The semantics of the messages are different. First, the IDONTWANT message is an improvement to a push/pull protocol, while DOG is a push protocol. The IDONTWANT message is used to tell a peer to not send a specific message ID, even if the node doesn't know if the peer ever saw that message. On the other hand, our HaveTx message only happens after receiving a duplicate transaction and is to used to tell a peer to not send more of any transaction from the same source. In this sense, it resembles the CHOKE message because both aim to stop receiving duplicate messages/transactions, but a key difference is that we use the sender of the transaction to form a superposition of multiple spanning tree topologies. And a spanning tree is the optimal topology one could have when gossiping.

@hvanz hvanz mentioned this issue Jun 17, 2024
3 tasks
@Wondertan
Copy link

First, the IDONTWANT message is an improvement to a push/pull protocol, while DOG is a push protocol.

There might be some misunderstanding. GossipSub is not a push/pull protocol but a push protocol in the first place. GossipSub has a pulling mechanism between overlay meshes to protect against a rare case where meshes disconnect, but this mechanism is strictly additive to the core pushing part. If GossipSub were a pull/push protocol, it wouldn't be able to gossip ETH blocks fast enough.

@hvanz
Copy link
Member Author
hvanz commented Jun 18, 2024

Still, IDONTWANT is only used on the pulling mechanism. I don't see how it's relevant to DOG, which is a purely push protocol.

@Wondertan
Copy link

Still, IDONTWANT is only used on the pulling mechanism.

That's not the case. The first paragraph from the spec:

When the peer receives the first message instance it immediately broadcasts
(not queue for later piggybacking) IDONTWANT with the messageId to all its mesh peers.

"Broadcast ... to all its mesh peers" implies pushing.

@sergio-mena
Copy link
Contributor

@Wondertan I see from your comments that you are very interested in DOG, and it's no surprise to me: indeed Celestia is potentially one of the biggest beneficiaries of the huge bandwidth savings that we have observed at scale. If you wish to discuss the details of the algorithm, we will be more than happy to do it in person: please come to our next CometBFT community call (in two weeks' time) where we can discuss those details at length. I don't think a long discussion here is helping other readers of this issue in any meaningful way.

@hvanz
Copy link
Member Author
hvanz commented Jun 19, 2024

"Broadcast ... to all its mesh pee 8000 rs" implies pushing.

Afaiu, "pushing" in push gossip protocols refers to sending the information that one wants to disseminate. IDONTWANT is a control message; it does not transmit information, only metadata.

@cason
Copy link
Contributor
cason commented Jun 19, 2024

My impression is that this IDONTWANT message has a very similar behavior to the SeenTx message introduced in the CAT mempool, proposed by Celestia.

The point here is to inform all peers that the node have a message, so that no one sends the same message again to the node.

This is not a pull operation. This is a control message to prevent pushing of messages, which is not the same.

@cason
Copy link
Contributor
cason commented Jun 19, 2024

So, regarding the differences between the protocols.

First of all, I have to say that when looking at the DOG design the first think that came to my mind was libp2p's Gossipsub. They are protocols essentially trying to solve the same problem: produce a low-cost dissemination sub-graph in a p2p/partially-connected network.

The DOG protocol reacts to the reception of duplicate messages. If I receive m again, I react to the reception of a duplicate message by sending a HaveTx message back to its sender. In the case of the IDONTWANTmessage, the node informs it has received m for the first time, and don't want to receive m again.

The goal is of the HaveTx message is more similar to the one of the PRUNE messages in libp2p gossip protocols. Namely, we realize that one of our peers only brings us duplicate message, so we gently ask the peer stop forwarding messages of that kind to to us.

An important difference though is how we define "a kind of message". In the case of libp2p, but correct me if I am wrong, we refer to a topic, as it is a pubsub system. In the case of the HaveTx it is more subtle. Essentially, we are telling our peer to not forward to us messages coming from the same "path" as the message I am encoding in my HaveTx.

So, I am a node that has received m. Then, I receive m again from peer X. I send peer X a HaveTx(id(m)) message. Node X has received m from someone, before forwarding m to me. What peer X does then is to disable the forward of messages (first) received from the same sender as m to me.

Again, there is no pulling in any of the protocols. They only push messages to everyone until they receive some information from peers telling them to chill out and send them less or no messages.

@cason
Copy link
Contributor
cason commented Jun 19, 2024

Anyway, I think that libp2p/specs#413 is relevant and we should follow the outcomes of it.

@Wondertan
Copy link
Wondertan commented Jun 19, 2024

@sergio-mena, thanks for the invite! I agree it is worth discussing this in person, but I can't imagine comparing DOG to other protocols on Comet's Community Call. These calls need to cover multiple topics in a fixed, limited time and involve a lot of parties, while detailed comparisons between protocols seem to be a very low-level and lengthy discussion. The issue with the proposal seems to be like the right place to compare these protocols unless you can propose a better alternative.

@Wondertan
Copy link

Afaiu, "pushing" in push gossip protocols refers to sending the information that one wants to disseminate. IDONTWANT is a control message; it does not transmit information, only metadata.

@hvanz, this is correct, but the point is that IDONTWANT is sent right before pushing the user payload, and the payload isn't being pulled.

@cason, summarised it well in here:
This is not a pull operation. This is a control message to prevent pushing of messages, which is not the same.

The idea is to front-run your peer with an IDONTWANT message to prevent it from pushing the payload to you.

@cason
Copy link
Contributor
cason commented Jun 20, 2024

I just wonder whether HaveTx is a good name for a control message with the goal it has.

I I understood it correctly, this message says: "the messages you are sending me matching tx are duplicated messages, that I am already receiving from another source". When the matching definition is implicit, only the node that receives HaveTx(tx) knows the "kind" of messages it has to stop forwarding to the peer from which it received that message (are messages that the node receive at first from the same peer that sent txat first to the node).

The typical use of a message with this name will match more the libp2p protocol whose spec is drafted in some post above. It means that we prefer to inform all our peers that we have some data to receive the same data from them and discard, because duplicate. This is used in our consensus protocol for votes and block parts (HasVote and HasBlockPart).

@cason
Copy link
Contributor
cason commented Jun 20, 2024

Another think I wonder is a way to formalize the protocol using graph theory.

We have an undirected graph, which is the network topology (nodes connected with their peers). And, from this protocol point of view, we have messages (txs) that we, a node N receives from a peer P because P has received them from its peer Q and have forwarded them to us.

So essentially, we have a connection between Q to N via P. So, Q sends txs to P, which is our peer and forwards them to us (N). Notice this is a link, directed edge. What the protocol does is to disable this link. Namely, when we tell P to stop forwarding us something like tx, we are asking to remove the link between Q and N (us) via P.

The spanning graph is therefore in terms of this kind of hyper-graph, where we connect two nodes (Q and N) if there is a third node (P) between them, that will forward messages from the first node (Q) to the second (N). Is from this hyper graph that we are trying to remove cycles, if my reasoning is not wrong.

All this reasoning is, again, to differentiate this solution from the ones adopted by libp2p in gossipsub, that at the end are mostly based on this paper: https://ieeexplore.ieee.org/document/4365705 (or https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf). This algorithm indeed produces a spanning tree on the "mesh" or p2p-connections graph, by removing (PRUNE) or adding back (GRAFT) links this graph.

8000

@BrendanChou
Copy link
Contributor
BrendanChou commented Jun 21, 2024

This would be amazing to have, this is a great idea!

One important point that I don't believe is mentioned above, is that non-zero redundancy may actually have performance benefits for the network depending on topology and especially if transactions enter the network at random nodes (rather than mostly at a particular node or nodes).

Even in the case where you don't have Byzantine or faulty nodes, my intuition is that a redundancy of 1 or greater may improve performance when:

  • The network topology is more "spread out" (e.g. all validators are not closely peered)
  • Users are sensitive to the time between when 1) the entry-point of the network sees the tx for the first time and 2) the validator of the next block sees the tx

Take for example the simple case of having validators A, B, ..., Z which are connected in a ring from A<->B...Y<->Z and then one more connection from Z<->A. Assuming no faulty or Byzantine nodes, and assuming that tx enter the network at a random validator, you would want to keep the ring connection intact (i.e. redundancy of 1) to improve latency by an average of 50% while only resulting in <4% more network bandwidth.

This is making me think that it would be beneficial if the value of redundancy could also be intelligently optimized real-time by the network rather than being set as a static value in a config. (The config itself could provide min/max/initial guardrails for the value if needed).

@BrendanChou
Copy link
Contributor

How does the system handle this case that I think can cause starvation:

Consider a 4-node network where nodes are connected in a cycle. Let us be node Z and we are connected to nodes A and B (which are not connected to each other). Node X (to which we are not connected directly) is connected to nodes A and B.

Z - A
|   |
B - X

Let's say that our redundancy is set at 0. If two transactions, txa and txb enter the network at X, and it sends the transactions txb...txa to B, and transactions txa...txb to A (where ... represents some latency or amount of time between the transactions), then I will receive txb from B first and txa from A first.

Later, I receive txa from B and txb from A. For both of these transactions, if I respond HaveTx, then B and A will both disable the route from X. This results in me never being able to see the transactions that enter the network at X.

How is this solved for?

@hvanz
Copy link
Member Author
hvanz commented Jun 24, 2024

One important point that I don't believe is mentioned above, is that non-zero redundancy may actually have performance benefits for the network

That's a good point. Redundancy > 0 means that the node is receiving transactions from more than one peer, which could be for instance the peers with the lowest latencies. So when the path to one of these peers becomes slower, the closed-loop controller allows the node to reset and pick a new path with lower latency. We still need to make a lot of tests on performance, especially to find out ideal parameters, which would allow us to analyse how often we want to reset paths or create new ones. As before, it will be a tradeoff between redundancy and latency.

@hvanz
Copy link
Member Author
hvanz commented Jun 24, 2024

How does the system handle this case that I think can cause starvation: ...

Good observation. Not mentioned above but the protocol has the following rule:

  • After sending a HaveTx msg, don't send more HaveTx msgs for some time.

In your example, let's say node Z first replies HaveTx to A. The goal of this rule is to allow incoming traffic from B to gradually stop, while all traffic is diverted to A. How much time to wait before sending HaveTx again? At least the time it takes already-dispatched messages from the B to arrive to Z. In the meantime Z will receive duplicate transactions from B, but we don't care about that. We don't want to render Z isolated from the traffic coming from X, so we need to be conservative on the time we abstain from sending HaveTx. In the current prototype we measure time in number of transactions received, eg. 500 txs, which is higher than the saturation point of the network.

@melekes melekes pinned this issue Sep 6, 2024
@jmalicevic jmalicevic added this to the 2024-Q4 milestone Oct 21, 2024
@jmalicevic jmalicevic self-assigned this Oct 21, 2024
@hvanz hvanz mentioned this issue Oct 22, 2024
2 tasks
@hvanz
Copy link
Member Author
hvanz commented Dec 10, 2024

Closing issue: the first version of the protocol was implemented in #3297.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mempool P:bandwidth-optimization Priority: Optimize bandwidth usage protocol-change A protocol change proposal wip Work in progress
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants
0