How to speed up diagnosis of chain execution problems when the root cause is inconsistent config #1093

sergio-mena · 2023-07-06T08:49:16Z

sergio-mena
Jul 6, 2023
Collaborator

As part of their responsibilities, the CometBFT team have participated in troubleshooting a number of problems in chains (testnet and production).

During the last months, many of those problems were root-caused to inconsistent configuration among full nodes, including validators (we've had both config.toml and app.toml inconsistencies).

Leaving aside the discussion on whether a given parameter should be in config.toml or a Consensus Param, this discussion describes and idea that can help DevRels more easily detect inconsistent configs when troubleshoot chain issues.

Here is a concise description of the idea (refinements welcome!):

Add a new p2p channel "gossip config". Nodes gossip (a part of) their config in an anonymous way... one cannot match a received config with the node sending it
Operators have the possibility of disabling this behaviour (they can explicitly prevent the node from gossiping its config, e.g. for privacy reasons)
Nodes collect all "gossip config" messages, not only from their direct peers
- They construct some struct that reflect which params are inconsistent (and the values that cause the inconsistency)
When inconsistencies are detected, there are several alternatives
- The operator/troubleshooter can see these in the logs (periodically?)
- A new RPC exposes the config data that diverges across nodes (according to the info available at the node we are querying)

One advantage of this idea is that it would be done completely out-of-chain: we'd just need:

a new P2P message type
some logic to combine the info received via p2p
everything kept in memory

As result, the upgrade path for existing chains should be very easy (I think we could even support uncoordinated upgrade, but not 100% sure).

thanethomson · 2023-07-06T10:52:42Z

thanethomson
Jul 6, 2023

I like this approach much more than the current opaque approach where the parameters are configured in config.toml and are not visible to the rest of the network. This approach at least makes it clearly visible to the operator that their config is out of sync with the rest of the network (unless the whole network has different timing parameters).

The consensus param-based approach seems problematic because, if a network ends up not being able to come to agreement with a specific set of consensus param-based timing parameters, there's no way for the network to progress to the next height in order to update those timing parameters beyond a hard fork (perhaps that's acceptable for such important parameters?).

Add a new p2p channel "gossip config". Nodes gossip (a part of) their config in an anonymous way... one cannot match a received config with the node sending it

In order for the network to be troubleshooted effectively, one needs to be able to trace the validator(s) with the aberrant timing parameters. Ideally it should be easy to identify and punish a cabal of validators whose timing parameters are substantially off from what the network requires (which somewhat strengthens the case for making these consensus parameters).

Operators have the possibility of disabling this behaviour (they can explicitly prevent the node from gossiping its config, e.g. for privacy reasons)

It would be interesting to understand why an operator would not want these parameters exposed - I can only think of nefarious reasons at this point.

1 reply

lasarojc Jul 20, 2023

It would be interesting to understand why an operator would not want these parameters exposed - I can only think of nefarious reasons at this point.

Wrong parameters may be explored by other nodes. For example, if a node has too large a maximum size block, malicious nodes could try filling its mempool do cause a large block to be proposed and rejected. Not sure how realistic this attack is, but there might be others that make more sense.

cason · 2023-07-17T03:10:49Z

cason
Jul 17, 2023

we'd just need:

a new P2P message type
some logic to combine the info received via p2p

We actually need a new channel and a (new?) reactor responsible for processing messages from that channel. The p2p layer itself does not process any message.

0 replies

melekes · 2023-11-07T09:05:40Z

melekes
Nov 7, 2023
Collaborator

While I like the idea of reaching "consensus" over network configuration, this also increases the amount of data sent between peers and potentially poses security risks (another DoS attack vector?).

An alternative idea: a CLI sub-cmd, which checks the node's config against the recommended config.

$ cometbft config check ...

The recommended config may come from a network's forum, GitHub, one of the validator's website, or any semi-trusted source. How often do configs change? If this never happens or happens very rarely, then the CLI command might be a good fit. Wdyt?

Pros:

no additional p2p traffic, reactors, or new messages
little security risks (other than unquestioningly trusting new parameters)
should be easy to implement

Cons:

no consensus; need to trust the source of the recommended config
not run in the background; operators needs to run the cmd themselves + look for the recommended configs

0 replies

sergio-mena · 2023-11-08T09:53:43Z

sergio-mena
Nov 8, 2023
Collaborator Author

I think the CLI idea is a good one. Still, the main objective of the original idea is to help troubleshooting when a problem arises in a real network and we suspect inconsistency between nodes' config. Currently it is very difficult (or impossible) to obtain evidence of such misconfig in the field.

So, I'd say both ideas are valid and tackle different aspects of the same problem (original would help live troubleshooting, @melekes's would help cautious operators avoid the problem to be troubleshot in the first place)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to speed up diagnosis of chain execution problems when the root cause is inconsistent config #1093

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to speed up diagnosis of chain execution problems when the root cause is inconsistent config #1093

Uh oh!

sergio-mena Jul 6, 2023 Collaborator

Replies: 4 comments · 1 reply

Uh oh!

thanethomson Jul 6, 2023

Uh oh!

lasarojc Jul 20, 2023

Uh oh!

cason Jul 17, 2023

Uh oh!

melekes Nov 7, 2023 Collaborator

Uh oh!

sergio-mena Nov 8, 2023 Collaborator Author

sergio-mena
Jul 6, 2023
Collaborator

Replies: 4 comments 1 reply

thanethomson
Jul 6, 2023

cason
Jul 17, 2023

melekes
Nov 7, 2023
Collaborator

sergio-mena
Nov 8, 2023
Collaborator Author