How to speed up diagnosis of chain execution problems when the root cause is inconsistent config #1093
Replies: 4 comments 1 reply
-
I like this approach much more than the current opaque approach where the parameters are configured in The consensus param-based approach seems problematic because, if a network ends up not being able to come to agreement with a specific set of consensus param-based timing parameters, there's no way for the network to progress to the next height in order to update those timing parameters beyond a hard fork (perhaps that's acceptable for such important parameters?).
In order for the network to be troubleshooted effectively, one needs to be able to trace the validator(s) with the aberrant timing parameters. Ideally it should be easy to identify and punish a cabal of validators whose timing parameters are substantially off from what the network requires (which somewhat strengthens the case for making these consensus parameters).
It would be interesting to understand why an operator would not want these parameters exposed - I can only think of nefarious reasons at this point. |
Beta Was this translation helpful? Give feedback.
-
We actually need a new channel and a (new?) reactor responsible for processing messages from that channel. The p2p layer itself does not process any message. |
Beta Was this translation helpful? Give feedback.
-
While I like the idea of reaching "consensus" over network configuration, this also increases the amount of data sent between peers and potentially poses security risks (another DoS attack vector?). An alternative idea: a CLI sub-cmd, which checks the node's config against the recommended config.
The recommended config may come from a network's forum, GitHub, one of the validator's website, or any semi-trusted source. How often do configs change? If this never happens or happens very rarely, then the CLI command might be a good fit. Wdyt? Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
I think the CLI idea is a good one. Still, the main objective of the original idea is to help troubleshooting when a problem arises in a real network and we suspect inconsistency between nodes' config. Currently it is very difficult (or impossible) to obtain evidence of such misconfig in the field. So, I'd say both ideas are valid and tackle different aspects of the same problem (original would help live troubleshooting, @melekes's would help cautious operators avoid the problem to be troubleshot in the first place) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As part of their responsibilities, the CometBFT team have participated in troubleshooting a number of problems in chains (testnet and production).
During the last months, many of those problems were root-caused to inconsistent configuration among full nodes, including validators (we've had both
config.toml
andapp.toml
inconsistencies).Leaving aside the discussion on whether a given parameter should be in
config.toml
or a Consensus Param, this discussion describes and idea that can help DevRels more easily detect inconsistent configs when troubleshoot chain issues.Here is a concise description of the idea (refinements welcome!):
One advantage of this idea is that it would be done completely out-of-chain: we'd just need:
As result, the upgrade path for existing chains should be very easy (I think we could even support uncoordinated upgrade, but not 100% sure).
Beta Was this translation helpful? Give feedback.
All reactions