8000 Clarify the operation of a recovering node · Issue #493 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Clarify the operation of a recovering node #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
cason opened this issue Mar 10, 2023 · 0 comments
Open
2 tasks

Clarify the operation of a recovering node #493

cason opened this issue Mar 10, 2023 · 0 comments
Labels
spec Specification-related

Comments

@cason
Copy link
Contributor
cason commented Mar 10, 2023

CometBFT considers the crash-recovery failure model, meaning that nodes may crash and then recovery, rejoining the distributed computation in a consistent state. For this to happen, nodes should persist relevant information and state changes during their regular operation, so that during recovery they are able to restore the state they had just before crashing.

Recovering the state of a node after a crash is a tricky operation. Several modules of CometBFT persist information that they are expected to recover after a crash. The consensus protocol keeps a Write-Ahead Log (WAL) to persist crucial information. The block store, the state store, the evidence reactor, the transaction indexer, and the address book persist data to their own DBs. And the application itself should adhere to the crash-recovery failure model, implementing a persistence strategy.

Among the mentioned modules, probably the best documented recovery procedure regards ABCI applications. The consensus WAL is very superficially covered, while the other DBs are essentially not documented. In any case, the assumptions regarding the persisted state and its recovery are not documented.

It is worth noting that when the state persistence is delegated to a database, the recovery procedure tends to be straightforward, as it is provided by the database implementation. As far as I known, consensus is the only module that adopts transactional semantics for persisted data, based on a WAL. The recovery of the consensus WAL is particularly tricky and undocumented.

Definition of Done:

  • List all databases adopted by CometBFT modules, summarize the persistence assumptions, and document, when it is the case, the relevant aspects of the recovery procedures
  • Document the consensus Write-Ahead Log and the operation of the consensus protocol during recovery. This should include the interaction between consensus and the ABCI application, covered only on the application side in the existing documentation.
@cason cason moved this to Todo in CometBFT 2023 Mar 10, 2023
@cason cason added documentation Improvements or additions to documentation spec Specification-related and removed documentation Improvements or additions to documentation labels Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec Specification-related
Projects
No open projects
Status: Todo
Development

No branches or pull requests

1 participant
0