Clarify the operation of a recovering node #493

cason · 2023-03-10T09:40:17Z

CometBFT considers the crash-recovery failure model, meaning that nodes may crash and then recovery, rejoining the distributed computation in a consistent state. For this to happen, nodes should persist relevant information and state changes during their regular operation, so that during recovery they are able to restore the state they had just before crashing.

Recovering the state of a node after a crash is a tricky operation. Several modules of CometBFT persist information that they are expected to recover after a crash. The consensus protocol keeps a Write-Ahead Log (WAL) to persist crucial information. The block store, the state store, the evidence reactor, the transaction indexer, and the address book persist data to their own DBs. And the application itself should adhere to the crash-recovery failure model, implementing a persistence strategy.

Among the mentioned modules, probably the best documented recovery procedure regards ABCI applications. The consensus WAL is very superficially covered, while the other DBs are essentially not documented. In any case, the assumptions regarding the persisted state and its recovery are not documented.

It is worth noting that when the state persistence is delegated to a database, the recovery procedure tends to be straightforward, as it is provided by the database implementation. As far as I known, consensus is the only module that adopts transactional semantics for persisted data, based on a WAL. The recovery of the consensus WAL is particularly tricky and undocumented.

Definition of Done:

List all databases adopted by CometBFT modules, summarize the persistence assumptions, and document, when it is the case, the relevant aspects of the recovery procedures
Document the consensus Write-Ahead Log and the operation of the consensus protocol during recovery. This should include the interaction between consensus and the ABCI application, covered only on the application side in the existing documentation.

cason added this to CometBFT 2023 Mar 10, 2023

cason moved this to Todo in CometBFT 2023 Mar 10, 2023

cason added documentation Improvements or additions to documentation spec Specification-related and removed documentation Improvements or additions to documentation labels Mar 10, 2023

cason mentioned this issue Mar 10, 2023

spec:abci2.0 - clarify crash recovery mechanism #469

Merged

cason mentioned this issue Sep 7, 2023

Option to reduce cs.wal max size #1233

Open

cason mentioned this issue Jan 15, 2024

perf(internal/state): avoid double-saving FinalizeBlockResponse #2017

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify the operation of a recovering node #493

Clarify the operation of a recovering node #493

Clarify the operation of a recovering node #493

Clarify the operation of a recovering node #493

Comments

Uh oh!