8000 perf: De-duplicate writes in SaveABCIResponses / SaveFinalizeBlockResponse · Issue #1976 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf: De-duplicate writes in SaveABCIResponses / SaveFinalizeBlockResponse #1976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ValarDragon opened this issue Jan 6, 2024 · 2 comments · Fixed by #2017
Closed

perf: De-duplicate writes in SaveABCIResponses / SaveFinalizeBlockResponse #1976

ValarDragon opened this issue Jan 6, 2024 · 2 comments · Fixed by #2017
Assignees
Labels
block-sync enhancement New feature or request
Milestone

Comments

@ValarDragon
Copy link
Contributor
ValarDragon commented Jan 6, 2024

Feature Request

SaveABCIResponses is called on every block during consensus execution + block sync, and is blocking consensus from proceeding.

We see that for nodes which want to serve the block_results query, we pay a 2x write amplification over the data we want to serve.

The 2x write amplification in SaveABCIResponses can get extremely expensive for nodes, when there are many events. This occurs during a current problematic mainnet scenario detailed below, that currently affects all cosmos chains. Would be great if we can get a reduction at the comet layer soon for its part of the slowdown in this process!

Summary

Over a 100 block sync, where only a fraction of blocks have spam txs as detailed below, we see SaveABCIResponses take a large amount of time. (10 seconds! The state machine exec + commit time for this is 55 seconds. For a more "normal" 100 block range, we see this take .23 seconds)
image

Problem Definition / Context

When people send IBC packets with large amounts of data (400kb), nodes begin falling behind. (This is the "bananaking" attacks)
The biggest culprit of problems for nodes falling behind comes from SaveABCIResponse where we suffer a 2x write overhead: https://github.com/cometbft/cometbft/blob/main/internal/state/store.go#L561-L601 . We see nodes having to write an extra 100mb/s when these txs come in. We also see 10+ of these getting into a block. (Which matches the expected 100MB/s overhead being measured)

During this time, we observe all RPC nodes have a very hard time keeping up, network block times slow down, and validators who do not run DiscardABCIResponses miss blocks.

There are other issues that go into having problems here, but this particular comet location definitely seems like a strong mitigation here going forward. The situation is that this msg, if x bytes large, has ABCIResponses of size 13x larger than x. This lead to a 26x write overhead in Comet here. CPU overheads in the SDK + IBC + comet call stack are present for this pattern. The gas per byte is already underpriced, but this makes it more costly. CometBFT block gossip also gets quite expensive on big blocks. (Talking to other teams to reduce the ABCI response size overhead + CPU times)

Without knowing this can become a system bottleneck, the reason for the 2x write overhead is quite reasonable imo!

Proposal

Notice that in https://github.com/cometbft/cometbft/blob/main/internal/state/store.go#L561-L601 we are setting the same data twice.

First we set at key block_results/{HEIGHT} value {data}. Then we SetSync at key latest_block_results value {data, height}, and we overwrite the latter every block.

I think we should instead be able to, at better performance, just only set SetSync at key block_results/{HEIGHT} , and then:

  • SetSync a key that points to what is the latest block result height
  • If "DiscardABCIResponses" (or some equivalent) flag is set, start a new go routine to delete the value at key block_results/{HEIGHT - 1}

Please let me know thoughts! I think this would really alleviate this attack vector thats being demonstrated in prod right now.

8000 @ValarDragon ValarDragon added enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels Jan 6, 2024
@faddat
Copy link
Contributor
faddat commented Jan 6, 2024

will give it a go

faddat added a commit to faddat/cometbft that referenced this issue Jan 6, 2024
@melekes melekes added block-sync and removed needs-triage This issue/PR has not yet been triaged by the team. labels Jan 9, 2024
@melekes
Copy link
Contributor
melekes commented Jan 9, 2024

Thanks @ValarDragon ❤️ for putting in the effort.

@melekes melekes self-assigned this Jan 11, 2024
@melekes melekes added this to CometBFT Jan 11, 2024
@github-project-automation github-project-automation bot moved this to Todo in CometBFT Jan 11, 2024
melekes added a commit that referenced this issue Jan 11, 2024
@adizere adizere added this to the 2024-Q1 milestone Jan 12, 2024
melekes added a commit that referenced this issue Jan 29, 2024
github-merge-queue bot pushed a commit that referenced this issue Feb 21, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in CometBFT Feb 21, 2024
mergify bot pushed a commit that referenced this issue Feb 21, 2024
for performance reasons.
Closes #1976

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments

(cherry picked from commit f2c7466)

# Conflicts:
#	internal/state/store.go
mergify bot pushed a commit that referenced this issue Feb 21, 2024
for performance reasons.
Closes #1976

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments

(cherry picked from commit f2c7466)

# Conflicts:
#	state/store.go
#	state/store_test.go
greg-szabo pushed a commit that referenced this issue Feb 22, 2024
for performance reasons.
Closes #1976

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
melekes added a commit that referenced this issue Feb 22, 2024
for performance reasons.
Closes #1976

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
block-sync enhancement New feature or request
Projects
No open projects
Status: Done
Status: Todo
4 participants
0