-
Notifications
You must be signed in to change notification settings - Fork 636
Add lru cache for loading commit from blockstore #2844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ValarDragon opened this issue
Apr 18, 2024
· 2 comments
· Fixed by #3003 or osmosis-labs/cometbft#76
Closed
Add lru cache for loading commit from blockstore #2844
ValarDragon opened this issue
Apr 18, 2024
· 2 comments
· Fixed by #3003 or osmosis-labs/cometbft#76
Labels
consensus
enhancement
New feature or request
needs-triage
This issue/PR has not yet been triaged by the team.
storage
Milestone
Comments
Definitely. This LRU cache can be implemented at the store, in my opinion, so no need for code changes in the reactor. |
4 tasks
github-merge-queue bot
pushed a commit
that referenced
this issue
May 15, 2024
…sensus (#3003) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>
mergify bot
pushed a commit
that referenced
this issue
May 15, 2024
…sensus (#3003) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit 46e2484)
Merged
4 tasks
mergify bot
pushed a commit
that referenced
this issue
May 15, 2024
…sensus (#3003) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit 46e2484) # Conflicts: # .golangci.yml # go.mod # go.sum # store/store.go # store/store_test.go # types/block.go
mergify bot
pushed a commit
that referenced
this issue
May 15, 2024
…sensus (#3003) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit 46e2484) # Conflicts: # .golangci.yml # go.mod # go.sum # store/store.go
melekes
pushed a commit
that referenced
this issue
May 15, 2024
…sensus (backport #3003) (#3081) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3003 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
melekes
added a commit
that referenced
this issue
May 15, 2024
…sensus (backport #3003) (#3083) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3003 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
melekes
added a commit
that referenced
this issue
May 15, 2024
…sensus (backport #3003) (#3082) Closes #2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3003 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
ValarDragon
added a commit
to osmosis-labs/cometbft
that referenced
this issue
May 25, 2024
…sensus (backport cometbft#3003) (cometbft#3083) Closes cometbft#2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3003 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Merged
7 tasks
mergify bot
added a commit
to osmosis-labs/cometbft
that referenced
this issue
May 28, 2024
…sensus (backport cometbft#3003) (cometbft#3083) Closes cometbft#2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3003 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> (cherry picked from commit 0c10bd5)
7 tasks
PaddyMc
added a commit
to osmosis-labs/cometbft
that referenced
this issue
May 28, 2024
…… (backport #76) (#81) * perf(blockstore): Add LRU caches to blockstore operations used in consensus (backport cometbft#3003) (cometbft#3083) Closes cometbft#2844 We are seeing that the blockstore loading operations get used in hot loops within gossip routines, and queryMaj23 routines. This PR reduces that overhead using an LRU cache. The LRU cache does have a mutex on every get, but the time with the LRU cache is 9x faster than without (before even adding in DB overheads), due to the proto unmarshalling saved. We could imagine a setup where we avoided a lock there entirely. I don't think this is worth right now, since the new code is 9x faster, and these mostly appear in catchup code which should not be highly contended for across peers at the same time. With the new benchmark I added: OLD: ``` BenchmarkRepeatedLoadSeenCommit-12 24447 54691 ns/op 46495 B/op 319 allocs/op ``` NEW: ``` BenchmarkRepeatedLoadSeenCommit-12 224131 6401 ns/op 8320 B/op 2 allocs/op ``` It turns out these gossip routines don't need mutative copies, so we could optimize out the large allocation in the future if we want. 1 hour cpu profile that shows this appearing in prod:  The state machine execution time here for context is 92 seconds. So this is adding up in system load (and GC! The GC load is 52GB, the entire trace is 200GB, with other parts being optimized down from recent PRs) --- - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3003 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> (cherry picked from commit 0c10bd5) * Add Changelog (cherry picked from commit 4594f29) # Conflicts: # CHANGELOG.md --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Dev Ojha <dojha@berkeley.edu> Co-authored-by: PaddyMc <paddymchale@hotmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
consensus
enhancement
New feature or request
needs-triage
This issue/PR has not yet been triaged by the team.
storage
Feature Request
Summary
Whenever a peer is lagging by more than 1 block, we load the commit for the height they were on and send them the whole commit (per code comment. Admittedly I haven't looked long, but I don't see how it's currently sending the whole commit vs just a single vote from it). Source
Loading the commit requires doing a db read and a proto unmarshal. Source In benchmarks we see the proto unmarshal taking a large number of heap allocations, and this is likely duplicate work. (Unmarshalling the same block commit)
We should keep a small LRU cache for height to unmarshalled commit to reduce the memory pressure here. I suspect the same maybe helpful for block parts as well.
In v37 we saw that this was around 70GB of heap allocation on a query serving node over 1-2 hours. (Will paste memprofile later)
I do not know the total system impact it has, but a 70GB ram allocation is very high.
Proposal
Add a simple lru cache in blockstore for loading commit and block parts
The text was updated successfully, but these errors were encountered: