8000 rework secondary index generation by jeffwashington · Pull Request #959 · anza-xyz/agave · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

rework secondary index generation #959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

jeffwashington
Copy link

Problem

stop mmapping storage files.
We cannot iterate and return StoredAccountMeta for many accounts anymore. The loop for secondary index generation doesn't easily support reworking.

Summary of Changes

Always generate the primary index the same way, using scan_index. Then, if we're building secondary indexes, update the secondary index using scan_accounts. By far the largest cost of secondary indexes is the generation step. We were already calling this update once per account. We are scanning the entire storage a second time, but the performance is dominated by the secondary index generation so this should be negligible.

Fixes #

@jeffwashington jeffwashington marked this pull request as ready for review April 22, 2024 14:54
Copy link
@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, and the intuition about perf impacts makes sense too. If this were not RPC-specific, we'd want before and after timing numbers to compare. It would be nice to have that comparison here too.

I also understand that the secondary index generation must change to accommodate no-mmaps. Does the new scan_accounts() not give enough info to create the items that is needed by insert_new_if_missing_into_primary_index()?

@jeffwashington
Copy link
Author

Does the new scan_accounts() not give enough info to create the items that is needed by insert_new_if_missing_into_primary_index()?

insert_new_if_missing_into_primary_index takes an iter of items: impl Iterator<Item = (Pubkey, T)>,.
We can scan using scan_accounts once and get the index info, too. But, this would be a code duplication that this pr gets rid of. In the new hot storage format, scanning for the index is a smaller scan of i/o than the accounts scan.
The choices are basically duplicate code paths for insertion into primary index or this solution.
What do you think?
I think the perf number changes will be irrelevant.

Copy link
@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choices are basically duplicate code paths for insertion into primary index or this solution. What do you think? I think the perf number changes will be irrelevant.

I definitely prefer this PR since it removes the code duplication. I'd prefer to have perf numbers to feel more comfortable. I do trust your judgements here too, so :shipit:

@jeffwashington jeffwashington merged commit 7389371 into anza-xyz:master Apr 22, 2024
@jeffwashington
Copy link
Author

@brooksprumo I ran ledger tool on a mnb validator. I ran 3 things:

  1. master
  2. an additional scan_accounts call per storage just like this pr would do
  3. replacing the scan_index call with a scan_accounts call
    Remove README and LICENSE #1 is fastest but varies greatly.
    Remove LICENSE to get Github to pick it up correctly #2 and Re-add license #3 are the same time and generally slower than the worst Remove README and LICENSE #1. Remove LICENSE to get Github to pick it up correctly #2 and Re-add license #3 are at worst 12% slower than the fastest Remove README and LICENSE #1 and 6% slower than the average Remove README and LICENSE #1 (of 15 tries). This indicates that scan_accounts is slower than scan_index, which we'd expect and will b worse with hot storage. Note this is with no secondary indexing occurring. That is the very slow part.

Conclusion: this second iteration of accounts is insignificant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0