8000 index: make startup more efficient by furszy · Pull Request #27607 · bitcoin/bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

index: make startup more efficient #27607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 10, 2023

Conversation

furszy
Copy link
Member
@furszy furszy commented May 9, 2023

Simplifies index startup code, eliminating the g_indexes_ready_to_sync variable,
deduplicating code and moving the prune violation check out of the BaseIndex class.

Also makes startup more efficient by running the prune violation check once for all indexes
instead of once for each index, and by delaying the prune violation check and moving it off
of the main thread so the node can start up faster and perform the block data availability
verification even when the '-reindex" or the "-reindex-chainstate" flags are enabled (which
hasn't being possible so far).

@DrahtBot
Copy link
Contributor
DrahtBot commented May 9, 2023
edited
Loading

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, TheCharlatan
Stale ACK mzumsande

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28053 (refactor: Move stopafterblockimport option out of blockstorage by TheCharlatan)
  • #28051 (Get rid of shutdown.cpp/shutdown.h, use SignalInterrupt directly by ryanofsky)
  • #27850 (test: Add unit & functional test coverage for blockstore by pinheadmz)
  • #26762 (bugfix: Make CCheckQueue RAII-styled by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 86752e0 and it mostly looks good, but does have a null pointer deference currently (see comments and suggested fix). I like this PR because it will simplify #24230 and could potentially improve performance. I'm not actually how sure how much it actually would improve performance in practice though, so I'm curious if that's was the original motivation here or if this is related to one of your other changes. I'm also curious about the todo:

Pending todo: Fix remaining test by using the block index map instead of the active chain. The active chain is not available at the point where the indexers are started.

Unsure if this is just referring to the null pointer problem, or to something else

@furszy
Copy link
Member Author
furszy commented May 11, 2023

Thanks for the deep review @ryanofsky!

Reviewed 86752e0 and it mostly looks good, but does have a null pointer deference currently (see comments and suggested fix). I like this PR because it will simplify #24230 and could potentially improve performance. I'm not actually how sure how much it actually would improve performance in practice though, so I'm curious if that's was the original motivation here or if this is related to one of your other changes.

Yeah, the motivation came from a mix of thoughts actually.

I was reviewing #24230 and wasn't completely sold by the hasDataFromTipDown() call inside the attachChain function. It seems odd to call to a chain data verification function in an event registering method.
Then went deeper over the function, realized about the work duplication, and took that argument to simplify #24230 a bit.

We could also move the entire hasDataFromTipDown from the interface to somewhere else in this way. So there is no CBlockIndex dependency in the interface neither here nor in the #24230 intermediate commits.

Plus, It plays well with #25193.

And.. I'm also thinking that after #24230 and the parallel sync work (#26966), we could have one initial sync thread with a workers pool for all the indexers instead of the current per indexer sync thread.
Which should speedup the sync process quite a lot. We are currently reading the entire chain block by block from disk on every index thread. So instead, we could read blocks just once, then dispatch events to the indexers. Making indexers purely listeners with no associated thread.

I'm also curious about the todo:

Pending todo: Fix remaining test by using the block index map instead of the active chain. The active chain is not available at the point where the indexers are started.

Unsure if this is just referring to the null pointer problem, or to something else

Yeah ok. That comment is inaccurate and I forgot to update it.

I initially thought that the issue was due an initialization ordering, that we weren’t having the active chain activated at that point (thought that was done only post load-blk thread completion).
So I implemented 4738a1d and 837acfd instead of the current version. Which is better than what we have here currently, but it’s not the solution for the null ptr deference.

The issue is that empty indexers don’t set the best block to the genesis CBlockIndex, they just leave it null. So, the index summary returns a valid height=0 with an empty block hash.. which crashes on the block hash assertion.

So, the fix is easy. But.. I haven’t done it because have found another possible index sync "fatal error" in master and wanted to fix it prior continue moving forward here.

Essentially, we are not checking whether the node has the post fork point blocks in disk prior starting the index:

The pruning violation checks whether we have blocks from the fork point up to the active chain tip. But it does not check if the node has blocks from the fork point up to the fork tip.. which are required by coinstatsindex to rever 8000 se its state during the reorg..

E.g.

Active chain
A -> B -> C -> D -> E

Index chain
A -> B -> C -> G -> F -> H

The “failure error” will happen when G or F are not in disk or were pruned.

So, if this happens, it causes a “fatal error“ during the coin stats index reorg process due the impossibility to read the block from disk.

@ryanofsky
Copy link
Contributor

re: #27607 (comment)

We could also move the entire hasDataFromTipDown from the interface to somewhere else in this way. So there is no CBlockIndex dependency in the interface neither here nor in the #24230 intermediate commits.

Yes I don't think it makes sense to expose that method as part of the Chain interface, and even if it did make sense, it wouldn't make sense to put the implementation in the ChainImpl class, because the class is mostly meant to hold glue code, not complicated functions. That's the reason for the suggestion to make it a standalone function bool ChainDataFromTipDown(ChainstateManager& chainman, const CBlockIndex& start_block) in #27607 (comment)

And.. I'm also thinking that after #24230 and the parallel sync work (#26966), we could have one initial sync thread with a workers pool for all the indexers instead of the current per indexer sync thread. Which should speedup the sync process quite a lot. We are currently reading the entire chain block by block from disk on every index thread. So instead, we could read blocks just once, then dispatch events to the indexers. Making indexers purely listeners with no associated thread.

Yes letting indexes just receive notifications to get in sync and not have to implement sync logic is the goal of #24230. And the threading issue should be orthogonal after that PR. Indexes (and wallets) could read and process blocks in single thread, or read blocks in a single thread and process them in parallel, or read and process blocks in parallel like happens currently, or use some other form of scheduling. But regardless of what ordering is used, the interface an individual index uses to get in sync should not have to change, and indexes shouldn't have to create sync threads or deal with race conditions between the notification threads and sync threads.

I initially thought that the issue was due an initialization ordering, that we weren’t having the active chain activated at that point (thought that was done only post load-blk thread completion).

Oh, I see. That's also what I concluded from seeing the failure, but I guess it is not the complete picture.

The issue is that empty indexers don’t set the best block to the genesis CBlockIndex, they just leave it null. So, the index summary returns a valid height=0 with an empty block hash.. which crashes on the block hash assertion.

So, the fix is easy.

I didn't know about this. I guess the fix would be to treat the genesis as the fork point in this case. And to consider the index already synced if there is nochainstate and the genesis pointer is null.

But.. I haven’t done it because have found another possible index sync "fatal error" in master and wanted to fix it prior continue moving forward here.

It seems like it would be a good thing to fix this. But this sounds like something that was already broken, so I'm not sure if the fix has to be bundled here, necessarily. I think the "has data from tip down" check is a useful check that can run early and warn if any block data is missing. But if the check isn't perfect and doesn't catch missing block, it shouldn't be the worst thing because the missing blocks will just be reported later rather than earlier.

Copy link
Contributor
@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see how the current approach could work, even if the problems discussed above were solved:

With this PR, the newly introduced function update_indexes_start_block is executed right after creating the index, but before executing Start(). But the index is basically just an empty object at this point, because its constructor doesn't interact with the database stored on disk at all. So calling GetSummary() can't possibly give any meaningful data about the best block at this stage. The load from disk is being done in Init()/CustomInit(), but that happens only later within Start().

I think the necessary order would be to

  1. create all indexes and read their best block / other data from disk
  2. determine the oldest block for all indexes
  3. Do the pruning check once

@furszy furszy force-pushed the 2023_index_decouple_has_data_checks branch from 86752e0 to 3af2301 Compare May 16, 2023 21:40
@ryanofsky
Copy link
Contributor

I can't see how the current approach could work, even if the problems discussed above were solved:

Good catch noticing Init() was not called! It doesn't seem like it should be that hard to fix, though. The PR was already moving most of the code out of Init(), anyway, so now a little more code needs to move. I didn't look very deeply but I would probably make Init() a public method and call it after constructing the index. Also stop calling Init() from Start() and move the RegisterValidationInterface() from Start() to Init().

@furszy
Copy link
Member Author
furszy commented May 16, 2023

Thanks for the review @mzumsande.

Funny that I pushed a small update at the same time that you were commenting.

Have few more changes on the pipeline that will be pushing soon. e.g. the hasDataFromTipDown entire function can be written in two lines.. just need to re-purpose the GetFirstStoredBlock function a bit :).

I think the necessary order would be to

create all indexes and read their best block / other data from disk
determine the oldest block for all indexes
Do the pruning check once

Yeah.. I was thinking on the other issue and forgot that Init() isn't being called in the constructor.. :face_palm:. Will re-order it, Thanks!

@ryanofsky
Copy link
Contributor
ryanofsky commented May 17, 2023

Looking at current version of the PR 6dfec1b it seems to have changed a lot, and the other thing it is doing now is delaying startup of indexes until blocks are loaded, so there is no longer a conflict with reindex-chainstate. So it is basically reimplementing #25193 in a different way that doesn't require a sleep_for(std::chrono::milliseconds(500)) waitloop. I think this approach is cleaner, but also that #25193 is a smaller more targeted change with a test, so I would probably prefer to see #25193 merged first and with this cleanup and rewrite merged later.

#25193 also has two ACKs, so I can re-ack and merge it soon. EDIT: Never mind, just noticed it needs rebase currently

On this PR, I like the approach and thinks the code looks pretty good overall. The only thing I don't like is all the complexity added to AppInitMain(). I think that complexity could go away if you got rid of the std::set<BaseIndex*> indexers local variable there and added a NodeContext std::vector<BaseIndex*> indexes member instead. Then there would be no need for the func_start_indexes lambda. The lambda could be replaced with a regular StartIndexes function instead.

@furszy
Copy link
Member Author
furszy commented May 17, 2023

Was about to send the updates comment hehe. I was too tired last night to write it. Thanks for the review ryanofsky!

Updates list:

  • Decoupled index Init() from Start(). So indexers can be initialized without spawning the sync thread.

  • Simplified the pruning violation code by re-purposing the GetFirstStoredBlock function. Now called IsBlockDataAvailable.

  • Fixed a small race, where we set the index m_synced flag (which enables BlockConnected events) before calling to the child class init function. So, for example, the block filter index could theoretically process a block before initializing the next filter position field and end up overwriting the first stored filter.


Feedback:

Looking at current version of the PR 6dfec1b it seems to have changed a lot, and the other thing it is doing now is delaying startup of indexes until blocks are loaded, so there is no longer a conflict with reindex-chainstate. So it is basically reimplementing #25193 in a different way that doesn't require a sleep_for(std::chrono::milliseconds(500)) waitloop. I think this approach is cleaner, but also that #25193 is a smaller more targeted change with a test, so I would probably prefer to see #25193 merged first and with this cleanup and rewrite merged later.

Yeah, I updated the PR description last night stating that this is now built on top of #25193 (but my slightly modified version of it #25193 (review)).

@mzumsande said that he was going to give them a look and probably take them. So, all good if them get squashed there or here. Either way is fine for me. Happy to re-review #25193 whenever is ready to go again.

On this PR, I like the approach and thinks the code looks pretty good overall. The only thing I don't like is all the complexity added to AppInitMain(). I think that complexity could go away if you got rid of the std::set<BaseIndex*> indexers local variable there and added a NodeContext std::vector<BaseIndex*> indexes member instead. Then there would be no need for the func_start_indexes lambda. The lambda could be replaced with a regular StartIndexes function instead.

Sounds great. Was also thinking about moving the lambda to a standalone function but wasn't finding the right place and didn't want to create a new file only for this.

@furszy furszy force-pushed the 2023_index_decouple_has_data_checks branch 2 times, most recently from d03d6dd to e38d09a Compare May 17, 2023 15:49
@furszy furszy force-pushed the 2023_index_decouple_has_data_checks branch from e38d09a to e5b418f Compare May 17, 2023 19:16
@furszy
Copy link
Member Author
furszy commented May 17, 2023

Rebased post #25193 merge. Conflicts solved.

Only change from the last push is on the first commit ca30419, where the index threads active wait and the global flag are replaced by a post-poned indexers start call.

@furszy furszy changed the title init: verify blocks data existence only once for all the indexers index: improve initialization and pruning violation checks May 17, 2023
@furszy furszy changed the title index: improve initialization and pruning violation checks index: improve initialization and pruning violation check May 17, 2023
@furszy
Copy link
Member Author
furszy commented Jul 7, 2023

Note: the branch tests are all passing locally. What is failing is the CI compiling it on top of master. Fixing it..

furszy and others added 4 commits July 7, 2023 19:31
The mempool load can take a while, and it is not
needed for the indexes' synchronization.

Also, having the mempool load function call
inside 'blockstorage.cpp' wasn't structurally
correct.
The thread does not only load blocks, it loads the mempool and,
in a future commit, will start the indexes as well.

Also, renamed the 'ThreadImport' function to 'ImportBlocks'
And the 'm_load_block' class member to 'm_thread_load'.

-BEGIN VERIFY SCRIPT-

sed -i "s/ThreadImport/ImportBlocks/g" $(git grep -l ThreadImport -- ':!/doc/')
sed -i "s/loadblk/initload/g" $(git grep -l loadblk -- ':!/doc/release-notes/')
sed -i "s/m_load_block/m_thread_load/g" $(git grep -l m_load_block)

-END VERIFY SCRIPT-
No behavior change.

The goal here is to group indexes, so we can perform the same
initialization and verification process equally for all of them.

The checks performed inside `StartIndexes` will be expanded
in the subsequent commits.
@furszy furszy force-pushed the 2023_index_decouple_has_data_checks branch from 988f3f6 to 4223d80 Compare July 7, 2023 22:37
Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 988f3f6, just suggested StartIndexes simplification and comment changes since last review

Copy link
Member Author
@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased on master due a silent conflict. Only had to adapt an AbortNode() line (which now is under the fatalError() alias).

Should be easy to re-check with a git range-diff 988f3f6...4223d80.

Copy link
Contributor
@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, ACK 4223d80

Nit can be ignored.

@DrahtBot DrahtBot requested a review from ryanofsky July 9, 2023 11:06
Copy link
Member
@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the tests fail

furszy and others added 5 commits July 10, 2023 10:47
So indexes can be initialized without spawning
the sync thread.

This makes asynchronous indexes startup
possible in the following commits.
And transfer the responsibility of verifying whether 'start_block'
has data or not to the caller.

This is because the 'GetFirstStoredBlock' function responsibility
is to return the first block containing data. And the current
implementation can return 'start_block' when it has no data!. Which
is misleading at least.

Edge case behavior change:
Previously, if the block tip lacked data but all preceding blocks
contained data, there was no prune violation. And now, such
scenario will result in a prune violation.
By generalizing 'GetFirstStoredBlock' and implementing
'CheckBlockDataAvailability' we can dedup code and
avoid repeating work when multiple indexes are enabled.
E.g. get the oldest block across all indexes and
perform the pruning violation check from that point
up to the tip only once (this feature is being introduced
in a follow-up commit).

This commit shouldn't change behavior in any way.

Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
By moving the 'StartIndexes()' call into the 'initload'
thread, we can remove the threads active wait. Optimizing
the available resources.

The only difference with the current state is that now the
indexes threads will only be started when they can process
work and not before it.
At present, during init, we traverse the chain (once per index)
to confirm that all necessary blocks to sync each index up to
the current tip are present.

To make the process more efficient, we can fetch the oldest block
from the indexers and perform the chain data existence check from
that point only once.

This also moves the pruning violation check to the end of the
'loadinit' thread, which is where the reindex, block loading and
chain activation processes happen.

Making the node's startup process faster, allowing us to remove
the global g_indexes_ready_to_sync flag, and enabling the
execution of the pruning violation verification even when the
reindex or reindex-chainstate flags are enabled (which has being
skipped so far).
@furszy furszy force-pushed the 2023_index_decouple_has_data_checks branch from 4223d80 to ca91c24 Compare July 10, 2023 13:52
Copy link
Member Author
@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per feedback, thanks Marco and TheCharlatan.

Changes:

  • Fixed bug in one of the intermediate commits (per comment).
  • Inlined an if/else in one of the intermediate commits (per comment).

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK ca91c24. Just rebase and suggested changes since last review (Start return check, and code simplification)

@DrahtBot DrahtBot requested a review from TheCharlatan July 10, 2023 14:28
Copy link
Contributor
@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK ca91c24

@ryanofsky ryanofsky merged commit ef29d5d into bitcoin:master Jul 10, 2023
ryanofsky added a commit that referenced this pull request Jul 11, 2023
…storage

462390c refactor: Move stopafterblockimport handling out of blockstorage (TheCharlatan)

Pull request description:

  This has the benefit of moving this StartShutdown call out of the blockstorage file and thus out of the kernel's responsibility. The user can now decide if he wants to start shutdown / interrupt after a block import or not.

  This also simplifies #28048, making it one fewer shutdown call to handle.

ACKs for top commit:
  MarcoFalke:
    lgtm ACK 462390c  🗝
  ryanofsky:
    Code review ACK 462390c. Just has been rebased and is a simpler change after #27607

Tree-SHA512: 84e58256b1c61f10e7ec5ecf32916f40a2ab1ea7cce703de0fa1c61ee2be94bd45ed32718bc99903b6eff3e6d3d5b506470bf567ddbb444a58232913918e8ab8
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jul 12, 2023
@furszy furszy deleted the 2023_index_decouple_has_data_checks branch July 20, 2023 23:00
@furszy furszy mentioned this pull request Jan 4, 2024
18 tasks
@bitcoin bitcoin locked and limited conversation to collaborators Jul 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0