8000 kernel: Remove shutdown globals from kernel library by TheCharlatan · Pull Request #27711 · bitcoin/bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kernel: Remove shutdown globals from kernel library #27711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

TheCharlatan
Copy link
Contributor
@TheCharlatan TheCharlatan commented May 21, 2023

This pull request is part of the libbitcoinkernel project #27587 The libbitcoinkernel Project and more specifically its "Step 2: Decouple most non-consensus code from libbitcoinkernel".


This removes the shutdown files from the kernel library. As a standalone library the libbitcoinkernel should not have to rely on code that accesses shutdown state through a global. Instead, replace it with an interrupt class that is owned by the kernel context. Additionally this moves the remaining usages of the uiInterface out of the kernel code. The process of moving the uiInterface out of the kernel was started in #27636.

The approach taken here uses the notification interface to externalize the existing shutdown code from the kernel library. The SignalInterrupt kernel context member is passed by reference to the ChainstateManager to allow for the termination of long-running kernel functions and sending interrupt signals.

@DrahtBot
Copy link
Contributor
DrahtBot commented May 21, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK theuni, hebasto

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #27823 (init: return error when block index is non-contiguous, fix feature_init.py file perturbation by mzumsande)
  • #27746 (Draft: rework validation logic for assumeutxo by sdaftuar)
  • #27708 (Return EXIT_FAILURE on post-init fatal errors by furszy)
  • #27607 (index: improve initialization and pruning violation check by furszy)
  • #27596 (assumeutxo (2) by jamesob)
  • #27576 (kernel: Remove args, settings, chainparams, chainparamsbase from kernel library by TheCharlatan)
  • #26966 (index: blockfilter initial sync speedup, parallelize process by furszy)
  • #26762 (refactor: Make CCheckQueue RAII-styled by hebasto)
  • #26045 (rpc: Optimize serialization disk space of dumptxoutset by aureleoules)
  • #10443 (Add fee_est tool for debugging fee estimation code by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@fanquake
Copy link
Member

cc @theuni

@theuni
Copy link
Member
theuni commented May 22, 2023

Nice! Huge concept ACK. Haven't looked at the approach/code yet but will review.

@theuni
Copy link
Member
theuni commented May 25, 2023

I spent some time reviving my clang-tidy transform plugin which is able to force bubble-ups of fatal errors and decorations of functions which may call them. I'm not suggesting that we use this approach (I mentioned at core-dev that it works but I find it pretty ugly), though it's useful to point out our current shutdown behavior.

Note that the plugin is self-written, so it's not flawless. It does produce compiling code though, so I have a reasonable amount of confidence.

See here for the transform of the current master branch.

A quick guide:

  • Any functions/assignments/declarations which may call shutdown is wrapped in a macro like MAYBE_EXIT, EXIT_OR_IF, etc.
  • Any function containing one of those gets a return type of MaybeEarlyExit<real_return_type> in order to bubble it up the call-stack.
  • This repeats recursively until the topmost functions have had their return type modified.

This provides an easy way to see whether or not a high-level function may call shutdown from some deeply nested function.

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

I haven't looked at the actual result yet at all. I suspect it will point out several things for us to fix.

@maflcko
Copy link
Member
maflcko commented May 26, 2023

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

Can you explain this? Am I correct when guessing that any added EXIT* macro, and not the BUBBLE* macros are indicating a missing return?

@theuni
Copy link
Member
theuni commented May 30, 2023

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

Can you explain this? Am I correct when guessing that any added EXIT* macro, and not the BUBBLE* macros are indicating a missing return?

@MarcoFalke Right, the BUBBLE_UP is kinda like a cast, it's not so interesting. The others introduce conditional returns to pass any errors back up the stack. So the problem cases are ones where we don't already have that behavior.

The main complication comes from the fact that the callstack might look like:

void foo()
  int bar()
    std::string baz()

Where baz() returns a fatal error. The solution implemented here is to wrap everything in a variant:

MaybeEarlyExit<void> foo()
  MaybeEarlyExit<int> bar()
    MaybeEarlyExit<std::string> baz()

Where: MaybeEarlyExit : std::variant<T, FatalError, UserInterrupted> (and those are simple enum classes)

So basically if the call resulted in an error, pass an error type back up immediately rather than a real return value. These macros are defined here and do that as necessary (you can largely think of them of different notations of try/catch/rethrow). In fact, this model forces that invariant: you get a return value or a fatal error but never both.


BlockManager::WriteUndoDataForBlock is a good example of correct/incorrect:

  • FindUndoPos: potential error caught and passed back up (what callers do with it is a different question)
  • UndoWriteToDisk: aborts and returns immediately as needed
  • FlushUndoFile: this one is (I believe) an example of a missed return in master that could cause problems with changes like these. Not necessarily this one in particular, but it's clearly not alone. Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don't react to that in WriteUndoDataForBlock.

I'm not sure how to proceed. @ryanofsky mentioned maybe fixing them up as part of this shutdown callback effort?

@maflcko
Copy link
8000
Member
maflcko commented May 30, 2023

Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don't react to that in WriteUndoDataForBlock.

Do we have to react there? AbortNode starts a shutdown, and the shutdown is later checked in ABC:

if (ShutdownRequested()) break;

Sure, better annotations are better, but then why not use the existing Result class over the MaybeEarlyExit type?

@theuni
Copy link
Member
theuni commented May 30, 2023

Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don't react to that in WriteUndoDataForBlock.

Do we have to react there? AbortNode starts a shutdown, and the shutdown is later checked in ABC:

if (ShutdownRequested()) break;

Yes, looks like I picked this example too hastily. You're correct that this one is caught higher up.

Sure, better annotations are better, but then why not use the existing Result class over the MaybeEarlyExit type?

Again, I'm not suggesting that we use MaybeEarlyExit. Only suggesting analyzing the above transformation to see if there are true bugs hiding.

@ryanofsky
Copy link
Contributor
10000 ryanofsky commented Jun 1, 2023

Is there really a qualitative difference between the errors?

I didn't dig into the actual errors, but the qualitative difference here seems to be that the other AbortNode errors actually need to be fatal and should try to shut down to prevent getting into a worse state. But an error flushing state in memory to disk does not need to be fatal, as long the state in memory is kept and can be saved later. There are lots of ways an application could handle a failed flush notification, and I think it would be better to avoid using fatal error callback for any notification that does not have to be fatal.

The choice of whether to use separate methods or enums is basically just aesthetic. And I didn't dig into whether these particular flush errors need to fatal or not. I think you can decide what to do here or whether to do anything at all. My reason for suggesting "flushFailed" as an alternative to "nodiscard bool" was that latter change might be more invasive and require changing more sensitive code.

This change helps generalize shutdown code so an interrupt can be
provided to libbitcoinkernel callers. This may also be useful to
eventually de-globalize all of the shutdown code.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>
@hebasto
Copy link
Member
hebasto commented Jun 5, 2023

Concept ACK.

@hebasto
Copy link
Member
hebasto commented Jun 5, 2023

#27711 (comment):

I think the approach in the branch is better for two reasons...

I agree with both of them.

TheCharlatan and others added 6 commits June 7, 2023 10:13
This and the following commit seek to decouple the libbitcoinkernel
library from the shutdown code. As a library, it should it should have
its own flexible interrupt infrastructure without relying on node-wide
globals.

The commit takes the first step towards this goal by de-globalising
`ShutdownRequested` calls in kernel code.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>
By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we
ensure that no further write operations take place that may lead to an
inconsistent database when crashing.

Prior to this patch `FlushBlockFile` may have failed without returning
in `Chainstate::FlushStateToDisk`, leading to a potential write from
`WriteBlockIndexDB` that may refer to a block that is not fully
flushed to disk yet.

Functions that have either `FlushUndoFile` or `FlushBlockFile` in their
call stack, need to handle these extra abort cases properly. Since
`Chainstate::FlushStateToDisk` already produces an abort error in case
of `WriteBlockIndexDB` failing, no extra logic for functions calling
`Chainstate::FlushStateToDisk` is required.

Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called
by `FindBlockPos`, which also already produces other potential abort
failures. FlushUndoFile is only called by `FlushBlockFile` and
`WriteUndoDataForBlock`. Both already produce their own abort errors, so
again no extra handling of these cases in calling code is required.

Add [[nodiscard]] annotations to these cases such that they are not
ignored in future.
This is done in preparation for the next commit where a new FatalError
function is introduced. FatalErrorf follows common convention to append
'f' for functions accepting format arguments.

-BEGIN VERIFY SCRIPT-
sed -i 's/FatalError/FatalErrorf/g' $( git grep -l 'FatalError')
-END VERIFY SCRIPT-
FatalError replaces what previously was the AbortNode function in
shutdown.cpp.

This commit is part of the libbitcoinkernel project and removes the
shutdown's and, more generally, the kernel library's dependency on
interface_ui with a kernel notification method. By removing interface_ui
from the kernel library, its dependency on boost is reduced to just
boost::multi_index. At the same time it also takes a step towards
de-globalising the interrupt infrastructure.

Also add an interrupt switch in chainman and blockman. This simplifies
testing of validation methods that may normally trigger an interrupt on a
fatal error by leaving out the interrupt invocation.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>
This is an error and we return on it, so treat it like one and don't
call the normal shutdown procedure.
This decouples the libbitcoinkernel library from the shutdown routines.
As a library, it should it should have its own flexible interrupt
infrastructure without relying on node-wide globals.

This commit should achieve the final step towards this goal by replacing
calls to shutdown in kernel code with calls to interrupt. Since some
kernel library user might not want to interrupt, but still receive the
notification, add a bool to control whether an interrupt is triggered
upon reaching one of the trigger conditions.


8000
While touching the lines around StartShutdown, it became apparent that
there was no logging indicating the reason for a call to StartShutdown.
Add an InterruptReason enum to the interrupt notification to communicate
this.
@TheCharlatan
Copy link
Contributor Author

Thank you @theuni and @ryanofsky for the great discussion so far.

Updated a6a3c32 -> f13880e (rmKernelShutdown_0 -> rmKernelShutdown_1, compare).

This update reworks the existing logic in this pull request by integrating @ryanofsky's suggested branch.

Instead of getting rid of the shutdown logic in the kernel, the core logic is now retained, but de-globalized and wrapped into a more user-friendly and self-descriptive class (SignalInterrupt).

As an alternative to @ryanofsky's suggestion for dropping the commits handling startShutdown I added boolean options that can be set by users of the kernel to indicate whether it should terminate if either a FatalError or Interrupt should trigger an actual interrupt.

Also adds two new commits, one for making a failure connecting the best block a FatalError and another for making FlushBlockFile and FlushUndoFile return on calls to FatalError. Rationale for each of these is provided in the commit messages.

These patches now went through quite some evolution in terms of their scope and intention. I am pushing this patch set to show how I'd currently unify the feeback received so far. In terms of next steps I'd like to lay out the following options:

  1. Get the current patch set through review, completing the goal of removing the ui_interface and some of the boost headers from the kernel. Complete step 2 of this phase of the kernel project. Add another step in this phase of the kernel project for strictly typing any functions that may cause an interrupt. My current thinking is to use @ryanofsky's refactor: Add util::Result failure values, multiple error and warning messages #25665 extension of the Result type to return either a result or these fatal errors. Each instance of an interrupt call will be replaced piecemeal in separate PR's to allow for clean review. I found that some of these cases get quite complicated, meaning simply returning on a fatal error may lead to bugs if surrounding code is not also adapted, and catching all nuances in a single, sprawling PR is probably too burdensome. Eventually this should lead to code where we'd only have to trigger a fatal interrupt at the edges of the library, and otherwise just push fatal errors to code using it.

  2. Revert to a more minimal patch set, basically containing the new interrupt class + the FatalError wrapper to remove the ui_interface. Then proceed as in step 1.

  3. Close this PR, then proceed as in step 1. Once all fatal interrupt cases are handled, circle back to the kernel interrupt handling for long-running functions and conditional interrupts (in other words those lines of code that currently call StartShutdown).

Option 1 is what seems pragmatic to me currently, option 2 would be closest to the goals as laid out in the kernel tracking issue (first cleave unwanted functionality / dependencies, then improve interfaces and fix weirdness discovered during development). Option 3 is where my heart is, even if it might block progress for some time and would depend on a bunch of other improvements to be completed first.

@TheCharlatan TheCharlatan changed the title kernel: Remove shutdown from kernel library kernel: Remove shutdown globals from kernel library Jun 7, 2023
@ryanofsky
Copy link
Contributor

Thanks for following up and incorporating the feedback!

On the topic of how to move forward with these changes, I think we should see if we can find a way to avoid making temporary changes to validation code that are not desirable for the long term. Specifically:

  • Having interrupt_on_fatal_error and interrupt_on_condition ChainStateManager options which are complicated (and very confusing even to me!)
  • Giving ChainStateManager a mutable reference to the interrupt object instead of a const reference, and using the mutable reference to have it send interrupt signals to itself.

If temporary kludges are necessary to avoid having a PR that is too big to review, maybe that's a good tradeoff. But it would be better if we could see what the final code looks like before making that tradeoff. If this isn't possible, I think we should at least be able to add comments / documentation in the code saying which things are temporary and how they should be used in the meantime.

If you want to take time to iterate more on this code more, I think we could easily pull out parts of this PR right now that are pure improvements and could be reviewed and merged quickly. If you took the first commit here 8b3ac7d and the last two commits a4034e7
and bbfbcd0 that were previously part of #27636, I think we could quickly review and merge these in a separate PR as they are pure refactorings which do not change behavior or add temporary kludges to validation code.

After that PR or alongside it commits 60f5461 and 79b549a from this PR also seem like good candidates for splitting off and being in their own PR so they could get more careful review as they do change behavior subtly.

Great to see more work here, though. I think there are still some tricky details to work out but most of the changes seem very good.

@DrahtBot
Copy link
Contributor
DrahtBot commented Jun 9, 2023

🐙 This pull request conflicts with the target branch and needs rebase.

@TheCharlatan
Copy link
Contributor Author

Re #27711 (comment)

If you want to take time to iterate more on this code more, I think we could easily pull out parts of this PR right now that are pure improvements and could be reviewed and merged quickly.

Ok, I will put this PR back to draft, and open two new ones as you suggested. I'll continue using this draft to stage my further improvements.

@TheCharlatan
Copy link
Contributor Author

I opened #27861 to introduce the interrupt class and implement the fatalError notification. Also opened #27866 to start tackling proper error return types when a function calls abort.

ryanofsky added a commit that referenced this pull request Jul 6, 2023
…ion code.

6eb33bd kernel: Add fatalError method to notifications (TheCharlatan)
7320db9 kernel: Add flushError method to notifications (TheCharlatan)
3fa9094 scripted-diff: Rename FatalError to FatalErrorf (TheCharlatan)
edb55e2 kernel: Pass interrupt reference to chainman (TheCharlatan)
e2d680a util: Add SignalInterrupt class and use in shutdown.cpp (TheCharlatan)

Pull request description:

  Get rid of all `ShutdownRequested` calls in validation code by introducing an interrupt object that applications can use to cancel long-running kernel operations.

  Replace all `AbortNode` calls in validation code with new fatal error and flush error notifications so kernel applications can be notified about failures and choose how to handle them.

  ---

  This pull request is part of the `libbitcoinkernel` project #27587 https://github.com/orgs/bitcoin/projects/3 and more specifically its "Step 2: Decouple most non-consensus code from libbitcoinkernel".

  The pull request mostly allows dropping the kernel dependency on shutdown.cpp. The only dependency left after this is a `StartShutdown` call which will be removed in followup PR #27711. This PR also drops the last reference to the `uiInterface` global in kernel code. The process of moving the `uiInterface` out of the kernel was started in #27636.

  This pull request contains a subset of patches originally proposed in #27711. It will be part of a series of changes required to make handling of interrupts (or in other words the current shutdown procedure) in the kernel library more transparent and less reliable on global mutable state. The set of patches contained here was originally proposed by @ryanofsky [here](#27711 (comment)).

ACKs for top commit:
  achow101:
    light ACK 6eb33bd
  hebasto:
    ACK 6eb33bd, I have reviewed the code and it looks OK.
  ryanofsky:
    Code review ACK 6eb33bd. No changes since last review other than rebase.

Tree-SHA512: 7d2d05fa4805428a09466d43c11ae32946cbb25aa5e741b1eec9cd142e4de4bb311e13ebf1bb125ae490c9d08274f2d56c93314e10f3d69e7fec7445e504987c
@TheCharlatan
Copy link
Contributor Author

Closing this PR in favor of:

ryanofsky added a commit that referenced this pull request Sep 29, 2023
d8041d4 blockstorage: Return on fatal undo file flush error (TheCharlatan)
f0207e0 blockstorage: Return on fatal block file flush error (TheCharlatan)
5671c15 blockstorage: Mark FindBlockPos as nodiscard (TheCharlatan)

Pull request description:

  The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site.

  Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add `[[nodiscard]]` annotations to them such that they are not ignored in future.

  Functions that call either `FlushUndoFile` or `FlushBlockFile`, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required.

  Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, while `FlushUndoFile` is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases.

  ---

  This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request [comment](#27711 (comment)). For ease of review of these critical changes, a first step would be checking that `AbortNode` leads to early and error-conveying returns at its call site. Further work for enforcing returns when `AbortNode` is called is done in #27862.

ACKs for top commit:
  stickies-v:
    re-ACK d8041d4
  ryanofsky:
    Code review ACK d8041d4

Tree-SHA512: 47ade9b873b15e567c8f60ca538d5a0daf32163e1031be3212a3a45eb492b866664b225f2787c9e40f3e0c089140157d8fd1039abc00c7bdfeec1b52ecd7e219
Frank-GER pushed a commit to syscoin/syscoin that referenced this pull request Oct 5, 2023
d8041d4 blockstorage: Return on fatal undo file flush error (TheCharlatan)
f0207e0 blockstorage: Return on fatal block file flush error (TheCharlatan)
5671c15 blockstorage: Mark FindBlockPos as nodiscard (TheCharlatan)

Pull request description:

  The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site.

  Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add `[[nodiscard]]` annotations to them such that they are not ignored in future.

  Functions that call either `FlushUndoFile` or `FlushBlockFile`, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required.

  Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, while `FlushUndoFile` is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases.

  ---

  This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request [comment](bitcoin#27711 (comment)). For ease of review of these critical changes, a first step would be checking that `AbortNode` leads to early and error-conveying returns at its call site. Further work for enforcing returns when `AbortNode` is called is done in bitcoin#27862.

ACKs for top commit:
  stickies-v:
    re-ACK d8041d4
  ryanofsky:
    Code review ACK d8041d4

Tree-SHA512: 47ade9b873b15e567c8f60ca538d5a0daf32163e1031be3212a3a45eb492b866664b225f2787c9e40f3e0c089140157d8fd1039abc00c7bdfeec1b52ecd7e219
@bitcoin bitcoin locked and limited conversation to collaborators Jul 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: Done or Closed or Rethinking
Development

Successfully merging this pull request may close these issues.

7 participants
0