-
Notifications
You must be signed in to change notification settings - Fork 37.4k
kernel: Remove shutdown globals from kernel library #27711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
cc @theuni |
Nice! Huge concept ACK. Haven't looked at the approach/code yet but will review. |
edc5483
to
fecdaf0
Compare
fecdaf0
to
4ea100c
Compare
I spent some time reviving my clang-tidy transform plugin which is able to force bubble-ups of fatal errors and decorations of functions which may call them. I'm not suggesting that we use this approach (I mentioned at core-dev that it works but I find it pretty ugly), though it's useful to point out our current shutdown behavior. Note that the plugin is self-written, so it's not flawless. It does produce compiling code though, so I have a reasonable amount of confidence. See here for the transform of the current master branch. A quick guide:
This provides an easy way to see whether or not a high-level function may call shutdown from some deeply nested function. It also provides an easy way to see when calls to shutdown are made without returning to the calling function. I haven't looked at the actual result yet at all. I suspect it will point out several things for us to fix. |
Can you explain this? Am I correct when guessing that any added EXIT* macro, and not the BUBBLE* macros are indicating a missing return? |
@MarcoFalke Right, the The main complication comes from the fact that the callstack might look like:
Where
Where: So basically if the call resulted in an error, pass an error type back up immediately rather than a real return value. These macros are defined here and do that as necessary (you can largely think of them of different notations of try/catch/rethrow). In fact, this model forces that invariant: you get a return value or a fatal error but never both. BlockManager::WriteUndoDataForBlock is a good example of correct/incorrect:
I'm not sure how to proceed. @ryanofsky mentioned maybe fixing them up as part of this shutdown callback effort? |
Do we have to react there? Line 3193 in 7130048
Sure, better annotations are better, but then why not use the existing |
4ea100c
to
b0de447
Compare
Yes, looks like I picked this example too hastily. You're correct that this one is caught higher up.
Again, I'm not suggesting that we use |
b0de447
to
a6a3c32
Compare
I didn't dig into the actual errors, but the qualitative difference here seems to be that the other AbortNode errors actually need to be fatal and should try to shut down to prevent getting into a worse state. But an error flushing state in memory to disk does not need to be fatal, as long the state in memory is kept and can be saved later. There are lots of ways an application could handle a failed flush notification, and I think it would be better to avoid using fatal error callback for any notification that does not have to be fatal. The choice of whether to use separate methods or enums is basically just aesthetic. And I didn't dig into whether these particular flush errors need to fatal or not. I think you can decide what to do here or whether to do anything at all. My reason for suggesting "flushFailed" as an alternative to "nodiscard bool" was that latter change might be more invasive and require changing more sensitive code. |
This change helps generalize shutdown code so an interrupt can be provided to libbitcoinkernel callers. This may also be useful to eventually de-globalize all of the shutdown code. Co-authored-by: Russell Yanofsky <russ@yanofsky.org> Co-authored-by: TheCharlatan <seb.kung@gmail.com>
Concept ACK. |
I agree with both of them. |
This and the following commit seek to decouple the libbitcoinkernel library from the shutdown code. As a library, it should it should have its own flexible interrupt infrastructure without relying on node-wide globals. The commit takes the first step towards this goal by de-globalising `ShutdownRequested` calls in kernel code. Co-authored-by: Russell Yanofsky <russ@yanofsky.org> Co-authored-by: TheCharlatan <seb.kung@gmail.com>
By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. Functions that have either `FlushUndoFile` or `FlushBlockFile` in their call stack, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required. Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, which also already produces other potential abort failures. FlushUndoFile is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. Both already produce their own abort errors, so again no extra handling of these cases in calling code is required. Add [[nodiscard]] annotations to these cases such that they are not ignored in future.
This is done in preparation for the next commit where a new FatalError function is introduced. FatalErrorf follows common convention to append 'f' for functions accepting format arguments. -BEGIN VERIFY SCRIPT- sed -i 's/FatalError/FatalErrorf/g' $( git grep -l 'FatalError') -END VERIFY SCRIPT-
FatalError replaces what previously was the AbortNode function in shutdown.cpp. This commit is part of the libbitcoinkernel project and removes the shutdown's and, more generally, the kernel library's dependency on interface_ui with a kernel notification method. By removing interface_ui from the kernel library, its dependency on boost is reduced to just boost::multi_index. At the same time it also takes a step towards de-globalising the interrupt infrastructure. Also add an interrupt switch in chainman and blockman. This simplifies testing of validation methods that may normally trigger an interrupt on a fatal error by leaving out the interrupt invocation. Co-authored-by: Russell Yanofsky <russ@yanofsky.org> Co-authored-by: TheCharlatan <seb.kung@gmail.com>
This is an error and we return on it, so treat it like one and don't call the normal shutdown procedure.
This decouples the libbitcoinkernel library from the shutdown routines. As a library, it should it should have its own flexible interrupt infrastructure without relying on node-wide globals. This commit should achieve the final step towards this goal by replacing calls to shutdown in kernel code with calls to interrupt. Since some kernel library user might not want to interrupt, but still receive the notification, add a bool to control whether an interrupt is triggered upon reaching one of the trigger conditions. 8000 While touching the lines around StartShutdown, it became apparent that there was no logging indicating the reason for a call to StartShutdown. Add an InterruptReason enum to the interrupt notification to communicate this.
a6a3c32
to
f13880e
Compare
Thank you @theuni and @ryanofsky for the great discussion so far. Updated a6a3c32 -> f13880e (rmKernelShutdown_0 -> rmKernelShutdown_1, compare). This update reworks the existing logic in this pull request by integrating @ryanofsky's suggested branch. Instead of getting rid of the shutdown logic in the kernel, the core logic is now retained, but de-globalized and wrapped into a more user-friendly and self-descriptive class ( As an alternative to @ryanofsky's suggestion for dropping the commits handling Also adds two new commits, one for making a failure connecting the best block a These patches now went through quite some evolution in terms of their scope and intention. I am pushing this patch set to show how I'd currently unify the feeback received so far. In terms of next steps I'd like to lay out the following options:
Option 1 is what seems pragmatic to me currently, option 2 would be closest to the goals as laid out in the kernel tracking issue (first cleave unwanted functionality / dependencies, then improve interfaces and fix weirdness discovered during development). Option 3 is where my heart is, even if it might block progress for some time and would depend on a bunch of other improvements to be completed first. |
Thanks for following up and incorporating the feedback! On the topic of how to move forward with these changes, I think we should see if we can find a way to avoid making temporary changes to validation code that are not desirable for the long term. Specifically:
If temporary kludges are necessary to avoid having a PR that is too big to review, maybe that's a good tradeoff. But it would be better if we could see what the final code looks like before making that tradeoff. If this isn't possible, I think we should at least be able to add comments / documentation in the code saying which things are temporary and how they should be used in the meantime. If you want to take time to iterate more on this code more, I think we could easily pull out parts of this PR right now that are pure improvements and could be reviewed and merged quickly. If you took the first commit here 8b3ac7d and the last two commits a4034e7 After that PR or alongside it commits 60f5461 and 79b549a from this PR also seem like good candidates for splitting off and being in their own PR so they could get more careful review as they do change behavior subtly. Great to see more work here, though. I think there are still some tricky details to work out but most of the changes seem very good. |
🐙 This pull request conflicts with the target branch and needs rebase. |
Ok, I will put this PR back to draft, and open two new ones as you suggested. I'll continue using this draft to stage my further improvements. |
…ion code. 6eb33bd kernel: Add fatalError method to notifications (TheCharlatan) 7320db9 kernel: Add flushError method to notifications (TheCharlatan) 3fa9094 scripted-diff: Rename FatalError to FatalErrorf (TheCharlatan) edb55e2 kernel: Pass interrupt reference to chainman (TheCharlatan) e2d680a util: Add SignalInterrupt class and use in shutdown.cpp (TheCharlatan) Pull request description: Get rid of all `ShutdownRequested` calls in validation code by introducing an interrupt object that applications can use to cancel long-running kernel operations. Replace all `AbortNode` calls in validation code with new fatal error and flush error notifications so kernel applications can be notified about failures and choose how to handle them. --- This pull request is part of the `libbitcoinkernel` project #27587 https://github.com/orgs/bitcoin/projects/3 and more specifically its "Step 2: Decouple most non-consensus code from libbitcoinkernel". The pull request mostly allows dropping the kernel dependency on shutdown.cpp. The only dependency left after this is a `StartShutdown` call which will be removed in followup PR #27711. This PR also drops the last reference to the `uiInterface` global in kernel code. The process of moving the `uiInterface` out of the kernel was started in #27636. This pull request contains a subset of patches originally proposed in #27711. It will be part of a series of changes required to make handling of interrupts (or in other words the current shutdown procedure) in the kernel library more transparent and less reliable on global mutable state. The set of patches contained here was originally proposed by @ryanofsky [here](#27711 (comment)). ACKs for top commit: achow101: light ACK 6eb33bd hebasto: ACK 6eb33bd, I have reviewed the code and it looks OK. ryanofsky: Code review ACK 6eb33bd. No changes since last review other than rebase. Tree-SHA512: 7d2d05fa4805428a09466d43c11ae32946cbb25aa5e741b1eec9cd142e4de4bb311e13ebf1bb125ae490c9d08274f2d56c93314e10f3d69e7fec7445e504987c
Closing this PR in favor of:
|
d8041d4 blockstorage: Return on fatal undo file flush error (TheCharlatan) f0207e0 blockstorage: Return on fatal block file flush error (TheCharlatan) 5671c15 blockstorage: Mark FindBlockPos as nodiscard (TheCharlatan) Pull request description: The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site. Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add `[[nodiscard]]` annotations to them such that they are not ignored in future. Functions that call either `FlushUndoFile` or `FlushBlockFile`, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required. Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, while `FlushUndoFile` is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases. --- This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request [comment](#27711 (comment)). For ease of review of these critical changes, a first step would be checking that `AbortNode` leads to early and error-conveying returns at its call site. Further work for enforcing returns when `AbortNode` is called is done in #27862. ACKs for top commit: stickies-v: re-ACK d8041d4 ryanofsky: Code review ACK d8041d4 Tree-SHA512: 47ade9b873b15e567c8f60ca538d5a0daf32163e1031be3212a3a45eb492b866664b225f2787c9e40f3e0c089140157d8fd1039abc00c7bdfeec1b52ecd7e219
d8041d4 blockstorage: Return on fatal undo file flush error (TheCharlatan) f0207e0 blockstorage: Return on fatal block file flush error (TheCharlatan) 5671c15 blockstorage: Mark FindBlockPos as nodiscard (TheCharlatan) Pull request description: The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site. Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add `[[nodiscard]]` annotations to them such that they are not ignored in future. Functions that call either `FlushUndoFile` or `FlushBlockFile`, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required. Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, while `FlushUndoFile` is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases. --- This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request [comment](bitcoin#27711 (comment)). For ease of review of these critical changes, a first step would be checking that `AbortNode` leads to early and error-conveying returns at its call site. Further work for enforcing returns when `AbortNode` is called is done in bitcoin#27862. ACKs for top commit: stickies-v: re-ACK d8041d4 ryanofsky: Code review ACK d8041d4 Tree-SHA512: 47ade9b873b15e567c8f60ca538d5a0daf32163e1031be3212a3a45eb492b866664b225f2787c9e40f3e0c089140157d8fd1039abc00c7bdfeec1b52ecd7e219
This pull request is part of the
libbitcoinkernel
project #27587 The libbitcoinkernel Project and more specifically its "Step 2: Decouple most non-consensus code from libbitcoinkernel".This removes the shutdown files from the kernel library. As a standalone library the libbitcoinkernel should not have to rely on code that accesses shutdown state through a global. Instead, replace it with an interrupt class that is owned by the kernel context. Additionally this moves the remaining usages of the
uiInterface
out of the kernel code. The process of moving theuiInterface
out of the kernel was started in #27636.The approach taken here uses the notification interface to externalize the existing shutdown code from the kernel library. The
SignalInterrupt
kernel context member is passed by reference to theChainstateManager
to allow for the termination of long-running kernel functions and sending interrupt signals.