8000 blockstorage: Return on fatal flush errors by TheCharlatan · Pull Request #27866 · bitcoin/bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

blockstorage: Return on fatal flush errors #27866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 29, 2023

Conversation

TheCharlatan
Copy link
Contributor
@TheCharlatan TheCharlatan commented Jun 12, 2023

The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site.

Prior to this patch FlushBlockFile may have failed without returning in Chainstate::FlushStateToDisk, leading to a potential write from WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet. By returning if either FlushUndoFile or FlushBlockFile fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add [[nodiscard]] annotations to them such that they are not ignored in future.

Functions that call either FlushUndoFile or FlushBlockFile, need to handle these extra abort cases properly. Since Chainstate::FlushStateToDisk already produces an abort error in case of WriteBlockIndexDB failing, no extra logic for functions calling Chainstate::FlushStateToDisk is required.

Besides Chainstate::FlushStateToDisk, FlushBlockFile is also called by FindBlockPos, while FlushUndoFile is only called by FlushBlockFile and WriteUndoDataForBlock. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases.


This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request comment. For ease of review of these critical changes, a first step would be checking that AbortNode leads to early and error-conveying returns at its call site. Further work for enforcing returns when AbortNode is called is done in #27862.

@DrahtBot
8000 Copy link
Contributor
DrahtBot commented Jun 12, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK stickies-v, ryanofsky
Concept ACK hebasto, naumenkogs
Stale ACK dergoegge

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28516 (validation: assumeutxo params for testnet and signet by Sjors)
  • #27596 (assumeutxo (2) by jamesob)
  • #27039 (blockstorage: do not flush block to disk if it is already there by pinheadmz)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about all the changes in this PR because it seems like some validation code is intentionally written to ignore flush errors, and now they are returning early and skipping other work. It also seems reasonable that some failed flushes like periodic flushes might not need to be fatal errors.

But adding return values and nodiscard attributes could at least make it clear if errors are being ignored intentionally or not.

@TheCharlatan TheCharlatan force-pushed the kernelReturnOnAbort branch from 360fe5a to 1fe06a8 Compare June 12, 2023 20:02
@TheCharlatan TheCharlatan changed the title validation: Return on abort blockstorage: Return on fatal flush errors Jun 12, 2023
@TheCharlatan
Copy link
Contributor Author

Re #27866 (review)

It also seems reasonable that some failed flushes like periodic flushes might not need to be fatal errors.

This is a good point, and I did not think of this before. A flush may fail once due to some sporadic error, but may succeed again at a later point in time. I'm still not sure though if it is really correct to ignore the flush error. If we fail to flush, but succeed in writing the block index, and then crash, we might have indexed a block, that has not been synced to disk. It also feels weird that we are currently okay with the block files failing to flush, while the call to CoinsTip().Flush() returns early on failure. Can you comment on this rationale?

@hebasto
Copy link
Member
hebasto commented Jun 15, 2023

A flush may fail once due to some sporadic error, but may succeed again at a later point in time. I'm still not sure though if it is really correct to ignore the flush error.

At least, such a flush error can be logged.

@ryanofsky
Copy link
Contributor

Can you comment on this rationale?

I'm not familiar enough with existing code to know why it's written the way it is. It could have a great rationale or just be weird like it appears to be.

I'm less concerned about this PR than I am about #27861, which is treating flush errors as fatal errors when they are not inherently fatal. If code writing to disk fails it could either mean (1) state on disk is corrupt or invalid and can't be recovered from or (2) state on disk is valid but out of date and requires more work to catch up. These cases are inherently different and I think kernel API would benefit off having separate "flush error" and "fatal error" notifications to distinguish them, regardless of implementation details of current code.

Copy link
Contributor
@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

The blockstorage changes seem pretty straightforward, the validation.cpp one I'll need to dive deeper into since it's got quite a few callsites.

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed in IRC libbitcoinkernel updates:

<achow101> it looks like the current pr is #27866, although #27861 has also been getting review

IMO, this PR could be an improvement, but right now seems like a draft change. I think the main thing that's missing is a clear description of how it changes behavior, and why the behavior change is safe.

@TheCharlatan
Copy link
Contributor Author

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review d1c92b5. I think the extra error reporting here is great, and some of the changes in behavior seem good too. But I think there is too much happening in one commit. There are 3-4 separate changes in behavior that could be implemented in separate commits, and I think would be more clearly understood and evaluated that way.

@@ -644,7 +650,9 @@ bool BlockManager::FindBlockPos(FlatFilePos& pos, unsigned int nAddSize, unsigne
if (!fKnown) {
LogPrint(BCLog::BLOCKSTORAGE, "Leaving block file %i: %s\n", m_last_blockfile, m_blockfile_info[m_last_blockfile].ToString());
}
FlushBlockFile(!fKnown, finalize_undo);
if (!FlushBlockFile(!fKnown, finalize_undo)) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Return on fatal flush errors" (d1c92b5)

Let me know if I understand this correctly: The new return false avoids writing block data to a new block file if the previous block file is full and cannot be synced and trimmed. This seems to be safe because this FindBlockPos function only has one caller, SaveBlockToDisk, which only has two callers, Chainstate::AcceptBlock and Chainstate::LoadGenesisBlock which both seem to handle this error by stopping what they are doing and writing "Failed to find position to write new block" log messages. I did not look up AcceptBlock and LoadGenesisBlock callers but assume they are handling this case reasonably.

I think existing log messages are vague / misleading, so would suggest adding a new log message here like:

LogPrintf("Failed to flush previous block file %05i (finalize=%i, finalize_undo=%i) before opening new block file %05i", m_last_blockfile, !fKnown, finalize_undo, nFile)`.

I'm not sure what the benefits are of returning false here. I think unlike the other changes in this commit, this change doesn't help avoid any "potential write from WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet", because this flush call is happening after previous block data has already been successfully written and the database has already been updated.

I think the clearest and most conservative thing to do would be to drop "return false" line in this commit, and only log an error message here. If you think returning false is a good idea, just do it in a followup commit decoupled from the other changes here with a commit message that explains why that is safe / desirable.

}
assert(static_cast<int>(m_blockfile_info.size()) > m_last_blockfile);

FlatFilePos block_pos_old(m_last_blockfile, m_blockfile_info[m_last_blockfile].nSize);
if (!BlockFileSeq().Flush(block_pos_old, fFinalize)) {
m_opts.notifications.flushError("Flushing block file to disk failed. This is likely the result of an I/O error.");
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Return on fatal flush errors" (d1c92b5)

This skips trying to flush the undo file if flushing the block file fails. It think it would be safer and more conservative to keep previous behavior of flushing both files in sequence.

You can still return false if either flush call fails, but I don't see a benefit in changing behavior here and potentially making code less robust, when it's not necessary just to add a return code.

I think it would also make review simpler to get rid of this unnecessary change, since the other changes in this commit are already pretty complex.

@@ -734,7 +742,9 @@ bool BlockManager::WriteUndoDataForBlock(const CBlockUndo& blockundo, BlockValid
// with the block writes (usually when a synced up node is getting newly mined blocks) -- this case is caught in
// the FindBlockPos function
if (_pos.nFile < m_last_blockfile && static_cast<uint32_t>(block.nHeight) == m_blockfile_info[_pos.nFile].nHeightLast) {
FlushUndoFile(_pos.nFile, true);
if (!FlushUndoFile(_pos.nFile, true)) {
return false;
Copy link
Contributor
@ryanofsky ryanofsky Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Return on fatal flush errors" (d1c92b5)

This change seems fragile to me, and it's unclear what the benefits are. I think it would be great to add a log print here, but if this this code is going to be changed to return false, I think that should happen in a separate followup commit, not combined with other changes here, and have a clear explanation and justification.

IIUC, this change introduces an inconsistency. For all block data, and for vast majority of undo data, the data is considered to be successfully written if the write call succeeeds regardless of whether syncing and trimming succeeds later. But now with this change, for some small fraction of undo block data, the data will be considered not written if the syncing or trimming fails. And the fraction of blocks this will be true for will be essentially random due to the undo flush heuristic (described in a new comment from #27746)

I understand that your goal here is to have libbitcoinkernel functions return more complete error information. I understand that goal is easier to achieve by changing behavior of validation and block storage code and returning early in certain cases. But fundamentally it is not necessary to change behavior of any existing code to have it return more error information. Without returning early, you can always accumulate errors and just return the error code at the end.

If returning early makes sense, I think it should be justified in its own terms and done carefully in smaller commits.

@@ -2505,7 +2505,9 @@ bool Chainstate::FlushStateToDisk(
LOG_TIME_MILLIS_WITH_CATEGORY("write block and undo data to disk", BCLog::BENCH);

// First make sure all block and undo data is flushed to disk.
m_blockman.FlushBlockFile();
if (!m_blockman.FlushBlockFile()) {
return false;
Copy link
Contributor
@ryanofsky ryanofsky Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "blockstorage: Return on fatal flush errors" (d1c92b5)

This line appears to implement the major change in behavior which is avoiding the "write from
WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet."

I wonder what this failure looks like in practice, though. Could we write a test for it? I think it would at least make sense to have a log print that explains the problem that's happening.

There are also a lot of callers to FlushStateToDisk and I haven't looked at all of them to determine if they are handling this correctly. Can you describe what they should be doing, and do you know if they actually are doing it? The commit message says "no extra logic for functions calling Chainstate::FlushStateToDisk is required" mentioning an "abort error", but there is no abort error here, only a "return false" so I'm not sure how this is supposed to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what this failure looks like in practice, though. Could we write a test for it?

I could not come up with a way to test this with our existing code. If anybody has any pointers for how to achieve this, I'd happy to give it a shot at implementing the test.

Can you describe what they should be doing, and do you know if they actually are doing it?

Callers of FlushStateToDisk already need to handle failures of the block index writing, so I don't think they require extra logic for handling block file flushing failures. We already have places in the code where we ignore flushing errors (including the block index writing), so I think in places where we are fine with that failure, we should also be fine with the block file flushing failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #27866 (comment)

I think this return false should be removed and replaced by a log print and comment, at least in this PR. Maybe a followup PR could try to change behavior in a bigger way, but it's not clear here that the intended change of returning false is here actually good, and that there aren't other unintended effects.

If a FlatFileSeq::Flush call fails, effectively meaning that an fflush or ftruncate call failed on block file (or undo file after the next commit), it's not clear to me that the new behavior of shutting down right away without trying to save state to disk is better than the old behavior of updating the block index, pruning old block files (which may free up disk space), and updating the coins database. This appears to be the intended change, and it does not seem to be a clear win. In general if there is some low level filesystem error which might be transient or spurious, it seems better to me for an application to handle it by at least trying to save its state than to just give up immediately and throw away data that's still in memory.

I'm also concerned about possible unintended effects of this change because FlushStateToDisk is called so many places, that it's not clear what other effects are outside of this function if it now returns false instead of true.

I like all the other changes in this PR of adding more return information and adding comments explaining error how errors are handled. And I wouldn't be even be surprised if it turns out the new behavior here is actually better than the old behavior in relevant cases. But I think if behavior is going to change, it should really be motivated by something specific like a bug report or test case, and ideally it should happen in a separate PR that doesn't mostly consist of code cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I think it is fine to change the behavior here but also don't mind if it's done in a follow-up. To me this just seems like a weird quirk given that we do return a fatal error for other flush operations.

I think, a failed flush opens the node up to be crashed in any case, e.g.:

bitcoin/src/net_processing.cpp

Lines 2217 to 2219 in 5aa67eb

if (!m_chainman.m_blockman.ReadBlockFromDisk(*pblockRead, *pindex)) {
assert(!"cannot load block from disk");
}

Cleanly shutting down seems preferable.

@hebasto
Copy link
Member
hebasto commented Jul 14, 2023

Concept ACK.

@TheCharlatan
Copy link
Contributor Author

Putting this back to draft until I've invested the time to properly address @ryanofsky's comments.

@TheCharlatan TheCharlatan marked this pull request as draft July 18, 2023 21:23
@TheCharlatan TheCharlatan force-pushed the kernelReturnOnAbort branch from d1c92b5 to f7072be Compare July 25, 2023 15:25
@TheCharlatan
Copy link
Contributor Author
TheCharlatan commented Jul 25, 2023

Thank you for the excellent review @ryanofsky.

Updated d1c92b5 -> 5720741 (kernelReturnOnAbort_1 -> kernelReturnOnAbort_2, compare)

  • Split the changes up into three commits
  • Addressed @ryanofsky's comment, ignoring FlushBlockFile failure in FindBlockPos. Also added a comment explaining why the return type is ignored and added the suggested log line.
  • Addressed @ryanofsky's comment, always attempting FlushUndoFile in FlushBlockFIle, even when the block file flushing fails and return a success code only at the end of the function.
  • Addressed @ryanofsky's comment, ignoring the FlushUndoFile failure in WriteUndoDataForBlock and instead added a comment and a log message in case of flush failure.

@TheCharlatan
Copy link
Contributor Author

@DrahtBot DrahtBot mentioned this pull request Aug 9, 2023
Copy link
Contributor
@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 7721ab9

The changes in blockstorage.cpp and blockstorage.h are strict improvements, increasing visibility in flushing failures. I found the change in validation.cpp much more difficult to review, given the wide variety of ways in which FlushStateToDisk() is used. I've found that:

  • it improves behaviour in some ways. For example, if FlushBlockFile() fails and the WriteBlockIndexDB() call still succeeds, we can end up in a situation where we expect to have a block, but don't actually have the data on disk. I found this quite easy to replicate by modifying FlushBlockFile() to a no-op, pkill bitcoind during IBD, and then call getblock on the latest synced block. This PR fixes that.
  • I don't see how it can degrade behaviour:
    • logically, it doesn't make sense to flush the block index and chainstate if the block data may be missing
    • we already return early in FlushStateToDisk() for flushing failures, just not for FlushBlockFile()
    • I don't think this can affect consensus, e.g. the BlockValidationState &state out-parameter is unaffected; AcceptBlock() is unaffected
    • this code path is only reached when fDoFullFlush || fPeriodicWrite, which pretty much happens only every hour, when we shutdown, or when we exceed cache sizes.

(thanks @dergoegge for helping me understand the latter two points)

{
bool success = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be declared closer to where it's used

E54B
// a reindex. A flush error might also leave some of the data files
// untrimmed.
if (!FlushBlockFile(!fKnown, finalize_undo)) {
LogPrintf("Failed to flush previous block file %05i (finalize=%i, finalize_undo=%i) before opening new block file %05i\n", m_last_blockfile, !fKnown, finalize_undo, nFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: long line

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review 7721ab9. Thanks for splitting up the commits and adding helpful log prints and comments. I like everything in this PR except the new return false after FlushBlockFile() fails in Chainstate::FlushStateToDisk. I think stickies-v makes the best possible case for it here: #27866 (review), but the test mentioned there of commenting out the block flush code entirely seems not very realistic, and none of the other cleanup in the PR requires this change in behavior, so I would really like to see it moved to another PR and justified on its own terms if it is a change worth making.

@@ -2505,7 +2505,9 @@ bool Chainstate::FlushStateToDisk(
LOG_TIME_MILLIS_WITH_CATEGORY("write block and undo data to disk", BCLog::BENCH);

// First make sure all block and undo data is flushed to disk.
m_blockman.FlushBlockFile();
if (!m_blockman.FlushBlockFile()) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #27866 (comment)

I think this return false should be removed and replaced by a log print and comment, at least in this PR. Maybe a followup PR could try to change behavior in a bigger way, but it's not clear here that the intended change of returning false is here actually good, and that there aren't other unintended effects.

If a FlatFileSeq::Flush call fails, effectively meaning that an fflush or ftruncate call failed on block file (or undo file after the next commit), it's not clear to me that the new behavior of shutting down right away without trying to save state to disk is better than the old behavior of updating the block index, pruning old block files (which may free up disk space), and updating the coins database. This appears to be the intended change, and it does not seem to be a clear win. In general if there is some low level filesystem error which might be transient or spurious, it seems better to me for an application to handle it by at least trying to save its state than to just give up immediately and throw away data that's still in memory.

I'm also concerned about possible unintended effects of this change because FlushStateToDisk is called so many places, that it's not clear what other effects are outside of this function if it now returns false instead of true.

I like all the other changes in this PR of adding more return information and adding comments explaining error how errors are handled. And I wouldn't be even be surprised if it turns out the new behavior here is actually better than the old behavior in relevant cases. But I think if behavior is going to change, it should really be motivated by something specific like a bug report or test case, and ideally it should happen in a separate PR that doesn't mostly consist of code cleanup.

Copy link
Member
@dergoegge dergoegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 7721ab9

@@ -2505,7 +2505,9 @@ bool Chainstate::FlushStateToDisk(
LOG_TIME_MILLIS_WITH_CATEGORY("write block and undo data to disk", BCLog::BENCH);

// First make sure all block and undo data is flushed to disk.
m_blockman.FlushBlockFile();
if (!m_blockman.FlushBlockFile()) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I think it is fine to change the behavior here but also don't mind if it's done in a follow-up. To me this just seems like a weird quirk given that we do return a fatal error for other flush operations.

I think, a failed flush opens the node up to be crashed in any case, e.g.:

bitcoin/src/net_processing.cpp

Lines 2217 to 2219 in 5aa67eb

if (!m_chainman.m_blockman.ReadBlockFromDisk(*pblockRead, *pindex)) {
assert(!"cannot load block from disk");
}

Cleanly shutting down seems preferable.

@TheCharlatan
Copy link
Contributor Author

Putting this back in draft while I address the change requests here.

@TheCharlatan TheCharlatan marked this pull request as draft August 29, 2023 19:37
A false return value indicates a fatal error (disk space being too low),
so make sure we always consume this error code.

This commit is part of an ongoing process for making the handling of
fatal errors more transparent and easier to understand.
By returning an error code if `FlushBlockFile` fails, the caller now has
to explicitly handle block file flushing errors. Before this change
such errors were non-explicitly ignored without a clear rationale.

Prior to this patch `FlushBlockFile` may have failed silently in
`Chainstate::FlushStateToDisk`. Improve this with a log line. Also add a
TODO comment to flesh out whether returning early in the case of an
error is appropriate or not. Returning early might be appropriate to
prohibit `WriteBlockIndexDB` from writing a block index entry that does
not refer to a fully flushed block.

Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called
by `FindBlockPos`. Don't change the abort behavior there, since we don't
want to fail the function if the flushing of already written blocks
fails. Instead, just document it.
By returning an error code if either `FlushUndoFile` or `FlushBlockFile`
fail, the caller now has to explicitly handle block undo file flushing
errors. Before this change such errors were non-explicitly ignored
without a clear rationale.

Besides the call to `FlushUndoFile` in `FlushBlockFile`, ignore its
return code at its call site in `WriteUndoDataForBlock`. There, a failed
flush of the undo data should not be indicative of a failed write.

Add [[nodiscard]] annotations to `FlushUndoFile` such that its return
value is not just ignored in the future.
@TheCharlatan
Copy link
Contributor Author

Updated 7721ab9 -> d8041d4 (kernelReturnOnAbort_3 -> kernelReturnOnAbort_4, compare)

  • Removed early return in FlushStateToDisk, replaced it with a TODO comment and a log line.
  • Addressed @jonatack's comment, added severity-based logging.
  • Addressed @stickies-v's comment, splitting up long log line.

@TheCharlatan TheCharlatan marked this pull request as ready for review September 1, 2023 07:22
@stickies-v
Copy link
Contributor

re-ACK d8041d4

I was okay with the validation logic change that we had before, but as I don't have a lot of familiarity with the complexities of FlushStateToDisk I am definitely okay with ryanofsky's suggestion to keep that for a later, more focused improvement and making this pretty much a no-brainer PR.

@DrahtBot DrahtBot requested a review from dergoegge September 1, 2023 11:39
@naumenkogs
Copy link
Member

Concept ACK

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK d8041d4. Thanks for making suggested changes. This is now an easier PR to review since it is only adding new return codes, not changing existing ones, and no longer changing the way flush errors are handled other than by adding log prints.

If I could make one more suggestion, it would be to switch the order of the second and third commits so the FlushBlockFile return value doesn't get added with some initial behavior in the second commit (f0207e0) which is later modified in the third commit (d8041d4). If the commit order were switched, the FlushBlockFile return value could just be determined one way with no code churn.

It may be a good idea to follow up this PR with more changes that change error handling beyond adding log prints and return codes, but since FlushStateToDisk is called so many places, I think changes like these are harder to review, so it would be better for them to happen in a dedicated PR.

Copy link
Contributor
@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK d8041d4

There should be no change in behavior other than logging here. Just to remind my self why, the reason is that it mostly adds new return values and doesn't change existing ones. The only case where it does change a return value is in FlushBlockFile in the third commit. But there is only one call to FlushBlockFile, and if that fails it just triggers a log print.

@DrahtBot DrahtBot requested review from ryanofsky and removed request for ryanofsky September 29, 2023 16:51
@ryanofsky ryanofsky merged commit f562856 into bitcoin:master Sep 29, 2023
Frank-GER pushed a commit to syscoin/syscoin that referenced this pull request Oct 5, 2023
d8041d4 blockstorage: Return on fatal undo file flush error (TheCharlatan)
f0207e0 blockstorage: Return on fatal block file flush error (TheCharlatan)
5671c15 blockstorage: Mark FindBlockPos as nodiscard (TheCharlatan)

Pull request description:

  The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site.

  Prior to this patch `FlushBlockFile` may have failed without returning in `Chainstate::FlushStateToDisk`, leading to a potential write from `WriteBlockIndexDB` that may refer to a block that is not fully flushed to disk yet. By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add `[[nodiscard]]` annotations to them such that they are not ignored in future.

  Functions that call either `FlushUndoFile` or `FlushBlockFile`, need to handle these extra abort cases properly. Since `Chainstate::FlushStateToDisk` already produces an abort error in case of `WriteBlockIndexDB` failing, no extra logic for functions calling `Chainstate::FlushStateToDisk` is required.

  Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called by `FindBlockPos`, while `FlushUndoFile` is only called by `FlushBlockFile` and `WriteUndoDataForBlock`. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases.

  ---

  This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request [comment](bitcoin#27711 (comment)). For ease of review of these critical changes, a first step would be checking that `AbortNode` leads to early and error-conveying returns at its call site. Further work for enforcing returns when `AbortNode` is called is done in bitcoin#27862.

ACKs for top commit:
  stickies-v:
    re-ACK d8041d4
  ryanofsky:
    Code review ACK d8041d4

Tree-SHA512: 47ade9b873b15e567c8f60ca538d5a0daf32163e1031be3212a3a45eb492b866664b225f2787c9e40f3e0c089140157d8fd1039abc00c7bdfeec1b52ecd7e219
@bitcoin bitcoin locked and limited conversation to collaborators Sep 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: Done or Closed or Rethinking
Development

Successfully merging this pull request may close these issues.

9 participants
0