-
Notifications
You must be signed in to change notification settings - Fork 9k
HDFS-16942. Send error to datanode if FBR is rejected due to bad lease #5460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
Not sure what to do about this checkstyle error:
If I add the file:
I get an enforcer error:
|
@@ -791,6 +792,9 @@ private void offerService() throws Exception { | |||
shouldServiceRun = false; | |||
return; | |||
} | |||
if (InvalidBlockReportLeaseException.class.getName().equals(reClass)) { | |||
fullBlockReportLeaseId = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel Thanks for your works here. Do we also need set forceFullBlockReport
to true here? Otherwise, datanode will send block report at next 6 hour by default, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 717, we can see where it attempts to get a lease from the heartbeat if the lease in the DN == 0:
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
scheduler.isBlockReportDue(startTime);
So its the isBlockReportDue that controls this. Then later, if we have a non zero lease, it will try to create the block report:
boolean forceFullBr =
scheduler.forceFullBlockReport.getAndSet(false);
if (forceFullBr) {
LOG.info("Forcing a full block report to " + nnAddr);
}
if ((fullBlockReportLeaseId != 0) || forceFullBr) {
cmds = blockReport(fullBlockReportLeaseId);
fullBlockReportLeaseId = 0;
}
Its really the isBlockReportDue()
method that controls whether a new one should be sent of not, and that is based on time since the last one. The the blockReport()
, it updates the time after a successful block report, but if it gets an exception, like this change causes, it will not update the time and so it will try again on the next heartbeat if it gets a new lease.
I think forceFullBlockReport
is only for tests, or the command to force a DN block from the CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you have any idea about fixing the checkstyle issue? As I mentioned above, trying to add a package-info.java file broke my compile locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't expect to reach here frequently (hopefully datanode is able to acquire lease successfully most of the times), perhaps we can add one liner log to indicate that the particular FBR went through this trouble (i.e. log report id and lease id)? (just in case if it helps debug further)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe reportId
and leaseId
can be added as constructor argument to InvalidBlockReportLeaseException
. This way RemoteException in offerService
log will likely print them anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its really the isBlockReportDue() method that controls whether a new one should be sent of not, and that is based on time since the last one. The the blockReport(), it updates the time after a successful block report, but if it gets an exception, like this change causes, it will not update the time and so it will try again on the next heartbeat if it gets a new lease.
Thanks for the detailed explain. Make sense to me.
perhaps we can add one liner log to indicate that the particular FBR went through this trouble (i.e. log report id and lease id)
+1 from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the report and lease ID into the exception message, so it will get logged as part of the exception.
Still stuck on that checkstyle violation :-(
🎊 +1 overall
This message was automatically generated. |
I believe this could be avoided by excluding it from either of the poms:
For instance, the above patch might help. I haven't tested this but I guess it might work. |
@sodonnel also I am curious, was it just a specific log (only one case e.g. the lease was expired) or combination of logs from I wonder if |
In the examples I saw, its was expired leases that caused the problem. However the namenode was under significant pressure when it happened. In one example, it was actually the SBNN which was rejecting the reports. Tailing the edits was taking frequent long locks (over 300 seconds at time) which was beyond the lease expiry. In another example, it was the ANN after startup. I am not sure, but I think the system perhaps out of safemode with many block reports still outstanding, and then between under replication and IBRs, contention on the NN lock seemed to block the FBRs until the lease expired. |
@virajjasani This isn't working. I added what you suggested, but still the same error. I also don't understand why this style warning is appearing, as its not a new package folder, and it has plenty of other classes in it already. |
I think it's appearing because it was not resolved earlier or it was ignored. For instance, I see the last time the module was touched was by this PR #4252 and in the last QA result, I see checkstyle issues, but of course since the PR is quite old, we can't see what the error was (it reports 404 for the checkstyle report page).
Strange, I think this one should hopefully work:
|
💔 -1 overall
This message was automatically generated. |
@virajjasani Thanks! That has fixed the compile. I just pushed the change, so hopefully we will get a green build this time. |
💔 -1 overall
This message was automatically generated. |
The reported failures are just known flakes, not related. +1 (non-binding) from my side, thank you @sodonnel @Hexiaoqiao |
@Hexiaoqiao Are you happy with the latest revision to give a +1 so I can commit? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 from my side. The failed unit tests seems not related with this PR.
Thanks @sodonnel and @virajjasani for your works.
This wasn't the fix: #5460 (comment), rather it broke the javadoc build.
Now the original problem
And the fix went in the direction this enforcer has gone crazy, lets filter this file itself, but that poor fellow wasn't doing anything wrong :) Check the file-1 Now your added File-2 One was already there in hdfs-client, now this got added for the same package on hdfs as well. Why same package on client and hdfs jar, I think all of us here know those reasons, so not getting into that... @sodonnel can you delete this new package-info.java file. And we can fix the build post writing "Checkstyle warnings is irrelevant/unavoidable" |
Or, we just need to remove |
That way in future also, if anyone makes changes in hadoop-hdfs module's hdfs/server/protocol, then jenkins won't given any redundant warning? |
Do you want just a followup PR to remove the file, or is just removing the |
Just remove the file and the filter. no need to have dupes. many package-info must be having those IA.private.
that's what all of us do :) |
Ah, looks like we have IA annotations in some and not in others. Looks like removing might be better. Thanks! I wonder how do we deal with it in future though? When someone makes changes only in hadoop-hdfs module's protocol package, the checkstyle failure will unnecessarily get attention? |
But will the checkstyle not affect all future PRs? I think I have seen that happen before, where a new checkstyle issue comes in, and then future PRs are impacted by it, but I may be wrong. I will remove the file and see what it does. |
Correct, that's what I was wondering :) |
How? You folks mean the new PR will also show this checstyle warning? So, our checkstyle works by computing diff b/w the trunk and the patch and flags when the number changes. if you check the warning above:
There were already 79, the PR had 2 more, so that is what it flags @sodonnel Ozone operates in different way to figure out checkstyle, may be you would have seen there, it would be true there most probably |
Correct, I am only worried about the amount of time folks will end up wasting in future for such warning (everytime new file is added or perhaps even existing file is modified under the package?):
As was the case with this PR:
|
Posted #5478 to address the problem |
Description of PR
When a datanode sends a FBR to the namenode, it requires a lease to send it. On a couple of busy clusters, we have seen an issue where the DN is somehow delayed in sending the FBR after requesting the least. Then the NN rejects the FBR and logs a message to that effect, but from the Datanodes point of view, it thinks the report was successful and does not try to send another report until the 6 hour default interval has passed.
If this happens to a few DNs, there can be missing and under replicated blocks, further adding to the cluster load. Even worse, I have see the DNs join the cluster with zero blocks, so it is not obvious the under replication is caused by lost a FBR, as all DNs appear to be up and running.
I believe we should propagate an error back to the DN if the FBR is rejected, that way, the DN can request a new lease and try again.
How was this patch tested?
Modified a test and added a new test. Also validated manually by changing the code to always throw the InvalidLease exception to ensure the DN handled it, reset the least ID and retried.