8000 healing corrupted chunks ("wrong header") · Issue #352 · moosefs/moosefs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

healing corrupted chunks ("wrong header") #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
onlyjob opened this issue Mar 27, 2020 · 11 comments
Closed

healing corrupted chunks ("wrong header") #352

onlyjob opened this issue Mar 27, 2020 · 11 comments
Labels
confirmed bug Confirmed bug data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. fix committed / please test

Comments

@onlyjob
Copy link
Contributor
onlyjob commented Mar 27, 2020

Something strange happened. A chunkserver started to report large number of errors, all being chunk_readcrc: ... - wrong header. There are no other errors whatsoever and all affected chunk files are on all local HDDs (on different file systems e.g. ext4, f2fs). All damaged chunk files are empty (filled with 00) but their size varies from smallest being 73728 bytes.

I've identified few thousand corrupted files, all empty, all created within 10 minutes between 16:10 and 16:20 on March 25 (chunkserver was 3.0.111 at a time). I'm investigating hardware but so far could not find anything to blame on the affected machine. I suspect that incident could have been caused by networking hardware (which I'm guessing would be unlikely).

Regardless of how those chunk files became corrupted, the problem seems to be that they are not being repaired or removed. Even after some time after reporting yet another corrupted file (which are discovered by background testing regularly) the files remain there, still empty and not even removed.

I suspect that auto-repair of damaged chunks files might be broken. Could you investigate please?

@oszafraniec
Copy link

This is strange and reminds me of #106 issue. Wrong header, file empty AFTER repair, all cased by HW problems...

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 27, 2020

I don't see similarity with #106. I'm not sure if repair is happening. Healthy copies are available from other chunkservers and no missing chunks reported.

@chogata
Copy link
Member
chogata commented Mar 27, 2020

The time of scanning all chunks on a single chunk server is measured in days or even weeks if you have a lot of chunks (on default configuration). It's a slow process, because it is meant as a lowest priority background maintenance process, "just in case". In normal circumstances chunks that just "lay there" on a disk should not get corrupted.

Those physical chunks that are found to be bad, are replicated - the faulty chunk copy is marked as such and from this moment this chunk is considered undergoal and replicated via regular MooseFS mechanisms. And the faulty copies are deleted after the replication is finished.

Of course, if you try to read from a chunk that has a yet-not-discovered bad chunk copy, then it will be discovered immediately - "out of order" of this regular maintenance check.

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 27, 2020

Yes, I understand how it should work. But 8 hours passed with new corrupted chunk discovered every few minutes yet none of the files were fixed or removed. I don't see undergoal files in CGI either...
Why corrupted chunk files are not removed immediately? How can I know that repair is working? Which command can I use to get information about number of replicas for a chunk if all I know about it is a file name?
Thanks.

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 27, 2020

As far as I'm concerned, not removing corrupted chunk files during testing is a bug. Why delay removing if it is already confirmed that the chunk file is invalid?

@chogata
Copy link
Member
chogata commented Mar 27, 2020

As far as I'm concerned, not removing corrupted chunk files during testing is a bug. Why delay removing if it is already confirmed that the chunk file is invalid?

Becau 8000 se "invalid" may mean that the whole chunk is zeroed or that two bytes are wrong. In the former case of course, the file is useless, in the latter case, if another copy of this chunk is also suddenly found missing or invalid, this copy might yet be repaired manually and some very, very important data might be saved. And, before you ask, no, we won't employ any algorithms to judge "how big" is the damage - this is a file system that is supposed to be fast.

Now, how long this may take from discovering to deleting: first, the chunk is marked as invalid and undergoal situation appears. This will be dealt with quickly if your system is not extremely busy otherwise, because undergoal replications have high priority (and endangered even higher, which will be the case if you normally keep stuff in 2 copies). So unless you sit there and reload CGI or you have some script checking via CLI, you may not notice the undergoal before it is replicated. But the invalid copy is still there and will be marked for deletion in the next general check loop in the master. If you have a big installation, this loop may be quite long (CGI, info tab, check loop start time/check loop end time). After it is marked for removal, it will be with respect to deletion limits of course, so if your system deletes a lot of files, it may also take a while. So, on a big installation, 8 hours from discovering to deleting might not be enough.

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 28, 2020

Of course I would not ask to "judge how big is the damage", that would be silly.

But I'm asking to remove corrupted files immediately which is how LizardFS does it and it is the right way to handle corrupted files. You can implement a config option for those who want to leave junk/corrupted data around.

It's been over 24 hours yet there were still no removals or repairs as far as I could see.
No undergoal chunks are visible in CGI which is most certainly wrong.

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 30, 2020

Three days passed, corrupted files are still there, unrepaired, in the same quantity.

Chunkserver logs the same errors over and over:

Mar 27 22:31:44 deblab mfschunkserver[92098]: chunk_readcrc: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - wrong header
Mar 27 22:31:44 deblab mfschunkserver[92098]: hdd_io_begin: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - read error: Success (errno=0)

Mar 29 05:21:44 deblab mfschunkserver[92098]: chunk_readcrc: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - wrong header
Mar 29 05:21:44 deblab mfschunkserver[92098]: hdd_io_begin: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - read error: Success (errno=0)

Mar 30 12:15:57 deblab mfschunkserver[92098]: chunk_readcrc: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - wrong header
Mar 30 12:15:57 deblab mfschunkserver[92098]: hdd_io_begin: file:/var/lib/mfs/04/f2fs//53/chunk_000000001F6E7ABF_00000001.mfs - read error: Success (errno=0)

Repair is definitely not working.

@acid-maker
Copy link
Member

But I'm asking to remove corrupted files immediately which is how LizardFS does it and it is the right way to handle corrupted files. You can implement a config option for those who want to leave junk/corrupted data around.

Simple answer is no. I don't want to remove them immediately (and lizard also doesn't do that - I know it because basically all it does, it does because I've programmed it like that - they didn't change such things). The correct way is to inform the master about the corruption and let it do the rest. Usually removing data is exactly what the master will do (in like 99.99% of cases), but if ALL copies are corrupt then it is better to leave them around than to remove everything like stupid monkey.

The problem you observed is just a result of stupid bug that I'm going to fix very soon.

BTW I've just checked code of lizard and they have the same bug, so please never again tell me that lizard does anything better than moosefs.

@acid-maker acid-maker added confirmed bug Confirmed bug data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. fix committed / please test labels Mar 30, 2020
@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 30, 2020

I'm glad to be wrong about LizardFS and it is awesome that you've found the problem, despite initial disbelief. I'll test soon. Thank you, Jakub.
I agree with you that immediate removal may be unnecessary, as long as repair works correctly and efficiently.

@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 30, 2020

1d62a41 appears to work well: corrupted chunk files are being removed only minutes after discovery. Thank you for quick response, @acid-maker.

@onlyjob onlyjob closed this as completed Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed bug Confirmed bug data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. fix committed / please test
Projects
None yet
Development

No branches or pull requests

4 participants
0