8000 Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] · Issue #322 · moosefs/moosefs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Client do not wait for unavailable chunks, not respecting mfsioretries [serious] [incident] #322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
onlyjob opened this issue Jan 9, 2020 · 4 comments
Assignees

Comments

@onlyjob
Copy link
Contributor
onlyjob commented Jan 9, 2020

I had a serious outage on several MooseFS mounts today. Due to emergency power maintenance in a rack (replacement of automatic transfer switch, ATS) I had to gracefully stop two chunkserver nodes at the same time, temporary losing availability of some data.

Chunkservers were down only for some minutes (and they were in the temporary maintenance mode) yet even hours later clients did not recover.

My FUSE3 mounts are configured with mfsioretries=444 which gives plenty of time to handle such situations. Unfortunately MooseFS just logged several lines like the following:

mfsmount[1147]: file: 5759521, index: 0, chunk: 578318316, version: 1 - there are no valid copies

and gave up(!) so even an hour later applications are still frozen, unresponsive.

It is especially frustrating to me because LizardFS handles such situation gracefully, retrying up to configured mfsioretries limit (with adequate logging) with complete recovery after brief unavailability of data.

This is a very serious issue that could have been easily triggered by temporary disruption of connectivity between clients and chunkservers (e.g. reboot of switch).
I believe I had exactly that kind of incident before but did not realised the nature of the problem at a time.

I recommend to investigate this issue with utmost importance. Thanks.

@onlyjob
Copy link
Contributor Author
onlyjob commented Jan 9, 2020

More interesting details about this incident:
Impatiently I've rebooted one computer and found that applications that use data on MooseFS were unresponsive, this time not even logging anything which suggests that mfsioretries logging is broken.

At a time one chunkserver was still scanning one HDD. For 20 minutes as I was scratching my head thinking about what could it be that paralyzed the entire cluster, scanning finished and everything returned to normal at once.

Interestingly, HDD that is slow to scan have large number of chunks (12...20 million so naturally scanning takes hours) and it is assigned exclusively to archival chunkserver isolated from the rest of the cluster by label and storage class that segregates active data (mostly sitting on SSDs with some fast rotational storage) from archival data. As you can imagine nothing in the cluster was using data from archival chunkserver and that data was available from another active chunkserver anyway due to replication level 2. Apparently SSD-based chunkservers were blocked by scanning of HDD on chunkserver that is completely unrelated and unnecessary for all affected applications and storage classes.

The problem may have something to do with blocking re-connection of chunks by the master.

@onlyjob
< 8000 svg aria-label="Show options" role="img" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-kebab-horizontal"> Copy link
Contributor Author
onlyjob commented Jan 12, 2020

This is a reproducible issue. On perfectly responsive cluster restarting one chunkserver with slow scanning HDD causes some clients/applications to freeze on I/O for the duration of initial scanning. There are two crucial observations:

  • Unrelated data is blocked. Chunkserver with slow-scanning HDD holds only archived chunks for unrelated storage classes. There is no reason to block I/O everywhere.

  • Only initial scanning is blocking. No I/O blocking happens diring scanning when chunkserver is started with slow HDD commented out in mfshdd.cfg then un-commented/re-loaded. This behaviour clearly exposes bug that should be easy enough to fix.

@acid-maker
Copy link
Member

Yes. Thanks a lot - We were able to reproduce this. We are working on fix.

acid-maker added a commit that referenced this issue Jan 22, 2020
@borkd borkd added the confirmed bug Confirmed bug label Jan 29, 2020
@onlyjob
Copy link
Contributor Author
onlyjob commented Mar 30, 2020

All good, thanks. Looks like the problem is fixed. Closing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0