Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] #322

onlyjob · 2020-01-09T22:37:05Z

I had a serious outage on several MooseFS mounts today. Due to emergency power maintenance in a rack (replacement of automatic transfer switch, ATS) I had to gracefully stop two chunkserver nodes at the same time, temporary losing availability of some data.

Chunkservers were down only for some minutes (and they were in the temporary maintenance mode) yet even hours later clients did not recover.

My FUSE3 mounts are configured with mfsioretries=444 which gives plenty of time to handle such situations. Unfortunately MooseFS just logged several lines like the following:

mfsmount[1147]: file: 5759521, index: 0, chunk: 578318316, version: 1 - there are no valid copies

and gave up(!) so even an hour later applications are still frozen, unresponsive.

It is especially frustrating to me because LizardFS handles such situation gracefully, retrying up to configured mfsioretries limit (with adequate logging) with complete recovery after brief unavailability of data.

This is a very serious issue that could have been easily triggered by temporary disruption of connectivity between clients and chunkservers (e.g. reboot of switch).
I believe I had exactly that kind of incident before but did not realised the nature of the problem at a time.

I recommend to investigate this issue with utmost importance. Thanks.

The text was updated successfully, but these errors were encountered:

onlyjob · 2020-01-09T23:11:33Z

More interesting details about this incident:
Impatiently I've rebooted one computer and found that applications that use data on MooseFS were unresponsive, this time not even logging anything which suggests that mfsioretries logging is broken.

At a time one chunkserver was still scanning one HDD. For 20 minutes as I was scratching my head thinking about what could it be that paralyzed the entire cluster, scanning finished and everything returned to normal at once.

Interestingly, HDD that is slow to scan have large number of chunks (12...20 million so naturally scanning takes hours) and it is assigned exclusively to archival chunkserver isolated from the rest of the cluster by label and storage class that segregates active data (mostly sitting on SSDs with some fast rotational storage) from archival data. As you can imagine nothing in the cluster was using data from archival chunkserver and that data was available from another active chunkserver anyway due to replication level 2. Apparently SSD-based chunkservers were blocked by scanning of HDD on chunkserver that is completely unrelated and unnecessary for all affected applications and storage classes.

The problem may have something to do with blocking re-connection of chunks by the master.

onlyjob · 2020-01-12T05:43:37Z

This is a reproducible issue. On perfectly responsive cluster restarting one chunkserver with slow scanning HDD causes some clients/applications to freeze on I/O for the duration of initial scanning. There are two crucial observations:

Unrelated data is blocked. Chunkserver with slow-scanning HDD holds only archived chunks for unrelated storage classes. There is no reason to block I/O everywhere.
Only initial scanning is blocking. No I/O blocking happens diring scanning when chunkserver is started with slow HDD commented out in mfshdd.cfg then un-commented/re-loaded. This behaviour clearly exposes bug that should be easy enough to fix.

acid-maker · 2020-01-13T13:10:39Z

Yes. Thanks a lot - We were able to reproduce this. We are working on fix.

…ailable chunks or no space status (issue #322)

onlyjob · 2020-03-30T08:54:39Z

All good, thanks. Looks like the problem is fixed. Closing...

pkonopelko assigned acid-maker Jan 10, 2020

acid-maker added a commit that referenced this issue Jan 22, 2020

(client) added options for setting behaviour when master returns unav…

06d4838

…ailable chunks or no space status (issue #322)

borkd added the confirmed bug Confirmed bug label Jan 29, 2020

onlyjob closed this as completed Mar 30, 2020

pkonopelko added fix committed / please test resolved Issue resolved labels Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] #322

Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] #322

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Client do not wait for unavailable chunks, not respecting mfsioretries [serious] [incident] #322

Client do not wait for unavailable chunks, not respecting mfsioretries [serious] [incident] #322

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] #322

Client do not wait for unavailable chunks, not respecting `mfsioretries` [serious] [incident] #322