-
Notifications
You must be signed in to change notification settings - Fork 218
Client do not wait for unavailable chunks, not respecting mfsioretries
[serious] [incident]
#322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
More interesting details about this incident: At a time one chunkserver was still scanning one HDD. For 20 minutes as I was scratching my head thinking about what could it be that paralyzed the entire cluster, scanning finished and everything returned to normal at once. Interestingly, HDD that is slow to scan have large number of chunks (12...20 million so naturally scanning takes hours) and it is assigned exclusively to archival chunkserver isolated from the rest of the cluster by label and storage class that segregates active data (mostly sitting on SSDs with some fast rotational storage) from archival data. As you can imagine nothing in the cluster was using data from archival chunkserver and that data was available from another active chunkserver anyway due to replication level 2. Apparently SSD-based chunkservers were blocked by scanning of HDD on chunkserver that is completely unrelated and unnecessary for all affected applications and storage classes. The problem may have something to do with blocking re-connection of chunks by the master. |
<
8000
svg aria-label="Show options" role="img" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-kebab-horizontal">
This is a reproducible issue. On perfectly responsive cluster restarting one chunkserver with slow scanning HDD causes some clients/applications to freeze on I/O for the duration of initial scanning. There are two crucial observations:
|
Yes. Thanks a lot - We were able to reproduce this. We are working on fix. |
…ailable chunks or no space status (issue #322)
All good, thanks. Looks like the problem is fixed. Closing... |
I had a serious outage on several MooseFS mounts today. Due to emergency power maintenance in a rack (replacement of automatic transfer switch, ATS) I had to gracefully stop two chunkserver nodes at the same time, temporary losing availability of some data.
Chunkservers were down only for some minutes (and they were in the temporary maintenance mode) yet even hours later clients did not recover.
My FUSE3 mounts are configured with
mfsioretries=444
which gives plenty of time to handle such situations. Unfortunately MooseFS just logged several lines like the following:and gave up(!) so even an hour later applications are still frozen, unresponsive.
It is especially frustrating to me because LizardFS handles such situation gracefully, retrying up to configured
mfsioretries
limit (with adequate logging) with complete recovery after brief unavailability of data.This is a very serious issue that could have been easily triggered by temporary disruption of connectivity between clients and chunkservers (e.g. reboot of switch).
I believe I had exactly that kind of incident before but did not realised the nature of the problem at a time.
I recommend to investigate this issue with utmost importance. Thanks.
The text was updated successfully, but these errors were encountered: