-
Notifications
You must be signed in to change notification settings - Fork 218
chunkserver: abnormal/disproportional distribution of chunk files #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will re-iterate: we made a lot of tests, and not only artificial tests, but we introduced this algorithm on several working instances of MooseFS (big and busy ones) and in every case the cumulated distribution (10000 per directory) was behaving much better than fair distribution (1 per directory, AKA even distribution). We actually made those tests on users' requests, because the even distribution was not optimal. Nevertheless, it's an easy change to make the 10000 an option in chunkserver configuration, so we will do that. Setting it to 1 (which we absolutely do not recommend) will give you even distribution of newly created chunks. Will be introduced together with the bugfix of distribution value going too much over the 10000 maximum. |
Thanks. |
I took the liberty of running the check on some of my chunkservers.
in the cluster there are only 107 chunks are waiting for deletion |
Maybe there should be a "sane" upper bound? 6-9M chunks in one directory, and <10k in all others seems a bit excessive. |
Unfortunately this bug is not fixed. On 3.0.112, on freshly added disk, directory On some other (randomly checked) disks directory Please reopen. |
If you added the new disk to a chunk server that already had a hdd affected by this bug, this may have "copied", because this condition is not checked when copying chunks between disks in the - a chunk goes to the "same" directory on the new disk when it's moved from the old one. The algorithm only works for chunks created via MooseFS file system, not for internal copying. I remember you were hit particularly strongly by this bug, so unless you re-distributed all badly distributed chunks manually, this will still persist. Also, remember that duplicates don't count. After we found this bug we created tests that were able to replicate it and those tests show the new version works as intended. So you need to be able to confirm 100% that you don't propagate the uneven distribution from old bug and that you don't have duplicates in your system. |
Thanks. Most if not all of my chunkservers' HDDs are affected by disproportional distribution of chunk files. Since this bug is considered "fixed" I was surprised to find the same condition on new HDD... No duplicates involved. What would be the best way to equalise distribution of chunk files among folders on chunkserver with many HDDs of which all are affected by the problem? |
We introduced a change for the internal rebalance into the code, so it's gonna start working in the next release and will also distribute chunks evenly when copying disk-to-disk. However, this would require you to insert new disks into all your chunk servers and indicate old ones should be emptied, wait for them to copy data, pull old disks and clear (format) them to re-use them etc. until the whole system balances itself eventually. It's gonna be a long process and will require spare disks, although, if you have lots of those lying around, you can do it 8000 on several chunk servers at the same time. Much quicker would be to just do it manually (I thought you already did that, but it seems you didn't?): shut down chunk servers one by one and re-distribute chunks manually, delete .chunkdb before starting up again. For anybody with the same problem, hints on how to do it properly (repeat for each disk):
|
Manual re-distribution takes a long time and requires to stop chunkserver. I do have a proof-of-concept script but if all HDDs has to be processed at once (because processing one disk at a time does not work due to propagation of proble from other disks) then the task becomes very tedious... So far I've only did manual re-distribution on two chunkservers with least number of HDDs (and not even on all HDDs yet) where the problem was the most prominent. Incorporating correction to internal distribution algorithm is much appreciated. Thanks. |
Uh oh!
There was an error while loading. Please reload this page.
Chunkserver is not spreading chunk files among directories fairly, evenly or even randomly.
Consider the following assessment of one chunkserver HDD:
Click to expand: [seconds to walk :: directory :: number of files]
Note abnormally large directory
BB
- 80 times heavier than most with walk time 123 times slower.Several directories have 40...50 times more files than most directories and walk time on those large directories is slow. Probabilistically heavy directories get more hits and overall performance suffers.
As far as I'm concerned disproportional distribution of chunk files among directories is a bug.
In #319 (comment) @chogata written:
Apparently distribution algorithm is flawed but IMHO the presence of such logic is flawed too as which chunks are accessed more frequently can not be predicted.
Please consider making it optional (configurable) to use fair (or completely random) distribution algorithm.
I recommend to remove assumption-based distribution logic in favour of completely random bias-less distribution.
Random distribution of chunk files among chunkserver directories is always better because fair distribution reduces access concentration. More data in one place means more hits, more congestion and less benefits from distribution. Imagine one chunkserver having 40...50 times more data than any other chunkserver -- guess which chunkserver would be the busiest?? Same thing with number of files per directory.
The text was updated successfully, but these errors were encountered: