-
Notifications
You must be signed in to change notification settings - Fork 32
Feature/fuse improvements #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also added explicitly_pinned to the list of known Xattributes
The default MaxReadahead is 4 kB. By changing it to 128 kB, we get fuse Read speed from 12.5MB to 33MB. Nice win!!!
I might be delusional but logging every call to fuse-Read has a througput penalty. With even with Maximum read size of 128kB with large file sizes it called quite often. With default MaxReadahead=4kB that logging leads to quite catastropic speed drop.
This is because now Write happens in much larger chunks 128kB instead of 8kB, and kernel sucks in the file to be writted. Note to get 50MB/s I need to disable logging some write operation logging, it takes quite a lot of CPU overhead. See follow up patches.
Apparently, when WritebackCache is enable, the OS is quite spammy with requests to Getxattr for attribute "security.capability". Old code use to call cfs.Stat even if such attribute is not the one which we can provide. The call cost time and decrease througput by about 10MB/s. Now we check if we can even provide an attribute, and only then fetch relevant info.
…rougput As I mention before Write with WritebackCache enabled is ask for Getxattr way to often. If we report every call, our trhoughput is lower by about 10MB/s.
This buffers files at the OS level, so consequent reads are instantaneous (as long as they are cached)
…he required position
Simplifies writing infrastructure by getting rid of internal write buffer. Everything is done with stream overlay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabling fuse optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.
Yep, logging is always surprisingly expensive.
Main slow down for using overlay was useless Seek at write: we write to memory overlay but aks the backend to Seek for no reason. This problem solved with WriteAt method.
I think the Seek was pointless indeed. Probably a leftover from pre-overlay times. I think it could be also factored out of the normal handle.Write()
method. On the other hand we don't use that otherwise, so might be removed completely too. Also, the old write did some metadata staging of the file size, which was very CPU intensive, so it was not only the Seek.
@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(
It's for the case that a file gets extended or truncated. Imagine a file that has 4KB, but then gets overwritten by a 1MB file. The file size in catfs would still be 4KB before Flush() was called, but fuse layer/overlay already knows that things got bigger. Same goes for the opposite of truncating a file (you can technically also make a file bigger by truncating, but that's a confusing term). The limit controls how big the file currently is. I left a comment about the file size (which was wrong before too), which we definitely need to check.
Otherwise very good. If overlay now learns to swap out excess intervals to disk overs a certain memory limit, then this would be a great improvement and would make the filesystem ready for day-to-day use.
// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt | ||
func (l *Layer) WriteAt(buf []byte, off int64) (n int, err error) { | ||
// Copy the buffer, since we cannot rely on it being valid forever. | ||
modBuf := make([]byte, len(buf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential improvement here: use a sync.Pool to save a few allocations when writing many buffers. This also applies to the regular Write
method and is probably more relevant there. Care must be taken that the buffer from pool.Get()
has a suitable length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that I follow here. modBuf
is somewhat permanent (sits in the overlay
until flashed). Pool implies, that the data would be soon deleted. Besides manual says "Any item stored in the Pool may
10000
be removed automatically at any time without notification.". This sounds very scary.
I guess I do not know internals of gc
to really see the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That documentation is a little misleading. Nothing gets garbage collected while you still use it (i.e. you hold a reference to it). modBuf
can be Put()
back to the pool after flush and on L235 above you can do a pool.Get()
to fetch a buffer that was either used before or is newly allocated. Go is free to do whatever it likes in the time between Put and Get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. Can we move it to separate issue?
From quick reading, pool
is not suited for different size objects. So if len(buf)
different from result of a Get
, we would have to put it back. But I see no way, to iterate over pool
to see if there is a suitable candidate. Besides, it would require the full rethinking of how overlay
is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would just get an item from the pool and check if it's reasonably sized (if larger, cut it), if not allocate a new one the normal way and use that. But sure, we can do it once overlay is touched in general.
// very likely it sequent read/write request | ||
// no need to seek since we already pointing to the right position | ||
return l.pos, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fuse/handle.go
Outdated
@@ -113,32 +65,47 @@ const maxInt = int(^uint(0) >> 1) | |||
|
|||
// Write is called to write a block of data at a certain offset. | |||
func (hd *Handle) Write(ctx context.Context, req *fuse.WriteRequest, resp *fuse.WriteResponse) error { | |||
start := time.Now() | |||
// Note: even when underlying process makes consequent writes, | |||
// the kernel might send blocks out of order!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit on this? Especially what the consequences for us are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit on this? Especially what the consequences for us are.
I will add the comment but the full description is the following:
let's say at the OS level you do (cat or dd) largeFile > /brig-fuse-moung/file
. You would naively expect Write
request to be in fifo
order with steady growing offsets (I called it consequent), and they mostly are. What I saw that occasionally OS layer asks to write a block with higher offset and then another with a lower offset.
I noticed this when I was tracking offset position: with fifo
order I would not have to do additional Seek
if position matches offset.
It does not concern our current Write
much. But one should not assume that Write
could be streamed with automatically positioned cursor.
Otherwise, it would be super easy to merge overlays
to one big modification
which might help with speed.
But it also makes, aforementioned swapping to disk hard 'since overlay might have a hole'. So every block would be of a different size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in d3aad7a
fuse/handle.go
Outdated
// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt | ||
// Main idea is not bother with Seek pointer, since underlying `overlay` works | ||
// with intervals in memory and we do not need to `Seek` the backend which is very time expensive. | ||
func (hd *Handle) WriteAt(buf []byte, off int64) (n int, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to lowercase the writeAt
method here. I was confused at first because I thought that fuse expects this interface and would call it directly. I used lower case method names to also indicate that they don't do locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 03e48e6
// enabling MaxReadahead double or even triple Read throughput 12MB/s -> 25 or 33 MB/s | ||
fuse.MaxReadahead(128 * 1024), // kernel uses at max 128kB = 131072B | ||
// enabling WritebackCache doubles write speed to buffer 12MB/s -> 24MB/s | ||
fuse.WritebackCache(), // writes will happen in mach large blocks 128kB instead of 8kB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice nice nice.
fuse/util.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one could be faster. Calling getXattr
often will always have the cost of calling listXattr
(allocation heavy) and linearly checking for the attribute. Instead we should just use a map since we know and control the possible xattrs:
var xattrMap = map[string]bool{
"user.brig.hash": true,
....
}
// use xattrMap in listXattr (iterate over keys) and getXattr (checking if a key exists)
Alternatively we could even think to do something like that:
type xattrGetter func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error)
var xattrMap = map[string]xattrGetter{
"user.brig.hash": func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error) {
return []byte(info.TreeHash.B58String())
},
// ....
}
// Then the implementation of getXattr would be trivial:
func getXattr(cfs *catfs.FS, name, path string, size uint32) ([]byte, error) {
// todo: not sure if it comes with a null byte or not.
handler, ok := xattrMap[name]
if !ok {
return nil, fuse.ErrNoXattr
}
info, err := cfs.Stat(path)
if err != nil {
return nil, errorize("getxattr", err)
}
return handler(cfs, info)
}
// listxattr would be simpler too.
Advantage of this approach would be: lookup time of a map is constant and involves validating it in one go. Also the xattr names would be in one central place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Mostly for unified track of available attributes. Speed wise, it is probably will gain as very little with only 3 attributes. Preparing patch now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual speed-up would only be noticeable when doing a lot of xattr fetching. But this would both improve implementation and speed, so a win-win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else { | ||
attr.Size = info.Size | ||
} | ||
attr.Size = info.Size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure, but info.Size
might not be correct. When we truncate or extend a file then catfs
still remembers the old size (it gets updated only after Stage(), i.e. on Flush()). We might need to use the limits of the overlay to determine the size at that point. At least that's what I remember, in any case we should check the file size on this test:
# file should have a size of 1MB.
$ dd if=/dev/zero of=/fuse/file.txt bs=1M count=1
# file size should increase with each write happening.
# (probably easier to test in a unit test where you can
# control when the file is closed).
$ dd if=/dev/zero of=/fuse/file.txt bs=1M count=20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might also explains why your test produced that much metadata: Every fuse write called catfs.Handle.Write()
and it stages the file to update the file size. That's actually something very dumb to do
F438
, and we should retrieve the current files ize from the handle and not from catfs
(where it will be changed on flush).
I looked at other loggers Zerolog and Zap. It seems that they get their speed from extremely rigid formatting. I.e. you cannot day
The confusing part is brig/catfs/mio/overlay/overlay.go Line 194 in c293485
there are limit and fileSize . From your description it sounds that they should be the same.
|
I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned. I am thinking about use cases:
I would argue that case 3 is not Case 1, would greatly benefit, if there were a flag from OS saying that what is coming is the single "stage" but I see no such thing in the manuals. |
No, as described above they're not. But in your PR you removed the part were on every write the file was re-staged with the up-to-date size. While this removal is good, the file size reported by fuse needs to stay up-to-date on every Write. You gonna need to expose it from overlay to handle to fuse's getattr.
Your point 1 is already the reason why we need this. Without it we can only copy files up to the size of our memory. The limit when we start swapping to disk can be set high enough that it does not happen for normal usage. I don't expect anybody to run a database on a fuse, but people might still edit large files for whatever reason. This may not be a typical use case and we don't need to be fast here, but we still need to support it to a certain extent. EDIT: Clarfication: We can of course postpone it and it shouldn't be part of that PR anyways. |
Let's postpone this particular point. The OS keeps track of the file size internally, since it does the writes, it knows better what the file size is. How I check, I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Now I see your point. Big files, on small memory machines. Yep, this should be addressed.
But I suggest to move it to different issue.
My particular PR, does not change current memory demand, so it just maintains status quo.
What is wrong with travis? All test are passed on my machine. |
That's unexpected. My understanding is that it still makes a call to
Sure, that needs another PR. I would keep #49 open since it's a long running issue with some more thing to do.
Good question. |
Oops, accidentally hit the close button. |
I would greatly appreciate it, I think I understand what it should do, but internal machinery does not fit yet to my brain.
I see logs, not sure that I have |
Pretty much 1st call always comes with req.Size=0. Our job to respond with all we have. If response resp.Xattr does not fit to a specified buffer req.Size, fuse will internally truncate it (see https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1656 and similar https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1618) and relay it to the caller with ERANGE error (buffer is too small). The caller should ask again with a larger buffer. From other hand truncation to zero is a disaster, since the caller assumes that there are no Xattributes. Note that apparently, even if the call comes with allowed req.Siz=0, internally it has some small buffer. So our first reply often fits and the call ends without error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Adresses #49
Enabling
fuse
optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.Also, removed use of write buffer back to
overlay
system.Main slow down for using
overlay
was uselessSeek
at write: we write to memory overlay but aks the backend toSeek
for no reason. This problem solved withWriteAt
method.@sahib please pay attention how the
overlay
limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(