Feature/fuse improvements #88

evgmik · 2021-01-23T05:14:05Z

Adresses #49
Enabling fuse optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.

Also, removed use of write buffer back to overlay system.

Main slow down for using overlay was useless Seek at write: we write to memory overlay but aks the backend to Seek for no reason. This problem solved with WriteAt method.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

Also added explicitly_pinned to the list of known Xattributes

The default MaxReadahead is 4 kB. By changing it to 128 kB, we get fuse Read speed from 12.5MB to 33MB. Nice win!!!

I might be delusional but logging every call to fuse-Read has a througput penalty. With even with Maximum read size of 128kB with large file sizes it called quite often. With default MaxReadahead=4kB that logging leads to quite catastropic speed drop.

This is because now Write happens in much larger chunks 128kB instead of 8kB, and kernel sucks in the file to be writted. Note to get 50MB/s I need to disable logging some write operation logging, it takes quite a lot of CPU overhead. See follow up patches.

Apparently, when WritebackCache is enable, the OS is quite spammy with requests to Getxattr for attribute "security.capability". Old code use to call cfs.Stat even if such attribute is not the one which we can provide. The call cost time and decrease througput by about 10MB/s. Now we check if we can even provide an attribute, and only then fetch relevant info.

…rougput As I mention before Write with WritebackCache enabled is ask for Getxattr way to often. If we report every call, our trhoughput is lower by about 10MB/s.

This buffers files at the OS level, so consequent reads are instantaneous (as long as they are cached)

…he required position

Simplifies writing infrastructure by getting rid of internal write buffer. Everything is done with stream overlay.

sahib

Enabling fuse optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.

Yep, logging is always surprisingly expensive.

Main slow down for using overlay was useless Seek at write: we write to memory overlay but aks the backend to Seek for no reason. This problem solved with WriteAt method.

I think the Seek was pointless indeed. Probably a leftover from pre-overlay times. I think it could be also factored out of the normal handle.Write() method. On the other hand we don't use that otherwise, so might be removed completely too. Also, the old write did some metadata staging of the file size, which was very CPU intensive, so it was not only the Seek.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

It's for the case that a file gets extended or truncated. Imagine a file that has 4KB, but then gets overwritten by a 1MB file. The file size in catfs would still be 4KB before Flush() was called, but fuse layer/overlay already knows that things got bigger. Same goes for the opposite of truncating a file (you can technically also make a file bigger by truncating, but that's a confusing term). The limit controls how big the file currently is. I left a comment about the file size (which was wrong before too), which we definitely need to check.

Otherwise very good. If overlay now learns to swap out excess intervals to disk overs a certain memory limit, then this would be a great improvement and would make the filesystem ready for day-to-day use.

sahib · 2021-01-23T09:54:46Z

catfs/mio/overlay/overlay.go

+// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt
+func (l *Layer) WriteAt(buf []byte, off int64) (n int, err error) {
+	// Copy the buffer, since we cannot rely on it being valid forever.
+	modBuf := make([]byte, len(buf))


Potential improvement here: use a sync.Pool to save a few allocations when writing many buffers. This also applies to the regular Write method and is probably more relevant there. Care must be taken that the buffer from pool.Get() has a suitable length.

Not sure that I follow here. modBuf is somewhat permanent (sits in the overlay until flashed). Pool implies, that the data would be soon deleted. Besides manual says "Any item stored in the Pool may 10000 be removed automatically at any time without notification.". This sounds very scary.

I guess I do not know internals of gc to really see the difference.

That documentation is a little misleading. Nothing gets garbage collected while you still use it (i.e. you hold a reference to it). modBuf can be Put() back to the pool after flush and on L235 above you can do a pool.Get() to fetch a buffer that was either used before or is newly allocated. Go is free to do whatever it likes in the time between Put and Get.

I see now. Can we move it to separate issue?
From quick reading, pool is not suited for different size objects. So if len(buf) different from result of a Get, we would have to put it back. But I see no way, to iterate over pool to see if there is a suitable candidate. Besides, it would require the full rethinking of how overlay is working.

You would just get an item from the pool and check if it's reasonably sized (if larger, cut it), if not allocate a new one the normal way and use that. But sure, we can do it once overlay is touched in general.

sahib · 2021-01-23T09:55:42Z

catfs/mio/overlay/overlay.go

+		// very likely it sequent read/write request
+		// no need to seek since we already pointing to the right position
+		return l.pos, nil
+	}


sahib · 2021-01-23T10:39:44Z

fuse/handle.go

@@ -113,32 +65,47 @@ const maxInt = int(^uint(0) >> 1)

 // Write is called to write a block of data at a certain offset.
 func (hd *Handle) Write(ctx context.Context, req *fuse.WriteRequest, resp *fuse.WriteResponse) error {
+	start := time.Now()
+	// Note: even when underlying process makes consequent writes,
+	// the kernel might send blocks out of order!!!


Can you elaborate a bit on this? Especially what the consequences for us are.

Can you elaborate a bit on this? Especially what the consequences for us are.

I will add the comment but the full description is the following:
let's say at the OS level you do (cat or dd) largeFile > /brig-fuse-moung/file. You would naively expect Write request to be in fifo order with steady growing offsets (I called it consequent), and they mostly are. What I saw that occasionally OS layer asks to write a block with higher offset and then another with a lower offset.

I noticed this when I was tracking offset position: with fifo order I would not have to do additional Seek if position matches offset.

It does not concern our current Write much. But one should not assume that Write could be streamed with automatically positioned cursor.
Otherwise, it would be super easy to merge overlays to one big modification which might help with speed.

But it also makes, aforementioned swapping to disk hard 'since overlay might have a hole'. So every block would be of a different size.

addressed in d3aad7a

sahib · 2021-01-23T10:42:59Z

fuse/handle.go

+// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt
+// Main idea is not bother with Seek pointer, since underlying `overlay` works
+// with intervals in memory and we do not need to `Seek` the backend which is very time expensive.
+func (hd *Handle) WriteAt(buf []byte, off int64) (n int, err error) {


I would recommend to lowercase the writeAt method here. I was confused at first because I thought that fuse expects this interface and would call it directly. I used lower case method names to also indicate that they don't do locking.

done in 03e48e6

sahib · 2021-01-23T10:43:44Z

fuse/mount.go

+		// enabling MaxReadahead double or even triple Read throughput 12MB/s -> 25 or 33 MB/s
+		fuse.MaxReadahead(128 * 1024), // kernel uses at max 128kB = 131072B
+		// enabling WritebackCache doubles write speed to buffer 12MB/s -> 24MB/s
+		fuse.WritebackCache(), // writes will happen in mach large blocks 128kB instead of 8kB


Nice nice nice.

sahib · 2021-01-23T10:59:09Z

fuse/util.go

+		}
+	}
+	return false
+}


I think this one could be faster. Calling getXattr often will always have the cost of calling listXattr (allocation heavy) and linearly checking for the attribute. Instead we should just use a map since we know and control the possible xattrs:

var xattrMap = map[string]bool{ "user.brig.hash": true, .... } // use xattrMap in listXattr (iterate over keys) and getXattr (checking if a key exists)

Alternatively we could even think to do something like that:

type xattrGetter func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error) var xattrMap = map[string]xattrGetter{ "user.brig.hash": func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error) { return []byte(info.TreeHash.B58String()) }, // .... } // Then the implementation of getXattr would be trivial: func getXattr(cfs *catfs.FS, name, path string, size uint32) ([]byte, error) { // todo: not sure if it comes with a null byte or not. handler, ok := xattrMap[name] if !ok { return nil, fuse.ErrNoXattr } info, err := cfs.Stat(path) if err != nil { return nil, errorize("getxattr", err) } return handler(cfs, info) } // listxattr would be simpler too.

Advantage of this approach would be: lookup time of a map is constant and involves validating it in one go. Also the xattr names would be in one central place.

I like it. Mostly for unified track of available attributes. Speed wise, it is probably will gain as very little with only 3 attributes. Preparing patch now.

The actual speed-up would only be noticeable when doing a lot of xattr fetching. But this would both improve implementation and speed, so a win-win.

done with c940779, I also fixed old unnoticed bug with ce915ce and added a test case in 05b5737

I think we are ready to merge

sahib · 2021-01-23T11:28:43Z

fuse/file.go

-	} else {
-		attr.Size = info.Size
-	}
+	attr.Size = info.Size


I'm not entirely sure, but info.Size might not be correct. When we truncate or extend a file then catfs still remembers the old size (it gets updated only after Stage(), i.e. on Flush()). We might need to use the limits of the overlay to determine the size at that point. At least that's what I remember, in any case we should check the file size on this test:

# file should have a size of 1MB. $ dd if=/dev/zero of=/fuse/file.txt bs=1M count=1 # file size should increase with each write happening. # (probably easier to test in a unit test where you can # control when the file is closed). $ dd if=/dev/zero of=/fuse/file.txt bs=1M count=20

This might also explains why your test produced that much metadata: Every fuse write called catfs.Handle.Write() and it stages the file to update the file size. That's actually something very dumb to do F438 , and we should retrieve the current files ize from the handle and not from catfs (where it will be changed on flush).

evgmik · 2021-01-23T18:20:43Z

Yep, logging is always surprisingly expensive.
Well, everyone who worked with micro controllers now that a printf will puts everything to the dead snail speed :)
I was surprised to see, that it still true.

I looked at other loggers Zerolog and Zap. It seems that they get their speed from extremely rigid formatting. I.e. you cannot day Printf("%v", value), you have to spell out the type.
Which makes it painful for sprinkling debugging info.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

It's for the case that a file gets extended or truncated. Imagine a file that has 4KB, but then gets overwritten by a 1MB file. The file size in catfs would still be 4KB before Flush() was called, but fuse layer/overlay already knows that things got bigger. Same goes for the opposite of truncating a file (you can technically also make a file bigger by truncating, but that's a confusing term). The limit controls how big the file currently is. I left a comment about the file size (which was wrong before too), which we definitely need to check.

The confusing part is

brig/catfs/mio/overlay/overlay.go

Line 194 in c293485

limit int64

there are limit and fileSize. From your description it sounds that they should be the same.

evgmik · 2021-01-23T18:46:55Z

If overlay now learns to swap out excess intervals to disk overs a certain memory limit, then this would be a great improvement and would make the filesystem ready for day-to-day use.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

I am thinking about use cases:

brig is a syncer

User would copy a file (maybe a very large one) in a single command. Yes, we ask the size of the file to be allocated in memory, (and about the same amount in OS cache).
This is a transient memory demand, once we flushed the overlay the memory is free. Moving to swap would cost us speed.

User edits small file directly in brig-fuse. Small files not a big deal.
brig-fuse is the real file system.

I.e. someone might run a database with random access read/writes. I see no efficient way to do it, which would not do the high io to the harddrive for swapping.
The only easy way to do it, is to make backend blocked, instead of the current mode: one file one hash. I guess there is a reason why real FS split files on multiple inodes.

I would argue that case 3 is not brig use case. Doing efficient FS is hard work (I still remember sqlitedb speed drop on transition from ext2 to ext3 due to some journaling). It is potentially doable: internal streamed data is prepared in chunks (archived and encrypted).
But then we are reinventing ipfs file storage.

Case 1, would greatly benefit, if there were a flag from OS saying that what is coming is the single "stage" but I see no such thing in the manuals.

… inner helper

sahib · 2021-01-23T19:08:38Z

there are limit and fileSize. From your description it sounds that they should be the same.

No, as described above they're not. limit should be the actual size of the file, fileSize is the size of the original file before it was opened in fuse. Disclaimer: I didn't re-read the code, that's just my faulty memory, so reality might be different or buggy.

But in your PR you removed the part were on every write the file was re-staged with the up-to-date size. While this removal is good, the file size reported by fuse needs to stay up-to-date on every Write. You gonna need to expose it from overlay to handle to fuse's getattr.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

Your point 1 is already the reason why we need this. Without it we can only copy files up to the size of our memory. The limit when we start swapping to disk can be set high enough that it does not happen for normal usage.

I don't expect anybody to run a database on a fuse, but people might still edit large files for whatever reason. This may not be a typical use case and we don't need to be fast here, but we still need to support it to a certain extent.

EDIT: Clarfication: We can of course postpone it and it shouldn't be part of that PR anyways.

evgmik · 2021-01-23T20:25:25Z

there are limit and fileSize. From your description it sounds that they should be the same.

No, as described above they're not. limit should be the actual size of the file, fileSize is the size of the original file before it was opened in fuse. Disclaimer: I didn't re-read the code, that's just my faulty memory, so reality might be different or buggy.

But in your PR you removed the part were on every write the file was re-staged with the up-to-date size. While this removal is good, the file size reported by fuse needs to stay up-to-date on every Write. You gonna need to expose it from overlay to handle to fuse's getattr.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

Your point 1 is already the reason why we need this. Without it we can only copy files up to the size of our memory. The limit when we start swapping to disk can be set high enough that it does not happen for normal usage.

I don't expect anybody to run a database on a fuse, but people might still edit large files for whatever reason. This may not be a typical use case and we don't need to be fast here, but we still need to support it to a certain extent.

EDIT: Clarfication: We can of course postpone it and it shouldn't be part of that PR anyways.

Let's postpone this particular point. The OS keeps track of the file size internally, since it does the writes, it knows better what the file size is. How I check, I dded 1GB file to brig-fuse, which takes a while, and I was doing ls fuse/file during dding. I saw that file size was growing in sync with write requests.

evgmik

OK. Now I see your point. Big files, on small memory machines. Yep, this should be addressed.
But I suggest to move it to different issue.

My particular PR, does not change current memory demand, so it just maintains status quo.

evgmik · 2021-01-23T20:38:11Z

What is wrong with travis? All test are passed on my machine.

sahib · 2021-01-23T21:59:10Z

How I check, I dded 1GB file to brig-fuse, which takes a while, and I was doing ls fuse/file during dding. I saw that file size was growing in sync with write requests.

That's unexpected. My understanding is that it still makes a call to getattr() to get the size. Will double check this tomorrow.

[...] But I suggest to move it to different issue.

Sure, that needs another PR. I would keep #49 open since it's a long running issue with some more thing to do.
If you want I can continue improving overlay - since I brought that beast into life anyways.

What is wrong with travis? All test are passed on my machine.

Good question. TestClientFetchPatch broke for some reason. Maybe a flaky test. I re-run now and it's green now.
Can you see the logs here? Can you also re-trigger the build with the "build now" button?

sahib · 2021-01-23T22:01:33Z

Oops, accidentally hit the close button.

evgmik · 2021-01-24T00:28:32Z

Sure, that needs another PR. I would keep #49 open since it's a long running issue with some more thing to do.
If you want I can continue improving overlay - since I brought that beast into life anyways.

I would greatly appreciate it, I think I understand what it should do, but internal machinery does not fit yet to my brain.

What is wrong with travis? All test are passed on my machine.

Good question. TestClientFetchPatch broke for some reason. Maybe a flaky test. I re-run now and it's green now.
Can you see the logs here? Can you also re-trigger the build with the "build now" button?

I see logs, not sure that I have rebuild button. It passed the checks, so it is probably disabled.

Pretty much 1st call always comes with req.Size=0. Our job to respond with all we have. If response resp.Xattr does not fit to a specified buffer req.Size, fuse will internally truncate it (see https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1656 and similar https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1618) and relay it to the caller with ERANGE error (buffer is too small). The caller should ask again with a larger buffer. From other hand truncation to zero is a disaster, since the caller assumes that there are no Xattributes. Note that apparently, even if the call comes with allowed req.Siz=0, internally it has some small buffer. So our first reply often fits and the call ends without error.

…names

sahib

Looks good to me.

evgmik added 16 commits January 17, 2021 22:47

better name for explicitly pinned extended attribute

f2c7539

Also added explicitly_pinned to the list of known Xattributes

fuse read speed up with MaxReadahead set to 128 kB

5e214f0

The default MaxReadahead is 4 kB. By changing it to 128 kB, we get fuse Read speed from 12.5MB to 33MB. Nice win!!!

fuse: commenting out debug info in Getxattr gives about +10MB/s to th…

73714fc

…rougput As I mention before Write with WritebackCache enabled is ask for Getxattr way to often. If we report every call, our trhoughput is lower by about 10MB/s.

added missing import and minor comment edit

3a729e9

Putting back logPanic, it does not influence performance

aed7d55

added time profiling info code

c24b8d3

better debugging: log with fields does not preserve fields order :(

1ae191f

more intensive debugging

bbf29ae

fuse: added resp.Flags |= fuse.OpenKeepCache

1f4dded

This buffers files at the OS level, so consequent reads are instantaneous (as long as they are cached)

Overlay improvement: Do not call for seek if the pointer already at t…

7c83f7d

…he required position

added WritaAt method to avoid unneeded Seek during writes

6b90be0

fuse uses overlay for writes

d0b3f64

Simplifies writing infrastructure by getting rid of internal write buffer. Everything is done with stream overlay.

gofmt fixes

45f264d

evgmik linked an issue Jan 23, 2021 that may be closed by this pull request

fix: stabilize & improve the FUSE layer #49

Closed

sahib requested changes Jan 23, 2021

View reviewed changes

evgmik added 2 commits January 23, 2021 14:01

WriteAt -> writeAt to indicate that it is not called by fuse but is…

03e48e6

… inner helper

improved comment about Write request not coming in fifo order

d3aad7a

evgmik commented Jan 23, 2021

View reviewed changes

sahib closed this Jan 23, 2021

sahib reopened this Jan 23, 2021

Removed unused maxInt constant

984107a

evgmik added 3 commits January 23, 2021 19:29

redone getXattr and listXattr via xattrMap, also change hash related …

c940779

…names

added test for extended attributes

05b5737

sahib approved these changes Jan 24, 2021

View reviewed changes

evgmik merged commit 6173448 into develop Jan 24, 2021

evgmik deleted the feature/fuse-improvements branch January 24, 2021 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/fuse improvements #88

Feature/fuse improvements #88

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feature/fuse improvements #88

Feature/fuse improvements #88

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!