8000 Feature/fuse improvements by evgmik · Pull Request #88 · sahib/brig · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Feature/fuse improvements #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jan 24, 2021
Merged

Feature/fuse improvements #88

merged 22 commits into from
Jan 24, 2021

Conversation

evgmik
Copy link
Collaborator
@evgmik evgmik commented Jan 23, 2021

Adresses #49
Enabling fuse optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.

Also, removed use of write buffer back to overlay system.

Main slow down for using overlay was useless Seek at write: we write to memory overlay but aks the backend to Seek for no reason. This problem solved with WriteAt method.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

evgmik added 16 commits January 17, 2021 22:47
Also added explicitly_pinned to the list of known Xattributes
The default MaxReadahead is 4 kB. By changing it to 128 kB,
we get fuse Read speed from 12.5MB to 33MB. Nice win!!!
I might be delusional but logging every call to fuse-Read
has a througput penalty. With even with Maximum read size of 128kB
with large file sizes it called quite often.

With default  MaxReadahead=4kB that logging leads to quite catastropic speed drop.
This is because now Write happens in much larger chunks 128kB instead of
8kB, and kernel sucks in the file to be writted.

Note to get 50MB/s I need to disable logging some write operation logging,
 it takes quite a lot of CPU overhead. See follow up patches.
Apparently, when WritebackCache is enable, the OS is quite spammy with
requests to Getxattr for attribute "security.capability". Old code use
to call cfs.Stat even if such attribute is not the one which we can
provide. The call cost time and decrease througput by about 10MB/s.

Now we check if we can even provide an attribute, and only then fetch
relevant info.
…rougput

As I mention before Write with WritebackCache enabled is ask for
Getxattr way to often. If we report every call, our trhoughput is
lower by about 10MB/s.
This buffers files at the OS level, so consequent reads are
instantaneous (as long as they are cached)
Simplifies writing infrastructure by getting rid of internal
write buffer. Everything is done with stream overlay.
@evgmik evgmik linked an issue Jan 23, 2021 that may be closed by this pull request
8000
Copy link
Owner
@sahib sahib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling fuse optimizations at mount: Cache and larger buffer sizes. This gives quite large speed improvements, I even had to think about disabling debug outputs, since it takes comparable time to disk io.

Yep, logging is always surprisingly expensive.

Main slow down for using overlay was useless Seek at write: we write to memory overlay but aks the backend to Seek for no reason. This problem solved with WriteAt method.

I think the Seek was pointless indeed. Probably a leftover from pre-overlay times. I think it could be also factored out of the normal handle.Write() method. On the other hand we don't use that otherwise, so might be removed completely too. Also, the old write did some metadata staging of the file size, which was very CPU intensive, so it was not only the Seek.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

It's for the case that a file gets extended or truncated. Imagine a file that has 4KB, but then gets overwritten by a 1MB file. The file size in catfs would still be 4KB before Flush() was called, but fuse layer/overlay already knows that things got bigger. Same goes for the opposite of truncating a file (you can technically also make a file bigger by truncating, but that's a confusing term). The limit controls how big the file currently is. I left a comment about the file size (which was wrong before too), which we definitely need to check.


Otherwise very good. If overlay now learns to swap out excess intervals to disk overs a certain memory limit, then this would be a great improvement and would make the filesystem ready for day-to-day use.

// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt
func (l *Layer) WriteAt(buf []byte, off int64) (n int, err error) {
// Copy the buffer, since we cannot rely on it being valid forever.
modBuf := make([]byte, len(buf))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential improvement here: use a sync.Pool to save a few allocations when writing many buffers. This also applies to the regular Write method and is probably more relevant there. Care must be taken that the buffer from pool.Get() has a suitable length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that I follow here. modBuf is somewhat permanent (sits in the overlay until flashed). Pool implies, that the data would be soon deleted. Besides manual says "Any item stored in the Pool may 10000 be removed automatically at any time without notification.". This sounds very scary.

I guess I do not know internals of gc to really see the difference.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That documentation is a little misleading. Nothing gets garbage collected while you still use it (i.e. you hold a reference to it). modBuf can be Put() back to the pool after flush and on L235 above you can do a pool.Get() to fetch a buffer that was either used before or is newly allocated. Go is free to do whatever it likes in the time between Put and Get.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Can we move it to separate issue?
From quick reading, pool is not suited for different size objects. So if len(buf) different from result of a Get, we would have to put it back. But I see no way, to iterate over pool to see if there is a suitable candidate. Besides, it would require the full rethinking of how overlay is working.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would just get an item from the pool and check if it's reasonably sized (if larger, cut it), if not allocate a new one the normal way and use that. But sure, we can do it once overlay is touched in general.

// very likely it sequent read/write request
// no need to seek since we already pointing to the right position
return l.pos, nil
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fuse/handle.go Outdated
@@ -113,32 +65,47 @@ const maxInt = int(^uint(0) >> 1)

// Write is called to write a block of data at a certain offset.
func (hd *Handle) Write(ctx context.Context, req *fuse.WriteRequest, resp *fuse.WriteResponse) error {
start := time.Now()
// Note: even when underlying process makes consequent writes,
// the kernel might send blocks out of order!!!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit on this? Especially what the consequences for us are.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit on this? Especially what the consequences for us are.

I will add the comment but the full description is the following:
let's say at the OS level you do (cat or dd) largeFile > /brig-fuse-moung/file. You would naively expect Write request to be in fifo order with steady growing offsets (I called it consequent), and they mostly are. What I saw that occasionally OS layer asks to write a block with higher offset and then another with a lower offset.

I noticed this when I was tracking offset position: with fifo order I would not have to do additional Seek if position matches offset.

It does not concern our current Write much. But one should not assume that Write could be streamed with automatically positioned cursor.
Otherwise, it would be super easy to merge overlays to one big modification which might help with speed.

But it also makes, aforementioned swapping to disk hard 'since overlay might have a hole'. So every block would be of a different size.

Copy link
Collaborator Author
@evgmik evgmik Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in d3aad7a

fuse/handle.go Outdated
// Mimics `WriteAt` from `io` package https://golang.org/pkg/io/#WriterAt
// Main idea is not bother with Seek pointer, since underlying `overlay` works
// with intervals in memory and we do not need to `Seek` the backend which is very time expensive.
func (hd *Handle) WriteAt(buf []byte, off int64) (n int, err error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to lowercase the writeAt method here. I was confused at first because I thought that fuse expects this interface and would call it directly. I used lower case method names to also indicate that they don't do locking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 03e48e6

// enabling MaxReadahead double or even triple Read throughput 12MB/s -> 25 or 33 MB/s
fuse.MaxReadahead(128 * 1024), // kernel uses at max 128kB = 131072B
// enabling WritebackCache doubles write speed to buffer 12MB/s -> 24MB/s
fuse.WritebackCache(), // writes will happen in mach large blocks 128kB instead of 8kB
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice nice nice.

fuse/util.go Outdated
}
}
return false
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one could be faster. Calling getXattr often will always have the cost of calling listXattr (allocation heavy) and linearly checking for the attribute. Instead we should just use a map since we know and control the possible xattrs:

var  xattrMap = map[string]bool{
    "user.brig.hash": true,
    ....
}

// use xattrMap in listXattr (iterate over keys) and getXattr (checking if a key exists)

Alternatively we could even think to do something like that:

type xattrGetter func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error)

var xattrMap = map[string]xattrGetter{
    "user.brig.hash": func(cfs *catfs.FS, info catfs.StatInfo) ([]byte, error) {
        return []byte(info.TreeHash.B58String())
    },
    // ....
}


// Then the implementation of getXattr would be trivial:
func getXattr(cfs *catfs.FS, name, path string, size uint32) ([]byte, error) {
    // todo: not sure if it comes with a null byte or not.
    handler, ok := xattrMap[name]
    if !ok {
        return nil, fuse.ErrNoXattr
    }

    info, err := cfs.Stat(path)
    if err != nil {
        return nil, errorize("getxattr", err)
    }

    return handler(cfs, info)
}

// listxattr would be simpler too.

Advantage of this approach would be: lookup time of a map is constant and involves validating it in one go. Also the xattr names would be in one central place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Mostly for unified track of available attributes. Speed wise, it is probably will gain as very little with only 3 attributes. Preparing patch now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual speed-up would only be noticeable when doing a lot of xattr fetching. But this would both improve implementation and speed, so a win-win.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with c940779, I also fixed old unnoticed bug with ce915ce and added a test case in 05b5737

I think we are ready to merge

} else {
attr.Size = info.Size
}
attr.Size = info.Size
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, but info.Size might not be correct. When we truncate or extend a file then catfs still remembers the old size (it gets updated only after Stage(), i.e. on Flush()). We might need to use the limits of the overlay to determine the size at that point. At least that's what I remember, in any case we should check the file size on this test:

# file should have a size of 1MB.
$ dd if=/dev/zero of=/fuse/file.txt bs=1M count=1
# file size should increase with each write happening.
# (probably easier to test in a unit test where you can
#  control when the file is closed). 
$ dd if=/dev/zero of=/fuse/file.txt bs=1M count=20

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might also explains why your test produced that much metadata: Every fuse write called catfs.Handle.Write() and it stages the file to update the file size. That's actually something very dumb to do F438 , and we should retrieve the current files ize from the handle and not from catfs (where it will be changed on flush).

@evgmik
Copy link
Collaborator Author
evgmik commented Jan 23, 2021

Yep, logging is always surprisingly expensive.
Well, everyone who worked with micro controllers now that a printf will puts everything to the dead snail speed :)
I was surprised to see, that it still true.

I looked at other loggers Zerolog and Zap. It seems that they get their speed from extremely rigid formatting. I.e. you cannot day Printf("%v", value), you have to spell out the type.
Which makes it painful for sprinkling debugging info.

@sahib please pay attention how the overlay limits are handled. I am not sure that I fully grasped what for they are used. The structure field had no comment :(

It's for the case that a file gets extended or truncated. Imagine a file that has 4KB, but then gets overwritten by a 1MB file. The file size in catfs would still be 4KB before Flush() was called, but fuse layer/overlay already knows that things got bigger. Same goes for the opposite of truncating a file (you can technically also make a file bigger by truncating, but that's a confusing term). The limit controls how big the file currently is. I left a comment about the file size (which was wrong before too), which we definitely need to check.

The confusing part is

limit int64

there are limit and fileSize. From your description it sounds that they should be the same.

@evgmik
Copy link
Collaborator Author
evgmik commented Jan 23, 2021

If overlay now learns to swap out excess intervals to disk overs a certain memory limit, then this would be a great improvement and would make the filesystem ready for day-to-day use.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

I am thinking about use cases:

  1. brig is a syncer
  • User would copy a file (maybe a very large one) in a single command. Yes, we ask the size of the file to be allocated in memory, (and about the same amount in OS cache).
    This is a transient memory demand, once we flushed the overlay the memory is free. Moving to swap would cost us speed.
  1. User edits small file directly in brig-fuse. Small files not a big deal.
  2. brig-fuse is the real file system.
  • I.e. someone might run a database with random access read/writes. I see no efficient way to do it, which would not do the high io to the harddrive for swapping.
  • The only easy way to do it, is to make backend blocked, instead of the current mode: one file one hash. I guess there is a reason why real FS split files on multiple inodes.

I would argue that case 3 is not brig use case. Doing efficient FS is hard work (I still remember sqlitedb speed drop on transition from ext2 to ext3 due to some journaling). It is potentially doable: internal streamed data is prepared in chunks (archived and encrypted).
But then we are reinventing ipfs file storage.

Case 1, would greatly benefit, if there were a flag from OS saying that what is coming is the single "stage" but I see no such thing in the manuals.

@sahib
Copy link
Owner
sahib commented Jan 23, 2021

there are limit and fileSize. From your description it sounds that they should be the same.

No, as described above they're not. limit should be the actual size of the file, fileSize is the size of the original file before it was opened in fuse. Disclaimer: I didn't re-read the code, that's just my faulty memory, so reality might be different or buggy.

But in your PR you removed the part were on every write the file was re-staged with the up-to-date size. While this removal is good, the file size reported by fuse needs to stay up-to-date on every Write. You gonna need to expose it from overlay to handle to fuse's getattr.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

Your point 1 is already the reason why we need this. Without it we can only copy files up to the size of our memory. The limit when we start swapping to disk can be set high enough that it does not happen for normal usage.

I don't expect anybody to run a database on a fuse, but people might still edit large files for whatever reason. This may not be a typical use case and we don't need to be fast here, but we still need to support it to a certain extent.

EDIT: Clarfication: We can of course postpone it and it shouldn't be part of that PR anyways.

@evgmik
Copy link
Collaborator Author
evgmik commented Jan 23, 2021

there are limit and fileSize. From your description it sounds that they should be the same.

No, as described above they're not. limit should be the actual size of the file, fileSize is the size of the original file before it was opened in fuse. Disclaimer: I didn't re-read the code, that's just my faulty memory, so reality might be different or buggy.

But in your PR you removed the part were on every write the file was re-staged with the up-to-date size. While this removal is good, the file size reported by fuse needs to stay up-to-date on every Write. You gonna need to expose it from overlay to handle to fuse's getattr.

I know that it is one of your needed features list. But after thinking about it, I think it should be pushed to future or even abandoned.

Your point 1 is already the reason why we need this. Without it we can only copy files up to the size of our memory. The limit when we start swapping to disk can be set high enough that it does not happen for normal usage.

I don't expect anybody to run a database on a fuse, but people might still edit large files for whatever reason. This may not be a typical use case and we don't need to be fast here, but we still need to support it to a certain extent.

EDIT: Clarfication: We can of course postpone it and it shouldn't be part of that PR anyways.

Let's postpone this particular point. The OS keeps track of the file size internally, since it does the writes, it knows better what the file size is. How I check, I dded 1GB file to brig-fuse, which takes a while, and I was doing ls fuse/file during dding. I saw that file size was growing in sync with write requests.

Copy link
Collaborator Author
@evgmik evgmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Now I see your point. Big files, on small memory machines. Yep, this should be addressed.
But I suggest to move it to different issue.

My particular PR, does not change current memory demand, so it just maintains status quo.

@evgmik
Copy link
Collaborator Author
evgmik commented Jan 23, 2021

What is wrong with travis? All test are passed on my machine.

@sahib
Copy link
Owner
sahib commented Jan 23, 2021

How I check, I dded 1GB file to brig-fuse, which takes a while, and I was doing ls fuse/file during dding. I saw that file size was growing in sync with write requests.

That's unexpected. My understanding is that it still makes a call to getattr() to get the size. Will double check this tomorrow.

[...] But I suggest to move it to different issue.

Sure, that needs another PR. I would keep #49 open since it's a long running issue with some more thing to do.
If you want I can continue improving overlay - since I brought that beast into life anyways.

What is wrong with travis? All test are passed on my machine.

Good question. TestClientFetchPatch broke for some reason. Maybe a flaky test. I re-run now and it's green now.
Can you see the logs here? Can you also re-trigger the build with the "build now" button?

@sahib sahib closed this Jan 23, 2021
@sahib sahib reopened this Jan 23, 2021
@sahib
Copy link
Owner
sahib commented Jan 23, 2021

Oops, accidentally hit the close button.

@evgmik
Copy link
Collaborator Author
evgmik commented Jan 24, 2021

Sure, that needs another PR. I would keep #49 open since it's a long running issue with some more thing to do.
If you want I can continue improving overlay - since I brought that beast into life anyways.

I would greatly appreciate it, I think I understand what it should do, but internal machinery does not fit yet to my brain.

What is wrong with travis? All test are passed on my machine.

Good question. TestClientFetchPatch broke for some reason. Maybe a flaky test. I re-run now and it's green now.
Can you see the logs here? Can you also re-trigger the build with the "build now" button?

I see logs, not sure that I have rebuild button. It passed the checks, so it is probably disabled.

Pretty much 1st call always comes with req.Size=0.  Our job to respond with all we have.
If response resp.Xattr does not fit to a specified buffer req.Size, fuse will internally
truncate it (see
https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1656
and similar
https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd5a7a8f84ead8aab/fuse.go#L1618)
and relay it to the
caller with ERANGE error (buffer is too small). The caller should ask
again with a larger buffer.

From other hand truncation to zero is a disaster, since the caller
assumes that there are no Xattributes.

Note that apparently, even if the call comes with allowed req.Siz=0, internally it
has some small buffer. So our first reply often fits and the call ends
without error.
Copy link
Owner
@sahib sahib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@evgmik evgmik merged commit 6173448 into develop Jan 24, 2021
@evgmik evgmik deleted the feature/fuse-improvements branch January 24, 2021 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix: stabilize & improve the FUSE layer
2 participants
0