8000 GitHub - tamatebako/dwarfs: A fast high compression read-only file system
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
forked from mhx/dwarfs

A fast high compression read-only file system

License

Notifications You must be signed in to change notification settings

tamatebako/dwarfs

 
 

Repository files navigation

Build status for tebako fork

Ubuntu MacOS Alpine Windows-MSys

DwarFS

The Deduplicating Warp-speed Advanced Read-only File System.

A fast high compression read-only file system for Linux and Windows.

Table of contents

Overview

Windows Screen Capture

Linux Screen Capture

DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data.

This probably doesn't sound very exciting, because if it's redundant, it should compress well. However, I found that other read-only, compressed file systems don't do a very good job 1E0A at making use of this redundancy. See here for a comparison with other compressed file systems.

DwarFS also doesn't compromise on speed and for my use cases I've found it to be on par with or perform better than SquashFS. For my primary use case, DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources.

To give you an idea of what DwarFS is capable of, here's a quick comparison of DwarFS and SquashFS on a set of video files with a total size of 39 GiB. The twist is that each unique video file has two sibling files with a different set of audio streams (I didn't make this up, this is an actual use case). So there's redundancy in both the video and audio data, but as the streams are interleaved and identical blocks are typically very far apart, it's quite challenging to make use of that redundancy for compression. SquashFS essentially fails to compress the source data at all, whereas DwarFS is able to reduce the size by almost a factor of 3, which is close to the theoretical maximum:

$ du -hs dwarfs-video-test
39G     dwarfs-video-test
$ ls -lh dwarfs-video-test.*fs
-rw-r--r-- 1 mhx users 14G Jul  2 13:01 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 39G Jul 12 09:41 dwarfs-video-test.squashfs

While this is already impressive, it gets even better. When mounting the SquashFS image and performing a random-read throughput test using fio-3.34, both squashfuse and squashfuse_ll top out at around 230 MiB/s:

$ fio --readonly --rw=randread --name=randread --bs=64k --direct=1 \
      --opendir=mnt --numjobs=4 --ioengine=libaio --iodepth=32 \
      --group_reporting --runtime=60 --time_based
[...]
   READ: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=13.5GiB (14.5GB), run=60004-60004msec

DwarFS, however, manages to sustain random read rates of 20 GiB/s:

  READ: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1212GiB (1301GB), run=60001-60001msec

Distinct features of DwarFS are:

  • Clustering of files by similarity using a similarity hash function. This makes it easier to exploit the redundancy across file boundaries.

  • Segmentation analysis across file system blocks in order to reduce the size of the uncompressed file system. This saves memory when using the compressed file system and thus potentially allows for higher cache hit rates as more data can be kept in the cache.

  • Highly multi-threaded implementation. Both the file system creation tool as well as the FUSE driver are able to make good use of the many cores of your system.

History

I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

Up until then, I had been using Cromfs for squeezing them into a manageable size. However, I was getting more and more annoyed by the time it took to build the filesystem image and, to make things worse, more often than not it was crashing after about an hour or so.

I had obviously also looked into SquashFS, but never got anywhere close to the compression rates of Cromfs.

This alone wouldn't have been enough to get me into writing DwarFS, but at around the same time, I was pretty obsessed with the recent developments and features of newer C++ standards and really wanted a C++ hobby project to work on. Also, I've wanted to do something with FUSE for quite some time. Last but not least, I had been thinking about the problem of compressed file systems for a bit and had some ideas that I definitely wanted to try.

The majority of the code was written in 2013, then I did a couple of cleanups, bugfixes and refactors every once in a while, but I never really got it to a state where I would feel happy releasing it. It was too awkward to build with its dependency on Facebook's (quite awesome) folly library and it didn't have any documentation.

Digging out the project again this year, things didn't look as grim as they used to. Folly now builds with CMake and so I just pulled it in as a submodule. Most other dependencies can be satisfied from packages that should be widely available. And I've written some rudimentary docs as well.

Building and Installing