Previously known as Zippy, it's a lossless, data compression algorithm implemented by Google used primarily on text. The algorithm focuses on the time-efficient compression of the data rather than peak compression ratio. This means it has a mediocre compression rate of 1.5x to 1.7x for plain text and 2x-4x for HTML, but can compress and decompress at rates much faster than other algorithms. Snappy uses 64-bit operations to allow efficient operation on multiple cores of modern consumer processors, boasting speeds upward of 250 MB/s compress and 500 MB/s decompress on just a single core of a 2015 i7 CPU. Part of this speed differential comes from the fact that Snappy doesn't use an entropy coder, like a Huffman tree or arithmetic encoder.
We're used to compression algorithms focusing almost entirely on compression ratio, but it turns out there's some intuitive applications for a high data-rate algorithm. Google has used it for multiple of its own projects, such as BigTable, MapReduce, MongoDB, and its own RPC system. A keen reader may recognize many of these as database tools, which makes total sense when considering what is going on under the hood. A database receives a query, and then returns all matching results. In a small business, this may be 100s of lines of text, but on the scale of a tech giant, these results could easily be 100,000 lines of text. So since compute resources are generally more expensive than storage resources, Google can use Snappy to make these queries ultra fast, at the (lower) cost of storage.
Snappy operates with both a block-level and stream format. We will focus on the stream format, since the data is only chunked between blocks for large files, which doesn't provide an accessible introduction to the algorithm. Let's start off with a string that we'd like to encode:
"Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project."
Snappy will turn this string into an array of bytes because it is byte-oriented, rather than bit oriented. We first start off by writing the length of the uncompressed data (as a little-endian varint) at the beginning of our compressed stream. Once the compression has started, we have four different element types to consider:
- Literal or uncompressed data. Stored with a 00 tag-byte, the upper 6 bis are used to store the length of the literal. Lengths larger than 60 can be encoded in multiple bytes, with 60 representing 1 byte, 61 representing 2, 62 is 3 bytes, and 63 is 4.
- A copy with tagbyte 01. Length is stored as 3 bits and offset stored as 11; one byte after tag byte is used for part of the offset.
- A copy with tagbyte 10. Length is stored as 6 bits in the tag byte and offset stored as 2 byte integer after tag byte.
- A copy with tagbyte 11. Length is stored as 6 bits in the tag byte and offset stored as 4 byte little-endian integer after the tag byte.
A literal, as mentioned before is just the uncompressed data fed directly as a byte array into the compressed stream. A copy on the other hand is essentially an indicator, that whatever is encoded there has been encoded previously in the stream. For our example, both "Wikipedia" and "encyclopedia" have the term "pedia". A copy at encyclopedia would store an offset that would tell the algorithm how many bytes to look back in the stream, and then use whatever is at that location instead of storing it literally.
We'll run through the example with actual data now. When we encode our stream and represent it as hex, we get the following.
ca02 f042 5769 6b69 7065 6469 6120 6973
2061 2066 7265 652c 2077 6562 2d62 6173
6564 2c20 636f 6c6c 6162 6f72 6174 6976
652c 206d 756c 7469 6c69 6e67 7561 6c20
656e 6379 636c 6f09 3ff0 1470 726f 6a65
6374 2e00 0000 0000 0000 0000 0000 0000
But we'll focus on the first line.
ca02 f042 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
On the left hand side we have the compressed stream in hex, and the right shows us whay it decodes to. If the bytes are a part of a format element and not actually part of the text, we'll represent it with a period.
Remember we said the stream starts with the length of the uncompressed data stored as a little-endian varint. In our case, the bytes 0xca02
. When decoded into binary we get the following:
11001010 00000010
The little-endian varint specification for Snappy says that the lower seven bits of each byte in stream header is used for the length data, and the high bit is a flag to indicate the end of the length field. Since our data is little-endian, we need to flip the hex bits like so:
00000010 11001010
The first byte has a 0 in the MSB, so we know there is at least one more byte following this which stores the length of the uncompressed data. The next byte has a 1 in the MSB, so we know that is the final length byte. concatenating the lower seven bytes from each one gives:
00000101001010
Or 330 in decimal, which is exactly the length in bytes of the uncompressed data.
Moving on to our next element in the stream, 0xf042
. Converting to binary gives us:
11110000 01000010
Since the element type is encoded in the lower two bits of the tag byte, we know the element here is 00
or a literal. Recall the upper 6 bits of a literal represent the length if it is under 60, or how many proceeding bytes will store the length if it's between 60 and 63. In our case, 111100
is 60 in decimal, which means 1 proceeding byte stores the length of the literal. Decoding that following byte gives us 66 in decimal (note that the snappy format encodes the length of the literal as length-1), so our final literal length is 67. That means the following 67 bytes are not compressed and can be read litearlly as text. Sure enough:
ca02 f042 5769 6b69 7065 6469 6120 6973
2061 2066 7265 652c 2077 6562 2d62 6173
6564 2c20 636f 6c6c 6162 6f72 6174 6976
652c 206d 756c 7469 6c69 6e67 7561 6c20
656e 6379 636c 6f09 3ff0 1470 726f 6a65
6374 2e00 0000 0000 0000 0000 0000 0000
or the next 67 bits can be directly translated from hex to ASCII, giving the following output:
Wikipedia is a free, web-based, collaborative, multilingual encyclo
The byte directly afterwards is 0x09
, or in binary:
00001001
Thus, this element type is 01
or a copy where the length is stored as 3 bits and the offset is stored as the remaining 3 bits in the tagbyte, and 8 bits in the following byte. Thus, our full copy element is:
00001001 00111111
The length is encoded as 2 and the offset is 63. If we move back 63 bytes, we reach the part in the literal corresponding to the "pedia " in "Wikipedia ". Thus, this copy corresponds to "pedia ". This makes sense, as we are currently at the "pedia " in "encyclopedia " in the input stream.
Finally, 0xf014
represents a literal with length of 21 bytes. This gives us the completed encoding of
ca02 f042 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas
6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ
652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual
656e 6379 636c 6f09 3ff0 1470 726f 6a65 encyclo.?..proje
6374 2e00 0000 0000 0000 0000 0000 0000 ct.