Google, Bigtable, Compression, Zippy and BMDiff

October 12, 2008 in google

A few months ago, when I was heads down finalizing the distributed database in Spinn3r, I was exceedingly curious about what other DBs are using for compression.

GZip seems to be the obvious choice but its compression speed isn’t very good when compared to LZO.

Your disks are almost certainly going to be bottlenecked on IO (if you have a good DB design) so compressing the data means you can trade CPU (which will almost certainly idle)

I remembered some notes about compression in the original Bigtable paper and decided to dig a bit deeper.

Apparently, there isn’t much information about what Google uses for compression in Bigtable, GFS, etc.

These notes were compiled from Jeff Dean’s talk in 2005 but I haven’t seen anything else referencing the subject.

Skip to 46:00 in the Dean talk to see the compression notes.

Andrew Hitchcock also took some notes on the talk:

There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.

Google is using an algorithm named BMDiff referenced in Bentley McIlroy DCC ’99 Data Compression Using Long Common Strings

The use BMDiff to compute a dictionary diff between all columns in a column family. This way common strings between columns can be stored in a compressed dictionary to avoid duplicate storage.

This also helps to diff between previous versions of a page across compactions. A page stored in your index will probably have a LOT in common with the same page stored a month ago.

They then run the bmdiff through zippy (another compression algorithm they wrote). Apparently, it’s a tuned version of LZO.

I’d like to see MySQL/Drizzle support more higher level DB primitives directly rather than having to build support for these above the DB level.

The zlib compress/uncompress support in MySQL is horrible (binary data is not compatible with other zlib implementations).

Supporting bmdiff, lzo, bloom filters, etc in DBs is going to be necessary to have drizzle support larger distributed databases.

There are a few UDFs I want to write so maybe I’ll take these on at the same time..

Come to think of it, crypto support isn’t that hot in MySQL either.

9 Comments

Luke Lu

October 13, 2008 at 12:52 pm

There is a open source implementation of bmdiff/zippy like library in Hypertable, called bmz. It’s written in pure ANSI C, and can be easily embedded in any project (that was a goal I had in mind, so it’s written in C instead of C++ like the rest of the Hypertable). The performance is similar to google’s published numbers for small blocks (the size of sstable block from 64KB-128KB. The speed to will go down dramatically (to about 40MB/s), when the block size is big enough to thrash the processor cache, as the algorithm uses a hashtable.)

I wrote it (the bmdiff part, lzo is used for the optional final pass) mostly for experiments without much optimization (so there are rooms for improvement). The Rabin-Karp like hash functions can be easily plugged in for experiments.

Feel free to give it try and ping me on how to use the library (documentation is in bmz.h.)
Marcus Herou

October 14, 2008 at 6:52 am

Hi.

Look at what the Hadoop/Hbase guys uses when compressing. Compression is key in those systems.
Steven Roussey

October 15, 2008 at 3:00 pm

The quick and dirty implementation, depending on your app, is to use gzip for a column, but only compress/decompress when the data is used — usually on a client webserver. Thus the compression cost is moved into a more scalable layer.

Ideally, this would get built into mysql/drizzle such that decompression happens on an access of the column’s data, and if that doesn’t happen on the server, then it passes that flag on to the client. So the client/server protocol would need to be updated as well both the client and server.
Luke Lu

October 30, 2008 at 11:54 am

Hey, Kevin,

Why would you delete my comment on an existing easily reusable open source implementation of bmdiff?

It’s not like we’re promoting a commercial product. It’s open source and free for all to use.

__Luke
elhoim

March 9, 2009 at 1:57 am

Did you find the paper explaining the algorithm?
Could you point me in the right direction please?
burtonator

March 9, 2009 at 8:54 am

Which paper? On Zippy? I don’t think there is one.
elhoim

March 9, 2009 at 10:56 am

On BMDiff, i found the ’99 DCC paper on IEEE Xplore, but not more…
burtonator

March 10, 2009 at 8:16 am

Hey Luke.

Sorry about the confusion. I didn’t actually delete your post… I just (accidentally) didn’t approve it.

It looks like gmail is starting to mark the ‘new comment’ posts from WordPress as spam so I didn’t see that your comment was pending moderation.

Cool that you posted your implementation.

Kevin
elhoim

March 11, 2009 at 4:17 am

Thank you for your info Luke!