Google, Bigtable, Compression, Zippy and BMDiff
A few months ago, when I was heads down finalizing the distributed database in Spinn3r, I was exceedingly curious about what other DBs are using for compression.
GZip seems to be the obvious choice but its compression speed isn’t very good when compared to LZO.
Your disks are almost certainly going to be bottlenecked on IO (if you have a good DB design) so compressing the data means you can trade CPU (which will almost certainly idle)
I remembered some notes about compression in the original Bigtable paper and decided to dig a bit deeper.
Apparently, there isn’t much information about what Google uses for compression in Bigtable, GFS, etc.
These notes were compiled from Jeff Dean’s talk in 2005 but I haven’t seen anything else referencing the subject.
Skip to 46:00 in the Dean talk to see the compression notes.
There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.
Google is using an algorithm named BMDiff referenced in Bentley McIlroy DCC ’99 Data Compression Using Long Common Strings
The use BMDiff to compute a dictionary diff between all columns in a column family. This way common strings between columns can be stored in a compressed dictionary to avoid duplicate storage.
This also helps to diff between previous versions of a page across compactions. A page stored in your index will probably have a LOT in common with the same page stored a month ago.
They then run the bmdiff through zippy (another compression algorithm they wrote). Apparently, it’s a tuned version of LZO.
I’d like to see MySQL/Drizzle support more higher level DB primitives directly rather than having to build support for these above the DB level.
The zlib compress/uncompress support in MySQL is horrible (binary data is not compatible with other zlib implementations).
Supporting bmdiff, lzo, bloom filters, etc in DBs is going to be necessary to have drizzle support larger distributed databases.
There are a few UDFs I want to write so maybe I’ll take these on at the same time..
Come to think of it, crypto support isn’t that hot in MySQL either.