Quantcast’s 700 Node KFS Cluster

Yesterday I had lunch with Sriram Rao, lead developer of the KFS project and compared notes about distributed databases.

KFS is basically a GFS-style clone developed at Kosmix and then released as open source.

Basically, a no-BS distributed filesystem implemented from the ground up in C++ to scale and actually get real work done. KFS is running on a 200 node cluster within Kosmix. I just found out that it’s deployed in a 700 node cluster at Quantcast.

What’s really cool is that he just left Kosmix (great and smart guys btw) to work on KFS full time at Quantcast extending KFS.

This is a win for both companies. The guy leaves Kosmix and can literally keep working on the same source code! OSS FTW!

Quantcast is obviously doing log analysis with this data so using a simple distributed filesystem is ideal for this task. Writing the content sequentially into a cluster and then map-reducing the data to compute resulting statistics is basically the ideal architecture.

At 700 boxes and three replicas this is about 500TB… not too bad. Though that’s a LOT of storage nodes which means they probably either need a LOT more storage per box or a lot of cores… (or both).

I also learned that KFS can’t handle total cluster-wide power failure. However, in their usage it’s not really considered a problem since it would only mean about 128k worth of missing logs per box. In the grand scheme of things this is probably less than %0.00001 percent of the total statistics in their index.

They could bypass this if they want by just doing an fsync() for each write. If they used hardware RAID controllers with BBUs they could do this without a performance hit but the cost of their hardware will increase dramatically.

Speaking of which. I still want disks that cache by using a 32MB DRAM buffer and flash with a BBU. While I’m not a hardware engineer it would seem possible to build such a device and use a cheap battery or super-capacitor to build persistence without the cost of expensive and complex RAID controllers.

Though, to be honest, the LSI controllers we’re using delivered a 10x performance boost in our OLTP load.

We spent a few hours last week crash testing MySQL and basically came to the conclusion that replication is totally broken. If you lose power in your cluster you basically have to re-sync all of your MySQL nodes to the master (mysqldump and re-import your data) and then setup replication again. Pathetic. This needs to be fixed.

  1. Check out maatkit for a faster way to sync slaves.

%d bloggers like this: