The Benefits of a Solid Distributed Compute Infrastructure

For about the last year (since my trip to Thailand) Tailrank has had a distributed compute infrastructure similar to MapReduce named TaskQueue.

The model is simple. We have a centralized but redundant master which handles giving work out to compute nodes. The nodes accept the work and execute it across hosts. It’s a GREAT way to distribute computation across nodes because it’s loosely coupled.

We could build MapReduce on top of TaskQueue as it’s a strict subset of the functionality we already provide. We have a distributed locking service similar to chubby (which we’ll be Open Sourcing btw) which allows for coordination among hosts. We use it for automated master queue promotion for example.

If any of our robots crash units of work it was given are simply re-executed on other nodes. In practice though our robots are very reliable. Once they’re started they pretty much stay online and functional.

Here’s an example of our idle/available CPU across 20 compute nodes. Most of these are dual or multi-core boxes so we have about 50 total cores.

Since we’re IO bound we end up having a lot of spare CPU.


%d bloggers like this: