Distributed System Design is Difficult

Rich writes up a “stream of consciousness” post about why distributed systems design is difficult:

We take the simplicity of i=n++ or counting lines for granted. It all begins with a single CPU and we know that model. In fact, we know that model so deeply that we think in it, in the same way that language shapes what we can think about. The von Neumann architecture defines our perception of what is easy and what is hard.

It isn’t that we know the von Neumann architecture as much as we understand discrete mathematics and von Neumann machines can model mathematical problems with analogs similar to their mathematical constructs.

If I’m computing a Fibonacci sequence as F(n) = F(n-1) + F(n-2) on a local machine I don’t have to worry about 50% of my machines crashing during the computation. On a local von Neumann machine this is easy to compute because your working set is in memory.

In fact we’re starting to see distributed systems problems within single image hardware. Newer multi-core machines from AMD and Intel have dedicated L1 caches and non lock-free code can slow down the entire system (even threads running on other cores).

In fact, we spent the last two weeks tuning our code and removing locks because we were only using about 50% of our CPUs after our quad core Opteron migration.

The fallacies of distributed computing tell us that latency is not zero and bandwidth is not infinite. This is true of local von Neumann architectures as well and in recent years has become more apparent. Reading memory isn’t zero latency nor does it have infinite bandwidth.

The tools we have now are primitive. Sawzall, Bigtable, DB sharding, and distributed compute frameworks can help make the problems easier but if you need to scale your code you’re not in for a fun time of it just yet.

Even at the network and datacenter layer this stuff is hard. So you’ve solved your whole distributed compute framework. Great. What happens when you lose power to that datacenter? What if you do something stupid like put the master and it’s backup/standby on the same switch? Same power? Same rack?

It’s not easy but at least it’s a competitive advantage once you have it solved!

%d bloggers like this: