Unconventional Distributed Systems Failures

Greg points out a section which he found interesting in the Google Bigtable paper:

For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.

The partition asymmetry issue should be obvious. If one section of your cluster is too slow you’re going to see throughput fall. Hung machines is another. The distributed filesystem I wrote last year (and will eventually opensource once I get time to finish it) handled hung machine situations.

Here are a few more I can think of.

Memory corruption without ECC memory. If memory is the new disk how can you be sure that your data is accurate? ECC is expensive but you could get away with it by using hashcodes on your content. The problem is that computing the hash of 5G of content isn’t exactly fast. You could buy ECC memory but it could end up being a lot of money.

Rumor has it this actually happened a little while ago at Google. Apparently they had a corruption in their pagerank graph in memory due to non-ECC systems. This yielded the dreaded subdomain spammer issue that they took some negative press on…

Large clock skews. I’ve seen the large clock skew issue. I hate this one honestly. It’s hard to diagnose. One solution would be to have each node connect to other nodes and assert that they are all within a margin of error. You could of course run XNTP and this corrects a lot of issues in practice but I find all the NTP implementations less than perfect.

Slow and overloaded node issues. If a node is taking too long to respond but is functioning normally you still need to flag it and take it offline (or reduce load). One way to solve this directly in your IO layer is to use async IO and allow for a max response time for a client. If the client is taking too long to respond you can notify a monitoring system and then request IO from another node.

%d bloggers like this: