Byzantine Fault Tolerance and Code Burn-In
Looks like the guys over at Amazon had fun this week:
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
This is classic byzantine fault tolerance (or lack thereof) and might be a good candidate for future textbooks.
Byzantine refers to the Byzantine Generals’ Problem, an agreement problem in which generals of the Byzantine Empire’s army must decide unanimously whether to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: trick some generals into attacking; force a decision that is not consistent with the generals’ desires, e.g. forcing an attack when no general wished to attack; or confusing some generals to the point that they are unable to make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory.
I blogged about a similar problem the other day with the MySQL replication protocol:
The MySQL protocol doesn’t seem to have a checksum or hashcode:
If a frame were corrupted in a way that didn’t break the SQL, then it would be possible to INSERT/UPDATE/DELETE incorrect data.
Granted, adding MD5 or SHA1 would burn a bit of CPU but it might be worth it in a lot of situations.
A number of developers jumped in to note similar concerns.:
We had at least one scary instance of this at Technorati – an UPDATE writing to the blogs table got terminated before its WHERE clause, thereby writing the same blog title to an entire database shard (at the time, a quarter of the blogs in the index).
The key takeaway here is that you should NOT assume your messages on your network won’t be corrupted in transit. TCP checksums aren’t going to save you and most protocols by default don’t include hashcodes.
A secondary aspect hers is that burning your code in over years and avoiding the urge to rewrite can REALLY save the day. Spinn3r is a very solid code base because I’ve been evolving it rather than rewriting components from scratch.