Detecting Corrupt Data in MySQL Protocol

This has me thinking:

When Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn’t match the MD5 sent in the header. Looking at our service logs from the period between 6/20 11:54pm PDT and 6/22 5:12am PDT, we do see a modest increase in the number of 400 errors. This may indicate that there were elevated network transmission errors somewhere between the customer and Amazon S3. We are continuing to investigate and will post an update when we have further information.

The MySQL protocol doesn’t seem to have a checksum or hashcode:

If a frame were corrupted in a way that didn’t break the SQL, then it would be possible to INSERT/UPDATE/DELETE incorrect data.

Granted, adding MD5 or SHA1 would burn a bit of CPU but it might be worth it in a lot of situations.

… of course the TCP checksum can come into play here but it’s just a weak checksum.

When you’re pushing LOTS of data this could be seen in production environments (not just in theory).

  1. Harrison

    I suspect that since the MySQL protocol is normally used over the local LAN it would be much less common than something like S3 (since it is always over a WAN).

    I wonder how much enabling compression or SSL would help to check for this sort of thing.

  2. And it does come into play…

    It will be a whole lot worse with row based replication, since simple bit flips may still leave the replication events still syntactically correct.

  3. Harrison,

    I agree it’s less common but becoming more common as switches go commodity.

    Also, some of this stuff isn’t zero copy so memory corruption can come into play….

    A hashcode would nail this stuff but only if it wasn’t in the last memcpy in the target MySQL box……


  4. There should be a CRC at over protocol blocks. CRC-32 would work fine and doesn’t require much computation. CRC is specifically designed to pick up bitflips, missing bytes and the like. It simply checks whether a single block was transmitted intact.

    MD5/SHA1 is complete overkill for this purpose. It was designed to allow signatures to be compared against eachother, equivalent to the original datasets. That’s a whole different issue.

  5. I’ve been bitten so badly by this problem (I suspect, but can’t prove — see the bug report mentioned above) that I’d almost settle for a magic number every so many bytes. 0xDEADBEEF, anyone? The fact that the protocol is unchecked has probably cost a lot of money over the years.

  6. We had at least one scary instance of this at Technorati – an UPDATE writing to the blogs table got terminated before its WHERE clause, thereby writing the same blog title to an entire database shard (at the time, a quarter of the blogs in the index). That the blog title was something like “Huggable, kissable, loveable” made it amusing, as every results page showed that for a quarter of the results. We managed to propagate real blog titles back from memcached, slaves and other caches, but it was a very stressful day.

  7. Kevin,

    Holy crap…. that’s a crazy story.

    I’ve heard similar horror stories before about binlog corruption from other companies.

    From the number of comments this post is getting so quickly it sounds like a lot of people are thinking about this….

    Anyone have any other horror stories?

  8. We (and many others) suffered from the recently fixed bug that could generate corruption in relay log events. In a few cases, the corrupt relay log events were executable. We will soon add checksums to binlog events to prevent that problem.

    We have had at least one instance where the SQL sent from client to server was corrupted by the network hardware and TCP checksums did not catch the problem. Checksums in the client-server protocol are a great idea, but I am not sure how extensible the client-server protocol is. It is much easier for me to change server-server protocols than client-server protocol as there are at least two implementations of client-server (C library, JDBC).

%d bloggers like this: