On Write Caching Controllers and Distributed Database Failure Models

The more databases you have in your cluster the greater the probability they’re going to fail.

It’s basically MTBF/N until you have one of these boxes crash. If you have 10k machines expect multiple failures per day.

If you’re performing realtime writes to these databases you can lose a DB in the middle of a write.

Now what? Do you lose transactions?

Most people using InnoDB/MyISAM have used write caching controllers to solve this problem.

You buy this expensive card with 128MB of memory which temporarily caches writes. The card has a battery so if you lose power you spin the disks until you can get your data onto the disks.

But as I mentioned before, these devices are expensive. Expect them to add another 20% on the price of your cluster.

This doesn’t sound like a ton of cash if you have one or two machines but if you’re buying 50-1000 it’s a significant chunk of change.

Another way around this is to make sure your write hits at least two nodes (three if you’re paranoid).

If you’re using a sharded databased you might have 2-3 replicas per shard.

What you do is then make sure your transaction made it to one more node.

This is essentially what Google does. They have a distributed write ahead log for an entire shard.

If one of their replicas in a shard fails the transaction is on two more boxes.

For Google saving that extra %20 at their scale is a LOT of cash.

Hopefully the semi-sync patch is isolated soon. This means you can now use the same fault recovery mechanism. It doesn’t matter if a master fails as long as you have semi-sync, have the transaction data already on another box, and bring a new replica online quickly.

Now we can shave these write caching controllers out of our budget.

The main question is what type of performance hit will we take?

Fortunately, InnoDB has a few variables you can set.

If you wet innodb_flush_log_at_trx_commit to zero then InnoDB will commit less often. Meaning more data is buffered (but remember it’s already replicated to a slave).

Another is sync_bin… You should set this to zero. This way the binary logs aren’t flushed too often.

The main question is which is faster:

– InnoDB with innodb_flush_log_at_trx_commit=1 and sync_bin=1 WITH a write caching controller

or

– InnoDB with innodb_flush_log_at_trx_commit=0 and sync_bin=0 WITHOUT a write caching controller but with semi-sync.

I haven’t had time to run the benchmark but I suspect that the write caching controller is faster. By how much is my main question.

Of course you’re going to need the global transaction ID patch too. Without this you can’t really run three replicas per node.

Update :

Some more thoughts.

Yes. You can lose an entire datacenter and then lose transactions. That’s why you need more than one datacenter. We’ve lost power three times this year (it isn’t fun).

While InnoDB would probably perform faster on BBU hardware a Bigtable or log structured DB without fsync would write at almost the full disk head write speed (80-120MB/s).

I still want a hybrid flash/supercapacitor based controller. Hopefully on the drives. Then I can just power off the drives and the DRAM cache on the drives is written quickly to the flash.

Does this exist yet? HDDs are going to exit until 2010 for bulk storage (TB sized arrays) so I think they’re worth the investment.

They should only add $20-50 to the cost of your disks so probably worth the investment at that point.


  1. Kevin,

    BBU will be faster with standard 1Gbit ethernet, Usually you can do 10000-20000 fsync from single thread to the NVRAM with ethernet it is lower.

    Speaking about price I think Google uses BBU for MySQL Boxes but not for Search boxes which is where crap hardware is used.

    If you’re looking at the crappy box the RAID is often expensive, in particular because vendors charge large premium on it. If you have better box the added cost is much lower. For example getting $5000 Dell (with 8 HDD etc) the cost factor of RAID controller in price will be $200 or so, which is closer to 4% than 20%

    Though it is a bit hard to judge as cost on the component level may be really different from whole system. And if you’re dealing with large volumes it is the component cost becomes important to see how price can you get from a vendor.

  2. Kevin,

    The semi sync patch, while very nice, and even nicer in the refactored version done by the MySQL replication team, might not do all that you expect.

    It defers return to the user from commit until at least one slave ACKs the transaction’s binlog events. That is, commit occurs on the master (commit to InnoDB, write events to the binlog) and then the user’s connection waits while threads copy the binlog events to slaves and wait for an ACK to wake the user’s connection. While the user’s connection is waiting other client connections may observe the committed data.

    This still has useful properties. For example, it rate limits a busy client to prevent it from making the master get far ahead of the slaves. N busy clients can make the master get N transactions ahead of the slaves. But they can’t make the master get N+1 transactions ahead of the slaves — a client can create at most 1 transaction on the master that has not been ACKd by a slave.

    This is a step in the right direction but it isn’t sync replication. Fortunately, MySQL hired 2 experts on sync replication into the MySQL replication team so we may get some interesting features in the future.

  3. Hey Mark.

    I didn’t see that the MySQL replication team had a refactored patch. Is that anywhere public?

    I’m aware that the writer waits for the entire replication chain before being returned.

    In our situation we do bulk inserts with INSERT … ON DUPLICATE KEY UPDATE of 50-100 records. So our throughput will pretty much be the same. Plus I pretty much get group commit.

    I get to cheat on this stuff when writing crawlers :)

    I didn’t realize that reads from the master can see the uncommited data. Makes sense though….. Interesting.

  4. “BBU will be faster with standard 1Gbit ethernet, Usually you can do 10000-20000 fsync from single thread to the NVRAM with ethernet it is lower.”

    For InnoDB I think you’re right…. I’m trying to figure out by how much.

    “Speaking about price I think Google uses BBU for MySQL Boxes but not for Search boxes which is where crap hardware is used.”

    Yes… this is my assessment as well. This is why I think they’re using distributed WAL replication and multiple datacenters so that they can buy ultra
    cheap hardware.

    “If you’re looking at the crappy box the RAID is often expensive, in particular because vendors charge large premium on it. If you have better box the added cost is much lower. For example getting $5000 Dell (with 8 HDD etc) the cost factor of RAID controller in price will be $200 or so, which is closer to 4% than 20%”

    I guess it depends on the card. The MegaRAID cards we’re looking at are $500 or so so that’s 10% of the box.

    In the actual machines we’re thinking of getting they’re 15%. We’re not buying cheap boxes but they’re commodity which is why they are so cheap. GREAT boxes though… I know two startups running cluster nodes of > 500 boxes on their hardware. Basically Supermicro with Seagate HDDs.

  5. Kevin,
    The data is committed on the master. The protocol is:
    1) commit on master (innodb and binlog)
    2) wait for slave to ACK
    3) return to client

    During 2) the data is committed on the master so other connections can run and see the results of the commit while the other client waits for commits.

    If you want better guarantees, then you need to use sync commit and that doesn’t exist within MySQL today.

    The referenced paper is about sync commit protocols that don’t block when some servers fail. This is great functionality that should eventually be supported by MySQL. You don’t want to get paged every time the master hardware fails and a new master must be elected, do you?

  6. Mark Leith

    Hey Kevin,

    The new semi-sync stuff is now in 6.0.8 onwards. There was a feature preview on launchpad:

    https://code.launchpad.net/%7Ehezx/mysql-server/semi-sync-replication

    It’s all fully documented now:

    http://dev.mysql.com/doc/refman/6.0/en/replication-semisync.html

    It’s all plugin enabled now as well. :)






%d bloggers like this: