On Write Caching Controllers and Distributed Database Failure Models
The more databases you have in your cluster the greater the probability they’re going to fail.
It’s basically MTBF/N until you have one of these boxes crash. If you have 10k machines expect multiple failures per day.
If you’re performing realtime writes to these databases you can lose a DB in the middle of a write.
Now what? Do you lose transactions?
Most people using InnoDB/MyISAM have used write caching controllers to solve this problem.
You buy this expensive card with 128MB of memory which temporarily caches writes. The card has a battery so if you lose power you spin the disks until you can get your data onto the disks.
But as I mentioned before, these devices are expensive. Expect them to add another 20% on the price of your cluster.
This doesn’t sound like a ton of cash if you have one or two machines but if you’re buying 50-1000 it’s a significant chunk of change.
Another way around this is to make sure your write hits at least two nodes (three if you’re paranoid).
If you’re using a sharded databased you might have 2-3 replicas per shard.
What you do is then make sure your transaction made it to one more node.
This is essentially what Google does. They have a distributed write ahead log for an entire shard.
If one of their replicas in a shard fails the transaction is on two more boxes.
For Google saving that extra %20 at their scale is a LOT of cash.
Hopefully the semi-sync patch is isolated soon. This means you can now use the same fault recovery mechanism. It doesn’t matter if a master fails as long as you have semi-sync, have the transaction data already on another box, and bring a new replica online quickly.
Now we can shave these write caching controllers out of our budget.
The main question is what type of performance hit will we take?
Fortunately, InnoDB has a few variables you can set.
If you wet innodb_flush_log_at_trx_commit to zero then InnoDB will commit less often. Meaning more data is buffered (but remember it’s already replicated to a slave).
Another is sync_bin… You should set this to zero. This way the binary logs aren’t flushed too often.
The main question is which is faster:
– InnoDB with innodb_flush_log_at_trx_commit=1 and sync_bin=1 WITH a write caching controller
– InnoDB with innodb_flush_log_at_trx_commit=0 and sync_bin=0 WITHOUT a write caching controller but with semi-sync.
I haven’t had time to run the benchmark but I suspect that the write caching controller is faster. By how much is my main question.
Of course you’re going to need the global transaction ID patch too. Without this you can’t really run three replicas per node.
Some more thoughts.
Yes. You can lose an entire datacenter and then lose transactions. That’s why you need more than one datacenter. We’ve lost power three times this year (it isn’t fun).
While InnoDB would probably perform faster on BBU hardware a Bigtable or log structured DB without fsync would write at almost the full disk head write speed (80-120MB/s).
I still want a hybrid flash/supercapacitor based controller. Hopefully on the drives. Then I can just power off the drives and the DRAM cache on the drives is written quickly to the flash.
Does this exist yet? HDDs are going to exit until 2010 for bulk storage (TB sized arrays) so I think they’re worth the investment.
They should only add $20-50 to the cost of your disks so probably worth the investment at that point.