Yes Jeremy, RAID Really Is Dying

Jeremy retorts that RAID is alive and well in the real world:

Kevin Burton wrote a sort-of-reply to my call for action in getting LSI to open source their CLI tool for the LSI MegaRAID SAS aka Dell PERC 5/i, where he asserted that “RAID is dying”. I’d like to assert otherwise. In my world, RAID is quite alive and well. Why?

I should note that I said:

I’d like to assert that in 3-5 years RAID will be a thing of the past.

I’m not saying it’s dead now – but I do think it’s dying.

RAID is cheap. Contrary to popular opinion, RAID isn’t really that expensive. The controller is cheap (only $299 for Dell’s PERC 5/i, with BBWC, if you pay full retail).

… that’s the price of one HDD. You’ve just lost some IO there. Granted this isn’t a major issue but it all ads up.

The “2x” disk usage in RAID 10 is really quite debatable, since those disks aren’t just wasting space, they are also improving read (and subsequently write) performance.

… the problem is you never see the same performance as the individual disks so you end up wasting a lot of money on lost IO.

I think there’s a better way (keep reading)

Latency. The battery-backed write cache is a necessity. If you want to safely store data quickly, you need a place to stash it that is reliable1. This is one of the main reasons (or only reasons, even) for using hardware RAID controllers.

Yes. I agree that this is still an issue but I think the performance boost you’d see by ditching RAID across your entire cluster would easily make up for the lost of the write cache.

* Disks fail. Often. If anything, we should have learned that from Google. Automatic RAID rebuild is proven and effective way to manage this without sinking a huge amount of time and/or resources into managing disk failures. RAID turns a disk failure into a non-event instead of a crisis.

* Hot swap ability. If you forgo hardware RAID, but make use of multiple disks in the machine, there’s a very good chance you will not be able to hot swap a failed disk. Most hot-swappable disk controllers are RAID controllers. So, if you want to hot-swap your disks, you likely end up paying the cost for the controller anyway.

You’re thinking too low level. Who cares if the disk fails. The entire shard is setup for high availability. Each server is redundant with 1-2 other boxes (depends on the number of replicas). If you have automated master promotion you’ll never notice any downtime. All the disks can fail in the server and a slave will be promoted to a new master.

Monitoring then catches that you have a failed server and you have operations repair it and put it back into production as a new slave.

During this process your entire cluster is still online. No RAID was uses other than a redundant array of inexpensive servers.

Granted this isn’t 100% there yet for stock MySQL but we’re working on building this into lbpool based on some of the same algorithms that Google uses internally (and have been published for about 10 years).

I don’t think it’s fair for anyone to say “Google doesn’t use RAID”. For a few reasons:

1. I would be willing to bet there are a number of hardware RAIDs spread across Google (feel free to correct me if I’m wrong, Googlers, but I very much doubt I am). Google has many applications. Many applications with different needs.

The biggest MySQL install within Google right now (Adwords) uses Linux software RAID 0 on INNODB with a 1MB stripe size (at least as of 6 months ago).

No RAID controller because they like to keep things cheap. (their words – not mine)

If they would use the same setup with two MySQL daemons they’d be able to get more effective throughput to their disks and reduce software complexity.

2. As pointed out by a commenter on Kevin’s entry, Google is, in many ways, its own RAID. So even in applications where they don’t use real RAID, they are sort of a special case.

I had already noted this in my post. Most large scale out shops should probably be using a redundant array of inexpensive servers. When I said that “Google doesn’t use RAID” I was specifically talking about hardware RAID controllers. Adwords is even a one off within Google. Their compute nodes don’t use RAID at all from what I can gather (and I follow them about as close as anyone).

In the latter half of his entry, Kevin mentions some crazy examples using single disks running multiple MySQL daemons, etc., to avoid RAID. He seems fixated on “performance” and talks about MBps, which is, in most databases, just about the least important aspect of “performance”. What his solution does not address, and in fact where it makes matters worse, is latency. Running four MySQL servers against four disks individually is going to make absolutely terrible use of those disks in the normal case.

This means you want to make the best use (in terms of seek capacity) of your disks possible, and minimize downtime, in order to make the best use of the immutable overhead.

Actually, I’m very well aware about the trade off of transactions per second vs disk bandwidth.

We’re a bit different than a lot of shops in that we try to use INNODB to buffer the entire database. This means that we’re not seek bound for a good portion of our application. This allows us to do one checkpoint and write the database with one pass of the head.

Which means that seeks (for a good portion of our application) are non existent.

The iostat for our application looks really nice. We perform about 20-40 seeks per second with about 50MBps throughput to the disks.

INNODB can’t do much better than that though due to its use of fuzzy checkpointing (we’re thinking about fixing that problem but I need to find time to dive into the code).

Now for some of our application we’re at a much lower memory / data ratio which means we start to see more seeks. Even if you ARE seeing more seeks ditching RAID and going with application level sharding is smarter because you can dynamically repartition your data if you see hotspots in your usage patterns. You can also see more effective use of your disks because they’re only storing data right now instead of dealing with redundancy.

RAID 10 helps in this case by making the best use of the available spindles, spreading IO across the disks so that as long as there is work to be done, in theory, no disk is underutilized.

You can spread the work the same way with a sharded database. RAID 10 isn’t required.

This is exactly something you cannot accomplish using single disks and crazy multiple-daemon setups. In addition, in your crazy setup, you will waste untold amounts of memory and CPU by handling the same logical connection multiple times. Again, more overhead.

You mean like what Livejournal, Flickr, and Adwords are doing?

What do I think is the future, if RAID is not dying? Better RAID, faster disks (20k anyone? 30k? Bring it on!), bigger battery-backed write caches, and non-spinning storage, such as flash.

I agree that flash is looking really hot. The pricing should really fall in 2008.

1 There’s a lot to be said for treating the network as “reliable”, for instance with Google’s semi-synchronous replication, but that is not available at this time, and isn’t really a viable option for most applications. Nonetheless, I would still assert that RAID is cheap compared to the cost (in terms of time, wasted effort, blips, etc.) of rebuilding an entire machine/daemon due to a single failed disk

.

Rebuilding the entire machine is fairly easy for us. We have our hosting provider (Serverbeach – I’d highly recommend them if you need hardware) automatically re-deploy a new image on the server. We then take about 15 minutes to install our OS changes on top, and then mysqlslavesync it from a production box. It’s all pretty simple actually.

Anyway. I hope this clarifies what I was talking about before.


  1. Where’s this mysqlslavesync ?

  2. Hey Kris…….. I need to release it….. I’ve been too busy with work.

    What I need is a general script repository……

    Kevin

  3. James Day

    For many people transaction commit seek load remains an issue. It’s pretty much the only reason I have to recommend hardware RAID unless someone has a broken architecture. As flash or battery-backed RAM solutions start to be accepted for this duty the reasons for the RAID controller will go away for those with a web mindset. Drives with sufficiently large and intelligent write caches that suvive long power outages without data loss could also do the job.

    For telcos and some larger corporates SANs and such are part of their historic “high availability” culture and that cultural reason is likely to remain more significant than the “failure happens, deal with it” approach of the web side. Big and cheap flash might be what converts these people. Probably still stuck in an expensive box with a familiar and comforting brand name.

    At Wikipedia a couple of years ago we switched to RAID 0 with a write caching controller as soon as we had enough database slaves to handle failures and abandon RAID 10 in each box. Still the controller to get the write caching for transaction commits. Today… time to play with flash and battery-backed RAM disks and throw out the controllers.






%d bloggers like this: