MySQL and the The Death of RAID

RAID is dying. Shocked? The prediction might be a bit early for some folks. It’s still somewhat conventional for some people to think that RAID is a conservative way to scale your IO.

I’d like to assert that in 3-5 years RAID will be a thing of the past.

Want some evidence? Google doesn’t use RAID. They’ve build a database infrastructure which avoids expensive and proprietary hardware controllers.

You could call it a redundant array of inexpensive servers.

Other scale out shops which don’t have access to such toys have built out sharded MySQL installations. LiveJournal, Flickr, Facebook. These shops are using RAID in some situations but they are only using them due to the nature of MySQL scalability limitations.

For example, most MySQL shops don’t have failover master replication setups so they usually invest in more expensive RAID 10 controllers for their database to reduce master downtime.

Imagine for a moment that you had stable automated master promotion. A lot of people are playing with is now (we think we have it solved as well) and hopefully it will become commonplace.

If your master fails you just promote a slave.

So why do you need RAID 10? It’s twice as expensive!

If you wanted to you could use software RAID 0. It’s just as fast and more than 1/2 the price! Probably 1/3rd the price if you factor in the price of the RAID card. Now you can buy another server!

Why stop there. Just ditch the RAID setup altogether. No software RAID.

But how do you utilize all those disks you ask? Good question.

Run parallel MySQL installs!

More and more MySQL implementations are rolling their own sharded database technologies.

Imagine you had an Opteron box with 16G of memory, 4 HDDs, and four cores.

Instead of running only ONE MySQL instance you could run four on this box (one for each HDD, core, and 4G of memory).

This has a lot of ideal properties.

Each instance is a member of a shard within a larger federation. You don’t have to worry about idiosyncrasies like RAID chunk size tuning. If a single disk dies you don’t lose the whole box. You only lose that member of a shard. So if one disk dies you only lose 1/4th of 1/Nth of the servers in your cluster.

The key win though is the fact that you can get significantly higher disk throughput. The binary log, write ahead log for INNODB, and data can utilize one disk. In our benchmarks taking two 100MBps disks and running them on RAID 0 only gives us about a 50% performance boost.

In this setup we scale linearly with the number of disks. We can now get 400MBps of IO with four disks. Not too shabby. If we were to run RAID 0 we wouldn’t see any where near 400MBps.

One difficulty lies in configuration. You’d have to run four MySQL processes. The easy solution here is to run one per port (3306-3309).

Another potential solution would be to run virtualization software like KVM or XEN. This would increase your complexity but you’d be able to avoid configuration difficulty. You’d also need to rely on the performance of your virtualization software.

One area where this does fall over though is with battery backed write caching controllers. It would be interesting to benchmark RAID 0 with a caching controller vs four independent MySQL instances across four disks.

Thoughts?


  1. One spool per database?! Alot of people keep indices and tables on separate spools so they would need more spools. I wish our current crop of opensource databases would stop punting on replication. I loved how in bigtable you treated your database as one large system and the client dealt with tablet servers behind the scenes. I wish I could do that in mysql/postgres transparently.

  2. Yeah…… with innodb you can’t really separate your indices and data. You can sort of split it up with partitioning in 5.1.

    Replication is supported in 5.1 … the truth is that MySQL doesn’t scale unless you implement federated databases.

    One other thing I should point out.

    Partitioning in MySQL 5.1 doesn’t necessarily solve the performance problem either because they’ve punted on concurrent disk IO or now.. It’s a BIG feature though so pretty sad they didn’t implement it….

  3. We use RAID 1 as a way to keep from losing a node due to a drive failure. We have about one drive failure every 4 months. Because the servers are all RAID 1, there is no service loss. We don’t use RAID 5 for sure anymore. Its not really useful. We still use RAID for our 1TB of storage of course. But that is a different subject. And with 1TB drives hitting the market, you can never tell. But, again, losing one drive loses 1TB of data. ewww. You don’t want to have to backup your backup. Its a little cyclical at that point.

  4. doughboy….

    The point I was trying to make was that you could ditch your RAID 1 …. double your storage capacity, and remain redundant at the server level.

    Of course you’d need more infrastructure work.

  5. The other point is, the more disks you have, the more IO throughput you get. At least with a single server instance, and you could use LVM to aggregate the disks.

  6. Pardon my ignorance, I’m not a mysql expert. Google is not really a realistic example, as they are a read-many, write-very-little type setup, not typical of most setups.

    Using mysql, how would you setup a system with roughly 100Gb of data, very high I/O load with 99.999% uptime requirements? Including data recovery between sites that are 30-50km apart? The only way I know how is to use very expensive storage devices (HP, EMC, etc) that replicate data synchronously, so as to tolerate a total loss of one data center.

    I’d love to dump the expensive storage arrays, but is that realistic?

  7. @Mike –

    Interesting, I wouldn’t have considered Google a “low write” example considering how often they crawl sites these days. True, I’m sure those writes don’t match the extremely high read load that their search engine produces, but I have to think that with the increase in result “freshness” that they are still doing a significant amount of writes.

    I’m all for a movement away from RAID for sure if it is feabsible. Just one question – what do you think the power and energy consumption differences are between an inexpensive server array and a RAID array?

  8. Michael

    Google isn’t “low write” as much as “append-optimized write”. They write quite often, but GoogleFS is nifty in that an append operation can be parallelized with reads. Facinating stuff ;)

  9. Hmm…looking more and more like Beowulf clusters will become a reality in business someday.

    ;)

    I know I know, not really a relation to what your talking about here but thought it was a good plug spot for Beowulf!

    I am now going to have to try these thoughts you’be outline in a dev environment. Here’s a question for you though, how would this work in a VMWare type environment do you think?

  10. In Google’s case, they are fine with data center A and data center B not matching. At their size, they have not other choice.

    As for doubling the servers, that requires more power, more space, and more system administration.

  11. tazspaz……

    It would work the same way that any virtualization would work (Zen, KVM, etc)

  12. “I’m all for a movement away from RAID for sure if it is feabsible. Just one question – what do you think the power and energy consumption differences are between an inexpensive server array and a RAID array?”

    They’re probably 1/2 if you need the same compute power and move from RAID 10 to RAID 0……

  13. “Using mysql, how would you setup a system with roughly 100Gb of data, very high I/O load with 99.999% uptime requirements? Including data recovery between sites that are 30-50km apart? The only way I know how is to use very expensive storage devices (HP, EMC, etc) that replicate data synchronously, so as to tolerate a total loss of one data center.”

    …. use mysql sharding… Then replicate between sites. If you NEED to make sure data is written just block until it reaches one slave.

    The secondary data center is also interesting. If it lags too much you’d have to take it out of production.

  14. This isn’t really tied to mysql, just more people properly partitioning their databases.. We do similar stuff with BerkeleyDB and all proprietary stuff.

  15. True……. I certainly agree it’s not tied to MySQL. This is why I mentioned Bigtable.

    Also, the last part *is* sort of MySQL specific – mentioning running one MySQL daemon per disk/core…

    Kevin

  16. You talk yourself in circles. From the definition that you yourself provided, RAID “divide[s] and/or replicate[s] data among multiple hard drives,” and this definition makes no specificity as to how this replication occurs or within what domains or systems. Google IS raid, not something like raid, they have the same inexpensive disks we have and they have far greater redundancy than we could ever count. The database software they run to power this RAID in ancillary to the fact that google is the most redundant array of inexpensive disks on the planet.

    RAID 1 provides the most scalable read access on the planet. All those people screaming bloody murder against mutable shared state variables? They may or may not know it, but they are RAID 1 fanatics.

    You talk about RAID 0 but RAID 0 IS NOT RAID, not even in name. RAID, redundant array of inexpensive disks. where is the redundancy in striping? Of course its scalability is crap, theres no redundancy.

  17. Eric

    @rektide,

    While I agree that there are a few weak points in Kevin’s argument (no flaws, but weak points); it’s fairly obvious to me that you’ve actually never seen a large system running, let alone built one.

    It’s funny that you mention google, while I don’t work there – I work at a competitor; and have several collegues @ the googleplex; and happen to know that the overall architecture of the system I work on (yet another search system) is quite similar to google’s architecture for search. Google is NOT raid. Google is an example of partitioning, at a MASSIVE scale – with intelligence to aggregate data coming from various partitions.

    You further mention the enormous number of disks that google has. I’m sure the number is absolutely mind boggling. The system that I work on pales in comparison, I’m sure. It only has ~3k servers, each with an average of 4 disks – but a few hundred are connected to a very large, even more expensive SAN for mass storage (don’t blame the SAN on me or my team, the “infrastructure architects” insisted on it). Do you have any clue what kind of failures you see on a weekly basis with soo many machines???

    Let’s talk about failures with large clusters. An average “server” disk now a’days has a MTBF of 500k hours. When you have (~3000 machines * 4 disks/machine) 12,000 disks; you will have a hard drive fail every 41.6 hours statistically, in practice you’ll see disks going tits up about once a week, some weeks a bunch of disks die, some weeks none die. I have no numbers for MTBF of CPU’s or RAM, but hell if I haven’t seen servers die monthly with dead cpu’s; damned if I haven’t our indexing processes crash “randomly”, after a restart – the IO stats on the machine goes nuts (it’s paging), do a prstat (it’s solaris) and see 12G when I expect 16G. Shit happens. It’s crappy, but the _very_ best thing you can do do deal with that shit is to partition and replicate.

  18. Hey Eric…. thanks for the response. I basically agree with our commentary RE riptide.

    But yeah… When you have a LARGE server install with THAT many boxes failures are the NORM not the exception. You have to design for failures and have your infrastructure smart enough to handle failed hardware without taking the system down…

    Onward and great commentary!

  19. Eric

    “big” server farms will suck the life from you if you write software like a newb (see twitter).

    I glossed over a point in my prior comment, but it deserves some more words. Aggregation is the key to making partitioning sane. The code base I work on (for the next two weeks anyway :)) has some remarkable stats – near real-time updates of the index, ~500msgs/sec of updates, an average of 1,200 queries per second per machine (yes, per machine – with peaks of 2500), and an index that is mostly new every week. But by far the single most complex chunk of code in the system is the code that does the aggregation.

    When you partition data – think RAIC (redundant array of inexpensive computers – where “inexpensive” is in the eye of the beholder); intelligent aggregation is a must. It’s a damn hard problem, I know how to do it for a search system (i.e. no rdbms, well controlled unambiguous query syntax, and _deep_ knowledge of how data gets distributed); how do you do it with mysql shards (excuse my ignorance).

  20. domweb

    I work for a high tech manufacturer that uses computer control extensively for everything from automation process control to SCADA to overall production control and monitoring. We are a 1000 person company an we probably have 300 computers devoted to production machinery (never mind all the office and worker PC workstations, etc)

    RAID is integral to all our production computers, and they have nothing to with any MySQL. This RAID standard is an excellent and inexpensive method of maintaining data integrity, which is absolutely vital in our operations.

    I have been actively involved in manufacturing for about fifteen years and I can tell you that RAID will not die out until a better/cheaper standard is created for the manufacturing sector. Keep in mind that manufacturing technology does not upgrade at the rate of Moore’s Law. I have seen machines running on 1997 hardware/software controls and they will never be upgraded because of cost to migrate the custom written apps to the latest Windows operating system.

    So, I respectfully disagree that RAID will die soon. My industry buys it constantly and will have a demand for it for several years to come.

  21. Also ingores the fact that hardwares cheap compared to complex daatbase repliction scheams.

    You want a realy resiliant DB use an big IBM Zframe

  22. Boyd Hemphill

    Can someone explain “sharding” as I am unfamiliar with the term.

    In terms of the 100GB database & etc from above why not implement a MySQL cluster?

    Further, a cluster can completely eliminate the need for RAID if you are OK with loosing a node completely while a drive failure is recognized and remedied. The cost is in ensuring you have enough redundant nodes to keep 100GB of memory available at all times.

    So, in a cluster, one must weigh the cost of RAID against the cost of a server, rack space, power etc.

    That said, I respectfully submit that prognosticating the death of RAID is a bit like prognosticating my own death. Yep, I will die. So, bully for you and you got it right. RAID is much to useful, affordable and speedy to be gone in the near or even medium term, but the ever popular Moore’s law tells us it will eventually be replaced by something better, faster and even more stable.

    Peace
    Boyd

  23. @Eric

    You say:

    intelligent aggregation is a must. It’s a damn hard problem

    What do you mean by “intelligent aggregation” what data is being aggregated? Queries, application data… Are you trying to avoid dispatching queries to nodes that don’t have the data? Sounds like a interesting insight but I am not sure what you mean.

  24. Eric

    @Jonathan

    First off, I want to point you to an interesting presentation – though it probably gives away more information about me than I’d like :) – http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

    The type on information contained in this document is rarely seen in the wild (most companies don’t like publishing this kind of data) and quite relevant to this discussion.

    I’ll explain what I mean by intelligent aggregation based on what I work on – a search engine. In general, you can think of a search engine as an RDBMS with one table and an implicit “SELECT ?FIELDS? FROM index WHERE ?FILTERS?” for every query – where ?FIELDS? is replaced by the the list of index fields that the query wants to see, and ?FILTERS? is the list of implicit predicates that allows the search engine to filter the document set.

    Aggregation: imagine you have data stored in a bunch of nodes, partitioned by “X” (X, in my experience is best if it splits the data essentially randomly – to avoid hotspots) . e.g. 1B rows of say, customer records, split amongst 10 logical partitions, with several replicas (say 20) for each partition.

    Your aggregator looks like a normal search engine to every client – you send queries to the agg(s) you get a dataset back – it exposes the exact same interface. your aggregator parses and does basic evaluation of the query, it then send the query to one replica in every partition.

    Here’s where the aggregation comes in… the aggregator collects the results (that it can, sometimes queries time out, sometimes a node has no results for the query, sometimes nodes are down – based on policy the agg can retry failed queries or give partial results) , and generates a properly formed dataset as a response to the original client. The intelligence comes in to play here (and this is where things get _really_ ugly), if the query does sorting, or grouping (thing GROUP BY), or calls aggregate functions (MAX, MIN, AVG, etc just like SQL), the aggregator has to redo those operations itself. For instance, if you sort by say customer name – the agg has to do a merge operation, or say your query uses the aggregate function MAX, MAX has to be re-evaluated by the agg, etc. I haven’t even mentioned caching, or keeping the sort stable for pagination, etc.

    I’m simplifying this _way_ too much (it would take a fairly long paper to explain it thoroughly), but I hope you get the general idea.

  25. Eric

    @Jonathan

    I just noticed (after looking at your blog) that you work with Kevin; so if you have any questions, ask him to grab my email addr from comment log (if it’s available).

  26. A mid-sized enterprise. 11pm on Saturday. Disk failure. There are two options: Firstly, your software is written error-prone for disk failures, or an administrator has cope with the problem. And if it is a service that is used 24/7, he should be quick.

    For an enterprise, some 400$ for a RAID-card and some 1000$ (or so) for more disks is not the price you don’t want to pay for reliability. Taking Google as an example is – at this point – a little bit ridiculous. All the data Google presents may mostly (especially the search) be recomputed. If you run an online-shop this is not the way, also because you cannot put a thousand computers with redundant informations in your rack. Not to talk about the software-enhancements you have to pay if you want automated switching in case a Database fails.

    And even going further: Imagine you got a site that uploads pictures. Surely, something like rsync may do, but replication is expensive (in meaning of time), we had a client where the rsync-command (to the backup servers) leasted minutes. Loosing pictures while this process was working? No way. RAID is not dying. It is a convenient way to ensure a service level at a (fairly) inexpensive way nowadays. It may evolve, it may even be substituted, but that’s a whole other part of the story.

    Just my 2 cents, Georgi

  27. “A mid-sized enterprise. 11pm on Saturday. Disk failure. There are two options:
    Firstly, your software is written error-prone for disk failures, or an
    administrator has cope with the problem. And if it is a service that is used
    24/7, he should be quick.”

    I think you’re making the assumption that I’m suggesting ditching RAID in favor
    of an architecture that is NOT fault tolerant.

    It’s just the opposite. I’m saying that with the correct cluster architecture
    you can scale and be fault tolerant at the same time without hacks like RAID.

    I’m also asserting that this trend will be 3-5 years in the FUTURE. NOT NOW!

  28. “… I’m saying that with the correct cluster architecture
    you can scale and be fault tolerant at the same time without hacks like RAID. …”

    Okay, then let me rephrase a question: What is the correct cluster architecture? By eliminating RAID you will have many more problems to be solved on other places: Assuming you take MySQL on two machines and one fails, the software has to be aware of this and switch over. Speaking of fileservers, you will need the drives of 2 machines to mirror the files, anyway. An enterprise will have to pay for that (software, more maintenance, administration overhead).

    “… I’m also asserting that this trend will be 3-5 years in the FUTURE. NOT NOW! …”

    The deeper the “system” you want to substitute is wired into your hard/software, the more complex it will be to replace it. And my 2 cents say that the filesystem is deep enough so that 3-5 years is very, very, very optimistic ;-)

  29. James Day

    Georgi, it depends partly on how big you are. If you are so small that one hard drive and CPU can do the job, RAID looks good because disk failures may be dominant and a second disk is quite cheap.

    Once you get big enough to have lots of CPU and RAM and chipset and power supply failures, the whole system becomes unreliable and RAID isn’t enough to fix it. With big numbers 2 computers isn’t enough redundancy. Even 4 isn’t enough if you’re very heavily loaded and can see two failures before you’ve had a comfortable time (not overnight!) to fix the first broken one.

    As you get big the relative cost of hardware and software also changes and software can become cheap compared to hardware.

    Even when you’re small, you have to be able to handle switching over when a computer fails if you really want 24/7/365 operation. You can make it less common with RAID but you’re still designing for failure if you rely on RAID and a human switchover. Many places do rely on that, of course. It’s often good enough and the odds mean that you’ve a good chance of staying lucky for long enough. You can’t stay lucky long enough as the computer count increases, so you have to deal with complete computer failure, dual drive failure and so on.

    To remove RAID in the small case, look at flash drives, possibly with software RAID0 (oops, well, you’ll forgive me if it’s just to get the volume size you need, I hope…. :) ). Those will probably make storage failures less common than power supply or other problems. If you change jobs every three years and only need a couple of computers you may have moved or the system may have been upgraded before there’s time for this setup to fail.






%d bloggers like this: