Final Thoughts on SSD and MySQL AKA Battleship Spinn3r
I need to give it a bit more thought but it looks like we’re going forward with deploying Spinn3r on SSD. Specifically, machines with 3 SSDs on Linux software RAID.
The performance of SSDs is nothing short of astounding. When tuned correctly these drives were nearly 10x the performance of the same box running RAID. Further, since the disks are larger than memory you have 10x the capacity in terms of database size.
It’s not perfect though. If you have an InnoDB database that fits in memory it will probably could be 50-150% faster than a similar DB that’s running on SSD. If you need massive transaction throughput it still might make more sense to use a hardware RAID controller and put your entire DB in memory (that is if you don’t become CPU bound first).
There are a few tricks to using SSD. An SSD config without any sort of tuning is going to be a dog in terms of performance. A naive SSD + MySQL implementation is going to be really slow when deployed out of the box.
On Linux, here’s what you should do to get optimum performance:
- Use the noop IO scheduler. Any sort of dynamic and re-prioritized IO on an SSD doesn’t make much sense. I should note that a hardware RAID controller will almost certainly do the wrong thing with SSD because they usually use an IO scheduler internally that assumes you’re using an HDD. The noop scheduler was an order of magnitude faster in our tests than the deadline scheduler.
The one exception to this rule is a controller or an SSD that implements a log structured filesystem. These don’t really exist on the market just yet but I expect they’re going to be available around Q2 or Q3.
- Write caching disabled unless you have a battery backed RAID controller (but these are expensive). All my benchmarks show that disabling write caching on these drive doesn’t have much of a performance impact.
- Run XFS (though ext3 is fine as well but slightly slower). XFS showed a bit faster sequential write performance.
- Use the noatime mount option. This is critical as it’s going to really hurt your system performance.
- 4k stripe size for software RAID.
- Disable read ahead on all drives. Doesn’t make much sense in the SSD world.
If you’re on MyISAM you’re in an even better position. While MyISAM is slightly slower and uses more CPU during INSERTs it’s going to be about 2x faster than InnoDB on SELECTs since it can use a 4k page size and doesn’t have much CPU overhead.
InnoDB could be faster but it’s going to take a bit more CPU tuning.
I’m optimistic that I could squeeze another 10x performance out of SSD. Additional things I need to look at involve:
- Find out why InnoDB is such a CPU hog under this load. MyISAM can perform the same transaction load in 1/2 the time and with hardly any CPU. InnoDB on the other hand takes twice as long and pegs the CPU at 100%. I need to load it in kcachegrind but a quick strace showed it continually calling futex() which might be a source of the problem. Even running with just 10 threads caused the CPU to go to 100% so futex() might not be a problem.
- While EasyTech’s MFT or other SSD cards seems interesting, InnoDB won’t get any faster by making the IO subsystem more efficient. I need to first figure out why it’s taking up so much CPU. Faster IO subsystems are attractive but only when I figure out how to get InnoDB to perform without using any CPU.
- Under the current load while performing SELECTs with sysbench the drives were at nearly 80% saturation. Even if I could fix the IO footprint in InnoDB I’m only going to get an additional 35% gain in performance since the drives seem are at 65% utilization.
- Play with InnoDB with an 8k page size. The benchmarks didn’t seem too impressive here. I’m also not tempted to deploy this in production because I’ve seen a deadlock in MySQL and an bug with innodb_file_per_table. Also, it looks like 4k pages in InnoDB doesn’t work (at least in my tests).
- Investigate random write performance on SSDs. The sysbench random write benchmarks were showing that one drive would hit 100% utilization while the other drives weren’t doing much IO. My theory is that ext3 was putting all the data blocks for this file in one data group. The filesystem would then continually update a single group block for every data block that was updated. This would end up causing a lot of wear leveling on the disk and causing the erase blocks to be re-written which isn’t very fast.
I think that if I could nail all these problems one could put InnoDB on an 8 disk RAID subsystem and get 720MB/s throughput. MyISAM would work with this right now since it’s not CPU bound for SELECTs.
What’s sad here is that these SSDs could be 100x faster rather than just 10x faster. The solution here seems to be improving InnoDB’s CPU efficiency (or moving to Maria). The next generation of SSDs are going to need to internally use log structured filesystems. This is going to be a big performance boost for them. Even if I could get my hands on one of these STEC of Fusion IO devices I’d have to buy the latest quad core boxes to be able to even remotely be able to take advantage of the additional IO.
Also, if you’re using something like Bigtable via Hypertable or Hbase your performance should be stellar since these are append only databases and SSDs do very well with sequential reads and writes.