Archive for the ‘clustering’ Category

I’ve been looking at replacing Munin with our own higher level proprietary monitoring system for keeping track of cluster-wide statistics.

This is needed for a new feature we’re trying to ship with Spinn3r so that we can expose some of our internal statistics to our customers.

We weren’t able to do this before because our internal monitoring was a bit of a hack due to the lack of quality in the open source monitoring tool chain.

About 80% of the work in a performance monitoring tool is in the charting component and I’m glad to see that the Google chart API basically rocks.

It’s very well done. The only real complaint I have is that you can’t submit a URL longer than 2000 bytes. It seems like a limitation in their own internal tool as Firefox, Safari, etc can support URLs up to 65k.

This limitation is more pathological than you would think because you can’t place two metrics on top of each other in the same graph. They need to move into two separate graphs or you’ll generate a long URL and Google will return an HTTP 40x response.

The other cool thing is the API is FAST. Check out the render time of this chart which is 300k pixels….

I can now render graphs in about .09 seconds which is an order of magnitude faster than Munin with rrdgraph on a quad core Opteron with a 2.6Ghz processor.

I hope they do something about exposing the component used with Google Finance. It is pretty sweet as well.

200806292227

200806292230

200805271727Spinn3r is hiring for an experienced Senior Systems Administrator with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog data.

We crawl the entire blogosphere in realtime, remove spam, rank, and classifying blogs, and provide this information to our customers.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

Overview:

In this role you’ll be responsible for maintaining performance and availability of our cluster as well as future architecture design.

You’re going to need to have a high level overview of our architecture but shouldn’t be shy about diving into MySQL and/or Linux internals.

This is a great opportunity for the right candidate. You’re going to be working in a very challenging environment with a lot of fun toys.

You’re also going to be a core member of the team and will be given a great deal of responsibility.

We have a number of unique scalability challenges including high write throughput and massive backend database requirements.

We’re also testing some cutting edge technology including SSD storage, distributed database technology and distributed crawler design.

Responsibilities:

  • Maintaining 24 x 7 x 365 operation of our cluster
  • Tuning our MySQL/InnoDB database environment
  • Maintaining our current crawler operations
  • Monitoring application availability and historical performance tracking
  • Maintaining our hardware and linux environment
  • Maintaining backups, testing failure scenarios, suggesting database changes

Requirements:

  • Experience in managing servers in large scale environments
  • Advanced understandling of Linux (preferably Debian). You need to grok the kernel, filesystem layout, memory model, swap, tuning, etc.
  • Advanced understanding of MySQL including replication and the InnoDB storage engine
  • Knowledge of scripting languages (Bash and PHP are desirable)
  • Maintaining software configuration within a large cluster of servers.
  • Network protocols including HTTP, SSH, and DNS
  • BS in Computer Science (or comparable experience)

Further Reading:

It dawned on me that if I were working for Twitter that I would just assume the service is down unless told otherwise.

This lead to the conclusion that one should invert monitoring to send off a notification when Twitter is online

Seriously. I like those guys but this is getting kind of embarrassing.

As someone interested in distributed, scalable, and reliable web services, I think I might stop using it out of protest.

Things could be worse though – they could be using Hadoop! :-)

You can see a picture of Twitter’s main database server below:

200805250725

I’m going to be migrating to using ZooKeeper within Spinn3r for a myriad of reasons but this one is especially powerful.

One could use ZooKeeper to configure external monitoring systems like Munin and Ganglia.

ZooKeeper enables this with its support for ephemeral files.

If you have an external process like a webserver, database, robot, etc you can have it create a ephemeral file which registers its services and presence in the cluster.

For example:

/services/www/www32.mydomain.com/80

Would represent a machine named www32.mydomain.com.

You can then have munin connect to ZooKeeper and enumerate files in /services/www and have a cron script continually regenerate a munin.conf file.

The great thing is that if you shutdown Apache your munin config will be automatically reflected.

This of course implies that you have ZooKeeper integration in your init scripts.

This is becoming easier with the presence of HTTP gateways for ZK. I haven’t looked at them too much but as long as PUT and GET are supported this is about 80% of the functionality needed to implement the above.

One issue is enumerating files in directories over HTTP. I assume a proprietary XML protocol is used.

We’ve had our SSDs in production for more than 72 hours now. We’ve had them in a slave role for nearly a week but they’ve now replaced existing hardware including the master.

The drives are FAST. In our production roles they’re reading at about 45MB/s and writing to disk at about 15MB/s and using only about 22% of disk utilization.

Not too bad.

We also have about 70GB free on these drives so that leaves plenty of room to grow.

There was small problem that I didn’t anticipate.

When we were running our entire disks out of memory we would only use one or two indexes per column. We had a set of reporting tasks which ran some queries once every 5 minutes.

The columns these queries were using didn’t have any indexes so InnoDB CPU would spike for a moment and continue.

Modern machines have memory bandwidth of about 15GB/s so these queries were mostly CPU bound but completed in a few seconds.

When we switched over to SSDs all of a sudden these queries needed to perform full table scans and were reading at about 100MB/s for two minutes at a time.

Fortunately, an ALTER TABLE later and a few more indexes fixed the problem.

We dropped the indexes when we were running out of memory because the queries could be resolved so quickly. Now that they were on disk again we had to revert to the olde school way of doing things.

I just finished watching the Disk is the New RAM video which a number of bloggers have been talking about.

If you’re lazy like me you can just read this blog post to get a nutshell on this theory of computing.

FYI, I transcoded it in my scalecast podcast if you want to watch it on your iPod.

There are some interesting concepts here. I’m a bit frustrated that I didn’t blog about this earlier as I proposed a similar architecture for use in Spinn3r about 1.5 years ago. However, I identified a number of both implementation and theoretical issues that prevented us from actually spending any significant amount of time in this area.

For certain data structures this type of technique will work very well.

This is how I discovered this idea in the first place. We were evaluating improving the performance of our crawl queue in Spinn3r. It’s a difficult problem actually – one that I won’t go into to much detail here for purposes of brevity.

In fact, it works out so well that I’m surprised that they didn’t bring up queueing as an example. If you need a queue it actually might be beneficial to use this type of data structure regardless of whether you’re using a RAM based approach in other areas.

To review, a queue basically needs to perform two operations:

– enqueue items
– fetch items at the head of the queue.

Both are essentially sequential operations. Fetching items from the queue is a sequential read. Writing items to the end of the queue is a sequential write.

My proposal involved reading large chunks from the head of the queue and buffering the results in RAM to avoid disk seeks. Writes to the queue would simply go to disk. If you organized the file into 100MB extents then you could just clean up the queue after you’ve processed the items by deleting the extent files once all the work in them has been completed.

This wouldn’t work for priority queues since the internal results would have to be sorted. You can get around this problem by separating out each priority level into a separate queue instance and then prioritizing processing higher level queue items. You’d move on to the next queue only after the first queue is exhausted.

Other types of algorithms (hashtables are a good example) involve lots of random seeks. This can be mitigated by using a ‘read ahead log’ to speed up disk seeks.

This is analogous to a ‘write ahead log’ used in some database to speed up random IO. In a write ahead log all writes are buffered, sorted, and then applied to disk in the hope of performing all sequential writes.

The worse case scenario of this algorithm is O(N) (fully random reads) but I suspect that the average cause is much better.

There are just a of problems here:

– How long do you buffer reads? If you buffer them long enough you can approach an optimal read scenario but the throughout of your cluster will be a bit slower.

– Do you buffer key to offset mapping data in memory? For large data/key ratios this probably makes sense. However, if you’re only storing one bit per key it might make sense to pre-allocate your hashtable and store the keys lexicographically on disk. This way you know the order simply by the key and can sort IO this way.

– Page size becomes a significant issue. If the underlying filesystem is created with a page size of 4096 bytes you should optimize your algorithms to perform contiguous reads on these sections. If you don’t wait enough your reads will be contiguous but your a lot of your IO will be wasted since you throw out 90% of the page due to useless data that you don’t actually need.

I’ve given this issue a lot of thought and there are more areas that could be improved by the design of a smart on disk read ahead log based hashtable. It probably makes sense to just implement your own until a more general optimized approach is found.

Further, data structures on read ahead logs will probably have to have a number of solid implementations before this technique can really be used by general practitioners. Custom code can be written by experts but for regular database user a MySQL-like tool set will be needed.

Fully RAM based databases are being used in more and more places. For a lot of use cases throwing ALL of your data into memory will have a major performance benefit.

But when should you use RAM vs SSD?

RAM is about $100/GB. SSD is about $30/GB.

SSDs have a finite performance of about 100MB/s for reads. The only time you should throw money at all RAM based scenarios is when you ned to get more performance than a SATA disk can give you.

Of course, this is obviously going to depend on your application.

Update:

Another thought. 100MB is close to 800Mbit which is close to 1Gbit. Gigabit ethernet is pretty much the only solution for networking at the moment.

Even if you DID have a box that could saturate the gigabit link you’re not going to get any better performance by putting it all in RAM.

Interesting.

One could use dual ports, but that’s only going to allow you to use two SSDs.

You could use 10GbE but that’s going to cost you an arm and a leg!

We’re going to deploy with 3 SSDs per box. This gives us a cheaper footprint as we can run with a larger DB however I’ll never come anywhere near the full throughput of the drive if I’m on GbE.

Simple Process Snipers

Process snipers are used to kill errant processes

I’ve used process snipers and watchdog’s to handle realistic process management in large clusters in the past but never felt any of them were very elegant in terms of code simplicity.

It dawned on me the other day that this would work out perfectly:

for proc in `find /proc -iregex "/proc/[0-9]+" -maxdepth 1 -cmin +15`; do

    if [ "$(cat $proc/cmdline |grep -E '^/usr/share/munin/munin-update')" != "" ]; then
        pid=$(basename $proc)
        cmdline=$(cat $proc/cmdline)
        echo "Killing process: $pid with: $cmdline"
        kill -9 $pid
    fi 

done

Fix fixes a bug in Munin where the background update task won’t ever finish and end up building breaking our stats.

200802221727I need to give it a bit more thought but it looks like we’re going forward with deploying Spinn3r on SSD. Specifically, machines with 3 SSDs on Linux software RAID.

The performance of SSDs is nothing short of astounding. When tuned correctly these drives were nearly 10x the performance of the same box running RAID. Further, since the disks are larger than memory you have 10x the capacity in terms of database size.

It’s not perfect though. If you have an InnoDB database that fits in memory it will probably could be 50-150% faster than a similar DB that’s running on SSD. If you need massive transaction throughput it still might make more sense to use a hardware RAID controller and put your entire DB in memory (that is if you don’t become CPU bound first).

There are a few tricks to using SSD. An SSD config without any sort of tuning is going to be a dog in terms of performance. A naive SSD + MySQL implementation is going to be really slow when deployed out of the box.

On Linux, here’s what you should do to get optimum performance:

– Use the noop IO scheduler. Any sort of dynamic and re-prioritized IO on an SSD doesn’t make much sense. I should note that a hardware RAID controller will almost certainly do the wrong thing with SSD because they usually use an IO scheduler internally that assumes you’re using an HDD. The noop scheduler was an order of magnitude faster in our tests than the deadline scheduler.

The one exception to this rule is a controller or an SSD that implements a log structured filesystem. These don’t really exist on the market just yet but I expect they’re going to be available around Q2 or Q3.

– Write caching disabled unless you have a battery backed RAID controller (but these are expensive). All my benchmarks show that disabling write caching on these drive doesn’t have much of a performance impact.

– Run XFS (though ext3 is fine as well but slightly slower). XFS showed a bit faster sequential write performance.

– Use the noatime mount option. This is critical as it’s going to really hurt your system performance.

– 4k stripe size for software RAID.

– Disable read ahead on all drives. Doesn’t make much sense in the SSD world.

If you’re on MyISAM you’re in an even better position. While MyISAM is slightly slower and uses more CPU during INSERTs it’s going to be about 2x faster than InnoDB on SELECTs since it can use a 4k page size and doesn’t have much CPU overhead.

InnoDB could be faster but it’s going to take a bit more CPU tuning.

I’m optimistic that I could squeeze another 10x performance out of SSD. Additional things I need to look at involve:

– Find out why InnoDB is such a CPU hog under this load. MyISAM can perform the same transaction load in 1/2 the time and with hardly any CPU. InnoDB on the other hand takes twice as long and pegs the CPU at 100%. I need to load it in kcachegrind but a quick strace showed it continually calling futex() which might be a source of the problem. Even running with just 10 threads caused the CPU to go to 100% so futex() might not be a problem.

– While EasyTech’s MFT or other SSD cards seems interesting, InnoDB won’t get any faster by making the IO subsystem more efficient. I need to first figure out why it’s taking up so much CPU. Faster IO subsystems are attractive but only when I figure out how to get InnoDB to perform without using any CPU.

– Under the current load while performing SELECTs with sysbench the drives were at nearly 80% saturation. Even if I could fix the IO footprint in InnoDB I’m only going to get an additional 35% gain in performance since the drives seem are at 65% utilization.

– Play with InnoDB with an 8k page size. The benchmarks didn’t seem too impressive here. I’m also not tempted to deploy this in production because I’ve seen a deadlock in MySQL and an bug with innodb_file_per_table. Also, it looks like 4k pages in InnoDB doesn’t work (at least in my tests).

– Investigate random write performance on SSDs. The sysbench random write benchmarks were showing that one drive would hit 100% utilization while the other drives weren’t doing much IO. My theory is that ext3 was putting all the data blocks for this file in one data group. The filesystem would then continually update a single group block for every data block that was updated. This would end up causing a lot of wear leveling on the disk and causing the erase blocks to be re-written which isn’t very fast.

I think that if I could nail all these problems one could put InnoDB on an 8 disk RAID subsystem and get 720MB/s throughput. MyISAM would work with this right now since it’s not CPU bound for SELECTs.

What’s sad here is that these SSDs could be 100x faster rather than just 10x faster. The solution here seems to be improving InnoDB’s CPU efficiency (or moving to Maria). The next generation of SSDs are going to need to internally use log structured filesystems. This is going to be a big performance boost for them. Even if I could get my hands on one of these STEC of Fusion IO devices I’d have to buy the latest quad core boxes to be able to even remotely be able to take advantage of the additional IO.

Also, if you’re using something like Bigtable via Hypertable or Hbase your performance should be stellar since these are append only databases and SSDs do very well with sequential reads and writes.

200802181230Apparently, the entire computing industry is stumped by the multi-core problem. Specifically, scaling single threaded code to run across multiple cores:

Both AMD and Intel have said they will ship processors using a mix of X86 and graphics cores as early as next year, with core counts quickly rising to eight or more per chip. But software developers are still stuck with a mainly serial programming model that cannot easily take advantage of the new hardware.

Thus, there’s little doubt the computer industry needs a new parallel-programming model to support these multicore processors. But just what that model will be, and when and how it will arrive, are still up in the air.

One such model already exists – MapReduce. More recently a paper was presented for running MapReduce on multi-core systems.

This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system.

We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes.

There are various other models here as well. You can run a producer consumer model which works very well.

Spinn3r runs on a task/queue model which is essentially producer consumer. Our CPU execution works VERY well across multiple cores. We have about 80 cores right now and our code executes across them in parallel without complication.

Here’s the problem. They’re trying to shove everything into the Von Neumann architecture.

The von Neumann architecture is a computer design model that uses a processing unit and a single separate storage structure to hold both instructions and data. It is named after mathematician and early computer scientist John von Neumann. Such a computer implements a universal Turing machine, and the common “referential model” of specifying sequential architectures, in contrast with parallel architectures.

This just plain won’t work as it violates a number of the distributed computing fallacies including:

* Transport cost is zero
* Latency is zero.
* Bandwidth is infinite.

Assuming that multiple cores can easily be accessed as one is false. Latency is not zero. There are L1 and L2 caches to consider. Cache coherency is also a problem.

Instead, why not yield to distributed computing fallacy #9:

A physical machine with multiple cores and multiple disks should not be treated as one single node.

Instead of treating a 8 core 32G box as one single node, treat it as 8 boxes with 4G of memory pre core. Your IO subsystem should either use SSD (which is somewhat immune to disk seeking problems) or 8 HDDs.

Problem solved.

I just spend a few hours today setting up RAID on SSD to see what the performance boost would look like.

I’m very happy with the results but they’re not perfect.

Here’s the results from comparing a single Mtron SSD vs a RAID 0 array running with a 64k chunk size.

200802141939

Figure 1: Mtron SSD performance advantages with RAID.

The clear loser here is rndwr (random writes).

I’m pretty sure this has to with the 64k stripe/chunk size. I’m willing to be the RAID controller is deciding to write the entire chunk for small updates which would really hurt SSD performance since the ideal block size is 4k.

This is a MegaRAID controller so I need to figure out how to rebuild the array with a 4k stripe size.

I suspect I’ll see a 300% performance boost over a single Mtron drive but not much more. The random read numbers give us a 550% performance boost but I suspect this has to do with the buffer on the drive since we now have 3x the on disk memory buffer.

200802112133It looks like Fusion IO has published more numbers about their SSD flash storage devices.

For starters, the price is totally reasonable. $2400 for an 80G device. This is $30/GB which puts it roughly 2x the price of the Mtron at $/GB. Not to bad.

The raw performance numbers seem amazing:

200802112138

However, these can be highly misleading. They don’t cite random write numbers and only quote 8k packets.

Here’s how you break apart these numbers.

Sequentially just means they’re doing ‘random’ 8k packets back to back. AT 600MB/s this works out to 73K random IOPS. The size can be 512B and up to 8k. So let’s assume we’re FULLY random and doing 512B writes.

This puts us at only 35MB/s.

Still pretty impressive. The Mtron can only do 7.5MB/s random writes which puts it at 4.6x faster for only 2x the cost.

The REAL question is how you access the device. I assume they wrote their own kernel driver and it exposes itself as some time of MTD, memory, or block device.

I’d want that driver to be OSS though.

The papers from WSDM 2008 aren’t yet available on the Internet so I took the liberty to upload them.

There’s some good stuff here. I’m going to blog my notes for a few of these talks once I have the time to review the official papers.

I wonder if I should charge $5 a copy like the ACM. (joke)

Big DBA Head has run some independent MySQL benchmarks with the Mtron SSD drives that I’ve been playing with.

Great to see that we’re coming to the same conclusions. It’s nice to have your research validated.

Run time dropped from 1979 seconds on a single raptor drive to 379 Seconds on the standalone Mtron drive. An improvement of over 5X. Based on the generic disk benchmarks I would have thought I would have seen slightly lower runtimes.

I think he might be missing out on one advantage. If you’re more than 50% writes you would probably see lower throughput but you can have a MUCH larger database and when you NEED to do random reads you can easily get access to your data – and quickly!

HDD prevents this since you’re saturating the disk IOPS with writes which prevents you from doing any reads.

This allows you to take the money you’re paying for memory and get 10x more for a slight performance hit.

I suspect these problems will be resolved in the next six months.

There are a few possibilities to solving this issue:

* Someone will write a Linux block driver that implements a log structured filesystem. You can then export the SSD as a LSFS and re-mount it as XFS. The random write performance will then soar and you’ll have features of XFS like constant time snapshots.

* Log structured databases like PBXT will be tuned for SSD which will increase the performance.

* Someone could patch InnoDB to handle 4k pages. I tried to compile InnoDB with 8k pages and it just dumped core. I think the performance of InnoDB will really shine on SSD once this has been fixed. One other potential problem is that during my tests InnoDB became CPU bound when working with the buffer pool. I’m not sure what the problem here is though.

* SSD vendors will implement native LSFS on the drives themselves. This will also help with wear leveling and help negate the problems with the flash translation layers in the drives. I suspect STEC is already doing this.

* No Flash Translation Layer. Instead of a block device they could be exported as MTD devices. This could boost performance with filesystems that were MTD aware.

* Raw random write IOPS performance upgrades on the drives themselves. Instead of only 180 random write IOPS they we could see drives doing 1k random write IOPS in Q2.

Hypertable is out.

Hypertable is out and in the wild:

Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks. Our goal is to bring the benefits of new levels of both performance and scale to many data-driven businesses who are currently limited by previous-generation platforms. Our goal is nothing less than that Hypertable become the world’s most massively parallel high performance database platform.

A couple of thoughts.

This already looks better than hbase.

It can run on top of KFS and HDFS. I’m not entirely happy with either of them (although KFS seems to be superior).

Once I finish up our sharded/federated database we’re running on top of MySQL I’m really tempted to look at implementing a single kernel image GFS+Bigtable clone from the ground up.

I think there’s a major advantage to implementing something in C and starting from a fresh code base.

However, I’m going to audit KFS and Hypertable as these seem like really solid starting points.

For all I know they could totally already solve my problem with just a few changes (or none at all).

MySQL is far from ideal for large databases and I think it’s going to really start showing its age in the next few years.

I spent some more time today comparing InnoDB and MyISAM on SSD.

I increased the data/cache ratio by about 5X. I allocated 1G of memory for MyISAM or InnoDB (restating MySQL after each test). Resulting on disk images are 6G for MyISAM and 7G for InnoDB.

This is on a 30M row DB which stores ints and char data.

I’m primarily testing the theory that SSD could be used to get near in-memory performance by using cheap SSDs since they can do lots of random reads.

MyISAM would clearly outpace InnoDB if I would perform the initial ‘prepare’ (bulk insert) stage in multiple processes. MyISAM became CPU bottlenecked which ended up slowing the write rate.

InnoDB on the other end had the SSD at 100% utilization. I’m not sure why. It could either be an issue with the 16k page size or the write ahead log.

A 3x performance boost is more than acceptable here especially when you consider you can create a 96G SSD RAID array for the same price as 8G of RAM.

I might end up recompiling MySQL with an 8k and 4k page size for InnoDB just to see if it makes any difference.

Further, I might spend some time trying to figure out why InnoDB is so slow at performing the initial INSERTs.

200802031658

Figure 1: MySQL performance time for sysbench for inserting and performing OLTP queries on the data. Times are in minutes (lower is better)

I’ve now had about 24 hours to play with the Mtron SSDs and had some time to benchmark them.

The good news is that the benchmarks look really solid. The drive is very competitive in terms of performance. I’m seeing about 100MB/s sequential read throughput and 80MB/s sequential write throughput.

I’ve had some time to benchmark them and they’re really holding up.

The bad news is that they can only do about 180 random writes per second. Here’s are the raw performance numbers from Mtron’s data sheet:

200801301443

I spent a lot of time reviewing this drive and didn’t notice this before.

The Battleship Mtron review went over this as well but didn’t spend much time on it:

Although they do perform astounding in random read operation, random write is still very sub-par on flash technology. Even though we are not benchmarking random write IOP’s I will give you some quick insight. Write performance is not yet a perfect and refined process using NAND flash and you will not have a drive that is going to write file operations as well as a high end U320 15K SCSI or SATA 10K setup. There is a company that I have been talking with directly about this NAND flash write issue called EasyCo in PA, USA. They are working on a process called MFT technology and they offer a simple MFT driver that is claiming to increase random write IOP’s on a single drive up to 15,000 IOP’s. Doug Dumitru had explained to me this technology will take your standard Mtron 16GB Professional drive and turn it into an enterprise behemoth.

I spent some time to see what EasyCo was up to and came across their Managed Flash Technology:

Managed Flash Technology (MFT) is a patent pending invention that accelerates the random write performance of both Flash Disks and Hard Disks by as much as a thousand fold.

It does this by converting random writes into chained linear writes. These writes are then done at the composite linear write speed of all the drives present in the file volume, subject only to the bandwidth limits of the disk control mechanism. In practice, even with as few as three drives present, this can result in the writing of as many as 75,000 4-kilobyte blocks a second.

As a result, MFT can dramatically improve the real-time performance of asymmetric storage devices such as Flash disks by making reads and writes symmetrical. Here, flash disk performance is typically improved 10 to 30 times, making some of these 60 times as fast as the fastest hard disk. Finally, it is possible to make clusters of as few as 20 flash drives run collectively as fast as RAM does but with a much larger storage space than RAM can practically have.

The question is what are they doing to get such substantial performance?

Here’s what I think is happening.

From what I’ve read they take a normal Mtron drive and install a new Linux kernel module which they use to interface with the drive. They then use a normal write ahead log and keep data in memory (probably something like a 500M buffer) and a binary tree of the block offsets. When the buffer fills they then take the data in memory, sort the results by offset, and apply the buffer to disk sequentially.

If the box crashes they have an on disk log that they apply. Probably when the drive is first mounted.

Basically a flash aware write ahead log.

Fortunately, InnoDB has a write ahead log internally so this should save us from needing to run a custom kernel module. Any database with a write ahead log should be more than competitive.

I wrote a simple benchmarking utility (see Figure 1 below) to simulate an InnoDB box performing thousands of random reads and one sequential write.

The benchmark consists of 3500 dd process running in the background reading from the SSD and writing to /dev/null. I then have one sequential write performing in the foreground writing out about 5G of data to the SSD.

The HDD holds up really well when compared to the SSD which should have an unfair advantage. So much so that I think the Linux scheduler is interfering with my benchmark. I think that’s happening is that the first few dd’s start reading in parallel and block the remaining process. This continues with 5-10 concurrent readers until the entire chain of 3500 completes.

I’m going to rewrite the benchmark to create one large 10G file and randomly read 10G from misc locations.

As you can see while SSD is very fast but it’s only about 2.5x faster than HDD. I’d expect it to be about 20-40x faster.

200801301441

Figure 1. Performance of SSD vs HDD (measured in seconds)

I’ve been reviewing our settings for innodb prior to testing our new SSDs drives later this week.

Here are some initial thoughts:

* Both sync_binlog and innodb_flush_log_at_trx_commit should both be enabled. The extra seeks required isn’t really an issue on SSD and the extra reliability is worth the slight performance hit.

* Read ahead for Innodb and the underlying block driver should probably be disabled. There’s no sense reading another 512 blocks in SSD. You can get the IO quick enough so why slow yourself down potentially reading content you don’t need? Innodb use a heuristic algorithm for read ahead but the best it can do is equal the same performance as SSD. At the very minimum disabling disk based read ahead is probably a good idea.

* If your database is primarily small fixed size rows it might make sense to recompile using a smaller block size. SSD performance seems to be a function of write size. If you constantly need to write 16k where 90% is re-written content you’re going to see an effective 4x slowdown. Jeremy Cole mentioned that changing this would bloat the DB. I’ll have to experiment. I’m also going to have to figure out if O_DIRECT can be used with less than 16k block sizes. I don’t think it can.

* The new thread concurrency stuff in MySQL 5.1 is probably going to be very important. There’s no reason multiple concurrent threads shouldn’t be able to mutate and access the DB in parallel since we’re no longer really bound by disk seeks. Letting the DB go full out seems like a big win. This is going to require MySQL 5.1 though which should be available any year now (*cough*).

Given all this, I think performance will still be outstanding for Innodb on SSD but probably a good deal of variability in performance.

It looks like there’s another competitive SSD on the market. The Stec Zeus IOPS.

I foolishly dismissed this drive before because I thought they weren’t disclosing their write rate (which all the other vendors are doing to lie about their performance).

Turns out they’re claiming 200MB/s with 100MB/s write throughput. If these numbers are accurate the then this would be 2x faster then the Mtron SSDs.

Storage Mojo has additional commentary. They’re comparing these drives to the RamSan which is not a fair comparison since this is a DRAM based SAN device. The RamSan-500 should trounce everything on the market but the pricing is astronomical.

The key win for SSDs is that they’re cheap and will soon be commodity. By mid-2008 I imagine 20% of the laptop market will be using SSDs and vendors like Toshiba, Samsung, Stec, and Mtron will be feverishly attacking each other in the enterprise market.

The key here with the Zeus will be price per GB. The Mtrons are about $15 per GB which is the price point I’m looking at for real world horizontal/diagnol scaled applications.

Prediction.

In mid-2008 (early 2009), one of the major hardware vendors (IBM, Dell, Sun, etc) will ship a blade server with all SSD.

The key will be that the blade itself will be a LOT smaller (by about 20-30%) since power consumption and dissipation is reduced. They’re also going to take advantage of the low form factor of SSD to fit more disks per box. Probably in the 4-8 disk range.

The system itself will have less memory than conventional systems since SSD is so fast.

This is going to further dampen the 1U form factor.

These boxes might also be half depth (ALA Rackable) to allow back to back mounting.