Disk is the New RAM – In a Nutshell
I just finished watching the Disk is the New RAM video which a number of bloggers have been talking about.
If you’re lazy like me you can just read this blog post to get a nutshell on this theory of computing.
FYI, I transcoded it in my scalecast podcast if you want to watch it on your iPod.
There are some interesting concepts here. I’m a bit frustrated that I didn’t blog about this earlier as I proposed a similar architecture for use in Spinn3r about 1.5 years ago. However, I identified a number of both implementation and theoretical issues that prevented us from actually spending any significant amount of time in this area.
For certain data structures this type of technique will work very well.
This is how I discovered this idea in the first place. We were evaluating improving the performance of our crawl queue in Spinn3r. It’s a difficult problem actually – one that I won’t go into to much detail here for purposes of brevity.
In fact, it works out so well that I’m surprised that they didn’t bring up queueing as an example. If you need a queue it actually might be beneficial to use this type of data structure regardless of whether you’re using a RAM based approach in other areas.
To review, a queue basically needs to perform two operations:
– enqueue items
– fetch items at the head of the queue.
Both are essentially sequential operations. Fetching items from the queue is a sequential read. Writing items to the end of the queue is a sequential write.
My proposal involved reading large chunks from the head of the queue and buffering the results in RAM to avoid disk seeks. Writes to the queue would simply go to disk. If you organized the file into 100MB extents then you could just clean up the queue after you’ve processed the items by deleting the extent files once all the work in them has been completed.
This wouldn’t work for priority queues since the internal results would have to be sorted. You can get around this problem by separating out each priority level into a separate queue instance and then prioritizing processing higher level queue items. You’d move on to the next queue only after the first queue is exhausted.
Other types of algorithms (hashtables are a good example) involve lots of random seeks. This can be mitigated by using a ‘read ahead log’ to speed up disk seeks.
This is analogous to a ‘write ahead log’ used in some database to speed up random IO. In a write ahead log all writes are buffered, sorted, and then applied to disk in the hope of performing all sequential writes.
The worse case scenario of this algorithm is O(N) (fully random reads) but I suspect that the average cause is much better.
There are just a of problems here:
– How long do you buffer reads? If you buffer them long enough you can approach an optimal read scenario but the throughout of your cluster will be a bit slower.
– Do you buffer key to offset mapping data in memory? For large data/key ratios this probably makes sense. However, if you’re only storing one bit per key it might make sense to pre-allocate your hashtable and store the keys lexicographically on disk. This way you know the order simply by the key and can sort IO this way.
– Page size becomes a significant issue. If the underlying filesystem is created with a page size of 4096 bytes you should optimize your algorithms to perform contiguous reads on these sections. If you don’t wait enough your reads will be contiguous but your a lot of your IO will be wasted since you throw out 90% of the page due to useless data that you don’t actually need.
I’ve given this issue a lot of thought and there are more areas that could be improved by the design of a smart on disk read ahead log based hashtable. It probably makes sense to just implement your own until a more general optimized approach is found.
Further, data structures on read ahead logs will probably have to have a number of solid implementations before this technique can really be used by general practitioners. Custom code can be written by experts but for regular database user a MySQL-like tool set will be needed.