Disk is the New RAM – In a Nutshell

I just finished watching the Disk is the New RAM video which a number of bloggers have been talking about.

If you’re lazy like me you can just read this blog post to get a nutshell on this theory of computing.

FYI, I transcoded it in my scalecast podcast if you want to watch it on your iPod.

There are some interesting concepts here. I’m a bit frustrated that I didn’t blog about this earlier as I proposed a similar architecture for use in Spinn3r about 1.5 years ago. However, I identified a number of both implementation and theoretical issues that prevented us from actually spending any significant amount of time in this area.

For certain data structures this type of technique will work very well.

This is how I discovered this idea in the first place. We were evaluating improving the performance of our crawl queue in Spinn3r. It’s a difficult problem actually – one that I won’t go into to much detail here for purposes of brevity.

In fact, it works out so well that I’m surprised that they didn’t bring up queueing as an example. If you need a queue it actually might be beneficial to use this type of data structure regardless of whether you’re using a RAM based approach in other areas.

To review, a queue basically needs to perform two operations:

– enqueue items
– fetch items at the head of the queue.

Both are essentially sequential operations. Fetching items from the queue is a sequential read. Writing items to the end of the queue is a sequential write.

My proposal involved reading large chunks from the head of the queue and buffering the results in RAM to avoid disk seeks. Writes to the queue would simply go to disk. If you organized the file into 100MB extents then you could just clean up the queue after you’ve processed the items by deleting the extent files once all the work in them has been completed.

This wouldn’t work for priority queues since the internal results would have to be sorted. You can get around this problem by separating out each priority level into a separate queue instance and then prioritizing processing higher level queue items. You’d move on to the next queue only after the first queue is exhausted.

Other types of algorithms (hashtables are a good example) involve lots of random seeks. This can be mitigated by using a ‘read ahead log’ to speed up disk seeks.

This is analogous to a ‘write ahead log’ used in some database to speed up random IO. In a write ahead log all writes are buffered, sorted, and then applied to disk in the hope of performing all sequential writes.

The worse case scenario of this algorithm is O(N) (fully random reads) but I suspect that the average cause is much better.

There are just a of problems here:

– How long do you buffer reads? If you buffer them long enough you can approach an optimal read scenario but the throughout of your cluster will be a bit slower.

– Do you buffer key to offset mapping data in memory? For large data/key ratios this probably makes sense. However, if you’re only storing one bit per key it might make sense to pre-allocate your hashtable and store the keys lexicographically on disk. This way you know the order simply by the key and can sort IO this way.

– Page size becomes a significant issue. If the underlying filesystem is created with a page size of 4096 bytes you should optimize your algorithms to perform contiguous reads on these sections. If you don’t wait enough your reads will be contiguous but your a lot of your IO will be wasted since you throw out 90% of the page due to useless data that you don’t actually need.

I’ve given this issue a lot of thought and there are more areas that could be improved by the design of a smart on disk read ahead log based hashtable. It probably makes sense to just implement your own until a more general optimized approach is found.

Further, data structures on read ahead logs will probably have to have a number of solid implementations before this technique can really be used by general practitioners. Custom code can be written by experts but for regular database user a MySQL-like tool set will be needed.

  1. This is not exactly new if you are loading up data in RAM block by block. The Unix command “sort” essentially sorts using only sequential writes and reads. It is how it is so damn fast. (The underlying algorithm was described by Knuth so it is quite standard.)

    But that’s not what this new paradigm is about, is it? You are supposed to forget about the RAM altogether and work with several disks with all of your data on disk.

    I do not buy it. There might be fringe cases where you can do away with your RAM, but…

    Even row-based databases require RAM. Oh! You can load up the columns in streaming mode using almost no RAM (except to buffer the stream), but how do you construct the database? You will find that for each attribute you need some RAM and if you have many attributes, you need a lot of RAM.

    Maybe I am missing the point, mind you.

  2. Hey Daniel.

    The author didn’t argue that you should use *no* RAM, just that this was a space/time tradeoff that you can make.

    In fact, his hashtable implementation uses some RAM to keep the routing table based on key prefix (if I understood correctly).

    That said, I’m not sure I buy it either as a general framework. For some applications it’s fine, for others, maybe not.


  3. DAR

    Seems like this oversimplifies things, and so misses a critical piece.

    “My proposal involved reading large chunks from the head of the queue and buffering the results in RAM to avoid disk seeks. Writes to the queue would simply go to disk. If you organized the file into 100MB extents then you could just clean up the queue after you’ve processed the items by deleting the extent files once all the work in them has been completed.”

    Not necessarily true. This completely ignores the possibility of failure in the midst of processing this 100MB extent file. Making this scenario able to deal with failures requires one of the following:

    * a pointer into the extent would need to be constantly updated, indicating the location of last item processed from the queue. (Otherwise on restart you’d wind up duplicating the processing of a potentially large number of items from the beginning of the queue until the failure location.) Repeatedly updating a pointer like this would cause large numbers of disk seeks.

    * the processing of every queue item must be idempotent, so that it can be re-processed when failure occurs. This could be a problematic restriction to place on an application.

    * on restart the application would need to have some way to look up each queue item to see whether it had already been processed, and if so, ignore it

    So not as simple as it sounds, IMO.

  4. Hey DAR.

    Agreed that I overlooked implementation details.

    However, while a queue entry is processing you have to keep track of its state and verify that it was processing.

    As long as an extent is immutable you can then clean them up once the window has passed through the extent.

%d bloggers like this: