Transparent Batch and Stream Operations in Distributed Systems

One of the things I don’t see much discussion on in distributed system research is the advantage that using batching and streaming can have on scalability.

Batching

Lets cover batching first. Say you have 1000 objects you need to fetch from the database. Now lets say this time is instantaneous on the database side (which is seldom the case). If you fetch all 1000 items one at a time this will end up killing your performance. Each operation will take about 1-2 ms which isn’t very long for an individual fetch but it all adds up. If this was a page load on behalf of an HTTP client it would load in the 1-2 second range with is pathetic.

If you could somehow batch these up into one operation you’d see a 1000x performance boost. Not bad. This isn’t a theoretical situation btw. Memcached has a getMulti method for just this reason. Unfortunately, there is no putMulti which means if you have to store a lot of objects you’re going to take a performance hit (though I think you could write an implementation that did a large socket write and then parse the results.)

If you know what’s happening under the covers it all makes sense. Each socket write would require one ethernet frame and at least one system call – both of which are somewhat expensive.

Databases have the same performance gains. If you can write multiple records at once you’re going to see a huge boost in efficiency. MySQL is a good example here. It can avoid multiple binary log writes and optimize the write by scanning the table once and updating the binary tree in batch. I wrote a post about using ON DUPLICATE KEY UPDATE a while back which you might find interesting.

Again. Not theoretical. I recently implemented this for Tailrank and you can see our performance gain here (smaller is better).

200608231411

I did a bunch of optimization on Rojo’s query system by building a parallel prejoin mechanism. This batched up our SQL calls and resulted in a 8-15x performance gain. Same theory here of course.

Streaming

Streaming your results instead of returning objects can also have a big scalability gain.

If you’re using a single threaded event-IO system and using large ethernet frames you really only need a 16k buffer which can be flushed on a per-client basis. This is amazingly scalable. You can approach a solution to the C10K problem (supporting 10k concurrent connections).

This is one significant advantage of memcached using libevent.

Systems like XMLRPC which have to allocate memory during a request can quickly blow up. If you’re using an evil system like Java (my fav language btw) which controls memory allocation you have to be careful not to run out of memory while processing a request.

Lets say you have 2G of memory and need to handle 1000 clients. This means each client can only allocate 2M of memory. All of a sudden you get Digged or Slashdotted or Tailranked (like how I snuck that one in?) and your webapp explodes because you run out of memory.

A streaming implementation wouldn’t even choke. You might start burning some additional CPU but at least it would scale linearly.

This is one HUGE advantage of REST over SOAP/XMLRPC. It’s a streaming protocol. With SOAP you’d have to allocate all your memory on the client prior to returning the result. With REST you can just call write() on your resulting data.

I’m working on a distributed filesystem based on async IO that does exactly this. It builds a distributed cluster of storage nodes and calls read() on the filesystem and then writes a 16k buffer across the network without any memory allocation. It’s using epoll on Linux and kqueue on BSD so it should scale fairly well and support as many connections as the CPU can support. It also supports batching so we can support efficient disk fetching.

Transparency

The biggest problem with batching and streaming APIs are that they’re not very pretty. If you’re used to dealing with APIs and nice object systems you’re not going to be a happy camper.

I don’t think there are any simple solutions here. Distributed systems aren’t simple to implement. If you want to scale you’re going to have to get your hands dirty and start thinking about the kernel, filesystem, IO, etc.

One thing I’ve learned is that if you’re working with a team that isn’t comfortable with working with these systems they need to be aware of the decisions as they’re made. These aren’t theoretical performance boosts and they’re worth the increase in performance.


  1. Good post Kevin,

    You’re quite right and saving queries is important, but I would be careful on counting performance by number of queries… going from 20 queries on a page to 10 does not mean you’ve doubled efficiency, if queries do same work as previously 20.

    Also sometimes it is actually more effifient to increase number of queries, if MySQL does not handle big query well. The question in this case would be what is better to have – two queries, or one whch does 100 times more work ?

    Also you need to make sure you do not make things to be too ugly. Ie I’ve seen people using UNION to combine result sets of completely different queries (doing appropriate padding so all results can be accomodated).

    Also you might check multiple statement execution API in MySQL 4.1+ – kind of similat ot multi-get in memcache.

  2. James Day

    You’re right about the benefit of batching but it woud be good to say how big the batches were. As Peter hinted, batches that are too big can hold locks for so long that they cause troubling concurrency problems, so there is a best size batch for each situation. That might vary as the application becomes more popular as well, since greater popularity will probably lead to more concurrency pain.

    Besides, I’m curious about how big your batches needed to be to get the improvement. :)






%d bloggers like this: