Slides from Spinn3r Architecture Talk at 2008 MySQL Users Conference

Here’s a copy of the slides from the talk I just gave about the architecture of Spinn3r at the 2008 MySQL Users Conference:

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infrastructure all built on top of cheap commodity hardware.

Spinn3R Architecture Talk - 2008 Mysql Users Conference

  1. Steve Yen

    Nice bumping into you last night at the hypertable talk.

    Thanks for posting these slides. Sorry I missed your presentation.

    Some questions..

    With your hashing design, does that mean you’re avoiding a master server that knows where data lives? That is, do clients just talk directly to where they think data lives?

    On “Query Limitations”, it seems the ID needs to part of every query, like using mysql like a big distributed hash table. So, no range queries, right? If I got that right, then what do you mean by “Range Routing” (slide 5)?

    Are you using consistent hashing? If not, and you’re instead something like server-list modulus, for example, what happens when a node goes down?

    minor typo on ssd stuff: “historically avoided due to [low] MTBF”?

    Again, thanks for posting these!

  2. Hey Steve.

    Good talking to you last night.

    Yes. Our hashing design avoids constantly talking to a master server. We use a rendezvous design where the range routes are given out at periodic intervals.

    This is a pseudo master server but not in the sense that every INSERT needs to be coordinated.

    The clients themselves just perform insertions into shards directly after this…

    You can do ORDER BY queries but results need to be post processed on the client.

    IDs should be used when performing client side joins or fetching unique rows.

    We avoid using consistent hashing by using the range routing. Consistent hashing in a cache system works when a whole node goes down but in a sharded system we can’t operate if we lose an entire shard. We therefore keep N copies (right now 2 but I’d like to move to 3).



%d bloggers like this: