Migrating Tailrank Image hosting to S3

I’m strongly considering migrating Tailrank image and thumbnail hosting to S3.

Right now we’re using a single server to host our images but this doesn’t really scale. I could of course build my own distributed filesystem or use one MogileFS or GFS but I just don’t see the point in administrating it myself. The more I can offload to 3rd parties the better.

What’s cool is that I could use S3 to avoid any sort of coordination between robots on where how they sync their images. I’d just do a put into S3 and be done.

Two small things that concern me:

1. I don’t like clients having to do unnecessary DNS requests.

2. If S3 goes down for maintenance or due to a network split (more likely) I still want Tailrank to be functional.

I think the solution to this could be to use Squid as a reverse proxy in front of S3. I’d have to pay for double the bandwidth in theory but my bandwidth right now is pretty cheap. My S3 bill would of course be a lot smaller because I think about 95% of the images would be served from my end.

It’s an interesting idea but I’m not sure how much of a win it would be. I’m pretty sure Amazon would be almost always online but it could be painful if they were to go down.

  1. keeping your most common and/or most recent images in, say, a memcached daemon would allow you to prevent the loss of all but old images in case of a temporary S3 outage. It would also allow you for, possibly, faster serving of hot images while negating the need to keep a local copy (or copies as the case may be with multiple web servers) of a large history of images on hand. Consider this a rolling window approach to local versus archived content.

    The question as to whether this makes sense to you is answered along side another question: are you looking to reduce bandwidth costs or eliminate the need to keep a huge cache of images stored in a highly available format.

    Based on your comments about clustering file mechanisms I would assume that your focus is less on transfer and more on disk space. therefor this would be viable.

    Also something to take into consideration is the turnover rate of your images. If you add images at an very fast rate, then the usefulness of memcached and its automagically expiring of data becomes less useful, and a local disk store plus scheduled cleanup script would be more useful. The advantage of memcached is the ability to define a static space, in ram, for the data you’re housing — say 512Mb, or 1Gb, and in a FIFO manner data is expired from the server without any specific interaction to accomplish the feat. In my experience memcached is also fast enough to serve images at in very rapid succession.

    YMMV, just another POV

  2. The #1 thing I’m trying to solve is disk seeking. With 500k->4M thumbnails I’m really going to start to burn disk performance. Most filesystems just can’t handle that many small files.

    The second issue is store coherence. How do I make sure all my files are replicated and so forth.

    S3 looks like it would easily solve all these problems….

    Memcached wouldn’t do a good job here….. it isn’t really a good file storage and serving engine. Squid with disk+ memory would be perfect. :)


  3. I’d need to look into this again, but one thing I seem to recall doing in the past is creating a special proxy that can send back local data but, in certain cases, send back just a Location: header pointing to data located elsewhere. As far as I recall, clients supported this and then got the images from the specified URL.. thereby meaning you never have the 2x bandwidth issue.

%d bloggers like this: