Tailrank is now Running Squid and Apache

200609161221Last night I finally had the time to migrate Tailrank to a new caching and load balancing infrastructure.

Thanks for all the help guys.

I decided to stick with a software load balancer using Apache 2.2 and just ditch mod_cache and go with Squid 2.6.

Apache and mod_proxy_balancer seem to work fairly well so we’ll stick with what works. This also means I can still use mod_rewrite which is some crazy powerful voodoo.

Apache doesn’t yet support heartbeat monitoring of backend servers but I’m going to write a quick crontab entry to take care of that. A quick wget wrapper to check the health of my servers and then do an Apache graceful restart should work fine.

So far Squid has been pretty decent. Much better than Apache and mod_cache since it can actually use memory for cache and not just disk. I’m buffering about 200M in memory and then an additional 2G on disk.

Apache 2.2 has a mod_memory_cache but it’s fairly buggy. It ends up corrupting documents and then breaking Tailrank. A webserver serving corrupt HTML is a bit like a database losing records. Not cool.

Tailrank generally sees about 95% of our page views on less than 2k documents so its very easy to cache this content rather than just building it in PHP/MySQL for every page view. Serving content from memory is about 10000x faster than using a dynamically generated page.

Since Squid has been around longer and really only does caching it’s fairly advanced. For example, it supports a negative_ttl to cache objects that are returning 404s. It just re-checks them every minute so that I can fix the 404 yet still see a performance boost.

One nice innovation is that it can use posix threads to write the disk index. If a client generates a cache miss the content will be fetched from an origin server and returned to the client immediately and then the disk IO will be handed off to a separate thread. It also supports a separate process model for this but we’re running NPTL so using a thread is the way to go.

Using Squid is initially a bit shocking since it uses a 60 page configuration file (I’m not exagerating). The majority of this (95%) is just documentation which if you can get past the sheer size is a bit refreshing.

Once you get past the learning curve and get Squid up and running you’re probably using the best too for the job…

  1. I was hoping you’d put some thoughts online about the caching infrastructure. Detailing the actual chosen setup is even great, so thanks.

    It’s still not clear though (at least to me) where Squid stands in your scheme. Is it in front of the load balancer, caching everything? Is there a squid for each web server, between them and the load balancer?

    Also, there is an issue with putting one load balancer in front of a lot of servers, as is could lead to having a single point of failure… have you dealt with this?

    Thanks again!

  2. Looking very forward to hearing how it goes.

    Woof out!

  3. Hey. I had a long comment written but it looks like I forgot to post it.


    The setup is:

    Squid (cache) -> Apache 2.2 (balancer) -> Apache 2.0 (php, mpm_prefork)

    We don’t have any sort of redundancy in place but I’m thinking of using DNS for this and setting the TTL very low. This way I can just have five minutes of network outage if necessary.

    So far Squid is working very well. I have munin monitoring it and 50% of my traffic is resolved from cache. I need to see if I can improve this but right now its pretty decent…. VERY fast.

  4. That sounds cool. I don’t have multiple servers yet, but squid should definately speed up a single server (probably even more visible that multiple ones). Thanks for the info!

%d bloggers like this: