Robot Yield

This morning I was thinking about robot blocks regarding Rich’s post about Cuill being blocked on 10k hosts.

So let’s say you write a web scale crawler and you accidentally pushed a bug. It was a huge mistake and you hurt a few hosts and end up being blocked.

A month passes and you’ve implemented a fix and a number of other features which make crawling easier on hosts in your cluster.

… basically you want another chance to crawl these sites. The problem is that you now need to wait an eternity until they remove your robot block.

No what?

Do you ignore the block? That’s probably not right.

Do you create a new User-Agent so that you can slide through the robot block? Possibly. That might work. However, what if you’re blocked because people don’t like you (and it’s not a politeness issue).

I assume if it’s a non-crawlable directory they’re just going to use User-Agent: *.

One could extend robots.txt to include additional syntax so that would allow robots.txt to handle such situations but honestly how many users are going to use that extension.

They could always just remove the disallow rules…


  1. I always assume if a bot is bad enough to deserve blocking in my robots.txt file, I should simply firewall off their IP addresses. That way they can’t change their user-agent to get around the block. And a bot’s web site that says nothing about what they’re doing with the info they’re spidering gets two strikes right off the bat.

  2. Yeah…. I agree that IP blocks can be valuable.

    Further, a bot should be VERY clear about what the data is used for.

  3. Yeah…. I agree that IP blocks can be valuable.

    Further, a bot should be VERY clear about what the data is used for.






%d bloggers like this: