The robot exclusion protocol (REP) is stating to show its age.
At Spinn3r we frequently deal with the chaos revolving around robots.txt so I thought I would throw few thoughts out there about the complexity of the issues involved here.
REP is not a EULA
This is part of the confusion around robots.txt. It’s not really clear that just because you can fetch a URL that the website will be happy with your use of that content.
It also isn’t clear that just because you fetched a URL that it means that you AGREED to a EULA limiting your rights.
In fact, I would argue that it does not limit your rights.
There is no forced click through and simply presenting your URL somewhere on your website, where a robot wouldn’t be able to read it, doesn’t mean that the company that originated this request has agreed to your EULA.
Limited amount of content.
Just because you’re allowed to fetch pages via robots.txt doesn’t mean that the website owner will be happy with you spidering their ENTIRE website and using EVERY single page.
Various social networks have routinely been upset and threatened lawsuits should bulk numbers of pages be downloaded from their site, even when robots.txt were in place.
This seems like a reasonable restriction.
An extension to propose a ‘limit’ on the number of URLs used for indexing purposes might make sense.
However, it’s a bit more complicated than that. What if you just use the URLs to compute rank for the top pages on that site and then discard the old ones?
The website would have no way to verify that you in fact discarded them.
Second, how long does the limit apply for? What if the limit is 1000 pages, and you index them, build your inverted index for full text search, then discard them entirely, and fetch another 1000. At any point in time you only have 1000 documents stored but you clearly have other secondary indexes built from these documents including link graphs, inverted indexes, etc.
Throttling is another complicated problem. which robots.txt tries to solve but doesn’t go far enough.
The specification includes a ‘Delay’ option for delays between requests but this doesn’t actually solve the problem.
Serializing request isn’t necessarily required to throttle access to a website.
Fetching a maximim of 10 requests per second, even if they’re overlapping, should be fine for most websites.
Google implements a latency based throttle which measures HTTP response time and backs off when it starts to rise.
Destruction of URLs and expensive URLs
One thing that REP is that it’s unclear WHY a URL was disallowed. What happens if someone implements a GET URL that actually deletes resources or otherwise mangles databases.
These things have happened before and they will happen again.
Also, what about URLs that are just expensive to load and use high database
Implementing a better throttle vocabulary would help fix this problem of course but having ALL the robots fetch this URL even though it’s throttled could still DOS your website becuase the robots don’t actually coordinate their throttling.
Google, Bing, and Ask aren’t the only crawlers
This is another problem we frequently see… because website owners feel uncomfortable sharing their content with “just anybody” they will often mark their website content available for only search engines.
The problem is that Google, Bing, and Ask aren’t the only search engines in town.
This is a chilling effect for sites like Blekko (and a number of Spinn3r customers) becuase they have to go to every website that blocks them and beg for access.
This makes it harder for search engine startups becuase the resources required here would be very expensive.
One potential solution is to have profile based user agents supported in robots.txt which strict definitions in the REP.
If you’re a public search engine, (Google, Microsoft, or any small or stealth startup) then you can access (or be disallowed) to access the content without having to benefit just the big guys.
What’s a robot? What about RSS?
There is also an issue of what exactly a robot is?
We’ve seen people that have RSS feeds, that have a public API available, but then have a Disallow for all user agents.
How does that make sense? So if you’re a robot you can’t index their RSS feed?
Is NetNewsWire a robot? What about Firefox with their RSS bookmarks feature?
What about Google Reader? Is that a robot?
Privacy is complicated
One problem that various social networks have run into is that they allow users to be ‘private’ by requiring the user to have an account on the website before showing their profile.
However, they then turn around and publish a snippet of their profile via unauthenticated HTTP.
If you’re clever you can build a search engine around this data and publish it in aggregate.
This can often frighten the user because they never intended their profile data to be used in this manner.
With facial recognition software it’s then possible to tie profiles together or use other textual signals to merge data from various social networks.
It’s fair to say that this could really alarm some users who are not up to date with the power of the Internet for de-anonymizing users.
This is a fair concern but some of the pressure is on the Social Networks to clarify privacy or their users and not to enable these awkward situations in the first place.
… and at the end of the day, REP can be used by a website JUST because they don’t like you or don’t understand what you’re up to.
This is the major problem as I see it. It will require website to clearly explain why and how their content is used.
It doesn’t help the situation that some websites are only allowing the big boys by default and blocking everyone else.
This is a dangerous situation because startups now need to beg for permission across thousands of websites (which is very expensive).
The way forward
I think the way forward here is to not attempt to solve all the problems with REP right now. Focus on a few things that are clearly broken and try to fix them.
Perfection seems to always be the enemy of progress and maybe REP will never be perfect but for now it’s all we have.