Tailrank’s Open Source Web Thumbnail Backend

We’ve open sourced the web thumbnail backend that we use within Tailrank.

It needs some work but if you’re ready to get your hands dirty then webthumb will get you 80% of the way to a scalable thumbnail backend.

The API is pretty simple. You just create a REST call to webthumb with a URL to generate and it performs everything for you. Errors are represented by HTTP 500 status codes and a successful request will generate a HTTP 302 redirect to a static file.

There are a few major reasons we’ve decided to open this up:

– Web thumbnails are no longer a competitive feature for Tailrank.

– A lot of people are doing this now and it makes sense to use an OSS framework.

– I want to extend this platform for use in malware detection including doorway page detection and javascript redirects.

– I want to extend the backend to support virtualization. This way a webthumb instance can be started, test for malware, and then its image destroyed and restarted. This prevents any errant browser vulnerabilities from hurting any further malware detection or further thumbnail generation.

The way we integrate into Mozilla is a bit of a hack. You can specify a debug mode in Mozilla and it then logs URL status to a file. This works well but it would be nice to have more of an API call which isn’t async and provides a status to the caller. This can be accomplished with a browser extension but this hasn’t been written yet.

One problem with this model that we hit early on is that the browser can pop up dialog boxes that can then accidentally be included in the resulting thumbnail. Errors like ‘referenced font is not available’ and so forth showed up in early versions of Tailrank until we found out ways to disable them.

I’d also like to extend the platform to support a REST API for crawlers to integrate with the browser directly. It would be nice for a crawler to give a URL to a browser instance, render it, and then get back the resulting DOM within the crawler. This way you know what the resulting

Update: Integrating this with jssh would be hot!

JSSh is a Mozilla C++ extension module that allows other programs (such as telnet) to establish JavaScript shell connections to a running Mozilla process via TCP/IP. This functionality is useful for interactive debugging/development of Mozilla applications, remotely controlling Mozilla, or for automated testing purposes.

200805102237-1 200805102237-2

%d bloggers like this: