Reliable CMS Generator Detection

I’ve been planing around with microformat and nanoformat[1] parsing today using real world HTML. One feature I’d like is the ability to reliably detect the CMS version a website is running.

For example, the Moveable Type site is running some version of Moveable Type (probably not Typepad) but which version?

They’ve stripped the generator meta element from their HTML (I’m pretty sure it’s in the default MT). I can’t check the RSS feed (it’s there sometimes) but they’re rewriting it via FeedBurner.

A number of CMS systems how there are nice enough to include a generator meta element but it’s often excludes any specific version number.

GigaOm is nice enough to include one but it doesn’t include any versioning information.

PhotoMatt was nice enough to include a generator AND version – “WordPress 2.4-bleeding” – whatever that means. I assume it means 2.4 from version control?

However, at present a robot is at the mercy of the author/designer to preserve the generator information. It’s possible to accidentally strip it which leaves a robot confused and could possibly hurt the SEO of the blogs owner without their knowledge.

Ideally there would be some type of generator discovery protocol hereby a robot could easily discovery the generator which wasn’t vulnerable to these type of flaws.

A straw man proposal would be to have a fixed URL (/generator.xml) which would return this metainfo. It would even be a static file.

Again. Straw man proposal. I don’t really know the solution right now – just identifying the problem.

Of course, maybe the best solution is to just have CMS vendors include the generator, and add a comment in the HTML saying DO NOT REMOVE.

1. Nanoformat parsing is indexing semantic HTML with real world deployed templates used in the major CMS platforms like WordPress, Typepad, etc.

Update: Of the top 100 high ranking Moveable Type blogs in our index, 57% of them just had a generator of http://www.movabletype.org/. This isn’t very helpful if you need to know the exact version of MT. At the very minimum it would be nice to have this for computing statistics.


  1. 1. Autodiscover the feed.
    2. Extract the generator from there.

    It even works on MT and Typepad. Almost no one removes it from their feed templates. Only people I don’t know without a generator in there is Drupal.

    :)

  2. That’s not reliable either.

    feedblog.org/feed (no version info)

    Though that might be a bad example because all WordPress blogs run the same version.

  3. Why is the version number important?

  4. I’m working on comment extraction for Spinn3r from the raw HTML.

    With WordPress it’s not TOO bad because there’s wfw:comments but with other CMS systems having the version will allow me to figure out what type of parser I need to use.

    Also, with this data I can compute stats on CMS version distribution.

    Last time we only published our stats on 50k sites but I’d like to expand on this to show off WordPress world domination :-P






%d bloggers like this: