Reliable CMS Generator Detection
I’ve been planing around with microformat and nanoformat parsing today using real world HTML. One feature I’d like is the ability to reliably detect the CMS version a website is running.
For example, the Moveable Type site is running some version of Moveable Type (probably not Typepad) but which version?
They’ve stripped the generator meta element from their HTML (I’m pretty sure it’s in the default MT). I can’t check the RSS feed (it’s there sometimes) but they’re rewriting it via FeedBurner.
A number of CMS systems how there are nice enough to include a generator meta element but it’s often excludes any specific version number.
GigaOm is nice enough to include one but it doesn’t include any versioning information.
PhotoMatt was nice enough to include a generator AND version – “WordPress 2.4-bleeding” – whatever that means. I assume it means 2.4 from version control?
However, at present a robot is at the mercy of the author/designer to preserve the generator information. It’s possible to accidentally strip it which leaves a robot confused and could possibly hurt the SEO of the blogs owner without their knowledge.
Ideally there would be some type of generator discovery protocol hereby a robot could easily discovery the generator which wasn’t vulnerable to these type of flaws.
A straw man proposal would be to have a fixed URL (/generator.xml) which would return this metainfo. It would even be a static file.
Again. Straw man proposal. I don’t really know the solution right now – just identifying the problem.
Of course, maybe the best solution is to just have CMS vendors include the generator, and add a comment in the HTML saying DO NOT REMOVE.
1. Nanoformat parsing is indexing semantic HTML with real world deployed templates used in the major CMS platforms like WordPress, Typepad, etc.
Update: Of the top 100 high ranking Moveable Type blogs in our index, 57% of them just had a generator of