Archive for the ‘wordpress’ Category

I bumped into Toni Schneider today at Crossroads and was reminded that we haven’t yet published our CMS breakdown.

200808132142

200808132142-1

We’re going to be publishing live these on spinn3r.com ….

This is based on raw posts per hour, with spam removed.

Statistics can tend to give you a limited perspective of what’s under the cover. For example, I suspect that the average typepad blog post has a higher rank than the average WordPress post.

Our new stats system allows us to ship a LOT more stats now (and scale them). For example, we should ship a ranking breakdown for WordPress, language post rate for WordPress, or even run these across all of the blog hosts.

Pretty good timing though considering WordCamp is right around the corner.

200804081439Technorati published more information on the wordpress blog spam cancer that’s spreading around the Internet.

If you’re running a version of WordPress less than 2.5 you need to stop what you’re doing NOW and upgrade! Don’t wait until your blog is compromised.

The blogosphere has had its share of maladies before. Comment spam, trackback spam, splogs and link trading schemes are the colds and flus that we’ve come to know and groan about. But lately, a cancer has afflicted the ecosystem that has led us at Technorati to take some drastic measures. Thousands of WordPress installations out in the wilds of the web are vulnerable to security compromises, they are being actively exploited and we’re not going to index them until they’re fixed.

We know about them at Technorati because part of what we do is count links. Compromised blogs have been coming to our attention because they have unusually high outbound links to spam destinations. The blog authors are usually unaware that they’ve been p0wned because the links are hidden with style attributes to obscure their visibility. Some bloggers only find out when they’ve been dropped by Google, this WordPress user wrote

I’ve reached out to Ian Kallen to offer collaboration on fixing this issue.

We’re going to push out a point release of Spinn3r to block blogs that exhibit this spam problem.

It’s such a rare event to have hundreds of thousands of weblogs compromised in a systematic manner.

OK gang. I need help.

I’m SLAMMED with work but I’d really like to find SOME way to alert these bloggers who have had their blogs compromised by this Trojan/Worm.

These people are going to be REALLY harmed by this since Google’s probably just going to zero out their pagerank for a while.

I need someone that can either contact these people manually or write code that can automatically post a comment to their most recent blog post.

It doesn’t look like WordPress would block these because it won’t be picked up by Akismet. That and it’s not spam since we’re trying to help these people.

I’ll give you a list of 200-300 blogs and you do the rest.

I figured it was at least worth asking…

Hopefully, this large WordPress 2.5 release will cause a few people to upgrade.

We just found another 200 or so sites that have been compromised today.

Not fun.

The zombie/trojan spam blogs are at it again tonight. I just caught another 5k stories published to Tailrank because of this recent blog spam torrent.

There is clearly some unknown vulnerability that he must be exploiting. I’ve only done sample based auditing of about 20% of the links and they’re nearly 100% WordPress blogs ranging from versions 1.5-2.x.

What’s the most efficient way to alert 2-300 WordPress bloggers that they’ve been owned?

I could write an automated script to post a comment to their most recent blog entry. Of course I wouldn’t be able to get through the captcha barrier. I could create a dedicated blog post linking to every single blog and hope they check with Technorati or Google Blog Search for their mentions.

That might actually be a good idea. I think I might do that tomorrow. It would be nice to re-enable these blogs at some point.

This is a good reason to subscribe to Spinn3r btw. If you need a crawler it doesn’t make a lot of sense to have your Engineering staff constantly chase down spam. Let us do it for you.

I’ve been planing around with microformat and nanoformat[1] parsing today using real world HTML. One feature I’d like is the ability to reliably detect the CMS version a website is running.

For example, the Moveable Type site is running some version of Moveable Type (probably not Typepad) but which version?

They’ve stripped the generator meta element from their HTML (I’m pretty sure it’s in the default MT). I can’t check the RSS feed (it’s there sometimes) but they’re rewriting it via FeedBurner.

A number of CMS systems how there are nice enough to include a generator meta element but it’s often excludes any specific version number.

GigaOm is nice enough to include one but it doesn’t include any versioning information.

PhotoMatt was nice enough to include a generator AND version – “WordPress 2.4-bleeding” – whatever that means. I assume it means 2.4 from version control?

However, at present a robot is at the mercy of the author/designer to preserve the generator information. It’s possible to accidentally strip it which leaves a robot confused and could possibly hurt the SEO of the blogs owner without their knowledge.

Ideally there would be some type of generator discovery protocol hereby a robot could easily discovery the generator which wasn’t vulnerable to these type of flaws.

A straw man proposal would be to have a fixed URL (/generator.xml) which would return this metainfo. It would even be a static file.

Again. Straw man proposal. I don’t really know the solution right now – just identifying the problem.

Of course, maybe the best solution is to just have CMS vendors include the generator, and add a comment in the HTML saying DO NOT REMOVE.

1. Nanoformat parsing is indexing semantic HTML with real world deployed templates used in the major CMS platforms like WordPress, Typepad, etc.

Update: Of the top 100 high ranking Moveable Type blogs in our index, 57% of them just had a generator of http://www.movabletype.org/. This isn’t very helpful if you need to know the exact version of MT. At the very minimum it would be nice to have this for computing statistics.

I’m currently getting HIT with a comment storm on WordPress.

I don’t think Akismet is handling one word comments and doesn’t realize that the comment should be added to /dev/null

About 50 comments in the last 72 hours.

Not too bad but kind of annoying.

Update:

OK. This is becoming a problem. I’ve already had another dozen emails today.

One bug which can have a negative SEO impact on Typepad migrations.

Older URLs return 302 not HTTP 301.

For example fetching this:

http://www.feedblog.org/2007/04/slack_and_mysql.html

returns a 302 redirect to:

http://feedblog.org/2007/04/27/slack-and-mysql/

It should return a 301 ….

We enabled a feature in our crawler to record the ‘generator’ meta tag to get a rough idea of real world CMS deployment.

This is based on a sample of 50k weblogs with generators. This isn’t across our entire weblog index as the crawler is still executing.

There’s one major disclaimer here. A good percentage of weblogs probably don’t have a generator specified. Often when people change their default template they drop the generator meta tag.

Also, while WordPress seems to be doing very well here compared to Typepad, Six Apart gets about $20 per month per Typepad account. I’d rather be in Typepad’s shoes! :-)

I also think Moveable Type’s numbers are probably depressed a bit since it’s really easy to hack your template and potentially remove your generator field.

200702081222

Scoble puts forth his definition of blogs which I find far too narrow.

First. If you’re a blog you apparently have to send pings:

I would go as far as saying that a site that does not ping a pingserver, like weblogs.com, is NOT a blog (private Web sites don’t ping weblogs.com and are NOT discoverable by search engines).

That’s not a very healthy requirement. Were people blogging before ping servers? Yes. If I disable pinging am I a still a blogger?

The key issue with me is the fact that 80% of pings are spam and garbage. For the most part they’re useless. Someone really needs to spend some time hear and clean this up a bit.

If you’re thinking of writing a new blog search engine or RSS aggregator I’d recommend totally and completely ignoring pings. Tailrank only really uses pings to re-prioritize and we index often enough that it really doesn’t matter that much.

Syndicatable. I can use a news aggregator to read your content, which lets me read a lot more blogs. (I can’t do that with private spaces).

This is a requirement? Technorati doesn’t care if you have an RSS feed. Tailrank doesn’t. Google doesn’t. Having a feed is a great idea but I don’t see this as being a requirement.

Blogging is a social phenomenon not a technical one. Robert should know this. What was new and innovative about blogs was the permalink. The fact that people could link to their ideas and easily post a response.

Whether you send pings or support RSS is totally optional. Ask every 16 year old myspace user if they blog and they’ll respond with an astounding yes. Ask them if they know anything about RSS or pings and they’ll just stare at you…

I’m going to call it like I see it. Vox is a WordPress and MySpace killer.

Add me as a friend!

At wordcamp one of the ‘sessions‘ was a musical interlude. Tantek and I took notes:

	<burtonator_>	we should all sing along on IRC
	<Cybo_>	haha
	<burtonator_>	lend me your eears and I'll sing you a song
	<burtonator_>	I will try not to sing out of key
	<burtonator_>	(this is for the transcript)
	<StaliN>	lalalalla
	<burtonator_>	I get high with a little help from my friends
	<bytee_>	StaliN: no, i believe you hjaven't authenticated
	<tantek>	oh i get by with a little help from my friends
	<Cybo_>	:)
	<tantek>	what do you feel when your love is away
	<tantek>	are you sad because you are on your own
	<tantek>	what do you feel at the end of the day
	<burtonator_>	what do you feel at the end of the day
	<burtonator_>	I get by with a little help from my friends

Thanks Eric Haller!

Here’s an interesting problem. How do you get a handle on all WordPress or TypePad blogs? Right now you can’t. You could accept pings but Six Apart doesn’t send pings anymore. Most of the ping traffic is filled with spam anyway so you’ll end up wasting a ton of CPU time. TypePad also supports domain masking where the blog URL is feedblog.org and not feedblog.typepad.com. This means a lot more work is required to verify the ping is actually from TypePad.

If these guys were to simply push a static XML dump of all their blog URLs this would be 95% of the way there. It would make writing tools much easier for developer and I think yield a space for innovation. For example you could write a tool which shows hot new WordPress blogs. Or you could write a tool that was the Six Degress of Six Apart similar to the Six Degress of Wikipedia hack.

The XML format doesn’t matter. It could be as simple as:


echo "SELECT URL FROM WEBLOGS" | mysql --sql

That would get us 95% of the way there…

Looks like there’s a new weblog company in town. Automattic is a new corporate entity which now employs a number of core WordPress developers including Matt Mullenweg himself.

The company has been around for a while it seems and hopefully is making enough money to support the lead developers. Their jobs page implies that they’re looking for other developers so this appears to be the case.

I assume the revenue model is consulting services it’s generally a bit difficult to license Open Source software.

Apparently Zawodny thinks trackback is dead:

I’m convinced now more than ever. But I’ll spare you all long
rant about why it’s dead, since others have already written this for me:

I’ve always felt that the case for trackback was a difficult one to make. One problem is that trackbacks are essentially automated comments.

Why not just provide a web comment API? No one has been working on this really and it’s a bit shocking that it’s 2005 and we still don’t have a way to syndicate comments (granted WordPress has a strong implementation of wfw:comments but that’s only one provider). Given a permalink why can’t I get a list of comments? Why can’t I post a comment through NetNewsWire or Ecto? I can send a trackback via Ecto but only with reduced metadata (the extract, not the full text).

I still don’t feel that Pubsub, Technorati, IceRocket, or Feedster are the answer. This just adds a brittle centralized man in the middle which can break when you least expect it.

Neither of these will be fixed anytime soon though. Link tracking services will still be fragile and comment APIs won’t take off (I’m obviously jaded).

Update:

The RSS Weblog has some thoughts:

Personally, I think TrackBacks foster self-promotion more than dialogue. Spam aside, linking offsite to follow the conversation is just too disruptive.