Archive for the ‘RSS’ Category

Matt just announced that WordPress will support the new RSS cloud protocol.

This ping model has already existed with Ping-o-Matic of course (which Matt/Wordpress have been running for since the blog epoch) and Spinn3r customers already benefit from this. In fact, we’ve been realtime for a long time now.

WordPress.com has always supported update pings through Ping-o-Matic so folks like Google Reader can get your posts as soon as they’re posted, but getting every ping in the world is a lot of work so not that many people subscribe to Ping-o-Matic. RSS Cloud effectively allows any client to register to get pings for only the stuff they’re interested in.

We haven’t announced this yet but we pushed a new filtering API in Spinn3r in the last release. We developed a domain specific language for filtering web content in real time.

A number of our customers have already started using this in production.

It’s nice that more people are pushing realtime content but I’m starting to worry about the proliferation of protocols here. XMLPRC pings are the old school way of handling things. Pubsubhubbub, Twitter stream API, SUP, etc.

However, I’ve played with most of these and think that they are all lacking in some area. One major problem is relaying messages when nodes fail and then come back online. For example, with XMLRPC pings, or the Twitter stream API, if my Internet connection fails, I’ve lost these messages forever.

The Spinn3r protocol doesn’t have this problem and supports resume. You just start off from where you last requested data and nothing is lost. We keep infinite archives so nothing is ever lost.

I don’t think most sites can support this much data (it’s expensive) but certainly a few hours of buffer, held in memory, seems reasonable to handle a transient outage.

ReadWriteWeb has more on this and is leading with a somewhat sensational title that would imply that these blogs were not real time in the past.

Techcrunch has more as does Scobleizer

One big issue with these protocols is spam. If it’s an open cloud any spammer can send messages into the cloud (which is the case with Pingomatic which receives 90% spam). And of course spammers can receive messages from the crowd to train their own classifiers and find spam targets.

We have an AUP with Spinn3r that prevents this usage. We’ve removed spam from the feed to begin with which is nice for our customers and allows them to build algorithms without having to worry about any attacks.

It looks like Friend Feed is proposing a new update protocol for RSS which avoids the thundering herd problem present with RSS polling.

When you add a web site like Flickr or Google Reader to FriendFeed, FriendFeed’s servers constantly download your feed from the service to get your updates as quickly as possible. FriendFeed’s user base has grown quite a bit since launch, and our servers now download millions of feeds from over 43 services every hour.

Venture Beat has more on the subject: (as does Tech Confidential)

It looks like the rapid fire site updates are about to start again for the social content conversation site FriendFeed. Just a few days after the launch of its new “beta” area, FriendFeed is finalizing a new technology that could help pull content into the site at a much faster rate.

The technology, called Simple Update Protocol (SUP) will process updates from the various services that FriendFeed imports faster than it currently does using traditional Really Simple Syndication (RSS) feeds, FriendFeed co-founder Paul Buchheit told Tech Confidential.

Spinn3r has a similar problem of course but we have 17.5M sources to consider.

The requirements are straight forward:

* Simple to implement. Most sites can add support with only few lines of code if their database already stores timestamps.
* Works over HTTP, so it’s very easy to publish and consume.
* Cacheable. A SUP feed can be generated by a cron job and served from a static text file or from memcached.
* Compact. Updates can be about 21 bytes each. (8 bytes with gzip encoding)
* Does not expose usernames or secret feed urls (such as Google Reader Shared Items feeds)

Sites wishing to produce a SUP feed must do two things:

* Add a special tag to their SUP enabled Atom or RSS feeds. This tag includes the feed’s SUP-ID and the URL of the appropriate SUP feed.

Interesting that this is seeing attention again because Dave proposed this in RSS 2.0:

is an optional sub-element of .

It specifies a web service that supports the rssCloud interface which can be implemented in HTTP-POST, XML-RPC or SOAP 1.1.

Its purpose is to allow processes to register with a cloud to be notified of updates to the channel, implementing a lightweight publish-subscribe protocol for RSS feeds.

In this example, to request notification on the channel it appears in, you would send an XML-RPC message to radio.xmlstoragesystem.com on port 80, with a path of /RPC2. The procedure to call is xmlStorageSystem.rssPleaseNotify.

However SUP is not XMLRPC (which is probably good since I’m a REST fan)

By using SUP-IDs instead of feed urls, we avoid having to expose the feed url, avoid URL canonicalization issues, and produce a more compact update feed (because SUP-IDs can be a database id or some other short token assigned by the service).

This can be avoided by just using the unique source URL. The feed is irrelevant. Just map the source to feed URL on your end.

Because it is still possible to miss updates due to server errors or other malfunctions, SUP does not completely eliminate the need for polling. However, when using SUP, feed consumers can reduce polling frequency while simultaneously reducing update latency. For example, if a site such as FriendFeed switched from polling feeds every 30 minutes to polling every 300 minutes (5 hours), and also monitored the appropriate SUP feed every 3 minutes, the total amount of feed polling would be reduced by about 90%, and new updates would typically appear 10 times as fast.

Spinn3r performs a hybrid. We index pinged sources once per week but also index right when they ping us. Best of both worlds basically.

The current ping space is across the board though.

There’s XMLRPC, XML, the Six Apart update stream and now JSON:

This doesn’t seem too different from Changes.xml…

Witness http://blogsearch.google.com/changes.xml vs http://friendfeed.com/api/sup.json

I’m not sure what the solution is here but it’s clear we need some standardization in this area.

One suggestion for SUP is to not use a JSON-only protocol. Having an alternative REST/XML version seems to be advantageous for people who don’t want to put a second parser framework in production.

I’m not sure why REST needs defending but apparently it does

Dare steps in and provides a solid background and Tim follows up

What is really interesting about REST from my perspective (and not everyone will agree) is that since it’s REST you can actually solve real problems without getting permission from a standards board.

It’s pretty easy to distrust standards bodies. Especially new standards bodies. The RSS wars were a joke. Atom took far too long to become standardized.

At Wordcamp this weekend one of the developers was complaining about brain damage in XMLRPC – god only knows why we’re still using this train wreck.

I noted that WordPress should abandon XMLRPC and just use REST. They seem to be headed in that direction anyway (by their own admission).

Their feedback was that they didn’t want to go through the Atom standardization process to extend the format to their specific needs.

You know what? You don’t need permission. Just write a documented protocol taking into consideration current REST best practices and ship it.

If people are using the spec and find value then it will eventually become a standard.

I’m a bit biased of course because Spinn3r is based on REST.

We burn about 40-50Mbit 24/7 indexing RSS and HTML via GET. We have tuned our own protocol stack within our crawler to be fast as hell and just as efficient.

Could I do this with SOAP? No way! I’ve had to gut and rewrite our HTTP and XML support a number of times now and while it’s not fun at least with REST it’s possible.

REST is complicated enough as it is… UTF-8 is not as straight forward as you would like. XML encoding issues do arise in production when you’re trying to squeeze maximum performance out of your code.

… and while REST is easy we STILL have customers who have problems getting up and running with Spinn3r.

We had to ship a reference client about six months ago that implements REST for our customers directly.

These guys are smart too… if they’re having problems with REST then SOAP/XMLRPC would be impossible.

Center Networks things it might be a good idea to charge for feeds:

What if blogs and journals offered a full feed for $1 per month with no ads, mobile access, etc. Would you subscribe for a buck? What I am proposing is the following forms of monetization: standard Web site with ads, partial feed with no ads, and a full feed with no ads for $1/month.

So many of the people I speak with daily subscribe to a ton of full feeds and never visit a site after picking up the feed. Some say that feeds strengthen the interactivity with a site because when they read the post, they are more likely to come to the site to comment. Sure, it’s easy to jam an advertisement into a feed, but what if there was another way to provide a revenue stream for a blogger to live off of and for the consumer to enjoy the media knowing they are supporting the content they enjoy?

This is an exceedingly complicated topic.

There’s really no yes or no answer here – I just want to make a few comments.

First. Ads in RSS. If you run ads in your RSS feed don’t expect to be raking in the cash. The CTR for feed ads is pathetically low. Why? They’re obviously ads and the community using RSS is very adverse to clicking on advertising.

Second. The people you ‘speak with daily’ are not representative of the vast majority of your users. I have about 1500 RSS subscribers but I think that I have about 10k additional users that come to my site via search, Digg, Tailrank, Reddit, etc.

Third. Maybe you could rephrase the question. Why not just charge flat out subscriptions. Other companies like Salon have experimented with this model.

Fouth. What about robots? Humans aren’t the only people reading your feeds. Robots (like Spinn3r) are used by search engines, analytics companies, etc. If you cut off robots from your full RSS you’re hurting your SEO and reducing your reach.

Fourth. What about hAtom? What about aggregators that don’t necessarily need RSS?

Fifth. Security. So I pay $1 for a feed. What happens when I add it to Google Reader or Bloglines? I suspect that there might be a bit of information leak and other users could accidentally search for and subscribe to the feed without realizing that they’re not paying.

Interesting proposal but it opens Pandora’s box on a number of issues.

Update: Josh chimes in noting that getting your users to pay anything is the biggest challenge.

Haloscan RSS Comments

Haloscan supports RSS for new comments posted to your blog. Great! The only problem is that the only person who can use them is the blog owner?

To access the RSS Feed URL:

1. On the main HaloScan site, from the menu on the left, select Manage Comments.
2. On the centre of the page, select the RSS tab (the one with the orange graphic).
3. The RSS Feed URL is now displayed in the address bar of your browser. Copy and paste this into your preferred newsreader.

Well that’s dumb. How does an ordinary human read them?

When Winer Attacks

Winer is at it again. Attacking people who don’t necessarily agree with his world view.

The first time I introduced myself to Dave he attacked me (harshly) for contributing to RSS 1.0.

I think his words were:

“How dare you! Who the hell are you to question me about security?!”

He’s basically done it to everyone.

Everyone in the RSS community has been attacked by Dave. Same thing with XMLRPC/SOAP. Same thing with Blogger. It’s really a bit shocking.

So now Calacanis has a Winer number of 1. If only one could sell their Winer number on Ebay.

Attensa sent me a press release via email today which I found a bit upsetting:

Attensa Opens Free Public Beta of First Attention Driven RSS Reader

The new beta version builds on the integrated RSS capabilities in the new version of Microsoft Outlook 2007 by optimizing Outlook performance and adding a complete set of tools for receiving and managing critical business information while helping to cut through information overload with AttentionStream processing. Attensa AttentionStream processing continuously analyzes behaviors as RSS subscriptions are scanned and articles are read. It automatically pulls the articles and feeds the reader finds to be most important to the top.

(emphasis mine)

This is just not true.

NewsMonster was the first RSS reader which added attention and reputation based article reprioritization. Rojo had this technology as well and so did SearchFox and Findory.

I’m sure it’s a cool product and wish them all the best but saying its the first is a bit overstating things.

Update: Ah. Ok. Scott in the comments noted that it it’s the first attention driven RSS reader for MS Outlook. Makes sense.

The subject line in my inbox was:

“Attensa Opens Free Public Beta of First Attention Driven RSS Reader”

but the press release title was:

“Attensa Opens Free Public Beta of First Attention Driven RSS Reader for Microsoft® Outlook 2007”

hm.

Why is it that Technorati has bugs which have been known for years yet they never seem to get around to fixing them?

It’s the only search engine I know that’s nondeterministic. How’d they pull that off? Search one minute and you’ll get “no results” but search 30 seconds later and you get 150 results. Fun.

I just did a link cosmos search for my blog and I get this (see below) which is a clear example of the problems they have. Two posts show twice (for a total or four items). Duplicate content penalty guys!

Update:

Both of these blogs use she rel-bookmark microformat. Ouch.

200612242038

About two weeks ago I posted a story with a new and unique word that I invented which would allow me to find a specific post being re-aggregated and indexed within Google. This would lead to a duplicate content penalty and hurt my SEO.

The result are in and the problem is far worse than I expected.

For starters, I don’t think the term “theft” is anywhere near appropriate but I can’t think of a better term. I have no problem syndicating my content but I don’t want these sites republishing the full story which would then be indexed by Google. They need to use robots.txt to block these pages.

I’d really appreciate it if you guys could help me come up with a better term.

So who are the big troublemakers? Let’s see…

Rojo (ouch) is number one. Findory is next followed by my category feed for the ‘Google’ tag on my blog (I’m going to have to fix that). Next is Jordo Media (never heard of these guys) followed by Kinja, myFeedz, feeds4all, Findory Blogs, and then Informatory.

Oddly enough the permalink for my post doesn’t show up at all. It might be pushed out by my category feed.

It would be nice to have a noindex meta tag that I could use within my full-content feed but this isn’t possible since it has to go in the head of the generated page.

I’ve attached a full copy of the Google result for historical purposes.

200612111018

Is it possible that the portalsphere could flatten by 2010?

… we believe that the Internet is moving away from big centralized portals, which have gathered the lions share of Internet traffic, towards a pattern where traffic is generally much flatter. The mountains, if you will, continue to exist. But the foothills advance and take up more of the overall pie. Fred Wilson had a post earlier this week about the de-portalization of the Internet which is essentially making the same point when seen from the point of view of Yahoo.

I certainly believe the space will flatten out a great deal. I think the big opportunity is for smaller publishers to flatten the space. I think this is what Keith is probably focusing on. His quote of $180k per month seems to come from the publisher angle.

I’m not sure I agree that the portals will vanish. We’ve always seen a power law curve and I believe this is a natural state of the Internet. It may become less severe over time (and certainly this is a good thing) but even the blogosphere (which is much more open) we’re seeing a power law distribution.

200612081518I saw down with Robert Scoble about a week ago to talk about Tailrank for the Scoble Show:

Kevin Burton is a talented developer who has worked on a variety of startups already including Rojo, and now TailRank which he started to be able to see what bloggers were talking about. Here I sit down with him for an interesting conversation in the lobby of San Francisco’s Palace Hotel.

I think the interview turned out pretty well. The only mistake I made was that I left my cell phone on which is a slight problem. Luckily no one else called during the interview (sorry Robert).

I also gave a demo of Tailrank. Unfortunately, the realtime IM delivery feature actually worked right after they shutoff the camera. It was pretty amazing actually. Our crawler found a post on that topic right after I subscribed to the meme.

OK guys. Time to invent a new word. Ready? Infinitepossum. Cool. Huh?

An infinitepossum is a little animal that helps you find sites which steal/borrow your content and turn around and have it indexed by Google. Most of the major RSS aggregators are nice enough to setup robots.txt so that you don’t get hit with a duplicate content penalty.

Just go ahead and search for infinitepossum and in theory this should be the only post shown.

Some of these sites are clearly stealing. Bitacle is one example. These guys are evil and stole an old version of the Netvibes code and are now stealing people’s content as well. A lot of these sites are just legit feed readers not realizing that this has become a problem. Hopefully this little experiment of mine will help them correct the error of their ways.

The only problem with the infinitepossum only works once and then dies. Kind of like a butterfly I guess.

Also… if you want to link to this post PLEASE DON’T USE THE WORD INFINITEPOSSUM so that we don’t taint the results.

This should be fun!

We just finished releasing Tailrank 2.0. Man this has been a lot of work. Seriously the last six months were pretty heads down.

I’m probably going to have to organize a party now since we’ve never really had one.

Niall told me tonight that the CIA recommends using ROME for RSS/Atom parsing. I doubt they even had a chance to evaluate Feedparser since I’ve moved it to a new home and have generally done a horrible job at being a good project manager.

That said, I imagine they’re using ROME to parse RSS feeds they get from AT&T for spying on our phone calls (thanks AT&T).

Good thing they don’t run Feedparser. I might be tempted to add a backdoor do I can do some spying on them for a change! (ha).

200609051936I’ve been a big fan of Six Apart for a few years now. Not only do they have a great blogging service (and Vox seems poised to take over the world) but they just acquired Rojo as well.

Six Apart will be issuing a press release on the subject and I’ll let them give you all the juicy details once thats available.

In the mean time Om Malik notes:

Blogging company Six Apart will soon announce it has purchased Rojo, the web-based feed reader, for undisclosed terms.

Six Apart won’t be adding an aggregator based on Rojo, but instead incorporating some elements of the technology into its existing products, according to Six Apart CEO Barak Berkowitz. Rojo CEO Chris Alden will run Six Apart’s Movable Type group

Niall Kennedy comments:

Blogging company Six Apart has acquired online feed aggregator Rojo Networks. Rojo will be integrated with the Vox blogging tool allowing users to browse updated content and create more blog posts. Rojo CEO Chris Alden will be the new head of Movable Type according to a GigaOm report.

I helped co-found Rojo almost three years ago to build a killer online RSS aggregation service. Literally. Before we had a name for Rojo we called it the KSA (Killer Server-side Aggregator). Rojo lead the RSS space in a number of key areas including mobile support, feed search, and integrated social networking.

For the last year I’ve been independent (working on Tailrank actually) but still remained involved in an advisory capacity.

In hindsight, I don’t ever think Rojo was given the credit it deserved. Feed search in particular. In fact, earlier this year when Ask/Bloglines released their feed search it was pointed out that Rojo had been doing the same thing for months.

Six Apart has big plans for Rojo. They’re going to take Rojo’s RSS infrastructure and build it into LiveJournal and Vox which sounds pretty interesting. You can bet I’ll be paying attention…

Luckily, Rojo was located in blogger gulch (AKA SOMA) in San Francisco which is also the home of Technorati and Feedster. The employees literally only have two extra blocks to commute to their new offices.

Best of luck on the new gig guys!

Update:

Techcrunch has a few notes:

Terms of the deal were not disclosed, but our assumption was that this a less than $5 million deal. Six Apart is not planning on continuing to build out the core Rojo products. In the press release (sorry no link available yet), Six Apart says “Six Apart intends to sell a majority interest in Rojo’s newsreader services in the coming months,” meaning they will become a minority stockholder of the service. Rojo founder and CEO Chris Alden and CTO Aaron Emigh will joing Six Apart’s executive team.

… and so does ValleyWag:

We hear GigaOM founder Om Malik heard about this deal when he saw Alden and 6A CEO Barak Berkowitz outside 6A’s office.

Update 2:

Six Apart finally issues a press release:

San Francisco, CA —September 6, 2006—Six Apart, the world leader in blogging software and services, today announced that it had acquired Rojo Networks for an undisclosed sum. Rojo senior executives Chris Alden and Aaron Emigh joined the Six Apart team as executive vice president and general manager of Movable Type, and executive vice president and general manager of core technologies, respectively. Six Apart intends to sell a majority interest in Rojo’s newsreader services in the coming months.

Update 3:

You can follow this over on Tailrank… For some reason it picked up Valleywag twice. I’m going to have to fix that.

OK lazyweb – I need some advice.

I want to configure Tailrank so that when an RSS aggregator fetches http://rss.tech.tailank.com it is actually internally proxied to http://feeds.feedburner.com/TailrankTopStories.

I thought the following mod_rewrite rule would work:


RewriteCond %{SERVER_NAME} ^rss.tech.tailrank.com$
RewriteCond %{HTTP_USER_AGENT} !^FeedBurner.*
RewriteRule ^/$ http://feeds.feedburner.com/TailrankTopStories [P]

This basically says: “if anyone but FeedBurner fetches rss.tech.tailrank.com make them internally use a proxy server as a reverse proxy to fetch the content.”

I was thinking that this way all my existing clients would just transparently start using the new Feedburner URL and I’d still get to control my URL space.

The only problem is that it doesn’t work.

What it does is use feeds.feedburner.com as an HTTP proxy and then execs

GET http://rss.tech.tailrank.com HTTP/1.0

which isn’t what I wanted. That URL of course will 404…

Is there any way to do this or is it just impossible?

RSS Won’t Go Mainstream

Scoble thinks RSS is going to explode:

Next year IE 7 ships with an RSS aggregator. Last week Maryam started using RSS for the first time.

Why is RSS usage going to continue to double? Influencers are doing it. As long as the cool kids who go to FOOcamp keep using RSS the rest of us will start catching on and doing it too. Just watch.

I don’t think so. RSS is great and all but I just don’t think very many people subscribe to as many news sources as the alpha geeks. I think they want products like Google News, CNN, or memetrackers like Tailrank to tall them what to read.

Want to see the way forward? It isn’t RSS. Take a look at Vox. The aggregated view page is huge! River of news for your friends. That’s what people are going to be using in the future.

Either that or a direct delivery service like feedcrier.

I’d love to be proven wrong of course…

Well I had a productive evening!

Check this out. I finally fixed a long standing bug with HTTP caching in Tailrank.

It turns out that Mozilla/Firefox sometimes specifies Cache-Control: no-cache on HTTP requests. Normally this would force the server to refresh the page on behalf off the client. The only problem is that this was kicking in for all HTTP requests and never returning from the cache.

This is obviously a bug in Apache 2.2.

I also took the liberty and updated the HTML page to load the right sidebar after the main content to give a slight perception of faster loading on the client. Eakes came up with this technique while we were at Rojo. You don’t actually load the page faster – you just make it seem faster by loading the primary content first.

The results below speak for themselves (thanks grabperf!)

200608222305

Update:

I’m still confused by the results though. There shouldn’t be any latency for pages served from cache. These results are showing from 200-600ms. It should be 0ms.

It turns out there’s another bug where the cache can only store one type of encoding. If grabperf isn’t using gzip encoding then it will trigger a cache flush (same with nagios).

I want to migrate the main Tailrank RSS feeds to use Feedburner. Not only do they rule (hey guys!) but I can make a few bucks running CPM ads in my feeds.

The only problem is that I can’t use my feeds because of this error message:

Your feed filesize is larger than 256K. You need to reduce its size in order for FeedBurner to process it. Tips for controlling feed file size with Blogger can be found in Tech Tips on FeedBurner Forums, our support site.

My RSS feeds is 384K but only 56K once gzip compression kicks in. I really think they ought to double this value. 256k is pretty small.

I could reduce the size of my feed but I need a window of about 30 items because of the way the ranking of posts changes within Tailrank per day. I could drop the clustering in the RSS feed but this changes a lot of the advantage of Tailrank. Hm.

Tailrank’s custom memetracker support does a pretty good job of helping me filter down my reading list.

I’m also using NetNewsWires “sort by attention” feature which is pretty sweet as well.

I’m still struggling with the number of RSS feeds I have to read.

Today it dawned on me that not only could I “sort by attention” I could sort by lack of attention. There are feeds that I’m constantly just marking as read because none of the subjects grab me. They just keep posting stories that I decide not to read.

In reputation design this would be called a negative implicit certification. They’re hard to record generally but in NetNewsWire a “mark all read” would accomplish the same thing.