Persai’s RSS Crawl and Topix

Looks like Rich was playing with the Persai tar.gz web crawl they posted the other day.

I got a sinking feeling as I read this. I had curl’d over the corpus already to eyeball it …yeah that’s a list of feeds all right… but hadn’t tallied the domains…

$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head

35695 rss.topix.net
14613 izynews.de
2831 feeds.feedburner.com
1869 p.moreover.com
1314 http://www.livejournal.com
1241 rss.groups.yahoo.com
1191 http://www.discountwatcher.com
1096 news.bbc.co.uk
1072 http://www.alibaba.com
882 xml.newsisfree.com

Anyone reading my blog know the guys over at Parsai?

Update:

Sam Ruby posts an HTTP response code analysis of the corpus.

Of course we have internal response code stats but broken feeds don’t make it into Spinn3r.



%d bloggers like this: