Spinn3r Sponsors 2009 International Conference for Weblogs and Social Data Challenge

200810211134Spinn3r is sponsoring the International Conference for Weblogs and Social Media this year with a snapshot of our index.

The data set was designed for use by researchers to build cool and interesting applications with the data.

Good research topics might include…

  • link analysis
  • social network extraction
  • tracing the evolution of news
  • blog search and filtering
  • psychological, socialogical, ethnographic, or personality-based studies
  • analysis of influence among bloggers
  • blog summarization and discource analysis

We’re already used by a number of researchers in top universities. Textmap (which presented at ICWSM last year) just migrated to using Spinn3r and Blogs Cascades has been using us for a while now.

The data set is pretty large. 142GB compressed (27GB uncompressed) but you need a solid chunk of data to perform interesting research.

The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 20092008. The post includes the text as syndicated, as well as metadata such as the blog’s homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).

This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial
crisis; …) as well as everything else you might expect to find posted to blogs.

To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form has been processed, you will be sent a URL and password where you can download the collection.

Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.

  1. The post includes the text as syndicated

    Are you not correlating the feed with the web version for full content extraction? Or are your purposely limiting the data set to content in a web feed?

%d bloggers like this: