Papers using Spinn3r Data from ICWSM 2009

Additional papers based on the Spinn3r/ICWSM dataset have been published. It seems I have a lot of reading to do!

Flash Floods and Ripples: The Spread of Media Content through the Blogosphere

This paper is based on the Spinn3r data set (ICWSM 2009), which consists of web feeds collected during a two month period in 2008. The data set includes posts from blogs as well as other data sources like news feeds. We discuss our methodology for cleaning up the data and extracting posts of popular blog domains for the study. Because the Spinn3r data set spans multiple blog domains and language groups, this gives us a unique opportunity to study the link structure and the content sharing patterns across multiple blog domains. For a representative type of content that is shared in the blogosphere, we focus on videos of the popular web-based broadcast media site, YouTube.

Our analysis, based on 8.7 million blog posts by 1.1 million blogs across 15 major blog hosting sites, reveals a number of interesting findings. First, the network structure of blogs shows a heavy-tailed degree distribution, low reciprocity, and low density. Although the majority of the blogs connect only to a few others, certain blogs connect to thousands of other blogs. These high-degree blogs are often content aggregators, recommenders, and reputed content producers. In contrast to other online social networks, most links are unidirectional and the network is sparse in the blogosphere. This is because links in social networks represent friendship where reciprocity and mutual friends are expected, while blog links are used to reference information from other data sources.

Identifying Personal Stories in Millions of Weblog Entries

Stories of people’s everyday experiences have long been the focus of psychology and sociology research, and are increasingly being used in innovative knowledge-based technologies. However, continued research in this area is hindered by the lack of standard corpora of sufficient size and by the costs of creating one from scratch. In this paper, we describe our efforts to develop a standard corpus for researchers in this area by identifying personal stories in the tens of millions of blog posts in the ICWSM 2009 Spinn3r Dataset. Our approach was to employ statistical text classification technology on the content of blog entries, which required the creation of a sufficiently large set of annotated training examples. We describe the development and evaluation of this classification technology and how it was applied to the dataset in order to identify nearly a million personal stories.

In this paper, we describe our efforts to overcome the limitations of our previous story collection research using new technologies and by capitalizing on the availability of a new weblog dataset. In 2009, the 3rd International AAAI Conference on Weblogs and Social Media sponsored the ICWSM 2009 Data Challenge to spur new research in the area of weblog analysis. A large dataset was released as part of this challenge, the ICWSM 2009 Spinn3r Dataset (ICWSM, 2009), consisting of tens of millions of weblog entries collected and processed by, a company that indexes, interprets, filters, and cleanses weblog entries for use in downstream applications. Available to all researchers who agree to a dataset license, this corpus consists of a comprehensive snapshot of weblog activity between August 1, 2008 and October 1, 2008. Although this dataset was described as containing 44 million weblog entries when it was originally released, the final release of this dataset actually consists of 62 million entries in’s XML format.

SentiSearch: Exploring Mood on the Web

Given an accurate mood classification system, one might imagine it to be simple to configure the classifier as a search filter, thus creating a mood-based retrieval system. However, the challenge lies in the fact that in order to classify the mood for a potential result, the entire content of that page must be downloaded and analyzed. Much like a typical web-based retrieval system, to avoid this cost, pages could be crawled and their mood indexed along with the representation stored for search indexing. Alternatively, the presence of a massive dataset from enabled the ESSE system to be built, performing mood classification and result filtering on the fly (Burton et al. 2009). Because the dataset (including textual content), search system, and mood classification system all exist on the same server, the filtering retrieval system was made possible. The dataset not only allows access to the content of a blog post (beyond the summary and title typically made available through search APIs) but the closed nature of the dataset allows for experimentation while still being vast enough to provide breadth and depth of topical coverage.

Event Intensity Tracking in Weblog Collections

The data provided for ICWSM 2009 came from a weblog indexing service Spinn3r ( This included 60 million postings spanned over August and September 2008. Some meta-data is provided by Spinn3r.

Each post comes with Spinn3r’s pre-determined language tag. Around 24 million posts are in English, 20 million more are labeled as ‘U’, and the remaining 16 million are comprised of 27 other languages (Fig. 3). The languages are encoded in ISO 639 two-letter codes (ISO 639 Codes, 2009). Other popular languages include Japanese (2.9 million), Chinese/Japanese/Korean (2.7 million) and Russian (2.5 million). The second largest label is U unknown. This data could potentially hold posts in languages not yet seen or posts in several languages. Our present work, including additional dataset analysis presented next, is limited to the English posts unless otherwise specified. In future work we plan to also consider other languages represented in the dataset.

Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network

Our research is the first attempt to give an accurate measure for the level of information propagation. This paper presents ‘SugarCube’, a model designed to tackle part of this problem by offering a mathematically precise solution for the quantification of the level of topic propagation. The paper also covers the application of SugarCube in the analysis of the social network structure of the ICWSM/Spinn3r dataset (ICWSM 2009). It presents threshold values for the communities found within the collection, and paves the way for the measurement of topic propagation within those communities. Not only can SugarCube quantify the proliferation level of topics, but it also helps to identify ‘heavily-propagated’ or Global topics. This novel approach is inspired by Percolation Theory and its application in Physics (Efros 1986).

%d bloggers like this: