Researchers Using Spinn3r

We’ve been providing researchers with access to Spinn3r for more than two years now. The results are really starting to land now.

We’re sponsoring ICWSM this year with a 400GB snapshot. This is being used by more than 100 research groups of about 500 total researchers.

There should be a few dozen more papers published in the next few weeks but I wanted to highlight these now as they are already live.

Specifications and Architectures of Federated Event-Driven Systems

Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is specified differently. Acquisition of Spinn3r data, referenced in Fig-
ure 3 step 1, is achieved through changing URL arguments in a manner defined by Spinn3r. Thus, the specification is
unique to Spinn3r. While that particular specification cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The specifications for all of these information inter-
faces are very different; a notation that allows the description of composite applications must account for this.

Blogs as Predictors of Movie Success

In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are
our main contributions:

• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full blog data set for 12 months.

• We find that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.

• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to find similarity between the features.

• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.

• We confirm low correlation between blog references and first week movie sales reported by (Mishne, G. et. al. 2006) but we find that (a) budget is a better predictor for the first week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).

Data and Features

The data set we used for this paper is the blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).

Blogvox2: A modular domain independent sentiment analysis system

Bloggers make a huge impact on society by representing and influencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political influence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justified.

Spinn3r ( ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.

Meme-tracking and the Dynamics of the News Cycle

Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.

Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API [27]. The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).

%d bloggers like this: