Researchers Using Spinn3r
We’ve been providing researchers with access to Spinn3r for more than two years now. The results are really starting to land now.
We’re sponsoring ICWSM this year with a 400GB snapshot. This is being used by more than 100 research groups of about 500 total researchers.
There should be a few dozen more papers published in the next few weeks but I wanted to highlight these now as they are already live.
Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is speciﬁed differently. Acquisition of Spinn3r data, referenced in Fig-
ure 3 step 1, is achieved through changing URL arguments in a manner deﬁned by Spinn3r. Thus, the speciﬁcation is
unique to Spinn3r. While that particular speciﬁcation cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The speciﬁcations for all of these information inter-
faces are very different; a notation that allows the description of composite applications must account for this.
In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are
our main contributions:
• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full spinn3r.com blog data set for 12 months.
• We ﬁnd that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.
• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to ﬁnd similarity between the features.
• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.
• We conﬁrm low correlation between blog references and ﬁrst week movie sales reported by (Mishne, G. et. al. 2006) but we ﬁnd that (a) budget is a better predictor for the ﬁrst week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).
Data and Features
The data set we used for this paper is the spinn3r.com blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).
Bloggers make a huge impact on society by representing and inﬂuencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political inﬂuence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justiﬁed.
Spinn3r (http://tailrank.com ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.
Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identiﬁcation of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.
Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API . The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).