Archive for the ‘spinn3r’ Category

This is pretty nice. Google released Zippy as Open Source:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

This means that along with open-vcdiff it is possible to use the full Google compression tool chain.

There’s a middle path here. You can go with someone like Softlayer or Rackspace and have your cake and eat it too.

Softlayer is a bit closer to being the cloud. We love them. Great company. Major partner for us… we’re going to be doubling down on servers this year and they’re going to get another big order from us.

This was the best decision I’ve made regarding Spinn3r I think. We gave this decision a lot of thought and were going to Colo but at the last minute I said felt that Colo was just a bad call and we went with Softlayer instead.

Win!

 

Facebook CTO Bret Taylor says buying servers was a mistake. A very big mistake. At the time, he was chief executive at FriendFeed, which eventually was sold to Facebook for the tidy sum of a reported $50 million. But these were the early days. He and his team needed to decide between buying servers or using Amazon Web Services. They bought the servers.

[From Facebook CTO Bret Taylor’s Biggest Mistake? Buying Servers – ReadWriteCloud]

 

We’re hiring an API Software Engineer to join the team over at Spinn3r.

We’re probably going to be hiring 2-3 engineers in the next month or so but don’t grow too fast. We want to focus on one position at a time so we can bring in the best potential hires.

This is a fun time to work in a startup though!

Job Description

Responsibilities:
Interact with customers both in the early sales cycle and support role to answer technical questions about our technology (crawling, ranking, etc)
Work with our API to understand throughput issues, protocol challenges, and optimize it for new issues as they arise.
Develop new version of our API as it evolves (more throughput, additional features, etc).
Monitor our crawler stats to enable understanding of operation and detect operational anomalies, monitor statistics, implement new features, etc.
Work on Java implementation of various new Spinn3r features as well as fix bugs in our current product. You will also be working on infrastructure in this position and responsible for various backend Java components of our architecture.
General passion and interest in technology (distributed systems, open content,
Web 2.0, etc).
should stress that while you’ll be interacting with customers, and providing support, our customers are exceedingly brilliant and amazingly knowledgeable about our space. They’re a major asset and staying in sync with them is very important for the company.

[From API Software Engineer at Spinn3r in San Francisco | LinkedIn]

Screen shot 2010-12-14 at 11.00.52 AM.pngAt Spinn3r we frequently deal with the chaos revolving around robots.txt so I thought I would throw few thoughts out there about the complexity of the issues involved here.

REP is not a EULA

This is part of the confusion around robots.txt. It’s not really clear that just because you can fetch a URL that the website will be happy with your use of that content.

It also isn’t clear that just because you fetched a URL that it means that you AGREED to a EULA limiting your rights.

In fact, I would argue that it does not limit your rights.

There is no forced click through and simply presenting your URL somewhere on your website, where a robot wouldn’t be able to read it, doesn’t mean that the company that originated this request has agreed to your EULA.

Limited amount of content.

Just because you’re allowed to fetch pages via robots.txt doesn’t mean that the website owner will be happy with you spidering their ENTIRE website and using EVERY single page.

Various social networks have routinely been upset and threatened lawsuits should bulk numbers of pages be downloaded from their site, even when robots.txt were in place.

This seems like a reasonable restriction.

An extension to propose a ‘limit’ on the number of URLs used for indexing purposes might make sense.

However, it’s a bit more complicated than that. What if you just use the URLs to compute rank for the top pages on that site and then discard the old ones?

The website would have no way to verify that you in fact discarded them.

Second, how long does the limit apply for? What if the limit is 1000 pages, and you index them, build your inverted index for full text search, then discard them entirely, and fetch another 1000. At any point in time you only have 1000 documents stored but you clearly have other secondary indexes built from these documents including link graphs, inverted indexes, etc.

Throttling

Throttling is another complicated problem. which robots.txt tries to solve but doesn’t go far enough.

The specification includes a ‘Delay’ option for delays between requests but this doesn’t actually solve the problem.

Serializing request isn’t necessarily required to throttle access to a website.

Fetching a maximim of 10 requests per second, even if they’re overlapping, should be fine for most websites.

Google implements a latency based throttle which measures HTTP response time and backs off when it starts to rise.

Destruction of URLs and expensive URLs

One thing that REP is that it’s unclear WHY a URL was disallowed. What happens if someone implements a GET URL that actually deletes resources or otherwise mangles databases.

These things have happened before and they will happen again.

Also, what about URLs that are just expensive to load and use high database
resources?

Implementing a better throttle vocabulary would help fix this problem of course but having ALL the robots fetch this URL even though it’s throttled could still DOS your website becuase the robots don’t actually coordinate their throttling.

Google, Bing, and Ask aren’t the only crawlers

This is another problem we frequently see… because website owners feel uncomfortable sharing their content with “just anybody” they will often mark their website content available for only search engines.

The problem is that Google, Bing, and Ask aren’t the only search engines in town.

This is a chilling effect for sites like Blekko (and a number of Spinn3r customers) becuase they have to go to every website that blocks them and beg for access.

This makes it harder for search engine startups becuase the resources required here would be very expensive.

One potential solution is to have profile based user agents supported in robots.txt which strict definitions in the REP.

If you’re a public search engine, (Google, Microsoft, or any small or stealth startup) then you can access (or be disallowed) to access the content without having to benefit just the big guys.

What’s a robot? What about RSS?

There is also an issue of what exactly a robot is?

We’ve seen people that have RSS feeds, that have a public API available, but then have a Disallow for all user agents.

How does that make sense? So if you’re a robot you can’t index their RSS feed?

Is NetNewsWire a robot? What about Firefox with their RSS bookmarks feature?

What about Google Reader? Is that a robot?

It would seem that an RSS feed or an API (with documentation) is an open for business sign inviting anyone to use the website (under the terms of use of your API of course) but your robots.txt just blocked them so your intentions are unclear.

Privacy is complicated

One problem that various social networks have run into is that they allow users to be ‘private’ by requiring the user to have an account on the website before showing their profile.

However, they then turn around and publish a snippet of their profile via unauthenticated HTTP.

If you’re clever you can build a search engine around this data and publish it in aggregate.

This can often frighten the user because they never intended their profile data to be used in this manner.

With facial recognition software it’s then possible to tie profiles together or use other textual signals to merge data from various social networks.

It’s fair to say that this could really alarm some users who are not up to date with the power of the Internet for de-anonymizing users.

This is a fair concern but some of the pressure is on the Social Networks to clarify privacy or their users and not to enable these awkward situations in the first place.

Politics

… and at the end of the day, REP can be used by a website JUST because they don’t like you or don’t understand what you’re up to.

This is the major problem as I see it. It will require website to clearly explain why and how their content is used.

It doesn’t help the situation that some websites are only allowing the big boys by default and blocking everyone else.

This is a dangerous situation because startups now need to beg for permission across thousands of websites (which is very expensive).

The way forward

I think the way forward here is to not attempt to solve all the problems with REP right now. Focus on a few things that are clearly broken and try to fix them.

Perfection seems to always be the enemy of progress and maybe REP will never be perfect but for now it’s all we have.

This really isn’t news but it’s nice to see more people talk about this problem:

Cha called her paper, “The Million Follower Fallacy,” a term that comes from work by Adi Avnit. Avnit posited that the number of followers of a Tweeter is largely meaningless, and Cha, after looking at data from all 52 million Twitter accounts (and, more closely, at the 6 million “active users”) seems to have proven Avnit right. “Popular users who have a high indegree [number of followers] are not necessarily influential in terms of spawning retweets or mentions,” she writes.

You can see this in our the Spinn3r Social Media Rank that we released a while back.

Spinn3r is growing fast. We’ve had an exceptional month (an exceptional year actually). Closing new deals. Releasing new features for our customers. Working on new backend architecture changes, and generally having a lot of fun in the process.

We’ve been posting to Craigslist like mad in the last few weeks but I wanted to take the time to post to our blog.

We’re hiring five new Engineers to join the team with us here in San Francisco.

This is in addition to the two new Engineers we’ve hired in the last couple months.

We’re hiring two Crawl Engineers, Operations Engineer, Support and QA Engineer, and Java Engineer.

Spinn3r is a great place to work. Smart people. Huge amounts of data. Great customers. New offices in SOMA (we’re in an awesome 103 year old building) and plenty of interesting problems to work on…

Matt just announced that WordPress will support the new RSS cloud protocol.

This ping model has already existed with Ping-o-Matic of course (which Matt/Wordpress have been running for since the blog epoch) and Spinn3r customers already benefit from this. In fact, we’ve been realtime for a long time now.

WordPress.com has always supported update pings through Ping-o-Matic so folks like Google Reader can get your posts as soon as they’re posted, but getting every ping in the world is a lot of work so not that many people subscribe to Ping-o-Matic. RSS Cloud effectively allows any client to register to get pings for only the stuff they’re interested in.

We haven’t announced this yet but we pushed a new filtering API in Spinn3r in the last release. We developed a domain specific language for filtering web content in real time.

A number of our customers have already started using this in production.

It’s nice that more people are pushing realtime content but I’m starting to worry about the proliferation of protocols here. XMLPRC pings are the old school way of handling things. Pubsubhubbub, Twitter stream API, SUP, etc.

However, I’ve played with most of these and think that they are all lacking in some area. One major problem is relaying messages when nodes fail and then come back online. For example, with XMLRPC pings, or the Twitter stream API, if my Internet connection fails, I’ve lost these messages forever.

The Spinn3r protocol doesn’t have this problem and supports resume. You just start off from where you last requested data and nothing is lost. We keep infinite archives so nothing is ever lost.

I don’t think most sites can support this much data (it’s expensive) but certainly a few hours of buffer, held in memory, seems reasonable to handle a transient outage.

ReadWriteWeb has more on this and is leading with a somewhat sensational title that would imply that these blogs were not real time in the past.

Techcrunch has more as does Scobleizer

One big issue with these protocols is spam. If it’s an open cloud any spammer can send messages into the cloud (which is the case with Pingomatic which receives 90% spam). And of course spammers can receive messages from the crowd to train their own classifiers and find spam targets.

We have an AUP with Spinn3r that prevents this usage. We’ve removed spam from the feed to begin with which is nice for our customers and allows them to build algorithms without having to worry about any attacks.

Spinn3r is growing fast. Time to hire another engineer. Actually, we’re hiring for like four people right now so I’ll probably be blogging more on this topic.

My older post on this subject still applies for requirements.

If you’re a Linux or MySQL geek we’d love to have your help.

Did I mention we just moved to an awesome office on 2nd and Howard in downtown SF?

200907071505We’re hiring a Support Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.

That said, this is an awesome opportunity to get in and work on a rapidly growing startup.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.

We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

For more information please visit our website.

Responsibilities:

  • Interact with customers both both in the early sales cycle and support role to answer technical questions about our technology (crawling, ranking, etc) (20%)
  • Monitor our crawler stats to enable understanding of operation and detect operational anamolies, monitor statistics, implement new features, etc.(20%).
  • Work on Java implementation of various new Spinn3r features as well as fix bugs in our current product. You will also be working on infrastructure in this position and responsible for various backend Java components of our architecture. (60%)
  • General passion and interest in technology (distributed systems, open content, Web 2.0, etc).

I should stress that while you’ll be interacting with customers, and providing support, our customers are exceedingly brilliant and amazingly knowledgeable about our space. They’re a major asset and staying in sync with them is very important for the company.

Requirements and Experience:

  • Java (though Python, C, C++, etc would work fine).
  • Ability to understand customer needs and prioritize feature requests.
  • Friendly, patient, and excellent people skills when interacting with customers.
  • Understanding HTTP
  • Databases (MySQL, etc).
  • Ability (and appreciation) for working in a Startup environment.
  • Must like cats :)

200907182203I didn’t have time to blog about this when it was originally posted but the NYTimes has a great piece on the cool work done by Jure Leskovec and Jon Kleinberg with their work on Memetracker (which is powered by Spinn3r).

For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.

The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.

Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs. Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.

200906241519Microformats are four years old now and will have a birthday party this friday to celebrate.

Some of us from the Spinn3r team will be there as well. Unfortunately, due to a timing error, our Spin3nr 3.0 launch dinner is that night so I need to take off around 8pm.

Additional papers based on the Spinn3r/ICWSM dataset have been published. It seems I have a lot of reading to do!

Flash Floods and Ripples: The Spread of Media Content through the Blogosphere

This paper is based on the Spinn3r data set (ICWSM 2009), which consists of web feeds collected during a two month period in 2008. The data set includes posts from blogs as well as other data sources like news feeds. We discuss our methodology for cleaning up the data and extracting posts of popular blog domains for the study. Because the Spinn3r data set spans multiple blog domains and language groups, this gives us a unique opportunity to study the link structure and the content sharing patterns across multiple blog domains. For a representative type of content that is shared in the blogosphere, we focus on videos of the popular web-based broadcast media site, YouTube.

Our analysis, based on 8.7 million blog posts by 1.1 million blogs across 15 major blog hosting sites, reveals a number of interesting findings. First, the network structure of blogs shows a heavy-tailed degree distribution, low reciprocity, and low density. Although the majority of the blogs connect only to a few others, certain blogs connect to thousands of other blogs. These high-degree blogs are often content aggregators, recommenders, and reputed content producers. In contrast to other online social networks, most links are unidirectional and the network is sparse in the blogosphere. This is because links in social networks represent friendship where reciprocity and mutual friends are expected, while blog links are used to reference information from other data sources.

Identifying Personal Stories in Millions of Weblog Entries

Stories of people’s everyday experiences have long been the focus of psychology and sociology research, and are increasingly being used in innovative knowledge-based technologies. However, continued research in this area is hindered by the lack of standard corpora of sufficient size and by the costs of creating one from scratch. In this paper, we describe our efforts to develop a standard corpus for researchers in this area by identifying personal stories in the tens of millions of blog posts in the ICWSM 2009 Spinn3r Dataset. Our approach was to employ statistical text classification technology on the content of blog entries, which required the creation of a sufficiently large set of annotated training examples. We describe the development and evaluation of this classification technology and how it was applied to the dataset in order to identify nearly a million personal stories.

In this paper, we describe our efforts to overcome the limitations of our previous story collection research using new technologies and by capitalizing on the availability of a new weblog dataset. In 2009, the 3rd International AAAI Conference on Weblogs and Social Media sponsored the ICWSM 2009 Data Challenge to spur new research in the area of weblog analysis. A large dataset was released as part of this challenge, the ICWSM 2009 Spinn3r Dataset (ICWSM, 2009), consisting of tens of millions of weblog entries collected and processed by Spinn3r.com, a company that indexes, interprets, filters, and cleanses weblog entries for use in downstream applications. Available to all researchers who agree to a dataset license, this corpus consists of a comprehensive snapshot of weblog activity between August 1, 2008 and October 1, 2008. Although this dataset was described as containing 44 million weblog entries when it was originally released, the final release of this dataset actually consists of 62 million entries in Spinn3r.com’s XML format.

SentiSearch: Exploring Mood on the Web

Given an accurate mood classification system, one might imagine it to be simple to configure the classifier as a search filter, thus creating a mood-based retrieval system. However, the challenge lies in the fact that in order to classify the mood for a potential result, the entire content of that page must be downloaded and analyzed. Much like a typical web-based retrieval system, to avoid this cost, pages could be crawled and their mood indexed along with the representation stored for search indexing. Alternatively, the presence of a massive dataset from http://www.spinn3r.com enabled the ESSE system to be built, performing mood classification and result filtering on the fly (Burton et al. 2009). Because the dataset (including textual content), search system, and mood classification system all exist on the same server, the filtering retrieval system was made possible. The dataset not only allows access to the content of a blog post (beyond the summary and title typically made available through search APIs) but the closed nature of the dataset allows for experimentation while still being vast enough to provide breadth and depth of topical coverage.

Event Intensity Tracking in Weblog Collections

The data provided for ICWSM 2009 came from a weblog indexing service Spinn3r (http://spinn3r.com). This included 60 million postings spanned over August and September 2008. Some meta-data is provided by Spinn3r.

Each post comes with Spinn3r’s pre-determined language tag. Around 24 million posts are in English, 20 million more are labeled as ‘U’, and the remaining 16 million are comprised of 27 other languages (Fig. 3). The languages are encoded in ISO 639 two-letter codes (ISO 639 Codes, 2009). Other popular languages include Japanese (2.9 million), Chinese/Japanese/Korean (2.7 million) and Russian (2.5 million). The second largest label is U unknown. This data could potentially hold posts in languages not yet seen or posts in several languages. Our present work, including additional dataset analysis presented next, is limited to the English posts unless otherwise specified. In future work we plan to also consider other languages represented in the dataset.

Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network

Our research is the first attempt to give an accurate measure for the level of information propagation. This paper presents ‘SugarCube’, a model designed to tackle part of this problem by offering a mathematically precise solution for the quantification of the level of topic propagation. The paper also covers the application of SugarCube in the analysis of the social network structure of the ICWSM/Spinn3r dataset (ICWSM 2009). It presents threshold values for the communities found within the collection, and paves the way for the measurement of topic propagation within those communities. Not only can SugarCube quantify the proliferation level of topics, but it also helps to identify ‘heavily-propagated’ or Global topics. This novel approach is inspired by Percolation Theory and its application in Physics (Efros 1986).

200812181719I’m proud to announce that we have just released Spinn3r 3.0 after more than a year of development.

This has been quite a lot of work based on feedback from our customer base and ships with some really awesome functionality.

Most of this time has been spent on architecture but a good deal has been spent implementing features for our rapidly growing user base.

When you outsource a major component of your infrastructure, like crawling, you tend to lean on it heavily and push it to the very edge.

Spinn3r has benefited significantly from our user base as they have suggested a number of excellent features. This has dramatically increased our reliability, performance, and feature set.

A good deal of work here has been spent on scalability, performance, and optimizations, including serious improvements to our core backend infrastructure.

There’s quite a lot that’s new in this release so I’ll just dive in.

Read the full post on the Spinn3r blog…

Want to track Swine Flu outbreaks? Just use Spinn3r!

Courtney Corley and Jorge Reyes are two University of North Texas graduate students who have been using Spinn3r under our research program to mine data about the recent Swine Flu outbreak.

The Denton Record-Chronicle has the story:

“We’re looking at what people write in blogs, Web [sites] and social media like Facebook, YouTube, etc. But, in particular, we’re just using blogs,” Corley said. “We have a service that allows us access to all blogs written in whatever language.”

The service is called Spinn3r, and allows them to pull together all media across the Internet that contains the keywords they search for.

“It’s a really rich resource to use for public health to see what people are writing about,” he said. “It’s a massive amount of data. Jorge and I for the past week have been looking at all the blogs that talk about swine flu. There are many words in Spanish for swine flu, so Jorge has been able to navigate that.”

Reyes, who is from Mexico, said he was motivated to work on the project because his family was in the country where the virus originated.

“All my family was there, I was worried,” Reyes said. “We were like, ‘what could we do with the tools we have?’

200905071134-1

caption: Courtney Corley and Jorge Reyes, who are tracking the spread of swine flu in the United States and Mexico, are shown Wednesday on campus.

We’ve been providing researchers with access to Spinn3r for more than two years now. The results are really starting to land now.

We’re sponsoring ICWSM this year with a 400GB snapshot. This is being used by more than 100 research groups of about 500 total researchers.

There should be a few dozen more papers published in the next few weeks but I wanted to highlight these now as they are already live.

Specifications and Architectures of Federated Event-Driven Systems

Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is specified differently. Acquisition of Spinn3r data, referenced in Fig-
ure 3 step 1, is achieved through changing URL arguments in a manner defined by Spinn3r. Thus, the specification is
unique to Spinn3r. While that particular specification cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The specifications for all of these information inter-
faces are very different; a notation that allows the description of composite applications must account for this.

Blogs as Predictors of Movie Success

In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are
our main contributions:

• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full spinn3r.com blog data set for 12 months.

• We find that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.

• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to find similarity between the features.

• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.

• We confirm low correlation between blog references and first week movie sales reported by (Mishne, G. et. al. 2006) but we find that (a) budget is a better predictor for the first week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).

Data and Features

The data set we used for this paper is the spinn3r.com blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).

Blogvox2: A modular domain independent sentiment analysis system

Bloggers make a huge impact on society by representing and influencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political influence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justified.

Spinn3r (http://tailrank.com ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.

Meme-tracking and the Dynamics of the News Cycle

Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.

Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API [27]. The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).

The New York Times has a great piece today on crawling the deep web – the portion of the web that isn’t easily accessible to normal web crawlers.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.

Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.

Oddly enough, Google published a paper on this topic at VLDB 2008 entitled “Google’s Deep-Web Crawl”.

This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index.

Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine’s index.

We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box … [and then] extract additional keywords from the resulting documents … We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition.

A typed text box will produce reasonable result pages only with type-appropriate values. We use … [sampling of] known values for popular types … e.g. zip codes … state abbreviations … city … date … [and] price.

The smart guys over at Kosmix were also quoted:

“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.

“Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.”

In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

Interesting. I’m going to have to reach out to them to offer our help with Spinn3r (more on our research program shortly).

200812191035Google Blog Search shipped with an update a few months back to index the full HTML of each new blog post.

The only problem is that they indexed the full HTML and not the article content:

I wanted to give everyone a brief end-of-the-year update on the blogroll problem. When we switched blogsearch to indexing the full text of posts, we started seeing a lot more results where the only matches for a query where from the blogroll or other parts of the page that frame the actual post. (There’s been a lot of discussion of the problem. You can search for [google blogsearch] using Google Blogsearch.)

We’re in the midst of deploying a solution for this problem. The basic approach is to analyze each blog to look for text and markup that is common to all of the posts. Usually, these comment elements include the blogroll, any navigational elements, and other parts of
the page that aren’t part of the post. This approach works well for a lot of blogs, but we’re continuing to improve the algorithm. The
search results should ignore matches that only come from these common elements. The indexing change to implement it is deployed almost everywhere now.

Spinn3r customers have had a solution for this problem for nearly a year now.

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search – what’s the point of using a search engine that’s not indexing 80% of potential content?

We’re also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they’ll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the ‘content’ of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what’s potentially a navigation item.

See the yellow text in the image on the right? That was identified algorithmically and isolated form the body of the post.

To be fair it’s a difficult problem but I’ve had a few years to think about it.

200810231630We have a number of other pending announcements of researchers building cool applications with Spinn3r but this one was just too awesome to hold back.

Researchers at Cornell have developed a new memetracker (cleverly named MemeTracker) powered by Spinn3r.

Jure Leskovec, Lars Backstrom and Jon Kleinberg (author of the HITS algorithm, among other things) built MemeTracker by tracking the hottest quotes from throughout the blogosphere and rending a graph by the grouping quotes and then tracking the number of quote references.

MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories per day from 1 million online sources, ranging from mass media to personal blogs.

We track the quotes and phrases that appear most frequently over time across this entire spectrum. This makes it possible to see how different stories compete for news coverage each day, and how certain stories persist while others fade quickly.

The plot above shows the frequency of the top 100 quotes in the news over time, for roughly the past two months.

Here’s a screenshot but you should definitely play with MemeTracker to see how it works:

200810231629

We’ve been thinking of shipping a new API for tracking quotes across the blogosphere. Our new change tracking algorithm for finding duplicate content also does an excellent of finding quotes.

Tracking duplicate content turns out to be very important in spam prevention and ranking. It just so happens that there’s a number of overlapping features and technologies that these things can provide.

We’re not ready to ship it just yet because the backend requires about 2TB of random access data. This isn’t exactly cheap so we’ve been experimenting with some new algorithms and hardware to bring down the pricing. I think we’ll be able to ship something along these lines once we get our next big release out the door.

200810211134Spinn3r is sponsoring the International Conference for Weblogs and Social Media this year with a snapshot of our index.

The data set was designed for use by researchers to build cool and interesting applications with the data.

Good research topics might include…

  • link analysis
  • social network extraction
  • tracing the evolution of news
  • blog search and filtering
  • psychological, socialogical, ethnographic, or personality-based studies
  • analysis of influence among bloggers
  • blog summarization and discource analysis

We’re already used by a number of researchers in top universities. Textmap (which presented at ICWSM last year) just migrated to using Spinn3r and Blogs Cascades has been using us for a while now.

The data set is pretty large. 142GB compressed (27GB uncompressed) but you need a solid chunk of data to perform interesting research.

The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 20092008. The post includes the text as syndicated, as well as metadata such as the blog’s homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).

This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial
crisis; …) as well as everything else you might expect to find posted to blogs.

To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form has been processed, you will be sent a URL and password where you can download the collection.

Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.

Looks like Cuil might be hitting websites too hard with their crawler:

“I don’t know what spawned it, but when Cuil attempts to index a site, it does so by completely hammering it with traffic,” the tipster wrote. “So much, that it completely brings the site down. We’re 24 hours into this “index” of the site, and I’ve had to restrict traffic to the site down to 2 packets per second, while discarding the rest, or otherwise it makes the site unusable.”

The Admin Zone forums are abuzz over Cuil’s overzealous method for indexing. Countless posters on the site have said that their websites have been brought down because of the Twiceler robot and one user said it “leeched enormous amounts of bandwidth — nearly 2GB this month until it was blocked. It visited nearly 70,000 times!”

One of the benefits to Spinn3r is that we crawl for multiple customers (and we’re super polite).

Cuil might be a bit more aggressive than normal crawlers for now in that they have to backfill archives. Google has the benefit of being around for a while and doesn’t need to drink from the firehose as much as Cuil.