Ignoring Blogroll and Sidebar Content in Search

200812191035Google Blog Search shipped with an update a few months back to index the full HTML of each new blog post.

The only problem is that they indexed the full HTML and not the article content:

I wanted to give everyone a brief end-of-the-year update on the blogroll problem. When we switched blogsearch to indexing the full text of posts, we started seeing a lot more results where the only matches for a query where from the blogroll or other parts of the page that frame the actual post. (There’s been a lot of discussion of the problem. You can search for [google blogsearch] using Google Blogsearch.)

We’re in the midst of deploying a solution for this problem. The basic approach is to analyze each blog to look for text and markup that is common to all of the posts. Usually, these comment elements include the blogroll, any navigational elements, and other parts of
the page that aren’t part of the post. This approach works well for a lot of blogs, but we’re continuing to improve the algorithm. The
search results should ignore matches that only come from these common elements. The indexing change to implement it is deployed almost everywhere now.

Spinn3r customers have had a solution for this problem for nearly a year now.

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search – what’s the point of using a search engine that’s not indexing 80% of potential content?

We’re also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they’ll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the ‘content’ of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what’s potentially a navigation item.

See the yellow text in the image on the right? That was identified algorithmically and isolated form the body of the post.

To be fair it’s a difficult problem but I’ve had a few years to think about it.


  1. Kevin,
    That’s great to hear. Two immediate thoughts/questions:

    Since it’s a probabilistic model, does it mean it’s applicable to non-blog content (i.e. *any* web page)? How much training is required for every previously unseen “page structure”?

    Have you done any evaluating to see how many false positive/negatives your model yields?

    Also, does Spinn3r fetch any non-blog content and does Spinn3r let customers specify completely custom sets of content sources? If so, does that imply that for every new type of content that Spinn3r hasn’t seen and been trained for before, one (you? customer?) would have train Spinn3r to recognize the meat of the page from the rest?

    Thanks.

  2. Hey Otis….

    It can work on any web page…. it just needs a good run of text.

    Training isn’t actually required but preferred. If we have previous HTML we can score a bit higher.

    It’s about 85%-95% accurate in our tests. Some really pathological HTML code can be confusing but for the most part it works.

    We have large customers who have deployed this in production and only suggested a few somewhat trivial recommendations.

    Spinn3r can fetch non-blog content if it has an RSS feed. We can also crawl a site if necessary. We have crawls setup for mainstream media sites that have poor RSS.

    If it’s a new source, we’re still pretty accurate but we can sometimes miss the first or last sentence in a post.

    Since most of our customers are training for NLP or bag of words classification this isn’t much of a problem.

    If we can train for a bit longer with previous posts we’re able to extract the meat of the post a bit better.

    One trick here is to just ignore the first 5-10 posts from a source and just use the RSS content.

    We now have a grand unified API that allows our customs to fetch the RSS, HTML, or content extract of a post with a single call.

  3. Fredrik

    Open Source?

  4. Not open source but we have licensing for research organizations.

    Kevin






%d bloggers like this: