NYTimes on Crawling the Deep Web

The New York Times has a great piece today on crawling the deep web – the portion of the web that isn’t easily accessible to normal web crawlers.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.

Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.

Oddly enough, Google published a paper on this topic at VLDB 2008 entitled “Google’s Deep-Web Crawl”.

This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index.

Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine’s index.

We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box … [and then] extract additional keywords from the resulting documents … We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition.

A typed text box will produce reasonable result pages only with type-appropriate values. We use … [sampling of] known values for popular types … e.g. zip codes … state abbreviations … city … date … [and] price.

The smart guys over at Kosmix were also quoted:

“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.

“Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.”

In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

Interesting. I’m going to have to reach out to them to offer our help with Spinn3r (more on our research program shortly).

  1. Great to see another shot at tackling the Deep Web. The problem is far from new, though. Vast.com (a company I founded) did this on large scale in 2003, with world-class vertical results which got us funded in 2005. Clearly, automated Deep Web crawling is quite doable and I am glad Google is (finally) getting around to doing it. OTOH, their effort so far seems very limited in both technology as well as scope of the targeted deep content.

    IMHO the next wave of Deep Web crawling is much more powerful and it comes out very naturally, almost for free, in our (my new startup) approach to large scale Web search. Unfortunately, I can not go into much details but the key is in user-driven aspects.

  2. Hey Borislav.

    Let me know when you can talk… Perhaps we could grab coffee in SOMA and talk search…


  3. Hi Kevin,

    Thanks that would be great. I live in Portland but am often in the Valley. Next time will be on March 3-4th, we could meet sometime on 3rd? Let me know if this time works, thanks


  4. Good article, Borislav.

    I’ll be taking a look at Spinn3r. At Deep Web Technologies (www.deepwebtech.com), we’ve been making deep web websites available for years. Check out http://www.science.gov, http://www.scitopia.org, http://www.mednar.com and http://www.biznar.com, among others.

    Crawling and indexing will go a long way to help make the deep web more accessible. Much of the content in the deep web, though, is there for a reason: it is proprietary, requires a subscription or the publisher doesn’t necessarily want the information made available publicly.

    The trick then, in making access to the deep web truly ubiquitous, is to provide the tools that enable public search of proprietary sources without actually exposing the underlying information to the public (i.e. managing credentials on a user-by-user basis). Also, many deep web researchers care about narrow topics within the deep web, and therefore want to constrain their deep web search to specific databases or sources.

    I didn’t think the NYT article did enough justice on the issues and solutions required to make access to the deep web effective and efficient.


%d bloggers like this: