Monitoring Specific Blogs – Blog Search is Broken

Blog search is just plain broken. Technorati is falling apart. Between spam, non-deterministic results, and a buggy feature set – just unusable. How can so many cool and smart people get together and build such a sad product?

That said, there’s no real alternative. Google Blogsearch finds comment feeds or example and their depth of coverage is just sad especially compared to the main Google search.

I just can’t use blog search anymore. What I need is reliable searching on a subset of blogs. I could add 100 or so feeds from my aggregator (I’m using both Google Reader and Bloglines now) and then monitor them for any mention of Tailrank. The only problem is that none of the blog search services implement this functionality.

Last night I evaluated Technorati, Sphere, Feedster, Icerocket, Google Blogsearch, Ask Blog Search, BlogPulse, Bloglines search, Yahoo search, and Google search and none of them worked for me. Am I missing any?

Yahoo search comes the closest I think but they don’t provide a way to sort by reverse chronology.

Technorati’s crawler is eight days behind and misses a number of key items. If you hit reload it will switch from between being eight days old and 30 minutes old. OK. How’s that possible? I’m willing to be they have a federated backend infrastructure but keeping it this inconsistent is not a good idea.

Bloglines search showed the most promise with the ability for me to add the blogs I wanted to search and just search within that subset but it was missing a number of stories which should have been in the results.

If I had infinite time I’d just sit down for a weekend and write my own blog search engine that only crawled a few blogs and focused on precision and indexing rate. Spam wouldn’t really be an issue since we would only crawl blogs on behalf of a given user.

Anyone have any suggestions on solution so I don’t have to spend next weekend coding a new service?


  1. You mean, you want to monitor a specific subset a blogs, for specific terms?

    …. I would just pull them down once an hour with the Universal Feed Parser and then do a strstr() and then dump the results into an Atom feed…..

    Traditional ‘search’ just isn’t good that what you are asking. There are lots of places and reasons things go ‘missing’… and it mostly relates to most people are retrofitting engines designed for whole web crawling where it “wouldn’t matter”.

    If you have a specific query/site, I can check and see if there is a cause for something missing from Ask/Blogline’s Search…..

  2. Yeah…. I sort of agree. I think that implementing it on top of MySQL fulltext searc would be just as easy as using a manual parser. This way you could support adhoc queries without having to recompile your indexer.

    I think the issue with Ask Blog Search is that I couldn’t search a specific blog.

    The reason I wanted to go down this path was to remove any chance of global spam hurting my results….

    Bloglines search would have been perfect though…. It seems like it should do exactly what I want. In this case it was missing results from some main blogs. Any idea why? Maybe the index isn’t rebuilt at a regular basis?

    Kevin

  3. The there issues is that for the most part using RSS might not work due to the fact that a lot of the A-lister s use summary feeds. :-/

    The one specific case I have isn’t an issue though……

  4. The index has new content every ~5 minutes from the crawler — there isn’t really a concept of… rebuilding in how it works, I’m not sure how much I can describe how it works on a public blog though :)

    However, as an example, you can use the burl: operator to restrict to a specific base url:
    http://www.bloglines.com/search?q=burl%3Atechcrunch.com+tailrank&ql=en&s=f&pop=n&news=m

    burl:techcrunch.com tailrank

    But, even this exposes at least one problem with someones GUIDs. And it is missing items compared to:
    http://www.techcrunch.com/tag/Tailrank/

    So, it looks like the problem you are hitting here, is that we have something called ‘Domain Collapse’ turned on, and it will cull out results from the same domain. (Because in ‘normal’ web search, you don’t want to be flooded with results from the same domain; See: Blog Search Engines coming from Web Search backgrounds.)

    One thing that does work different internal to Bloglines is what we call ‘saved search’, for example, if you go to this URL in Bloglines:
    http://www.bloglines.com/sub/http://www.bloglines.com/search?q=bcite%3Atailrank.com&ql=en&s=f&pop=n&news=m&n=100&format=rss

    It will create an internal savedsearch, and any new items linking to tailrank will show up… And it should be more complete than the historical search, for new items at least….

  5. This url better shows the effects of domain collapse:
    http://www.bloglines.com/search?q=tailrank&ql=en&s=f&pop=n&news=m&sitelist=3026855

    (sitelist is kind of a magical parameter to make blogsearch only search a specific Bloglines Site ID. 3026855 is techcrunch).

    For example, this searches just techcrunch and feedblog:
    http://www.bloglines.com/search?q=tailrank&ql=en&s=f&pop=n&news=m&sitelist=3026855,2516162

    (feedblog is 2516162).

    I’ll try to see if we can tune some of the domain collapse stuff when searching a specific site when I’m back at work in january.

  6. I would invite you to check out Blogdigger; using the site: operator, you can specify a set of blogs to search, so the query:

    tailrank site:(feedblog.org OR techcrunch.com)

    would give you results from just those blogs. In addition to our regular blog search, you can use Blogdigger Groups (http://groups.blogdigger.com) to create a group of feeds of your choosing, which includes full text search functionality (the UI needs a good deal of work, but it does work). If I can be of any help, let me know.

  7. What about Topix.net? http://search.topix.net/

    The blogs in the Topix.net crawl are all editorially selected, so the quality is very high and spam is virtually non-existent. If splogs do get in somehow, we will take them out when we are alerted to there presence.

    Here is sample search for tailrank:
    http://www.topix.net/search/?blogs=1&q=tailrank&ts-t=Search+Blogs

    If we’re missing anything, let us know, we add it.

    (yes, I work at Topix)

  8. Favorite your blog subset in Technorati, and use the ‘search within favorites’ feature.

  9. Kevin….

    Hm… thanks. That seems to work. I’m going to watch the results more. The normal Technorati stuff seems to be nondeterministic but for a first pass the favorites search seemed to work.

    The problem I’m having is that with Tailrank I’m reading less and less of my RSS but I *do* want to see any mention of Tailrank. I have to avoid the spam though so I can’t really use Technorati as a lot of people will just link to our clusters but aren’t providing feedback on Tailrank.

    I’ll play with this for a while and report back. I imported OPML into Technorati have search feeds setup so we’ll see how this works.

    Kevin

  10. You may see false positives from the ‘tailrank’ buttons people add to posts. Filtering those out better is on the to do list

  11. Hi Kevin,

    I see some ideas to tackle your problem, but spam coders will be as smart as those ideas I think.

    To identify spam is not only about identifying the source and the title, but also the contents of that spam: the semantics of the message.

    Therefore I think search engines will need to capture the semantics in the (near) future. Technologies like used in ‘Semantic Web’ alternatives are an option to bring meaning to the content of the message, so you can search more specific on the content and not how it is being presented.
    And therefore identify spam or duplicate messages more easily.

    I think, for Tailrank, it is good to know what the new technologies are to index news and maybe in the (near) future you can extend Tailrank with search?

    So, I think you will have to write a new service, while spam-coders will be as smart (message as spam nowadays) as us providing new solutions to tackle this items.

    Sorry :p

    Joey :D

  12. I’m an engineer on the Google BlogSearch team. I was wondering what specific concerns you had about depth of coverage. If you had a few examples, I’d be interested to find out why they aren’t in our index.

    We do have many comment feeds in the index, but for many blogs the comments have interesting content. I’d think they improve our coverage.






%d bloggers like this: