Archive for the ‘search’ Category

One of the things that has always bothered me about replication is that the binary logs are written to disk and then read from disk.

There is are two threads which are for the most part, unaware of each other.

One thread reads the remote binary logs, and the other writes them to disk.

While the Linux page buffer CAN work to buffer these logs, the first write will cause additional disk load.

One strategy, which could seriously boost performance in some situations, would be to pre-read say 10-50MB of data and just keep it in memory.

If a slave is catching up, it could have GIGABYTES of binary log data from the master. It would then write this to disk. These reads would then NOT come from cache.

Simply using a small buffer could solve this problem.

One HACK would be to use a ram drive or tmpfs for logs. I assume that the log thread will block if the disk fills up… if it does so intelligently, one could just create a 50MB tmpfs to store binary logs. MySQL would then read these off tmpfs, and execute them.

50MB-250MB should be fine for a pre-read buffer. Once one of the files is removed, the thread would continue reading data.

200805271727Spinn3r is hiring for an experienced Senior Systems Administrator with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog data.

We crawl the entire blogosphere in realtime, remove spam, rank, and classifying blogs, and provide this information to our customers.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

Overview:

In this role you’ll be responsible for maintaining performance and availability of our cluster as well as future architecture design.

You’re going to need to have a high level overview of our architecture but shouldn’t be shy about diving into MySQL and/or Linux internals.

This is a great opportunity for the right candidate. You’re going to be working in a very challenging environment with a lot of fun toys.

You’re also going to be a core member of the team and will be given a great deal of responsibility.

We have a number of unique scalability challenges including high write throughput and massive backend database requirements.

We’re also testing some cutting edge technology including SSD storage, distributed database technology and distributed crawler design.

Responsibilities:

  • Maintaining 24 x 7 x 365 operation of our cluster
  • Tuning our MySQL/InnoDB database environment
  • Maintaining our current crawler operations
  • Monitoring application availability and historical performance tracking
  • Maintaining our hardware and linux environment
  • Maintaining backups, testing failure scenarios, suggesting database changes

Requirements:

  • Experience in managing servers in large scale environments
  • Advanced understandling of Linux (preferably Debian). You need to grok the kernel, filesystem layout, memory model, swap, tuning, etc.
  • Advanced understanding of MySQL including replication and the InnoDB storage engine
  • Knowledge of scripting languages (Bash and PHP are desirable)
  • Maintaining software configuration within a large cluster of servers.
  • Network protocols including HTTP, SSH, and DNS
  • BS in Computer Science (or comparable experience)

Further Reading:

It dawned on me that if I were working for Twitter that I would just assume the service is down unless told otherwise.

This lead to the conclusion that one should invert monitoring to send off a notification when Twitter is online

Seriously. I like those guys but this is getting kind of embarrassing.

As someone interested in distributed, scalable, and reliable web services, I think I might stop using it out of protest.

Things could be worse though – they could be using Hadoop! :-)

You can see a picture of Twitter’s main database server below:

200805250725

This press release on Mtron’s sites is interesting:

In the second half of 2008, Intel will release both high-performance SSDs for use in servers and storage, and exclusive SSD models for consumer electronics; the industry is thus watching whether Google will introduce SSD systems.

Google reportedly will be supplied with Intel’s SSD-embedded storage devices at the end of the second quarter to be applied to its search systems.

There’s been more activity in the distributed consensus space recently.

At the Hypertable talk yesterday Doug mentioned Hyperspace, their Chubby-style distributed lock manager. Though I think it’s missing the ‘distributed’ part for now.

To provide some level of high availability, Hypertable needs something akin to Chubby. We’ve decided to call this service Hyperspace. Initially we plan to implement this service as a single server. This single server implementation will later be replaced with a replicated version based on Paxos or the Spread toolkit.

ZooKeeper seems to be making some progress as well.

Check out this recent video presentation (which I probably can’t embed so here’s the link).

In 2006 we were building distributed applications that needed a master, aka coordinator, aka controller to manage the sub processes of the applications. It was a scenario that we had encountered before and something that we saw repeated over and over again inside and outside of Yahoo!.

For example, we have an application that consists of a bunch of processes. Each process needs be aware of other processes in the system. The processes need to know how requests are partitioned among the processes. They need to be aware of configuration changes and failures. Generally an application specific central control process manages these needs, but generally these control programs are specific to applications and thus represent a recurring development cost for each distributed application. Because each control program is rewritten it doesn’t get the investment of development time to become truly robust, making it an unreliable single point of failure.

We developed ZooKeeper to be a generic coordination service that can be used in a variety of applications. The API consists of less than a dozen functions and mimics the familiar file system API. Because it is used by many applications we can spend time making robust and resilient to server failures. We also designed it to have good performance so that it can be used extensively by applications to do fine grained coordination.

We use a lock coordinator in Spinn3r and are very happy with the results. It’s a very simple system so provides a LOT of functionality without much pain and maintenance.

Paxos made live is out as well. (I haven’t had time to read it yet).

Robot Yield

This morning I was thinking about robot blocks regarding Rich’s post about Cuill being blocked on 10k hosts.

So let’s say you write a web scale crawler and you accidentally pushed a bug. It was a huge mistake and you hurt a few hosts and end up being blocked.

A month passes and you’ve implemented a fix and a number of other features which make crawling easier on hosts in your cluster.

… basically you want another chance to crawl these sites. The problem is that you now need to wait an eternity until they remove your robot block.

No what?

Do you ignore the block? That’s probably not right.

Do you create a new User-Agent so that you can slide through the robot block? Possibly. That might work. However, what if you’re blocked because people don’t like you (and it’s not a politeness issue).

I assume if it’s a non-crawlable directory they’re just going to use User-Agent: *.

One could extend robots.txt to include additional syntax so that would allow robots.txt to handle such situations but honestly how many users are going to use that extension.

They could always just remove the disallow rules…

200804071213We’ve been covering a massive blog spam epidemic thanks to a nasty/evil spammer who’s exploiting a XMLRPC bug in WordPress 2.2.

This issue is FINALLY getting the attention it deserves:

I had a closer look at many of the blogs concerned that had spammy content — pages promoting credit cards, pharmaceuticals and the like, and I realized that if you go to the root domain they are all legitimate blogs. Not scraper blogs that were being auto-generated with adsense / affiliate links, which was extremely curious, and actually reminiscient of something that hit home a few months ago.

A few months ago, this blog got hacked — but in a sneaky way. Not only did the hackers insert “invisible” code into my template, so that I was getting listed in Google for all manner of sneaky (and NSFW terms), so that people could click on those links with the hacker getting the affiliate cash — but *actually*, said hackers also inserted fake tempates into my wordpress theme.

Center Networks is also covering this issue…

Oddly enough Tailrank picks up on this spam because of our clustering algorithm. We cluster common links and terms via our blog index and promote these stories to our front page.

Since we ‘trust’ stories with past behavior when major A-list blogs like ZDNet get owned we believe they are legitimate links.

If we had a smaller index this might be a big easier to handle but we’re indexing 12M blogs within Tailrank and on Spinn3r.

Another way around this of course would be to blacklist every blog running WordPress 2.2 or earlier but we’re talking millions of blogs here and we don’t want to unfairly harm anyone.

To date our approach has been to wait until Tailrank has identified the spam, and then blacklist any blogs that have been compromised.

Unfortunately this is a war of attrition with the spammer just spending a few more days and hacking another dozen or so sites.

The only positive aspect of this is that it’s encouraging people to upgrade to WordPress 2.5.

We’re also working on some secondary algorithms to catch this a bit sooner and we’ll probably ship these in Spinn3r 2.5 which is due shortly.

Award Me Stars

It’s been known for a while that many SEOs are using link bait to attract links to help them manipulate search engine rankings:

His non-operating, do-nothing program won 16 awards. Various cites labeled it “Certified 5-Star,” “Editor’s Pick,” and “Cool Discovery.” All of them, obviously, from sites that didn’t even bother to note the blatant name of the program, nor try to run it even once.

What’s going on? Brice surmises that the software sites award their top rating to everything submitted, in hopes that the software authors will boast of the awards on their own sites and link back to the aggregator sites — thus, raising the aggregator site’s rankings in search engines.

It looks like Odeo has acquired BlogDigger:

It’s worth noting that they’ve never raised VC:

I admire the way Blogdigger has diversified through the years and consistently sought out new niches within blog search. The digital media part of its business is what, in the end, differentiated Blogdigger from the crowd. It’s worth reading Greg’s story in full, below, as it provides an informative glimpse into how a small, unfunded startup has battled through 5 years and finally had a successful conclusion (well, we hope the price paid can be deemed a success). And remember that Blogdigger, like this blog, launched well before the web 2.0 hype began.

So while competitors such as Feedster have closed up shop, and Technorati flounders, BlogDigger doesn’t have to worry about these problems and they can exit on their own terms.

Specifically, they never were able to overestimate the market and instead won their revenue at tortoise pace instead of crashing into the wall at a hare’s pace.

I have to admit, that if you can use VC correctly, it’s a valuable tool. But all too often the entrepreneurs over estimate their own sense of self worth and the VCs are too easily convinced that investing in a lemon is a good idea.

Of course, the Founder/CEO has nothing to lose. As long as they have 1 in 10 Youtube style exit’s they’re more than breaking even. This explains why serial entrepreneurs can raise round after round while continuing to fail.

I’m expecting Technorati to close up shop anytime now. Either that or they’ll have to recap the whole company and start with a fresh round of VC.

200803141449The guys over at Slaant were nice enough to write an Open Source driver for Spinn3r written in Perl.

They did all the work here and we’re immensely grateful that they decided to release it as Open Source.

This is 100% native and uses Expat for XML parsing.

As part of this release I also wrote some notes on client design guidelines. It turns out that 80% of the problems are produced by common implementation issues. Things like using read and connect timeouts, correct DNS caching, UTF-8 encoding, etc.

WWW::Spinn3r is an iterative interface to the Spinn3r API. The Spinn3r API is implemented over REST and XML and documented throughly at `http://spinn3r.com/documentation’. This document makes many reference to the online doc and the reader is advised to study Spinn3r documentation before proceeding further.

This module gives your a perl hash interface to the API. You’ll need just two functions from this module: `new()’ and `next()’. `new()’ creates a new instance of the API and `next()’ returns the next item from the Spinn3r feed.

200803131640Looks like Yahoo is releasing more details about web standards, RDF, and microformat support in their search platform:

While there has been remarkable progress made toward understanding the semantics of web content, the benefits of a data web have not reached the mainstream consumer. Without a killer semantic web app for consumers, site owners have been reluctant to support standards like RDF, or even microformats. We believe that app can be web search.

By supporting semantic web standards, Yahoo! Search and site owners can bring a far richer and more useful search experience to consumers. For example, by marking up its profile pages with microformats, LinkedIn can allow Yahoo! Search and others to understand the semantic content and the relationships of the many components of its site. With a richer understanding of LinkedIn’s structured data included in our index, we will be able to present users with more compelling and useful search results for their site. The benefit to LinkedIn is, of course, increased traffic quality and quantity from sites like Yahoo! Search that utilize its structured data.

… and of course a rising tide lifts all boats. I expect this will help out Spinn3r as this just means more structured content for us to index.

They’re using the right vocabulary though:

In the coming weeks, we’ll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

Techcrunch has more on the subject and generally likes the direction Yahoo is taking.

The signal to noise at WSDM was great. The only conference I’ve been to in a while where I had to do homework after the talks (which is a good thing).

There were some really good highlights from WSDM which I wanted to point out. I’m going to sit down with the papers and read through them this weekend.

These are the highlights in my opinion. There were some other really great papers here which deserve a lot of respect. It’s just that these specific papers apply to my direct work.

Crawl Ordering by Search Impact

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main difficulty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext.

Ranking Web Sites with Real User Traffic

This was the highlight of the conference for me.

We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven
months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior.

A Scalable Pattern Mining Approach to Web Graph Compression with Communities

This paper seems interesting but the setup overhead necessary for smaller graphs might not justify the implementation costs. Huffman coding and Beta codes seem to go a long way towards an efficient graph storage representation.

A link server is a system designed to support efficient implementations of graph computations on the web graph. In this work, we present a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers. We use a frequent pattern mining approach to extract meaningful connectivity formations. Our Virtual Node Miner achieves graph compression without sacrificing random access by generating virtual nodes from frequent itemsets in vertex adjacency lists. The mining phase guarantees scalability by bounding the pattern mining complexity to O(E log E). We facilitate global mining, relaxing the requirement for the graph to be sorted by URL, enabling discovery for both inter-domain as well as intra-domain patterns. As a consequence, the approach allows incremental graph updates. Further, it not only facilitates but can also expedite graph computations such as PageRank and local random walks by implementing them directly on the compressed graph. We demonstrate the effectiveness of the proposed approach on several publicly available large web graph data sets. Experimental results indicate that the proposed algorithm achieves a 10- to 15-fold compression on most real word web graph data sets.

Preferential Behavior in Online Groups

Online communities in the form of message boards, listservs, and newsgroups continue to represent a considerable amount of the social activity on the Internet. Every year thousands of groups flourish while others decline into relative obscurity; likewise, millions of members join a new community every year, some of whom will come to manage or moderate the conversation while others simply sit by the sidelines and observe. These processes of group formation, growth, and dissolution are central in social science, and in an online venue they have ramifications for the design and development of community software.

On Ranking Controversies in Wikipedia: Models and Evaluation

Wikipedia is a very large and successful Web 2.0 example. As the number of Wikipedia articles and contributors grows at a very fast pace, there are also increasing disputes occurring among the contributors. Disputes often happen in articles with controversial content. They also occur frequently among contributors who are “aggressive” or controversial in their personalities. In this paper, we aim to identify controversial articles in Wikipedia. We propose three models, namely the Basic model and two Controversy Rank (CR) models. These models draw clues from collaboration and edit history instead of interpreting the actual articles or edited content. While the Basic model only considers the amount of disputes within an article, the two Controversy Rank models extend the former by considering the relationships between articles and contributors. We also derived enhanced versions of these models by considering the age of articles. Our experiments on a collection of 19,456 Wikipedia articles shows that the Controversy Rank models can more effectively determine controversial articles compared to the Basic and other baseline models.

Finding High-Quality Content in Social Media

The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions—social media sites becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content.

As an experiment I added the Spinn3r reference client to both Koders and Google Code Search to see how quickly they would index the source code.

Boy am I disappointed. It’s been nearly 1.5 months and both Koders and Google Code Search haven’t yet indexed my content.

Sad sad sad.

Koders promised me they would add it 3 weeks ago. Nothing yet.

You’d think Google could at least get search right.

The papers from WSDM 2008 aren’t yet available on the Internet so I took the liberty to upload them.

There’s some good stuff here. I’m going to blog my notes for a few of these talks once I have the time to review the official papers.

I wonder if I should charge $5 a copy like the ACM. (joke)

200712062002It appears Feedster is no more. Off into the Internet nether-world alongside Kozmo, Pets.com, and Webvan.

I’ve been monitoring the situation for a week and they haven’t been responding to ICMP packets the entire time.

At the very minimum you’d THINK they’d give people a heads up. Put up a ‘we’re going away page’ at least …

What’s even more confusing is that the normal sites (GigaOM, Techcrunch, etc) haven’t yet picked up on the story.

Update: I forgot to mention. A number of bloggers have approached me about this offline and have noticed the same thing.

The question is whether they’re going to go to Technorati or just go right to Google Blog Search.

Update 2: More confirmation. It’s been removed from Ping-O-Matic.

Update 3: After more digging my trusty (but anonymous) source is telling me that someone is trying to acquire Feedster. This might be interesting. It’s only one source for the moment.

If Feedster is REALLY on its last legs them it makes sense that someone might try to pick up the technology on the cheap.

Update 4: BTW… if you were working with Feedster for blog data and now are high and dry we’d LOVE to help you out over at Spinn3r.

Om is asserting that Google’s infrastructure is a huge competitive advantage:

To sum it up, Google’s gigantic infrastructure is the big barrier to entry for its rivals, and will remain so, as long as the company keeps spending billions on it. That said, there’s another thing Google could learn from Dell: Maintain the quality of your search results — customers will only put up with shoddiness for so long.

I think this is the biggest problem for all the startups in this space. They need to scale. Time and time again when I talk to startups their biggest issue is scaling their backend.

Ethan seems to agree:

A huge portion of Google’s opex is people, and many of those people are the systems guys who built fundamental software infrastructure like UNIX, C, and TCP/IP. Those guys aren’t there for their halo effect – they’re there, despite Google’s youth bias, to build software infrastructure.

A lot of the design of Spinn3r has been around scalability. We index a LOT of data and if we can’t process if fast enough we won’t be able to serve our customers.

Our infrastructure is starting to REALLY come into its own now and we’re starting to think about other things we can accomplish with the platform. I can only imagine what Google’s thinking.

Google – The End is Nigh

How perfect is that!?

200711261940

Bigtable and C

There’s been a lot of activity in the distributed database space in the last few weeks.

First was KFS (Kosmos FS) and now Powerset brings us Hadoop.

I’ve been thinking about this a lot recently but I think Java is the wrong language in which to design distributed databases (or any database in general).

I’m specifically talking about the on-disk persistence engine.

The main problem being implementations of sendfile, async and event IO, memory management, and implementation details such as access to mlock.

Java’s VM is one problematic area. Once the VM allocates memory it doesn’t want to let it go. Then there’s the problem that there’s no implementation of mlockall for Java. One could write an implementation in JNI but then you run into other problems with lack of access to other JNI libraries.

C just isn’t that hard. For a small and tight database implementation like GFS or Bigtable it seems to just make more sense to implement it in C.

Memcached and lighttpd are a great examples of what I’m talking about. They’re small, thin, and get the job done.

This is a big day for us. We’re announcing new versions of both Tailrank and Spinn3r.

The first big announcement is Spinn3r 2.0:

After nearly a year in development, I’m pleased to announce the release of Spinn3r 2.0.

We’ve also been heads down working on Tailrank as well and are announcing Tailrank 2.5 today as well.

All of this has been possible due to the sheer amount of work we’ve invested into our software and hardware infrastructure. We’re pretty ambitious and now that we’ve completed the majority of our infrastructure work we can ship more applications at a faster rate.

And of course Tailrank 2.5:

Not only are we announcing Spinn3r 2.0 today but we’re announcing that a new version of Tailrank is being released as well.

If you’ve been a regular reader of Tailrank over the last few months you might have noticed a number of incremental improvements. Tailrank 2.5 is far more evolutionary than revolutionary.

We’ve spent a lot of time focusing accuracy of Tailrank’s core internal algorithms. What works for one blog or even 1M of blogs in our index tends to fail from time to time when working on 12M blogs.

You should read both blog posts because there’s a lot more detail here.

We’re going to follow up with a Spinn3r 2.1 release in a couple of weeks. There were a few more features we wanted to integrate but weren’t able to ship at the last minute.

Update:

Another point I forgot to mention. We’re making Spinn3r available 100% free of charge for researchers! Should be interesting to see what happens here!

Search Engines at SIMS

Check out these super smart dudes talking search engines at SIMS.

I’m going to have to watch this one about web spam: