Archive for the ‘google’ Category

What’s up with this? This makes me crazy :-P

Google releases the Android Open Accessory API but fails to ship ANT+ support?

ONLY USB and no Bluetooth for now.

This is amazingly LAME.

Sony can ship ANT+ for Android but Google can’t?

The demo they gave on the screen, with an Android game monitoring the pace of the bike could have actually been done directly with existing ANT+ open standard wireless.

In fact, my new Trek Madone 5.5 ACTUALLY HAS AN INTEGRATED ANT+ ALREADY.

I could download the existing game and throw my bike on a trainer and actually play the game via existing hardware using an open wireless standard with existing technology.

So now my bike needs to have a USB port? Lame.

For Google to adopt a new API but ignore existing open wireless standards is amazingly lame.

This is pretty nice. Google released Zippy as Open Source:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

This means that along with open-vcdiff it is possible to use the full Google compression tool chain.

The War for Engineers

There is definitely a war in Silicon Valley (and SF) for top talent.

Todd McKinnon, the co-founder and chief executive of San Francisco software start-up Okta Inc., calls the competition for top talent “a war.” His company, which this year raised $10 million in Series A funding from venture firm Andreessen Horowitz and angel investors, plans to spend 80% of its new capital on salaries, mostly for engineers.

The best strategy (other than cash) is to have a great company with perks for engineers including creativity, working on cool projects, etc. This can go a LONG way to attracting top talent.

CR48 Thoughts and Feedback

I’m going to keep a running blog post on CR48 thoughts as I use this for my day to day work (or at least try to):

– as a software engineer it’s hard to use fully as I live in Emacs.

– There needs to be so solution for running something like Pidgin in the background. Isn’t this a cloud app? Google Talk isn’t the only IM provider. It’s ironic that that the Google Talk client only runs on Windows.

– The trackpad is a bit funky to get used to.

– I don’t like the lower case keys on the keyboard. I’m used to >= 15 years of upper case keys and my brain doesn’t recognize them.

– I like the keyboard and feel of the case. It’s nice and soft.

– The power charger is hugely retro and ugly. I want this thing to have an Apple mag connector but I guess Apple has patents on that :-(

– I want site specific browsers, not just tabs. I need to really easily switch to gmail and not just search for it among a sea of tabs.

– Gtalk only works in gmail (again) and of course does not support OTR so when I login to gmail on my CR48 all of my contacts start communicating with me over encrypted IM and that is FAIL.

– For a secondary machine it’s actually pretty decent. This is how I’m using it now. Right now I have a movie I downloaded playing on my MBP and I’m watching it in the background while I work on the CR48.

200810231630We have a number of other pending announcements of researchers building cool applications with Spinn3r but this one was just too awesome to hold back.

Researchers at Cornell have developed a new memetracker (cleverly named MemeTracker) powered by Spinn3r.

Jure Leskovec, Lars Backstrom and Jon Kleinberg (author of the HITS algorithm, among other things) built MemeTracker by tracking the hottest quotes from throughout the blogosphere and rending a graph by the grouping quotes and then tracking the number of quote references.

MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories per day from 1 million online sources, ranging from mass media to personal blogs.

We track the quotes and phrases that appear most frequently over time across this entire spectrum. This makes it possible to see how different stories compete for news coverage each day, and how certain stories persist while others fade quickly.

The plot above shows the frequency of the top 100 quotes in the news over time, for roughly the past two months.

Here’s a screenshot but you should definitely play with MemeTracker to see how it works:


We’ve been thinking of shipping a new API for tracking quotes across the blogosphere. Our new change tracking algorithm for finding duplicate content also does an excellent of finding quotes.

Tracking duplicate content turns out to be very important in spam prevention and ranking. It just so happens that there’s a number of overlapping features and technologies that these things can provide.

We’re not ready to ship it just yet because the backend requires about 2TB of random access data. This isn’t exactly cheap so we’ve been experimenting with some new algorithms and hardware to bring down the pricing. I think we’ll be able to ship something along these lines once we get our next big release out the door.

A few months ago, when I was heads down finalizing the distributed database in Spinn3r, I was exceedingly curious about what other DBs are using for compression.

GZip seems to be the obvious choice but its compression speed isn’t very good when compared to LZO.

Your disks are almost certainly going to be bottlenecked on IO (if you have a good DB design) so compressing the data means you can trade CPU (which will almost certainly idle)

I remembered some notes about compression in the original Bigtable paper and decided to dig a bit deeper.

Apparently, there isn’t much information about what Google uses for compression in Bigtable, GFS, etc.

These notes were compiled from Jeff Dean’s talk in 2005 but I haven’t seen anything else referencing the subject.

Skip to 46:00 in the Dean talk to see the compression notes.

Andrew Hitchcock also took some notes on the talk:

There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.

Google is using an algorithm named BMDiff referenced in Bentley McIlroy DCC ’99 Data Compression Using Long Common Strings

The use BMDiff to compute a dictionary diff between all columns in a column family. This way common strings between columns can be stored in a compressed dictionary to avoid duplicate storage.

This also helps to diff between previous versions of a page across compactions. A page stored in your index will probably have a LOT in common with the same page stored a month ago.

They then run the bmdiff through zippy (another compression algorithm they wrote). Apparently, it’s a tuned version of LZO.

I’d like to see MySQL/Drizzle support more higher level DB primitives directly rather than having to build support for these above the DB level.

The zlib compress/uncompress support in MySQL is horrible (binary data is not compatible with other zlib implementations).

Supporting bmdiff, lzo, bloom filters, etc in DBs is going to be necessary to have drizzle support larger distributed databases.

There are a few UDFs I want to write so maybe I’ll take these on at the same time..

Come to think of it, crypto support isn’t that hot in MySQL either.





Google is showing Feedburner redirect URLs in their search results.


They’re using the link:

which is a Feedburner redirect URL which they use in RSS feeds to help in tracking.

Google owns Feedburner so it’s a bit embarrassing that they’re making such an obvious mistake.

This might be distorting the stats for the LA Times. Feedburner may in fact be ignoring these when they see that the HTTP referrer is in fact from Google (and not empty or from a web based reader).

The biggest problem with the implementation that Feedburner is using is that it’s impossible to reconstruct the original URL from the redirect URL once it’s in the wild.

A few years ago (when they were just a newly hatched startup and I was working on Rojo) I proposed that they use a URL that encodes the target URL and only adds about 30 additional characters.

The template would be:$nonce/$site/$path

A search engine like Google or Spinn3r could use the URL found in the wild, decode the correct target URL, and then update their index (and rank) to reflect the actual URL.

Looks like Cuil might be hitting websites too hard with their crawler:

“I don’t know what spawned it, but when Cuil attempts to index a site, it does so by completely hammering it with traffic,” the tipster wrote. “So much, that it completely brings the site down. We’re 24 hours into this “index” of the site, and I’ve had to restrict traffic to the site down to 2 packets per second, while discarding the rest, or otherwise it makes the site unusable.”

The Admin Zone forums are abuzz over Cuil’s overzealous method for indexing. Countless posters on the site have said that their websites have been brought down because of the Twiceler robot and one user said it “leeched enormous amounts of bandwidth — nearly 2GB this month until it was blocked. It visited nearly 70,000 times!”

One of the benefits to Spinn3r is that we crawl for multiple customers (and we’re super polite).

Cuil might be a bit more aggressive than normal crawlers for now in that they have to backfill archives. Google has the benefit of being around for a while and doesn’t need to drink from the firehose as much as Cuil.

Ha….. this is what I get if I try to share an item.

I guess they’re using AHAH?


This press release on Mtron’s sites is interesting:

In the second half of 2008, Intel will release both high-performance SSDs for use in servers and storage, and exclusive SSD models for consumer electronics; the industry is thus watching whether Google will introduce SSD systems.

Google reportedly will be supplied with Intel’s SSD-embedded storage devices at the end of the second quarter to be applied to its search systems.

There’s been more activity in the distributed consensus space recently.

At the Hypertable talk yesterday Doug mentioned Hyperspace, their Chubby-style distributed lock manager. Though I think it’s missing the ‘distributed’ part for now.

To provide some level of high availability, Hypertable needs something akin to Chubby. We’ve decided to call this service Hyperspace. Initially we plan to implement this service as a single server. This single server implementation will later be replaced with a replicated version based on Paxos or the Spread toolkit.

ZooKeeper seems to be making some progress as well.

Check out this recent video presentation (which I probably can’t embed so here’s the link).

In 2006 we were building distributed applications that needed a master, aka coordinator, aka controller to manage the sub processes of the applications. It was a scenario that we had encountered before and something that we saw repeated over and over again inside and outside of Yahoo!.

For example, we have an application that consists of a bunch of processes. Each process needs be aware of other processes in the system. The processes need to know how requests are partitioned among the processes. They need to be aware of configuration changes and failures. Generally an application specific central control process manages these needs, but generally these control programs are specific to applications and thus represent a recurring development cost for each distributed application. Because each control program is rewritten it doesn’t get the investment of development time to become truly robust, making it an unreliable single point of failure.

We developed ZooKeeper to be a generic coordination service that can be used in a variety of applications. The API consists of less than a dozen functions and mimics the familiar file system API. Because it is used by many applications we can spend time making robust and resilient to server failures. We also designed it to have good performance so that it can be used extensively by applications to do fine grained coordination.

We use a lock coordinator in Spinn3r and are very happy with the results. It’s a very simple system so provides a LOT of functionality without much pain and maintenance.

Paxos made live is out as well. (I haven’t had time to read it yet).

– It seems the biggest scalability issue in InnoDB has to do with its excessive use of inefficient mutexes to protect data structures. Turning down innodb_thread_concurrency should actually help performance on multi-core boxes.

– The performance problems really start to hit at about eight core. Four core is just fine but still feels a performance hit.

– Google implemented their mutexes using X86-specific compare and swap. Apparently, Monty is working on a CAS portability library. An audit of the room 99% of the people running on X86 anyway so this might not be an issue.

– If you’re having CPU issues not upgrade to > 5.0.30. There’s another fix in > 5.0.54 which is interesting.

– Google has also replaced the innodb malloc heap with a scalable malloc library (tcmalloc). For larger buffer pools this might make a big difference.

– MySQL 6 separates threds from connections. Google will backport this patch… (Awesome)

More notes from others are available as well.

Ouch. So much for upgrading to WordPress 2.5 for a secure version of WordPress.

While the shift is going in the right direction it might not fully fix the problem now that this exploit is known. (thanks to Ian for pointing this out).

WordPress is prone to multiple SQL-injection vulnerabilities because it fails to sufficiently sanitize user-supplied data before using it in an SQL query.

Exploiting these issues could allow an attacker to compromise the application, access or modify data, or exploit latent vulnerabilities in the underlying database.

WordPress 2.5 is vulnerable; other versions may also be affected.

… and check out the infected versions:

WordPress WordPress 2.3.1
WordPress WordPress 2.2.3
WordPress WordPress 2.2.2
WordPress WordPress 2.2.1
WordPress WordPress 2.2.1
WordPress WordPress 2.1.3
WordPress WordPress 2.1.3
WordPress WordPress 2.1.2
WordPress WordPress 2.1.1
WordPress WordPress 2.0.10
WordPress WordPress 2.0.7
WordPress WordPress 2.0.6
WordPress WordPress 2.0.5
WordPress WordPress 2.0.4
WordPress WordPress 2.0.3
WordPress WordPress 2.0.2
WordPress WordPress 2.0.1
WordPress WordPress 2.0
WordPress WordPress 2.5
WordPress WordPress 2.3
WordPress WordPress 2.2 Revision 5003
WordPress WordPress 2.2 Revision 5002
WordPress WordPress 2.2
WordPress WordPress 2.1.3-RC2
WordPress WordPress 2.1.3-RC1
WordPress WordPress 2.1
WordPress WordPress 2.0.10-RC2
WordPress WordPress 2.0.10-RC1

200804081439Technorati published more information on the wordpress blog spam cancer that’s spreading around the Internet.

If you’re running a version of WordPress less than 2.5 you need to stop what you’re doing NOW and upgrade! Don’t wait until your blog is compromised.

The blogosphere has had its share of maladies before. Comment spam, trackback spam, splogs and link trading schemes are the colds and flus that we’ve come to know and groan about. But lately, a cancer has afflicted the ecosystem that has led us at Technorati to take some drastic measures. Thousands of WordPress installations out in the wilds of the web are vulnerable to security compromises, they are being actively exploited and we’re not going to index them until they’re fixed.

We know about them at Technorati because part of what we do is count links. Compromised blogs have been coming to our attention because they have unusually high outbound links to spam destinations. The blog authors are usually unaware that they’ve been p0wned because the links are hidden with style attributes to obscure their visibility. Some bloggers only find out when they’ve been dropped by Google, this WordPress user wrote

I’ve reached out to Ian Kallen to offer collaboration on fixing this issue.

We’re going to push out a point release of Spinn3r to block blogs that exhibit this spam problem.

It’s such a rare event to have hundreds of thousands of weblogs compromised in a systematic manner.

200804071213We’ve been covering a massive blog spam epidemic thanks to a nasty/evil spammer who’s exploiting a XMLRPC bug in WordPress 2.2.

This issue is FINALLY getting the attention it deserves:

I had a closer look at many of the blogs concerned that had spammy content — pages promoting credit cards, pharmaceuticals and the like, and I realized that if you go to the root domain they are all legitimate blogs. Not scraper blogs that were being auto-generated with adsense / affiliate links, which was extremely curious, and actually reminiscient of something that hit home a few months ago.

A few months ago, this blog got hacked — but in a sneaky way. Not only did the hackers insert “invisible” code into my template, so that I was getting listed in Google for all manner of sneaky (and NSFW terms), so that people could click on those links with the hacker getting the affiliate cash — but *actually*, said hackers also inserted fake tempates into my wordpress theme.

Center Networks is also covering this issue…

Oddly enough Tailrank picks up on this spam because of our clustering algorithm. We cluster common links and terms via our blog index and promote these stories to our front page.

Since we ‘trust’ stories with past behavior when major A-list blogs like ZDNet get owned we believe they are legitimate links.

If we had a smaller index this might be a big easier to handle but we’re indexing 12M blogs within Tailrank and on Spinn3r.

Another way around this of course would be to blacklist every blog running WordPress 2.2 or earlier but we’re talking millions of blogs here and we don’t want to unfairly harm anyone.

To date our approach has been to wait until Tailrank has identified the spam, and then blacklist any blogs that have been compromised.

Unfortunately this is a war of attrition with the spammer just spending a few more days and hacking another dozen or so sites.

The only positive aspect of this is that it’s encouraging people to upgrade to WordPress 2.5.

We’re also working on some secondary algorithms to catch this a bit sooner and we’ll probably ship these in Spinn3r 2.5 which is due shortly.

Is Google going to release Bigtable to the public tomorrow?

That’s the rumor going around:

My guess is that Google will be announcing the launch of web services that will compete head on with those offered by Amazon and others. The anchor for these services, we hear, is their internal database system called BigTable. Google has definitely briefed press on the imminent launch of BigTable as a web service, although as we said last week we haven’t been contacted.

I’ve heard a few rumors from multiple parties today (via the back channel).

It sounds like the press is being briefed on a new service Google is about to release to developers — possibly as early as this week. The service is called BigTable, and it has been proving itself for quite a while as the storage engine behind many Google services.

I called this a few months back when I was at a Google open house event:

An audience member went up to the microphone and asked if Google had plans to provide BigTable, GFS, and MapReduce to the public as a web service. Larry looked RIGHT at Jeff Dean as if to say “if only they knew what we know”. I was in Larry’s direct line of sight so the look was plain as day.

It seem inevitable that Google will provide a similar feature (especially with Amazon doing it) but I think the main issue is a question of time.

… and others seem to agree.

Google has officially jumped the rel=nofollow shark.

Google Sites is live and every page on the site uses rel=nofollow.

I just created a sample site and linked to my blog only to be presented with the following HTML:

<a href=”; rel=”nofollow”></a&gt;

Is this the future of the web? Every URL is going to have rel=nofollow?

Google Sites isn’t alone. Google Finance uses rel=nofollow as does Google Code.

The rel=nofollow attribute is a cancer that’s destroying the link graph.

Every URL I create is going to be blocked from link based trust metrics like PageRank? That’s just dumb. I’d rather use another wiki system that doesn’t penalize my linking behavior.

I realize that your intention is to fight spam but you should pursue and algorithmic approach. Blacklisting the entire Internet is NOT the solution.

It’s clear by now that Google uses other metrics for page ranking (almost certainly including HTTP traffic monitoring by now) so this isn’t the end of the world.

Linking is the whole point of the Internet! Creating road blocks for EVERY LINK in the system is the antithesis of a free an open web!

Update: I’m not the only blog covering this topic. Techcrunch, Search Engine Land, and Business Week have more.

Update 2: Ross Mayfield agrees:

I’m glad Kevin is saying nofollow is not the web (outside of the blogs and comments it was designed for) and in this world view you can only give Google Juice and it doesn’t give back. Such a view and action only favors those in a dominant network position.

Update 3: Mashable covers this in their Mashable Conversations podcast

Over a week ago I submitted our Open Source Spinn3r client to both Koders and Google Code Search.

The result? Nothing.

Neither one of them has yet indexed Spinn3r and it doesn’t return in any of their search engines.

Not very impressive guys.

I pinged the Koders guys and hopefully I’ll get a response. Of course this is a lot better than Google Code Search. Their ‘discuss’ link at the bottom of the page 404s.

Update: I could possibly use Google Code sitemap extensions but I shouldn’t have to if they have a manual form for specifying the SVN repo.

Update 2: Koders does an index push once per month (which seems a bit slow) and that they’ll have it in next week.

I wonder how long it takes Google Code Search to push an index.

Looks like NewsGator might have just shipped a NewsMonster-style distributed reputation system.

From NewsGator’s announcement:

“It’s all about ubiquity,” said Greg Reinacker, NewsGator CTO and founder. “We have more than 100 Fortune 2000 companies using NewsGator Enterprise Server and our client products. In selling to these enterprises, we discovered that thousands of knowledge workers were already using one or more of our client products and we learned that we could drive the relevance of everyone’s experience by using the community’s anonymous content consumption patterns throughout the system. In general, we found that the more people that used our system, the more relevant we could make the product for each user. By making it easier for knowledge workers to use our clients we dramatically increase the size of our user community. Enterprises that then deploy our server can take advantage of the synchronization and increased relevance for every user supported by the system. Likewise, we can extend these capabilities to our online platform, which currently serves well over 1 million consumers and indexes 7 million new articles per day. The result is tremendous value and continued innovation for both consumer and enterprise users.”

From the NewsMonster documentation:

NewsMonster allows for the creation of a trust network of worthy bloggers which is managed by the user. NewsMonster then uses this network to build a popularity index of recent events and aggregated RSS content by reputation.

NewsMonster observes decisions of users and publishes implicit certifications/reputes into the users online profile. These are then shared with other NewsMonster peers and aggregated to form a relevance network and collaborative filter. NewsMonster pays attention to what you are currently interested in and recommends articles for you based on what you should be interested in.

In layman’s terms – there is a lot of relevance information that can be pulled from your subscriptions and shared with your friends. It doesn’t make sense for two NewsMonster users to simultaneously scan through blogs, often duplicating effort, trying to find interesting content. Why not use each other to make finding news easier? Divide and conquer!

NewsMonster used similar shared profile access to share these certifications with other NewsMonster users. We published the data onto servers (thanks Brewster) and other NewsMonster clients then aggregated the data and computed local reputation with a trust metric I developed.

This was all before Spinn3r, Tailrank, Techmeme, Reddit, Digg, Bloggrunner, etc.

Unfortunately, I was never able to finish up the full system. The NewsMonster assets were purchased by Rojo when we started the company. We then switched focus towards and I was never able to find the spare cycles to finish up NewsMonster.

Google Reader is pushing some of these ideas. Nice to see NewsGator moving in this direction as well.

I still think it’s a bit too early. I’m not even sure it’s a mainstream product just yet.

Selling into enterprises is a good idea though.

On Wikia

200801062257-1Wikia launched today. I’m going to hold my tongue. It’s hard to compare anything to Google as they’ve generally done a stellar job to date.

Techcruch hates it. I’m not sure what Wikia could have done better here.

They did try to manage expectations:

To be fair, CEO Gil Penchina warned me it wouldn’t be a great product at launch. It’s simply a proof of concept of what can be created using open source software and little money, he says. Fair enough. But it’s time for Wales to be quiet, let this thing evolve or not, and eventually let the software do the talking.

Of course there are some obvious problems here. Searches for Viagra, Levitra, Mortgage, etc. all turn up spam.

Which makes sense. They used their crawler Grub (similar to Spinn3r) to aggregate content but they’re only using a simple term weighting algorithm.

Basically it’s search ALA 1995.

Perhaps this is just due to the rise in Apple expectations. You can’t release a product until it’s perfectly polished. No more evolutionary adaptation unless it’s in another major rev of the product.

They’re at least direct about it:

We are aware that the quality of the search results is low..

Wikia’s search engine concept is that of trusted user feedback from a community of users acting together in an open, transparent, public way. Of course, before we start, we have no user feedback data. So the results are pretty bad. But we expect them to improve rapidly in coming weeks, so please bookmark the site and return often.

Perhaps Mahalo should recognize this as a shot across their bow and ship their own search engine?

Venture Beat has more:

According to Wales, Search Wikia’s primary innovation will be to tie a user’s social network – that is, information about the user and their friends – into search results. The idea is that a user and their friends share a common set of preferences and that using that information makes search results more personalized as well as more relevant. More on that in a second.

Let’s assume for a moment that Wikia can pull it off. They build this great community and prove their idea works.

Why can’t Google just HIRE these people away and have them improve the rankings of Google?