Engineering Open House At Google – Rumor: Google will Release BigTable/GFS/MapReduce to the Public
I was invited to a Google Engineering Open House yesterday and heard more discussions about Google’s infrastructure from Jeff Dean.
Not to disappoint, they lifted the kimono a little bit and released some more data points on their impressive infrastructure.
For example, Google Translator’s accuracy is improved 0.5% for every doubling in training corpus size. I assume this is asymptotic so it will start to approach zero after a few more doublings (I’d love to see the graph of the accuracy rate).
Towards the end of the talks Larry Page dropped by and talked a bit about the $30M Google/X-prize content to put a robot on the moon.
Apparently, he was ‘jet lagged’ from flying to LA, Vegas, and then back to Moffett Field.
The most interesting aspect of the whole night is how Jeff Dean and Larry looked when asked a pointed question.
An audience member went up to the microphone and asked if Google had plans to provide BigTable, GFS, and MapReduce to the public as a web service. Larry looked RIGHT at Jeff Dean as if to say “if only they knew what we know”. I was in Larry’s direct line of sight so the look was plain as day.
It seem inevitable that Google will provide a similar feature (especially with Amazon doing it) but I think the main issue is a question of time.
It’s a brilliant strategy really. It makes for a EASY acquisitions. Google has been known to stumble with their acquisitions when they have to port the existing software to Google’s infrastructure. From an outsider’s perspective this was a big stumbling block for Urchin as they had to port to Bigtable. It was a huge dataset – eventually growing to 500T in 2006.
It would be interesting to see Google take this step. Could you launch competing products on their infrastructure? With Spinn3r could license our crawler to their competitors for example.
If I were Powerset I’d stick to Amazon though…
I should note (and just to clarify) that Google probably will NEVER Open Source their infrastructure as it’s too tuned to their own environment. Jeff mentioned (in a similar talk at the Google Scalability conference) that Google’s infrastructure dependencies act as “tentacles” into other components of the system. BigTable is highly tuned to work on GFS for example and provides hints to GFS to obtain additional performance.
Google is spending a LOT of time going around and educating the industry about their infrastructure.
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.