LinkedIn and Hadoop
Interesting post about LinkedIn’s Data Infrastructure
Much of LinkedIn’s important data is offline – it moves fairly slowly. So they use daily batch processing withHadoop as an important part of their calculations. For example, they pre-compute data for their “People You May Know” product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.
They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a “fly by instruments” approach to optimize results. To achieve performance improvements, they also need to operate on large scale data – they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code – this required some thoughtful design since many graph algorithms don’t translate into mapreduce in a straightforward manner.
The workflow management systems are starting to become interesting as jobs with this many steps and intermediate data can become confusing quickly.