MapReduce: Simplified Data Processing on Large Clusters

Both High Scalability and Greg Linden linked to a new MapReduce publication which is hot off the press from Google:

P107-Dean

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a mapand a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Click on the attachment to the right to read the full publication.

I’ll add my thoughts as an update when I read the paper later tonight.


  1. brianaker

    Thanks for linking to that.

    Two observations that jumped out at me:
    1) Reduce looks to be single threaded in most cases (not a huge surprise, I know this from optimizing group by operations for parallel operations).

    2) The system is weakly typed.

    My Tangent::Serial class does quite a bit of this using MySQL as the HA behind all of this work. It just goes to show how many of us are pointed in similar directions at any one point in time.






%d bloggers like this: