MapReduce

MapReduce is the original programming model for distributed data processing on Apache Hadoop. You express a computation that parallelizes across many nodes by structuring it as a map phase that processes pieces independently, followed by a reduce phase that combines partial results. Newer frameworks like Spark have partly displaced it, but the underlying ideas are still everywhere.

The structure has two phases:

Map phase. The input data is split into chunks. Each chunk goes to a mapper, which processes records and emits key-value pairs. Mappers are independent (no mapper knows what any other is doing), so they parallelize trivially across nodes.
Reduce phase. All key-value pairs with the same key are grouped together via a shuffle-and-sort step that moves pairs across the cluster. Each reducer processes all the pairs sharing a key and produces the final output. Reducers are also independent across keys.

The canonical example is counting words in a large document collection:

The mapper takes a chunk of text and emits, for each word it sees, the pair (word, 1), one occurrence of this word.
The shuffle step groups all pairs sharing the same word onto the same reducer.
The reducer takes a word and a stream of 1s and sums them, emitting (word, total_count).

Mapping is embarrassingly parallel: each mapper works on its own chunk with no knowledge of any other, and each reducer works on its own group of keys. A MapReduce program scales from one machine to a thousand without the programmer rewriting it; the framework handles distribution.

A frequent optimization is a combiner, a mini-reducer that runs on each mapper’s output before the shuffle, locally collapsing many (word, 1) pairs into a single (word, partial_count). The combiner cuts the volume of data that has to cross the network during shuffle, usually the slowest part of a MapReduce job. For commutative-and-associative reductions like sum, the combiner can use the reducer’s code unchanged; for non-associative ones it has to be written separately or skipped.

The cost is that MapReduce is restrictive. Not every computation fits naturally into map-then-reduce. Iterative algorithms (Gradient descent, expectation-maximization, the algorithms that drive modern machine learning) are awkward in MapReduce because they require many passes over the data, and each pass writes intermediate results to disk. This is the gap Spark and other later frameworks filled, by keeping intermediate results in memory between iterations.

The lasting point is the shape of the idea: in distributed computing, work is expressed as local processing of independent pieces followed by combination of partial results, with a framework moving the data around behind the scenes. The same shape recurs in Spark and Dask.

MapReduce runs inside Apache Hadoop’s ecosystem, alongside HDFS for storage and YARN for resource scheduling.

Idriss Rami — Notes

Explorer

MapReduce

Graph View

Backlinks