MapReduce is the original programming model for distributed data processing on Apache Hadoop. It’s a way of expressing computations that can be parallelized across many nodes, by structuring them as a map phase that processes pieces independently followed by a reduce phase that combines partial results. Although it’s been partly displaced by newer frameworks like Spark, the underlying ideas are still everywhere.
The structure has two phases:
- Map phase. The input data is split into chunks. Each chunk goes to a mapper, which processes records and emits key-value pairs. Mappers are independent — no mapper knows what any other is doing — so they parallelize trivially across nodes.
- Reduce phase. All key-value pairs with the same key are grouped together (via a shuffle-and-sort step that moves pairs across the cluster). Each reducer processes all the pairs sharing a key and produces the final output. Reducers are also independent across keys.
The canonical example is counting words in a large document collection:
- The mapper takes a chunk of text and emits, for each word it sees, the pair
(word, 1)— one occurrence of this word. - The shuffle step groups all pairs sharing the same word onto the same reducer.
- The reducer takes a word and a stream of 1s and sums them, emitting
(word, total_count).
The beauty of the model is that mapping is embarrassingly parallel (each mapper works on its own chunk with no knowledge of any other) and reducing is also parallel (each reducer works on its own group of keys). A MapReduce program scales from one machine to a thousand without the programmer rewriting it; the framework handles distribution.
A frequent optimization is a combiner — a mini-reducer that runs on each mapper’s output before the shuffle, locally collapsing many (word, 1) pairs into a single (word, partial_count). The combiner cuts the volume of data that has to cross the network during shuffle, which is usually the slowest part of a MapReduce job. For commutative-and-associative reductions like sum, the combiner can use the reducer’s code unchanged; for non-associative ones it has to be written separately or skipped.
The cost is that MapReduce is restrictive. Not every computation fits naturally into map-then-reduce. Iterative algorithms — Gradient descent, expectation-maximization, the algorithms that drive modern machine learning — are awkward in MapReduce because they require many passes over the data, and each pass writes intermediate results to disk. This is the gap that Spark and other later frameworks filled, by keeping intermediate results in memory between iterations.
For our purposes, the point to take from MapReduce is the shape of the idea: in distributed computing, work is expressed as local processing of independent pieces followed by combination of partial results, with a framework moving the data around behind the scenes. The same shape recurs in Spark, in Dask, in the modern data-processing landscape.
MapReduce runs inside Apache Hadoop’s ecosystem, alongside HDFS for storage and YARN for resource scheduling.