MapReduce is a big data processing technique, and a model for how to programmatically implement that technique. Its goal is to sort and filter massive amounts of data into smaller subsets, then distribute those subsets to computing nodes, which process the filtered data in parallel.

In data analytics, this general approach is called split-apply-combine.

MapReduce steps

A generic MapReduce procedure has three main steps: map, shuffle, and reduce. Each node on the distributed MapReduce system has local access to an arbitrary small portion of the large data set.

  • MapReduce steps

  • Implementations

  • Limitations and alternatives

  • Related information

  • Map: Each node applies the mapping function to its portion of the data, filtering and sorting it according to parameters. The mapped data is output to temporary storage, with a corresponding set of keys (identifying metadata), indicating how the data should be redistributed.

  • Shuffle: The mapped data is redistributed to other nodes on the system, so that each node contains groups of key-similar data.

  • Reduce: Data is processed in parallel, per node, per key.

Implementations

The following software and data systems implement MapReduce:

  • Hadoop — Developed by the Apache Software Foundation. Written in Java, with a language-agnostic API.
  • Spark — Developed by AMPLab at UC Berkeley, with APIs for Python, Java, and Scala.
  • Disco — A MapReduce implementation originally developed by Nokia, written in Python and Erlang.
  • MapReduce-MCI — Developed at Sandia National Laboratories, with bindings for C, C++, and Python.
  • Phoenix — A threaded MapReduce implementation developed at Stanford University, written in C++.

Limitations and alternatives

Due to its constant shuffling of data, MapReduce is poorly suited for iterative algorithms, such as those used in ML (machine learning).

Alternatives to traditional MapReduce, that aim to alleviate these bottlenecks, include:

Database terms

  • Introduction to MySQL.