What Is Mapreduce

MapReduce is a big data processing technique, and a model for how to programmatically implement that technique. Its goal is to sort and filter massive amounts of data into smaller subsets, then distribute those subsets to computing nodes, which process the filtered data in parallel.

In data analytics, this general approach is called split-apply-combine.

MapReduce steps

A generic MapReduce procedure has three main steps: map, shuffle, and reduce. Each node on the distributed MapReduce system has local access to an arbitrary small portion of the large data set.

MapReduce steps
Implementations
Limitations and alternatives
Related information
Map: Each node applies the mapping function to its portion of the data, filtering and sorting it according to parameters. The mapped data is output to temporary storage, with a corresponding set of keys (identifying metadata), indicating how the data should be redistributed.
Shuffle: The mapped data is redistributed to other nodes on the system, so that each node contains groups of key-similar data.
Reduce: Data is processed in parallel, per node, per key.

Implementations

The following software and data systems implement MapReduce:

Hadoop — Developed by the Apache Software Foundation. Written in Java, with a language-agnostic API.
Spark — Developed by AMPLab at UC Berkeley, with APIs for Python, Java, and Scala.
Disco — A MapReduce implementation originally developed by Nokia, written in Python and Erlang.
MapReduce-MCI — Developed at Sandia National Laboratories, with bindings for C, C++, and Python.
Phoenix — A threaded MapReduce implementation developed at Stanford University, written in C++.

Limitations and alternatives

Due to its constant shuffling of data, MapReduce is poorly suited for iterative algorithms, such as those used in ML (machine learning).

Alternatives to traditional MapReduce, that aim to alleviate these bottlenecks, include:

Database terms

Introduction to MySQL.

MapReduce steps#

Implementations#

Limitations and alternatives#

Related information#

MapReduce steps

Implementations

Limitations and alternatives

Related information