When to Use MapReduce with Big Data
MapReduce is a programming model developed for distributed computation on big data sets in parallel. A MapReduce model contains a map function, which performs filtering and sorting, and a reduce function, which performs a summary operation.
MapReduce is a module in the Apache Hadoop open source ecosystem, and it’s widely used for querying and selecting data in the Hadoop Distributed File System (HDFS). A range of queries may be done based on the wide spectrum of MapReduce algorithms that are available for making data selections.
MapReduce is suitable for iterative computation involving large quantities of data requiring parallel processing. It represents a data flow rather than a procedure.
It’s also suitable for large-scale graph analysis; in fact, MapReduce was originally developed for determining PageRank of web documents. Graphs consist of vertices (nodes) and edges (links). Edges may be directed, such as hyperlinks on the web, or undirected, such as team members.
In MapReduce, graphs are stored as key-value pairs in which the key is the vertex id (URL) and the value is a complex record called a “tuple,” consisting of a list of neighboring vertices and other attributes of the vertices, such as web page text. A graph is distributed across the cluster, with different portions of the graphs, including adjacent vertices, stored on different nodes. A graph may be processed in parallel using MapReduce.
Graph algorithms are executed using the same pattern in the map, shuffle, and reduce phases. In the map phase, parallel computation is executed on all vertices. In the shuffle phase, partial results of the map phase are transferred to the neighboring vertices, which could be on different machines in the cluster, along the edges. In the reduce phase, the vertices compute a new value, also in parallel, based on the incoming values (“messages”).
Taking the PageRank example, in the map phase, the current PageRank of vertices is computed, and in the shuffle phase, the PageRank is conveyed to neighboring vertices. In the reduce phase, the PageRank of the destination vertices is recomputed as the sum of the incoming PageRank values. The algorithm iterates and reruns multiple times, each time updating a vertex’s value based on the incoming values from the neighboring vertices. PageRank is computed based on values from immediate neighbors and is therefore more suitable for MapReduce parallelization.
One of the challenges of graph algorithms is that the graph structure must be available at each iteration. The graph structure is passed along with the PageRank messages from the mappers to the reducers.
However, MapReduce is not suitable for graph analysis requiring multi-hop message transfers to distant vertices, as intermediate data could fill up the disk storage or network traffic could get bogged down waiting for large quantities of intermediate data to be transferred.
But due to its abilities with generating and processing big data, MapReduce is also suitable for other kinds of tasks:
Term frequency–inverse document frequency (tf–idf)
The more data we generate and collect, the greater the need to process all that data to make it usable. MapReduce’s iterative, parallel processing programming model is a good tool for making sense of big data.