140 likes | 284 Vues
Google’s Map Reduce. Commodity Clusters. Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: Cluster of commodity Linux nodes Gigabit Ethernet interconnect How to organize computations on this architecture?. CPU.
E N D
Commodity Clusters • Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server • Standard architecture emerging: • Cluster of commodity Linux nodes • Gigabit Ethernet interconnect • How to organize computations on this architecture?
CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch … … Each rack contains 16-64 nodes
Map Reduce • Map-reduce is a high-level programming system that allows database processes to be written simply. • The user writes code for two functions, map and reduce. • A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. • Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.
Data Organization • Data is assumed stored in files. • Typically, the files are very large compared with the files found in conventional systems. • For example, one file might be all the tuples of a very large relation. • Or, the file might be a terabyte of "market baskets,“ • Or, the file might be the "transition matrix of the Web," which is a representation of the graph with all Web pages as nodes and hyperlinks as edges. • Files are divided into chunks, which might be complete cylinders of a disk, and are typically many megabytes.
The Map Function • Input is thought of as a set of key-value records. • Executed by one or more processes, located at any number of processors. • Each map process is given a chunk of the entire input data on which to work. • Designed to take one key-value pair as input and to produce a list of key-value pairs as output. • The types of keys and values for the output of the map function need not be the same as the types of input keys and values. • The "keys" that are output from the map function are not true keys in the database sense. • That is, there can be many pairs with the same key value. • The result of executing all the map processes is a collection of key-value pairs called the intermediate result. • Each pair appears at the processor that generated it.
Map Example Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Map Function • Input is a set of (i,d) pairs • i is document ID • d is corresponding document. • The map function scans d and for each word w it finds, it emits the pair (w, i). • Notice that in the output, the word is the key and the document ID is the associated value. • Output of map is a list of word-ID pairs. • Not necessary to catch duplicate words in the document; the elimination of duplicates can be done later, at the reduce phase. • The intermediate result is the collection of all word-ID pairs created from all the documents in the input database.
Note. The output of a map-reduce algorithm is always a set of key-value pairs. Useful in some applications to compose two or more map-reduce operations.
The Reduce Function • The second user-defined function, reduce, is also executed by one or more processes, located at any number of processors. • Input to reduce is a single key value from the intermediate result, together with the list of all values that appear with this key in the intermediate result. • The reduce function itself combines the list of values associated with a given key k. • The result is k paired with a value of some type.
Reduce Example Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Reduce Function • The intermediate result consists of pairs of the form (w, [i1, i2,…,in]), • where the i's are a list of document ID's, one for each occurrence of word w. • The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.
Parallelism • This organization of the computation makes excellent use of whatever parallelism is available. • The map function works on a single document, so we could have as many processes and processors as there are documents in the database. • The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. • Of course, it is unlikely that we would use so many processors in practice.
Another Example – Word Count Construct a word count. • For each word w that appears at least once in our database of documents, output pair (w, c), where c is the number of times w appears among all the documents. The map function • Input is a document. • Goes through the document, and each time it encounters another word w, it emits the pair (w, 1). • Intermediate result is a list of pairs (w1,1), (w2,1),…. The reduce function • Input is a pair (w, [1, 1,... ,1]), with a 1 for each occurrence of word w. • Sums the 1's, producing the count. • Output is word-count pairs (w,c).
What about Joins? R(A, B) S(B, C) The map function • Input is key-value pairs (X, t), • X is either R or S, • t is a tuple of the relation named by X. • Output is a single pair (b, (R, a)) or (b, (S, c)) depending on X • b is the B-value of t. • b is the B-value of t (if X=R). • c is the C-value of t (if X=C). The reduce function • Input is a pair (b, [(R,a), S(S,c), …]). • Extracts all the A-values associated with R and all C-values associated with S. These are paired in all possible ways, with the b in the middle to form a tuple of the result.
Reading • Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html • Hadoop (Apache) – Open Source implementation of MapReduce http://hadoop.apache.org/core