The Performance of MapReduce: An Indepth Study

The Performance of MapReduce: An Indepth Study Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu

Background • Relevance • Topics covered

Map Reduce Features • Why do they go for this approach? • Map and Reduce functions • Flexibility • Scalability • fault tolerance

Previous work • Large-scale data analysis market is dominated by Parallel Database systems. • The results showed that the observed performance Parallel Database system is much better than of a Map Reduce-based system. (A comparison of approaches to large-scale data analysis. In SIGMOD. ACM, June 2009.-A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker.)

Possible reasons • Architectural faults in design of Map Reduce. • MR-based systems need to repetitively parse records since it is designed to be independent of the storage system. • Thus, parsing introduces performance overhead. • Is it not possible for efficient MR system?

MR vs Parallel distributed systems • Advantages: MR over Parallel DS • Simplicity: • 1)Map and Reduce, that are written by a user to process key/value data pairs. • 2)Does not require that data files adhere to a schema defined using the relational data model. Programmer is free to structure their data in any manner or even to have no structure at all. • Parallel DBMSs: data to fit into the relational paradigm of rows and columns. • Disadvantages : programmers should agree on the structure of data, something or someone must ensure that any data added or modified does not violate integrity or other high-level constraints (e.g., employee salaries must be non negative). Such conditions must be known and explicitly adhered to by all programmers modifying a particular data set.

Comparison • MR Programming: • one is forced to write algorithms in a low-level language in order to perform record-level manipulation. • Hadoop was always much more CPU intensive than the parallel DBMS in running equivalent tasks because it must parse and deserialize the records in the input data at run time, whereas parallel databases do the parsing at load time and can quickly extract attributes from tuples at essentially zero cost.

Execution overflow MapReduce: Simplied Data Processing on Large Clusters(Jeffrey Dean and Sanjay Ghemawat) OSDI 2004.

Execution of single MR Job 1)invoke map functions to read data from a storage system into memory, 2)parse/decode data into records,(worker does this function) 3) process the records. 4) sort the intermediate data emitted by map functions according to keys 5) shuffle intermediate data from mappers to reducers, 6) merge data into groups by intermediate keys 7) invoke reduce functions and write resulting records back to a storage system

Key Areas of concern Step1), Step 2), and Step 4) are factors that have major performance impact: 1) I/O mode that the map function reads data on the storage system, 2) Decoding method that transforms raw data into records, and 3) The key comparison strategy used during sorting for computing aggregations.

I/O mode • MR is designed to be independent of underlying storage systems. • A Map function takes input from a reader instead of the storage system • The reader repeatedly reads data from the storage system into a memory buffer(64KB or 128KB in typical) for the map function to process until all the data in the data chunk are exhausted.

Two Modes • Readers have two ways to read data from storage systems: 1) Direct I/O :read data from the disk directly 2) Streaming I/O: namely streaming data from the storage system by a communication scheme, such as TCP/IP or JDBC. • Benchmark on DFS shows that direct I/O outperforms streaming I/O by 10%~15%.

Indexing • How MR utilizes indexing to speedup data process? MapReduce can benefit from three kinds of indices: range-indexes, block-level indexes, and database indexed tables. 1) First, if the input of a MR job is a set of files stored in Distributed File System, and each file is already sorted on keys, MR can use a range-index to prune unnecessary data chunks and direct map functions to only scan data chunks that store records of interest. 2) If the input DFS files of MR are not sorted but each data chunk in the files are indexed by keys, MapReduce can also utilize this chunk-index by scheduling a map task on each data chunk and configure the reader to apply the index for searching desired records.

Sorting • Employs a sort-merge strategy for performing all kinds of data analysis tasks. • Example: calculate the total revenue of each sourceIP in the UserVisits table. To perform this task, the MR framework needs to sort the intermediate records emitted by the map function according to sourceIP. • Each sourceIP is a variable-length string with 16 bytes at most. Thus the comparison of two sourceIPs need 16 byte-to-byte comparisons in the worst case. • When an intermediate record is emitted, store a fingerprint, a 32bits integer, of the key along with that record. When MR sorts the intermediate records, first compare the fingerprints of keys. • This fingerprints comparison strategy reduces the cost of comparing two keys which are fingerprinted to different values. In such cases, only one integer comparison is needed.

Example • Djb2 :this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. • unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = *str++) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */ return hash; } (http://www.cse.yorku.ca/~oz/hash.html)

Parsing: Mutable Vs Immutable Decoding scheme • read-only records whose fields in the value part can only be set once and cannot be Changed after the record is created. A new immutable record is created each time the decoder is called by the map function for parsing the next input record. Thus, parsing four million records produces four million immutable records. • One can also use mutable decoding scheme. Each time a mutable decoder is called by the map function, it decodes the raw data and fills the fields of the mutable record with next record. Thus, no matter how many records will be decoded, only one mutable record is created.

Results • Benchmark on DFS shows that direct I/O outperforms streaming I/O by 10%~15%. • The range-index improves the performance of MR by a factor of 2 in the selection task. • A mutable decoder is faster than an immutable decoder by a factor of 10, and improves the performance of selection by a factor of 2. • Fingerprinting-based sort outperforms direct sort by a factor of 4 to 5, and improves overall performance of the job by 20%~25%

Conclusion • The results show that with proper implementation, the performance of MapReduce can be significantly improved by a factor of 2.5 to 3.5 and approaches nearly to Parallel databases. This result indicates that a system that achieves scalability and flexibility does not necessarily sacrifice performance. Thus , these experimental results would be useful for the future development of MapReduce based data processing systems.

Future research To achieve the efficiency to the level of parallel Databases while maintaining the various other aspects of Map Reduce such as scalability and flexibility without sacrificing performance.

Thank You!

The Performance of MapReduce: An Indepth Study