Section 5: Performance

Section 5: Performance Chris Zingraf

Overview: • This section measures the performance of MapReduce on two computations, Grep and Sort. • These programs represent a large subset of real programs that MapReduce users have created.

5.1 Cluster ConfigurationThe machines: • Cluster of ≈ 1800 machines. • Two 2GHz Intel Xeon processors with Hyper-Threading. • 4 GB of memory. • Two 160GB IDE(Integrated Drive Electronics) Disks. • Gigabit Ethernet link.

5.1 cluster configuration (continued) • Arranged in a two-level tree-shaped switched network. • ≈ 100-200 Gbps aggregate bandwidth available at root. • Every machine is located in the same hosting facility. • Round trip between pairs is less than a millisecond. • Out of the 4GB of memory available, approximately 1-1.5GB was reserved by other tasks. • Programs were run on a weekend afternoon, when the CPUs, disks, and network were mostly idle.

5.2 Grep • Grep scans through 10^10 100-byte records. • The program looks for a match to a rare 3-character pattern. • This pattern occurs in 92,337 records. • The input gets slip up into ≈ 64 MB pieces. • Output gets stored into one file.

30000 Input (MB/s) 20000 10000 5.2 Grep (continued) 0 20 40 60 80 100 Seconds • Y-axis shows the rate at which the input data is scanned. • This picks up as more machines are assigned to the computation. • The graph reaches its peak (above 30GB/s when 1764 workers have been assigned. • The entire computation takes about 150 seconds. • This time includes the minute it takes to start everything up.

5.3 sort • The sort program sorts through 10^10 100-byte records. • This is modeled after the TeraSort benchmark. • Whole program is less than 50 lines.

How does the program sort the data? • A 3 line Map function extracts a 10-byte sorting key from a text line. • It then emits the key and original text line. • (This is the intermediate key/value pair. • The built-in Identify function served as the Reduce operation. • This passes the intermediate key/value pair unchanged as the output key/value pair. • The final sorted output is written to a set of 2-way replicated Google File System (GFS) files • i.e., 2 terabytes are written as the output of the program

5.3 Sort (continued) • Like Grep the input for the sort program is split up into 64MB pieces. • The sorted output is partitioned into 4000 files. • The partitioning function uses the initial bytes of the key to segregate the output into one of the 4000 pieces.

5.3 Sorting (continued) • This figure shows the data transfer rate over time for a normal execution of the Sort function. • The rate peaks at 13GB/s and then starts to die quickly since all of the map tasks get finished before the 200 second mark.

5.3 sorting( continued) • This graph shows the rate of data being sent over the network from the map tasks to the reduce tasks. • This is started as soon as the first map task finishes • First bump in the graph is when the first batch of reduce tasks • This is approximately 1700 reduce tasks, since the entire task was given to around 1700 machines and each machine does one task at a time. • Then, around 300 seconds into the computation the second batch of reduce tasks finishes so they get shuffled. • Everything is finished in about 600 seconds.

5.3 sort(continued) • This figure shows the rate at which sorted data is written to the final output files. • There is a delay between the last of the first batch of shuffling and the start of the writing since the machines are too busy sorting the intermediate data. • The writes stay at a more steady rate compared to reading input and shuffling. • This rate is about 2-4GB/s. • The writes are finished by around 850 seconds. • The entire computation takes a total of 891 seconds. • (this rate is similar to the best reported result by TeraSort[1057 seconds]).

5.4 Effect of backup tasks • Figure 3(b) shows how long it takes when the sort program is run without backup tasks enabled. • It is similar to 3(a) except for the long tail where there is hardly any write activity. • 960 seconds in, every reduce task but 5 are completed. • These last 5 tasks do not finish until 300 seconds later. • The whole thing takes 1283 seconds to complete. • This is a 44% increase in time.

5.5 machine failures • Figure 3(c) shows the sort program executing where they intentionally killed 200 of the 1746 workers several minutes into the operation. • Since the machines were still functioning, the underlying cluster scheduler immediately restarted new worker processes on the “killed” machines. • The graph shows negative input, this is when the machines were killed. • The input goes into the negatives because the data was lost, and needed to be redone. • The whole thing is finished after 933 seconds • This is only a 5% increase in time over the normal execution.

Section 5: Performance