Distributed Cluster Computing Platforms

Distributed Cluster Computing Platforms

Outline • What is the purpose of Data Intensive Super Computing? • MapReduce • Pregel • Dryad • Spark/Shark • Distributed Graph Computing

Why DISC • DISC stands for Data Intensive Super Computing • A lot of applications. • scientific data, web search engine, social network • economic, GIS • New data are continuously generated • People want to understand the data • BigData analysis is now considered as a very important method for scientific research.

What are the required features for the platform to handle DISC? • Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. • Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. • Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. • Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. • Try to understand all these four features during the introduction of the concrete platform below.

Google MapReduce • Programming Model • Implementation • Refinements • Evaluation • Conclusion

Motivation: large scale data processing • Process lots of data to produce other derived data • Input: crawled documents, web request logs etc. • Output: inverted indices, web page graph structure,top queries in a day etc. • Want to use hundreds or thousands of CPUs • but want to only focus on the functionality • MapReduce hides messy details in a library: • Parallelization • Data distribution • Fault-tolerance • Load balancing

Motivation: Large Scale Data Processing • Want to process lots of data ( > 1 TB) • Want to parallelize across hundreds/thousands of CPUs • … Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html

MapReduce • Automatic parallelization & distribution • Fault-tolerant • Provides status and monitoring tools • Clean abstraction for programmers

Programming Model • Borrows from functional programming • Users implement interface of two functions: • map (in_key, in_value) -> (out_key, intermediate_value) list • reduce (out_key, intermediate_value list) -> out_value list

map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

Architecture

Parallelism • map() functions run in parallel, creating different intermediate values from different input data sets • reduce() functions also run in parallel, each working on a different output key • All values are processed independently • Bottleneck: reduce phase can’t start until map phase is completely finished.

Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w ininput_value: EmitIntermediate(w, "1"); reduce(String output_key, Iteratorintermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v inintermediate_values: result += ParseInt(v); Emit(AsString(result));

Example vs. Actual Source Code • Example is written in pseudo-code • Actual implementation is in C++, using a MapReduce library • Bindings for Python and Java exist via interfaces • True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)

Example • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good.

Map output • Worker 1: • (the 1), (weather 1), (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), (is 1), (good 1).

Reduce Input • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1)

Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4)

Some Other Real Examples • Term frequencies through the whole Web repository • Count of URL access frequency • Reverse web-link graph

Implementation Overview • Typical cluster: • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines • Implementation is a C++ library linked into user programs

Architecture

Execution

Parallel Execution

Task Granularity And Pipelining • Fine granularity tasks: many more map tasks than machines • Minimizes time for fault recovery • Can pipeline shuffling with map execution • Better dynamic load balancing • Often use 200,000 map/5000 reduce tasks w/ 2000 machines

Locality Effect: Thousands of machines read input at local disk speed • Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack • map() task inputs are divided into 64 MB blocks: same size as Google File System chunks • Without this, rack switches limit read rate

Fault Tolerance • Master detects worker failures • Re-executes completed & in-progress map() tasks • Re-executes in-progress reduce() tasks • Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. • Effect: Can work around bugs in third-party libraries!

Fault Tolerance • On worker failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master • Master failure: • Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine

Optimizations • No reduce can start until map is complete: • A single slow disk controller can rate-limit the whole process • Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”) • Slow workers significantly lengthen completion time • Other jobs consuming resources on machine • Bad disks with soft errors transfer data very slowly • Weird things: processor caches disabled (!!) Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?

Optimizations Under what conditions is it sound to use a combiner? “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth

Refinement Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters

Performance Two benchmarks: MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) • Tests run on cluster of 1800 machines: • 4 GB of memory • Dual-processor 2 GHz Xeons with Hyperthreading • Dual 160 GB IDE disks • Gigabit Ethernet per machine • Bisection bandwidth approximately 100 Gbps

MR_Grep • Locality optimization helps: • 1800 machines read 1 TB of data at peak of ~31 GB/s • Without this, rack switches would limit to 10 GB/s • Startup overhead is significant for short jobs

MR_Sort Normal No Backup Tasks 200 processes killed Backup tasks reduce job completion time significantly System deals well with failures

More and more MapReduce MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation

Real MapReduce : Rewrite of Production Indexing System • Rewrote Google's production indexing system using MapReduce • Set of 10, 14, 17, 21, 24MapReduce operations • New code is simpler, easier to understand • MapReduce takes care of failures, slow machines • Easy to make indexing faster by adding more machines

MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details

MapReduce Programs • Sorting • Searching • Indexing • Classification • TF-IDF • Breadth-First Search / SSSP • PageRank • Clustering

MapReduce for PageRank

PageRank: Random Walks Over The Web • If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? • The PageRank of a page captures this notion • More “popular” or “worthwhile” pages get a higher rank

PageRank: Visually

PageRank: Formula Given page A, and pages T1 through Tn linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor

PageRank: Intuition PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Calculation is iterative: PRi+1 is based on PRi Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1 d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor”

PageRank: First Implementation Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged

Distribution of the Algorithm • Key insights allowing parallelization: • The 'next' table depends on 'current', but not on any other rows of 'next' • Individual rows of the adjacency matrix can be processed in parallel • Sparse matrix rows are relatively small

Distribution of the Algorithm • Consequences of insights: • We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees • These fragments can be reduced into a single PageRank value for a page by summing • Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1

Phase 1: Parse HTML • Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls)) • PRinit is the “seed” PageRank for URL • list-of-urls contains all pages pointed to by URL • Reduce task is just the identity function

Phase 2: PageRank Distribution PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • Map task takes (URL, (cur_rank, url_list)) • For each u in url_list, emit (u, cur_rank/|url_list|) • Emit (URL, url_list) to carry the points-to list along through iterations

Phase 2: PageRank Distribution PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • Reduce task gets (URL, url_list) and many (URL, val) values • Sum vals and fix up with d • Emit (URL, (new_rank, url_list))

Distributed Cluster Computing Platforms