Parallel and Distributed Programming Models and Languages
This discussion focuses on the necessity of distributed computations in the era of big data, highlighting the limitations of traditional single-node processing. It uses sorting as an example to illustrate the volume of data that can be managed. Key programming models, such as MapReduce and Dryad, are examined for their effectiveness in handling massive datasets with distributed systems, emphasizing their ability to optimize for locality, load balancing, and fault tolerance. Additionally, various applications, including web data analysis through graph processes, are considered, showcasing the strengths and trade-offs of high-level programming abstractions.
Parallel and Distributed Programming Models and Languages
E N D
Presentation Transcript
Parallel and Distributed ProgrammingModels and Languages 15-740/18-740 Computer Architecture In-Class Discussion Dong Zhou Kun Li Mike Ralph
Why distributed computations? • Buzzword: Big Data • Take sorting as an example • Amount of data that can be sorted in 60 seconds • One computer can read ~60 MB/sec from one disk • 2012 world record • Flat Datacenter Storage by Ed Nightingale et.al • 1470 GB • 256 heterogeneous nodes, 1033 disks • Google indexes 100 billion+ web pages
Solution: use many nodes • Grid computing • Hundreds of supercomputers connected by high speed net • Cluster computing • Thousands or tens of thousands of PCs connected by high speed LANS • 1000 nodes potentially give 1000x speedup
Distributed computations are difficult to program • Sending data to/from nodes • Coordinating among nodes • Recovering from node failure • Optimizing for locality • Debugging • …
MapReduce • A programming model for large-scale computations • Process large amounts of input, produce output • No side-effects or persistent state • MapReduce is implemented as a runtime library • Automatic parallelization • Load balancing • Locality optimization • Handling of machine failures
MapReduce design • Input data is partitioned into M splits • Map: extract information on each split • Each map produces R partitions • Shuffle and sort • Bring M partitions to the same reducer • Reduce: aggregate, summarize, filter or transform • Output is in R result files
More specifically • Programmer specifies two methods • map(k, v) → <k', v'>* • reduce(k', <v'>*) → <k'', v''>* • All v' with same k' are reduced together • Usually also specify: • partition(k', total partitions) → partition for k’ • often a simple hash of the key
MapReduce is widely applicable • Distributed grep • Distributed clustering • Web link graph reversal • Detecting approx. duplicate web pages • …
Dryad • Similar goals as MapReduce • Focus on throughput, not latency • Automatic management of scheduling, distribution, fault tolerance • Computations expressed as a graph • Vertices are computations • Edges are communication channels • Each vertex has several input and output edges
Why using a dataflow graph? • Many programs can be represented as a distributed dataflow graph • The programmer may not have to know this • ``SQL-like’’ queries: LINQ • Dryad will run them for you
V V V Runtime • Vertices (V) run arbitrary app code • Vertices exchange data through • files, TCP pipes etc. • Vertices communicate with JM to report • status • Daemon process (D) • executes vertices • Job Manager (JM) consults name server(NS) • to discover available machines. • JM maintains job graph and schedules vertices
Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs
Advantages of DAG over MapReduce • Big jobs more efficient with Dryad • MapReduce: big jobs runs > 1 MR stages • Reducers of each stage write to replicated storage • Output of reduce: 2 network copies, 3 disks • Dryad: each job is represented with a DAG • Intermediate vertices write to local file • …
Pig Latin • High-level procedural abstraction of MapReduce • Contains SQL-like primitives • Example: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>106; Output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); • Plus user-defined functions (UDFs)
Value • Reduces development time • Procedural vs. declarative • Overhead/performance costs worthwhile? C/C++ Assembly Pig Latin MapReduce
Green-Marl • High-level graph analysis language/compiler • Uses basic data types and graph primitives • Built-in graph function • BFS, RBFS, DFS • Uses domain specific optimizations • Both non-architecture and architecture specific • Compiler translates Green-Marl to other high-level language (ex. C++)
Tradeoffs • Achieve speedup over hand-tuned parallel equivalents • Tested only on single workstation • Only works with graph representations • Difficulty representing certain data sets and computations • Domain specific vs. general purpose languages • Future work for more architectures, user-defined data structures
Example: count word frequencies in web page • Input is files with one doc per record • Map parses document into words • key = document URL • value = document contents • Output of map "to", "1" "be", "1" "or", "1" "not", "1" "to", "1" "be", "1" "doc1", "to be or not to be"
Example: count word frequencies in web page • Reduce: computes sum for a key • Output of reduce saved key = "be" values = "1", "1" key = "not" values = "1" key = "or" values = "1" key = "to" values = "1", "1" "2" "1" "2" "2" "to", "2" "be", "2" "or", "1" "not", "1"
Example: Pseudo-code Map(String input_key, String input_value): //input_key: document name //input_value: document contents for each word w in input_values: EmitIntermediate(w, "1"); Reduce(String key, Iterator intermediate_values): //key: a word, same for input and output //intermediate_values: a list of counts int result = 0; for each v in inermediate_values: result += ParseInt(v); Emit(AsString(result))