1 / 59

Big Data Processing with MapReduce and Spark

Big Data Processing with MapReduce and Spark. Matei Zaharia UC Berkeley AMPLab spark-project.org. UC BERKELEY. Outline. The big data problem MapReduce model Limitations of MapReduce Spark model Future directions. The Big Data Problem.

derex
Télécharger la présentation

Big Data Processing with MapReduce and Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Processing with MapReduce and Spark Matei Zaharia UC Berkeley AMPLab spark-project.org UC BERKELEY

  2. Outline • The big data problem • MapReduce model • Limitations of MapReduce • Spark model • Future directions

  3. The Big Data Problem • Data is growing faster than computation speeds Growing data sources • Web, mobile, scientific, … Cheap storage • Doubling every 18 months Stalling CPU speeds • Even multicores not enough

  4. Examples • Facebook’s daily logs: 60 TB • 1000 genomes project: 200 TB • Google web index: 10+ PB • Cost of 1 TB of disk: $50 • Time to read 1 TB from disk: 6 hours (50 MB/s)

  5. The Big Data Problem • Single machine can no longer process or even store all the data! • Only solution is to distribute over large clusters

  6. Google Datacenter How do we program this thing?

  7. Traditional Network Programming • Message-passing between nodes • Really hard to do at scale: • How to split problem across nodes? • Must consider network, data locality • How to deal with failures? • 1 server fails every 3 years => 10K nodes see 10 faults/day • Even worse: stragglers (node is not failed, but slow) Almost nobody does message passing!

  8. Data-Parallel Models • Restrict the programming interface so that the system can do more automatically • “Here’s an operation, run it on all of the data” • I don’t care where it runs (you schedule that) • In fact, feel free to run it twice on different nodes • Biggest example: MapReduce

  9. MapReduce • First widely popular programming model for data-intensive apps on clusters • Published by Google in 2004 • Processes 20 PB of data / day • Popularized by open-source Hadoop project • 40,000 nodes at Yahoo!, 70 PB at Facebook

  10. MapReduce Programming Model • Data type: key-valuerecords • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)

  11. Example: Word Count defmapper(line): foreach word in line.split(): output(word, 1) defreducer(key, values): output(key, sum(values))

  12. Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

  13. MapReduce Execution • Automatically split work into many small tasks • Send map tasks to nodes based on data locality • Load-balance dynamically as tasks finish

  14. Fault Recovery 1. If a task crashes: • Retry on another node • OK for a map because it had no dependencies • OK for reduce because map outputs are on disk • If the same task repeatedly fails, end the job Requires user code to be deterministic

  15. Fault Recovery 2. If a node crashes: • Relaunch its current tasks on other nodes • Relaunch any maps the node previously ran • Necessary because their output files were lost along with the crashed node

  16. Fault Recovery 3. If a task is going slowly (straggler): • Launch second copy of task on another node • Take the output of whichever copy finishes first, and kill the other one

  17. Example Applications

  18. 1. Search • Input: (lineNumber, line) records • Output: lines matching a given pattern • Map:if(line matches pattern): output(line) • Reduce: identity function • Alternative: no reducer (map-only job)

  19. 2. Sort • Input: (key, value) records • Output: same records, sorted by key • Map: identity function • Reduce: identify function • Trick: Pick partitioningfunction pso thatk1 < k2 => p(k1) < p(k2) ant, bee Map [A-M] Reduce zebra aardvark ant bee cow elephant cow Map pig [N-Z] Reduce aardvark, elephant pig sheep yak zebra Map sheep, yak

  20. 3. Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map:foreachword in text.split():output(word, filename) • Reduce:def reduce(word, filenames): output(word, unique(filenames))

  21. Inverted Index Example hamlet.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt to be or not to be afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt 12th.txt be not afraid of greatness

  22. 4. Most Popular Words • Input: (filename, text) records • Output: the 100 words occurring in most files • Two-stage solution: • Job 1: • Create inverted index, giving (word, list(file)) records • Job 2: • Map each (word, list(file)) to (count, word) • Sort these records by count as in sort job

  23. Summary • By providing a data-parallel model, MapReduce greatly simplified cluster programming: • Automatic division of job into tasks • Locality-aware scheduling • Load balancing • Recovery from failures & stragglers • But… the story doesn’t end here!

  24. Outline • The big data problem • MapReduce model • Limitations of MapReduce • Spark model • Future directions

  25. When an Abstraction is Useful… • People want to compose it! • Most real applications require multiple MR steps • Google indexing pipeline: 21 steps • Analytics queries (e.g. sessions, top K): 2-5 steps • Iterative algorithms (e.g. PageRank): 10’s of steps • Problems: programmability & performance

  26. Programmability • Multi-step jobs create spaghetti code • 21 MR steps -> 21 mapper and reducer classes • Lots of boilerplate wrapper code per step • API doesn’t provide type safety

  27. Performance • MR only provides one pass of computation • Must write out data to file system in-between • Expensive for apps that need to reuse data • Multi-step algorithms (e.g. PageRank) • Interactive data mining (many queries on same data) • Users often hand-optimize by merging steps

  28. Spark • Aims to address both problems • Programmability: clean, functional API • Parallel transformations on collections • 5-10x less code than MR • Available in Scala, Java and Python • Performance: • In-memory computing primitives • Automatic optimization across operators

  29. Spark Programmability Google MapReduceWordCount: • #include "mapreduce/mapreduce.h" • // User’s map function • class SplitWords: public Mapper { • public: • virtual void Map(constMapInput& input) • { • const string& text = input.value(); • constint n = text.size(); • for (inti = 0; i < n; ) { • // Skip past leading whitespace • while (i< n && isspace(text[i])) • i++; • // Find word end • int start = i; • while (i< n && !isspace(text[i])) • i++; • if (start < i) • Emit(text.substr(start,i-start),"1"); • } • } • }; • REGISTER_MAPPER(SplitWords); • // User’s reduce function • class Sum: public Reducer { • public: • virtual void Reduce(ReduceInput* input) • { • // Iterate over all entries with the • // same key and add the values • int64 value = 0; • while (!input->done()) { • value += StringToInt( • input->value()); • input->NextValue(); • } • // Emit sum for input->key() • Emit(IntToString(value)); • } • }; • REGISTER_REDUCER(Sum); • intmain(intargc, char** argv) { • ParseCommandLineFlags(argc, argv); • MapReduceSpecification spec; • for (inti = 1; i < argc; i++) { • MapReduceInput* in= spec.add_input(); • in->set_format("text"); • in->set_filepattern(argv[i]); • in->set_mapper_class("SplitWords"); • } • // Specify the output files • MapReduceOutput* out = spec.output(); • out->set_filebase("/gfs/test/freq"); • out->set_num_tasks(100); • out->set_format("text"); • out->set_reducer_class("Sum"); • // Do partial sums within map • out->set_combiner_class("Sum"); • // Tuning parameters • spec.set_machines(2000); • spec.set_map_megabytes(100); • spec.set_reduce_megabytes(100); • // Now run it • MapReduceResult result; • if (!MapReduce(spec, &result)) abort(); • return 0; • }

  30. Spark Programmability Spark WordCount: • val file = spark.textFile(“hdfs://...”) • val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _)counts.save(“out.txt”)

  31. Spark Performance • Iterative algorithms: sec sec

  32. Spark Concepts • Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects • May be cached in memory for fast reuse • Operations on RDDs • Transformations(build RDDs), actions (compute results) • Restricted shared variables • Broadcast, accumulators

  33. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Worker results tasks Block 1 Driver Action messages.filter(_.contains(“foo”)).count Cache 2 messages.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: search 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3

  34. Fault Recovery RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter(func = _.contains(...)) map(func = _.split(...))

  35. Demo

  36. Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – – + – + – – – – – – target

  37. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Finalw: " + w) w automatically shipped to cluster

  38. Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s

  39. Other RDD Operations

  40. Spark in Java and Python JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count(); lines = sc.textFile(...)lines.filter(lambda x: “error” in x).count()

  41. Shared Variables • So far we’ve seen that RDD operations can use variables from outside their scope • By default, each task gets a read-only copy of each variable (no sharing) • Good place to enable other sharing patterns!

  42. Example: Collaborative Filtering Goal: predict users’ movie ratings based on past ratings of other movies 1 ? ? 4 5 ? 3 ? ? 3 5 ? ? 3 5 ? 5 ? ? ? 1 4 ? ? ? ? 2 ? R = Users Movies

  43. Model and Algorithm Model R as product of user and movie feature matrices A and B of size U×K and M×K R A BT = • Alternating Least Squares (ALS) • Start with random A & B • Optimize user vectors (A) based on movies • Optimize movie vectors (B) based on users • Repeat until converged

  44. Serial ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) } Range objects

  45. Naïve Spark ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() } Problem:R re-sent to all nodes in each iteration

  46. Efficient Spark ALS var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() } Solution: mark R as broadcast variable Result: 3× performance improvement

  47. Accumulators • Apart from broadcast, another common sharing pattern is aggregation • Add up multiple statistics about data • Count various events for debugging • Spark’s reduce operation does aggregation, but accumulators are another nice way to express it

  48. Usage • valbadRecords = sc.accumulator(0)valbadBytes= sc.accumulator(0.0) • records.filter(r => {if (isBad(r)) { badRecords += 1badBytes += r.sizefalse } else {true }}).save(...) • printf(“Total bad records: %d, avg size: %f\n”,badRecords.value, badBytes.value/badRecords.value)

  49. Accumulator Rules • Create with SparkContext.accumulator(initialVal) • “Add” to the value with += inside tasks • Each task’s effect only counted once • Access with .value, but only on master • Exception if you try it on workers Retains efficiency and fault tolerance!

  50. Job Scheduler Captures RDD dependency graph Pipelines functionsinto “stages” Cache-aware fordata reuse & locality Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = cached partition

More Related