Big Data Processing with MapReduce and Spark

Big Data Processing with MapReduce and Spark Matei Zaharia UC Berkeley AMPLab spark-project.org UC BERKELEY

Outline • The big data problem • MapReduce model • Limitations of MapReduce • Spark model • Future directions

The Big Data Problem • Data is growing faster than computation speeds Growing data sources • Web, mobile, scientific, … Cheap storage • Doubling every 18 months Stalling CPU speeds • Even multicores not enough

Examples • Facebook’s daily logs: 60 TB • 1000 genomes project: 200 TB • Google web index: 10+ PB • Cost of 1 TB of disk: $50 • Time to read 1 TB from disk: 6 hours (50 MB/s)

The Big Data Problem • Single machine can no longer process or even store all the data! • Only solution is to distribute over large clusters

Google Datacenter How do we program this thing?

Traditional Network Programming • Message-passing between nodes • Really hard to do at scale: • How to split problem across nodes? • Must consider network, data locality • How to deal with failures? • 1 server fails every 3 years => 10K nodes see 10 faults/day • Even worse: stragglers (node is not failed, but slow) Almost nobody does message passing!

Data-Parallel Models • Restrict the programming interface so that the system can do more automatically • “Here’s an operation, run it on all of the data” • I don’t care where it runs (you schedule that) • In fact, feel free to run it twice on different nodes • Biggest example: MapReduce

MapReduce • First widely popular programming model for data-intensive apps on clusters • Published by Google in 2004 • Processes 20 PB of data / day • Popularized by open-source Hadoop project • 40,000 nodes at Yahoo!, 70 PB at Facebook

MapReduce Programming Model • Data type: key-valuerecords • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)

Example: Word Count defmapper(line): foreach word in line.split(): output(word, 1) defreducer(key, values): output(key, sum(values))

Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1

MapReduce Execution • Automatically split work into many small tasks • Send map tasks to nodes based on data locality • Load-balance dynamically as tasks finish

Fault Recovery 1. If a task crashes: • Retry on another node • OK for a map because it had no dependencies • OK for reduce because map outputs are on disk • If the same task repeatedly fails, end the job Requires user code to be deterministic

Fault Recovery 2. If a node crashes: • Relaunch its current tasks on other nodes • Relaunch any maps the node previously ran • Necessary because their output files were lost along with the crashed node

Fault Recovery 3. If a task is going slowly (straggler): • Launch second copy of task on another node • Take the output of whichever copy finishes first, and kill the other one

Example Applications

1. Search • Input: (lineNumber, line) records • Output: lines matching a given pattern • Map:if(line matches pattern): output(line) • Reduce: identity function • Alternative: no reducer (map-only job)

2. Sort • Input: (key, value) records • Output: same records, sorted by key • Map: identity function • Reduce: identify function • Trick: Pick partitioningfunction pso thatk1 < k2 => p(k1) < p(k2) ant, bee Map [A-M] Reduce zebra aardvark ant bee cow elephant cow Map pig [N-Z] Reduce aardvark, elephant pig sheep yak zebra Map sheep, yak

3. Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map:foreachword in text.split():output(word, filename) • Reduce:def reduce(word, filenames): output(word, unique(filenames))

Inverted Index Example hamlet.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt to be or not to be afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt 12th.txt be not afraid of greatness

4. Most Popular Words • Input: (filename, text) records • Output: the 100 words occurring in most files • Two-stage solution: • Job 1: • Create inverted index, giving (word, list(file)) records • Job 2: • Map each (word, list(file)) to (count, word) • Sort these records by count as in sort job

Summary • By providing a data-parallel model, MapReduce greatly simplified cluster programming: • Automatic division of job into tasks • Locality-aware scheduling • Load balancing • Recovery from failures & stragglers • But… the story doesn’t end here!

Outline • The big data problem • MapReduce model • Limitations of MapReduce • Spark model • Future directions

When an Abstraction is Useful… • People want to compose it! • Most real applications require multiple MR steps • Google indexing pipeline: 21 steps • Analytics queries (e.g. sessions, top K): 2-5 steps • Iterative algorithms (e.g. PageRank): 10’s of steps • Problems: programmability & performance

Programmability • Multi-step jobs create spaghetti code • 21 MR steps -> 21 mapper and reducer classes • Lots of boilerplate wrapper code per step • API doesn’t provide type safety

Performance • MR only provides one pass of computation • Must write out data to file system in-between • Expensive for apps that need to reuse data • Multi-step algorithms (e.g. PageRank) • Interactive data mining (many queries on same data) • Users often hand-optimize by merging steps

Spark • Aims to address both problems • Programmability: clean, functional API • Parallel transformations on collections • 5-10x less code than MR • Available in Scala, Java and Python • Performance: • In-memory computing primitives • Automatic optimization across operators

Spark Programmability Google MapReduceWordCount: • #include "mapreduce/mapreduce.h" • // User’s map function • class SplitWords: public Mapper { • public: • virtual void Map(constMapInput& input) • { • const string& text = input.value(); • constint n = text.size(); • for (inti = 0; i < n; ) { • // Skip past leading whitespace • while (i< n && isspace(text[i])) • i++; • // Find word end • int start = i; • while (i< n && !isspace(text[i])) • i++; • if (start < i) • Emit(text.substr(start,i-start),"1"); • } • } • }; • REGISTER_MAPPER(SplitWords); • // User’s reduce function • class Sum: public Reducer { • public: • virtual void Reduce(ReduceInput* input) • { • // Iterate over all entries with the • // same key and add the values • int64 value = 0; • while (!input->done()) { • value += StringToInt( • input->value()); • input->NextValue(); • } • // Emit sum for input->key() • Emit(IntToString(value)); • } • }; • REGISTER_REDUCER(Sum); • intmain(intargc, char** argv) { • ParseCommandLineFlags(argc, argv); • MapReduceSpecification spec; • for (inti = 1; i < argc; i++) { • MapReduceInput* in= spec.add_input(); • in->set_format("text"); • in->set_filepattern(argv[i]); • in->set_mapper_class("SplitWords"); • } • // Specify the output files • MapReduceOutput* out = spec.output(); • out->set_filebase("/gfs/test/freq"); • out->set_num_tasks(100); • out->set_format("text"); • out->set_reducer_class("Sum"); • // Do partial sums within map • out->set_combiner_class("Sum"); • // Tuning parameters • spec.set_machines(2000); • spec.set_map_megabytes(100); • spec.set_reduce_megabytes(100); • // Now run it • MapReduceResult result; • if (!MapReduce(spec, &result)) abort(); • return 0; • }

Spark Programmability Spark WordCount: • val file = spark.textFile(“hdfs://...”) • val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _)counts.save(“out.txt”)

Spark Performance • Iterative algorithms: sec sec

Spark Concepts • Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects • May be cached in memory for fast reuse • Operations on RDDs • Transformations(build RDDs), actions (compute results) • Restricted shared variables • Broadcast, accumulators

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Worker results tasks Block 1 Driver Action messages.filter(_.contains(“foo”)).count Cache 2 messages.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: search 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3

Fault Recovery RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter(func = _.contains(...)) map(func = _.split(...))

Demo

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – – + – + – – – – – – target

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Finalw: " + w) w automatically shipped to cluster

Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s

Other RDD Operations

Spark in Java and Python JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count(); lines = sc.textFile(...)lines.filter(lambda x: “error” in x).count()

Shared Variables • So far we’ve seen that RDD operations can use variables from outside their scope • By default, each task gets a read-only copy of each variable (no sharing) • Good place to enable other sharing patterns!

Example: Collaborative Filtering Goal: predict users’ movie ratings based on past ratings of other movies 1 ? ? 4 5 ? 3 ? ? 3 5 ? ? 3 5 ? 5 ? ? ? 1 4 ? ? ? ? 2 ? R = Users Movies

Model and Algorithm Model R as product of user and movie feature matrices A and B of size U×K and M×K R A BT = • Alternating Least Squares (ALS) • Start with random A & B • Optimize user vectors (A) based on movies • Optimize movie vectors (B) based on users • Repeat until converged

Serial ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) } Range objects

Naïve Spark ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() } Problem:R re-sent to all nodes in each iteration

Efficient Spark ALS var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() } Solution: mark R as broadcast variable Result: 3× performance improvement

Accumulators • Apart from broadcast, another common sharing pattern is aggregation • Add up multiple statistics about data • Count various events for debugging • Spark’s reduce operation does aggregation, but accumulators are another nice way to express it

Usage • valbadRecords = sc.accumulator(0)valbadBytes= sc.accumulator(0.0) • records.filter(r => {if (isBad(r)) { badRecords += 1badBytes += r.sizefalse } else {true }}).save(...) • printf(“Total bad records: %d, avg size: %f\n”,badRecords.value, badBytes.value/badRecords.value)

Accumulator Rules • Create with SparkContext.accumulator(initialVal) • “Add” to the value with += inside tasks • Each task’s effect only counted once • Access with .value, but only on master • Exception if you try it on workers Retains efficiency and fault tolerance!

Job Scheduler Captures RDD dependency graph Pipelines functionsinto “stages” Cache-aware fordata reuse & locality Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = cached partition

Big Data Processing with MapReduce and Spark

Big Data Processing with MapReduce and Spark

Presentation Transcript

Data-Intensive Text Processing with MapReduce

Experiences on Processing Spatial Data with MapReduce

Comparison of Parallel DB and MapReduce MapReduce: A Flexible Data Processing Tool

Large-scale Processing with MapReduce

Processing Big Data with Small Programs

Data-Intensive Computing with MapReduce

Large-Scale Data Processing with MapReduce

Building Big Data Operational Intelligence platform with Apache Spark

Cloud Computing and Big Data Processing

MapReduce , the Big Data Workhorse

SIMR Spark In MapReduce

Stream Processing with BigData: SSS-MapReduce

Data Processing with MapReduce

Massive Data Processing 02: MapReduce Basics

Big Data Processing

Simplifying MapReduce Data Processing

Bangalore Speaks Spark & Big Data

Hadoop MapReduce Vs Spark: Which big data framework to choose

Big Data Processing with Azure Data Lake Analytics - Edukite

Nested JSON data processing using Apache Spark with Coding