4 . Scalability and MapReduce

ENEE 759D | ENEE 459D | CMSC 858Z 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD

Today’s Lecture • Where we’ve been • How to say “hapaxlegomenon” and “heteroskedasticity” • Interpretation of Statistics • Attributes of Big Data • Where we’re going today • Threats to validity • Scalability • MapReduce • Where we’re going next • Machine learning

The IROP Keyboard[Zeller, 2011] To prevent bugs, remove the keystrokesthat predict 74% of failure-prone modules in Eclipse

Does this work? What am I measuring? C Sample D V1 ? V2 ? Sample C G D V3 ? Reconstruct Lineage N E How well does this work in the real world? Sample E Korgo worm family S T F Will this work tomorrow?

What Am I Measuring: Scalability vs. Latency Can we make use of 1000s of cheap computers? • Analyzing data in parallel • To access 1 TB in 1 min, must distribute data over 20 disks • Parallelism is useful for algorithms where complexity constants matter • N log N operations sequentially => (N log N)/K operations in parallel • Scalability: ability to throw resources at the problem • You can measurescalability • Scaleup(weak scalability): • More resources => solve proportionally bigger problem with same latency • Speedup(strong scalability): • More resources => proportionally lower latency with same problem size

Some Problems Are Embarrassingly Parallel (1) Task: Convert 405K TIFF images (~4 TB) to PNG Input: many TIFF images Distribute images among K computers f is a function to convert TIFF to PNG; apply it to every item f f f f f f Output: a big distributed set of converted images http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

Some Problems Are Embarrassingly Parallel (2) Task: Compute the word frequency of 5M documents Input: millions of documents Distribute documentsamong K computers For each document freturns a set of <word, freq> pairs f f f f f f Output: a big a big distributed list of sets of word freqs. Adapted from slides by Bill Howe

Some Problems Are Embarrassingly Parallel (3) Task: Compute the word frequency across all documents Input: millions of documents Distribute documentsamong K computers For each document freturns a set of <word, freq> pairs f f f f f f We don’t want a bunch of little histograms – we want one big histogram Now what?

MapReduce Task: Compute the word frequency across all documents Distribute documentsamong K computers For each document freturns a set of <word, freq> pairs map map map map map map A big distributed list of setsof word freqs. Shuffle <word, freq> pairs so that all the counts for a word are sent to the same host reduce reduce reduce reduce Add the countsof each word Output: the distributed histogram

Hadoop on One Slide • MapReduce was invented at Google[Dean & Ghemawat, OSDI’04] • Hadoop = open source implementation • Data stored on HDFS distributed file system • Direct-attached storage • No schema needed on load • Programmers write Map and Reduce functions • Framework provides automated parallelization and fault tolerance • Data replication, restarting failed tasks • Scheduling Map and Reduce tasks on hosts with local copies of input data Source: Huy Vo

MapReduce Programming Model • Iput& Output: each a set of key/value pairs • Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one) • Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell Slide source: Google

Example: What Does This Do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? intresult = 0; for each v in intermediate_values: result += v; EmitFinal(output_key, result);

Big Data in the Security Industry • Booz Allen Hamilton • Dr. Brian Keller’s colloquium “Innovating with Analytics” • Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/ • Symantec • WINE platform for data analytics in security • Google • Mine user access patterns to mitigate data loss due to stolen credentials • Supplementary to passwords and two-factor authentication • Fuzz testing at scale

Big Data for Security: Benefits and Challenges • Benefits • Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011) • MapReduce provides simple programming model, automated parallelization and fault tolerance • Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some of these benefits, but they are very expensive • Challenges • Lack of ground truth on malware families • Lack of contextual data: e.g., date and time of appearance • Inability to collect some types of data owing to privacy concerns • Sharing data (e.g., malware samples are dangerous, some data sets may include personal information) Illustrate general threats to validity in experimental cyber security

Threats to Validity Construct validity: use metrics that model the hypothesis Internal validity: establish causal connection Does it work? What am I measuring? Will it work tomorrow? Will it work tomorrow? Will it work in the real world? Content validity: include only and all relevant data External validity: generalize results beyond experimental data

Review of Lecture • What did we learn? • Construct, content, internal, external validity • Programming in MapReduce • Measuring scalability • What’s next? • Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day Attacks In The Real World’ • Next lecture: Machine learning techniques • Deadline reminder • Pilot project reports due on Wednesday • Post report on Piazza

4 . Scalability and MapReduce

4 . Scalability and MapReduce

Presentation Transcript

Scalability

Performance and Scalability

MPI and MapReduce

MapReduce

Scalability

MapReduce

MapReduce and Hadoop

Mapreduce and Hadoop

Database and MapReduce

MapReduce

Scalability

MapReduce

i-4 routing scalability

ASR and scalability

Scalability

Scalability

MapReduce

MapReduce

Scalability

Scalability

Scalability

MapReduce