An Introduction to Data Intensive Computing Chapter 3: Processing Big Data

An Introduction to Data Intensive ComputingChapter 3: Processing Big Data Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011

Section 3.1The Origins of ProcessingBig Data A Google production rack of servers from about 1999.

How do you do analytics over commodity disks and processors? • How do you improve the efficiency of programmers?

The Google Data Stack • The Google File System (2003) • MapReduce: Simplified Data Processing… (2004) • BigTable: A Distributed Storage System… (2006)

Google’s Large Data Cloud Applications Google’s MapReduce Compute Services Data Services Google’s BigTable Storage Services Google File System (GFS) Google’s Early Data Stack

Hadoop’s Large Data Cloud (Open Source) Applications Compute Services Hadoop’sMapReduce Data Services NoSQL, e.g. HBase Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack

A very nice recent book by Barroso and Holzle

The Amazon Data Stack Amazon uses a highly decentralized, loosely coupled, service oriented architecture consisting of hundreds of services. In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. SOSP’07

Amazon Style Data Cloud Load Balancer Simple Queue Service SDB EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances EC2 Instances S3 Storage Services

Open Source Versions • Eucalyptus • Ability to launch VMs • S3 like storage • Open Stack • Ability to launch VMs • S3 like storage - Swift • Cassandra • Key-value store like S3 • Columns like BigTable • Many other open source Amazon style services available.

How Do You Program A Data Center?

Some Programming Models for Data Centers • Operations over data center of disks • MapReduce (“string-based” scans of data) • User-Defined Functions (UDFs) over data center • Launch VMs that all have access to highly scalable and available disk-based data. • SQL and NoSQL over data center • Operations over data center of memory • Grep over distributed memory • UDFs over distributed memory • Launch VMs that all have access to highly scalable and available membory-based data. • SQL and NoSQL over distributed memory

Section 3.2 SMP & PP Algorithms in the Age of Google and Amazon

Serial & SMP Algorithms • * local disk and memory Task Task Task Task local disk* local disk* Symmetric Multiprocessing (SMP) algorithm Serial algorithm

Pleasantly (= Embarrassingly) Parallel • Need to partition data, start tasks, collect results. • Often tasks organized into DAG. Task Task Task Task Task Task Task Task Task local disk local disk local disk MPI

Processing Big Data Pattern 1: Launch Independent Virtual Machines and Task with a Messaging Service

Task With Messaging Service & Use S3 (Variant 1) Control VM: Launches and tasks workers Messaging Services (AWS SMS, AMQP Service, etc.) Task Task Task Task Worker VMs … VM VM VM VM S3

Task With Messaging Service & Use NoSQL DB (Variant 2) Control VM: Launches and tasks workers Messaging Services (AWS SMS, AMQP Service, etc.) Task Task Task Task Worker VMs … VM VM VM VM AWS SimpleDB

Task With Messaging Service & Use Clustered FS (Variant 3) Control VM: Launches and tasks workers Messaging Services (AWS SMS, AMQP Service, etc.) Task Task Task Task Worker VMs … VM VM VM VM GlusterFS

Section 3.3Quick Introduction to MapReduce Google 2004 Technical Report

Core Concepts • Data are (key, value) pairs and that’s it • Partition data over commodity nodes filling racks in a data center. • Software handles failures, restarts, etc. This is the hard part. • Basic examples: • Word Count • Inverted index

Processing Big Data Pattern 2: MapReduce

Map Task Map Task Task Tracker Task Tracker Reduce Task Reduce Task Map Task Map Task Map Task Map Task HDFS HDFS local disk local disk HDFS HDFS local disk local disk Map Task Task Tracker Map Task Map Task HDFS local disk Shuffle & Sort

Example: Word Count & Inverted Index • How do you count the words in a million books? • (best, 7) • Inverted index: • (best; page 1, page 82, …) • (worst; page 1, page 12, …) Cover of serial Vol. V, 1859, London

Assume you have a cluster of 50 computers, each with an attached local disk and half full of web pages. • What is a simple parallel programming framework that would support the computation of word counts and inverted indices?

Basic Pattern: Strings 1. Extract words from web pages in parallel. 2. Hash and sort words. 3. Count (or construct inverted index) in parallel.

What about data records? 1. Extract words from web pages in parallel. 2. Hash and sort words. 3. Count (or construct inverted index) in parallel. 2. Hash and sort binned field values. 3. Count (or construct inverted index) in parallel. 1. Extract binned field value from data records in parallel.

Map-Reduce Example • Input is files with one document per record • User specifies map function • key = document URL • Value = document contents Input of map “doc cdickens two cities”, “it was the best of times” Output of map “it”, 1“was”, 1“the”, 1“best”, 1

Example (cont’d) • MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) • The user-defined reduce function combines all the values associated with the same key Input of reduce key = “it”values = 1, 1 key = “was”values = 1, 1 key = “best”values = 1 key = “worst”values = 1 Output of reduce “it”, 2“was”, 2“best”, 1“worst”, 1

Why Is Word Count Important? • It is one of the most important examples for the type of text processing often done with MapReduce. • There is an important mapping document < ----- > data record words < ----- > (field, value) Inversion

Common MapReduce Design Patterns • Word count • Inversion – inverted index • Computing simple statistics • Computing windowed statistics • Sparse matrix (document-term, data record-FieldBinValue, …) • Site-entity statistics • PageRank • Partitioned and ensemble models • EM

Exercise: Use MapReduce to count k-mers in short reads arising in next gen sequencing.

Processing Big Data Pattern 3: User Defined Functions over Distributed File Systems

Sector/Sphere • Sector/Sphere is a platform for data intensive computing.

Idea 1: Apply User Defined Functions (UDF) to Files in a Distributed File System map/shuffle reduce UDF UDF This generalizes Hadoop’s implementation of MapReduce over the Hadoop Distributed File system.

Idea 2: Add Security From the Start Security Server Master Client • Security server maintains information about users and slaves. • User access control: password and client IP address. • File level access control. • Messages are encrypted over SSL. Certificate is used for authentication. • Sector is a good basis for HIPAA compliant applications. SSL SSL AAA data Slaves

Idea 3: Extend the Stack to Include Network Transport Services Compute Services Compute Services Data Services Data Services Storage Services Storage Services Routing & Transport Services Google, Hadoop Sector

Section 3.4Warm Up: Means and Variances

Warm Up: Partitioned Means • Means and variances cannot be computed naively when the data is in distributed partitions. Step 1. Compute local(Σ xi,Σ xi2, ni) in parallel for each partition. Step 2. Compute global mean and variance from these tuples.

Trivial Observation 1 If si = Σ xi is a the i’th local means, then global mean = Σsi/ Σni. • If local means for each partition are passed (without corresponding counts), then there is not enough information to compute global means. • Same tricks works for variance, but need to pass triples (Σ xi,Σ xi2, ni).

Trivial Observation 2 • To reduce data passed over the network, combine appropriate statistics as early as possible. • Consider average by IP. Recall with MapReduce there are 4 steps (Map, Shuffle, Sort and Reduce) and Reduce pulls data from local disk that performs Map. • A Combine Step in MapReduce combines local data before it is pulled for Reduce Step. • There are built in combiners for counts, means, etc.

Section 3.5 Interface Choices, Language Choices, & Application Choices

Choosing a Hadoop Interface • In addition to the Java API, Hadoop offers • Streaming interface for any language that supports reading and writing to Standard In and Out • Pipes for C++ • Why would I want to use something besides Java? Direct Access to • (Without JNI/ NIO) to C++ libraries like Boost, GNU Scientific Library (GSL) • R modules

Pros and Cons • Java + Best documented + Largest community • More LOC per MR job • Python + Efficient memory handling + Programmers can be very efficient • Limited logging / debugging • R + Vast collection of statistical algorithms • Poor error handling and memory handling • Less familiar to developers

Word Count Python Mapper defread_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)

Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Code Comparison – Word Count Mapper Python defread_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Word Count Python Reducer defread_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)

An Introduction to Data Intensive Computing Chapter 3: Processing Big Data