Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce Tutorial at the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) Jimmy Lin The iSchool University of Maryland Sunday, July 19, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details. PageRank slides adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Who am I?

Why big data? • Information retrieval is fundamentally: • Experimental and iterative • Concerned with solving real-world problems • “Big data” is a fact of the real world • Relevance of academic IR research hinges on: • The extent to which we can tackle real-world problems • The extent to which our experiments reflect reality

How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s LHC will generate 15 PB a year (??) 640Kought to be enough for anybody.

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) No data like more data! s/knowledge/data/g; How do we get here if we’re not Google?

Academia vs. Industry • “Big data” is a fact of life • Resource gap between academia and industry • Access to computing resources • Access to data • This is changing: • Commoditization of data-intensive cluster computing • Availability of large datasets for researchers

MapReduce e.g., Amazon Web Services cheap commodity clusters (or utility computing) + simple distributed programming models + availability of large datasets = data-intensive IR research for the masses! ClueWeb09

ClueWeb09 • NSF-funded project, led by Jamie Callan (CMU/LTI) • It’s big! • 1 billion web pages crawled in Jan./Feb. 2009 • 10 languages, 500 million pages in English • 5 TB compressed, 25 uncompressed • It’s available! • Available to the research community • Test collection coming (TREC 2009)

Ivory and SMRF • Collaboration between: • University of Maryland • Yahoo! Research • Reference implementation for a Web-scale IR toolkit • Designed around Hadoop from the ground up • Written specifically for the ClueWeb09 collection • Implements some of the algorithms described in this tutorial • Features SMRF query engine based on Markov Random Fields • Open source • Initial release available now!

Cloud9 • Set of libraries originally developed for teaching MapReduce at the University of Maryland • Demos, exercises, etc. • “Eat you own dog food” • Actively used for a variety of research projects

Topics: Morning Session • Why is this different? • Introduction to MapReduce • Graph algorithms • MapReduce algorithm design • Indexing and retrieval • Case study: statistical machine translation • Case study: DNA sequence alignment • Concluding thoughts

Topics: Afternoon Session • Hadoop “Hello World” • Running Hadoop in “standalone” mode • Running Hadoop in distributed mode • Running Hadoop on EC2 • Hadoop “nuts and bolts” • Hadoop ecosystem tour • Exercises and “office hours”

Why is this different? Introduction to MapReduce Graph algorithms MapReduce algorithm design Indexing and retrieval Case study: statistical machine translation Case study: DNA sequence alignment Concluding thoughts

Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”

It’s a bit more complex… Fundamental issues Different programming models scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Message Passing Shared Memory Memory Architectural issues P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Flynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … The reality: programmer shoulders the burden of managing concurrency…

Source: Ricardo Guimarães Herrmann

Source: MIT Open Courseware

Source: Harper’s (Feb, 2008)

Why is this different? Introduction to MapReduce Graph algorithms MapReduce algorithm design Indexing and retrieval Case study: statistical machine translation Case study: DNA sequence alignment Concluding thoughts

Typical Large-Data Problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output Map Reduce Key idea:provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)

MapReduce ~ Map + Fold from functional programming! Map f f f f f Fold g g g g g

MapReduce • Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* • All values with the same key are reduced together • The runtime handles everything else…

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

MapReduce • Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* • All values with the same key are reduced together • The runtime handles everything else… • Not quite…usually, programmers also specify: partition (k’, number of partitions) → partition for k’ • Often a simple hash of the key, e.g., hash(k’) mod n • Divides up key space for parallel reduce operations combine (k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map combine combine combine combine a a 1 1 b b 2 2 c c 9 3 c 6 a a 5 5 c c 2 2 b b 7 7 c c 8 8 partitioner partitioner partitioner partitioner a 1 5 b 2 7 c 2 9 8 Shuffle and Sort: aggregate values by keys reduce reduce reduce r1 s1 r2 s2 r3 s3

MapReduce Runtime • Handles scheduling • Assigns workers to map and reduce tasks • Handles “data distribution” • Moves processes to data • Handles synchronization • Gathers, sorts, and shuffles intermediate data • Handles faults • Detects worker failures and restarts • Everything happens on top of a distributed FS (later)

“Hello World”: Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

MapReduce Implementations • MapReduce is a programming model • Google has a proprietary implementation in C++ • Bindings in Java, Python • Hadoop is an open-source implementation in Java • Project led by Yahoo, used in production • Rapidly expanding software ecosystem

UserProgram (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output file 0 worker split 1 (5) remote read (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Redrawn from (Dean and Ghemawat, OSDI 2004)

How do we get data to the workers? SAN Compute Nodes NAS What’s the problem here?

Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer • GFS (Google File System) • HDFS for Hadoop (= GFS clone)

GFS: Assumptions • Commodity hardware over “exotic” hardware • Scale out, not up • High component failure rates • Inexpensive commodity components fail all the time • “Modest” number of HUGE files • Files are write-once, mostly appended to • Perhaps concurrently • Large streaming reads over random access • High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design Decisions • Files stored as chunks • Fixed size (64MB) • Reliability through replication • Each chunk replicated across 3+ chunkservers • Single master to coordinate access, keep metadata • Simple centralized management • No data caching • Little benefit due to large datasets, streaming reads • Simplify the API • Push some of the issues onto the client

Application GFS master /foo/bar (file name, chunk index) File namespace GSF Client chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … … Redrawn from (Ghemawatet al., SOSP 2003)

Master’s Responsibilities • Metadata storage • Namespace management/locking • Periodic communication with chunkservers • Chunk creation, re-replication, rebalancing • Garbage collection

Questions?

Why is this different? Introduction to MapReduce Graph Algorithms MapReduce algorithm design Indexing and retrieval Case study: statistical machine translation Case study: DNA sequence alignment Concluding thoughts

Graph Algorithms: Topics • Introduction to graph algorithms and graph representations • Single Source Shortest Path (SSSP) problem • Refresher: Dijkstra’s algorithm • Breadth-First Search with MapReduce • PageRank

What’s a graph? • G = (V,E), where • V represents the set of vertices (nodes) • E represents the set of edges (links) • Both vertices and edges may contain additional information • Different types of graphs: • Directed vs. undirected edges • Presence or absence of cycles • ...

Some Graph Problems • Finding shortest paths • Routing Internet traffic and UPS trucks • Finding minimum spanning trees • Telco laying down fiber • Finding Max Flow • Airline scheduling • Identify “special” nodes and communities • Breaking up terrorist cells, spread of avian flu • Bipartite matching • Monster.com, Match.com • And of course... PageRank

Representing Graphs • G = (V, E) • Two common representations • Adjacency matrix • Adjacency list

Adjacency Matrices Represent a graph as an n x n square matrix M • n = |V| • Mij = 1 means a link from node i to j 2 1 3 4

Adjacency Lists Take adjacency matrices… and throw away all the zeros 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

Single Source Shortest Path • Problem: find shortest path from a source node to one or more target nodes • First, a refresher: Dijkstra’s Algorithm

Dijkstra’s Algorithm Example   1 10 0 9 2 3 4 6 5 7   2 Example from CLR

Dijkstra’s Algorithm Example 10  1 10 0 9 2 3 4 6 5 7 5  2 Example from CLR

Dijkstra’s Algorithm Example 8 14 1 10 0 9 2 3 4 6 5 7 5 7 2 Example from CLR

Data-Intensive Text Processing with MapReduce