ecs289m Fall, 2009 â€œFacebook and Hadoopâ€ Lecture #04

ecs289m Fall, 2009“Facebook and Hadoop”Lecture #04 S. Felix Wu Computer Science Department University of California, Davis wu@cs.ucdavis.edu http://www.cs.ucdavis.edu/~wu/

Hadoop at Facebook Production cluster 4800 cores, 600 machines, 16GB per machine – April 2009 8000 cores, 1000 machines, 32 GB per machine – July 2009 4 SATA disks of 1 TB each per machine 2 level network hierarchy, 40 machines per rack Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 Test cluster 800 cores, 16GB each Cloudera: “Redhat” for Hadoop Cloud Computing Davis Social Links

Yahoo! Hadoop Clusters Yahoo! has ~10,000 machines running Hadoop The largest cluster is currently 1600 nodes Nearly 1 petabyte of user data (compressed, unreplicated) run roughly 10,000 research jobs / week Davis Social Links

Web Crawl Problem Detection • The Problem • Yahoo! crawls billions of pages per day, how do you detect when one site has a problem? • The Solution • We load the crawl logs into Hadoop (via a map-reduce job) • We aggregate reports by site over time and flag sites where the crawl behavior has changed • This generates a report to customer service every day • They contact web masters and get sites fixed Davis Social Links

Facebook Data Flow Web Servers Scribe Servers Network Storage Oracle RAC Hadoop Cluster MySQL Davis Social Links

Facebook Hadoop and Hive Usage Statistics : 15 TB uncompressed data ingested per day 55TB of compressed data scanned per day 3200+ jobs on production cluster per day 80M compute minutes per day Barrier to entry is reduced: 80+ engineers have run jobs on Hadoop platform Analysts (non-engineers) starting to use Hadoop through Hive Davis Social Links

What Is ? • Distributed computing frame work • For clusters of computers • Thousands of Compute Nodes • Petabytes of data • Open source, Java • Google’s MapReduce and GFS inspired Yahoo’s Hadoop. Davis Social Links

What Is ? • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes: • Hadoop Common utilities • Avro: A data serialization system with scripting languages. • Chukwa: managing large distributed systems. • HBase: A scalable, distributed database for large tables. • HDFS: A distributed file system. • Hive: data summarization and ad hoc querying. • MapReduce: distributed processing on compute clusters. • Pig: A high-level data-flow language for parallel computation. • ZooKeeper: coordination service for distributed applications. Davis Social Links

GFS – Google File System • “failures” are norm • Multiple-GB files are common • Append rather than overwrite • Random writes are rare • Can we relax the consistency? Davis Social Links

Davis Social Links

The Master • Maintains all file system metadata. • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state Davis Social Links

The Master • Helps make sophisticated chunk placement and replication decision, using global knowledge • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers • Master is not a bottleneck for reads/writes Davis Social Links

Chunkservers • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. • handle is assigned by the master at chunk creation • Chunk size is 64 MB • Each chunk is replicated on 3 (default) servers Davis Social Links

Clients • Linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache. Davis Social Links

Chunk Locations • Master does not keep a persistent record of locations of chunks and replicas. • Polls chunkservers at startup, and when new chunkservers join/leave for this. • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers) Davis Social Links

Atomic Commitment • All replica made the change or none of them did! • Asynchronous updates > Inconsistency • “Commit State” and “Write State” • Before you really write “it” to the public record, you already have it “committed”. Davis Social Links

Atomic commit protocols one-phase atomic commit protocol • the coordinator tells the participants whether to commit or abort • what is the problem with that? • this does not allow one of the servers to decide to abort – it may have discovered a deadlock or it may have crashed and been restarted The decision could be commit or abort – participants record it in permanent store Davis Social Links •

Atomic commit protocols two-phase atomic commit protocol • is designed to allow any participant to choose to abort a transaction • phase 1 - each participant votes. If it votes to commit, it is prepared. It cannot change its mind. In case it crashes, it must save updates in permanent store • phase 2 - the participants carry out the joint decision The decision could be commit or abort – participants record it in permanent store Davis Social Links •

Two phase commit (2PC) Coordinator What is your result? ….. Server Server Server Server Server Coordinator Final consensus. ….. Server Server Server Server Server Davis Social Links

Failure model • Commit protocols are designed to work in • asynchronous system (e.g. messages may take a very long time) • servers/coordinator may crash • messages may be lost. • assume corrupt and duplicated messages are removed. • no byzantine faults – servers either crash or they obey their requests • 2PC is an example of a protocol for reaching a consensus. • because crash failures of processes are masked by replacing a crashed process with a new process whose state is set from information saved in permanent storage and information held by other processes. Davis Social Links •

2PC • 2PC • voting phase: coordinator asks all servers if they can commit • if yes, server records updates in permanent storage and then votes • completion phase: coordinator tells all servers to commit or abort Davis Social Links •

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted coordinator server Davis Social Links

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted Davis Social Links

Failures • Some servers missed “canCommit”. • Coordinator missed some “votes”. • Some servers missed “doAbort” or “doCommit”. Davis Social Links

Failures/Crashes • Some servers crashed b/a “canCommit”. • Coordinator crashed b/a receiving some “votes”. • Some servers crashes b/a receiving “doAbort” or “doCommit”. Davis Social Links

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted Assume the coordinator crashed after “canCommit” messages have been sent: (0). Some servers have not received the vote requests. WAIT/INIT (1). All good servers are in the WAIT state. WAIT/INIT (2). Some servers are in either ABORT or COMMIT state. ABORT/COMMIT (3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT Davis Social Links

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted Assume the coordinator crashed after “canCommit” messages have been sent: (0). Some servers have not received the vote requests. ABORT (1). All good servers are in the WAIT state. ABORT (2). Some servers are in either ABORT or COMMIT state. ABORT (3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT Davis Social Links

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted Assume the coordinator crashed after “canCommit” messages have been sent: (0). Some servers have not received the vote requests. ABORT (1). All good servers are in the WAIT state. ABORT (2). Some servers are in either ABORT or COMMIT state. ABORT/COMMIT (3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT Davis Social Links

Coordinator ….. Server Server Server Server Server Davis Social Links

COMMITTED Coordinator M ….. Server Server Server Server Server COMMITTED;WRITTEN WAITING Davis Social Links

INIT INIT vote canCommit WAIT READY doAbort doCommit ABORT COMMIT ABORT COMMIT haveCommitted Assume the coordinator crashed after “canCommit” messages have been sent: (0). Some servers have not received the vote requests. ABORT (1). All good servers are in the WAIT state. ??? (2). Some servers are in either ABORT or COMMIT state. ABORT/COMMIT (3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT Davis Social Links

2PC • Concept widely used! • The only “holding” condition is … Davis Social Links

3PCSkeen & Stonebraker, 1983 INIT Uncertain vote Aborted WAIT ABORT Pre-COMMIT Committable ACK COMMIT Committed Davis Social Links

COMMITTED Coordinator M ….. Server Server Server Server Server COMMITTED;WRITTEN PRE-COMIT/ WAITING?? Davis Social Links

HDFS Architecture Cluster Membership NameNode 1. filename Secondary NameNode 2. BlckId, DataNodes o Client 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log DataNodes Davis Social Links

Map and Reduce • The idea of Map, and Reduce is 40+ year old • Present in all Functional Programming Languages. • See, e.g., APL, Lisp and ML • Alternate names for Map: Apply-All • Higher Order Functions • take function definitions as arguments, or • return a function as output • Map and Reduce are higher-order functions. Davis Social Links

Map: A Higher Order Function • F(x: int) returns r: int • Let V be an array of integers. • W = map(F, V) • W[i] = F(V[i]) for all I • i.e., apply F to every element of V Davis Social Links

Map Examples in Haskell • map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] • map (toLower) "abcDEFG12!@#“ == "abcdefg12!@#“ • map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1] Davis Social Links

Word Count Example • Read text files and count how often words occur. • The input is text files • The output is a text file • each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts. Davis Social Links

I,1 am,1 a,1 a,2 also,1 am,1 are,1 tiger,1 you,1 are,1 I, 1 tiger,2 you,1 also,1 a, 1 tiger,1 I am a tiger, you are also a tiger a,2 also,1 am,1 are,1 I,1 tiger,2 you,1 map reduce a, 1 a,1 also,1 am,1 are,1 I,1 tiger,1 tiger,1 you,1 map reduce map Davis Social Links

Grep Example Search input files for a given pattern Map: emits a line if pattern is matched Reduce: Copies results to output Davis Social Links

Inverted Index Example Generate an inverted index of words from a given set of files Map: parses a document and emits <word, docId> pairs Reduce: takes all pairs for a given word, sorts the docId values, and emits a <word, list(docId)> pair Davis Social Links

Execution on Clusters Input files split (M splits) Assign Master & Workers Map tasks Writing intermediate data to disk (R regions) Intermediate data read & sort Reduce tasks Return Davis Social Links

<Key, Value> Pair Row Data Map key1 val key2 val key1 val … … Select Key Map Reduce Input Input key1 val val Reduce …. val Output Output 47 key values Davis Social Links

split 0 split 1 split 2 split 3 split 4 input HDFS map output HDFS sort/copy merge reduce part0 map reduce part1 map Davis Social Links

ecs289m Fall, 2009 â€œFacebook and Hadoopâ€ Lecture #04