15-440, Hadoop Distributed File System Allison Naaktgeboren

15-440, Hadoop Distributed File SystemAllison Naaktgeboren • Ur doin' it rong kitteh • Wut u mean? I iz loadin a HA-doop fileh

Annoucements • Go Vote! • Interpretive Dances happen only after Lecture • Office Hour Change • Mon: 6:30-9:30 • Tues: 6-7:30 • Exams are graded

Hadoop Core at 30,000 ft

Back to the Map Reduce Model • Recall that • map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) • reduce (inter_key, inter_value list) -> (out_key, out_vlaue)‏ • What resource are we most constrained by? • “Oceans of Data, Skinny pipes” • How many types of data will the file system care about? • How long will we need each kind? • What is the common case for each?

What would a MR Filesytem need? • General Use case: large files • Mostly append to end, long sequential reads, few deletes • Appends might be concurrent • Scability • Adding (or losing) machines should be relatively painless • Nodes work on nearby data • Minimize moving data between machines • Bandwidth is our limiting resource • Remember how much data • Failure (handling)is Common • Yea, yea we know, we took 213, we know hardware sucks • No, really failure (handling) is common (constant)‏ • Disks, processors,whole nodes, racks, and datacenters

Addressing Those Concerns • Sequential Reads, appends need to be fast • Deletes can be painful • “Hot plug” machines • Add or lose machines while system is running jobs • System should auto detect the change • HDFS should distribute data somewhat evenly • So that all workers have a reasonable amount of data to chew on • And coordinating with the Jobtracker (job master)‏ • Data Replication • Should be spread out. Why? • What type of problems could arise?

Moving into the Details • Nodes in HDFS • NameNode (master) ( like GFS Master)‏ • DataNodes (slaves) ( like GFS chunkservers)‏ • NB – Hadoop and HDFS closely paired • “careful use of jargon defines the true expert” • “worker node A” and “data node 1” are frequently the same machine • Two types of Masters • Jobtracker (Hadoop Job Master)‏ • NameNode (file system Master)‏ • What I mean by 'master' for the rest of the lecture

Your Data goes in .... • Files are divided into Chunks • 64 MB • The mapping between filename and chunks goes to the Master • Each chunk is replicated and sent off to DataNodes • By default, 3 • The master determines which dataNodes

What the Clients Do • Where the data starts • On file creation creates a seperate file w/checksum • When data fetched back from a dataNode, checksum computed again • Cache file data • Avoid bothering the Master too often • When a Client has 1 chunk's worth of data • Contacts the Master, • Master sends name of dataNodes to send it to • ONLY sends it to the 1st

What the DataNodes Do • Heartbeat to the Master • Opens, closes, or replicates a chunk if requested from Master • During replication, sends data to next dataNode in chain

What the Namespace Node Does • System metadata! • Holds Name->ID mapping • Chunk replicas locations • Transcation Logs • EditLog • FSImage • It is responsible for coherency • Uses the logs atomically • Addresses the conccurent writes issue • It is checkpointed • Similar to AFS volume snapshots • Will pull last consistent log upon restart

What the Namespace Node Does • Listens for Heartbeats • Listens for Client Requests • If no heartbeat • marks a node as dead • Its data is deregistered • It selects dataNodes • Which nodes get which chunks • Signals creating, opening, closing • Deletes • Orders move to /trash • Starts delete timer

All together Now!

Additional Resources • Hadoop wiki • Youtube → “Hadoop” → Google developer videos (1-3 will be helpful)‏ • Google University • Includes UW course, the other UW course, a couple others • Use are your own risk • “The Google File System” paper is rather readable as research papers go

15-440, Hadoop Distributed File System Allison Naaktgeboren

15-440, Hadoop Distributed File System Allison Naaktgeboren

Presentation Transcript

Hadoop File System

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System Architecture and Design

15-440, Hadoop Distributed File System Allison Naaktgeboren

Hadoop Distributed File System

Hadoop Distributed File System

The Hadoop Distributed File System

HDFS ( Hadoop Distributed File System)

Hadoop File System

Hadoop Distributed File System Usage in USCMS

HDFS Hadoop Distributed File System

MapReduce and Hadoop Distributed File System

The Hadoop Distributed File System

Big-data Computing: Hadoop Distributed File System

15-440 Distributed Systems

MapReduce and Hadoop Distributed File System

Hadoop File System