1 / 15

15-440, Hadoop Distributed File System Allison Naaktgeboren

15-440, Hadoop Distributed File System Allison Naaktgeboren . Ur doin' it rong kitteh. Wut u mean? I iz loadin a HA-doop fileh. Annoucements . Go Vote! Interpretive Dances happen only after Lecture Office Hour Change Mon: 6:30-9:30 Tues: 6-7:30 Exams are graded.

zalman
Télécharger la présentation

15-440, Hadoop Distributed File System Allison Naaktgeboren

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-440, Hadoop Distributed File SystemAllison Naaktgeboren • Ur doin' it rong kitteh • Wut u mean? I iz loadin a HA-doop fileh

  2. Annoucements • Go Vote! • Interpretive Dances happen only after Lecture • Office Hour Change • Mon: 6:30-9:30 • Tues: 6-7:30 • Exams are graded

  3. Hadoop Core at 30,000 ft

  4. Back to the Map Reduce Model • Recall that • map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) • reduce (inter_key, inter_value list) -> (out_key, out_vlaue)‏ • What resource are we most constrained by? • “Oceans of Data, Skinny pipes” • How many types of data will the file system care about? • How long will we need each kind? • What is the common case for each?

  5. What would a MR Filesytem need? • General Use case: large files • Mostly append to end, long sequential reads, few deletes • Appends might be concurrent • Scability • Adding (or losing) machines should be relatively painless • Nodes work on nearby data • Minimize moving data between machines • Bandwidth is our limiting resource • Remember how much data • Failure (handling)is Common • Yea, yea we know, we took 213, we know hardware sucks • No, really failure (handling) is common (constant)‏ • Disks, processors,whole nodes, racks, and datacenters

  6. Addressing Those Concerns • Sequential Reads, appends need to be fast • Deletes can be painful • “Hot plug” machines • Add or lose machines while system is running jobs • System should auto detect the change • HDFS should distribute data somewhat evenly • So that all workers have a reasonable amount of data to chew on • And coordinating with the Jobtracker (job master)‏ • Data Replication • Should be spread out. Why? • What type of problems could arise?

  7. Moving into the Details • Nodes in HDFS • NameNode (master) ( like GFS Master)‏ • DataNodes (slaves) ( like GFS chunkservers)‏ • NB – Hadoop and HDFS closely paired • “careful use of jargon defines the true expert” • “worker node A” and “data node 1” are frequently the same machine • Two types of Masters • Jobtracker (Hadoop Job Master)‏ • NameNode (file system Master)‏ • What I mean by 'master' for the rest of the lecture

  8. Your Data goes in .... • Files are divided into Chunks • 64 MB • The mapping between filename and chunks goes to the Master • Each chunk is replicated and sent off to DataNodes • By default, 3 • The master determines which dataNodes

  9. What the Clients Do • Where the data starts • On file creation creates a seperate file w/checksum • When data fetched back from a dataNode, checksum computed again • Cache file data • Avoid bothering the Master too often • When a Client has 1 chunk's worth of data • Contacts the Master, • Master sends name of dataNodes to send it to • ONLY sends it to the 1st

  10. What the DataNodes Do • Heartbeat to the Master • Opens, closes, or replicates a chunk if requested from Master • During replication, sends data to next dataNode in chain

  11. What the Namespace Node Does • System metadata! • Holds Name->ID mapping • Chunk replicas locations • Transcation Logs • EditLog • FSImage • It is responsible for coherency • Uses the logs atomically • Addresses the conccurent writes issue • It is checkpointed • Similar to AFS volume snapshots • Will pull last consistent log upon restart

  12. What the Namespace Node Does • Listens for Heartbeats • Listens for Client Requests • If no heartbeat • marks a node as dead • Its data is deregistered • It selects dataNodes • Which nodes get which chunks • Signals creating, opening, closing • Deletes • Orders move to /trash • Starts delete timer

  13. All together Now!

  14. Additional Resources • Hadoop wiki • Youtube → “Hadoop” → Google developer videos (1-3 will be helpful)‏ • Google University • Includes UW course, the other UW course, a couple others • Use are your own risk • “The Google File System” paper is rather readable as research papers go

More Related