1 / 52

Hadoop Overview

Hadoop Overview. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Why Hadoop? HDFS MapReduce. Why Hadoop?. The V’s of Big Data!. Volume Velocity Variety Veracity. Data Sources. Social Media Web Logs Video Networks Sensors

dash
Télécharger la présentation

Hadoop Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Overview CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • Why Hadoop? • HDFS • MapReduce

  3. Why Hadoop?

  4. The V’s of Big Data! • Volume • Velocity • Variety • Veracity

  5. Data Sources • Social Media • Web Logs • Video Networks • Sensors • Transactions (banking, etc.) • E-mail, Text Messaging • Paper Documents

  6. Value in all of this! • Fraud Detection • Predictive Models • Recommendations • Analyzing Threats • Market Analysis • Others!

  7. How do we extract value? • Monolithic Computing! • Build bigger faster computers! • Limited solution • Expensive and does not scale as data volume increases

  8. Enter Distributed Computing • Processing is distributed across many computers • Distribute the workload to powerful compute nodes with some separate storage

  9. New Class of Challenges • Does not scale gracefully • Hardware failure is common • Sorting, combining, and analyzing data spread across thousands of machines is not easy

  10. RDBMS is still alive! • SQL is still very relevant, with many complex business rules mapping to a relational model • But there is much more than can be done by leaving relational behind and looking at all of your data • NoSQL can expand what you have, but it has its disadvantages

  11. NoSQL Downsides • Increased middle-tier complexity • Constraints on query capability • No standard semantics for query • Complex to setup and maintain

  12. An Ideal Cluster • Linear horizontal scalability • Analytics run in isolation • Simple API with multiple language support • Available in spite of hardware failure

  13. Enter Hadoop • Hits these major requirements • Two core pieces • Distributed File System (HDFS) • Flexible analytic framework (MapReduce) • Many ecosystem components to expand on what core Hadoop offers

  14. Scalability • Near-linear horizontal scalability • Clusters are built on commodity hardware • Component failure is an expectation

  15. Data Access • Moving data from storage to a processor is expensive • Store data and process the data on the same machines • Process data intelligently by being local

  16. Disk Performance • Disk technology has made significant advancements • Take advantage of multiple disks in parallel • 1 disk, 3TB of data, 300MB/s, ~2.5 hours to read • 1,000 disks, same data, ~10 seconds to read • Distribution of data and co-location of processing makes this a reality

  17. Complex Processing Code • Hadoopframework abstracts complex distributed computing environment • No synchronization code • No networking code • No I/O code • MapReduce developer focuses on the analysis • Job runs the same on one node or 4,000 nodes!

  18. Fault Tolerance • Failure is inevitable, and it is planned for • Component failure is automatically detected and seamlessly handled • System continues to operate as expected with minimal degradation

  19. Hadoop History • Spun off of Nutch, an open source web search engine • Google Whitepapers • GFS • MapReduce • Nutch re-architected to birth Hadoop • Hadoop is very mainstream today

  20. Hadoop Versions • Hadoop 2.x recently became stable • Two Java APIs, referred to as the old and new APIs • Two task management systems • MapReduce v1 • JobTracker, TaskTracker • MapReduce v2 – A YARN application • MR runs in containers in the YARN framework

  21. Hdfs

  22. Hadoop Distributed File System • Inspired by Google File System (GFS) • High performance file system for storing data • Relatively simple centralized management • Fault tolerance through data replication • Optimized for MapReduce processing • Exposing data locality • Linearly scalable

  23. Hadoop Distributed File System • Use of commodity hardware • Files are write once, read many • Leverages large streaming reads vs random • Favors high throughput vs low latency • Modest number of huge files

  24. HDFS Architecture • Split large files into blocks • Distribute and replicate blocks to nodes • Two key services • Master NameNode • Many DataNodes • Backup/Checkpoint NameNode for HA • User interacts with a UNIX file system

  25. NameNode • Single master service for HDFS • Was a single point of failure • Stores file to block to location mappings in a namespace • All transactions are logged to disk • Can recover based on checkpoints of the namespace and transaction logs

  26. NameNode Memory • HDFS prefers fewer larger files as a result of the metadata • Consider 1GB of data with 64MB block size • Stored as one file • Name: 1 item • Blocks: 16*3 = 48 items • Total items: 49 • Stored as 1024 1MB files • Names: 1024 items • Blocks: 1024*3 = 3072 items • Total items: 4096 • Each item is around 200 bytes

  27. Checkpoint Node (Secondary NN) • Performs checkpoints of the namespace and logs • Not a hot backup! • HDFS 2.0 introduced NameNode HA • Active and a Standby NameNode service coordinated via ZooKeeper

  28. DataNode • Stores blocks on local disk • Sends frequent heartbeats to NameNode • Sends block reports to NameNode • Clients connect to DataNodes for I/O

  29. How HDFS Works - Writes Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially writes blocks to DataNode

  30. How HDFS Works - Writes DataNodes replicate data blocks, orchestrated by the NameNode

  31. How HDFS Works - Reads Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode

  32. How HDFS Works - Failure Client connects to another node serving that block

  33. HDFS Blocks • Default block size is 64MB • Configurable • Common sizes are 128 MB and 256 MB • Default replication is three • Also configurable, not rarely changed • Stored as files on the DataNode’s local fs • Cannot associate any block with its true file

  34. Replication Strategy • First copy is written to the same node as the client • If the client is not part of the cluster, first block goes to a random node • Second copy is written to a node on a different rack • Third copy is written to a different node on the same rack as the second copy

  35. Data Locality • Key in achieving performance • MapReduce tasks run as close to data as possible • Some terms • Local • On-Rack • Off-Rack

  36. Data Corruption • Use of checksums to to ensure block integrity • Checksum is calculatd on read and compared against that when it was written • Fast to calculate and space-efficient • If the checksums differ, client reads the block from another DataNode • Corrupted block will be deleted and replicated by a non-corrupt block • DataNode periodically runs a background thread to do this checksum process • For those files that aren’t read very often

  37. Fault Tolerance • If no heartbeat is received from DN within a configurable time window, it is considered lost • 10 minute default • NameNode will: • Determine which blocks were on the lost node • Locate other DataNodes with valid copies • Instruct DataNodes with copies to replicate blocks to other DataNodes

  38. Interacting with HDFS • Primary interfaces are Java CI and API • Web UI for read-only access • HFTP provides HTTP/S read-only • WebHDFS provides RESTful read/write • FUSE, allows HDFS to be mounted on standard file system • Typically used so legacy apps can use HDFS data • ‘hdfsdfs -help’ will show CLI usage

  39. Hadoop MapReduce

  40. MapReduce! • A programming model for processing data • Contains two phases! • Map • Perform a map function on key/value pairs • Reduce • Perform a reduce function on key/value groups • Groups are created by sorting map output • Operations on key/value pairs open the door for very parallelizable algorithms

  41. Hadoop MapReduce • Automatic parallelization and distribution of tasks • Framework handles scheduling task and repeating failed tasks • Developer can code many pieces of the puzzle • Framework handles a Shuffle and Sort phase between map and reduce tasks. • Developer need only focus on the task at hand, rather than how to manage where data comes from and where it goes

  42. MRv1 and MRv2 • Both manage compute resources, jobs, and tasks • Job API the same • MRv1 is proven in production • JobTracker / TaskTrackers • MRv2 is a new application on YARN • YARN is a generic platform for developing distributed applications

  43. MRv1 - JobTracker • Monitors job and task progress • Issues task attempts to TaskTrackers • Re-tries failed task attempts • Four failed attempts of same task = one failed job • Default scheduler is FIFO order • CapacityScheduler or FairScheduler is preferred • Single point of failure for MapReduce

  44. MRv1 – TaskTrackers • Runs on same node as DataNodes • Sends heartbeats and task reports to JT • Configurable number of map and reduce slots • Runs map and reduce task attempts in a separate JVM

  45. How MapReduce Works Client submits job to JobTracker JobTracker submits tasks to TaskTrackers JobTracker reports metrics Job output is written to DataNodes w/replication

  46. How MapReduce Works - Failure JobTracker assigns task to different node

  47. Map Input Map Output Reducer Input Groups Reducer Output

  48. Example -- Word Count • Count the number of times each word is used in a body of text • Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)

  49. Mapper Code public class WordMapper extendsMapper<LongWritable, Text, Text, IntWritable> { private final static IntWritableONE = newIntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizertokenizer = newStringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } }

  50. Shuffle and Sort Mapper outputs to a single partitioned file Reducers copy their parts Reducer merges partitions

More Related