1 / 15

CS525 : Big Data Analytics

CS525 : Big Data Analytics. MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner. Large-Scale Data Analytics. Many enterprises turn to Hadoop computing paradigm for big data applications : . vs. Database.

jabir
Télécharger la présentation

CS525 : Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS525:Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner

  2. Large-Scale Data Analytics • Many enterprises turn to Hadoop computing paradigm for big data applications : vs. Database Scalability (petabytes of data, thousands of machines) Performance (indexing, tuning, data organization tech.) Flexibility in accepting all data formats (no schema) Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Commodity inexpensive hardware

  3. What is Hadoop • Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : • Large datasets  Terabytes or petabytes of data • Large clusters  Hundreds or thousands of nodes • Open-source implementation for Google MapReduce • Simple programming model : MapReduce • Simple data model: flexible for any data

  4. Hadoop Framework • Two main layers: • Distributed file system (HDFS) • Execution engine (MapReduce) Hadoop is designed as a master-slave shared-nothing architecture

  5. Key Ideas of Hadoop • Automatic parallelization & distribution • Hidden from end-user • Fault tolerance and automatic recovery • Failed nodes/tasks recover automatically • Simple programming abstraction • Users provide two functions “map” and “reduce”

  6. Who Uses Hadoop ? • Google: Invent MapReduce computing paradigm • Yahoo: Develop Hadoop open-source of MapReduce • Integrators: IBM, Microsoft, Oracle, Greenplum • Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn • Many others …

  7. Hadoop Distributed File System (HDFS) 1 2 3 4 5 Centralized namenode - Maintains metadata info about files File F Blocks (64 MB) Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

  8. HDFS File System Properties • Large Space: An HDFS instance may consist of thousands of server machines for storage • Replication: Each data block is replicated • Failure: Failure is norm rather than exception • Fault Tolerance: Automated detection of faults and recovery

  9. Map-Reduce Execution Engine(Example: Color Count) Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Input blocks on HDFS Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions

  10. MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (locality) Node 3 Node 1 Node 2 • This file has 5 Blocks  run 5 map tasks • Run task reading block “1” on Node 1 or 3.

  11. MapReduce Engine • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs task to completion (either map or reduce task) • Communicates with Job Tracker to report its progress 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

  12. About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value • Developer must follow the key-value pair interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Shuffling and Sorting: • Groups all similar keys from all mappers, • sorts and passes them to a certain reducer • in the form of <key, <list of values>> • Reducers: • Consume <key, <list of values>> • Produce <key, value>

  13. MapReduce Phases

  14. Another Example : Word Count • Job: Count occurrences of each word in a data set Reduce Tasks Map Tasks

  15. Summary : Hadoop vs. Typical DB

More Related