1 / 10

Introduction to Hadoop: Scalable Data Processing with MapReduce

This introduction to Hadoop elaborates on the MapReduce framework, emphasizing its fault-tolerant architecture and scalable nature for processing massive datasets in a distributed environment. Utilizing a master-slave configuration and n-way replication, Hadoop allows efficient parallel execution of data mining tasks despite machine failures. The process involves mapping input data into key-value pairs, shuffling these outputs, and reducing them to a final output, enabling calculations like histograms on large volumes of data. User-defined map and reduce functions streamline operations, empowering data analysis at scale.

imaran
Télécharger la présentation

Introduction to Hadoop: Scalable Data Processing with MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-826: Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos

  2. Resources • Software • http://hadoop.apache.org/ • Map/reduce paper [Dean & Ghemawat] • http://labs.google.com/papers/mapreduce.html • Tutorial: see part1, foils #9-20 • videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ (c) 2012 C. Faloutsos

  3. Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ (c) 2012 C. Faloutsos

  4. 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details (c) 2012 C. Faloutsos

  5. 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details reduce map (c) 2012 C. Faloutsos

  6. details fork fork fork Mapper Output File0 write Split0 read Mapper Split1 Output File1 Split2 Mapper User Program Master assign map assign reduce InputData (on HDFS) Reducer local write Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) (c) 2012 C. Faloutsos

  7. More details: • (thanks to U Kang for the animations) (c) 2012 C. Faloutsos

  8. Background: MapReduce MapReduce/Hadoop Framework HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Map 0 Map 1 Map 2 Output sorted by the key Shuffle Reducer: read output from mappers, output a new (k,v) pair to HDFS Reduce 0 Reduce 1 Programmers need to provide only map() and reduce() functions HDFS (c) 2012 C. Faloutsos 8

  9. Background: MapReduce Example: histogram of fruit names HDFS map( fruit ) { output(fruit, 1); } Map 0 Map 1 Map 2 (apple, 1) (apple, 1) (strawberry,1) (orange, 1) Shuffle reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } Reduce 0 Reduce 1 (orange, 1) (strawberry, 1) (apple, 2) HDFS (c) 2012 C. Faloutsos 9

  10. Conclusions • Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines • User provides map() and reduce() functions • Hadoop handles the rest (eg., machine failures etc) (c) 2012 C. Faloutsos

More Related