100 likes | 217 Vues
This introduction to Hadoop elaborates on the MapReduce framework, emphasizing its fault-tolerant architecture and scalable nature for processing massive datasets in a distributed environment. Utilizing a master-slave configuration and n-way replication, Hadoop allows efficient parallel execution of data mining tasks despite machine failures. The process involves mapping input data into key-value pairs, shuffling these outputs, and reducing them to a final output, enabling calculations like histograms on large volumes of data. User-defined map and reduce functions streamline operations, empowering data analysis at scale.
E N D
15-826: Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos
Resources • Software • http://hadoop.apache.org/ • Map/reduce paper [Dean & Ghemawat] • http://labs.google.com/papers/mapreduce.html • Tutorial: see part1, foils #9-20 • videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ (c) 2012 C. Faloutsos
Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ (c) 2012 C. Faloutsos
2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details (c) 2012 C. Faloutsos
2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details reduce map (c) 2012 C. Faloutsos
details fork fork fork Mapper Output File0 write Split0 read Mapper Split1 Output File1 Split2 Mapper User Program Master assign map assign reduce InputData (on HDFS) Reducer local write Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) (c) 2012 C. Faloutsos
More details: • (thanks to U Kang for the animations) (c) 2012 C. Faloutsos
Background: MapReduce MapReduce/Hadoop Framework HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Map 0 Map 1 Map 2 Output sorted by the key Shuffle Reducer: read output from mappers, output a new (k,v) pair to HDFS Reduce 0 Reduce 1 Programmers need to provide only map() and reduce() functions HDFS (c) 2012 C. Faloutsos 8
Background: MapReduce Example: histogram of fruit names HDFS map( fruit ) { output(fruit, 1); } Map 0 Map 1 Map 2 (apple, 1) (apple, 1) (strawberry,1) (orange, 1) Shuffle reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } Reduce 0 Reduce 1 (orange, 1) (strawberry, 1) (apple, 2) HDFS (c) 2012 C. Faloutsos 9
Conclusions • Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines • User provides map() and reduce() functions • Hadoop handles the rest (eg., machine failures etc) (c) 2012 C. Faloutsos