1 / 90

Hadoop

Hadoop. 教材. 湯秉翰 著 (2013) , 雲端網頁程式設計- Google App Engine 應用實作 ( 第二版 ) , 博碩文化, ISBN 978-986-201-824-8 ( 書號 PG31356) 鍾葉青、鍾武君 著 (2013) , 雲端計算, 東華書局, ISBN 9789861579030 ( 書號 CL009) 許清榮、林奇暻、買大誠 著 (2012) , 掌握 Hadoop 翱翔雲端 -Windoop 應用實作指南, 博碩文化, ISBN 978-986-201-673-2 ( 書號 PG21241)

Télécharger la présentation

Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop

  2. 教材 湯秉翰 著(2013), 雲端網頁程式設計-Google App Engine應用實作(第二版), 博碩文化, ISBN 978-986-201-824-8 (書號 PG31356) 鍾葉青、鍾武君 著(2013), 雲端計算, 東華書局, ISBN 9789861579030 (書號 CL009) 許清榮、林奇暻、買大誠 著(2012), 掌握Hadoop翱翔雲端-Windoop應用實作指南, 博碩文化, ISBN 978-986-201-673-2 (書號 PG21241) 鍾葉青、李冠憬、許慶賢、賴冠州 著(2011), 雲端程式設計: 入門與應用實務, 東華書局, ISBN 9789861578125 (書號 CL008)

  3. 大綱 Hadoop 簡介 HDFS MapReduce Programming Model Hbase

  4. Hadoop • Hadoop • Apache 專案項目 • 分散式計算平台 • 軟體平台(Software Framework) • 適合處理巨量資料

  5. Hadoop 應用程式 Cloud Applications 分散式檔案系統 MapReduce Hbase Hadoop Distributed File System (HDFS) 叢集電腦 A Cluster of Machines

  6. 歷史(2002-2004) • 創立者– Doug Cutting • Lucene • 純 Java 高效能、全文檢索搜尋引擎程式庫 • Inverse Index • Nutch • 基於 Lucene程式庫 • 網頁搜尋軟體(Web-Search Software)

  7. 歷史(轉捩點) • Nutch遇到儲存的問題 • Google 發表網頁搜尋引擎論文 • SOSP 2003 : "The Google File System" • OSDI 2004 : "MapReduce : Simplifed Data Processing on Large Cluster" • OSDI 2006 : "Bigtable: A Distributed Storage System for Structured Data"

  8. 歷史(2004-Now) • Dong Cutting 參考 Google 發表的論文 • 在 Nutch 裡實作 GFS & MapReduce • 自 Nutch 0.8 起 Hadoop 成為獨立專案 • Yahoo 僱用 Dong Cutting 建立網頁搜尋引擎 • Nutch DFS → Hadoop Distributed File System (HDFS)

  9. Hadoop 的特色 • Efficiency 效率 • 資料節點,平行處理 • Robustness 耐用 • Automatically maintain multiple copies of data and automatically re-deploys computing tasks based on failures • Cost Efficiency 成本效益 • Distribute the data and processing across clusters of commodity computers • Scalability 擴展性 • Reliably store and process massive data

  10. Google vs. Hadoop

  11. HDFS HDFS 簡介 HDFS 維運 程式開發環境

  12. 什麼是 HDFS • Hadoop Distributed File System 分散式檔案系統 • 參考 Google 檔案系統 • 適合大量資料分析的分散式檔案系統 • 基於具備容錯能力的商用硬體 • 應用於 Hadoop Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines

  13. HDFS 架構 名稱節點 資料節點 建立副本 機櫃#1 機櫃#2 HDFS 架構

  14. HDFS Client Block Diagram Client computer HDFS Namenode HDFS-Aware application POSIX API HDFS API HDFS Datanode Regular VFS with local and NFS-supported files Separate HDFS view HDFS Datanode Specific drivers Network stack

  15. HDFS 維運 • Shell Commands 殼層命令 • HDFS Common APIs 共通API

  16. HDFS Shell Command(1/2)

  17. HDFS Shell Command(2/2)

  18. 範例 • In the <HADOOP_HOME>/ • bin/hadoop fs –ls • Lists the content of the directory by given path of HDFS • ls • Lists the content of the directory by given path of local file system

  19. HDFS Common APIs Configuration 設定 FileSystem檔案系統 Path 路徑 FSDataInputStream資料輸入串流 FSDataOutputStream資料輸出串流

  20. Using HDFS Programmatically(1/2)

  21. 1: import java.io.File; 2: import java.io.IOException; 3: 4: import org.apache.hadoop.conf.Configuration; 5: import org.apache.hadoop.fs.FileSystem; 6: import org.apache.hadoop.fs.FSDataInputStream; 7: import org.apache.hadoop.fs.FSDataOutputStream; 8: import org.apache.hadoop.fs.Path; 9: 10: public class HelloHDFS { 11: 12: public static final String theFilename = "hello.txt"; 13: public static final String message = "Hello HDFS!\n"; 14: 15: public static void main (String [] args) throws IOException { 16: 17: Configuration conf = new Configuration(); 18: FileSystemhdfs = FileSystem.get(conf); 19: 20: Path filenamePath = new Path(theFilename);

  22. FSDataOutputStream extends the java.io.DataOutputStream class 21: 22: try { 23: if (hdfs.exists(filenamePath)) { 24: // remove the file first 25: hdfs.delete(filenamePath, true); 26: } 27: 28: FSDataOutputStream out = hdfs.create(filenamePath); 29: out.writeUTF(message); 30: out.close(); 31: 32: FSDataInputStream in = hdfs.open(filenamePath); 33: String messageIn = in.readUTF(); 34: System.out.print(messageIn); 35: in.close(); 36: } catch (IOExceptionioe) { 37: System.err.println("IOExceptionduring operation: " + ioe.toString()); 38: System.exit(1); 39: } 40: } 41: } FSDataInputStream extends the java.io.DataInputStream class

  23. Configuration • Provides access to configuration parameters. • Configuration conf = new Configuration() • A new configuration. • … = new Configuration(Configuration other) • A new configuration with the same settings cloned from another. • Methods:

  24. FileSystem Configuration conf = new Configuration(); FileSystemhdfs = FileSystem.get( conf ); An abstract base class for a fairly generic FileSystem. Ex: Methods:

  25. Path Path filenamePath = new Path("hello.txt"); Names a file or directory in a FileSystem. Ex: Methods:

  26. FSDataInputStream FSDataInputStream in = hdfs.open(filenamePath); Utility that wraps a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream. Inherit from java.io.DataInputStream Ex:

  27. FSDataInputStream Methods:

  28. FSDataOutputStream FSDataOutputStream out = hdfs.create(filenamePath); Utility that wraps a OutputStream in a DataOutputStream, buffers output through a BufferedOutputStream and creates a checksum file. Inherit from java.io.DataOutputStream Ex:

  29. FSDataOutputStream Methods:

  30. 程式開發環境 • A Linux environment • On physical or virtual machine • Ubuntu 10.04 • Hadoop environment • Reference Hadoop setup guide • user/group: hadoop/hadoop • Single or multiple node(s), the later is preferred. • Eclipse 3.7M2a with hadoop-0.20.2 plugin

  31. MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse Lab Requirement

  32. 什麼是MapReduce? • Programming model for expressing distributed computations at a massive scale • A patented software framework introduced by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, … Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines

  33. MapReduce: High Level

  34. Nodes, Trackers, Tasks • JobTracker • Run on Master node • Accepts Job requests from clients • TaskTracker • Run on slave nodes • Forks separate Java process for task instances

  35. Example - Wordcount Sort/Copy Mapper Input Output Hello 1 Cloud 1 Merge Hello Cloud Reducer Hello 2 TA 2 Hello 1 Hello [1 1] TA [1 1] Hello 1 TA 1 TA 1 Mapper TA cool TA 1 cool 1 cool 1 Hello TA Reducer Cloud 1 cool 2 Cloud 1 Cloud [1] cool [1 1] cool 1 cool 1 Mapper cool Hello 1 TA 1

  36. MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse

  37. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }

  38. Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizeritr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

  39. Mapper(cont.) Hi Cloud TA say Hi InputKey StringTokenizeritr = new StringTokenizer( line); ( (Text) value ).toString(); Hi Cloud TA say Hi /user/hadoop/input/hi … … HiCloud TAsay Hi … … itr itr itr itr itr itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } <word, one> Input Value <Hi, 1> <Cloud, 1> <TA, 1> <say, 1> <Hi, 1>

  40. Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

  41. Reducer (cont.) <word, one> <Hi, 1 → 1> <Cloud, 1> Hi <TA, 1> <say, 1> 1 1 <key, result> <Hi, 2> <Cloud, 1> <TA, 1> <say, 1>

  42. MapReduce術語 • Job • A "full program" - an execution of a Mapper and Reducer across a data set • Task • An execution of a Mapper or a Reducer on a slice of data • Task Attempt • A particular instance of an attempt to execute a task on a machine

  43. Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, "job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

  44. Job • Identify classes implementing Mapper and Reducer interfaces • Job.setMapperClass(), setReducerClass() • Specify inputs, outputs • FileInputFormat.addInputPath() • FileOutputFormat.setOutputPath() • Optionally, other options too: • Job.setNumReduceTasks(), • Job.setOutputFormat()…

  45. Class Mapper • Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Maps input key/value pairs to a set of intermediate key/value pairs. • Ex:

  46. Class Mapper Class MyMapper extend Mapper <Object, Text, Text, IntWritable> { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } } Input Class(key, value) Onput Class(key, value)

  47. Text, IntWritable, LongWritable, • Hadoop defines its own "box" classes • Strings : Text • Integers : IntWritable • Long : LongWritable • Any (WritableComparable, Writable) can be sent to the reducer • All keys are instances of WritableComparable • All values are instances of Writable

  48. Read Data

  49. Mappers • Upper-case Mapper • Ex: let map(k, v) = emit(k.toUpper(), v.toUpper()) • ("foo", "bar") → ("FOO", "BAR") • ("Foo", "other") → ("FOO", "OTHER") • ("key2", "data") → ("KEY2", "DATA") • Explode Mapper • let map(k, v) = for each char c in v: emit(k, c) • ("A", "cats") → ("A", "c"), ("A", "a"), ("A", "t"), ("A", "s") • ("B", "hi") → ("B", "h"), ("B", "i") • Filter Mapper • let map(k, v) = if (isPrime(v)) then emit(k, v) • ("foo", 7) → ("foo", 7) • ("test", 10) → (nothing)

  50. Class Reducer • Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Reduces a set of intermediate values which share a key to a smaller set of values. • Ex:

More Related