Hadoop

Hadoop

教材湯秉翰著(2013)，雲端網頁程式設計－Google App Engine應用實作(第二版)，博碩文化， ISBN 978-986-201-824-8 (書號 PG31356) 鍾葉青、鍾武君著(2013)，雲端計算，東華書局， ISBN 9789861579030 (書號 CL009) 許清榮、林奇暻、買大誠著(2012)，掌握Hadoop翱翔雲端-Windoop應用實作指南，博碩文化， ISBN 978-986-201-673-2 (書號 PG21241) 鍾葉青、李冠憬、許慶賢、賴冠州著(2011)，雲端程式設計: 入門與應用實務，東華書局， ISBN 9789861578125 (書號 CL008)

大綱 Hadoop 簡介 HDFS MapReduce Programming Model Hbase

Hadoop • Hadoop • Apache 專案項目 • 分散式計算平台 • 軟體平台（Software Framework） • 適合處理巨量資料

Hadoop 應用程式 Cloud Applications 分散式檔案系統 MapReduce Hbase Hadoop Distributed File System (HDFS) 叢集電腦 A Cluster of Machines

歷史（2002-2004） • 創立者– Doug Cutting • Lucene • 純 Java 高效能、全文檢索搜尋引擎程式庫 • Inverse Index • Nutch • 基於 Lucene程式庫 • 網頁搜尋軟體（Web-Search Software）

歷史（轉捩點） • Nutch遇到儲存的問題 • Google 發表網頁搜尋引擎論文 • SOSP 2003 : "The Google File System" • OSDI 2004 : "MapReduce : Simplifed Data Processing on Large Cluster" • OSDI 2006 : "Bigtable: A Distributed Storage System for Structured Data"

歷史（2004-Now） • Dong Cutting 參考 Google 發表的論文 • 在 Nutch 裡實作 GFS & MapReduce • 自 Nutch 0.8 起 Hadoop 成為獨立專案 • Yahoo 僱用 Dong Cutting 建立網頁搜尋引擎 • Nutch DFS → Hadoop Distributed File System (HDFS)

Hadoop 的特色 • Efficiency 效率 • 資料節點，平行處理 • Robustness 耐用 • Automatically maintain multiple copies of data and automatically re-deploys computing tasks based on failures • Cost Efficiency 成本效益 • Distribute the data and processing across clusters of commodity computers • Scalability 擴展性 • Reliably store and process massive data

Google vs. Hadoop

HDFS HDFS 簡介 HDFS 維運程式開發環境

什麼是 HDFS • Hadoop Distributed File System 分散式檔案系統 • 參考 Google 檔案系統 • 適合大量資料分析的分散式檔案系統 • 基於具備容錯能力的商用硬體 • 應用於 Hadoop Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines

HDFS 架構 名稱節點資料節點建立副本機櫃#1 機櫃#2 HDFS 架構

HDFS Client Block Diagram Client computer HDFS Namenode HDFS-Aware application POSIX API HDFS API HDFS Datanode Regular VFS with local and NFS-supported files Separate HDFS view HDFS Datanode Specific drivers Network stack

HDFS 維運 • Shell Commands 殼層命令 • HDFS Common APIs 共通API

HDFS Shell Command(1/2)

HDFS Shell Command(2/2)

範例 • In the <HADOOP_HOME>/ • bin/hadoop fs –ls • Lists the content of the directory by given path of HDFS • ls • Lists the content of the directory by given path of local file system

HDFS Common APIs Configuration 設定 FileSystem檔案系統 Path 路徑 FSDataInputStream資料輸入串流 FSDataOutputStream資料輸出串流

Using HDFS Programmatically(1/2)

1: import java.io.File; 2: import java.io.IOException; 3: 4: import org.apache.hadoop.conf.Configuration; 5: import org.apache.hadoop.fs.FileSystem; 6: import org.apache.hadoop.fs.FSDataInputStream; 7: import org.apache.hadoop.fs.FSDataOutputStream; 8: import org.apache.hadoop.fs.Path; 9: 10: public class HelloHDFS { 11: 12: public static final String theFilename = "hello.txt"; 13: public static final String message = "Hello HDFS!\n"; 14: 15: public static void main (String [] args) throws IOException { 16: 17: Configuration conf = new Configuration(); 18: FileSystemhdfs = FileSystem.get(conf); 19: 20: Path filenamePath = new Path(theFilename);

FSDataOutputStream extends the java.io.DataOutputStream class 21: 22: try { 23: if (hdfs.exists(filenamePath)) { 24: // remove the file first 25: hdfs.delete(filenamePath, true); 26: } 27: 28: FSDataOutputStream out = hdfs.create(filenamePath); 29: out.writeUTF(message); 30: out.close(); 31: 32: FSDataInputStream in = hdfs.open(filenamePath); 33: String messageIn = in.readUTF(); 34: System.out.print(messageIn); 35: in.close(); 36: } catch (IOExceptionioe) { 37: System.err.println("IOExceptionduring operation: " + ioe.toString()); 38: System.exit(1); 39: } 40: } 41: } FSDataInputStream extends the java.io.DataInputStream class

Configuration • Provides access to configuration parameters. • Configuration conf = new Configuration() • A new configuration. • … = new Configuration(Configuration other) • A new configuration with the same settings cloned from another. • Methods:

FileSystem Configuration conf = new Configuration(); FileSystemhdfs = FileSystem.get( conf ); An abstract base class for a fairly generic FileSystem. Ex: Methods:

Path Path filenamePath = new Path("hello.txt"); Names a file or directory in a FileSystem. Ex: Methods:

FSDataInputStream FSDataInputStream in = hdfs.open(filenamePath); Utility that wraps a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream. Inherit from java.io.DataInputStream Ex:

FSDataInputStream Methods:

FSDataOutputStream FSDataOutputStream out = hdfs.create(filenamePath); Utility that wraps a OutputStream in a DataOutputStream, buffers output through a BufferedOutputStream and creates a checksum file. Inherit from java.io.DataOutputStream Ex:

FSDataOutputStream Methods:

程式開發環境 • A Linux environment • On physical or virtual machine • Ubuntu 10.04 • Hadoop environment • Reference Hadoop setup guide • user/group: hadoop/hadoop • Single or multiple node(s), the later is preferred. • Eclipse 3.7M2a with hadoop-0.20.2 plugin

MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse Lab Requirement

什麼是MapReduce? • Programming model for expressing distributed computations at a massive scale • A patented software framework introduced by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, … Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines

MapReduce: High Level

Nodes, Trackers, Tasks • JobTracker • Run on Master node • Accepts Job requests from clients • TaskTracker • Run on slave nodes • Forks separate Java process for task instances

Example - Wordcount Sort/Copy Mapper Input Output Hello 1 Cloud 1 Merge Hello Cloud Reducer Hello 2 TA 2 Hello 1 Hello [1 1] TA [1 1] Hello 1 TA 1 TA 1 Mapper TA cool TA 1 cool 1 cool 1 Hello TA Reducer Cloud 1 cool 2 Cloud 1 Cloud [1] cool [1 1] cool 1 cool 1 Mapper cool Hello 1 TA 1

MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizeritr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Mapper(cont.) Hi Cloud TA say Hi InputKey StringTokenizeritr = new StringTokenizer( line); ( (Text) value ).toString(); Hi Cloud TA say Hi /user/hadoop/input/hi … … HiCloud TAsay Hi … … itr itr itr itr itr itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } <word, one> Input Value <Hi, 1> <Cloud, 1> <TA, 1> <say, 1> <Hi, 1>

Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Reducer (cont.) <word, one> <Hi, 1 → 1> <Cloud, 1> Hi <TA, 1> <say, 1> 1 1 <key, result> <Hi, 2> <Cloud, 1> <TA, 1> <say, 1>

MapReduce術語 • Job • A "full program" - an execution of a Mapper and Reducer across a data set • Task • An execution of a Mapper or a Reducer on a slice of data • Task Attempt • A particular instance of an attempt to execute a task on a machine

Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, "job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

Job • Identify classes implementing Mapper and Reducer interfaces • Job.setMapperClass(), setReducerClass() • Specify inputs, outputs • FileInputFormat.addInputPath() • FileOutputFormat.setOutputPath() • Optionally, other options too: • Job.setNumReduceTasks(), • Job.setOutputFormat()…

Class Mapper • Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Maps input key/value pairs to a set of intermediate key/value pairs. • Ex:

Class Mapper Class MyMapper extend Mapper <Object, Text, Text, IntWritable> { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } } Input Class(key, value) Onput Class(key, value)

Text, IntWritable, LongWritable, • Hadoop defines its own "box" classes • Strings : Text • Integers : IntWritable • Long : LongWritable • Any (WritableComparable, Writable) can be sent to the reducer • All keys are instances of WritableComparable • All values are instances of Writable

Read Data

Mappers • Upper-case Mapper • Ex: let map(k, v) = emit(k.toUpper(), v.toUpper()) • ("foo", "bar") → ("FOO", "BAR") • ("Foo", "other") → ("FOO", "OTHER") • ("key2", "data") → ("KEY2", "DATA") • Explode Mapper • let map(k, v) = for each char c in v: emit(k, c) • ("A", "cats") → ("A", "c"), ("A", "a"), ("A", "t"), ("A", "s") • ("B", "hi") → ("B", "h"), ("B", "i") • Filter Mapper • let map(k, v) = if (isPrime(v)) then emit(k, v) • ("foo", 7) → ("foo", 7) • ("test", 10) → (nothing)

Class Reducer • Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Reduces a set of intermediate values which share a key to a smaller set of values. • Ex:

Hadoop

Hadoop

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop @

Cassandra + Hadoop

Hadoop Demo

Hola Hadoop

Hadoop

HADOOP

Hadoop

Hadoop

Hadoop

Hadoop concpet

Hadoop Administration