1 / 167

Developing Hadoop Applications in Java

Learn how to develop Hadoop applications using Java with this comprehensive Hortonworks University training course for developers. Gain knowledge and skills in various topics such as MapReduce, HBase, Pig, and Hive programming.

francesd
Télécharger la présentation

Developing Hadoop Applications in Java

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Hadoop Applications in Java A Hortonworks University Hadoop Training Course for Developers

  2. Course Introduction

  3. Introductions • Your name • Job responsibilities • Previous Hadoop experience (if any) • What brought you here

  4. Course Outline • Day 1 • Unit 1: Understanding Hadoop and MapReduce • Unit 2: Writing MapReduce Applications • Unit 3: Map Aggregation • Day 2 • Unit 4: Partitioning and Sorting • Unit 5: Input and Output Formats • Day 3 • Unit 6: Optimizing MapReduce Jobs • Unit 7: Advanced MapReduce Features • Unit 8: Unit Testing • Unit 9: Defining Workflow • Day 4 • Unit 10: HBase Programming • Unit 11: Pig Programming • Unit 12: Hive Programming

  5. Who is Hortonworks?

  6. Who is Developing Apache Hadoop? Apache Hadoop Hortonworks Hortonworks Largest PMC and Committer Base from any Single Organization Apache Bylaws http://hadoop.apache.org/who.html As of 5/2012

  7. Balancing Innovation & Stability Be aggressive—Ship early and often Be predictable—ship when stable

  8. Unit 1: Understanding Hadoop and MapReduce

  9. What is Hadoop?

  10. Features of Hadoop = HDFS + MapReduce + Hive + Pig + HBase + HCatalog, Mahout, ZooKeeper, etc.

  11. Hortonworks Data Platform

  12. Hortonworks Data Platform Fully Supported Integrated Platform • Challenge: • Integrate, manage, and support changes across a wide range of open source Hadoop • Time intensive, complex, expensive • Solution: Hortonworks Data Platform • Integrated certified platform distributions • Extensive Q/A process • Industry-leading Support with clear service levels for updates and patches • Multi-year Support and Maintenance Policy • Technical guidance support for Universe and Multiverse components Hadoop Core Pig Zookeeper Hive HCatalog HBase = New Version

  13. The Hadoop Distributed File System • NameNode • The “master” node of HDFS • Determines and maintains how the chunks of data are distributed across the DataNodes • DataNode • Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes

  14. Big Data NameNode Put into HDFS Big Data Break the data into chunks and distribute to the DataNodes DataNode 1 DataNode 2 DataNode 3 The DataNodes replicate the chunks

  15. The JobTracker and TaskTrackers • JobTracker • the “master” daemon of the TaskTrackers • clients submit MapReduce jobs to the JobTracker • distributes the tasks to available TaskTrackers • TaskTracker • runs on DataNodes • performs the actual MapReduce job

  16. 1. Submits a job to the JobTracker JobTracker client 2. Distributes tasks to the TaskTrackers based on availability and where the data resides 4. Sends task status to JobTracker TaskTracker 1 TaskTracker 2 TaskTracker 3 JVM JVM JVM 3. The TaskTracker spawns a JVMto execute the task

  17. Job Schedulers • Fair Scheduler • all jobs get, on average, an equal share of resources over time • Capacity Scheduler • jobs are submitted to queues, and queues are allocated a fraction of the total resource capacity • Use mapred.jobtracker.taskSchedulerto configure the scheduler

  18. Hadoop Modes NameNode Secondary NameNode JobTracker DataNode/ TaskTracer DataNode/ TaskTracer In a fully-distributed cluster, the NameNode, Secondary NameNode and JobTrackereach run on their own machine. DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracker

  19. Installing HDP

  20. HDFS Filesystem Commands • hadoopfs -ls counties • hadoopfs -lsr counties • hadoopfs -mkdirpopulation_data • hadoopfs -put data/*.txt population_data/ • hadoopfs -cat population_data/population_1.txt

  21. The HDFS API Configuration conf = new Configuration(); Path dir = new Path("results"); FileSystemfs = FileSystem.get(conf); if(!fs.exists(dir)) { dir.getFileSystem(conf).mkdirs(dir); }

  22. Lab 1.1: Configuring a Hadoop Development Environment Lab 1.2: Putting Files in HDFS with Java

  23. Unit 2: Writing MapReduce Applications

  24. Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 DataNode 2 Data is shuffled across the network and sorted Mapper DataNode 3 DataNode 3 Mapper Reducer

  25. DataNode Mapper output = Reducer input Input split Spill files are merged into a single file The InputFormat generates <k1,v1> pairs Records are sorted and spilled to disk when the buffer reaches a threshold Mapper MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> The map method outputs <k2,v2> pairs

  26. DataNode DataNode Mapper output = Reducer input 2. In-memory buffer Spill files 3. Merged input DataNode Mapper output = Reducer input 4. HDFS 5. Reducer DataNode In-memory buffer Spill files DataNode Mapper output = Reducer input Merged input 1. The Reducer fetches the data from the Mappers HDFS Reducer

  27. The Key/Value Pairs of MapReduce <K1, V1> <K2, V2> Mapper Shuffle/Sort Reducer <K3, V3> <K2, (V2,V2,V2,V2)>

  28. The MapReduce API • Develop Java MapReduce applications using the org.apache.hadoop packages • Prior to Hadoop 0.20: the old API • org.apache.hadoop.mapred package • As of Hadoop 0.20: the new API • org.apache.hadoop.mapreduce package

  29. WordCountMapper public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String currentLine = value.toString(); String [] words = currentLine.split(" "); for(String word : words) { Text outputKey = new Text(word); context.write(outputKey, new IntWritable(1)); } } }

  30. WordCountReducer public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } IntWritableoutputValue = new IntWritable(sum); context.write(key, outputValue); } }

  31. WordCountJob Job job = new Job(getConf(), "WordCountJob"); Configuration conf = job.getConfiguration(); job.setJarByClass(getClass()); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;

  32. Running a MapReduce Job • To run a job, perform the following steps: • Put the input files into HDFS. • If the output directory exists, delete it. • Use hadoop to execute the job. • View the output files. • hadoop jar wordcount.jarWordCountJob input/file.txt result

  33. Lab 2.1: Word Count Lab 2.2: Distributed Grep Lab 2.3: Inverted Index

  34. Unit 3: Map Aggregation

  35. <“by”, 1> <“the”, 1> <“people”, 1> <“for”, 1> <“the”, 1> <“people”, 1> <“of”, 1> <“the”, 1> <“people”, 1> Without Aggregation: Mapper Reducer The Reducer processes a large number of records, using HTTP across the network. The Mapper simply outputs every word, without performing any computations. With Aggregation: <“by”, 1> <“the”, 3> <“people”, 3> <“for”, 1> <“of”, 1> Mapper Reducer Expensive network traffic is decreased. The Mapper combines records in a manner that does not affect the algorithm.

  36. Overview of Combiners MapOutputBuffer Disk Spill file Combiner <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Combiner Disk Spill file <k2,v2> <k2,v2> Combiner 1. When the output buffer is full, a spill to disk occurs. 2. The Combiner is invoked in an attempt to reduce file I/O. Disk Spill file <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Mapper output = Reducer input 3. The result is fewer records output by the Mapper.

  37. Details of a Combiner 2. If a Combiner is used, the output is sent to Lists in memory (with a List for each Key). MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> List for key1 List for key5 List for key2 List for key3 List for key4 List for keyN 1. When the output buffer is full, a spill to disk occurs. 3. After a certain number of <key,value> pairs are written to the lists, the lists are sent to the Combiner. Disk Spill file <k2,v2> <k2,v2> <k2,v2> Combiner 4. The records are spilled to disk.

  38. Reduce-side Combining Spill files In-memory buffer 1. The Combiner is also used in the reduce phase when merging the intermediate <key,value> pairs from different Mappers. Merged input 2. 3. HDFS 4. Reducer

  39. Example of a Combiner public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritableoutputValue = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } outputValue.set(sum); context.write(key, outputValue); } }

  40. In-Map Aggregation • The Mapper combines records as they are being processed • The Mapper stores records in memory • If you have a lot of records and storing in memory is prohibitive, then in-map aggregation may not work for you

  41. “We the People of the United States, in Order to form a more perfect union...” TopResultsMapper ArrayList PriorityQueue “by”, 100 “in”, 145 “or”, 157 “be”, 178 “to”, 201 “and”, 262 “shall”, 293 “of”, 493 “the”, 726 “We”, 1 “the”, 2 “People”, 1 “of”, 1 “United”, 1 “States”, 1 “in”, 1 “order”, 1 ... After the entire input is processed, the List is converted to a PriorityQueue The top 10 results are sent to the Reducer

  42. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String [] input = StringUtils.split(value.toString(),'\\', ' '); for(String word : input) { Word currentWord = new Word(word, 1); if(words.contains(currentWord)) { //increment the existing Word's frequency for(Word w : words) { if(w.equals(currentWord)) { w.frequency++; break; } } } else { words.add(currentWord); } } }

  43. @Override protected void cleanup(Context context) throws IOException, InterruptedException { Text outputKey = new Text(); IntWritableoutputValue = new IntWritable(); queue = new PriorityQueue<Word>(words.size()); queue.addAll(words); for(inti = 1; i <= maxResults; i++) { Word tail = queue.poll(); if(tail != null) { outputKey.set(tail.value); outputValue.set(tail.frequency); context.write(outputKey, outputValue); } } }

  44. Counters

  45. User-defined Counters • Write an enum: public enumMyCounters { GOOD_RECORDS, BAD_RECORDS } • Use getCounterto increment a counter: context.getCounter(MyCounters.GOOD_RECORDS). increment(1);

  46. Lab 3.1: Using a Combiner Lab 3.2: Computing an Average

  47. Unit 4: Partitioning and Sorting

  48. DataNode DataNode Reducer The Partitioner determines which records get sent to which Reducer Mapper DataNode Reducer Partitioner DataNode Reducer

  49. 1. The Mapper outputs <key,value> pairs Mapper <key1, value> <key6, value> <key2, value> <key2, value> <key1, value> <key8, value> <key3, value> <key8, value> <key1, value> Partitioner public intgetPartition() 2. Each <key,value> pair is passed to the Partitioner 3. The Partitioner returns an int between 0 and the number of Reducers Reducer 0 Reducer 1 Reducer 2 Reducer 3

  50. The Default Partitioner public class HashPartitioner<K, V> extends Partitioner<K, V> { public intgetPartition(K key, V value, intnumReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }

More Related