MapReduce Programming and Cluster Accessing Instructions
Learn how to implement MapReduce with an example query, accessing Hadoop cluster nodes, and changing code for different queries. Full instructions at course website.
MapReduce Programming and Cluster Accessing Instructions
E N D
Presentation Transcript
Gang Luo Sept. 2, 2010 MapReduce ProgrammingandCluster Accessing Instructions
Dataflow (K1, V1) (K2, V2) (K2, List<V2>) (K3, V3)
A Query Example Table1 SELECT Year, MAX(Temperature) FROM Table1 WHERE AirQuality = 0|1|4|5|9 GROUPBY Year
Implementation in MapReduce Selection+ Projection Aggregation (MAX) (1998, 87, 2, …) (1998, 87) 87 94 1998, 84 87 78 (1998, 94)
Think more! • What if we want to get the average temperature for a year? • What if you are only interested in the temperature in Durham? (Assume the station ID at Durham is 212) You may want to change the code a little bit and fulfill a different query
Hadoop Cluster • Master node: • hadoop21.cs.duke.edu • Slave nodes • hadoop22.cs.duke.edu – hadoop36.cs.duke.edu • Online job tracker* • hadoop21.cs.duke.edu:50030 • Online HDFS info* • hadoop21.cs.duke.edu:50070 *You cannot access these pages outside CS trusted network. Solution: 1. ssh to any node, use lynx. 2. build “ssh -D port” connection to any node, set proxy in your browser
Now, let’s see how to compile and run a MapReduce job in a clusterWhat I will be showing you is covered by the instructions at the course website:http://www.cs.duke.edu/courses/fall10/cps216/Project/cluster_instruction