380 likes | 521 Vues
This document provides an overview of the significant activities during Apache Bigtop's 9th week, focusing on integration testing of Hadoop components in production environments. It covers topics like Java project setup for Hive, JDBC command execution, and the integration of HBase and Pig. Emphasis is placed on best practices for using Hive and Pig, including configuration, server setup, and execution of batch commands. Knowledge of these topics is crucial for effective use of Hadoop ecosystems.
E N D
Apache Bigtop Week 9 Integration Testing, M/R Coding
Administration • Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members • MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. • Roman’s Yahoo HUG presentation next week • Move to ground floor next week? • Machine Learning Solution Architect, 2/16 • List
Review from last time • Hive/Pig/Hbasedata layer for integration tests • Hbase upgrade to x.92 • JAVA_LIBRARY_PATH for JVM to point to .so native libs for hadoop • Hadoopclasspath debug to print out classpath • HBASE 0.92 guess where Hadoop is using HADOOP_HOME • /etc/hostname screwed up on ec2
Bigtop Data Integration Layer, Hive, Pig, Hbase • Hive: • Create a separate Java project • Install Hive locally, verify you can run the command line, >show tables;
Hive Data Layer • Import all the jars under hive-0.8.1/lib to Eclipse
Hive Notes • Hive has 2 configurations, an embedded and server. • To start the server: • Set the HADOOP_HEAPSIZE to 1024 by copying hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting. • source ~/hive-0.8.1/conf/hive-env.sh • Verify, echo $HADOOP_HIVESIZE
Hive Notes Increase Heap Size:
Hive Run JDBC Commands • Like connecting to MySQL/oracle/MSFT db • Create connection, PreparedStatement, ResultSet • Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); • Connection con =DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); • Driver in thejar
Hive JDBC Prepared Statement • Create Table statement different Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt .executeQuery("create table " + tableName + " (keyint, valuestring) ROW FORMAT delimitedfieldsterminatedby '\t'");
Verification – server running and table printout Eclipse output
Pig, uses Pig Util Class • Util not in Pig-xxx.jar, only in Test package • Local mode only, distributed not debugged Util.deleteDirectory(new File("/Users/dc/pig-0.9.2/nyse")); PigServerps = new PigServer(ExecType.LOCAL); ps.setBatchOn();
Pig Example String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); "; String second = "B = foreachnyse generate symbol, dividends;"; String third = " store B into 'nyse'; ";
Pig Example Util.registerMultiLineQuery(ps, first + second + third); ps.executeBatch(); ps.shutdown();
Pig Example Output 12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// 12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN 12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2 12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job 12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1 12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete 12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001 12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse 12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to getRunningJob for job job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Statsreportedbelowmay be incomplete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:
Pig Example Output HadoopVersionPigVersionUserIdStartedAtFinishedAt Features 0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN Success! Job Stats (time in seconds): JobId Alias FeatureOutputs job_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse, Input(s): Successfullyreadrecordsfrom: "/Users/dc/programmingpig/data/NYSE_dividends" Output(s): Successfullystoredrecords in: "file:///Users/dc/pig-0.9.2/nyse"
Pig Example Output Job DAG: job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success!
M/R Pattern Design Review • Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster. • Cluster experience with big data on AWS • Important when migrating to production processes • Design Patterns in the Sample HadoopExamples.jar • WordCount • Word Count Aggregation • MultiFileWordCount
M/R Design Pattern Review • Word Count From Lin/Dyer Data Intensive Text Processing with Map Reduce
M/R Adding Array to Mapper Output • WordCount Design Process • Mapper(Contents of File, Tokenize, output) <Object, Text, Text, IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1. 2 steps to mapper design, 1) split up the input then 2) output K,V to reducer • First step, copy Mapper output K,V to reducer . Reducer(Collect mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form. • Replace the IntWritable with an arraylist • Why? From Lin/Dyer Data-Intensive Text Processing with Map Reduce
Word Count Notes • Remove ctors() from map()
HadoopAvg Coding Demo • Create an AvgPair Object, implements writable • Create ivars, sum, count, key • Auto generate methods for ivars • Implement Write and readFields methods • Put the ctors outside map() • Run using M/R plugin
NXServer/NXClient • Remote Desktop to EC2 • 2 options • 1) use prepared AMI by Eric Hammond • 2) Install NXServer
Prepared AMI • http://aws.amazon.com/amis/Europe/1950 • US East AMI ID: ami-caf615a3 • Ubuntu 9.04 Jaunty Desktop with NX Server Free Edition • Update the Repos to newer versions of Ubuntu • Create new user
Ubuntu Create new user • Script for this AMI only • >user-setup
Verify login from desktop • Created user dc, password dc
Download NXPlayer, install • Create new connection, enter in IP address
Installing NXServer • Read logs /usr/NX/var/log/install • If installed correctly should see daemons
Configure sshd • sudonano /etc/init.d/sshd_config
Same process as before with nxplayer • Enter in ip, user name/password
Clone the instance store if you cant get the NXServer to work • Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume? • Create blank volume, default attach is /dev/sdf mksf.ext3 /dev/sdf mkdir /newvolume Sudo mount /dev/sdf /newvolume
rsync copy instance store to ebs • Copy the instance store volume to EBS • rsync –aHxv / /newvolume • Create further snapshots, create an ami by specifying kernel, etc…