Apache Bigtop

Apache Bigtop Week 9 Integration Testing, M/R Coding

Administration • Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members • MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. • Roman’s Yahoo HUG presentation next week • Move to ground floor next week? • Machine Learning Solution Architect, 2/16 • List

Review from last time • Hive/Pig/Hbasedata layer for integration tests • Hbase upgrade to x.92 • JAVA_LIBRARY_PATH for JVM to point to .so native libs for hadoop • Hadoopclasspath debug to print out classpath • HBASE 0.92 guess where Hadoop is using HADOOP_HOME • /etc/hostname screwed up on ec2

Bigtop Data Integration Layer, Hive, Pig, Hbase • Hive: • Create a separate Java project • Install Hive locally, verify you can run the command line, >show tables;

Hive Data Layer • Import all the jars under hive-0.8.1/lib to Eclipse

Hive Notes • Hive has 2 configurations, an embedded and server. • To start the server: • Set the HADOOP_HEAPSIZE to 1024 by copying hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting. • source ~/hive-0.8.1/conf/hive-env.sh • Verify, echo $HADOOP_HIVESIZE

Start Hive Server from Command Line

Hive Command Line Server

Hive Notes Increase Heap Size:

Hive Run JDBC Commands • Like connecting to MySQL/oracle/MSFT db • Create connection, PreparedStatement, ResultSet • Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); • Connection con =DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); • Driver in thejar

Hive JDBC Prepared Statement • Create Table statement different Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt .executeQuery("create table " + tableName + " (keyint, valuestring) ROW FORMAT delimitedfieldsterminatedby '\t'");

Verification – server running and table printout Eclipse output

Hive Eclipse/Java Code

Pig, uses Pig Util Class • Util not in Pig-xxx.jar, only in Test package • Local mode only, distributed not debugged Util.deleteDirectory(new File("/Users/dc/pig-0.9.2/nyse")); PigServerps = new PigServer(ExecType.LOCAL); ps.setBatchOn();

Pig Example String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); "; String second = "B = foreachnyse generate symbol, dividends;"; String third = " store B into 'nyse'; ";

Pig Example Util.registerMultiLineQuery(ps, first + second + third); ps.executeBatch(); ps.shutdown();

Pig Example Output 12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// 12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN 12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2 12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job 12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1 12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete 12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001 12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse 12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to getRunningJob for job job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Statsreportedbelowmay be incomplete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:

Pig Example Output HadoopVersionPigVersionUserIdStartedAtFinishedAt Features 0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN Success! Job Stats (time in seconds): JobId Alias FeatureOutputs job_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse, Input(s): Successfullyreadrecordsfrom: "/Users/dc/programmingpig/data/NYSE_dividends" Output(s): Successfullystoredrecords in: "file:///Users/dc/pig-0.9.2/nyse"

Pig Example Output Job DAG: job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success!

M/R Pattern Design Review • Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster. • Cluster experience with big data on AWS • Important when migrating to production processes • Design Patterns in the Sample HadoopExamples.jar • WordCount • Word Count Aggregation • MultiFileWordCount

M/R Design Pattern Review • Word Count From Lin/Dyer Data Intensive Text Processing with Map Reduce

M/R Adding Array to Mapper Output • WordCount Design Process • Mapper(Contents of File, Tokenize, output) <Object, Text, Text, IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1. 2 steps to mapper design, 1) split up the input then 2) output K,V to reducer • First step, copy Mapper output K,V to reducer . Reducer(Collect mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form. • Replace the IntWritable with an arraylist • Why? From Lin/Dyer Data-Intensive Text Processing with Map Reduce

Word Count Notes • Remove ctors() from map()

HadoopAvg Coding Demo • Create an AvgPair Object, implements writable • Create ivars, sum, count, key • Auto generate methods for ivars • Implement Write and readFields methods • Put the ctors outside map() • Run using M/R plugin

NXServer/NXClient • Remote Desktop to EC2 • 2 options • 1) use prepared AMI by Eric Hammond • 2) Install NXServer

Prepared AMI • http://aws.amazon.com/amis/Europe/1950 • US East AMI ID: ami-caf615a3 • Ubuntu 9.04 Jaunty Desktop with NX Server Free Edition • Update the Repos to newer versions of Ubuntu • Create new user

Ubuntu Create new user • Script for this AMI only • >user-setup

Verify login from desktop • Created user dc, password dc

Download NXPlayer, install • Create new connection, enter in IP address

Login with username/password

Ubuntu Desktop

Installing NXServer • Read logs /usr/NX/var/log/install • If installed correctly should see daemons

Create user

Configure sshd • sudonano /etc/init.d/sshd_config

Verify ssh login

Same process as before with nxplayer • Enter in ip, user name/password

Clone the instance store if you cant get the NXServer to work • Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume? • Create blank volume, default attach is /dev/sdf mksf.ext3 /dev/sdf mkdir /newvolume Sudo mount /dev/sdf /newvolume

rsync copy instance store to ebs • Copy the instance store volume to EBS • rsync –aHxv / /newvolume • Create further snapshots, create an ami by specifying kernel, etc…

Apache Bigtop

Apache Bigtop

Presentation Transcript

Apache Sandesha and Apache Axis2

Apache

An introduction to Apache Bigtop

Apache Bigtop Working Group

Apache

Apache

Apache

Apache

APACHE

BIGTOP

Apache Bigtop Working Group

Apache

Apache

Apache

Apache

Apache

APACHE

Apache Ant

Apache

Apache Pulsar vs Apache Kafka [Infographic]