1 / 38

Apache Bigtop

Apache Bigtop. Week 9 Integration Testing, M/R Coding. Administration . Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members MSFT Azure talk, volunteers for tech leads to port bigtop to Azure.

sol
Télécharger la présentation

Apache Bigtop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Bigtop Week 9 Integration Testing, M/R Coding

  2. Administration • Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members • MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. • Roman’s Yahoo HUG presentation next week • Move to ground floor next week? • Machine Learning Solution Architect, 2/16 • List

  3. Review from last time • Hive/Pig/Hbasedata layer for integration tests • Hbase upgrade to x.92 • JAVA_LIBRARY_PATH for JVM to point to .so native libs for hadoop • Hadoopclasspath debug to print out classpath • HBASE 0.92 guess where Hadoop is using HADOOP_HOME • /etc/hostname screwed up on ec2

  4. Bigtop Data Integration Layer, Hive, Pig, Hbase • Hive: • Create a separate Java project • Install Hive locally, verify you can run the command line, >show tables;

  5. Hive Data Layer • Import all the jars under hive-0.8.1/lib to Eclipse

  6. Hive Notes • Hive has 2 configurations, an embedded and server. • To start the server: • Set the HADOOP_HEAPSIZE to 1024 by copying hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting. • source ~/hive-0.8.1/conf/hive-env.sh • Verify, echo $HADOOP_HIVESIZE

  7. Start Hive Server from Command Line

  8. Hive Command Line Server

  9. Hive Notes Increase Heap Size:

  10. Hive Run JDBC Commands • Like connecting to MySQL/oracle/MSFT db • Create connection, PreparedStatement, ResultSet • Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); • Connection con =DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); • Driver in thejar

  11. Hive JDBC Prepared Statement • Create Table statement different Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt .executeQuery("create table " + tableName + " (keyint, valuestring) ROW FORMAT delimitedfieldsterminatedby '\t'");

  12. Verification – server running and table printout Eclipse output

  13. Hive Eclipse/Java Code

  14. Pig, uses Pig Util Class • Util not in Pig-xxx.jar, only in Test package • Local mode only, distributed not debugged Util.deleteDirectory(new File("/Users/dc/pig-0.9.2/nyse")); PigServerps = new PigServer(ExecType.LOCAL); ps.setBatchOn();

  15. Pig Example String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); "; String second = "B = foreachnyse generate symbol, dividends;"; String third = " store B into 'nyse'; ";

  16. Pig Example Util.registerMultiLineQuery(ps, first + second + third); ps.executeBatch(); ps.shutdown();

  17. Pig Example Output 12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// 12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN 12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2 12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job 12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1 12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete 12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001 12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse 12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to getRunningJob for job job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Statsreportedbelowmay be incomplete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:

  18. Pig Example Output HadoopVersionPigVersionUserIdStartedAtFinishedAt Features 0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN Success! Job Stats (time in seconds): JobId Alias FeatureOutputs job_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse, Input(s): Successfullyreadrecordsfrom: "/Users/dc/programmingpig/data/NYSE_dividends" Output(s): Successfullystoredrecords in: "file:///Users/dc/pig-0.9.2/nyse"

  19. Pig Example Output Job DAG: job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success!

  20. M/R Pattern Design Review • Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster. • Cluster experience with big data on AWS • Important when migrating to production processes • Design Patterns in the Sample HadoopExamples.jar • WordCount • Word Count Aggregation • MultiFileWordCount

  21. M/R Design Pattern Review • Word Count From Lin/Dyer Data Intensive Text Processing with Map Reduce

  22. M/R Adding Array to Mapper Output • WordCount Design Process • Mapper(Contents of File, Tokenize, output) <Object, Text, Text, IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1. 2 steps to mapper design, 1) split up the input then 2) output K,V to reducer • First step, copy Mapper output K,V to reducer . Reducer(Collect mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form. • Replace the IntWritable with an arraylist • Why? From Lin/Dyer Data-Intensive Text Processing with Map Reduce

  23. Word Count Notes • Remove ctors() from map()

  24. HadoopAvg Coding Demo • Create an AvgPair Object, implements writable • Create ivars, sum, count, key • Auto generate methods for ivars • Implement Write and readFields methods • Put the ctors outside map() • Run using M/R plugin

  25. NXServer/NXClient • Remote Desktop to EC2 • 2 options • 1) use prepared AMI by Eric Hammond • 2) Install NXServer

  26. Prepared AMI • http://aws.amazon.com/amis/Europe/1950 • US East AMI ID: ami-caf615a3 • Ubuntu 9.04 Jaunty Desktop with NX Server Free Edition • Update the Repos to newer versions of Ubuntu • Create new user

  27. Ubuntu Create new user • Script for this AMI only • >user-setup

  28. Verify login from desktop • Created user dc, password dc

  29. Download NXPlayer, install • Create new connection, enter in IP address

  30. Login with username/password

  31. Ubuntu Desktop

  32. Installing NXServer • Read logs /usr/NX/var/log/install • If installed correctly should see daemons

  33. Create user

  34. Configure sshd • sudonano /etc/init.d/sshd_config

  35. Verify ssh login

  36. Same process as before with nxplayer • Enter in ip, user name/password

  37. Clone the instance store if you cant get the NXServer to work • Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume? • Create blank volume, default attach is /dev/sdf mksf.ext3 /dev/sdf mkdir /newvolume Sudo mount /dev/sdf /newvolume

  38. rsync copy instance store to ebs • Copy the instance store volume to EBS • rsync –aHxv / /newvolume • Create further snapshots, create an ami by specifying kernel, etc…

More Related