1 / 95

Big data 實務運算 Apache Pig Hadoop course

Big data 實務運算 Apache Pig Hadoop course. Will Y Lin Trend Micro/ TCloud 2012/10. Agenda. Hadoop Pig Introduction Basic Exercise. Big Data & Hadoop. Big Data. A set of files. A database. A single file. Data Driven World.

tania
Télécharger la présentation

Big data 實務運算 Apache Pig Hadoop course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big data實務運算Apache PigHadoop course Will Y Lin Trend Micro/ TCloud 2012/10

  2. Agenda • Hadoop • Pig • Introduction • Basic • Exercise

  3. Big Data & Hadoop

  4. Big Data A set of files A database A single file

  5. Data Driven World • Modern systems have to deal with far more data than was the case in the past • Yahoo :over 170PB of data • Facebook : over 30PB of data • We need a system to handlelarge-scale computation!

  6. Distributed File System • Characteristics • Distributed systems are groups of networked computers • Allowing developers to use multiple machines for a single job • At compute time, data is copied to the compute nodes • Programming for traditional distributed systems is complex. • Data exchange requires synchronization • Finite bandwidth is available • Temporal dependencies are complicated • It is difficult to deal with partial failures of the system

  7. We need a new approach!

  8. Hadoop - Concept • Distribute the data as it is initially stored in the system • Individual nodes can work on data local to those nodes • Users can focus on developing applications.

  9. Hadoop – Core • HDFS (Hadoop Distributed File System) • Distributed • Scalable • Fault tolerant • High performance • MapReduce • Software framework of parallel computation for distributed operation • Mainly written by Java

  10. Hadoop - Ecosystem • Hadoop has become the kernel of the distributed operating system for Big Data • A collection of projects at Apache • No one uses the kernel alone

  11. Hadoop Family • Open Source Software + Hardware Commodity Pig Oozie Sqoop/Flume Hive Hue Mahout Ganglia/Nagios Zookeeper MapReduce HBase Hadoop Distributed File System (HDFS)

  12. Hadoop Family • Open Source Software + Hardware Commodity Pig Oozie Sqoop/Flume Hive Hue Mahout Ganglia/Nagios Zookeeper MapReduce HBase Hadoop Distributed File System (HDFS)

  13. Pig Introduction

  14. Apache Pig • Initiated by • Consists of a high-level language for expressing data analysis, which is called “Pig Latin” • Combined with the power of Hadoop and the MapReduce framework • A platform for analyzing large data sets

  15. Example - Data Analysis Task • Q : Find user who tend to visit “good” pages. Visits Pages . . . . . .

  16. Conceptual Dataflow Load Visits(user, url, time) Load Pages(url, pagerank) Canonicalizeurls Join url = url Group by user Compute Average Pagerank Filter avgPR > 0.5 Result

  17. By MapReduce • Low-level programing. • - Requiring more effort to understand and maintain • Write join code yourself. • Effort to manage multiple MapReduce jobs.

  18. By Pig Visits = load‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits= groupVP by user; UserPageranks= foreachUserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers= filterUserPageranks by avgpr > ‘0.5’; storeGoodUsersinto ‘/data/good_users’;s Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time.

  19. Why Pig? • Ease of programming. • script language(Pig Latin) • Pig not equals to SQL • Increases productivity • In one test • 10 lines of Pig Latin ≈ 200 lines of Java. • What took 4 hours to write in Java took 15 minutes in Pig Latin. • Provide some common functionality (such as join, group, sort, etc) • Extensibility- UDF

  20. Running Pig Pig Basic

  21. Running Pig • You can run Pig (execute Pig Latin statements and Pig commands) using various modes.

  22. Execution Modes • Local mode • Definition : all files are installed and run using your local host and file system. • How : using the -x flag : pig -x local • Mapreduce mode • Definition : access to a Hadoop cluster and HDFS installation. • How : the mode is the default mode; you can, but don't need to, specify it using the -x flag : pig or pig -x mapreduce.

  23. Execution Modes (Cont.) • By Pig command : /* local mode */ $ pig -x local ... /* mapreduce mode */ $ pig ... or $ pig -x mapreduce ... • By Java command : /* local mode */ $ java -cp pig.jar org.apache.pig.Main -x local ... /* mapreduce mode */ $ java -cp pig.jar org.apache.pig.Main ... or $ java -cp pig.jar org.apache.pig.Main -x mapreduce...

  24. Interactive Mode • Definition : using the Grunt shell. • How : Invoke the Grunt shell using the "pig" command and then enter your Pig Latin statements and Pig commands interactively at the command line. • Example: grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; $ pig -x local $ pig $ pig –x mapreduce

  25. Batch Mode • Definition : run Pig in batch mode. • How : Pig scripts + Pig command • Example: /* id.pig */ a = LOAD 'passwd’ USING PigStorage(':'); b= FOREACH a GENERATE $0 AS id; STORE b INTO ‘id.out’; $ pig -x local id.pig

  26. Batch Mode (Cont.) • Pig scripts • Place Pig Latin statements and Pig commands in a single file. • Not required, but is good practice to identify the file using the *.pig extension. • Allows “parameter substitution” • Support comments • For single-line : use - - • For multi-line : use /* …. */

  27. Useful Pig Command Pig Basic

  28. Utility Commands • exec : Run a Pig script. • Syntax : exec [–paramparam_name = param_value] [–param_filefile_name] [script] grunt> cat myscript.pig a = LOAD 'student' AS (name, age, gpa); b = ORDER a BY name; STORE b into '$out'; grunt> exec –param out=myoutput myscript.pig;

  29. Utility Commands • run : Run a Pig script. • Syntax : run [–paramparam_name = param_value] [–param_filefile_name] script grunt> cat myscript.pig b = ORDER a BY name; c = LIMIT b 10; grunt> a = LOAD 'student' AS (name, age, gpa); grunt> run myscript.pig grunt> d = LIMIT c 3; grunt> DUMP d; (alice,20,2.47) (alice,27,1.95)

  30. Others • File commands • cat, cd, copyFromLocal, copyToLocal, cp, ls, mkdir,mv, pwd, rm, rmf, exec, run • Utility commands • help • quit • kill jobid • set debug [on|off] • set job.name 'jobname'

  31. Learning Pig – Fundamental Pig Basic

  32. Data Type – Simple Type

  33. Data Type – Complex Type Syntax : ( field [, field …] ) • A tuple just like a row with one or more fields. • Each field can be any data type • Any field may/may not have data

  34. Data Type – Complex Type Syntax : { tuple [, tuple …] } • A bag contains multiple tuples. (1~n) • Outer bag/Inner bag

  35. Data Type – Complex Type Syntax : [ key#value <, key#value …>] • Just like other language, key must be unique in a map.

  36. Example about Data Field Tuple Relation(Outer Bag)

  37. Relation And Data Type • Conclusion? • A field is a piece of data. • A tuple is an ordered set of fields. • A bag is a collection of tuples. • A relation is a bag (more specifically, an outer bag). Pig Latin statement work with relations!!

  38. Schemas • Used as keyword “AS” clause. • Defined with LOAD/STREAM/FOREACH operator. • Enhance better parse-time error checking and more efficient code execution. • Optionalbut encourage to use.

  39. Schema • Legal format: • includes both the field name and field type • includes the field name only, lack field type • not to define a schema

  40. Schema – Example 1 John 18 true Mary 19 false Bill 20 true Joe 18 true a= LOAD 'data' AS (name:chararray, age:int, male:boolean); a= LOAD ‘data’ AS (name, age, male); a= LOAD ‘data’ ;

  41. Schema – Example 2 (3,8,9) (mary,19) (1,4,7) (john,18) (2,5,8) (joe,18) a= LOAD ‘data’ AS (F:tuple(f1:int, f2:int, f3:int),T:tuple(t1:chararray, t2:int)); a= LOAD ‘data’ AS (F:(f1:int, f2:int, f3:int),T:(t1:chararray, t2:int));

  42. Schema – Example 3 {(3,8,9)} {(1,4,7)} {(2,5,8)} a= LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)}); a= LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});

  43. Schema – other cases • What if field name/field type is not defined? • Lacks field name : can only refer to the field using positional notation. ($0, $1…) • Lacks field type : value will be treated as bytearray. you can change the default type using the cast operators. John 18 true Mary 19 false Bill 20 true Joe 18 true grunt> a = LOAD ‘data'; grunt> b = FOREACH a GENERATE $0;

  44. Dereference • When you work with complex data type, use dereference operator(.) to access fields within thte. • Dereference tuple : you can get specified field. • Dereference bag : you can get a bag composed of the specified fields.

  45. Pig Latin Statement Pig Basic

  46. Standard Flow

  47. Loading Data LOAD 'data‘ [USING function] [AS schema]; • Use the LOAD operator to load data from the file system. • Default load function is “PigStorage”. • Example : a= LOAD 'data' AS (name:chararray, age:int); a= LOAD 'data' USING PigStorage(‘:’) AS (name, age);

  48. Working with Data • Transform data into what you want. • Major operators: • FOREACH : work with columns of data • FILTER : work with tuples or rows of data • GROUP : group data in a single relation • JOIN : Performs an inner join of two or more relations based on common field values. • UNION : merge the contents of two or more relations • SPLIT : Partitions a relation into two or more relations.

  49. Storing Intermediate Results • Default location is “/tmp” • Could be configured by property “pig.temp.dir”

  50. Storing Final Results STORE relation INTO 'directory' [USING function]; • Use the STORE operator to write output to the file system. • Default store function is “PigStorage”. • If the directory already exists, the STORE operation will fail. The output file will be named as “part-nnnnn”. • Example : STORE a INTO 'myoutput‘ USING PigStorage ('*');

More Related