1 / 21

Pig: Making Hadoop Easy

Pig: Making Hadoop Easy. Wednesday, June 10, 2009 Santa Clara Marriott. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. An Example Problem. Data User records Pages served

twila
Télécharger la présentation

Pig: Making Hadoop Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott

  2. What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language.

  3. An Example Problem • Data • User records • Pages served • Question: the 5 pages most visited by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

  4. In Map Reduce

  5. In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = join Fltrd by name, Pages by user;Grpd = group Jnd by url;Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;Srtd = order Smmd by clicks desc;Top5 = limit Srtd 5;store Top5 into‘top5sites’;

  6. Comparison 1/20 the lines of code 1/16 the development time Performance: 1.5x Hadoop

  7. Pig Compared to Map Reduce • Faster development time • Data flow versus programming logic • Many standard data operations (e.g. join) included • Manages all the details of connecting jobs and data flow • Copes with Hadoop version change issues

  8. And, You Don’t Lose Power • UDFs can be used to load, evaluate, aggregate, and store data • External binaries can be invoked • Metadata is optional • Flexible data model • Nested data types • Explicit data flow programming

  9. Pig Commands

  10. How it Works Pig Latin script is translated to a set of operators which are placed in one or more MR jobs and executed. Filter $1 > 0 Map A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt;E = filter D by cnt > 5;dump E; COUNT(B) Combiner SUM(COUNT(B)) Filter cnt > 5 Reducer

  11. What Users Do with Pig • Inside Yahoo (based on user interviews) • 60% of ad hoc and 40% of production MR jobs • Production • Examples: search infrastructure, ad relevance • Attraction: fast development, extensibility via custom code, protection against Hadoop changes, debugability • Ad hoc • Examples: user intent analysis • Attraction: easy to learn, compact readable code, fast iteration when trying new algorithms, easy for collaboration

  12. What Users Do with Pig • Outside Yahoo (based on mailing list responses) • Processing search engine query logs“Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” • Image recommendations“I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” • Adsorption Algorithm (video recommendations) • Hoffman’s PLSI implementation“The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me to write it, starting from learning pig.”

  13. Users Extending Pig: PigPy • Created by Mashall Weir at Zattoo • Uses Python to create Pig Latin scripts on the fly • Enables looping • Branching based on job results • Submits Pig jobs from Python scripts • Cache intermediate calculations • Avoid variable name collisions in large scripts

  14. Version 0.2.0 • Released April 2009 • Added type system • ~5x better performance than 0.1 • More aggressive use of the combiner • Map side join • Handles key skew in ORDER BY • Improved error handling • Improved documentation

  15. Version 0.3.0 • Release branch created June 8th, 2009 • Supports multiple STOREs in one MR job • Supports multiple GROUP Bys in one MR job students = load ’students' as (name, age, gpa); a_ed = filter students by age > 25; store a_ed into ‘adult_ed'; gname = group a_ed by name; cname = foreach gname generate group, COUNT(a_ed); store cname into ’count_by_name'; g_age = group a_ed by age; c_age = foreach g_age generate group, COUNT(a_ed); store c_age into ’count_by_age'; In 0.2.0 and before, this would be 3 MR jobs. In 0.3.0 it will be one. Seeing up to 10x speedup for these types of scripts.

  16. Currently Working On • Map side merge join • Handling severe skew in join keys • Improving memory footprint • Extending optimizer capabilities

  17. SQL • Pig will be bilingual, accepting SQL and Pig Latin • UDFs will work in both languages • Gives users ability to choose appropriate interface level • Administrators have one component to maintain

  18. Metadata for the Grid • Provide metadata model for files and directories as data sets • Usable from Map Reduce and Pig • Attach user defined attributes to data sets • Define hierarchy and associations between data sets • Record data schema and statistics • Browsing, searching, and metadata administration via GUI and web services API • JIRA: PIG-823

  19. Storage Access Layer • Common abstraction to contain storage access features and optimizations • Support fast projection • Support early row filtering • CPU/space efficient data serialization and compression • Usable by Map Reduce and Pig • PIG-833

  20. Learn More • Come to the Hadoop Summit Training, tomorrow • Watch the training by Yahoo! and Cloudera:http://www.cloudera.com/hadoop-training-pig-introduction • Get involved: http://hadoop.apache.org/pig

  21. Q & A

More Related