1 / 25

Making Hadoop Easy

pig. pig. Making Hadoop Easy. http://hadoop.apache.org/pig. What is Pig. Pig is a Language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. Pig is Hadoop Subproject. Apache Incubator: October’07-October’08

khalil
Télécharger la présentation

Making Hadoop Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pig pig Making Hadoop Easy http://hadoop.apache.org/pig

  2. What is Pig

  3. Pig is a Language An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language.

  4. Pig is Hadoop Subproject • Apache Incubator: October’07-October’08 • Graduated into Hadoop Subproject • Main page: http://hadoop.apache.org/pig/

  5. Why Pig? • Higher level languages: • Increase programmer productivity • Decrease duplication of effort • Open the system to more users • Pig insulates you against hadoop complexity • Hadoop version upgrades • JobConf configuration tuning • Job chains

  6. An Example Problem Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

  7. In Map Reduce

  8. In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = join Fltrd by name, Pages by user;Grpd = group Jnd by url;Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;Srtd = order Smmd by clicks desc;Top5 = limit Srtd 5;store Top5 into‘top5sites’;

  9. Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Users = load …Fltrd = filter … Pages = load …Jnd = join …Grpd = group …Smmd = … COUNT()…Srtd = order …Top100 = limit … Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

  10. Comparison 1/20 the lines of code 1/16 the development time Performance within 2x

  11. Pig Compared to Map Reduce • Faster development time • Many standard data operations (project, filter, join) already included. • Pig manages all the details of Map Reduce jobs and data flow for you.

  12. And, You Don’t Lose Power • Easy to provide user code throughout. External binaries can be invoked. • Metadata is not required, but metadata supported and used when available. • Pig does not impose a data model on you. • Fine grained control. One line equals one action. • Complex data types

  13. Example, User Code -- use a custom loaderLogs = load‘apachelogfile’using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes);-- apply your own functionCleaned = foreach Logs generate addr, canonicalize(url) as url;Grouped = group Cleaned by url;-- run the result through a binaryAnalyzed = stream Grouped through‘urlanalyzer.py’;store Analyzed into‘analyzedurls’;

  14. Example, Schema on the Fly -- declare your typesGrades = load‘studentgrades’as (name: chararray, age: int, gpa: double);Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by typeSorted = order Good by gpa;store Sorted into‘smartgrownups’;

  15. Pig Commands

  16. How it Works Pig Latin script is translated to a set of operators which are placed in one or more M/R jobs and executed. Filter $1 > 0 Map A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt;E = filter D by cnt > 5;dump E; COUNT(B) Combiner COUNT(B) Filter cnt > 0 Reducer

  17. Current Pig Status • 30% of all Hadoop jobs at Yahoo are now pig jobs, 1000s per day. • Graduated from Apache Incubator in October’08 and was accepted as Hadoop sub-project. • In the process of releasing version 0.2.0 • type system • 2-10x speedup • 1.6x Hadoop latency • Improved user experience: • Improved documentation • PigTutorial • UDF repository – PiggyBank • Development environment (eclipse plugin)

  18. What Users Do with Pig • Inside Yahoo (based on user interviews) • Used for both production processes and adhoc analysis • Production • Examples: search infrastructure, ad relevance • Attraction: fast development, extensibility via custom code, protection against hadoop changes, debugability • Research • Examples: user intent analysis • Attraction: easy to learn, compact readable code, fast iteration on trying new algorithms, easy for collaboration

  19. What Users Do with Pig • Outside Yahoo (based on mail list responses) • Processing search engine query logs • “Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” • Image recommendations • “I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” • Adsorption Algorithm (video recommendations) • Hoffman's PLSI implementation in PIG • “The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me write it, starting from learning pig :)” • Inverted Index • “The Pig feature that makes it stand out is the easy native support for nested elements — meaning, a tuple can have other tuples nested inside it; they also support Maps and a few other constructs. The Sigmod 2008 paper presents the language and gives examples of how the system is used at Yahoo. • Without further ado — a quick example of the kind of processing that would be awkward, if not impossible, to write in regular SQL, and long and tedious to express in Java (even using Hadoop).”

  20. What Users Do with Pig • Common asks • Control structures or embedding • UDFs in scripting languages (Perl, Python) • More performance

  21. Roadmap • Performance • Latency: goal of 10-20 % overhead compared to hadoop • Better scalability: memory usage, dealing with skew • Planned improvements • Multi-query support • Rule-based optimizer • Handling skew in joins • Pushing projections to the loader • More efficient serialization • Better memory utilization

  22. Roadmap (cont.) • Functionality • UDFs in languages other than Java • Perl, C++ • New Parser with better error handling

  23. How Do I Get a Pig of My Own? • Need an installation of hadoop to run on, seehttp://hadoop.apache.org/core/ • Get the pig jar. You can get release 0.1.0 at http://hadoop.apache.org/pig/releases.html. I strongly recommend using the code from trunk Get a copy of thehadoop-site.xmlfile for your hadoop cluster. • Runjava –cp pig.jar:configdir org.apache.pig.Mainwhere configdir is the directory containing yourhadoop-site.xml.

  24. How Do I Make My Pig Work? • Starting pig with no script puts you in the grunt shell, where you can type pig and hdfs navigation commands. • Pig Latin can be put in file that is then passed to pig. • JDBC like interface for java usage. • PigPen, an Eclipse plugin that supports textual and graphical construction of scripts. Shows sample data flowing through the script to illustrate how your script will work.

  25. Q & A

More Related