270 likes | 333 Vues
Pig Latin: A Not-So-Foreign Language For Data Processing. Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins. Research. Data Processing Renaissance. Internet companies swimming in data E.g. TBs/day at Yahoo!
E N D
Pig Latin: A Not-So-Foreign Language For Data Processing Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Research
Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers
Data Warehousing …? Often not scalable enough Scale • Prohibitively expensive at web scale • Up to $200K/TB $ $ $ $ • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs SQL
New Systems For Data Analysis • Map-Reduce • Apache Hadoop • Dryad . . .
Map-Reduce Input records Output records map reduce map reduce Just a group-by-aggregate?
The Map-Reduce Appeal • Scalable due to simpler design • Only parallelizable operations • No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”
Disadvantages M R 1. Extremely rigid data flow Other flows constantly hacked in M M M R Join, Union Chains Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize
Pros And Cons Need a high-level, general data flow language
Enter Pig Latin Need a high-level, general data flow language Pig Latin
Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation
Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info
Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls
In Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; // select * from visitCounts v,urlInfo u where v.url = u.url gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation
Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo! • Automatic query optimization is hard • Pig Latin does not preclude optimization
Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files
Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically
User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
Nested Data Model • Pig Latin has a fully-nestable data model with: • Atomic values, tuples, bags (lists), and maps • More natural to programmers than flat tuples • Avoids expensive joins • See paper finance yahoo , email news
Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Novel features • Implementation
Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://incubator.apache.org/pig or or Hadoop Map-Reduce cluster
Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)
Usage • First production release about a year ago • 150+ early adopters within Yahoo! • Over 25% of the Yahoo! map-reduce user base
Related Work • Sawzall • Data processing language on top of map-reduce • Rigid structure of filtering followed by aggregation • DryadLINQ • SQL-like language on top of Dryad • Nested data models • Object-oriented databases
Future Work • Optional “safe” query optimizer • Performs only high-confidence rewrites • User interface • Boxes and arrows UI • Promote collaboration, sharing code fragments and UDFs • Tight integration with a scripting language • Use loops, conditionals of host language
Credits Shubham Chopra Alan Gates Shravan Narayanamurthy Olga Natkovich Arun Murthy Pi Song Santhosh Srinivasan Amir Youssefi
Summary • Big demand for parallel data processing • Emerging tools that do not look like SQL DBMS • Programmers like dataflow pipes over static files • Hence the excitement about Map-Reduce • But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL