The Pig Experience: Building High-Level Data flows on top of Map-Reduce

DISTRIBUTED INFORMATION SYSTEMS The Pig Experience: Building High-Level Data flows on top of Map-Reduce Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald

Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work

Data Processing Renaissance • Internet companies swimming in data • TBs/day for Yahoo! Or Google! • PBs/day for FaceBook! • Data analysis is “inner loop” of product innovation

Data Warehousing …? • High level declarative approach • Little control over execution method Scale • Prohibitively expensive at web scale • Up to $200K/TB Price SQL Often not scalable enough

Map-Reduce Map : Performs filtering Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing BUT no complex Database Operations e.g. Joins

Execution Overview of Map-Reduce Buffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workers Worker reads, parses key/value pairs and passes pairs to user-defined Map function Split the Program Master and Worker Threads

Execution Overview of Map-Reduce Unique keys, values are passed to user’s Reduce function. Output is appended to the output file for this reduce partition. Reduce worker sorts data by the intermediate keys.

The Map-Reduce Appeal • Scalable due to simpler design • Explicit programming model • Only parallelizable operations Scale Runs on cheap commodity hardware Less Administration Price SQL Procedural Control- a processing “pipe”

Disadvantages M R 1. Extremely rigid data flow Other flows hacked in M M M R Chains Join, Union Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize • 3. No combined processing of multiple Datasets • Joins and other data processing operations

Motivation Need a high-level, general data flow language

Enter Pig Latin Need a high-level, general data flow language Pig Latin

Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work

Pig Latin: Data Types Rich and Simple Data Model Simple Types: int, long, double, chararray, bytearray Complex Types: Atom: String or Number e.g. (‘apple’) Tuple: Collection of fields e.g. (áppe’, ‘mango’) Bag: Collection of tuples { (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’)) } Map: Key, Value Pair

Example: Data Model Atom: contains Single atomic value • Tuple: sequence of fields • Bag: collection of tuple with possible duplicates ‘lanker’ ‘ipod’ ‘alice’ Atom Tuple

Pig Latin: Input/Output Data Input: queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp); Output: STORE query_revenues INTO `myoutput' USING myStore();

Pig Latin: General Syntax Discarding Unwanted Data: FILTER Comparison operators such as ==, eq, !=, neq Logical connectors AND, OR, NOT

Pig Latin: Expression Table

Pig Latin: FOREACH with Flatten expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); ----------------- expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));

Pig Latin: COGROUP Getting Related Data Together: COGROUPSuppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)grouped_data = COGROUP result BY queryString, revenue BY queryString;

Pig Latin: COGROUP vs. JOIN

Pig Latin: Map-Reduce Map-Reduce in Pig Latinmap_result = FOREACH input GENERATE FLATTEN(map(*));key_group = GROUP map_result BY $0;output = FOREACH key_group GENERATE reduce(*);

Pig Latin: Other Commands UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag

Pig Latin: Nested Operations grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue { top_slot = FILTER revenue BY adSlot eq `top'; GENERATE queryString, SUM(top_slot.amount), SUM(revenue.amount); };

Pig Pen: Screen Shot

Pig Latin: Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

Data Flow Filter good_urlsby pagerank > 0.2 Group by category Filter categoryby count > 106 Foreach category generate avg. pagerank

Equivalent Pig Latin good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Example 2: Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

Equivalent Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability Operates directly over files visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability Schemas optional; Can be assigned dynamically visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Nested Data Model finance yahoo , email news • Pig Latin has a fully nested data model with: • Atomic values, tuples, bags (lists), and maps • Avoids expensive joins

Nested Data Model group by url • Common case: aggregation on these nested sets • Power users: sophisticated UDFs, e.g., sequence analysis • Efficient Implementation (see paper) I frankly like pig much better than SQL in some respects (group + optional ﬂatten), I love nested data structures).” Ted Dunning Chief Scientist, Veoh Decouples grouping as an independent operation 35

CoGroup results revenue Cross-product of the 2 bags would give natural join

Pig Features Explicit Data Flow Language unlike SQL Low Level Procedural Language unlike Map-Reduce Quick Start & Interoperability Mode (Interactive Mode, Batch, Embedded) User Defined Functions Nested Data Model

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Pig Process Life Cycle Parser PigLatinto Logical Plan Logical Optimizer Logical Plan to Physical Plan Map-Reduce Compiler Map-Reduce Optimizer Hadoop Job Manager

Pig Latin to Physical Plan LOAD LOAD FILTER JOIN GROUP FOREACH STORE A = LOAD ‘file1’ AS (x,y,z); B = LOAD ‘file2’ AS (t,u,v); C = FILTER A by y > 0; D = JOIN C by x,B by u; E = GROUP D by z; F = FOREACH E generate group, COUNT(D); STORE F into ‘output’; x,y,z,t,u,v x,y,z group , count

Logical Plan to Physical Plan 3 LOAD 1 1 LOAD 3 LOAD FILTER LOAD 2 2 4 GLOABL REARRANGE LOCAL REARRANGE FILTER JOIN 4 4 PACKAGE 5 4 FOREACH GROUP 4 6 LOCAL REARRANGE GLOBAL REARRANGE FOREACH 5 5 PACKAGE 7 5 FOREACH 6 STORE STORE 7

Physical Plan to Map-Reduce Plan LOAD 1 LOAD 3 FILTER 2 GLOABL REARRANGE LOCAL REARRANGE 4 4 PACKAGE 4 FOREACH 4 LOCAL REARRANGE GLOBAL REARRANGE 5 5 PACKAGE 5 FOREACH 6 STORE 7 Filter Local Rearrange Package Foreach LocalRearrange Package Foreach

Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://hadoop.apache.org/pig or or Hadoop Map-Reduce cluster • ~50% of Hadoop jobs at • Yahoo! are Pig • 1000s of jobs per day

Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)

Nested Sub Plans FILTER SPLIT LOCAL REARRANGE FILTER FILTER GLOBALREARRANGE FOEACH FOREACH MULTIPLEX PACKAGE PACKAGE FOREACH FOREACH SPLIT

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Using the Combiner Input records Output records map reduce map reduce • Can pre-process data on the map-side to reduce data shipped • Algebraic Aggregation Functions • Distinct processing

Skew Join • Default join method is symmetric hash join. cross product carried out on 1 reducer • Problem if too many values with same key • Skew join samples data to find frequent values • Further splits them among reducers

Fragment-Replicate Join • Symmetric-hash join repartitions both inputs • If size(data set 1) >> size(data set 2) • Just replicate data set 2 to all partitions of data set 1 • Translates to map-only job • Open data set 2 as “side file”

Merge Join • Exploit data sets are already sorted. • Again, a map-only job • Open other data set as “side file”

The Pig Experience: Building High-Level Data flows on top of Map-Reduce