600 likes | 759 Vues
DISTRIBUTED INFORMATION SYSTEMS. The Pig Experience: Building High-Level Data flows on top of Map-Reduce. Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald. Outline. Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization
E N D
DISTRIBUTED INFORMATION SYSTEMS The Pig Experience: Building High-Level Data flows on top of Map-Reduce Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald
Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work
Data Processing Renaissance • Internet companies swimming in data • TBs/day for Yahoo! Or Google! • PBs/day for FaceBook! • Data analysis is “inner loop” of product innovation
Data Warehousing …? • High level declarative approach • Little control over execution method Scale • Prohibitively expensive at web scale • Up to $200K/TB Price SQL Often not scalable enough
Map-Reduce Map : Performs filtering Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing BUT no complex Database Operations e.g. Joins
Execution Overview of Map-Reduce Buffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workers Worker reads, parses key/value pairs and passes pairs to user-defined Map function Split the Program Master and Worker Threads
Execution Overview of Map-Reduce Unique keys, values are passed to user’s Reduce function. Output is appended to the output file for this reduce partition. Reduce worker sorts data by the intermediate keys.
The Map-Reduce Appeal • Scalable due to simpler design • Explicit programming model • Only parallelizable operations Scale Runs on cheap commodity hardware Less Administration Price SQL Procedural Control- a processing “pipe”
Disadvantages M R 1. Extremely rigid data flow Other flows hacked in M M M R Chains Join, Union Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize • 3. No combined processing of multiple Datasets • Joins and other data processing operations
Motivation Need a high-level, general data flow language
Enter Pig Latin Need a high-level, general data flow language Pig Latin
Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work
Pig Latin: Data Types Rich and Simple Data Model Simple Types: int, long, double, chararray, bytearray Complex Types: Atom: String or Number e.g. (‘apple’) Tuple: Collection of fields e.g. (áppe’, ‘mango’) Bag: Collection of tuples { (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’)) } Map: Key, Value Pair
Example: Data Model Atom: contains Single atomic value • Tuple: sequence of fields • Bag: collection of tuple with possible duplicates ‘lanker’ ‘ipod’ ‘alice’ Atom Tuple
Pig Latin: Input/Output Data Input: queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp); Output: STORE query_revenues INTO `myoutput' USING myStore();
Pig Latin: General Syntax Discarding Unwanted Data: FILTER Comparison operators such as ==, eq, !=, neq Logical connectors AND, OR, NOT
Pig Latin: FOREACH with Flatten expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); ----------------- expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));
Pig Latin: COGROUP Getting Related Data Together: COGROUPSuppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)grouped_data = COGROUP result BY queryString, revenue BY queryString;
Pig Latin: Map-Reduce Map-Reduce in Pig Latinmap_result = FOREACH input GENERATE FLATTEN(map(*));key_group = GROUP map_result BY $0;output = FOREACH key_group GENERATE reduce(*);
Pig Latin: Other Commands UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag
Pig Latin: Nested Operations grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue { top_slot = FILTER revenue BY adSlot eq `top'; GENERATE queryString, SUM(top_slot.amount), SUM(revenue.amount); };
Pig Latin: Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106
Data Flow Filter good_urlsby pagerank > 0.2 Group by category Filter categoryby count > 106 Foreach category generate avg. pagerank
Equivalent Pig Latin good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
Example 2: Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info
Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls
Equivalent Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability Operates directly over files visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability Schemas optional; Can be assigned dynamically visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
Nested Data Model finance yahoo , email news • Pig Latin has a fully nested data model with: • Atomic values, tuples, bags (lists), and maps • Avoids expensive joins
Nested Data Model group by url • Common case: aggregation on these nested sets • Power users: sophisticated UDFs, e.g., sequence analysis • Efficient Implementation (see paper) I frankly like pig much better than SQL in some respects (group + optional flatten), I love nested data structures).” Ted Dunning Chief Scientist, Veoh Decouples grouping as an independent operation 35
CoGroup results revenue Cross-product of the 2 bags would give natural join
Pig Features Explicit Data Flow Language unlike SQL Low Level Procedural Language unlike Map-Reduce Quick Start & Interoperability Mode (Interactive Mode, Batch, Embedded) User Defined Functions Nested Data Model
Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work
Pig Process Life Cycle Parser PigLatinto Logical Plan Logical Optimizer Logical Plan to Physical Plan Map-Reduce Compiler Map-Reduce Optimizer Hadoop Job Manager
Pig Latin to Physical Plan LOAD LOAD FILTER JOIN GROUP FOREACH STORE A = LOAD ‘file1’ AS (x,y,z); B = LOAD ‘file2’ AS (t,u,v); C = FILTER A by y > 0; D = JOIN C by x,B by u; E = GROUP D by z; F = FOREACH E generate group, COUNT(D); STORE F into ‘output’; x,y,z,t,u,v x,y,z group , count
Logical Plan to Physical Plan 3 LOAD 1 1 LOAD 3 LOAD FILTER LOAD 2 2 4 GLOABL REARRANGE LOCAL REARRANGE FILTER JOIN 4 4 PACKAGE 5 4 FOREACH GROUP 4 6 LOCAL REARRANGE GLOBAL REARRANGE FOREACH 5 5 PACKAGE 7 5 FOREACH 6 STORE STORE 7
Physical Plan to Map-Reduce Plan LOAD 1 LOAD 3 FILTER 2 GLOABL REARRANGE LOCAL REARRANGE 4 4 PACKAGE 4 FOREACH 4 LOCAL REARRANGE GLOBAL REARRANGE 5 5 PACKAGE 5 FOREACH 6 STORE 7 Filter Local Rearrange Package Foreach LocalRearrange Package Foreach
Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://hadoop.apache.org/pig or or Hadoop Map-Reduce cluster • ~50% of Hadoop jobs at • Yahoo! are Pig • 1000s of jobs per day
Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)
Nested Sub Plans FILTER SPLIT LOCAL REARRANGE FILTER FILTER GLOBALREARRANGE FOEACH FOREACH MULTIPLEX PACKAGE PACKAGE FOREACH FOREACH SPLIT
Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work
Using the Combiner Input records Output records map reduce map reduce • Can pre-process data on the map-side to reduce data shipped • Algebraic Aggregation Functions • Distinct processing
Skew Join • Default join method is symmetric hash join. cross product carried out on 1 reducer • Problem if too many values with same key • Skew join samples data to find frequent values • Further splits them among reducers
Fragment-Replicate Join • Symmetric-hash join repartitions both inputs • If size(data set 1) >> size(data set 2) • Just replicate data set 2 to all partitions of data set 1 • Translates to map-only job • Open data set 2 as “side file”
Merge Join • Exploit data sets are already sorted. • Again, a map-only job • Open other data set as “side file”