Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

Pig Latin: A Not-So-Foreign Language For Data Processing Research Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Shimin Chen Big Data Reading Group Presentation

Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers

Data Warehousing …? Often not scalable enough Scale • Prohibitively expensive at web scale • Up to $200K/TB $ $ $ $ • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs SQL

New Systems For Data Analysis • Map-Reduce • Apache Hadoop • Dryad . . .

Map-Reduce Input records Output records map reduce map reduce Just a group-by-aggregate?

The Map-Reduce Appeal • Scalable due to simpler design • Only parallelizable operations • No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

Disadvantages M R 1. Extremely rigid data flow Other flows constantly hacked in M M M R Join, Union Chains Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize

Pros And Cons Need a high-level, general data flow language

Enter Pig Latin Need a high-level, general data flow language Pig Latin

Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation

Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

In Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation

Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo! • Automatic query optimization is hard • Pig Latin does not preclude optimization

Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Nested Data Model • Pig Latin has a fully-nestable data model with: • Atomic values, tuples, bags (lists), and maps • More natural to programmers than flat tuples • Avoids expensive joins • See paper finance yahoo , email news

Pig Latin Operators Input/Output: Load Store Operations on a single bag Foreach Filter Order Group Distinct Operations on multiple bags Co-group, Join Union From paper

Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Novel features • Implementation

Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://incubator.apache.org/pig or or Hadoop Map-Reduce cluster

Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)

Pig Pen: Debugging Environment From paper

Usage • First production release about a year ago • 150+ early adopters within Yahoo! • Over 25% of the Yahoo! map-reduce user base

Related Work • Sawzall • Data processing language on top of map-reduce • Rigid structure of filtering followed by aggregation • DryadLINQ • SQL-like language on top of Dryad • Nested data models • Object-oriented databases

Distributed Sorting in DryadLinq public static IQueryable<TSource> DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, int pcount) { var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector); } From Mihai Budiu’s slides on “Cluster Computing with DryadLINQ”

Sawzall Example proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1

Future Work • Optional “safe” query optimizer • Performs only high-confidence rewrites • User interface • Boxes and arrows UI • Promote collaboration, sharing code fragments and UDFs • Tight integration with a scripting language • Use loops, conditionals of host language

Credits Shubham Chopra Alan Gates Shravan Narayanamurthy Olga Natkovich Arun Murthy Pi Song Santhosh Srinivasan Amir Youssefi

Summary • Big demand for parallel data processing • Emerging tools that do not look like SQL DBMS • Programmers like dataflow pipes over static files • Hence the excitement about Map-Reduce • But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins