Download
chris olston benjamin reed utkarsh srivastava ravi kumar andrew tomkins n.
Skip this Video
Loading SlideShow in 5 Seconds..
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins PowerPoint Presentation
Download Presentation
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

126 Vues Download Presentation
Télécharger la présentation

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Pig Latin: A Not-So-Foreign Language For Data Processing Research Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Shimin Chen Big Data Reading Group Presentation

  2. Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers

  3. Data Warehousing …? Often not scalable enough Scale • Prohibitively expensive at web scale • Up to $200K/TB $ $ $ $ • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs SQL

  4. New Systems For Data Analysis • Map-Reduce • Apache Hadoop • Dryad . . .

  5. Map-Reduce Input records Output records map reduce map reduce Just a group-by-aggregate?

  6. The Map-Reduce Appeal • Scalable due to simpler design • Only parallelizable operations • No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

  7. Disadvantages M R 1. Extremely rigid data flow Other flows constantly hacked in M M M R Join, Union Chains Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize

  8. Pros And Cons Need a high-level, general data flow language

  9. Enter Pig Latin Need a high-level, general data flow language Pig Latin

  10. Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation

  11. Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

  12. Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

  13. In Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  14. Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation

  15. Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo! • Automatic query optimization is hard • Pig Latin does not preclude optimization

  16. Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

  17. Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

  18. User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  19. Nested Data Model • Pig Latin has a fully-nestable data model with: • Atomic values, tuples, bags (lists), and maps • More natural to programmers than flat tuples • Avoids expensive joins • See paper finance yahoo , email news

  20. Pig Latin Operators Input/Output: Load Store Operations on a single bag Foreach Filter Order Group Distinct Operations on multiple bags Co-group, Join Union From paper

  21. Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Novel features • Implementation

  22. Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://incubator.apache.org/pig or or Hadoop Map-Reduce cluster

  23. Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)

  24. Pig Pen: Debugging Environment From paper

  25. Usage • First production release about a year ago • 150+ early adopters within Yahoo! • Over 25% of the Yahoo! map-reduce user base

  26. Related Work • Sawzall • Data processing language on top of map-reduce • Rigid structure of filtering followed by aggregation • DryadLINQ • SQL-like language on top of Dryad • Nested data models • Object-oriented databases

  27. Distributed Sorting in DryadLinq public static IQueryable<TSource> DSort<TSource, TKey>(this IQueryable<TSource> source,                Expression<Func<TSource, TKey>> keySelector,                                   int pcount) {             var samples = source.Apply(x => Sampling(x));             var keys = samples.Apply(x => ComputeKeys(x, pcount));             var parts = source.RangePartition(keySelector, keys);             return parts.OrderBy(keySelector); } From Mihai Budiu’s slides on “Cluster Computing with DryadLINQ”

  28. Sawzall Example proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1

  29. Future Work • Optional “safe” query optimizer • Performs only high-confidence rewrites • User interface • Boxes and arrows UI • Promote collaboration, sharing code fragments and UDFs • Tight integration with a scripting language • Use loops, conditionals of host language

  30. Credits Shubham Chopra Alan Gates Shravan Narayanamurthy Olga Natkovich Arun Murthy Pi Song Santhosh Srinivasan Amir Youssefi

  31. Summary • Big demand for parallel data processing • Emerging tools that do not look like SQL DBMS • Programmers like dataflow pipes over static files • Hence the excitement about Map-Reduce • But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL