Nova: Continuous Pig/ Hadoop Workflows

Nova: Continuous Pig/Hadoop Workflows

storage & processing Nova workflow manager e.g. Nova Pig dataflow programming framework e.g. Pig distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. HDFS

Nova Overview • Nova: a system for batched incremental processing. • Scenarios: Yahoo • Ingesting and analyzing user behavior logs • Building and updating a search index from a stream of crawled web pages • Processing semi-structured data (news, blogs, etc.) • Two-layer programming model (Nova over Pig) • Continuous processing • Independent scheduling • Cross-module optimization • Manageability features

Continuous Processing - Nova: An outer workflow manager layer, deals with graphs of interconnected Pig programs, with data passing in a continuous fashion. - Pig/Hadoop: Inner layer, merely deals with transforming static input data into static output data. Nova: keeps track of “delta” data and routs them to the workflow components in the right order. Delta Input Output

Independent Scheduling Different portions of a workflow may be scheduled at different times/rates. - Global link analysis algorithms may only be run occasionally due to their costly nature and consumers‘ tolerance for staleness. - The components that perform ingesting, tagging, indexing new news articles, need to operate continuously.

Cross-module optimization • Can identify and exploit certain optimization opportunities. E.g.: • 2 components read the same input data at the same time. • Pipelining: output of one module as input of subsequent module => Avoid materializing the intermediate result. • Manageability features • Manage workflow programming, execution. • Support debugging, keep track of versions of workflow components. • Capture data source and emitting notifications of key events.

Workflow Model • Workflow • Two kinds of vertices: tasks (processing steps) and channels (data containers) • Edges connect tasks to channels and vise versa. • [Task] Consumption mode: • ALL: read a complete snapshot • NEW: only new data since the last invocation • [Task] Production mode: • B: new complete snapshot • Delta: new data that augments any existed data

Workflow Model • [Task] Four common patterns of processing • Non-incremental (template detection): Process data from scratch every time. • Stateless incremental (shingling): Process new data only, each data item is handle independently. • Stateless incremental with lookup table (template tagging): Process new data independently. May use a side loop-up table for reference. • Statefulincremental (de-duping): Process new data while maintain and reference some state with the prior input data.

Workflow Model (Cont.) • Data and Update Model • Blocks: A channel’s data is divided into blocks. They vary in size. • Blocks are atomic units (either be processed entirely or discarded) • Blocks are immutable. • Contains a complete snapshot of data on a channel as of some point in time • Base blocks are assigned increasing sequence numbers(B0,B1,B2……Bn) Base block • Used in conjunction with incremental processing • Contains instructions for transforming a base block into a new base block( ) Delta block

Workflow Model (Cont.) • Data and Update Model • Operators: • Merging: combine base and delta blocks: • Diffing: Compare 2 base blocks to create a delta block • Chaining: combine multiple delta blocks • Upsertmodel: Leverages the presence of a primary key attribute to encode updates and insertsin a uniform way. With upserts, delta blocks are comprised • of records to be inserted, with each one displacing any pre-existing record with the same key => retain only the most recent record with a given key.

Workflow Model (Cont.) • Task/Data Interface: • [Task] Consumption mode: • ALL: read a complete snapshot • NEW: only new data since the last invocation • [Task] Production mode: • B: new complete snapshot • Delta: new data that augments any existed data

Workflow Model (Cont.) • Workflow Programming and Scheduling • Workflows programming starts with task definitions, then compose them into “workflowettes”. • Workflowettes have ports to which input and output channels they may connect. • Channels attached to the input and output ports of a workflowette => bound workflowette. • 3 types of trigger associated with a workflowette: • Data-based trigger. • Time-based trigger. • Cascade trigger.

Workflow Model (Cont.) • Data blocks are immutable. Channels accumulate data blocks => can grow without bound. • Data Compaction and Garbage Collection • If a channel has blocks B0，， , ，the compaction operation computes and adds B3 to the channel • After compaction is used to add B3 to the channel，and current cursor is at sequence number 2， then B0，， can be garbage-collected.

Tying the model to Pig/Hadoop • Each data block resides in an HDFS file. A metadata maintains the mapping. • The notion of channel exists only in metadata. • Each task: a Pig program.

Tying the model to Pig/Hadoop • Each data block resides in an HDFS file. A metadata maintains the mapping. • The notion of channel exists only in metadata.

Nova System Architecture

Nova: Continuous Pig/ Hadoop Workflows