Real-time Data Processing with Storm: A Distributed System Overview

Scalable stream processing with Storm Brian Johnson Luke Forehand

Our Ambition: Marketing Decision Platform Brand Health Choice & Experience Brand & Category Environment Budgets Equity Image & Personality Perceptions & Associations Choice Purchase Funnel Budget Planning Value Chain Unmet Needs Brand Roles & Portfolio Complements Substitutes Competitive Positioning Experience & Usage Engagement Category Trends Laws & Regulations External Forces (i.e. economy) Product Lifecycle Marketing & Media Mix Core Benefit & Differentiation Position Loyalty Sales Management Product Development Features & Functions Design Packaging Quality Cost Structure Global & Local Demand Planning Market Management Channel Management Sales Analysis Advertising Advertising Content Relationship Marketing Influencing Public Relations Influence & Advocacy Endorsers & Spokespeople Partnerships Sponsorships CRM Engagement Owned Social Engagement Message Naming & Taglines Damage Control Buzz Generation Digital Marketing & Advertising Consumer Promotion Traditional Advertising Direct Coupon Social Ad/ Display Ad Radio Tracking & Attribution Out of Home Owned Media Search TV Print Email Mobile Consumer Segmentation & Targeting Retailer Management Channel Trends Distribution In-Store Promotion Behavioral & Attitudinal Demos & Geos Lifestyles Price & Costs Loyalty Program Brand & Category Lifestages Feature Promotion Pricing Price/Value Perception Price Management Category Management Own Stores E-Commerce Price Justification Price Change Response Competitive Pricing Price Optimization Assortment Promotion & Co-marketing Price Owned Stores Online

Big Data Analytics • What is “Big Data” to Networked Insights? • Almost exclusively social media posts and metadata • Twitter (~67%), Forums, Blogs, Facebook, etc. • Total index ~60 Billion documents, ~500 TB in production • New documents of 2 Billion/month, increasing • Historical data going back to 2009 Data Information Thematic Clustering (Doppler)

Utilizing Social Media Data • We do two things: 1) Filter data; 2) Analyze data • Our filtering technology must accomodate two scenarios • We analyze 2 types of information: Implicit & Explicit

Implicit Information Mining Example • Gender Classification– List of methods and features • Author name / author ID analysis: compare both fields list of first names from US Census • Twitter summary field analysis • Post content features: analyze the content for certain clues or common characteristics that one gender has over another • Text formality – males tend to have more formality than females • Suffix preferences – many suffixes show up more in female posts than male • Word classes – 23 different groups of words that reflect certain topics or emotions that skew towards one gender more than another • Lexical words & phrases – certain words/phrases that are giveaways like “my husband” • POS sequences – certain part of speech patterns for unigram, bigram, trigram, and quadgram phrases

Lots of data, lots of routing Timberlake? Bieber? Jay Z? Meta Data World War Z? Monsters U? White House Down? Taco Bell? McDonald’s? Subway? Spam Classifiers iPhone? Samsung? BlackBerry? Etc. Gender Analysis Original Documents Topical Categorization Age Classification Sentiment Age Classification Reporting Layer SocialSense Application Layer = Storm

Storm Agenda • Overview • Architecture • Working Example • Spout API / Reliability • Bolt API / Scalability • Topology Demo • Monitoring

Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs

Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed

Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently

Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language

Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language • Written in Clojure (functional language), driven by ZeroMQ

Architecture • Components

Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors

Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus

Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus • Worker • Managed by Supervisor • Responsible for receiving, executing, and emitting datainside a storm topology

Working Example

Working Example • Topology • Defines the logical components of a data flow

Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams

Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology

Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples

Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many

Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many • Tuple is a single record containing a named list of values

Working Example

Spout API ISpout void declareOutputFields(OutputFieldsDeclarer declarer) void open(Map conf, TopologyContext context, SpoutOutputCollector collector) void nextTuple() void close() ISpoutOutputCollector List<Integer> emit(String streamId, List<Object> tuple, Object messageId)

Reliability • Each Storm component acknowledges that a tuplehas been processed

Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology

Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology • Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged

Reliability ISpout void ack(Object msgId) void fail(Object msgId)

Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput!

Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput! • Batching operations with reliability turned on can alsocreate issues

Reliability • If max_spout_pending is smaller thanbatch size, topo will collapse • If interruption in tuple flow, batch may never fill

Reliability • Solution: time based batching with TickTuple • TickTuple exercises the component to prompt a batch commit on a specified interval

Reliability • Questions?

Bolt API

Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt • Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance

Bolt API

Bolt API IBolt void declareOutputFields(OutputFieldsDeclarer declarer) void prepare(Map stormConf, TopologyContext context, OutputCollector collector) void execute(Tuple input) void cleanup() IOutputCollector List<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple) void ack(Tuple input) void fail(Tuple input)

Bolt API • You can also build the components of your topology inother languages public class MyPythonBoltextends ShellBolt { public MyPythonBolt() { super("python", "mybolt.py"); } ... }

Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow

Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout)

Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology

Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology • Increase total workers available in cluster

Scalability Example Topology increasing number of executors per component

Scalability Example Topology increasing number of workers in the topology 2 workers, MySpout with 2 executors, MyBolt with 4 executors • Work will always be spread • evenly across the workers • when possible 4 workers, MySpout with 2 executors, MyBolt with 4 executors

Real-time Data Processing with Storm: A Distributed System Overview

Real-time Data Processing with Storm: A Distributed System Overview

Presentation Transcript

Stream Processing in PNEs

Data Stream Processing (Part IV)

Toward Scalable Transaction Processing

Real-Time Stream Processing

Scalable Approximate Query Processing

Stream Processing with BigData: SSS-MapReduce

Stream Processing

Scalable Delivery of Stream Query Result

Stream Processing of XPath Queries with Predicates

fBIRN Image Processing Stream (FIPS)

Advances and Challenges for Scalable Provenance in Stream Processing Systems

Scalable Trigger Processing

Continuous Data Stream Processing

Using Processing Stream

XML Stream Processing

Scalable Trigger Processing

XMLTK: An XML Toolkit for Scalable XML Stream Processing

Continuous Data Stream Processing

Event Stream Processing Industry