590 likes | 662 Vues
Scalable stream processing with Storm. Brian Johnson Luke Forehand. Our Ambition: Marketing Decision Platform. Brand Health. Choice & Experience. Brand & Category Environment. Budgets. Equity. Image & Personality. Perceptions & Associations. Choice. Purchase Funnel. Budget Planning.
E N D
Scalable stream processing with Storm Brian Johnson Luke Forehand
Our Ambition: Marketing Decision Platform Brand Health Choice & Experience Brand & Category Environment Budgets Equity Image & Personality Perceptions & Associations Choice Purchase Funnel Budget Planning Value Chain Unmet Needs Brand Roles & Portfolio Complements Substitutes Competitive Positioning Experience & Usage Engagement Category Trends Laws & Regulations External Forces (i.e. economy) Product Lifecycle Marketing & Media Mix Core Benefit & Differentiation Position Loyalty Sales Management Product Development Features & Functions Design Packaging Quality Cost Structure Global & Local Demand Planning Market Management Channel Management Sales Analysis Advertising Advertising Content Relationship Marketing Influencing Public Relations Influence & Advocacy Endorsers & Spokespeople Partnerships Sponsorships CRM Engagement Owned Social Engagement Message Naming & Taglines Damage Control Buzz Generation Digital Marketing & Advertising Consumer Promotion Traditional Advertising Direct Coupon Social Ad/ Display Ad Radio Tracking & Attribution Out of Home Owned Media Search TV Print Email Mobile Consumer Segmentation & Targeting Retailer Management Channel Trends Distribution In-Store Promotion Behavioral & Attitudinal Demos & Geos Lifestyles Price & Costs Loyalty Program Brand & Category Lifestages Feature Promotion Pricing Price/Value Perception Price Management Category Management Own Stores E-Commerce Price Justification Price Change Response Competitive Pricing Price Optimization Assortment Promotion & Co-marketing Price Owned Stores Online
Big Data Analytics • What is “Big Data” to Networked Insights? • Almost exclusively social media posts and metadata • Twitter (~67%), Forums, Blogs, Facebook, etc. • Total index ~60 Billion documents, ~500 TB in production • New documents of 2 Billion/month, increasing • Historical data going back to 2009 Data Information Thematic Clustering (Doppler)
Utilizing Social Media Data • We do two things: 1) Filter data; 2) Analyze data • Our filtering technology must accomodate two scenarios • We analyze 2 types of information: Implicit & Explicit
Implicit Information Mining Example • Gender Classification– List of methods and features • Author name / author ID analysis: compare both fields list of first names from US Census • Twitter summary field analysis • Post content features: analyze the content for certain clues or common characteristics that one gender has over another • Text formality – males tend to have more formality than females • Suffix preferences – many suffixes show up more in female posts than male • Word classes – 23 different groups of words that reflect certain topics or emotions that skew towards one gender more than another • Lexical words & phrases – certain words/phrases that are giveaways like “my husband” • POS sequences – certain part of speech patterns for unigram, bigram, trigram, and quadgram phrases
Lots of data, lots of routing Timberlake? Bieber? Jay Z? Meta Data World War Z? Monsters U? White House Down? Taco Bell? McDonald’s? Subway? Spam Classifiers iPhone? Samsung? BlackBerry? Etc. Gender Analysis Original Documents Topical Categorization Age Classification Sentiment Age Classification Reporting Layer SocialSense Application Layer = Storm
Storm Agenda • Overview • Architecture • Working Example • Spout API / Reliability • Bolt API / Scalability • Topology Demo • Monitoring
Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs
Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed
Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently
Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language
Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language • Written in Clojure (functional language), driven by ZeroMQ
Architecture • Components
Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors
Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus
Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus • Worker • Managed by Supervisor • Responsible for receiving, executing, and emitting datainside a storm topology
Working Example • Topology • Defines the logical components of a data flow
Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams
Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology
Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples
Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many
Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many • Tuple is a single record containing a named list of values
Spout API ISpout void declareOutputFields(OutputFieldsDeclarer declarer) void open(Map conf, TopologyContext context, SpoutOutputCollector collector) void nextTuple() void close() ISpoutOutputCollector List<Integer> emit(String streamId, List<Object> tuple, Object messageId)
Reliability • Each Storm component acknowledges that a tuplehas been processed
Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout
Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout
Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology
Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology • Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged
Reliability ISpout void ack(Object msgId) void fail(Object msgId)
Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput!
Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput! • Batching operations with reliability turned on can alsocreate issues
Reliability • If max_spout_pending is smaller thanbatch size, topo will collapse • If interruption in tuple flow, batch may never fill
Reliability • Solution: time based batching with TickTuple • TickTuple exercises the component to prompt a batch commit on a specified interval
Reliability • Questions?
Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.
Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt
Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt • Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance
Bolt API IBolt void declareOutputFields(OutputFieldsDeclarer declarer) void prepare(Map stormConf, TopologyContext context, OutputCollector collector) void execute(Tuple input) void cleanup() IOutputCollector List<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple) void ack(Tuple input) void fail(Tuple input)
Bolt API • You can also build the components of your topology inother languages public class MyPythonBoltextends ShellBolt { public MyPythonBolt() { super("python", "mybolt.py"); } ... }
Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow
Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout)
Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology
Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology • Increase total workers available in cluster
Scalability Example Topology increasing number of executors per component
Scalability Example Topology increasing number of workers in the topology 2 workers, MySpout with 2 executors, MyBolt with 4 executors • Work will always be spread • evenly across the workers • when possible 4 workers, MySpout with 2 executors, MyBolt with 4 executors